From 16cb9833395a4904ab6d5e37ea538caaaf70e3dd Mon Sep 17 00:00:00 2001
From: Ashwini Khade <askhade@microsoft.com>
Date: Thu, 5 Sep 2019 11:54:21 -0700
Subject: [PATCH] optimize quantize (#1762)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Add more type support for OneHot op (#1565)

* parallel build

* update quatizelinear to process int8 input (#1576)

* Remove unneeded C APIs + some refactoring. (#1555)

* Mention OrtCreateSessionFromArray in C API doc

* c api changes after review (1)

* updates...

* fixes

* Reorder include

* A few performance improvements coming out of ssd_mobilenet and ssd_resnet34 analysis (#1578)

* A few performance improvements:
 - Make the iteration in NonZero more efficient by using a raw pointer and simplifying the increment logic
   - add another unit test to check the new logic works with 3 dimensional tensor
   - gains about 2% for ssd_mobilenet
 - Avoid floating point operations on each iteration on Concat
  - about 0.5% for ssd_mobilenet and ssd_resnet34
 - Put common case first in ExecutionFrame::AllocateAsPerAllocationPlan to avoid unnecessary call to IsSparseTensor
  - about 0.05% for ssd_mobilenet
 - Minor tweak to put some ctors in the TensorShape header so they can be inlined more easily

* Fix race condition issue in RNN/LSTM/GRU (#1544)

Fix race condition issue in RNN/LSTM/GRU.

Description:
The filter_desc and rnn_desc could also be changed in compute which could be in multi-thread. It will cause race condition issue.

Fix:
create temperate cudnn descriptors
cache cudnn_dropout_desc_ which won't change

* Remove memory copy between TensorRT and CUDA (#1561)

* remove memory copy between CUDA and TRT

* add info to RegisterExecutionProvider input

* use new IDeviceAllocator for trt allocator

* remove SetDefaultInputsMemoryType from TRT EP

* remove onnx-tensorrt 5.0

* add submodule onnx-tensorrt branch 5.1

* remove redundancy

* Update transformer_memcpy.cc

* Update tensorrt_execution_provider.cc

* switch to TensorRT 5.1.5.0

* update python binding

* disable failed test case on TensorRT

* Update activation_op_test.cc

* upgrade to TensorRT container 19.06

* update according to feedback

* add comments

* remove tensorrt allocator and use cuda(gpu) allocator

* update onnx-tensorrt submodule

* change ci build cuda directory name

* Optimize Fence checking performance (#1593)

* For majority of nodes, we do not need to do fence check. Instead, we only need to do FenceCheck for CPU<->GPU mem sync node
But we pay the Fence check cost for every single node and every single input and output.

This change will minimize the Fence check to only do it when necessary.

* Added license files in the base image (#1595)

* Update Dockerfile.openvino

* Update Dockerfile.cuda

* Update Dockerfile.cuda

* Update Dockerfile.openvino

* Update Dockerfile.cuda

* added ThirdParty notice file to base image.

* corrected license file name

* Implement new LabelEncoder in opset 2 in ML domain (#1393)

* Implement new LabelEncoder in opset 2 in ML domain

* Fix compilation error

* Fix tests

* Include ONNX's fix

* Formatting and addressing a comment

* Address a minor comment

* add int64 support for less op. (#1604)

* put all gemmlowp common code in one place (#1590)

* put all gemmlowp common code in one place

* fix gpu build failures

* minor update

* Update nGraph to v0.22.1 (#1582)

* Update nGraph to 0.21 and adjust the EP

* Share the graph initializers between custom ops

* Update nGraph to 0.22 and exclude Gather entirely

* Enable building on Windows with nGraph v0.21.1-rc.0

* Disable the unsigned input Shrink op tests for nGraph until the next update

* Line-shortening code refactor

* Fix for the master branch merge artifact

* MKLDNN patches adjustment for Windows

* Exclude MatMulInteger for non-const zero points

* Exclude ConvInteger for non-const zero points

* Enable full Cast op support

* Use the v0.22.1 tag

* Skip ConvTranspose_InvalidKernelShape test for ngraph provider

* Create sub-graph ModelProto from fused_node

* Include io_win32.h only if builds on windows (#1587)

* Include io_win32.h only if builds on windows

* looks like include order matters

* Fix for CPU random ops seed narrowing conversion. (#1594)

* Fix perf test executable. (#1598)

* Mention OrtCreateSessionFromArray in C API doc

* Fix perf test executable due to removal of certain C APIs

* fix linux build

* Avoid duplication

* Fix mem leak

* Minor perf improvements. (#1580)

* Minor perf improvements.

- Cache the vector sizes in IExecutionFrame and NodeIndexInfo to avoid calls to size().
  - 2 instructions instead of 10
- Remove an unnecessary check in IExecutionFrame
  - add a check to the ctor so we guarantee it's unnecessary
- Reserve memory for the vectors in BroadcastIterator
  - saves reallocs if more than one value is added
    - but rare with the mlperf models for multiple values to be added so benefit is limited.
  - slight tweak to the Broadcaster ctor code to make it more readable

* Serialize optimized onnx model (#1470)

* Model serialization

* Removed duplicate symbol

* Minor update

* Review comments

* add tests

* Model serialization

* Removed duplicate symbol

* Minor update

* Merged PR 1106437: Model Serialization in onnxruntime

* Review comments

* Merged PR 1107226: Review comments

Review comments

* add tests

* Fixed merge conflict

* Correct python tests

* InferenceSesssion Refeed Test

* Replace use of widechar const literal-L

* Fixed failing tests

* Updated comment

* Removed unnecessary session options

* Spell check on comments

* Do not serialize when level 3 optimization specified

* Updated error logs

* Changed log severity to WARN

* Fix log message truncation on Windows when printf formatting is used.` (#1599)

* Fix log message truncation and add unit test. On Windows vnsprintf_s returns -1 when truncating so we need to differentiate that from a real error.

* Remove copy of generator in Multinomial (#1611)

* Remove copy of generator in Multinomial so that different values are generated each time.
Add ability to test

* Kezhan/execute graph refactoring (#1553)

* checking execution provider logic updated.

* fix the logic of copy input and output.

* update

* update

* update

* update

* update

* update

* fix ngraph failure.

* fix comments

* Cleanup csharp API SessionOptions and RunOptions to be consistent with other APIs (#1570)

- Updated SessionOptions API to use properties instead of setter/getter methods.
- Added missing APIs.
- Added RunOptions.

* Make changes to pipeline template to include missing headers in tars/zips (#1617)

* Fix trtlogger segfault. re-enable SoftPlus unit test for TRT. add doc… (#1623)

* Fix trtlogger segfault. re-enable SoftPlus unit test for TRT. add documentation for ORT_TENSORRT* env vars.

* Update TensorRT-ExecutionProvider.md

* Use a friendly enum for graph optimization level. (#1586)

* Mention OrtCreateSessionFromArray in C API doc

* review changes

* use enum for graph optimization level

* Use explicit values for enums

* updates...

* Add friendly enum for graph optimization levels in C, C# and Python APIs.

* Fix linux build

* Fix build breakage due to master merge

* PR comments

* Generate documentation from the registered operator kernels (#1395)

- Added python script for generating markdown doc from the registered opkernels.
- Made some conditional changes in the pybind to expose necessary python API
- Added some missing type-constraints in the op kernel registrations

* Fix incorrect box offset computation in NMS op (#1624)

* More changes

* Fix NMS

* nits

* Integrate featurizers (#1573)

Added Sample Featurizer and Infrastructure
  Make featurizers and unit tests compile and run with GTest.
  Create definitions for the first featurizer kernel.
  Add new operator domain.
  Create datetime_transformer kernel and build.
  Move OPAQUE types definitions for featurizers kerneles out to a separate cc.
  Register them with the type system.
 Provide unit tests for new AutoML DateTimeTransformer kernel.
  Make necessary adjustments to the test infrastructure to make it run
  with new types.

* Support int64 for ReduceMax (#1625)

* update onnx to latest commit (#1622)

* update onnx to latest commit

* Disable and/or fix failing tests

* disable not yet implemented tests for opset 11

* disable tests

* fix bug in mkldnn fp16 graph check

* Copy System.Numerics.Tensors sources from dotnet/corefx into onnxruntime (#1605)

 Copy System.Numerics.Tensors sources from dotnet/corefx into onnxruntime

* removed --gen_doc (#1633)

* Fix parsing initial hidden state in RNN (#1626)

* Fix the way initial hidden state is used for reverse direction in RNN

* Add test case

* Updates

* Let mlas use session thread pool (#1609)

1.Let mlas use session thread pool
2.Remove onnxruntime_USE_MLAS cmake option
3. Remove the win32 thread pool code inside mlas

mlas will:

1.use ort thread pool if it get passed in
2.use openmp if the threadpool parameter is nullptr
3.run single threaded if the threadpool parameter is nullptr and openmp is disabled.

* update TRT EP CI's to use latest model.zip (#1637)

* Add AutoML to 3 main builds. (#1631)

Add AutoML to 3 main builds.
  Fix unit tests. Enable copy elision, do not move movable object
  on return by value.

* MLAS: add U8U8 MatMul operation (#1644)

Implement the first round of changes for quantization inside MLAS. This adds a MatMul operation for U8xU8=S32 for x86/x64 processors.

* Add uint8 Support for NonZero Op (#1614)

* update MKLML to version which contains fix for thread hang. (#1636)

* update MKLML which has bugfix for thread hang. move PATCH_COMMAND outside BUILD_FOR_NATIVE_MACHINE check.

* MKLML_VERSION 2020.0.20190813 is for windows only.

* MlasGetMaximumThreadCount: plus 1 to the NumThreads from ORT thread pool (#1646)

* Update perf tool documentation to reflect the new graph optimization enums. Relax constraint for enable_all. (#1650)

* Allow user disable multiple threading (#1647)

* Update onnx test runner documentation (#1651)

* Mention OrtCreateSessionFromArray in C API doc

* Update perf tool documentation to reflect the new graph optimization enums. Relax constraint for enable_all.

* Update one more doc

* Update onnx test runner documentation

* Add default in the docs

* Fix memory leak in mlas unitest (#1654)

* fix bug on windows where ops were always getting dumped. (#1648)

* Remove --whole-archive (#1655)

* Check return value form CreateFeedsFetchesManager. (#1653)

Also cleanup a couple of unused variables.

* Update PyTorch Section for supported onnx version (#1635)

PyTorch exporter in Pytorch1.2 can natively support multiple opset now

* cudnnRNNForwardInferenceEx doesn't support 0 sequence in the bathes

Fix issue that cudnnRNNForwardInferenceEx doesn't support 0 sequence in the bathes

Solution:
Reset the 0 sequence to 1 for the bathes before call the cudnnRNNForwardInferenceEx, has a array to track the batch id which has 0 sequence. Once get the result, call a CUDA kernel to mask on the output using the batch id tracked in the array.

* Add details of which node was not able to be placed on an execution provider. (#1665)

* nGraph EP Optimizations (#1630)

* Added check for unnecessary function initializations, and removed lock from unneeded areas of code.

* Added LRU cache to EP.

* Bugfixes for nGraph EP Optimization PR

* Changed default cache size to 500 and refactored mutex readability.

* Fixed unsafe environmental variable fetch for Windows.

* Cleaned up Windows environment functions and cleaned up mutexes.

* Fix a few errors in the NuGet pipeline (still broken) (#1656)

* update set fetches for execution with allocation plan. (#1668)

* Support Tensor<bool> and Tensor<Int8> in C# API. Support Tensor<string> as input. Fix a bug in the InferenceSession Run() with RunOptions (#1671)

- Support bool-Tensor and int8-Tensor in input-output of C# api
- Support string-tensor as input in C# api
- Fix a bug in InferenceSession.Run() -- RunOptions was not passed into the native call

* Optimize kernel index (#1672)

* update clip for opset 11 (#1661)

* update clip for opset 11

* exclude ngraph provider for clip unit tests

* exclude ngraph for all clip opset 11 tests

* fix op version

* Add support of ReduceSum int64 (#1664)

* Add support of ReduceSum int64

* add unit test for int64

* int64 support for 'where' op (#1666)

* Added some mo optimizations to improve performance (#1674)

Signed-off-by: suryasidd <surya.siddharth.pemmaraju@intel.com>

* Don't create the default allocator every single time. Rename API accordingly. Expose Session/Run log severity levels. (#1615)

* Mention OrtCreateSessionFromArray in C API doc

* Don't create the default allocator every single time. Rename API accordingly.

* Don't create the default allocator every single time. Rename API accordingly.

* updates...

* updates...

* PR comments

* fix typo in license header

* fix build

* Share default CPU allocator with Mlas preferred alignment (#1682)

Description: make default CPU allocator to use MLAS preferred alignment

Motivation and Context

This is needed for C API to have an aligned default CPU allocator, the same as the one in CPU provider

* More fixes on the NuGet CPU CI pipeline (#1688)

- Fix the Windows end-to-end test in NuGet CI
- Skip the TestModelSerialization, because it is failing on Linux. Must be fixed before API is released for use. Owner is notified.

* treat zero point properly (#1686)

* use MLAS for QGEMM in matmulInteger and convInteger (#1692)

* use mlas qgemm for u8u8_s32 gemms

* update test

* fix typo in max batch size error msg. (#1687)

* Python API naming and other cleanup (#1678)

- Make the naming of properties in python SessionOptions and RunOptions consistent with other apis.
- Remove unnecessary apis

* make gemmlowp default for arm (#1701)

* make gemmlowp default for arm

* force use_gemmlowp in header for default case

* remove unnecessary white space

* Doc updates (#1522)

* Updates

* Remove preview texts

* Update README.md

* Updates

* Update README.md

* Update README.md

* Minor wording update

* Update README.md

* Update doc on CUDA version

* revert update

* Update readme for issue #1558

* Clean up example section

* Cosmetic updates

- Add a index of build instructions for browsability
- Update build CUDA version from 9.1 to 10

* Fix broken link

* Update README to reflect upgrade to pip requirement

* Update CuDNN version for Linux Python packages

* Clean up content

Updated ordering and add table of contents

* Minor format fixes

* Move Android NNAPI under EP section

* Add link to operator support documentation

* Fix typo

* typo fix

* remove todo section

* remove @PCGOTREL x64 usage (#1707)

Avoid the need for @PCGOTREL relocations by annotating MLAS global data shared with assembly modules with attribute(visibility("hidden")).

* MLAS: Android sgemm kernel build fix (#1710)

Fix the aarch64 kernel to build properly with the Android NDK (specifically clang).

* Remove TaskThreadPool (#1713)

* Allow input used across execution providers as long as they use the same allocator device (#1715)

as long as these providers use the same allocator device

Description: Currently ORT throws error when one input is used in different EPs. The change removes that restriction

Motivation and Context

It is now possible to share inputs across EPs now that allocation are device-based, instead of EP based.

* Add support for int8 x uint8 for MatMulInteger, and int16 x int16 custom op (#1391)

Description: The change adds necessary quantization support on CPU with mixed int8/uint8, as well as int16 for matrix multiply operations that outputs int32

Motivation and Context

Integer operations are critical for quantized model's performance
Current MatMulInteger implementation in CPU only supports uint8 x uint8, while the spec supports int8 x uint8. Having a default CPU implementation that fully support the spec would help accuracy verification.
Besides, some model may need to quantize to int16, but MatMulInteger op does not support that yet. A custom op of MatMulInteger16 is added to satisfy such models.

* Use exec form of ENTRYPOINT for docker server (#1690)

* Use exec form of ENTRYPOINT for docker server

# Issue
The entrypoint currently uses the shell form - this prevents users from passing in any cmdline arguments... also passing a model_path in means the server only works in the envvar is set... however this is not what the error message says!
```
$ docker run -v /home/rakelkar/try/onnxzoo/style:/mnt/models -it   mcr.microsoft.com/onnxruntime/server --model_path /mnt/models/model.onnx
Version: local_build
Commit ID: default

model_path must be the location of a valid file
Allowed options:
  -h [ --help ]               Shows a help message and exits
  --log_level arg (=info)     Logging level. Allowed options (case sensitive):
                              verbose, info, warning, error, fatal
  --model_path arg            Path to ONNX model
  --address arg (=0.0.0.0)    The base HTTP address
  --http_port arg (=8001)     HTTP port to listen to requests
  --num_http_threads arg (=4) Number of http threads
  --grpc_port arg (=50051)    GRPC port to listen to requests
```
# Fix
1. remove the env var
2. use the exec form

* Update readme to use model_path arg

* Support 'Bilinear' mode for 2D inputs in Resize and Upsample kernels  (#1679)

* Support bilinear mode with actual 2D inputs in Resize and upsample

* Fix build break

* Fix build break

* Add test

* CUDA changes

* Resolve PR comments

* Resolve comments

* add implementation for dynamic quantize linear (#1697)

* Fix reading of onnx domain causing one of the automl models to break in 0.5 release. (#1694)

* Mention OrtCreateSessionFromArray in C API doc

* Fix registration of Equal op causing one of the automl models to break in 0.5 release.

* updates...

* Fix a issue that CUDA EP fallback to much nodes to CPU for some case which cause huge data copy. If the node's inputs are all initializer, we shouldn't fallback the node to CPU. (#1727)

Fix an issue that CUDA EP fallback too much nodes to CPU for some case which cause huge data copy.
https://github.com/microsoft/onnxruntime/issues/1675

Currently, if the node's inputs are all as initialier, CUDA EP will fallback it to CPU. And it will also fallback some nodes under it. It could cause some huge data copy. for the case reported by a user, it has several Slices with input from initializer, and a Concat op to concat the output from Slice output. The data is huge 16MB after concat, which make the data copy from CPU to GPU quite costly because it's a sync copy.

Fix
If the node's inputs are all initializer, we shouldn't fallback the node to CPU.

* Publish perf tool with nightly build (#1728)

* Update the docker file for OpenVINO (#1741)

Update the docker file for OpenVINO which is used for AML

* Fix typo in NMS code

Fix typo in NMS code

* MKL-DNN EP:  control flow fix (#1740)

* moved subgraph_index to MklDnn Execution Provider

* code cleanup

* Implementation of Nuphar execution provider (#881)

* Implement Nuphar execution provider

Nuphar execution provider is a TVM-based compilation provider. It has shown great speedups for RNN models using Scan.
This PR is mainly for a preview of the shared codegen library for other TVM-based providers.

* Fix submodules

* Fix TVM submodule

* Update Nuphar to latest and resolve confliction

* Remove stale files caused by merge -X theirs

* Revert heap buffer change to not introduce onnxruntime_framework into onnxruntime_perf_test

* Fix bad merge

* Merge from Nuphar

* Fix warning treated as error, revert some unnecessary changes

* Revert some more test changes

* Some more test revert or comments to make review easier
New tests could be added later

* One more revert of unnecessary changes

* More change revert. Test could be added back later.

* Enforce shape validation. (#1716)

* Mention OrtCreateSessionFromArray in C API doc

* Enforce shape validation.

* Update broken models

* enable quantizing specific nodes (#1742)

* update quantization script
---
 .gitmodules                                   |     8 +-
 BUILD.md                                      |   328 +-
 README.md                                     |    93 +-
 cgmanifest.json                               |     2 +-
 cmake/CMakeLists.txt                          |    26 +-
 cmake/external/mkldnn.cmake                   |    10 +-
 cmake/external/ngraph.cmake                   |     8 +-
 cmake/external/onnx                           |     2 +-
 cmake/external/onnx-tensorrt                  |     2 +-
 cmake/external/tvm                            |     2 +-
 cmake/onnxruntime.cmake                       |    16 +-
 cmake/onnxruntime_automl_featurizers.cmake    |    44 +
 cmake/onnxruntime_common.cmake                |     9 +-
 cmake/onnxruntime_graph.cmake                 |     8 +
 cmake/onnxruntime_mlas.cmake                  |    59 +-
 cmake/onnxruntime_nuphar_extern.cmake         |    39 +
 cmake/onnxruntime_providers.cmake             |    87 +-
 cmake/onnxruntime_python.cmake                |    13 +
 cmake/onnxruntime_unittests.cmake             |    40 +-
 cmake/onnxruntime_util.cmake                  |    11 +-
 .../ngraph/ngraph_fix_install_error.patch     |   127 -
 .../ngraph/ngraph_fix_library_path.patch      |    33 -
 .../ngraph_fix_mkldnn_missing_symbol.patch    |    64 +
 csharp/OnnxRuntime.CSharp.proj                |    10 +-
 .../OnnxRuntime.snk                           |   Bin
 .../Program.cs                                |     4 +-
 .../DisposableNamedOnnxValue.cs               |    11 +-
 .../InferenceSession.cs                       |   160 +-
 .../Microsoft.ML.OnnxRuntime.csproj           |    72 +-
 .../NamedOnnxValue.cs                         |   137 +-
 .../NativeMemoryAllocator.cs                  |    19 +-
 .../Microsoft.ML.OnnxRuntime/NativeMethods.cs |    65 +-
 .../Microsoft.ML.OnnxRuntime/OnnxRuntime.cs   |    32 +-
 .../Microsoft.ML.OnnxRuntime/RunOptions.cs    |   120 +
 .../SessionOptions.cs                         |   319 +-
 .../Tensors/ArrayTensorExtensions.cs          |    66 +
 .../Tensors/ArrayUtilities.cs                 |   227 +
 .../Tensors/DenseTensor.cs                    |   188 +
 .../Tensors/Tensor.cs                         |  1311 ++
 .../CXX_Api_Sample.cpp                        |    11 +-
 .../C_Api_Sample.cpp                          |     9 +-
 .../InferenceTest.cs                          |   183 +-
 .../Microsoft.ML.OnnxRuntime.Tests.csproj     |    32 +
 .../Tensors/NativeMemory.cs                   |   119 +
 .../Tensors/TensorArithmetic.cs               | 16201 ++++++++++++++++
 .../Tensors/TensorArithmetic.tt               |   249 +
 .../Tensors/TensorExtensions.cs               |    42 +
 .../Tensors/TensorOperations.cs               |   750 +
 .../Tensors/TensorOperations.tt               |   251 +
 .../Tensors/TensorTemplate.ttinclude          |   328 +
 .../Tensors/TensorTests.cs                    |  2243 +++
 .../Tensors/TensorTestsBase.cs                |   164 +
 csharp/testdata/test_types_BOOL.pb            |   Bin 167 -> 151 bytes
 csharp/testdata/test_types_INT8.pb            |   Bin 167 -> 151 bytes
 csharp/testdata/test_types_STRING.pb          |   Bin 167 -> 151 bytes
 .../Program.cs                                |    23 +-
 dockerfiles/Dockerfile.cuda                   |     8 +-
 dockerfiles/Dockerfile.openvino               |    24 +-
 dockerfiles/Dockerfile.server                 |     3 +-
 dockerfiles/Dockerfile.source                 |     7 +-
 dockerfiles/Dockerfile.tensorrt               |     8 +-
 dockerfiles/README.md                         |    21 +-
 .../{ => scripts}/install_common_deps.sh      |     8 +-
 docs/ONNX_Runtime_Perf_Tuning.md              |     2 +-
 docs/OperatorKernels.md                       |   470 +
 docs/Versioning.md                            |     2 +-
 .../Nuphar-ExecutionProvider.md               |   142 +
 .../TensorRT-ExecutionProvider.md             |    26 +-
 include/onnxruntime/core/common/callback.h    |    17 -
 .../onnxruntime/core/framework/allocator.h    |     6 +
 .../onnxruntime/core/framework/data_types.h   |     3 +
 .../core/framework/kernel_def_builder.h       |     6 +
 .../core/framework/kernel_registry.h          |     8 +
 .../onnxruntime/core/framework/op_kernel.h    |     7 +-
 include/onnxruntime/core/framework/tensor.h   |     4 +-
 .../onnxruntime/core/framework/tensor_shape.h |     9 +-
 include/onnxruntime/core/graph/constants.h    |     1 +
 include/onnxruntime/core/graph/graph.h        |     1 -
 include/onnxruntime/core/graph/graph_viewer.h |    11 +-
 .../onnxruntime/core/platform/threadpool.h    |    23 +-
 .../nuphar/nuphar_provider_factory.h          |    17 +
 .../tensorrt/tensorrt_provider_factory.h      |     2 +-
 .../core/session/onnxruntime_c_api.h          |    79 +-
 .../core/session/onnxruntime_cxx_api.h        |    21 +-
 .../core/session/onnxruntime_cxx_inline.h     |    24 +-
 onnxruntime/__init__.py                       |     2 +-
 onnxruntime/automl_ops/automl_featurizers.h   |     8 +
 onnxruntime/automl_ops/automl_types.cc        |    39 +
 onnxruntime/automl_ops/automl_types.h         |    13 +
 .../automl_ops/cpu/datetime_transformer.cc    |    42 +
 onnxruntime/automl_ops/cpu_automl_kernels.cc  |    25 +
 onnxruntime/automl_ops/cpu_automl_kernels.h   |    13 +
 .../cpu/attnlstm/attention_wrapper.cc         |    25 +-
 .../cpu/attnlstm/attention_wrapper.h          |     4 +-
 .../cpu/attnlstm/bahdanau_attention.cc        |    34 +-
 .../cpu/attnlstm/bahdanau_attention.h         |     3 +-
 .../cpu/attnlstm/deep_cpu_attn_lstm.cc        |    23 +-
 .../cpu/attnlstm/uni_dir_attn_lstm.cc         |     8 +-
 .../cpu/attnlstm/uni_dir_attn_lstm.h          |     4 +-
 .../contrib_ops/cpu/matmul_integer16.cc       |    45 +
 .../contrib_ops/cpu/matmul_integer16.h        |    22 +
 onnxruntime/contrib_ops/cpu/nchwc_ops.cc      |     3 -
 onnxruntime/contrib_ops/cpu/nchwc_ops.h       |     2 +
 .../contrib_ops/cpu/word_conv_embedding.cc    |    12 +-
 .../contrib_ops/cpu/word_conv_embedding.h     |     5 +-
 .../contrib_ops/cpu_contrib_kernels.cc        |     2 +
 .../src/FeaturizerPrep/Featurizer.h           |   163 +
 .../Featurizers/DateTimeFeaturizer.cpp        |    56 +
 .../Featurizers/DateTimeFeaturizer.h          |   101 +
 .../FeaturizerPrep/Featurizers/SampleAdd.cpp  |    40 +
 .../FeaturizerPrep/Featurizers/SampleAdd.h    |    95 +
 .../Featurizers/UnitTests/CMakeLists.txt      |    48 +
 .../DateTimeFeaturizer_UnitTests.cpp          |   125 +
 .../UnitTests/SampleAdd_UnitTest.cpp          |    22 +
 .../Featurizers/UnitTests/code_coverage.yaml  |     5 +
 .../featurizers/src/FeaturizerPrep/Traits.h   |   218 +
 .../FeaturizerPrep/UnitTests/CMakeLists.txt   |    41 +
 .../UnitTests/Featurizer_UnitTest.cpp         |   104 +
 .../UnitTests/Traits_UnitTests.cpp            |    40 +
 .../FeaturizerPrep/UnitTests/test_main.cpp    |    18 +
 onnxruntime/core/codegen/common/common.cc     |    11 +-
 onnxruntime/core/codegen/common/creator.h     |     2 +-
 onnxruntime/core/codegen/common/dispatcher.h  |     2 +
 onnxruntime/core/codegen/common/profile.h     |     2 +-
 onnxruntime/core/codegen/common/registry.h    |     2 +
 onnxruntime/core/codegen/common/settings.cc   |     4 +
 onnxruntime/core/codegen/common/settings.h    |     1 +
 onnxruntime/core/codegen/mti/math/gemm.cc     |     6 +-
 .../core/codegen/mti/math/matmul_ops.cc       |    37 +-
 .../core/codegen/mti/math/matmul_ops.h        |     5 +
 .../core/codegen/mti/math/unary_ops.cc        |    29 +-
 onnxruntime/core/codegen/mti/mti_tvm_utils.cc |    34 +
 onnxruntime/core/codegen/mti/mti_tvm_utils.h  |     3 +
 .../math/quantize/matmul_integer.cc           |    18 +-
 .../passes/op_ir_creator/tensor/crop.cc       |     3 +-
 .../passes/op_ir_creator/tensor/transpose.cc  |    10 +-
 .../passes/op_ir_creator/tvm_op_creator.h     |     2 +-
 .../codegen/passes/scheduler/tvm_scheduler.h  |     2 +-
 .../codegen/passes/utils/ort_tvm_utils.cc     |    14 +-
 .../passes/weight_layout/weight_layout.h      |     2 +-
 onnxruntime/core/common/logging/capture.cc    |    14 +-
 onnxruntime/core/common/profiler.cc           |    18 +
 onnxruntime/core/common/profiler.h            |    19 +
 onnxruntime/core/common/task_thread_pool.h    |   213 -
 onnxruntime/core/common/threadpool.cc         |   198 +-
 .../core/framework/allocation_planner.cc      |    99 +-
 onnxruntime/core/framework/allocator.cc       |    55 +-
 onnxruntime/core/framework/bfc_arena.h        |     2 +-
 onnxruntime/core/framework/callback.cc        |    10 +-
 onnxruntime/core/framework/callback.h         |    15 +
 onnxruntime/core/framework/data_types.cc      |    41 +
 onnxruntime/core/framework/error_code.cc      |     5 +-
 onnxruntime/core/framework/execution_frame.cc |    97 +-
 onnxruntime/core/framework/execution_frame.h  |     7 +-
 .../core/framework/feeds_fetches_manager.h    |     3 +-
 .../core/framework/graph_partitioner.cc       |     5 -
 .../core/framework/kernel_registry_manager.cc |    18 +-
 onnxruntime/core/framework/mem_pattern.h      |     4 +-
 onnxruntime/core/framework/node_index_info.cc |     4 +
 onnxruntime/core/framework/node_index_info.h  |     8 +-
 .../framework/op_kernel_context_internal.h    |     4 +-
 .../core/framework/parallel_executor.cc       |    76 +-
 .../core/framework/parallel_executor.h        |     1 -
 onnxruntime/core/framework/run_options.cc     |    10 +
 .../framework/sequential_execution_plan.h     |     9 +
 .../core/framework/sequential_executor.cc     |    72 +-
 onnxruntime/core/framework/session_state.cc   |   149 +-
 onnxruntime/core/framework/session_state.h    |    60 +-
 .../framework/session_state_initializer.cc    |   139 +-
 .../framework/session_state_initializer.h     |     9 +-
 onnxruntime/core/framework/tensor.cc          |     4 +-
 onnxruntime/core/framework/tensor_shape.cc    |    12 +-
 .../core/framework/tensorprotoutils.cc        |     4 +-
 onnxruntime/core/framework/utils.cc           |   162 +-
 onnxruntime/core/framework/utils.h            |     2 +
 .../core/graph/automl_ops/automl_defs.cc      |    46 +
 .../core/graph/automl_ops/automl_defs.h       |    30 +
 .../core/graph/contrib_ops/contrib_defs.cc    |    39 +-
 onnxruntime/core/graph/graph_viewer.cc        |    23 +-
 onnxruntime/core/graph/model.cc               |    13 +-
 onnxruntime/core/mlas/inc/mlas.h              |    21 +
 .../aarch64/{sgemma.s => SgemmKernelNeon.S}   |    26 +-
 .../mlas/lib/amd64/AssembleAvx512Vnni.inc     |   232 +
 .../mlas/lib/amd64/QgemmU8U8KernelAvx2.asm    |  1241 ++
 .../lib/amd64/QgemmU8U8KernelAvx512BW.asm     |   114 +
 .../lib/amd64/QgemmU8U8KernelAvx512Common.inc |   385 +
 .../lib/amd64/QgemmU8U8KernelAvx512Vnni.asm   |    91 +
 .../arm64/{sgemma.asm => SgemmKernelNeon.asm} |    41 +-
 onnxruntime/core/mlas/lib/erf.cpp             |     2 +-
 onnxruntime/core/mlas/lib/logistic.cpp        |     2 +-
 onnxruntime/core/mlas/lib/mlasi.h             |    91 +-
 onnxruntime/core/mlas/lib/platform.cpp        |    45 +-
 onnxruntime/core/mlas/lib/qgemm.cpp           |   599 +
 onnxruntime/core/mlas/lib/sgemm.cpp           |     2 +-
 onnxruntime/core/mlas/lib/tanh.cpp            |     2 +-
 onnxruntime/core/mlas/lib/threading.cpp       |    92 +-
 .../core/mlas/lib/x86_64/AssembleAvx512Vnni.h |   238 +
 .../core/mlas/lib/x86_64/ErfKernelFma3.S      |     8 +-
 .../core/mlas/lib/x86_64/LogisticKernelFma3.S |     5 +-
 .../mlas/lib/x86_64/QgemmU8U8KernelAvx2.S     |  1121 ++
 .../mlas/lib/x86_64/QgemmU8U8KernelAvx512BW.S |   120 +
 .../lib/x86_64/QgemmU8U8KernelAvx512Common.h  |   361 +
 .../lib/x86_64/QgemmU8U8KernelAvx512Vnni.S    |    95 +
 .../core/mlas/lib/x86_64/SconvKernelAvx.S     |     6 +
 .../core/mlas/lib/x86_64/SconvKernelAvx512F.S |     3 +
 .../core/mlas/lib/x86_64/SconvKernelSse2.S    |     3 +
 .../core/mlas/lib/x86_64/SgemmKernelAvx.S     |    15 +-
 .../core/mlas/lib/x86_64/SgemmKernelFma3.S    |     9 +-
 .../core/mlas/lib/x86_64/SgemmKernelM1Avx.S   |     5 +-
 .../lib/x86_64/SgemmKernelM1TransposeBAvx.S   |     5 +-
 .../core/mlas/lib/x86_64/TanhKernelFma3.S     |     5 +-
 .../optimizer/optimizer_execution_frame.cc    |     4 +-
 .../optimizer/optimizer_execution_frame.h     |     2 +-
 .../core/optimizer/transformer_memcpy.cc      |    11 +-
 onnxruntime/core/platform/env.h               |     2 +-
 onnxruntime/core/platform/posix/env.cc        |     8 +-
 onnxruntime/core/platform/windows/env.cc      |     2 +-
 onnxruntime/core/providers/common.h           |    22 +
 .../core/providers/cpu/controlflow/loop.cc    |     1 -
 .../providers/cpu/controlflow/scan_utils.cc   |     1 -
 .../core/providers/cpu/controlflow/utils.h    |     3 +-
 .../providers/cpu/cpu_execution_provider.cc   |    66 +-
 .../core/providers/cpu/generator/random.cc    |   119 +-
 .../core/providers/cpu/generator/random.h     |    55 +-
 onnxruntime/core/providers/cpu/math/clip.cc   |     9 +-
 onnxruntime/core/providers/cpu/math/clip.h    |    36 +-
 .../providers/cpu/math/element_wise_ops.cc    |    54 +-
 .../providers/cpu/math/element_wise_ops.h     |    33 +-
 onnxruntime/core/providers/cpu/math/gemm.h    |     8 +-
 .../core/providers/cpu/math/logsoftmax.cc     |     7 +-
 onnxruntime/core/providers/cpu/math/matmul.cc |     7 +-
 .../core/providers/cpu/math/matmul_helper.h   |     5 +-
 .../core/providers/cpu/math/matmul_integer.cc |   122 +-
 .../core/providers/cpu/math/matmul_integer.h  |     6 +-
 .../cpu/math/quantize_linear_matmul.cc        |   104 +-
 .../cpu/math/quantize_linear_matmul.h         |     3 +-
 .../core/providers/cpu/math/softmax.cc        |     6 +-
 .../core/providers/cpu/math/softmax_shared.cc |     5 +-
 .../core/providers/cpu/math/softmax_shared.h  |     5 +-
 .../core/providers/cpu/ml/label_encoder.cc    |   109 +-
 .../core/providers/cpu/ml/label_encoder.h     |    62 +
 onnxruntime/core/providers/cpu/nn/Unpool.cc   |     6 +-
 onnxruntime/core/providers/cpu/nn/conv.cc     |    23 +-
 .../core/providers/cpu/nn/conv_integer.cc     |    67 +-
 .../core/providers/cpu/nn/conv_transpose.cc   |     9 +-
 onnxruntime/core/providers/cpu/nn/pool.cc     |     2 +-
 onnxruntime/core/providers/cpu/nn/pool_base.h |    13 +-
 .../core/providers/cpu/nn/qlinearconv.cc      |   110 +-
 .../core/providers/cpu/nn/qlinearconv.h       |    42 +-
 .../object_detection/non_max_suppression.cc   |    10 +-
 .../cpu/object_detection/roialign.cc          |     2 +-
 .../providers/cpu/reduction/reduction_ops.cc  |    10 +
 .../core/providers/cpu/rnn/deep_cpu_gru.cc    |    29 +-
 .../core/providers/cpu/rnn/deep_cpu_lstm.cc   |    32 +-
 .../core/providers/cpu/rnn/deep_cpu_lstm.h    |     4 +-
 onnxruntime/core/providers/cpu/rnn/rnn.cc     |    19 +-
 .../core/providers/cpu/rnn/rnn_helpers.h      |     6 +-
 onnxruntime/core/providers/cpu/symbols.txt    |     9 +-
 .../core/providers/cpu/tensor/cast_op.cc      |     4 +-
 .../core/providers/cpu/tensor/compress.cc     |     3 +-
 .../core/providers/cpu/tensor/concat.cc       |    43 +-
 .../cpu/tensor/dynamicquantizelinear.cc       |    75 +
 .../cpu/tensor/dynamicquantizelinear.h        |    20 +
 .../core/providers/cpu/tensor/identity_op.cc  |     3 +-
 .../core/providers/cpu/tensor/nonzero_op.cc   |    44 +-
 .../core/providers/cpu/tensor/onehot.cc       |     5 +-
 .../providers/cpu/tensor/quantize_linear.cc   |    37 +-
 onnxruntime/core/providers/cpu/tensor/size.cc |     3 +-
 onnxruntime/core/providers/cpu/tensor/tile.cc |     3 +-
 .../core/providers/cpu/tensor/upsample.cc     |    69 +-
 .../core/providers/cpu/tensor/upsample.h      |     7 +-
 .../core/providers/cpu/tensor/where_op.cc     |     2 +-
 .../core/providers/cuda/cuda_allocator.cc     |     3 +-
 .../core/providers/cuda/cuda_allocator.h      |     6 +-
 .../providers/cuda/cuda_execution_provider.cc |    21 +-
 .../core/providers/cuda/cudnn_common.h        |     2 +-
 .../cuda/math/binary_elementwise_ops.cc       |    40 +-
 .../core/providers/cuda/rnn/cudnn_rnn_base.cc |   159 +-
 .../core/providers/cuda/rnn/cudnn_rnn_base.h  |    82 +-
 onnxruntime/core/providers/cuda/rnn/gru.h     |     3 +-
 onnxruntime/core/providers/cuda/rnn/lstm.h    |     3 +-
 onnxruntime/core/providers/cuda/rnn/rnn.h     |     6 +-
 .../core/providers/cuda/rnn/rnn_impl.cu       |    50 +-
 .../core/providers/cuda/rnn/rnn_impl.h        |     7 +
 .../core/providers/cuda/tensor/compress.cc    |     3 +-
 .../core/providers/cuda/tensor/resize_impl.cu |    71 +-
 .../core/providers/cuda/tensor/tile.cc        |     3 +-
 .../core/providers/cuda/tensor/upsample.cc    |    28 +-
 .../providers/cuda/tensor/upsample_impl.cu    |    68 +-
 .../mkldnn/mkldnn_execution_provider.cc       |     8 +-
 .../mkldnn/mkldnn_execution_provider.h        |     2 +
 .../mkldnn/mkldnn_provider_factory.cc         |     1 +
 .../core/providers/mkldnn/subgraph/subgraph.h |     6 +-
 .../core/providers/ngraph/ngraph_custom_op.cc |    95 +-
 .../core/providers/ngraph/ngraph_custom_op.h  |     7 +-
 .../ngraph/ngraph_execution_provider.cc       |   189 +-
 .../ngraph/ngraph_execution_provider.h        |     9 +-
 .../nuphar/common/analysis/analysis.h         |    45 +
 .../nuphar/common/analysis/graph_stats.h      |    77 +
 .../common/analysis/output_alias_analysis.cc  |   109 +
 .../common/analysis/output_alias_analysis.h   |    43 +
 .../nuphar/common/analysis/shape_expr.h       |   243 +
 .../common/analysis/subgraph_codegen_stats.cc |    63 +
 .../common/analysis/subgraph_codegen_stats.h  |    34 +
 .../analysis/subgraph_partition_stats.cc      |    28 +
 .../analysis/subgraph_partition_stats.h       |    31 +
 .../common/analysis/use_count_analysis.cc     |   264 +
 .../common/analysis/use_count_analysis.h      |    83 +
 .../nuphar/common/nuphar_settings.cc          |   132 +
 .../providers/nuphar/common/nuphar_settings.h |    48 +
 .../providers/nuphar/common/nuphar_subgraph.h |   106 +
 .../nuphar/common/nuphar_tvm_utils.cc         |   174 +
 .../nuphar/common/nuphar_tvm_utils.h          |    26 +
 .../core/providers/nuphar/common/utils.cc     |    76 +
 .../core/providers/nuphar/common/utils.h      |    23 +
 .../nuphar/compiler/codegen_manager.cc        |   233 +
 .../nuphar/compiler/codegen_manager.h         |    43 +
 .../providers/nuphar/compiler/func_info.cc    |   562 +
 .../providers/nuphar/compiler/func_info.h     |   122 +
 .../nuphar/compiler/initializer_info.h        |    34 +
 .../nuphar/compiler/nuphar_codegen_ctx.cc     |   247 +
 .../nuphar/compiler/nuphar_codegen_ctx.h      |   147 +
 .../nuphar/compiler/nuphar_compiler.cc        |   229 +
 .../nuphar/compiler/nuphar_compiler.h         |    65 +
 .../providers/nuphar/compiler/nuphar_handle.h |    40 +
 .../nuphar/compiler/nuphar_op_ir_builder.cc   |   311 +
 .../nuphar/compiler/nuphar_op_ir_builder.h    |    34 +
 .../compiler/nuphar_schedule_builder.cc       |    77 +
 .../nuphar/compiler/nuphar_schedule_builder.h |    20 +
 .../nuphar/compiler/traverse_shape_infer.cc   |   128 +
 .../nuphar/compiler/traverse_shape_infer.h    |    49 +
 .../compiler/x86/op_ir_creator/all_ops.h      |    64 +
 .../compiler/x86/op_ir_creator/math/gemm.cc   |    52 +
 .../x86/op_ir_creator/math/logsoftmax.cc      |    32 +
 .../compiler/x86/op_ir_creator/math/matmul.cc |   148 +
 .../math/quantize/matmul_integer.cc           |   126 +
 .../x86/op_ir_creator/math/reduce_ops.cc      |   169 +
 .../x86/op_ir_creator/math/softmax.cc         |    32 +
 .../x86/op_ir_creator/math/unary_ops.cc       |   124 +
 .../x86/op_ir_creator/tensor/slice.cc         |    72 +
 .../compiler/x86/op_ir_creator/tensor/tile.cc |    34 +
 .../x86/scheduler/analysis_schedule.cc        |    31 +
 .../x86/scheduler/nuphar_scheduler.cc         |    53 +
 .../compiler/x86/scheduler/nuphar_scheduler.h |    41 +
 .../x86/scheduler/ort_type_schedule.cc        |   270 +
 .../x86/scheduler/partial_schedule.cc         |    21 +
 .../scheduler/tensorize/intrin_gemv_16bit.cc  |   100 +
 .../scheduler/tensorize/intrin_gemv_16bit.h   |    20 +
 .../scheduler/tensorize/intrin_gemv_8bit.cc   |   104 +
 .../scheduler/tensorize/intrin_gemv_8bit.h    |    20 +
 .../tensorize/intrin_gemv_ll_extern.cc        |   103 +
 .../tensorize/intrin_gemv_ll_extern.h         |    13 +
 .../scheduler/tensorize/intrin_gemv_ll_ir.cc  |    96 +
 .../scheduler/tensorize/intrin_gemv_ll_ir.h   |    20 +
 .../x86/scheduler/tensorize/ll/gemv_impl.cpp  |    18 +
 .../x86/scheduler/tensorize/ll/gemv_impl.h    |   137 +
 .../x86/scheduler/tensorize/tensorize_base.h  |    76 +
 .../tensorize/tensorize_utilities.cc          |    73 +
 .../scheduler/tensorize/tensorize_utilities.h |    30 +
 .../x86/scheduler/tensorize_schedule.cc       |   144 +
 .../x86/scheduler/tvm_rule_schedule.cc        |   120 +
 .../nuphar/compiler/x86/x86_target_info.cc    |    19 +
 .../nuphar/compiler/x86/x86_target_info.h     |    33 +
 .../providers/nuphar/extern/igemv_avx2.cc     |   740 +
 .../core/providers/nuphar/extern/igemv_avx2.h |    46 +
 .../core/providers/nuphar/extern/igemv_mkl.cc |    36 +
 .../core/providers/nuphar/extern/igemv_mkl.h  |    30 +
 onnxruntime/core/providers/nuphar/kernel.cc   |   230 +
 onnxruntime/core/providers/nuphar/kernel.h    |   152 +
 .../nuphar/mti_x86/math/halide_ops.cc         |   307 +
 .../nuphar/mti_x86/math/halide_ops.h          |    49 +
 .../nuphar/mti_x86/math/logsoftmax.cc         |    16 +
 .../nuphar/mti_x86/math/logsoftmax.h          |    15 +
 .../nuphar/mti_x86/math/matmul_ops.cc         |   232 +
 .../nuphar/mti_x86/math/matmul_ops.h          |    24 +
 .../nuphar/mti_x86/math/reduce_ops.cc         |   356 +
 .../nuphar/mti_x86/math/reduce_ops.h          |    40 +
 .../providers/nuphar/mti_x86/math/softmax.cc  |    16 +
 .../providers/nuphar/mti_x86/math/softmax.h   |    15 +
 .../nuphar/mti_x86/math/softmax_internal.cc   |    56 +
 .../nuphar/mti_x86/math/softmax_internal.h    |    17 +
 .../nuphar/mti_x86/math/unary_ops.cc          |   186 +
 .../providers/nuphar/mti_x86/math/unary_ops.h |    25 +
 .../mti_x86/quantize/imatmul16_extern.cc      |   149 +
 .../mti_x86/quantize/imatmul16_extern.h       |    29 +
 .../nuphar/mti_x86/quantize/imatmul_extern.cc |   143 +
 .../nuphar/mti_x86/quantize/imatmul_extern.h  |    29 +
 .../nuphar/nuphar_execution_provider.cc       |   410 +
 .../nuphar/nuphar_execution_provider.h        |   163 +
 .../nuphar/nuphar_provider_factory.cc         |    36 +
 .../nuphar/partition/graph_partitioner.cc     |   190 +
 .../nuphar/partition/graph_partitioner.h      |    48 +
 .../providers/nuphar/partition/partitioner.cc |   266 +
 .../providers/nuphar/partition/partitioner.h  |   100 +
 .../nuphar/partition/subgraph_partitioner.cc  |   403 +
 .../nuphar/partition/subgraph_partitioner.h   |    56 +
 .../providers/nuphar/runtime/compute_ctx.cc   |    43 +
 .../providers/nuphar/runtime/compute_ctx.h    |   233 +
 .../runtime/control_flow/loop_exec_ctx.h      |    43 +
 .../runtime/control_flow/scan_exec_ctx.cc     |   530 +
 .../runtime/control_flow/scan_exec_ctx.h      |    87 +
 .../providers/nuphar/runtime/exec_block.cc    |    29 +
 .../providers/nuphar/runtime/exec_block.h     |    54 +
 .../core/providers/nuphar/runtime/handle.h    |    24 +
 .../nuphar/runtime/sequential/basic.cc        |   195 +
 .../nuphar/runtime/sequential/basic.h         |    34 +
 .../nuphar/runtime/sequential/loop.cc         |    72 +
 .../nuphar/runtime/sequential/loop.h          |    30 +
 .../core/providers/nuphar/runtime/utils.h     |    72 +
 .../core/providers/nuphar/scripts/README.md   |    25 +
 .../nuphar/scripts/cntk_converter.py          |    81 +
 .../nuphar/scripts/create_shared.cmd          |    65 +
 .../providers/nuphar/scripts/create_shared.sh |    64 +
 .../providers/nuphar/scripts/model_editor.py  |   629 +
 .../nuphar/scripts/model_quantizer.py         |   310 +
 .../providers/nuphar/scripts/node_factory.py  |   153 +
 .../providers/nuphar/scripts/rnn_benchmark.py |   205 +
 .../nuphar/scripts/symbolic_shape_infer.py    |   643 +
 onnxruntime/core/providers/nuphar/symbols.txt |     1 +
 .../openvino/openvino_mo/openvino_mo.py       |    36 +-
 .../providers/tensorrt/tensorrt_allocator.h   |    32 -
 .../tensorrt/tensorrt_execution_provider.cc   |   113 +-
 .../tensorrt/tensorrt_execution_provider.h    |     4 +-
 .../tensorrt/tensorrt_provider_factory.cc     |    17 +-
 .../core/session/abi_session_options.cc       |    38 +-
 .../session/default_cpu_allocator_c_api.cc    |    14 +-
 onnxruntime/core/session/environment.cc       |     7 +
 onnxruntime/core/session/inference_session.cc |    49 +-
 onnxruntime/core/session/inference_session.h  |    11 +-
 onnxruntime/core/session/onnxruntime_c_api.cc |    67 +-
 onnxruntime/core/util/gemmlowp_common.cc      |    49 +
 onnxruntime/core/util/gemmlowp_common.h       |    65 +
 .../core/util/gemmlowp_common_wrapper.h       |     2 +
 onnxruntime/core/util/math.h                  |     9 +-
 onnxruntime/core/util/math_cpu.cc             |    55 +-
 onnxruntime/core/util/math_cpuonly.h          |    14 +-
 .../core/util/protobuf_parsing_utils.cc       |     2 +
 onnxruntime/core/util/qmath.cc                |    33 +
 onnxruntime/core/util/qmath.h                 |    38 +
 .../python/onnxruntime_pybind_state.cc        |   296 +-
 .../python/tools/quantization/quantize.py     |   298 +-
 onnxruntime/server/environment.cc             |     2 +-
 onnxruntime/server/executor.cc                |     1 -
 .../automl_ops/datetimetransformer_test.cc    |    97 +
 .../test/common/logging/logging_test.cc       |    38 +-
 .../test/contrib_ops/matmul_integer16_test.cc |    41 +
 .../test/framework/TestAllocatorManager.cc    |     4 +-
 .../test/framework/allocation_planner_test.cc |    43 +-
 .../framework/cuda/allocator_cuda_test.cc     |     4 +-
 .../test/framework/execution_frame_test.cc    |    83 +-
 .../test/framework/inference_session_test.cc  |    83 +
 onnxruntime/test/framework/math_test.cc       |    58 +-
 .../test/framework/session_state_test.cc      |   147 +-
 .../test_tensor_loader.cc                     |    98 +-
 onnxruntime/test/framework/test_utils.cc      |     3 +-
 onnxruntime/test/mlas/unittest.cpp            |   309 +-
 onnxruntime/test/onnx/README.txt              |     2 +-
 onnxruntime/test/onnx/TestCase.cc             |    79 +-
 onnxruntime/test/onnx/TestCase.h              |     5 +-
 onnxruntime/test/onnx/callback.cc             |    16 +
 onnxruntime/test/onnx/callback.h              |    17 +
 onnxruntime/test/onnx/heap_buffer.cc          |     8 +-
 onnxruntime/test/onnx/heap_buffer.h           |     8 +-
 onnxruntime/test/onnx/main.cc                 |    92 +-
 onnxruntime/test/onnx/mem_buffer.h            |    27 +
 .../test/onnx/microbenchmark/model_init.cc    |   222 -
 .../test/onnx/microbenchmark/modeltest.cc     |     3 +-
 onnxruntime/test/onnx/runner.cc               |    82 +-
 onnxruntime/test/onnx/tensorprotoutils.cc     |   459 +
 onnxruntime/test/onnx/tensorprotoutils.h      |    39 +
 .../test/optimizer/graph_transform_test.cc    |     5 +-
 onnxruntime/test/perftest/README.md           |     2 +-
 onnxruntime/test/perftest/TFModelInfo.cc      |     2 +-
 .../test/perftest/command_args_parser.cc      |    41 +-
 onnxruntime/test/perftest/ort_test_session.cc |    10 +-
 .../test/perftest/performance_runner.h        |     2 +-
 .../test/perftest/test_configuration.h        |     4 +-
 .../cpu/activation/activation_op_test.cc      |     5 +-
 .../providers/cpu/generator/random_test.cc    |    51 +-
 .../test/providers/cpu/math/clip_test.cc      |    40 +-
 .../cpu/math/element_wise_ops_test.cc         |     8 +
 .../providers/cpu/math/matmul_integer_test.cc |    63 +-
 .../test/providers/cpu/math/softmax_test.cc   |     8 +-
 .../providers/cpu/ml/label_encoder_test.cc    |   126 +
 .../cpu/nn/conv_transpose_op_test.cc          |     2 +
 .../test/providers/cpu/nn/shrink_test.cc      |    12 +-
 .../non_max_suppression_test.cc               |    37 +-
 .../cpu/reduction/reduction_ops_test.cc       |    35 +
 .../providers/cpu/rnn/deep_cpu_gru_op_test.cc |    33 +
 .../cpu/rnn/deep_cpu_lstm_op_test.cc          |    40 +
 .../test/providers/cpu/rnn/rnn_op_test.cc     |    54 +-
 .../tensor/dynamic_quantize_linear_test.cc    |    51 +
 .../providers/cpu/tensor/nonzero_op_test.cc   |    19 +
 .../providers/cpu/tensor/onehot_op_test.cc    |    28 +
 .../cpu/tensor/quantize_linear_test.cc        |    12 +-
 .../providers/cpu/tensor/resize_op_test.cc    |    49 +-
 .../providers/cpu/tensor/upsample_op_test.cc  |    30 +-
 onnxruntime/test/providers/memcpy_test.cc     |     8 +-
 .../test/providers/provider_test_utils.cc     |    94 +-
 .../test/providers/provider_test_utils.h      |    55 +
 .../providers/tensorrt/tensorrt_basic_test.cc |     6 +-
 .../test/python/onnx_backend_test_series.py   |    13 +-
 .../test/python/onnxruntime_test_python.py    |    29 +-
 .../python/onnxruntime_test_python_nuphar.py  |   111 +
 .../test/server/unit_tests/converter_tests.cc |     2 +-
 onnxruntime/test/shared_lib/test_allocator.cc |     2 +-
 onnxruntime/test/shared_lib/test_inference.cc |     4 +-
 .../test/shared_lib/test_session_options.cc   |    11 +-
 onnxruntime/test/testdata/CNTK/gen.py         |   106 +-
 .../test_model_with_fullonnxdomain.onnx       |    18 +
 onnxruntime/test/util/default_providers.cc    |    11 +-
 .../test/util/include/default_providers.h     |     2 +-
 requirements-dev.txt                          |     2 +
 .../c_cxx/fns_candy_style_transfer/README.md  |    19 +-
 samples/c_cxx/imagenet/main.cc                |     5 +-
 tools/ci_build/build.py                       |    64 +-
 tools/ci_build/gen_def.py                     |    12 +-
 .../azure-pipelines-py-packaging.yml          |     4 +-
 .../c-api-packaging-pipelines.yml             |     2 +-
 .../azure-pipelines/linux-ci-pipeline.yml     |     2 +-
 .../linux-gpu-tensorrt-ci-pipeline.yml        |     3 +-
 .../azure-pipelines/mac-ci-pipeline.yml       |     2 +-
 .../azure-pipelines/nuget/templates/cpu.yml   |    10 +
 .../azure-pipelines/nuget/templates/gpu.yml   |     4 +-
 .../nuget/templates/test_win.yml              |     2 +-
 .../azure-pipelines/templates/esrp_dll.yml    |     2 +-
 .../azure-pipelines/templates/esrp_nuget.yml  |     2 +-
 .../azure-pipelines/templates/win-ci.yml      |    45 +-
 .../azure-pipelines/templates/win-x86-ci.yml  |    40 +-
 .../windows-build-tools-setup-steps.yml       |    34 +-
 .../azure-pipelines/win-ci-pipeline.yml       |     2 +-
 .../azure-pipelines/win-gpu-ci-pipeline.yml   |     4 +-
 .../win-gpu-tensorrt-ci-pipeline.yml          |    67 +-
 .../win-ngraph-ci-pipeline.yml                |     4 +-
 .../github/linux/copy_strip_binary.sh         |     2 +
 .../linux/docker/Dockerfile.ubuntu_tensorrt   |     8 +-
 .../linux/docker/scripts/install_onnx.sh      |     2 +-
 .../github/windows/setup_env_cuda.bat         |     2 +-
 tools/python/gen_opkernel_doc.py              |   152 +
 539 files changed, 51564 insertions(+), 4146 deletions(-)
 create mode 100644 cmake/onnxruntime_automl_featurizers.cmake
 create mode 100644 cmake/onnxruntime_nuphar_extern.cmake
 delete mode 100644 cmake/patches/ngraph/ngraph_fix_install_error.patch
 delete mode 100644 cmake/patches/ngraph/ngraph_fix_library_path.patch
 create mode 100644 cmake/patches/ngraph/ngraph_fix_mkldnn_missing_symbol.patch
 rename csharp/{src/Microsoft.ML.OnnxRuntime => }/OnnxRuntime.snk (100%)
 create mode 100644 csharp/src/Microsoft.ML.OnnxRuntime/RunOptions.cs
 create mode 100644 csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayTensorExtensions.cs
 create mode 100644 csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayUtilities.cs
 create mode 100644 csharp/src/Microsoft.ML.OnnxRuntime/Tensors/DenseTensor.cs
 create mode 100644 csharp/src/Microsoft.ML.OnnxRuntime/Tensors/Tensor.cs
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/NativeMemory.cs
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorArithmetic.cs
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorArithmetic.tt
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorExtensions.cs
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorOperations.cs
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorOperations.tt
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTemplate.ttinclude
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTests.cs
 create mode 100644 csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTestsBase.cs
 rename dockerfiles/{ => scripts}/install_common_deps.sh (81%)
 create mode 100644 docs/OperatorKernels.md
 create mode 100644 docs/execution_providers/Nuphar-ExecutionProvider.md
 delete mode 100644 include/onnxruntime/core/common/callback.h
 create mode 100644 include/onnxruntime/core/providers/nuphar/nuphar_provider_factory.h
 create mode 100644 onnxruntime/automl_ops/automl_featurizers.h
 create mode 100644 onnxruntime/automl_ops/automl_types.cc
 create mode 100644 onnxruntime/automl_ops/automl_types.h
 create mode 100644 onnxruntime/automl_ops/cpu/datetime_transformer.cc
 create mode 100644 onnxruntime/automl_ops/cpu_automl_kernels.cc
 create mode 100644 onnxruntime/automl_ops/cpu_automl_kernels.h
 create mode 100644 onnxruntime/contrib_ops/cpu/matmul_integer16.cc
 create mode 100644 onnxruntime/contrib_ops/cpu/matmul_integer16.h
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizer.h
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.cpp
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.h
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/SampleAdd.cpp
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/SampleAdd.h
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/CMakeLists.txt
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/DateTimeFeaturizer_UnitTests.cpp
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/SampleAdd_UnitTest.cpp
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/code_coverage.yaml
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Traits.h
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/CMakeLists.txt
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Featurizer_UnitTest.cpp
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Traits_UnitTests.cpp
 create mode 100644 onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/test_main.cpp
 delete mode 100644 onnxruntime/core/common/task_thread_pool.h
 create mode 100644 onnxruntime/core/framework/callback.h
 create mode 100644 onnxruntime/core/graph/automl_ops/automl_defs.cc
 create mode 100644 onnxruntime/core/graph/automl_ops/automl_defs.h
 rename onnxruntime/core/mlas/lib/aarch64/{sgemma.s => SgemmKernelNeon.S} (95%)
 create mode 100644 onnxruntime/core/mlas/lib/amd64/AssembleAvx512Vnni.inc
 create mode 100644 onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx2.asm
 create mode 100644 onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512BW.asm
 create mode 100644 onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Common.inc
 create mode 100644 onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Vnni.asm
 rename onnxruntime/core/mlas/lib/arm64/{sgemma.asm => SgemmKernelNeon.asm} (91%)
 create mode 100644 onnxruntime/core/mlas/lib/qgemm.cpp
 create mode 100644 onnxruntime/core/mlas/lib/x86_64/AssembleAvx512Vnni.h
 create mode 100644 onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx2.S
 create mode 100644 onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512BW.S
 create mode 100644 onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Common.h
 create mode 100644 onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Vnni.S
 create mode 100644 onnxruntime/core/providers/cpu/tensor/dynamicquantizelinear.cc
 create mode 100644 onnxruntime/core/providers/cpu/tensor/dynamicquantizelinear.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/analysis.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/graph_stats.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/output_alias_analysis.cc
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/output_alias_analysis.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/shape_expr.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/subgraph_codegen_stats.cc
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/subgraph_codegen_stats.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/subgraph_partition_stats.cc
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/subgraph_partition_stats.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/use_count_analysis.cc
 create mode 100644 onnxruntime/core/providers/nuphar/common/analysis/use_count_analysis.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/nuphar_settings.cc
 create mode 100644 onnxruntime/core/providers/nuphar/common/nuphar_settings.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/nuphar_subgraph.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/nuphar_tvm_utils.cc
 create mode 100644 onnxruntime/core/providers/nuphar/common/nuphar_tvm_utils.h
 create mode 100644 onnxruntime/core/providers/nuphar/common/utils.cc
 create mode 100644 onnxruntime/core/providers/nuphar/common/utils.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/codegen_manager.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/codegen_manager.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/func_info.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/func_info.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/initializer_info.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_codegen_ctx.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_codegen_ctx.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_compiler.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_compiler.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_handle.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_op_ir_builder.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_op_ir_builder.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_schedule_builder.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/nuphar_schedule_builder.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/traverse_shape_infer.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/traverse_shape_infer.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/gemm.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/logsoftmax.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/matmul.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/quantize/matmul_integer.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/reduce_ops.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/softmax.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/unary_ops.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/tensor/slice.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/tensor/tile.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/analysis_schedule.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/ort_type_schedule.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/partial_schedule.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.cpp
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_base.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.h
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize_schedule.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tvm_rule_schedule.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/x86_target_info.cc
 create mode 100644 onnxruntime/core/providers/nuphar/compiler/x86/x86_target_info.h
 create mode 100644 onnxruntime/core/providers/nuphar/extern/igemv_avx2.cc
 create mode 100644 onnxruntime/core/providers/nuphar/extern/igemv_avx2.h
 create mode 100644 onnxruntime/core/providers/nuphar/extern/igemv_mkl.cc
 create mode 100644 onnxruntime/core/providers/nuphar/extern/igemv_mkl.h
 create mode 100644 onnxruntime/core/providers/nuphar/kernel.cc
 create mode 100644 onnxruntime/core/providers/nuphar/kernel.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/halide_ops.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/halide_ops.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/logsoftmax.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/logsoftmax.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/matmul_ops.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/matmul_ops.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/reduce_ops.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/reduce_ops.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/softmax.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/softmax.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/softmax_internal.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/softmax_internal.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/unary_ops.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/math/unary_ops.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul16_extern.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul16_extern.h
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul_extern.cc
 create mode 100644 onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul_extern.h
 create mode 100644 onnxruntime/core/providers/nuphar/nuphar_execution_provider.cc
 create mode 100644 onnxruntime/core/providers/nuphar/nuphar_execution_provider.h
 create mode 100644 onnxruntime/core/providers/nuphar/nuphar_provider_factory.cc
 create mode 100644 onnxruntime/core/providers/nuphar/partition/graph_partitioner.cc
 create mode 100644 onnxruntime/core/providers/nuphar/partition/graph_partitioner.h
 create mode 100644 onnxruntime/core/providers/nuphar/partition/partitioner.cc
 create mode 100644 onnxruntime/core/providers/nuphar/partition/partitioner.h
 create mode 100644 onnxruntime/core/providers/nuphar/partition/subgraph_partitioner.cc
 create mode 100644 onnxruntime/core/providers/nuphar/partition/subgraph_partitioner.h
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/compute_ctx.cc
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/compute_ctx.h
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/control_flow/loop_exec_ctx.h
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/control_flow/scan_exec_ctx.cc
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/control_flow/scan_exec_ctx.h
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/exec_block.cc
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/exec_block.h
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/handle.h
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/sequential/basic.cc
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/sequential/basic.h
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/sequential/loop.cc
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/sequential/loop.h
 create mode 100644 onnxruntime/core/providers/nuphar/runtime/utils.h
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/README.md
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/cntk_converter.py
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/create_shared.cmd
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/create_shared.sh
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/model_editor.py
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/model_quantizer.py
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/node_factory.py
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/rnn_benchmark.py
 create mode 100644 onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py
 create mode 100644 onnxruntime/core/providers/nuphar/symbols.txt
 delete mode 100755 onnxruntime/core/providers/tensorrt/tensorrt_allocator.h
 create mode 100644 onnxruntime/core/util/gemmlowp_common.cc
 create mode 100644 onnxruntime/core/util/gemmlowp_common.h
 create mode 100644 onnxruntime/core/util/qmath.cc
 create mode 100644 onnxruntime/core/util/qmath.h
 create mode 100644 onnxruntime/test/automl_ops/datetimetransformer_test.cc
 create mode 100644 onnxruntime/test/contrib_ops/matmul_integer16_test.cc
 rename onnxruntime/test/{shared_lib => framework}/test_tensor_loader.cc (59%)
 create mode 100644 onnxruntime/test/onnx/callback.cc
 create mode 100644 onnxruntime/test/onnx/callback.h
 create mode 100644 onnxruntime/test/onnx/mem_buffer.h
 delete mode 100644 onnxruntime/test/onnx/microbenchmark/model_init.cc
 create mode 100644 onnxruntime/test/onnx/tensorprotoutils.cc
 create mode 100644 onnxruntime/test/onnx/tensorprotoutils.h
 create mode 100644 onnxruntime/test/providers/cpu/tensor/dynamic_quantize_linear_test.cc
 create mode 100644 onnxruntime/test/python/onnxruntime_test_python_nuphar.py
 create mode 100644 onnxruntime/test/testdata/test_model_with_fullonnxdomain.onnx
 create mode 100644 tools/python/gen_opkernel_doc.py

diff --git a/.gitmodules b/.gitmodules
index 47dc9124d74dd..6eb38ef853cab 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -25,10 +25,6 @@
 [submodule "cmake/external/re2"]
 	path = cmake/external/re2
 	url = https://github.com/google/re2.git
-[submodule "cmake/external/onnx-tensorrt"]
-	path = cmake/external/onnx-tensorrt
-	url = https://github.com/onnx/onnx-tensorrt.git
-        branch = v5.0
 [submodule "cmake/external/eigen"]
 	path = cmake/external/eigen
 	url = https://github.com/eigenteam/eigen-git-mirror.git
@@ -41,3 +37,7 @@
 [submodule "cmake/external/spdlog"]
 	path = cmake/external/spdlog
 	url = https://github.com/gabime/spdlog.git
+[submodule "cmake/external/onnx-tensorrt"]
+	path = cmake/external/onnx-tensorrt
+	url = https://github.com/onnx/onnx-tensorrt.git
+	branch = 5.1
diff --git a/BUILD.md b/BUILD.md
index f4d1650ee03ad..f1bd922639d86 100644
--- a/BUILD.md
+++ b/BUILD.md
@@ -1,37 +1,9 @@
-# Build ONNX Runtime
-Dockerfiles are available [here](https://github.com/microsoft/onnxruntime/tree/master/tools/ci_build/github/linux/docker) to help you get started.
+# Building ONNX Runtime - Getting Started
+*Dockerfiles are available [here](https://github.com/microsoft/onnxruntime/tree/master/tools/ci_build/github/linux/docker) to help you get started.*
 
-## Supported architectures
+*Pre-built packages are available at the locations indicated [here](https://github.com/microsoft/onnxruntime#official-builds).*
 
-|           | x86_32       | x86_64       | ARM32v7      | ARM64        |
-|-----------|:------------:|:------------:|:------------:|:------------:|
-|Windows    | YES          | YES          |  YES         | YES          |
-|Linux      | YES          | YES          |  YES         | YES          |
-|Mac OS X   | NO           | YES          |  NO          | NO           |
-
-## Supported dev environments
-
-| OS          | Supports CPU | Supports GPU| Notes                              |
-|-------------|:------------:|:------------:|------------------------------------|
-|Windows 10   | YES          | YES         | VS2019 through the latest VS2015 are supported |
-|Windows 10 <br/> Subsystem for Linux | YES         | NO        |         |
-|Ubuntu 16.x  | YES          | YES         | Also supported on ARM32v7 (experimental) |
-
-* Red Hat Enterprise Linux and CentOS are not supported.
-* Other version of Ubuntu might work but we don't support them officially.
-* GCC 4.x and below are not supported.
-
-OS/Compiler Matrix:
-
-| OS/Compiler | Supports VC  | Supports GCC     |
-|-------------|:------------:|:----------------:|
-|Windows 10   | YES          | Not tested       |
-|Linux        | NO           | YES(gcc>=5.0)    |
-
-ONNX Runtime python binding only supports Python 3.5, 3.6 and 3.7.
-
-## Getting Started
-You may either get a prebuilt onnxruntime from nuget.org, or do it yourself using the following steps:
+## To build the baseline CPU version of ONNX Runtime from source:
 1. Checkout the source tree:
    ```
    git clone --recursive https://github.com/Microsoft/onnxruntime
@@ -39,7 +11,8 @@ You may either get a prebuilt onnxruntime from nuget.org, or do it yourself usin
    ```
 2. Install cmake-3.13 or better from https://cmake.org/download/.
 
-On Windows:
+**On Windows:**
+
 3. (optional) Install protobuf 3.6.1 from source code (cmake/external/protobuf). CMake flag protobuf\_BUILD\_SHARED\_LIBS must be turned OFF. After the installation, you should have the 'protoc' executable in your PATH.
 4. (optional) Install onnx from source code (cmake/external/onnx)
     ```
@@ -49,7 +22,10 @@ On Windows:
     ```
 5. Run `build.bat --config RelWithDebInfo --build_shared_lib --parallel`. 
 
-On Linux:
+*Note: The default Windows CMake Generator is Visual Studio 2017, but you can also use the newer Visual Studio 2019 by passing `--cmake_generator "Visual Studio 16 2019"` to build.bat.*
+
+**On Linux:**
+
 3. (optional) Install protobuf 3.6.1 from source code (cmake/external/protobuf). CMake flag protobuf\_BUILD\_SHARED\_LIBS must be turned ON. After the installation, you should have the 'protoc' executable in your PATH. It is recommended to run `ldconfig` to make sure protobuf libraries are found.
 4. If you installed your protobuf in a non standard location it would be helpful to set the following env var:`export CMAKE_ARGS="-DONNX_CUSTOM_PROTOC_EXECUTABLE=full path to protoc"` so ONNX build can find it. Also run `ldconfig <protobuf lib folder path>` so the linker can find protobuf libraries.
 5. (optional) Install onnx from source code (cmake/external/onnx)
@@ -62,46 +38,120 @@ On Linux:
 
 The build script runs all unit tests by default (for native builds and skips tests by default for cross-compiled builds).
 
+---
+
+# Supported architectures and build environments
+
+## Architectures
+
+|           | x86_32       | x86_64       | ARM32v7      | ARM64        |
+|-----------|:------------:|:------------:|:------------:|:------------:|
+|Windows    | YES          | YES          |  YES         | YES          |
+|Linux      | YES          | YES          |  YES         | YES          |
+|Mac OS X   | NO           | YES          |  NO          | NO           |
+
+## Environments
+
+| OS          | Supports CPU | Supports GPU| Notes                              |
+|-------------|:------------:|:------------:|------------------------------------|
+|Windows 10   | YES          | YES         | VS2019 through the latest VS2015 are supported |
+|Windows 10 <br/> Subsystem for Linux | YES         | NO        |         |
+|Ubuntu 16.x  | YES          | YES         | Also supported on ARM32v7 (experimental) |
+
+* Red Hat Enterprise Linux and CentOS are not supported.
+* Other version of Ubuntu might work but we don't support them officially.
+* GCC 4.x and below are not supported.
+
+### OS/Compiler Matrix:
+
+| OS/Compiler | Supports VC  | Supports GCC     |
+|-------------|:------------:|:----------------:|
+|Windows 10   | YES          | Not tested       |
+|Linux        | NO           | YES(gcc>=5.0)    |
+
+ONNX Runtime Python bindings support Python 3.5, 3.6 and 3.7.
+
+---
+
+# Additional Build Instructions
 The complete list of build options can be found by running `./build.sh (or ./build.bat) --help`
 
-## Build x86
- - For Windows, just add --x86 argument when launching build.bat
- - For Linux, it must be built out of a x86 os, --x86 argument also needs be specified to build.sh
+* [Docker on Linux](#Docker-on-Linux)
+* [ONNX Runtime Server (Linux)](#Build-ONNX-Runtime-Server-on-Linux)
 
-## Build ONNX Runtime Server on Linux
+**Execution Providers**
+* [NVIDIA CUDA](#CUDA)
+* [NVIDIA TensorRT](#TensorRT)
+* [Intel MKL-DNN/MKL-ML](#MKLDNN-and-MKLML)
+* [Intel nGraph](#nGraph)
+* [Intel OpenVINO](#openvino)
+* [Android NNAPI](#Android)
+* [Nuphar](#Nuphar)
+
+**Options**
+* [OpenMP](#OpenMP)
+* [OpenBLAS](#OpenBLAS)
+
+**Architectures**
+* [x86](#x86)
+* [ARM](#ARM)
+
+---
+## Docker on Linux
+Install Docker: `https://docs.docker.com/install/`
+
+**CPU**
+```
+cd tools/ci_build/github/linux/docker
+docker build -t onnxruntime_dev --build-arg OS_VERSION=16.04 -f Dockerfile.ubuntu .
+docker run --rm -it onnxruntime_dev /bin/bash
+```
+
+**GPU**
+If you need GPU support, please also install:
+1. nvidia driver. Before doing this please add `nomodeset rd.driver.blacklist=nouveau` to your linux [kernel boot parameters](https://www.kernel.org/doc/html/v4.17/admin-guide/kernel-parameters.html).
+2. nvidia-docker2: [Install doc](`https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-2.0)`)
+
+To test if your nvidia-docker works:
+```
+docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
+```
+
+Then build a docker image. We provided a sample for use:
+```
+cd tools/ci_build/github/linux/docker
+docker build -t cuda_dev -f Dockerfile.ubuntu_gpu .
+```
+
+Then run it
+```
+./tools/ci_build/github/linux/run_dockerbuild.sh
+```
+
+---
 
+## Build ONNX Runtime Server on Linux
+Read more about ONNX Runtime Server [here](https://github.com/microsoft/onnxruntime/blob/master/docs/ONNX_Runtime_Server_Usage.md)
 1. ONNX Runtime server (and only the server) requires you to have Go installed to build, due to building BoringSSL. 
     See https://golang.org/doc/install for installation instructions.
 2. In the ONNX Runtime root folder, run `./build.sh --config RelWithDebInfo --build_server  --use_openmp --parallel`
 3. ONNX Runtime Server supports sending log to [rsyslog](https://www.rsyslog.com/) daemon. To enable it, please build with an additional parameter: `--cmake_extra_defines onnxruntime_USE_SYSLOG=1`. The build command will look like this: `./build.sh --config RelWithDebInfo --build_server  --use_openmp --parallel --cmake_extra_defines onnxruntime_USE_SYSLOG=1`
 
+---
 
-## Build/Test Flavors for CI
-
-### CI Build Environments
-
-| Build Job Name     | Environment         | Dependency                      | Test Coverage            | Scripts                                  |
-|--------------------|---------------------|---------------------------------|--------------------------|------------------------------------------|
-| Linux_CI_Dev       | Ubuntu 16.04        | python=3.5                      | Unit tests; ONNXModelZoo | [script](tools/ci_build/github/linux/run_build.sh) |
-| Linux_CI_GPU_Dev   | Ubuntu 16.04        | python=3.5; nvidia-docker       | Unit tests; ONNXModelZoo | [script](tools/ci_build/github/linux/run_build.sh) |
-| Windows_CI_Dev     | Windows Server 2016 | python=3.5                      | Unit tests; ONNXModelZoo | [script](build.bat)                                |
-| Windows_CI_GPU_Dev | Windows Server 2016 | cuda=9.1; cudnn=7.1; python=3.5 | Unit tests; ONNXModelZoo | [script](build.bat)                                |
-
-## Additional Build Flavors
-The complete list of build flavors can be seen by running `./build.sh --help` or `./build.bat --help`. Here are some common flavors.
+## Execution Providers
 
-### Windows CMake Generator
-The default generator on Windows is Visual Studio 2017, but you can also use the newer Visual Studio 2019 by passing `--cmake_generator "Visual Studio 16 2019"` to build.bat.
+### CUDA
+For Linux, please use [this Dockerfile](https://github.com/microsoft/onnxruntime/blob/master/tools/ci_build/github/linux/docker/Dockerfile.ubuntu_gpu) and refer to instructions above for [building with Docker on Linux](#Docker-on-Linux)
 
-### Windows CUDA Build
-ONNX Runtime supports CUDA builds. You will need to download and install [CUDA](https://developer.nvidia.com/cuda-toolkit) and [CUDNN](https://developer.nvidia.com/cudnn).
+ONNX Runtime supports CUDA builds. You will need to download and install [CUDA](https://developer.nvidia.com/cuda-toolkit) and [cuDNN](https://developer.nvidia.com/cudnn).
 
-ONNX Runtime is built and tested with CUDA 9.1 and CUDNN 7.1 using the Visual Studio 2017 14.11 toolset (i.e. Visual Studio 2017 v15.3).
-CUDA versions from 9.1 up to 10.0, and CUDNN versions from 7.1 up to 7.4 should also work with Visual Studio 2017.
+ONNX Runtime is built and tested with CUDA 10.0 and cuDNN 7.3 using the Visual Studio 2017 14.11 toolset (i.e. Visual Studio 2017 v15.3). 
+CUDA versions from 9.1 up to 10.1, and cuDNN versions from 7.1 up to 7.4 should also work with Visual Studio 2017.
 
  - The path to the CUDA installation must be provided via the CUDA_PATH environment variable, or the `--cuda_home parameter`.
- - The path to the CUDNN installation (include the `cuda` folder in the path) must be provided via the CUDNN_PATH environment variable, or `--cudnn_home parameter`. The CUDNN path should contain `bin`, `include` and `lib` directories.
- - The path to the CUDNN bin directory must be added to the PATH environment variable so that cudnn64_7.dll is found.
+ - The path to the cuDNN installation (include the `cuda` folder in the path) must be provided via the cuDNN_PATH environment variable, or `--cudnn_home parameter`. The cuDNN path should contain `bin`, `include` and `lib` directories.
+ - The path to the cuDNN bin directory must be added to the PATH environment variable so that cudnn64_7.dll is found.
 
 You can build with:
 
@@ -110,7 +160,7 @@ You can build with:
 ./build.bat --use_cuda --cudnn_home <cudnn home path> --cuda_home <cuda home path> (Windows)
 ```
 
-Depending on compatibility between the CUDA, CUDNN, and Visual Studio 2017 versions you are using, you may need to explicitly install an earlier version of the MSVC toolset.
+Depending on compatibility between the CUDA, cuDNN, and Visual Studio 2017 versions you are using, you may need to explicitly install an earlier version of the MSVC toolset.
 - CUDA 10.0 is known to work with toolsets from 14.11 up to 14.16 (Visual Studio 2017 15.9), and should continue to work with future Visual Studio versions
   - https://devblogs.microsoft.com/cppblog/cuda-10-is-now-available-with-support-for-the-latest-visual-studio-2017-versions/
 - CUDA 9.2 is known to work with the 14.11 MSVC toolset (Visual Studio 15.3 and 15.4)
@@ -132,30 +182,38 @@ _Side note: If you have multiple versions of CUDA installed on a Windows machine
 e.g. C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\Common7\IDE\VC\VCTargets\BuildCustomizations\.
 If you want to build with an earlier version, you must temporarily remove the 'CUDA x.y.*' files for later versions from this directory._
 
-### MKL-DNN/MKLML
-To build ONNX Runtime with MKL-DNN support, build it with `./build.sh --use_mkldnn`
-To build ONNX Runtime using MKL-DNN built with dependency on MKL small libraries, build it with `./build.sh --use_mkldnn --use_mklml`
-
-### nGraph
-ONNX runtime with nGraph as an execution provider (released as preview) can be built on Linux as follows : `./build.sh --use_ngraph`.  Similarly, on Windows use `.\build.bat --use_ngraph`.
+---
 
 ### TensorRT
-ONNX Runtime supports the TensorRT execution provider (released as preview). You will need to download and install [CUDA](https://developer.nvidia.com/cuda-toolkit), [CUDNN](https://developer.nvidia.com/cudnn) and [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download).
+ONNX Runtime supports the TensorRT execution provider (released as preview). You will need to download and install [CUDA](https://developer.nvidia.com/cuda-toolkit), [cuDNN](https://developer.nvidia.com/cudnn) and [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download).
 
-The TensorRT execution provider for ONNX Runtime is built and tested with CUDA 9.0/CUDA 10.0, CUDNN 7.1 and TensorRT 5.0.2.6.
+The TensorRT execution provider for ONNX Runtime is built and tested with CUDA 9.0/CUDA 10.0, cuDNN 7.1 and TensorRT 5.0.2.6.
 
  - The path to the CUDA installation must be provided via the CUDA_PATH environment variable, or the `--cuda_home parameter`. The CUDA path should contain `bin`, `include` and `lib` directories.
  - The path to the CUDA `bin` directory must be added to the PATH environment variable so that `nvcc` is found.
- - The path to the CUDNN installation (path to folder that contains libcudnn.so) must be provided via the CUDNN_PATH environment variable, or `--cudnn_home parameter`.
+ - The path to the cuDNN installation (path to folder that contains libcudnn.so) must be provided via the cuDNN_PATH environment variable, or `--cudnn_home parameter`.
 - The path to TensorRT installation must be provided via the `--tensorrt_home parameter`.
 
 You can build from source on Linux by using the following `cmd` from the onnxruntime directory:
 
 ```
-./build.sh --cudnn_home <path to CUDNN e.g. /usr/lib/x86_64-linux-gnu/> --cuda_home <path to folder for CUDA e.g. /usr/local/cuda> --use_tensorrt --tensorrt_home <path to TensorRT home> (Linux)
-
+./build.sh --cudnn_home <path to cuDNN e.g. /usr/lib/x86_64-linux-gnu/> --cuda_home <path to folder for CUDA e.g. /usr/local/cuda> --use_tensorrt --tensorrt_home <path to TensorRT home> (Linux)
 ```
-### OpenVINO Build
+
+---
+
+### MKLDNN and MKLML
+To build ONNX Runtime with MKL-DNN support, build it with `./build.sh --use_mkldnn`
+To build ONNX Runtime using MKL-DNN built with dependency on MKL small libraries, build it with `./build.sh --use_mkldnn --use_mklml`
+
+---
+
+### nGraph
+ONNX runtime with nGraph as an execution provider (released as preview) can be built on Linux as follows : `./build.sh --use_ngraph`.  Similarly, on Windows use `.\build.bat --use_ngraph`
+
+---
+
+### OpenVINO
 
 ONNX Runtime supports OpenVINO Execution Provider to enable deep learning inference using Intel<sup>®</sup> OpenVINO<sup>TM</sup> Toolkit. This execution provider supports several Intel hardware device types - CPU, integrated GPU, Intel<sup>®</sup> Movidius<sup>TM</sup> VPUs and Intel<sup>®</sup> Vision accelerator Design with 8 Intel Movidius<sup>TM</sup> MyriadX VPUs.
 
@@ -194,58 +252,97 @@ The OpenVINO Execution Provider can be built using the following commands:
 | <code>VAD-M_FP16</code> | Intel<sup>®</sup> Vision Accelerator Design based on 8 Movidius<sup>TM</sup> MyriadX VPUs |
 
 For more information on OpenVINO Execution Provider&#39;s ONNX Layer support, Topology support, and Intel hardware enabled, please refer to the document OpenVINO-ExecutionProvider.md in <code>$onnxruntime_root/docs/execution_providers</code>
+ 
+---
 
-### OpenBLAS
-#### Windows
-Instructions how to build OpenBLAS for windows can be found here https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio#build-openblas-for-universal-windows-platform.
+### Android
 
-Once you have the OpenBLAS binaries, build ONNX Runtime with `./build.bat --use_openblas`
+#### Cross compiling on Linux
 
-#### Linux
-For Linux (e.g. Ubuntu 16.04), install libopenblas-dev package
-`sudo apt-get install libopenblas-dev` and build with `./build.sh --use_openblas`
+1. Get Android NDK from https://developer.android.com/ndk/downloads. Please unzip it after downloading.
 
-### OpenMP
-```
-./build.sh --use_openmp (for Linux)
-./build.bat --use_openmp (for Windows)
-```
+2. Get a pre-compiled protoc:
 
-### Build with Docker on Linux
-Install Docker: `https://docs.docker.com/install/`
+   You may get it from https://github.com/protocolbuffers/protobuf/releases/download/v3.6.1/protoc-3.6.1-linux-x86_64.zip. Please unzip it after downloading.
+
+3. Denote the unzip destination in step 1 as $ANDROID_NDK, append `-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DONNX_CUSTOM_PROTOC_EXECUTABLE=path/to/protoc` to your cmake args, run cmake and make to build it.
+
+Note: For 32-bit devices, replace `-DANDROID_ABI=arm64-v8a` to `-DANDROID_ABI=armeabi-v7a`.
+
+---
 
-#### CPU
+### Nuphar
+ONNX Runtime supports Nuphar execution provider (released as preview). It is an execution provider built on top of [TVM](https://github.com/dmlc/tvm) and [LLVM](https://llvm.org). Currently it targets to X64 CPU.
+
+The Nuphar execution provider for ONNX Runtime is built and tested with LLVM 6.0.1. Because of TVM's requirement when building with LLVM, you need to build LLVM from source:
+
+Window with Visual Studio 2017: (Note here builds release flavor. Debug build of LLVM would be needed to build with Debug flavor of ONNX Runtime)
 ```
-cd tools/ci_build/github/linux/docker
-docker build -t onnxruntime_dev --build-arg OS_VERSION=16.04 -f Dockerfile.ubuntu .
-docker run --rm -it onnxruntime_dev /bin/bash
+REM download llvm source code 6.0.1 and unzip to \llvm\source\path, then install to \llvm\install\path
+cd \llvm\source\path
+mkdir build
+cd build
+cmake .. -G "Visual Studio 15 2017 Win64" -DLLVM_TARGETS_TO_BUILD=X86
+msbuild llvm.sln /maxcpucount /p:Configuration=Release /p:Platform=x64
+cmake -DCMAKE_INSTALL_PREFIX=\llvm\install\path -DBUILD_TYPE=Release -P cmake_install.cmake
 ```
 
-#### GPU
-If you need GPU support, please also install:
-1. nvidia driver. Before doing this please add `nomodeset rd.driver.blacklist=nouveau` to your linux [kernel boot parameters](https://www.kernel.org/doc/html/v4.17/admin-guide/kernel-parameters.html).
-2. nvidia-docker2: [Install doc](`https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-2.0)`)
+Linux:
+```
+# download llvm source code 6.0.1 and unzip to /llvm/source/path, then install to /llvm/install/path
+cd /llvm/source/path
+mkdir build
+cd build
+cmake .. -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release
+cmake --build.
+cmake -DCMAKE_INSTALL_PREFIX=/llvm/install/path -DBUILD_TYPE=Release -P cmake_install.cmake
+```
 
-To test if your nvidia-docker works:
+Then you can build from source by using following command from the onnxruntime directory:
+Windows:
 ```
-docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
+build.bat --use_tvm --use_llvm --llvm_path=\llvm\install\path\lib\cmake\llvm --use_mklml --use_nuphar --build_shared_lib --build_csharp --enable_pybind --config=Release
 ```
 
-Then build a docker image. We provided a sample for use:
+Linux:
 ```
-cd tools/ci_build/github/linux/docker
-docker build -t cuda_dev -f Dockerfile.ubuntu_gpu .
+./build.sh --use_tvm --use_llvm --llvm_path=/llvm/install/path/lib/cmake/llvm --use_mklml --use_nuphar --build_shared_lib --build_csharp --enable_pybind --config=Release
 ```
 
-Then run it
+---
+
+## Options
+### OpenMP
 ```
-./tools/ci_build/github/linux/run_dockerbuild.sh
+./build.sh --use_openmp (for Linux)
+./build.bat --use_openmp (for Windows)
 ```
 
-## ARM Builds
+---
+
+### OpenBLAS
+**Windows**
+Instructions how to build OpenBLAS for windows can be found here https://github.com/xianyi/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio#build-openblas-for-universal-windows-platform.
+
+Once you have the OpenBLAS binaries, build ONNX Runtime with `./build.bat --use_openblas`
+
+**Linux**
+For Linux (e.g. Ubuntu 16.04), install libopenblas-dev package
+`sudo apt-get install libopenblas-dev` and build with `./build.sh --use_openblas`
+
+--- 
+
+## Architectures
+### x86
+ - For Windows, just add --x86 argument when launching build.bat
+ - For Linux, it must be built out of a x86 os, --x86 argument also needs be specified to build.sh
+
+---
+
+### ARM
 We have experimental support for Linux ARM builds. Windows on ARM is well tested.
 
-### Cross compiling for ARM with Docker (Linux/Windows - FASTER, RECOMMENDED)
+#### Cross compiling for ARM with Docker (Linux/Windows - FASTER, RECOMMENDED)
 This method allows you to compile using a desktop or cloud VM. This is much faster than compiling natively and avoids out-of-memory issues that may be encountered when on lower-powered ARM devices. The resulting ONNX Runtime Python wheel (.whl) file is then deployed to an ARM device where it can be invoked in Python 3 scripts.
 
 The Dockerfile used in these instructions specifically targets Raspberry Pi 3/3+ running Raspbian Stretch. The same approach should work for other ARM devices, but may require some changes to the Dockerfile such as choosing a different base image (Line 0: `FROM ...`).
@@ -296,7 +393,7 @@ The Dockerfile used in these instructions specifically targets Raspberry Pi 3/3+
     ```
 10. Test installation by following the instructions [here](https://microsoft.github.io/onnxruntime/)
 
-### Cross compiling on Linux (without Docker)
+#### Cross compiling on Linux (without Docker)
 1. Get the corresponding toolchain. For example, if your device is Raspberry Pi and the device os is Ubuntu 16.04, you may use gcc-linaro-6.3.1 from [https://releases.linaro.org/components/toolchain/binaries](https://releases.linaro.org/components/toolchain/binaries)
 2. Setup env vars
     ```bash
@@ -321,8 +418,7 @@ The Dockerfile used in these instructions specifically targets Raspberry Pi 3/3+
     ```
 6. Append `-DONNX_CUSTOM_PROTOC_EXECUTABLE=/path/to/protoc -DCMAKE_TOOLCHAIN_FILE=path/to/tool.cmake` to your cmake args, run cmake and make to build it.
 
-
-### Native compiling on Linux ARM device (SLOWER)
+#### Native compiling on Linux ARM device (SLOWER)
 Docker build runs on a Raspberry Pi 3B with Raspbian Stretch Lite OS (Desktop version will run out memory when linking the .so file) will take 8-9 hours in total.
 ```bash
 sudo apt-get update
@@ -374,26 +470,10 @@ ls -l /code/onnxruntime/build/Linux/MinSizeRel/*.so
 ls -l /code/onnxruntime/build/Linux/MinSizeRel/dist/*.whl
 ```
 
-### Cross compiling on Windows
-#### Using Visual C++ compilers
+#### Cross compiling on Windows
+**Using Visual C++ compilers**
 1. Download and install Visual C++ compilers and libraries for ARM(64).
    If you have Visual Studio installed, please use the Visual Studio Installer (look under the section `Individual components` after choosing to `modify` Visual Studio) to download and install the corresponding ARM(64) compilers and libraries.
 
 2. Use `build.bat` and specify `--arm` or `--arm64` as the build option to start building. Preferably use `Developer Command Prompt for VS` or make sure all the installed cross-compilers are findable from the command prompt being used to build using the PATH environmant variable.
 
-### Using other compilers
-(TODO)
-
-## Android Builds
-
-### Cross compiling on Linux
-
-1. Get Android NDK from https://developer.android.com/ndk/downloads. Please unzip it after downloading.
-
-2. Get a pre-compiled protoc:
-
-   You may get it from https://github.com/protocolbuffers/protobuf/releases/download/v3.6.1/protoc-3.6.1-linux-x86_64.zip. Please unzip it after downloading.
-
-3. Denote the unzip destination in step 1 as $ANDROID_NDK, append `-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DONNX_CUSTOM_PROTOC_EXECUTABLE=path/to/protoc` to your cmake args, run cmake and make to build it.
-
-Note: For 32-bit devices, replace `-DANDROID_ABI=arm64-v8a` to `-DANDROID_ABI=armeabi-v7a`.
diff --git a/README.md b/README.md
index b1a715dbfe95d..5eed6744af919 100644
--- a/README.md
+++ b/README.md
@@ -11,15 +11,19 @@
 [ONNX](https://onnx.ai) is an interoperable format for machine learning models supported by various ML and DNN frameworks and tools. The universal format makes it easier to interoperate between frameworks and maximize the reach of hardware optimization investments.
 
 ***
+**[Key Features](#key-features)**
+
 **Setup**
 * [Installation](#installation)
 * [APIs and Official Binaries](#apis-and-official-builds)
 * [Building from Source](#building-from-source)
 
-**Getting Started**
+**Usage**
 * [Getting ONNX Models](#getting-onnx-models)
 * [Deploying ONNX Runtime](#deploying-onnx-runtime)
-* [Examples and Tutorials](#examples-and-tutorials)
+* [Performance Tuning](#performance-tuning)
+
+**[Examples and Tutorials](#examples-and-tutorials)**
 
 **More Info**
 * [Technical Design Details](#technical-design-details)
@@ -29,39 +33,43 @@
 
 **[License](#license)**
 ***
-## Key Features
-### Run any ONNX model
+# Key Features
+## Run any ONNX model
 ONNX Runtime provides comprehensive support of the ONNX spec and can be used to run all models based on ONNX v1.2.1 and higher. See version compatibility details [here](https://github.com/microsoft/onnxruntime/blob/master/docs/Versioning.md).
 
-*Note: Some operators not supported in the current ONNX version may be available as a [Contrib Operator](https://github.com/microsoft/onnxruntime/blob/master/docs/ContribOperators.md)*
-
 **Traditional ML support**
 
 In addition to DNN models, ONNX Runtime fully supports the [ONNX-ML profile](https://github.com/onnx/onnx/blob/master/docs/Operators-ml.md) of the ONNX spec for traditional ML scenarios.
 
-### High Performance
+For the full set of operators and types supported, please see [operator documentation](https://github.com/microsoft/onnxruntime/blob/master/docs/OperatorKernels.md)
+
+*Note: Some operators not supported in the current ONNX version may be available as a [Contrib Operator](https://github.com/microsoft/onnxruntime/blob/master/docs/ContribOperators.md)*
+
+
+## High Performance
 ONNX Runtime supports both CPU and GPU. Using various graph optimizations and accelerators, ONNX Runtime can provide lower latency compared to other runtimes for faster end-to-end customer experiences and minimized machine utilization costs.
 
 Currently ONNX Runtime supports the following accelerators:
-* CPU
-  * MLAS (Microsoft Linear Algebra Subprograms)
-  * MKL-DNN
-  * MKL-ML
-  * [Intel nGraph](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/nGraph-ExecutionProvider.md)
-* GPU
-  * CUDA
-  * [TensorRT](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/TensorRT-ExecutionProvider.md)
+* MLAS (Microsoft Linear Algebra Subprograms)
+* [MKL-DNN](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/MKL-DNN-ExecutionProvider.md) - [subgraph optimization](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/MKL-DNN-Subgraphs.md)
+* MKL-ML
+* [Intel nGraph](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/nGraph-ExecutionProvider.md)
+* CUDA
+* [TensorRT](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/TensorRT-ExecutionProvider.md)
+* [OpenVINO](https://github.com/microsoft/onnxruntime/blob/master/docs/execution_providers/OpenVINO-ExecutionProvider.md)
+* [Nuphar](docs/execution_providers/Nuphar-ExecutionProvider.md)
 
-Not all variations are supported in the [official release builds](#apis-and-official-builds), but can be built from source following [these instructions](https://github.com/Microsoft/onnxruntime/blob/master/BUILD.md).
+Not all variations are supported in the [official release builds](#apis-and-official-builds), but can be built from source following [these instructions](https://github.com/Microsoft/onnxruntime/blob/master/BUILD.md). Find Dockerfiles [here](https://github.com/microsoft/onnxruntime/tree/master/dockerfiles).
 
 We are continuously working to integrate new execution providers for further improvements in latency and efficiency. If you are interested in contributing a new execution provider, please see [this page](docs/AddingExecutionProvider.md).
 
-### Cross Platform
+
+## Cross Platform
 [API documentation and package installation](https://github.com/microsoft/onnxruntime#installation)
 
 ONNX Runtime is available for Linux, Windows, Mac with Python, C#, and C APIs, with more to come!
 If you have specific scenarios that are not currently supported, please share your suggestions and scenario details via [Github Issues](https://github.com/microsoft/onnxruntime/issues).
-
+***
 # Installation
 **Quick Start:** The [ONNX-Ecosystem Docker container image](https://github.com/onnx/onnx-docker/tree/master/onnx-ecosystem) is available on Dockerhub and includes ONNX Runtime (CPU, Python), dependencies, tools to convert from various frameworks, and Jupyter notebooks to help get started.
 
@@ -80,7 +88,7 @@ Additional dockerfiles for some features can be found [here](https://github.com/
 |---|:---|:---|:---|
 | **Python** | **[pypi: onnxruntime](https://pypi.org/project/onnxruntime)**<br><br>Windows (x64)<br>Linux (x64)<br>Mac OS X (x64) | -- | **[pypi: onnxruntime-gpu](https://pypi.org/project/onnxruntime-gpu)**<br><br>Windows (x64)<br>Linux (x64) |
 | **C#** | **[Nuget: Microsoft.ML.OnnxRuntime](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime/)**<br><br>Windows (x64, x86)<br>Linux (x64, x86)<br>Mac OS X (x64) | **[Nuget: Microsoft.ML.OnnxRuntime.MKLML](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.MKLML/)**<br><br>Windows (x64)<br>Linux (x64)<br>Mac OS X (x64) | **[Nuget: Microsoft.ML.OnnxRuntime.Gpu](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.Gpu/)**<br><br>Windows (x64)<br>Linux (x64) |
-| **C** | **[Nuget: Microsoft.ML.OnnxRuntime](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime)**<br><br>**[.zip, .tgz](https://aka.ms/onnxruntime-release)**<br><br>Windows (x64, x86)<br>Linux (x64, x86)<br>Mac OS X (x64 | **[Nuget: Microsoft.ML.OnnxRuntime.MKLML](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.MKLML/)**<br><br>Windows (x64)<br>Linux (x64)<br>Mac OS X (x64) | **[Nuget: Microsoft.ML.OnnxRuntime.Gpu](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.Gpu/)**<br><br>**[.zip, .tgz](https://aka.ms/onnxruntime-release)**<br><br>Windows (x64)<br>Linux (x64) |
+| **C/C++ wrapper** | **[Nuget: Microsoft.ML.OnnxRuntime](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime)**<br><br>**[.zip, .tgz](https://aka.ms/onnxruntime-release)**<br><br>Windows (x64, x86)<br>Linux (x64, x86)<br>Mac OS X (x64) | **[Nuget: Microsoft.ML.OnnxRuntime.MKLML](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.MKLML/)**<br><br>Windows (x64)<br>Linux (x64)<br>Mac OS X (x64) | **[Nuget: Microsoft.ML.OnnxRuntime.Gpu](https://www.nuget.org/packages/Microsoft.ML.OnnxRuntime.Gpu/)**<br><br>**[.zip, .tgz](https://aka.ms/onnxruntime-release)**<br><br>Windows (x64)<br>Linux (x64) |
 
 #### System Requirements (pre-requisite dependencies)
 * ONNX Runtime binaries in the CPU packages use OpenMP and depend on the library being available at runtime in the
@@ -88,20 +96,26 @@ system.
   * For Windows, **OpenMP** support comes as part of VC runtime. It is also available as redist packages:
     [vc_redist.x64.exe](https://aka.ms/vs/15/release/vc_redist.x64.exe) and [vc_redist.x86.exe](https://aka.ms/vs/15/release/vc_redist.x86.exe)
   * For Linux, the system must have **libgomp.so.1** which can be installed using `apt-get install libgomp1`.
-* GPU builds require the **CUDA 10.0 and cuDNN 7.3** runtime libraries being installed on the system. Older releases used 9.1/7.1 - please refer to [release notes](https://github.com/microsoft/onnxruntime/releases) for more details.
-* Python binaries are compatible with **Python 3.5-3.7**. See [Python Dev Notes](https://github.com/microsoft/onnxruntime/blob/master/docs/Python_Dev_Notes.md)
+* GPU builds require CUDA runtime libraries being installed on the system:
+	 * Version: **CUDA 10.0** and **cuDNN 7.3**
+	 * Linux Python packages require **CUDA 10.1** and **cuDNN 7.6** 
+  * Older ONNX Runtime releases: used **CUDA 9.1** and **cuDNN 7.1** - please refer to [prior release notes](https://github.com/microsoft/onnxruntime/releases) for more details.
+* Python binaries are compatible with **Python 3.5-3.7**. See [Python Dev Notes](https://github.com/microsoft/onnxruntime/blob/master/docs/Python_Dev_Notes.md). If using `pip` to be download the Python binaries, run `pip install --upgrade pip` prior to downloading. 
 * Certain operators makes use of system locales. Installation of the **English language package** and configuring `en_US.UTF-8 locale` is required.
   * For Ubuntu install [language-pack-en package](https://packages.ubuntu.com/search?keywords=language-pack-en)
   * Run the following commands:
     `locale-gen en_US.UTF-8`
     `update-locale LANG=en_US.UTF-8`
   * Follow similar procedure to configure other locales on other platforms.
-  
+
 ## Building from Source
 If additional build flavors are needed, please find instructions on building from source at [Build ONNX Runtime](BUILD.md). For production scenarios, it's strongly recommended to build from an [official release branch](https://github.com/microsoft/onnxruntime/releases).
 
 Dockerfiles are available [here](https://github.com/microsoft/onnxruntime/tree/faxu-doc-updates/tools/ci_build/github/linux/docker) to help you get started.
 
+***
+# Usage
+
 ## Getting ONNX Models
 * The [ONNX Model Zoo](https://github.com/onnx/models) has popular ready-to-use pre-trained models.
 * To export or convert a trained ONNX model trained from various frameworks, see [ONNX Tutorials](https://github.com/onnx/tutorials). Versioning comptability information can be found under [Versioning](docs/Versioning.md#tool-compatibility)
@@ -115,8 +129,12 @@ ONNX Runtime can be deployed to the cloud for model inferencing using [Azure Mac
 
 **ONNX Runtime Server (beta)** is a hosted application for serving ONNX models using ONNX Runtime, providing a REST API for prediction. Usage details can be found [here](https://github.com/microsoft/onnxruntime/blob/master/docs/ONNX_Runtime_Server_Usage.md), and image installation instructions are [here](https://github.com/microsoft/onnxruntime/tree/master/dockerfiles#onnx-runtime-server-preview).
 
-## Examples and Tutorials
-### Python
+## Performance Tuning
+ONNX Runtime is open and extensible, supporting a broad set of configurations and execution providers for model acceleration. For performance tuning guidance, please see [this page](https://github.com/microsoft/onnxruntime/blob/master/docs/ONNX_Runtime_Perf_Tuning.md).
+
+***
+# Examples and Tutorials
+## Python
 * [Basic Inferencing Sample](https://github.com/onnx/onnx-docker/blob/master/onnx-ecosystem/inference_demos/simple_onnxruntime_inference.ipynb)
 * [Inferencing (Resnet50)](https://github.com/onnx/onnx-docker/blob/master/onnx-ecosystem/inference_demos/resnet50_modelzoo_onnxruntime_inference.ipynb)
 * [Inferencing samples](https://github.com/onnx/onnx-docker/tree/master/onnx-ecosystem/inference_demos) using [ONNX-Ecosystem Docker image](https://github.com/onnx/onnx-docker/tree/master/onnx-ecosystem)
@@ -127,21 +145,29 @@ ONNX Runtime can be deployed to the cloud for model inferencing using [Azure Mac
 
 
 **Deployment with AzureML**
-* Inferencing: [Inferencing Facial Expression Recognition](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-inference-facial-expression-recognition-deploy.ipynb), [Inferencing MNIST Handwritten Digits](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-inference-mnist-deploy.ipynb), [ Resnet50 Image Classification](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-modelzoo-aml-deploy-resnet50.ipynb), [TinyYolo](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-convert-aml-deploy-tinyyolo.ipynb)
-* [Train and Inference MNIST from Pytorch](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-train-pytorch-aml-deploy-mnist.ipynb)
-* [FER+ on Azure Kubernetes Service with TensorRT](https://github.com/microsoft/onnxruntime/blob/master/docs/python/notebooks/onnx-inference-byoc-gpu-cpu-aks.ipynb)
-
-
-### C#
+* Inferencing using [ONNX Model Zoo](https://github.com/onnx/models) models: 
+  * [Facial Expression Recognition](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-inference-facial-expression-recognition-deploy.ipynb) 
+  * [MNIST Handwritten Digits](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-inference-mnist-deploy.ipynb)
+  * [Resnet50 Image Classification](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-modelzoo-aml-deploy-resnet50.ipynb)
+* Convert existing model for Inferencing:
+  * [TinyYolo](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-convert-aml-deploy-tinyyolo.ipynb)
+* Train a model with PyTorch and Inferencing:
+  * [MNIST](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/onnx/onnx-train-pytorch-aml-deploy-mnist.ipynb)
+ 
+* Inferencing with TensorRT Execution Provider on GPU (AKS)
+  * [FER+](https://github.com/microsoft/onnxruntime/blob/master/docs/python/notebooks/onnx-inference-byoc-gpu-cpu-aks.ipynb)
+
+
+## C#
 * [Inferencing Tutorial](https://github.com/microsoft/onnxruntime/blob/master/docs/CSharp_API.md#getting-started)
 
 
-### C/C++
+## C/C++
 * [Basic Inferencing (SqueezeNet) - C](https://github.com/microsoft/onnxruntime/blob/master/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/C_Api_Sample.cpp)
 * [Basic Inferencing (SqueezeNet) - C++](https://github.com/microsoft/onnxruntime/blob/master/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp)
 * [Inferencing (MNIST) - C++](https://github.com/microsoft/onnxruntime/tree/master/samples/c_cxx/MNIST)
 
-
+***
 # Technical Design Details
 * [High level architectural design](docs/HighLevelDesign.md)
 * [Versioning](docs/Versioning.md)
@@ -153,6 +179,7 @@ ONNX Runtime can be deployed to the cloud for model inferencing using [Azure Mac
 transform](include/onnxruntime/core/optimizer/graph_transformer.h)
 * [Add a new rewrite rule](include/onnxruntime/core/optimizer/rewrite_rule.h)
 
+***
 # Contribute
 We welcome contributions! Please see the [contribution guidelines](CONTRIBUTING.md).
 
@@ -163,6 +190,6 @@ For any feedback or to report a bug, please file a [GitHub Issue](https://github
 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
 For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
 or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
-
+***
 # License
 [MIT License](LICENSE)
diff --git a/cgmanifest.json b/cgmanifest.json
index 2fd8a43254e26..410f8210e1c27 100644
--- a/cgmanifest.json
+++ b/cgmanifest.json
@@ -49,7 +49,7 @@
          "component":{
             "type":"git",
             "git": {
-              "commitHash": "65b8e0f9979fbade16e3becbdfa69c0764946f72",
+              "commitHash": "7d90796473295ca3cdf976ed772215c5980ad3e0",
               "repositoryUrl": "https://github.com/onnx/onnx.git"
             }
          }
diff --git a/cmake/CMakeLists.txt b/cmake/CMakeLists.txt
index 9ee7470b0b2b3..04ac0cb65aa5c 100644
--- a/cmake/CMakeLists.txt
+++ b/cmake/CMakeLists.txt
@@ -50,9 +50,10 @@ option(onnxruntime_USE_OPENVINO "Build with OpenVINO support" OFF)
 option(onnxruntime_USE_NSYNC "Build with NSYNC support. This option only takes effect on Linux" OFF)
 option(onnxruntime_USE_EIGEN_FOR_BLAS "Use eign for blas" ON)
 option(onnxruntime_USE_NNAPI "Build with DNNLibrary for Android NNAPI support" OFF)
-option(onnxruntime_USE_MLAS "Use optimized blas library for GEMM and 2D Convolution" ON)
 option(onnxruntime_USE_MKLDNN "Build with MKL-DNN support" OFF)
 option(onnxruntime_USE_MKLML "Build MKL-DNN with MKL-ML binary dependency" OFF)
+option(onnxruntime_USE_GEMMLOWP "Build with gemmlowp for quantized gemm" OFF)
+option(onnxruntime_USE_AUTOML "Build AutoML support" ON)
 option(onnxruntime_USE_NGRAPH "Build with nGraph support" OFF)
 option(onnxruntime_USE_OPENBLAS "Use openblas" OFF)
 option(onnxruntime_DEV_MODE "Enable developer warnings and treat most of them as error." OFF)
@@ -349,8 +350,13 @@ if (onnxruntime_USE_TVM)
   add_definitions(-DUSE_TVM)
 
   set(onnxruntime_tvm_libs onnxruntime_codegen_tvm)
-  list(APPEND onnxruntime_EXTERNAL_LIBRARIES tvm nnvm_compiler)
-
+  # needs to link with stdc++fs in Linux
+  if(UNIX)
+    if (NOT APPLE)
+      set(FS_STDLIB stdc++fs)
+    endif()
+  endif()
+  list(APPEND onnxruntime_EXTERNAL_LIBRARIES tvm nnvm_compiler ${FS_STDLIB})
   list(APPEND onnxruntime_EXTERNAL_DEPENDENCIES tvm nnvm_compiler)
 endif()
 
@@ -367,10 +373,6 @@ if (onnxruntime_RUN_ONNX_TESTS)
   add_definitions(-DORT_RUN_EXTERNAL_ONNX_TESTS)
 endif()
 
-if (onnxruntime_USE_MLAS)
-  add_definitions(-DUSE_MLAS)
-endif()
-
 #Adjust warning flags
 if (WIN32)
     add_definitions(-DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES)
@@ -476,6 +478,10 @@ if (onnxruntime_USE_MKLDNN OR onnxruntime_USE_MKLML)
   include(mkldnn)
 endif()
 
+if(onnxruntime_USE_GEMMLOWP)
+  add_definitions(-DUSE_GEMMLOWP=1)
+endif()
+
 if (onnxruntime_USE_MKLML)
   add_definitions(-DUSE_MKLML=1 -DUSE_MKLML_FOR_BLAS=1)
   if (WIN32 OR APPLE)
@@ -646,6 +652,12 @@ include(onnxruntime_optimizer.cmake)
 include(onnxruntime_session.cmake)
 include(onnxruntime_mlas.cmake)
 
+if(onnxruntime_USE_AUTOML)
+  add_definitions(-DMICROSOFT_AUTOML)
+  # Build shared featurizer library
+  include(onnxruntime_automl_featurizers.cmake)
+endif()
+
 if(WIN32)
   list(APPEND onnxruntime_EXTERNAL_LIBRARIES Shlwapi)
   list(APPEND onnxruntime_EXTERNAL_LIBRARIES debug Dbghelp)
diff --git a/cmake/external/mkldnn.cmake b/cmake/external/mkldnn.cmake
index 364ba88a891c8..e3a638cc7a183 100644
--- a/cmake/external/mkldnn.cmake
+++ b/cmake/external/mkldnn.cmake
@@ -11,6 +11,8 @@ if(WIN32)
   set(MKLDNN_SHARED_LIB mkldnn.dll)
   set(MKLDNN_IMPORT_LIB mkldnn.lib)
   if(onnxruntime_USE_MKLML)
+    # Windows-only updated MKLML binary which contains fix for thread cleanup hang.
+    set(MKLML_VERSION 2020.0.20190813)
     set(MKLML_SHARED_LIB mklml.dll)
     set(MKLML_IMPORT_LIB mklml.lib)
     set(IOMP5MD_SHARED_LIB libiomp5md.dll)
@@ -59,15 +61,15 @@ if (onnxruntime_USE_MKLDNN)
     set(MKLDNN_DLL_PATH ${MKLDNN_LIB_DIR}/${MKLDNN_SHARED_LIB})
   endif()
   set(MKLDNN_INCLUDE_DIR ${MKLDNN_INSTALL}/include)
-  set (MKLDNN_CMAKE_EXTRA_ARGS)
+  set(MKLDNN_CMAKE_EXTRA_ARGS)
+  set(MKLDNN_PATCH_COMMAND1 git apply ${CMAKE_SOURCE_DIR}/patches/mkldnn/mem-patch.cmake.patch)
+  # discard prior changes due to patching in mkldnn source to unblock incremental builds.
+  set(MKLDNN_PATCH_DISCARD_COMMAND cd ${MKLDNN_SOURCE} && git checkout -- .)
   if(NOT onnxruntime_BUILD_FOR_NATIVE_MACHINE)
     # pre-v1.0
     list(APPEND MKLDNN_CMAKE_EXTRA_ARGS "-DARCH_OPT_FLAGS=")
     # v1.0
     list(APPEND MKLDNN_CMAKE_EXTRA_ARGS "-DMKLDNN_ARCH_OPT_FLAGS=")
-    set(MKLDNN_PATCH_COMMAND1 git apply ${CMAKE_SOURCE_DIR}/patches/mkldnn/mem-patch.cmake.patch)
-    # discard prior changes due to patching in mkldnn source to unblock incremental builds.
-    set(MKLDNN_PATCH_DISCARD_COMMAND cd ${MKLDNN_SOURCE} && git checkout -- .)
   endif()
   ExternalProject_Add(project_mkldnn
     PREFIX mkl-dnn
diff --git a/cmake/external/ngraph.cmake b/cmake/external/ngraph.cmake
index 12d0b6e1431db..45aae7d44f512 100644
--- a/cmake/external/ngraph.cmake
+++ b/cmake/external/ngraph.cmake
@@ -11,7 +11,7 @@ set(ngraph_SRC ${CMAKE_CURRENT_BINARY_DIR}/ngraph/src/project_ngraph)
 set(prebuilt_ONNX_SOURCE_DIR "${PROJECT_SOURCE_DIR}/external/onnx")
 set(prebuilt_ONNX_BINARY_DIR "${CMAKE_CURRENT_BINARY_DIR}/onnx")
 set(ngraph_URL "https://github.com/NervanaSystems/ngraph.git")
-set(ngraph_TAG "v0.18.1")
+set(ngraph_TAG "v0.22.1")
 
 # Libraries for python package.
 if (WIN32)
@@ -42,7 +42,7 @@ else()
 endif()
 
 # discard prior changes due to unblock incremental builds.
-set(NGRAPH_PATCH_DISCARD_COMMAND cd ${ngraph_SRC} && git checkout -- .)
+set(NGRAPH_PATCH_DISCARD_COMMAND cd ${ngraph_SRC} && git reset HEAD --hard && git clean -fx)
 
 if (MSVC)
     set(prebuilt_ONNX_BINARY_DIR "${CMAKE_CURRENT_BINARY_DIR}/onnx/${CMAKE_BUILD_TYPE}")
@@ -54,12 +54,12 @@ if (MSVC)
             PREFIX ngraph
             GIT_REPOSITORY ${ngraph_URL}
             GIT_TAG ${ngraph_TAG}
+            GIT_CONFIG core.autocrlf=input
             PATCH_COMMAND ${NGRAPH_PATCH_DISCARD_COMMAND}
             COMMAND ${CMAKE_COMMAND} -E copy ${PROJECT_SOURCE_DIR}/patches/ngraph/ngraph_onnx.cmake ${ngraph_SRC}/cmake/external_onnx.cmake
             COMMAND git apply --ignore-space-change --ignore-whitespace ${PROJECT_SOURCE_DIR}/patches/ngraph/ngraph_protobuf.patch
-            COMMAND git apply --ignore-space-change --ignore-whitespace ${PROJECT_SOURCE_DIR}/patches/ngraph/ngraph_fix_install_error.patch
-            COMMAND git apply --ignore-space-change --ignore-whitespace ${PROJECT_SOURCE_DIR}/patches/ngraph/ngraph_fix_library_path.patch
             COMMAND git apply --ignore-space-change --ignore-whitespace ${PROJECT_SOURCE_DIR}/patches/ngraph/ngraph_fix_memory.patch
+            COMMAND git apply --ignore-space-change --ignore-whitespace ${PROJECT_SOURCE_DIR}/patches/ngraph/ngraph_fix_mkldnn_missing_symbol.patch
             CMAKE_ARGS
                 -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE}
                 -DNGRAPH_DEX_ONLY=ON
diff --git a/cmake/external/onnx b/cmake/external/onnx
index 65b8e0f9979fb..7d90796473295 160000
--- a/cmake/external/onnx
+++ b/cmake/external/onnx
@@ -1 +1 @@
-Subproject commit 65b8e0f9979fbade16e3becbdfa69c0764946f72
+Subproject commit 7d90796473295ca3cdf976ed772215c5980ad3e0
diff --git a/cmake/external/onnx-tensorrt b/cmake/external/onnx-tensorrt
index 3aa0a1cb41fae..6c37109733a9b 160000
--- a/cmake/external/onnx-tensorrt
+++ b/cmake/external/onnx-tensorrt
@@ -1 +1 @@
-Subproject commit 3aa0a1cb41fae88b7787b6289a729ed9046a18e4
+Subproject commit 6c37109733a9bbf8211f0ca78a85804cb376eca0
diff --git a/cmake/external/tvm b/cmake/external/tvm
index fd4801612817f..b4bff71f36eca 160000
--- a/cmake/external/tvm
+++ b/cmake/external/tvm
@@ -1 +1 @@
-Subproject commit fd4801612817f96e890058656834deb925fc064a
+Subproject commit b4bff71f36eca1e840dd280ba485cad186718844
diff --git a/cmake/onnxruntime.cmake b/cmake/onnxruntime.cmake
index 91508a8aa8f57..8a6bf402e1488 100644
--- a/cmake/onnxruntime.cmake
+++ b/cmake/onnxruntime.cmake
@@ -19,13 +19,13 @@ foreach(f ${ONNXRUNTIME_PROVIDER_NAMES})
   list(APPEND SYMBOL_FILES "${ONNXRUNTIME_ROOT}/core/providers/${f}/symbols.txt")
 endforeach()
 
-add_custom_command(OUTPUT ${SYMBOL_FILE}
-  COMMAND ${PYTHON_EXECUTABLE} "${REPO_ROOT}/tools/ci_build/gen_def.py" --version_file "${ONNXRUNTIME_ROOT}/../VERSION_NUMBER" --src_root "${ONNXRUNTIME_ROOT}" --config ${ONNXRUNTIME_PROVIDER_NAMES} --style=${OUTPUT_STYLE} --output ${SYMBOL_FILE}
+add_custom_command(OUTPUT ${SYMBOL_FILE} ${CMAKE_CURRENT_BINARY_DIR}/generated_source.c
+  COMMAND ${PYTHON_EXECUTABLE} "${REPO_ROOT}/tools/ci_build/gen_def.py" --version_file "${ONNXRUNTIME_ROOT}/../VERSION_NUMBER" --src_root "${ONNXRUNTIME_ROOT}" --config ${ONNXRUNTIME_PROVIDER_NAMES} --style=${OUTPUT_STYLE} --output ${SYMBOL_FILE} --output_source ${CMAKE_CURRENT_BINARY_DIR}/generated_source.c
   DEPENDS ${SYMBOL_FILES}
   WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})
 
-add_custom_target(onnxruntime_generate_def ALL DEPENDS ${SYMBOL_FILE})
-add_library(onnxruntime SHARED ${onnxruntime_session_srcs})
+add_custom_target(onnxruntime_generate_def ALL DEPENDS ${SYMBOL_FILE} ${CMAKE_CURRENT_BINARY_DIR}/generated_source.c)
+add_library(onnxruntime SHARED ${CMAKE_CURRENT_BINARY_DIR}/generated_source.c)
 set_target_properties(onnxruntime PROPERTIES VERSION ${ORT_VERSION})
 add_dependencies(onnxruntime onnxruntime_generate_def ${onnxruntime_EXTERNAL_DEPENDENCIES})
 target_include_directories(onnxruntime PRIVATE ${ONNXRUNTIME_ROOT})
@@ -37,12 +37,8 @@ endif()
 
 if(UNIX)
   if (APPLE)
-    set(BEGIN_WHOLE_ARCHIVE -Xlinker -all_load)
-    set(END_WHOLE_ARCHIVE -Xlinker -noall_load)
     set(ONNXRUNTIME_SO_LINK_FLAG "-Xlinker -dead_strip")
   else()
-    set(BEGIN_WHOLE_ARCHIVE -Xlinker --whole-archive)
-    set(END_WHOLE_ARCHIVE -Xlinker --no-whole-archive)
     set(ONNXRUNTIME_SO_LINK_FLAG "-Xlinker --version-script=${SYMBOL_FILE} -Xlinker --no-undefined -Xlinker --gc-sections")
   endif()
 else()
@@ -59,7 +55,7 @@ endif()
 
 #The BEGIN_WHOLE_ARCHIVE/END_WHOLE_ARCHIVE part should contain the implementations of all the C API functions
 target_link_libraries(onnxruntime PRIVATE
-    ${BEGIN_WHOLE_ARCHIVE}
+    onnxruntime_session
     ${onnxruntime_libs}
     ${PROVIDERS_CUDA}
     ${PROVIDERS_MKLDNN}
@@ -67,12 +63,12 @@ target_link_libraries(onnxruntime PRIVATE
     ${PROVIDERS_NNAPI}
     ${PROVIDERS_TENSORRT}
     ${PROVIDERS_OPENVINO}
+    ${PROVIDERS_NUPHAR}
     onnxruntime_optimizer
     onnxruntime_providers
     onnxruntime_util
     ${onnxruntime_tvm_libs}
     onnxruntime_framework
-    ${END_WHOLE_ARCHIVE}
     onnxruntime_graph
     onnxruntime_common
     onnxruntime_mlas
diff --git a/cmake/onnxruntime_automl_featurizers.cmake b/cmake/onnxruntime_automl_featurizers.cmake
new file mode 100644
index 0000000000000..daffe92842826
--- /dev/null
+++ b/cmake/onnxruntime_automl_featurizers.cmake
@@ -0,0 +1,44 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# This source code should not depend on the onnxruntime and may be built independently
+
+file(GLOB automl_featurizers_srcs CONFIGURE_DEPENDS
+    "${ONNXRUNTIME_ROOT}/core/automl/featurizers/src/FeaturizerPrep/*.h"
+    "${ONNXRUNTIME_ROOT}/core/automl/featurizers/src/FeaturizerPrep/Featurizers/*.h"
+    "${ONNXRUNTIME_ROOT}/core/automl/featurizers/src/FeaturizerPrep/Featurizers/*.cpp"
+)
+
+source_group(TREE ${ONNXRUNTIME_ROOT}/core/automl/ FILES ${onnxruntime_automl_featurizers_srcs})
+
+add_library(automl_featurizers ${automl_featurizers_srcs})
+
+target_include_directories(automl_featurizers PRIVATE ${ONNXRUNTIME_ROOT} PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
+
+set_target_properties(automl_featurizers PROPERTIES FOLDER "AutoMLFeaturizers")
+
+# Individual featurizers unit tests added at bulk
+file(GLOB automl_featurizers_tests_srcs
+    "${ONNXRUNTIME_ROOT}/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/*.cpp"
+)
+
+list(APPEND automl_featurizers_tests_srcs
+    "${ONNXRUNTIME_ROOT}/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Traits_UnitTests.cpp"
+    "${ONNXRUNTIME_ROOT}/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Featurizer_UnitTest.cpp"
+    "${ONNXRUNTIME_ROOT}/core/automl/featurizers/src/FeaturizerPrep/UnitTests/test_main.cpp"
+)
+
+add_executable(automl_featurizers_unittests ${automl_featurizers_tests_srcs})
+add_dependencies(automl_featurizers_unittests automl_featurizers)
+target_link_libraries(automl_featurizers_unittests PRIVATE gtest automl_featurizers)
+source_group(TREE ${ONNXRUNTIME_ROOT}/core/automl/ FILES ${automl_featurizers_tests_srcs})
+set_target_properties(automl_featurizers_unittests PROPERTIES FOLDER "AutoMLFeaturizers")
+add_test(NAME automl_featurizers_unittests
+    COMMAND automl_featurizers_unittests
+    WORKING_DIRECTORY $<TARGET_FILE_DIR:automl_featurizers_unittests>
+)
+
+
+if (WIN32)
+    # Add Code Analysis properties to enable C++ Core checks. Have to do it via a props file include.
+    set_target_properties(automl_featurizers PROPERTIES VS_USER_PROPS ${PROJECT_SOURCE_DIR}/ConfigureVisualStudioCodeAnalysis.props)
+endif()
diff --git a/cmake/onnxruntime_common.cmake b/cmake/onnxruntime_common.cmake
index 0799ab9a6c79e..133ea4b60bf16 100644
--- a/cmake/onnxruntime_common.cmake
+++ b/cmake/onnxruntime_common.cmake
@@ -53,11 +53,12 @@ target_include_directories(onnxruntime_common PRIVATE ${CMAKE_CURRENT_BINARY_DIR
 if(onnxruntime_USE_NSYNC)
     target_compile_definitions(onnxruntime_common PUBLIC USE_NSYNC)
 endif()
-if(onnxruntime_USE_EIGEN_THREADPOOL)
-    target_include_directories(onnxruntime_common PRIVATE ${eigen_INCLUDE_DIRS})
-    target_compile_definitions(onnxruntime_common PUBLIC USE_EIGEN_THREADPOOL)
-    add_dependencies(onnxruntime_common ${onnxruntime_EXTERNAL_DEPENDENCIES})
+
+target_include_directories(onnxruntime_common PUBLIC ${eigen_INCLUDE_DIRS})
+if(NOT onnxruntime_USE_OPENMP)
+  target_compile_definitions(onnxruntime_common PUBLIC EIGEN_USE_THREADS)
 endif()
+add_dependencies(onnxruntime_common ${onnxruntime_EXTERNAL_DEPENDENCIES})
 
 install(DIRECTORY ${PROJECT_SOURCE_DIR}/../include/onnxruntime/core/common  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/onnxruntime/core)
 set_target_properties(onnxruntime_common PROPERTIES LINKER_LANGUAGE CXX)
diff --git a/cmake/onnxruntime_graph.cmake b/cmake/onnxruntime_graph.cmake
index 366eadf680fff..4c05a3307bff0 100644
--- a/cmake/onnxruntime_graph.cmake
+++ b/cmake/onnxruntime_graph.cmake
@@ -14,6 +14,13 @@ if (onnxruntime_DISABLE_CONTRIB_OPS)
     )
 endif()
 
+if(NOT onnxruntime_USE_AUTOML)
+  list(REMOVE_ITEM onnxruntime_graph_src
+    "${ONNXRUNTIME_ROOT}/core/graph/automl_ops/*.h"
+    "${ONNXRUNTIME_ROOT}/core/graph/automl_ops/*.cc"
+    )
+endif()
+
 file(GLOB_RECURSE onnxruntime_ir_defs_src CONFIGURE_DEPENDS
     "${ONNXRUNTIME_ROOT}/core/defs/*.cc"
 )
@@ -21,6 +28,7 @@ file(GLOB_RECURSE onnxruntime_ir_defs_src CONFIGURE_DEPENDS
 add_library(onnxruntime_graph ${onnxruntime_graph_src} ${onnxruntime_ir_defs_src})
 add_dependencies(onnxruntime_graph onnx_proto gsl)
 onnxruntime_add_include_to_target(onnxruntime_graph onnxruntime_common gsl onnx onnx_proto protobuf::libprotobuf)
+
 target_include_directories(onnxruntime_graph PRIVATE ${ONNXRUNTIME_ROOT})
 set_target_properties(onnxruntime_graph PROPERTIES FOLDER "ONNXRuntime")
 set_target_properties(onnxruntime_graph PROPERTIES LINKER_LANGUAGE CXX)
diff --git a/cmake/onnxruntime_mlas.cmake b/cmake/onnxruntime_mlas.cmake
index 619a4c3d08dc9..a29cd85a94f7e 100644
--- a/cmake/onnxruntime_mlas.cmake
+++ b/cmake/onnxruntime_mlas.cmake
@@ -4,6 +4,7 @@
 set(mlas_common_srcs
   ${ONNXRUNTIME_ROOT}/core/mlas/lib/platform.cpp
   ${ONNXRUNTIME_ROOT}/core/mlas/lib/threading.cpp
+  ${ONNXRUNTIME_ROOT}/core/mlas/lib/qgemm.cpp
   ${ONNXRUNTIME_ROOT}/core/mlas/lib/sgemm.cpp
   ${ONNXRUNTIME_ROOT}/core/mlas/lib/convolve.cpp
   ${ONNXRUNTIME_ROOT}/core/mlas/lib/pooling.cpp
@@ -16,12 +17,10 @@ set(mlas_common_srcs
 )
 
 if(MSVC)
-
   if(CMAKE_GENERATOR_PLATFORM STREQUAL "ARM64")
-
-    set(asm_filename ${ONNXRUNTIME_ROOT}/core/mlas/lib/arm64/sgemma.asm)
-    set(pre_filename ${CMAKE_CURRENT_BINARY_DIR}/sgemma.i)
-    set(obj_filename ${CMAKE_CURRENT_BINARY_DIR}/sgemma.obj)
+    set(asm_filename ${ONNXRUNTIME_ROOT}/core/mlas/lib/arm64/SgemmKernelNeon.asm)
+    set(pre_filename ${CMAKE_CURRENT_BINARY_DIR}/SgemmKernelNeon.i)
+    set(obj_filename ${CMAKE_CURRENT_BINARY_DIR}/SgemmKernelNeon.obj)
 
     if(CMAKE_BUILD_TYPE STREQUAL "Debug")
       set(ARMASM_FLAGS "-g")
@@ -36,20 +35,18 @@ if(MSVC)
         COMMAND
             armasm64.exe ${ARMASM_FLAGS} ${pre_filename} ${obj_filename}
     )
-
     set(mlas_platform_srcs ${obj_filename})
-
   elseif(CMAKE_GENERATOR_PLATFORM STREQUAL "ARM" OR CMAKE_GENERATOR MATCHES "ARM")
-
     set(mlas_platform_srcs
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/arm/sgemmc.cpp
     )
-
   elseif(CMAKE_GENERATOR_PLATFORM STREQUAL "x64" OR CMAKE_GENERATOR MATCHES "Win64")
-
     enable_language(ASM_MASM)
 
     set(mlas_platform_srcs
+      ${ONNXRUNTIME_ROOT}/core/mlas/lib/amd64/QgemmU8U8KernelAvx2.asm
+      ${ONNXRUNTIME_ROOT}/core/mlas/lib/amd64/QgemmU8U8KernelAvx512BW.asm
+      ${ONNXRUNTIME_ROOT}/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Vnni.asm
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/amd64/SgemmKernelSse2.asm
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/amd64/SgemmKernelAvx.asm
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/amd64/SgemmKernelFma3.asm
@@ -67,9 +64,7 @@ if(MSVC)
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/amd64/TanhKernelFma3.asm
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/amd64/ErfKernelFma3.asm
     )
-
   else()
-
     enable_language(ASM_MASM)
 
     set(CMAKE_ASM_MASM_FLAGS "${CMAKE_ASM_MASM_FLAGS} /safeseh")
@@ -77,14 +72,13 @@ if(MSVC)
     set(mlas_platform_srcs
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/i386/sgemma.asm
     )
-
   endif()
 else()
   if (CMAKE_SYSTEM_NAME STREQUAL "Android")
     if (CMAKE_ANDROID_ARCH_ABI STREQUAL "armeabi-v7a")
       set(ARM TRUE)
     elseif (CMAKE_ANDROID_ARCH_ABI STREQUAL "arm64-v8a")
-      set(ARM TRUE) # Android NDK fails to compile sgemma.s
+      set(ARM64 TRUE)
     elseif (CMAKE_ANDROID_ARCH_ABI STREQUAL "x86_64")
       set(X86_64 TRUE)
     elseif (CMAKE_ANDROID_ARCH_ABI STREQUAL "x86")
@@ -95,8 +89,7 @@ else()
       COMMAND ${CMAKE_C_COMPILER} -dumpmachine
       OUTPUT_VARIABLE dumpmachine_output
       ERROR_QUIET
-      )
-
+    )
     if(dumpmachine_output MATCHES "^arm.*")
       set(ARM TRUE)
     elseif(dumpmachine_output MATCHES "^aarch64.*")
@@ -108,39 +101,39 @@ else()
     endif()
   endif()
 
-  if (ARM)
+  if(ARM)
     set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mfpu=neon")
 
     set(mlas_platform_srcs
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/arm/sgemmc.cpp
-      )
-  elseif (ARM64)
+    )
+  elseif(ARM64)
     enable_language(ASM)
 
     set(mlas_platform_srcs
-      ${ONNXRUNTIME_ROOT}/core/mlas/lib/aarch64/sgemma.s
-      )
-  elseif (X86)
+      ${ONNXRUNTIME_ROOT}/core/mlas/lib/aarch64/SgemmKernelNeon.S
+    )
+  elseif(X86)
     enable_language(ASM)
 
     set(mlas_platform_srcs_sse2
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/x86/SgemmKernelSse2.S
-      )
+    )
     set_source_files_properties(${mlas_platform_srcs_sse2} PROPERTIES COMPILE_FLAGS "-msse2")
 
     set(mlas_platform_srcs_avx
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/x86/SgemmKernelAvx.S
-      )
+    )
     set_source_files_properties(${mlas_platform_srcs_avx} PROPERTIES COMPILE_FLAGS "-mavx")
 
     set(mlas_platform_srcs
       ${mlas_platform_srcs_sse2}
       ${mlas_platform_srcs_avx}
-      )
-  elseif (X86_64)
+    )
+  elseif(X86_64)
     enable_language(ASM)
 
-    # The LLVM assmebler does not support the .arch directive to enable instruction
+    # The LLVM assembler does not support the .arch directive to enable instruction
     # set extensions and also doesn't support AVX-512F instructions without
     # turning on support via command-line option. Group the sources by the
     # instruction set extension and explicitly set the compiler flag as appropriate.
@@ -164,6 +157,7 @@ else()
     set_source_files_properties(${mlas_platform_srcs_avx} PROPERTIES COMPILE_FLAGS "-mavx")
 
     set(mlas_platform_srcs_avx2
+      ${ONNXRUNTIME_ROOT}/core/mlas/lib/x86_64/QgemmU8U8KernelAvx2.S
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/x86_64/SgemmKernelFma3.S
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/x86_64/SconvKernelFma3.S
       ${ONNXRUNTIME_ROOT}/core/mlas/lib/x86_64/LogisticKernelFma3.S
@@ -179,17 +173,22 @@ else()
     )
     set_source_files_properties(${mlas_platform_srcs_avx512f} PROPERTIES COMPILE_FLAGS "-mavx512f")
 
+    set(mlas_platform_srcs_avx512bw
+      ${ONNXRUNTIME_ROOT}/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512BW.S
+      ${ONNXRUNTIME_ROOT}/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Vnni.S
+    )
+    set_source_files_properties(${mlas_platform_srcs_avx512bw} PROPERTIES COMPILE_FLAGS "-mavx512bw")
+
     set(mlas_platform_srcs
       ${mlas_platform_srcs_sse2}
       ${mlas_platform_srcs_avx}
       ${mlas_platform_srcs_avx2}
       ${mlas_platform_srcs_avx512f}
+      ${mlas_platform_srcs_avx512bw}
     )
-
   endif()
-
 endif()
 
 add_library(onnxruntime_mlas STATIC ${mlas_common_srcs} ${mlas_platform_srcs})
-target_include_directories(onnxruntime_mlas PRIVATE ${ONNXRUNTIME_ROOT}/core/mlas/inc ${ONNXRUNTIME_ROOT}/core/mlas/lib)
+target_include_directories(onnxruntime_mlas PRIVATE ${ONNXRUNTIME_ROOT}/core/mlas/inc ${ONNXRUNTIME_ROOT}/core/mlas/lib ${eigen_INCLUDE_DIRS})
 set_target_properties(onnxruntime_mlas PROPERTIES FOLDER "ONNXRuntime")
diff --git a/cmake/onnxruntime_nuphar_extern.cmake b/cmake/onnxruntime_nuphar_extern.cmake
new file mode 100644
index 0000000000000..9a34e82f204d6
--- /dev/null
+++ b/cmake/onnxruntime_nuphar_extern.cmake
@@ -0,0 +1,39 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# this is for building extern functions in nuphar execution provider, using AVX2
+# the separation from onnxruntime_providers.cmake is to avoid unnecessary AVX2 codegen in providers
+# functions built here would be dynamically switched based on if AVX2 is available from CPUID
+
+add_definitions(-DNUPHAR_USE_AVX2)
+
+set(extern_avx2_srcs
+  ${ONNXRUNTIME_ROOT}/core/providers/nuphar/extern/igemv_avx2.cc
+  ${ONNXRUNTIME_ROOT}/core/providers/nuphar/extern/igemv_avx2.h
+)
+
+if (MSVC) 
+  set_source_files_properties(${extern_avx2_srcs} PROPERTIES COMPILE_FLAGS "/arch:AVX2")
+else()
+  set_source_files_properties(${extern_avx2_srcs} PROPERTIES COMPILE_FLAGS "-march=broadwell")
+endif()
+
+set(nuphar_extern_srcs
+    ${extern_avx2_srcs}
+)
+
+add_library(onnxruntime_nuphar_extern  ${nuphar_extern_srcs})
+
+if (onnxruntime_USE_MKLML)
+  add_definitions(-DNUPHAR_USE_MKL)
+  target_include_directories(onnxruntime_nuphar_extern PRIVATE ${ONNXRUNTIME_ROOT}/core/providers/nuphar/extern ${MKLML_INCLUDE_DIR})
+  add_dependencies(onnxruntime_nuphar_extern project_mklml)
+else()
+  target_include_directories(onnxruntime_nuphar_extern PRIVATE ${ONNXRUNTIME_ROOT}/core/providers/nuphar/extern)
+endif()
+
+set_target_properties(onnxruntime_nuphar_extern PROPERTIES FOLDER "ONNXRuntime")
+
+list(APPEND onnxruntime_EXTERNAL_LIBRARIES onnxruntime_nuphar_extern)
+list(APPEND onnxruntime_EXTERNAL_DEPENDENCIES onnxruntime_nuphar_extern)
+link_directories(${CMAKE_CURRENT_BINARY_DIR}/${CMAKE_BUILD_TYPE})
diff --git a/cmake/onnxruntime_providers.cmake b/cmake/onnxruntime_providers.cmake
index 0447e4814d37d..4c0abeb970aaa 100644
--- a/cmake/onnxruntime_providers.cmake
+++ b/cmake/onnxruntime_providers.cmake
@@ -25,6 +25,16 @@ file(GLOB_RECURSE onnxruntime_cuda_contrib_ops_cu_srcs CONFIGURE_DEPENDS
   "${ONNXRUNTIME_ROOT}/contrib_ops/cuda/*.cuh"
 )
 
+file(GLOB onnxruntime_cpu_automl_cc_srcs CONFIGURE_DEPENDS
+  "${ONNXRUNTIME_ROOT}/automl_ops/cpu_automl_kernels.h"
+  "${ONNXRUNTIME_ROOT}/automl_ops/cpu_automl_kernels.cc"
+  "${ONNXRUNTIME_ROOT}/automl_ops/automl_types.h"
+  "${ONNXRUNTIME_ROOT}/automl_ops/automl_types.cc"
+  "${ONNXRUNTIME_ROOT}/automl_ops/automl_featurizers.h"
+  "${ONNXRUNTIME_ROOT}/automl_ops/cpu/*.h"
+  "${ONNXRUNTIME_ROOT}/automl_ops/cpu/*.cc"
+)
+
 file(GLOB onnxruntime_providers_common_srcs CONFIGURE_DEPENDS
   "${ONNXRUNTIME_ROOT}/core/providers/*.h"
   "${ONNXRUNTIME_ROOT}/core/providers/*.cc"
@@ -38,6 +48,10 @@ if(onnxruntime_USE_NGRAPH)
   set(PROVIDERS_NGRAPH onnxruntime_providers_ngraph)
   list(APPEND ONNXRUNTIME_PROVIDER_NAMES ngraph)
 endif()
+if(onnxruntime_USE_NUPHAR)
+  set(PROVIDERS_NUPHAR onnxruntime_providers_nuphar)
+  list(APPEND ONNXRUNTIME_PROVIDER_NAMES nuphar)
+endif()
 if(onnxruntime_USE_CUDA)
   set(PROVIDERS_CUDA onnxruntime_providers_cuda)
   list(APPEND ONNXRUNTIME_PROVIDER_NAMES cuda)
@@ -55,17 +69,30 @@ if(onnxruntime_USE_NNAPI)
   list(APPEND ONNXRUNTIME_PROVIDER_NAMES nnapi)
 endif()
 source_group(TREE ${ONNXRUNTIME_ROOT}/core FILES ${onnxruntime_providers_common_srcs} ${onnxruntime_providers_srcs})
-# add using ONNXRUNTIME_ROOT so they show up under the 'contrib_ops' folder in Visual Studio
-source_group(TREE ${ONNXRUNTIME_ROOT} FILES ${onnxruntime_cpu_contrib_ops_srcs})
+
+set(onnxruntime_providers_src ${onnxruntime_providers_common_srcs} ${onnxruntime_providers_srcs})
 
 # disable contrib ops conditionally
-if(onnxruntime_DISABLE_CONTRIB_OPS)
-  add_library(onnxruntime_providers ${onnxruntime_providers_common_srcs} ${onnxruntime_providers_srcs})
-else()
-  add_library(onnxruntime_providers ${onnxruntime_providers_common_srcs} ${onnxruntime_providers_srcs} ${onnxruntime_cpu_contrib_ops_srcs})
+if(NOT onnxruntime_DISABLE_CONTRIB_OPS)
+  # add using ONNXRUNTIME_ROOT so they show up under the 'contrib_ops' folder in Visual Studio
+  source_group(TREE ${ONNXRUNTIME_ROOT} FILES ${onnxruntime_cpu_contrib_ops_srcs})
+  list(APPEND onnxruntime_providers_src ${onnxruntime_cpu_contrib_ops_srcs})
+endif()
+
+if (onnxruntime_USE_AUTOML)
+  source_group(TREE ${ONNXRUNTIME_ROOT}/ FILES ${onnxruntime_cpu_automl_cc_srcs})
+  list(APPEND onnxruntime_providers_src ${onnxruntime_cpu_automl_cc_srcs})
 endif()
 
+add_library(onnxruntime_providers ${onnxruntime_providers_src})
 onnxruntime_add_include_to_target(onnxruntime_providers onnxruntime_common onnxruntime_framework gsl onnx onnx_proto protobuf::libprotobuf)
+
+if (onnxruntime_USE_AUTOML)
+  add_dependencies(onnxruntime_providers automl_featurizers)
+  onnxruntime_add_include_to_target(onnxruntime_providers automl_featurizers)
+  target_link_libraries(onnxruntime_providers automl_featurizers)
+endif()
+
 if(HAS_DEPRECATED_COPY)
   #temporarily ignore this warning
   #see: https://en.wikipedia.org/wiki/Rule_of_three_(C%2B%2B_programming)
@@ -78,17 +105,6 @@ if(HAS_DEPRECATED_COPY)
   set_source_files_properties("${ONNXRUNTIME_ROOT}/core/providers/cpu/tensor/where_op.cc" PROPERTIES COMPILE_FLAGS -Wno-deprecated-copy)
 endif()
 
-if(CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "AMD64" AND NOT MSVC)
-  # For x86 platforms it is important to pass this flag to compiler. Without this gemmlowp will use slow reference code.
-  # These optimizations are not enabled on MSVC so excluding it.
-  message("enabling optimizations for gemmlowp")
-  set_source_files_properties("${ONNXRUNTIME_ROOT}/core/providers/cpu/math/matmul_integer.cc" PROPERTIES COMPILE_FLAGS "-msse4.1")
-  set_source_files_properties("${ONNXRUNTIME_ROOT}/core/providers/cpu/math/quantize_linear_matmul.cc" PROPERTIES COMPILE_FLAGS "-msse4.1")
-  set_source_files_properties("${ONNXRUNTIME_ROOT}/core/providers/cpu/nn/qlinearconv.cc" PROPERTIES COMPILE_FLAGS "-msse4.1")
-  set_source_files_properties("${ONNXRUNTIME_ROOT}/core/providers/cpu/nn/conv_integer.cc" PROPERTIES COMPILE_FLAGS "-msse4.1")
-endif()
-
-set(gemmlowp_src ${PROJECT_SOURCE_DIR}/external/gemmlowp)
 set(re2_src ${ONNXRUNTIME_ROOT}/../cmake/external/re2)
 target_include_directories(onnxruntime_providers PRIVATE ${ONNXRUNTIME_ROOT} ${eigen_INCLUDE_DIRS} ${gemmlowp_src} ${re2_src})
 add_dependencies(onnxruntime_providers gsl onnx ${onnxruntime_EXTERNAL_DEPENDENCIES})
@@ -306,6 +322,43 @@ endif()
   file(COPY ${onnxruntime_providers_openvino_py_srcs} DESTINATION ${onnxruntime_BINARY_DIR})
 endif()
 
+if (onnxruntime_USE_NUPHAR)
+  add_definitions(-DUSE_NUPHAR=1)
+
+  if (NOT onnxruntime_USE_TVM)
+    message(FATAL_ERROR "onnxruntime_USE_TVM required for onnxruntime_USE_NUPHAR")
+  endif()
+
+  if (NOT onnxruntime_USE_LLVM)
+    message(FATAL_ERROR "onnxruntime_USE_LLVM required for onnxruntime_USE_NUPHAR")
+  endif()
+
+  include(onnxruntime_nuphar_extern.cmake)
+
+  file(GLOB_RECURSE onnxruntime_providers_nuphar_cc_srcs
+    "${ONNXRUNTIME_ROOT}/core/providers/nuphar/*.h"
+    "${ONNXRUNTIME_ROOT}/core/providers/nuphar/*.cc"
+  )
+
+  # following files required different build flag for AVX2 in separate onnxruntime_nuphar_extern.cmake file
+  list (REMOVE_ITEM onnxruntime_providers_nuphar_cc_srcs "${ONNXRUNTIME_ROOT}/core/providers/nuphar/extern/igemv_avx2.cc")
+  list (REMOVE_ITEM onnxruntime_providers_nuphar_cc_srcs "${ONNXRUNTIME_ROOT}/core/providers/nuphar/extern/igemv_avx2.h")
+
+  if (onnxruntime_USE_MKLML)
+    add_definitions(-DNUPHAR_USE_MKL)
+  endif()
+
+  source_group(TREE ${ONNXRUNTIME_ROOT}/core FILES ${onnxruntime_providers_nuphar_cc_srcs})
+  add_library(onnxruntime_providers_nuphar ${onnxruntime_providers_nuphar_cc_srcs})
+  onnxruntime_add_include_to_target(onnxruntime_providers_nuphar onnxruntime_common onnxruntime_framework gsl onnx onnx_proto protobuf::libprotobuf)
+  set_target_properties(onnxruntime_providers_nuphar PROPERTIES FOLDER "ONNXRuntime")
+  target_include_directories(onnxruntime_providers_nuphar PRIVATE ${ONNXRUNTIME_ROOT} ${TVM_INCLUDES} ${eigen_INCLUDE_DIRS})
+  set_target_properties(onnxruntime_providers_nuphar PROPERTIES LINKER_LANGUAGE CXX)
+  target_compile_options(onnxruntime_providers_nuphar PRIVATE ${DISABLED_WARNINGS_FOR_TVM})
+  add_dependencies(onnxruntime_providers_nuphar ${onnxruntime_EXTERNAL_DEPENDENCIES})
+  install(DIRECTORY ${PROJECT_SOURCE_DIR}/../include/onnxruntime/core/providers/nuphar  DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/onnxruntime/core/providers)
+endif()
+ 
 if (onnxruntime_USE_NNAPI)
   add_definitions(-DUSE_NNAPI=1)
   option(DNN_READ_ONNX "" ON)
diff --git a/cmake/onnxruntime_python.cmake b/cmake/onnxruntime_python.cmake
index c9fcb91ff359d..a01317df6eff4 100644
--- a/cmake/onnxruntime_python.cmake
+++ b/cmake/onnxruntime_python.cmake
@@ -73,6 +73,7 @@ set(onnxruntime_pybind11_state_libs
     ${PROVIDERS_TENSORRT}
     ${PROVIDERS_NGRAPH}
     ${PROVIDERS_OPENVINO}
+    ${PROVIDERS_NUPHAR}
     ${PROVIDERS_NNAPI}
     onnxruntime_optimizer
     onnxruntime_providers
@@ -234,3 +235,15 @@ if (onnxruntime_USE_MKLML)
         $<TARGET_FILE_DIR:${test_data_target}>/onnxruntime/capi/
   )
 endif()
+
+if (onnxruntime_USE_NUPHAR)
+  file(GLOB onnxruntime_python_nuphar_test_srcs CONFIGURE_DEPENDS
+    "${ONNXRUNTIME_ROOT}/core/providers/nuphar/scripts/*.*"
+  )
+  add_custom_command(
+    TARGET onnxruntime_pybind11_state POST_BUILD
+    COMMAND ${CMAKE_COMMAND} -E copy
+      ${onnxruntime_python_nuphar_test_srcs}
+      $<TARGET_FILE_DIR:${test_data_target}>
+  )
+endif()
\ No newline at end of file
diff --git a/cmake/onnxruntime_unittests.cmake b/cmake/onnxruntime_unittests.cmake
index 3223e263a21e1..c2cf520cd32ed 100644
--- a/cmake/onnxruntime_unittests.cmake
+++ b/cmake/onnxruntime_unittests.cmake
@@ -126,6 +126,12 @@ if(NOT onnxruntime_DISABLE_CONTRIB_OPS)
     "${TEST_SRC_DIR}/contrib_ops/*.cc")
 endif()
 
+if(onnxruntime_USE_AUTOML)
+  list(APPEND onnxruntime_test_providers_src_patterns
+    "${TEST_SRC_DIR}/automl_ops/*.h"
+    "${TEST_SRC_DIR}/automl_ops/*.cc")
+endif()
+
 file(GLOB onnxruntime_test_providers_src CONFIGURE_DEPENDS
   ${onnxruntime_test_providers_src_patterns})
 file(GLOB_RECURSE onnxruntime_test_providers_cpu_src CONFIGURE_DEPENDS
@@ -209,6 +215,10 @@ if(onnxruntime_USE_NNAPI)
   list(APPEND onnxruntime_test_providers_dependencies onnxruntime_providers_nnapi)
 endif()
 
+if(onnxruntime_USE_AUTOML)
+   list(APPEND onnxruntime_test_providers_dependencies automl_featurizers)
+endif()
+
 file(GLOB_RECURSE onnxruntime_test_tvm_src CONFIGURE_DEPENDS
   "${ONNXRUNTIME_ROOT}/test/tvm/*.h"
   "${ONNXRUNTIME_ROOT}/test/tvm/*.cc"
@@ -219,6 +229,13 @@ file(GLOB_RECURSE onnxruntime_test_openvino_src
   "${ONNXRUNTIME_ROOT}/test/openvino/*.cc"
  )
 
+if(onnxruntime_USE_NUPHAR)
+  list(APPEND onnxruntime_test_framework_src_patterns  ${TEST_SRC_DIR}/framework/nuphar/*)
+  list(APPEND onnxruntime_test_framework_libs onnxruntime_providers_nuphar)
+  list(APPEND onnxruntime_test_providers_dependencies onnxruntime_providers_nuphar)
+  list(APPEND onnxruntime_test_providers_libs onnxruntime_providers_nuphar)
+endif()
+
 if (onnxruntime_ENABLE_MICROSOFT_INTERNAL)
   include(onnxruntime_unittests_internal.cmake)
 endif()
@@ -231,6 +248,7 @@ set(ONNXRUNTIME_TEST_LIBS
     ${PROVIDERS_TENSORRT}
     ${PROVIDERS_NGRAPH}
     ${PROVIDERS_OPENVINO}
+    ${PROVIDERS_NUPHAR}
     ${PROVIDERS_NNAPI}
     onnxruntime_optimizer
     onnxruntime_providers
@@ -471,7 +489,12 @@ set(onnx_test_runner_common_srcs
   ${onnx_test_runner_src_dir}/TestCase.h
   ${onnx_test_runner_src_dir}/onnxruntime_event.h
   ${onnx_test_runner_src_dir}/sync_api.h
-  ${onnx_test_runner_src_dir}/sync_api.cc)
+  ${onnx_test_runner_src_dir}/sync_api.cc
+  ${onnx_test_runner_src_dir}/callback.h
+  ${onnx_test_runner_src_dir}/callback.cc
+  ${onnx_test_runner_src_dir}/mem_buffer.h
+  ${onnx_test_runner_src_dir}/tensorprotoutils.h
+  ${onnx_test_runner_src_dir}/tensorprotoutils.cc)
 
 if(WIN32)
   set(wide_get_opt_src_dir ${TEST_SRC_DIR}/win_getopt/wide)
@@ -505,13 +528,19 @@ onnxruntime_add_include_to_target(onnx_test_runner gsl)
 target_include_directories(onnx_test_runner PRIVATE ${ONNXRUNTIME_ROOT})
 set_target_properties(onnx_test_runner PROPERTIES FOLDER "ONNXRuntimeTest")
 
+if (onnxruntime_USE_TVM)
+  if (WIN32)
+    target_link_options(onnx_test_runner PRIVATE "/STACK:4000000")
+  endif()
+endif()
+
 install(TARGETS onnx_test_runner
         ARCHIVE  DESTINATION ${CMAKE_INSTALL_LIBDIR}
         LIBRARY  DESTINATION ${CMAKE_INSTALL_LIBDIR}
         RUNTIME  DESTINATION ${CMAKE_INSTALL_BINDIR})
 
 if(onnxruntime_BUILD_BENCHMARKS)
-  add_executable(onnxruntime_benchmark ${TEST_SRC_DIR}/onnx/microbenchmark/main.cc ${TEST_SRC_DIR}/onnx/microbenchmark/modeltest.cc ${TEST_SRC_DIR}/onnx/microbenchmark/model_init.cc)
+  add_executable(onnxruntime_benchmark ${TEST_SRC_DIR}/onnx/microbenchmark/main.cc ${TEST_SRC_DIR}/onnx/microbenchmark/modeltest.cc)
   target_include_directories(onnxruntime_benchmark PRIVATE ${ONNXRUNTIME_ROOT} ${onnxruntime_graph_header} benchmark)
   onnxruntime_add_include_to_target(onnxruntime_benchmark gsl)
   if(WIN32)
@@ -585,6 +614,12 @@ if (onnxruntime_ENABLE_LANGUAGE_INTEROP_OPS AND NOT onnxruntime_BUILD_SHARED_LIB
   target_link_libraries(onnxruntime_perf_test PRIVATE onnxruntime_language_interop onnxruntime_pyop)
 endif()
 
+if (onnxruntime_USE_TVM)
+  if (WIN32)
+    target_link_options(onnxruntime_perf_test PRIVATE "/STACK:4000000")
+  endif()
+endif()
+
 # shared lib
 if (onnxruntime_BUILD_SHARED_LIB)
   add_library(onnxruntime_mocked_allocator ${ONNXRUNTIME_ROOT}/test/util/test_allocator.cc)
@@ -606,7 +641,6 @@ if (onnxruntime_BUILD_SHARED_LIB)
   endif()
   if (NOT(${CMAKE_SYSTEM_NAME} MATCHES "Darwin"))
     #for some reason, these tests are failing. Need investigation.
-    list(APPEND onnxruntime_shared_lib_test_SRC ${ONNXRUNTIME_SHARED_LIB_TEST_SRC_DIR}/test_tensor_loader.cc)
     if (onnxruntime_USE_FULL_PROTOBUF)
       list(APPEND onnxruntime_shared_lib_test_SRC ${ONNXRUNTIME_SHARED_LIB_TEST_SRC_DIR}/test_model_loading.cc)
     endif()
diff --git a/cmake/onnxruntime_util.cmake b/cmake/onnxruntime_util.cmake
index a8b611d7c99c0..feea9f90ee80f 100644
--- a/cmake/onnxruntime_util.cmake
+++ b/cmake/onnxruntime_util.cmake
@@ -8,8 +8,17 @@ file(GLOB_RECURSE onnxruntime_util_srcs CONFIGURE_DEPENDS
 
 source_group(TREE ${ONNXRUNTIME_ROOT}/core FILES ${onnxruntime_util_srcs})
 
+if(CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "AMD64" AND NOT MSVC)
+  # For x86 platforms it is important to pass this flag to compiler. Without this gemmlowp will use slow reference code.
+  # These optimizations are not enabled on MSVC so excluding it.
+  message("enabling optimizations for gemmlowp")
+  set_source_files_properties("${ONNXRUNTIME_ROOT}/core/util/gemmlowp_common.cc" PROPERTIES COMPILE_FLAGS "-msse4.1")
+endif()
+
+set(gemmlowp_src ${PROJECT_SOURCE_DIR}/external/gemmlowp)
+
 add_library(onnxruntime_util ${onnxruntime_util_srcs})
-target_include_directories(onnxruntime_util PRIVATE ${ONNXRUNTIME_ROOT} ${MKLML_INCLUDE_DIR} PUBLIC ${eigen_INCLUDE_DIRS})
+target_include_directories(onnxruntime_util PRIVATE ${ONNXRUNTIME_ROOT} ${MKLML_INCLUDE_DIR} ${gemmlowp_src} PUBLIC ${eigen_INCLUDE_DIRS})
 onnxruntime_add_include_to_target(onnxruntime_util onnxruntime_common onnxruntime_framework gsl onnx onnx_proto protobuf::libprotobuf)
 if(UNIX)
     target_compile_options(onnxruntime_util PUBLIC "-Wno-error=comment")
diff --git a/cmake/patches/ngraph/ngraph_fix_install_error.patch b/cmake/patches/ngraph/ngraph_fix_install_error.patch
deleted file mode 100644
index ddabbb7d86527..0000000000000
--- a/cmake/patches/ngraph/ngraph_fix_install_error.patch
+++ /dev/null
@@ -1,127 +0,0 @@
-From 280fbc003ea2794adb24d6a81d42db838a793dd9 Mon Sep 17 00:00:00 2001
-From: Sang Ik Lee <sang.ik.lee@intel.com>
-Date: Mon, 15 Apr 2019 16:11:27 -0700
-Subject: [PATCH] CMAKE_CFG_INTDIR does not work at install time. Use
- CMAKE_INSTALL_CONFIG_NAME on Windows.
-
----
- CMakeLists.txt                    |  7 ++++++-
- cmake/external_mkldnn.cmake       | 22 +++++++++++-----------
- cmake/external_tbb.cmake          |  4 ++--
- cmake/external_tbb_prebuilt.cmake |  6 +++---
- 4 files changed, 22 insertions(+), 17 deletions(-)
-
-diff --git a/CMakeLists.txt b/CMakeLists.txt
-index 2a21ed3a3..a695e217f 100755
---- a/CMakeLists.txt
-+++ b/CMakeLists.txt
-@@ -390,12 +390,17 @@ endif()
- 
- set(NGRAPH_BUILD_DIR ${CMAKE_CURRENT_BINARY_DIR}/src/ngraph)
- set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${NGRAPH_BUILD_DIR})
--set(NGRAPH_LIBRARY_OUTPUT_DIRECTORY ${NGRAPH_BUILD_DIR}/${CMAKE_CFG_INTDIR})
- if(WIN32)
-+    set(NGRAPH_LIBRARY_OUTPUT_DIRECTORY ${NGRAPH_BUILD_DIR}/${CMAKE_CFG_INTDIR})
-+    set(NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY ${NGRAPH_BUILD_DIR}/\${CMAKE_INSTALL_CONFIG_NAME})
-     set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${NGRAPH_BUILD_DIR})
-     set(NGRAPH_ARCHIVE_OUTPUT_DIRECTORY ${NGRAPH_BUILD_DIR}/${CMAKE_CFG_INTDIR})
-+    set(NGRAPH_ARCHIVE_INSTALLSRC_DIRECTORY ${NGRAPH_BUILD_DIR}/\${CMAKE_INSTALL_CONFIG_NAME})
-     set(CMAKE_PDB_OUTPUT_DIRECTORY ${NGRAPH_BUILD_DIR})
-     set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${NGRAPH_BUILD_DIR})
-+else()
-+    set(NGRAPH_LIBRARY_OUTPUT_DIRECTORY ${NGRAPH_BUILD_DIR})
-+    set(NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY ${NGRAPH_BUILD_DIR})
- endif()
- 
- set(EXTERNAL_INSTALL_DIR ${CMAKE_BINARY_DIR}/external)
-diff --git a/cmake/external_mkldnn.cmake b/cmake/external_mkldnn.cmake
-index 25445bf0b..7874aca76 100644
---- a/cmake/external_mkldnn.cmake
-+++ b/cmake/external_mkldnn.cmake
-@@ -312,12 +312,12 @@ endif()
- if(WIN32)
-     install(
-         FILES
--            ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${MKLML_LIB}
--            ${NGRAPH_ARCHIVE_OUTPUT_DIRECTORY}/${MKLML_IMPLIB}
--            ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${OMP_LIB}
--            ${NGRAPH_ARCHIVE_OUTPUT_DIRECTORY}/${OMP_IMPLIB}
--            ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${MKLDNN_LIB}
--            ${NGRAPH_ARCHIVE_OUTPUT_DIRECTORY}/${MKLDNN_IMPLIB}
-+            ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${MKLML_LIB}
-+            ${NGRAPH_ARCHIVE_INSTALLSRC_DIRECTORY}/${MKLML_IMPLIB}
-+            ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${OMP_LIB}
-+            ${NGRAPH_ARCHIVE_INSTALLSRC_DIRECTORY}/${OMP_IMPLIB}
-+            ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${MKLDNN_LIB}
-+            ${NGRAPH_ARCHIVE_INSTALLSRC_DIRECTORY}/${MKLDNN_IMPLIB}
-         DESTINATION
-             ${NGRAPH_INSTALL_LIB}
-         OPTIONAL
-@@ -325,9 +325,9 @@ if(WIN32)
- else()
-     install(
-         FILES
--            ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${MKLML_LIB}
--            ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${OMP_LIB}
--            ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${MKLDNN_LIB}
-+            ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${MKLML_LIB}
-+            ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${OMP_LIB}
-+            ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${MKLDNN_LIB}
-         DESTINATION
-             ${NGRAPH_INSTALL_LIB}
-         OPTIONAL
-@@ -335,8 +335,8 @@ else()
-     if(NGRAPH_LIB_VERSIONING_ENABLE)
-         install(
-             FILES
--            ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${MKLDNN_SHORT_LIB}
--            ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${MKLDNN_FULL_LIB}
-+            ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${MKLDNN_SHORT_LIB}
-+            ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${MKLDNN_FULL_LIB}
-             DESTINATION
-                 ${NGRAPH_INSTALL_LIB}
-             OPTIONAL
-diff --git a/cmake/external_tbb.cmake b/cmake/external_tbb.cmake
-index 761c5b3bd..6960ea929 100644
---- a/cmake/external_tbb.cmake
-+++ b/cmake/external_tbb.cmake
-@@ -63,10 +63,10 @@ if(NGRAPH_TBB_ENABLE)
-                 ${TBB_BUILD_DIR}/${TBB_LIB}.${TBB_SOVER}
-              DESTINATION ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY})
-     endif()
--    install(FILES ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${TBB_LIB}
-+    install(FILES ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${TBB_LIB}
-         DESTINATION ${NGRAPH_INSTALL_LIB})
-     if(LINUX)
--        install(FILES ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${TBB_LIB}.${TBB_SOVER}
-+        install(FILES ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${TBB_LIB}.${TBB_SOVER}
-             DESTINATION ${NGRAPH_INSTALL_LIB})
-     endif()
-     add_library(libtbb INTERFACE)
-diff --git a/cmake/external_tbb_prebuilt.cmake b/cmake/external_tbb_prebuilt.cmake
-index 3e1d0688f..a1cf1922a 100644
---- a/cmake/external_tbb_prebuilt.cmake
-+++ b/cmake/external_tbb_prebuilt.cmake
-@@ -69,8 +69,8 @@ if (WIN32)
-         DEPENDEES download
-         )
- 
--    install(FILES ${NGRAPH_ARCHIVE_OUTPUT_DIRECTORY}/${TBB_LIB_NAME}${CMAKE_STATIC_LIBRARY_SUFFIX}
--                  ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${TBB_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}
-+    install(FILES ${NGRAPH_ARCHIVE_INSTALLSRC_DIRECTORY}/${TBB_LIB_NAME}${CMAKE_STATIC_LIBRARY_SUFFIX}
-+                  ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${TBB_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}
-             DESTINATION ${NGRAPH_INSTALL_LIB})
- elseif(APPLE)
-     set(TBB_LINK_LIBS
-@@ -82,7 +82,7 @@ elseif(APPLE)
-         COMMENT "Move tbb libraries to ngraph build directory"
-     )
- 
--    install(FILES ${NGRAPH_LIBRARY_OUTPUT_DIRECTORY}/${CMAKE_SHARED_LIBRARY_PREFIX}${TBB_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}
-+    install(FILES ${NGRAPH_LIBRARY_INSTALLSRC_DIRECTORY}/${CMAKE_SHARED_LIBRARY_PREFIX}${TBB_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}
-             DESTINATION ${NGRAPH_INSTALL_LIB})
- endif()
- 
--- 
-2.13.0.windows.1
-
diff --git a/cmake/patches/ngraph/ngraph_fix_library_path.patch b/cmake/patches/ngraph/ngraph_fix_library_path.patch
deleted file mode 100644
index aaa63e96e7d78..0000000000000
--- a/cmake/patches/ngraph/ngraph_fix_library_path.patch
+++ /dev/null
@@ -1,33 +0,0 @@
-From fcd51f874f4a96fb4ca77d762ed39ea1bf3f2c0d Mon Sep 17 00:00:00 2001
-From: Junfeng Dong <junfeng.dong@intel.com>
-Date: Wed, 17 Apr 2019 13:42:42 -0700
-Subject: [PATCH] Fix dll library load path on Windows.
-
----
- src/ngraph/runtime/backend_manager.cpp | 3 ++-
- 1 file changed, 2 insertions(+), 1 deletion(-)
-
-diff --git a/src/ngraph/runtime/backend_manager.cpp b/src/ngraph/runtime/backend_manager.cpp
-index eaa8fc26a..4d35c63ec 100644
---- a/src/ngraph/runtime/backend_manager.cpp
-+++ b/src/ngraph/runtime/backend_manager.cpp
-@@ -123,7 +123,7 @@ unique_ptr<runtime::Backend> runtime::BackendManager::create_backend(const std::
- static string find_my_file()
- {
- #ifdef _WIN32
--    HMODULE hModule = GetModuleHandleW(NULL);
-+    HMODULE hModule = GetModuleHandleW(L"ngraph.dll");
-     WCHAR wpath[MAX_PATH];
-     GetModuleFileNameW(hModule, wpath, MAX_PATH);
-     wstring ws(wpath);
-@@ -157,6 +157,7 @@ DL_HANDLE runtime::BackendManager::open_shared_library(string type)
-     string my_directory = file_util::get_directory(find_my_file());
-     string library_path = file_util::path_join(my_directory, library_name);
- #ifdef _WIN32
-+    SetDllDirectory((LPCSTR)my_directory.c_str());
-     handle = LoadLibrary(library_path.c_str());
- #else
-     handle = dlopen(library_path.c_str(), RTLD_NOW | RTLD_GLOBAL);
--- 
-2.13.0.windows.1
-
diff --git a/cmake/patches/ngraph/ngraph_fix_mkldnn_missing_symbol.patch b/cmake/patches/ngraph/ngraph_fix_mkldnn_missing_symbol.patch
new file mode 100644
index 0000000000000..96504c910003a
--- /dev/null
+++ b/cmake/patches/ngraph/ngraph_fix_mkldnn_missing_symbol.patch
@@ -0,0 +1,64 @@
+ cmake/external_mkldnn.cmake   |  1 +
+ cmake/mkldnn_fix_missing_symbol.patch | 99 +++++++++++++++++++++++++++++++++++
+ 2 files changed, 100 insertions(+)
+ create mode 100644 cmake/mkldnn_fix_missing_symbol.patch
+
+diff --git a/cmake/external_mkldnn.cmake b/cmake/external_mkldnn.cmake
+index 7874aca76..bbae6d1a4 100644
+--- a/cmake/external_mkldnn.cmake
++++ b/cmake/external_mkldnn.cmake
+@@ -194,7 +194,8 @@ if (WIN32)
+         CONFIGURE_COMMAND
+         PATCH_COMMAND ${MKLDNN_PATCH_REVERT_COMMAND}
+         COMMAND git apply --ignore-space-change --ignore-whitespace ${CMAKE_SOURCE_DIR}/cmake/${MKLDNN_PATCH_FILE}
+         COMMAND git apply --ignore-space-change --ignore-whitespace ${CMAKE_SOURCE_DIR}/cmake/mkldnn_fix_memory.patch
++        COMMAND git apply --ignore-space-change --ignore-whitespace ${CMAKE_SOURCE_DIR}/cmake/mkldnn_fix_missing_symbol.patch
+         CMAKE_GENERATOR ${CMAKE_GENERATOR}
+         CMAKE_GENERATOR_PLATFORM ${CMAKE_GENERATOR_PLATFORM}
+         CMAKE_GENERATOR_TOOLSET ${CMAKE_GENERATOR_TOOLSET}
+diff --git a/cmake/mkldnn_fix_missing_symbol.patch b/cmake/mkldnn_fix_missing_symbol.patch
+new file mode 100644
+index 000000000..ea1a3bd61
+--- /dev/null
++++ b/cmake/mkldnn_fix_missing_symbol.patch
+@@ -0,0 +1,40 @@
++commit d485a54ac2b07b7349dabd833961415315a18fea
++Author: Denis Samoilov <denis.samoylov@intel.com>
++Date:   Sun Apr 14 20:11:33 2019 -0700
++
++    cpu: gemv: fix unresolved symbol
++
++    Fixes #456
++
++diff --git a/src/cpu/gemm/gemm_driver.cpp b/src/cpu/gemm/gemm_driver.cpp
++index 0773b212..df7bc44d 100644
++--- a/src/cpu/gemm/gemm_driver.cpp
+++++ b/src/cpu/gemm/gemm_driver.cpp
++@@ -1304,10 +1304,8 @@ static mkldnn_status_t gemm_threading_driver(
++                 (float *) arg->co);
++     }
++ 
++-    if (data_traits<a_type>::data_type == data_type::s8) {
++-        if (gemm_s8u8s32_jump_to_gemv_s8u8s32(arg)) {
++-            return mkldnn_success;
++-        }
+++    if (gemm_s8u8s32_jump_to_gemv_s8u8s32(arg)) {
+++        return mkldnn_success;
++     }
++ 
++     int nthr = (mkldnn_in_parallel()) ? 1 : mkldnn_get_max_threads();
++diff --git a/src/cpu/gemm/s8x8s32/jit_avx512_core_gemv_s8u8s32.cpp b/src/cpu/gemm/s8x8s32/jit_avx512_core_gemv_s8u8s32.cpp
++index 73d50e40..81646a43 100644
++--- a/src/cpu/gemm/s8x8s32/jit_avx512_core_gemv_s8u8s32.cpp
+++++ b/src/cpu/gemm/s8x8s32/jit_avx512_core_gemv_s8u8s32.cpp
++@@ -29,6 +29,10 @@ namespace cpu {
++ template <typename T>
++ int gemm_s8u8s32_jump_to_gemv_s8u8s32(T *arg);
++ 
+++template <>
+++int gemm_s8u8s32_jump_to_gemv_s8u8s32(
+++        gemm_info_t<float, float, float> *arg) { return 0; }
+++
++ template <>
++ int gemm_s8u8s32_jump_to_gemv_s8u8s32(
++         gemm_info_t<int8_t, uint8_t, int32_t> *arg) {
diff --git a/csharp/OnnxRuntime.CSharp.proj b/csharp/OnnxRuntime.CSharp.proj
index 696e019834ebe..c3bf76f2d087a 100644
--- a/csharp/OnnxRuntime.CSharp.proj
+++ b/csharp/OnnxRuntime.CSharp.proj
@@ -19,19 +19,19 @@ CMake creates a target to this project
     <Message Importance="High" Text="Restoring NuGet packages for CSharp projects..." />
     <MSBuild Projects="src\Microsoft.ML.OnnxRuntime\Microsoft.ML.OnnxRuntime.csproj"
              Targets="Restore" 
-             Properties="Platform=AnyCPU;RestoreConfigFile=$(MSBuildThisFileDirectory)\NuGet.CSharp.config;MSBuildWarningsAsMessages=NU1503;RestoreIgnoreFailedSource=true" 
+             Properties="Platform=AnyCPU;RestorePackagesPath=$(MSBuildThisFileDirectory)\packages;RestoreConfigFile=$(MSBuildThisFileDirectory)\NuGet.CSharp.config;MSBuildWarningsAsMessages=NU1503;RestoreIgnoreFailedSource=true" 
              />
     <MSBuild Projects="sample\Microsoft.ML.OnnxRuntime.InferenceSample\Microsoft.ML.OnnxRuntime.InferenceSample.csproj"
              Targets="Restore" 
-             Properties="Platform=AnyCPU;RestoreConfigFile=$(MSBuildThisFileDirectory)\NuGet.CSharp.config;MSBuildWarningsAsMessages=NU1503;RestoreIgnoreFailedSource=true" 
+             Properties="Platform=AnyCPU;RestorePackagesPath=$(MSBuildThisFileDirectory)\packages;RestoreConfigFile=$(MSBuildThisFileDirectory)\NuGet.CSharp.config;MSBuildWarningsAsMessages=NU1503;RestoreIgnoreFailedSource=true" 
              />
     <MSBuild Projects="test\Microsoft.ML.OnnxRuntime.Tests\Microsoft.ML.OnnxRuntime.Tests.csproj"
              Targets="Restore" 
-             Properties="RestoreConfigFile=$(MSBuildThisFileDirectory)\NuGet.CSharp.config;MSBuildWarningsAsMessages=NU1503;RestoreIgnoreFailedSource=true" 
+             Properties="RestorePackagesPath=$(MSBuildThisFileDirectory)\packages;RestoreConfigFile=$(MSBuildThisFileDirectory)\NuGet.CSharp.config;MSBuildWarningsAsMessages=NU1503;RestoreIgnoreFailedSource=true" 
              />
     <MSBuild Projects="tools\Microsoft.ML.OnnxRuntime.PerfTool\Microsoft.ML.OnnxRuntime.PerfTool.csproj"
              Targets="Restore" 
-             Properties="Platform=AnyCPU;RestoreConfigFile=$(MSBuildThisFileDirectory)\NuGet.CSharp.config;MSBuildWarningsAsMessages=NU1503;RestoreIgnoreFailedSource=true" 
+             Properties="Platform=AnyCPU;RestorePackagesPath=$(MSBuildThisFileDirectory)\packages;RestoreConfigFile=$(MSBuildThisFileDirectory)\NuGet.CSharp.config;MSBuildWarningsAsMessages=NU1503;RestoreIgnoreFailedSource=true" 
              />
   </Target>
 
@@ -54,7 +54,7 @@ CMake creates a target to this project
              />
   </Target>
 
-  <Target Name="RunTest" AfterTargets="Build">
+  <Target Name="RunTest">
     <Message Importance="High" Text="Running CSharp tests..." />
     <Exec Command="$(DotNetExe) test test\Microsoft.ML.OnnxRuntime.Tests\Microsoft.ML.OnnxRuntime.Tests.csproj -c $(Configuration) --no-build" ConsoleToMSBuild="true">
       <Output TaskParameter="ConsoleOutput" PropertyName="OutputOfExec" />
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/OnnxRuntime.snk b/csharp/OnnxRuntime.snk
similarity index 100%
rename from csharp/src/Microsoft.ML.OnnxRuntime/OnnxRuntime.snk
rename to csharp/OnnxRuntime.snk
diff --git a/csharp/sample/Microsoft.ML.OnnxRuntime.InferenceSample/Program.cs b/csharp/sample/Microsoft.ML.OnnxRuntime.InferenceSample/Program.cs
index 0025bb8429f4e..6fc9b63e1a163 100644
--- a/csharp/sample/Microsoft.ML.OnnxRuntime.InferenceSample/Program.cs
+++ b/csharp/sample/Microsoft.ML.OnnxRuntime.InferenceSample/Program.cs
@@ -6,7 +6,7 @@
 using System.Text;
 using System.IO;
 using Microsoft.ML.OnnxRuntime;
-using System.Numerics.Tensors;
+using Microsoft.ML.OnnxRuntime.Tensors;
 
 namespace CSharpUsage
 {
@@ -26,7 +26,7 @@ static void UseApi()
 
             // Optional : Create session options and set the graph optimization level for the session
             SessionOptions options = new SessionOptions();
-            options.SetSessionGraphOptimizationLevel(2);
+            options.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED;
 
             using (var session = new InferenceSession(modelPath, options))
             {
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/DisposableNamedOnnxValue.cs b/csharp/src/Microsoft.ML.OnnxRuntime/DisposableNamedOnnxValue.cs
index 096501019771e..bb159ea29e577 100644
--- a/csharp/src/Microsoft.ML.OnnxRuntime/DisposableNamedOnnxValue.cs
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/DisposableNamedOnnxValue.cs
@@ -3,7 +3,7 @@
 
 using System;
 using System.Collections.Generic;
-using System.Numerics.Tensors;
+using Microsoft.ML.OnnxRuntime.Tensors;
 using System.Runtime.InteropServices;
 
 
@@ -120,9 +120,15 @@ internal static DisposableNamedOnnxValue CreateTensorFromOnnxValue(string name,
                 case TensorElementType.UInt8:
                     result = DisposableNamedOnnxValueFromNativeTensor<byte>(name, nativeOnnxValue);
                     break;
+                case TensorElementType.Int8:
+                    result = DisposableNamedOnnxValueFromNativeTensor<sbyte>(name, nativeOnnxValue);
+                    break;
                 case TensorElementType.String:
                     result = DisposableNamedOnnxValueFromNativeTensor<string>(name, nativeOnnxValue);
                     break;
+                case TensorElementType.Bool:
+                    result = DisposableNamedOnnxValueFromNativeTensor<bool>(name, nativeOnnxValue);
+                    break;
                 default:
                     throw new NotSupportedException("Tensor of element type: " + elemType + " is not supported");
 
@@ -134,9 +140,8 @@ internal static DisposableNamedOnnxValue CreateTensorFromOnnxValue(string name,
         internal static DisposableNamedOnnxValue CreateFromOnnxValue(string name, IntPtr nativeOnnxValue)
         {
             IntPtr allocator = IntPtr.Zero;
-            NativeApiStatus.VerifySuccess(NativeMethods.OrtCreateDefaultAllocator(out allocator));
+            NativeApiStatus.VerifySuccess(NativeMethods.OrtGetAllocatorWithDefaultOptions(out allocator));
             var ret = CreateFromOnnxValue(name, nativeOnnxValue, allocator);
-            NativeMethods.OrtReleaseAllocator(allocator);
             return (DisposableNamedOnnxValue)ret;
         }
 
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.cs b/csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.cs
index 5f89bad8bbe9b..79643029561da 100644
--- a/csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.cs
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.cs
@@ -19,6 +19,8 @@ public class InferenceSession : IDisposable
     {
         protected IntPtr _nativeHandle;
         protected Dictionary<string, NodeMetadata> _inputMetadata, _outputMetadata;
+        private SessionOptions _builtInSessionOptions = null;
+        private RunOptions _builtInRunOptions = null;
 
 
         #region Public API
@@ -28,10 +30,12 @@ public class InferenceSession : IDisposable
         /// </summary>
         /// <param name="modelPath"></param>
         public InferenceSession(string modelPath)
-            : this(modelPath, SessionOptions.Default)
         {
+            _builtInSessionOptions = new SessionOptions(); // need to be disposed
+            Init(modelPath, _builtInSessionOptions);
         }
 
+
         /// <summary>
         /// Constructs an InferenceSession from a model file, with some additional session options
         /// </summary>
@@ -39,52 +43,13 @@ public InferenceSession(string modelPath)
         /// <param name="options"></param>
         public InferenceSession(string modelPath, SessionOptions options)
         {
-            var envHandle = OnnxRuntime.Handle;
-
-            _nativeHandle = IntPtr.Zero;
-            try
-            {
-                if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
-                    NativeApiStatus.VerifySuccess(NativeMethods.OrtCreateSession(envHandle, System.Text.Encoding.Unicode.GetBytes(modelPath), options._nativePtr, out _nativeHandle));
-                else
-                    NativeApiStatus.VerifySuccess(NativeMethods.OrtCreateSession(envHandle, System.Text.Encoding.UTF8.GetBytes(modelPath), options._nativePtr, out _nativeHandle));
-
-                // Initialize input/output metadata
-                _inputMetadata = new Dictionary<string, NodeMetadata>();
-                _outputMetadata = new Dictionary<string, NodeMetadata>();
-
-                // get input count
-                UIntPtr inputCount = UIntPtr.Zero;
-                NativeApiStatus.VerifySuccess(NativeMethods.OrtSessionGetInputCount(_nativeHandle, out inputCount));
-
-                // get all the output names
-                for (ulong i = 0; i < (ulong)inputCount; i++)
-                {
-                    var iname = GetInputName(i);
-                    _inputMetadata[iname] = GetInputMetadata(i);
-                }
-                // get output count
-                UIntPtr outputCount = UIntPtr.Zero;
-                NativeApiStatus.VerifySuccess(NativeMethods.OrtSessionGetOutputCount(_nativeHandle, out outputCount));
-
-                // get all the output names
-                for (ulong i = 0; i < (ulong)outputCount; i++)
-                {
-                    _outputMetadata[GetOutputName(i)] = GetOutputMetadata(i);
-                }
-
-            }
-            catch (OnnxRuntimeException e)
-            {
-                if (_nativeHandle != IntPtr.Zero)
-                {
-                    NativeMethods.OrtReleaseSession(_nativeHandle);
-                    _nativeHandle = IntPtr.Zero;
-                }
-                throw e;
-            }
+            Init(modelPath, options);
         }
 
+
+        /// <summary>
+        /// Meta data regarding the input nodes, keyed by input names
+        /// </summary>
         public IReadOnlyDictionary<string, NodeMetadata> InputMetadata
         {
             get
@@ -93,6 +58,9 @@ public IReadOnlyDictionary<string, NodeMetadata> InputMetadata
             }
         }
 
+        /// <summary>
+        /// Metadata regarding the output nodes, keyed by output names
+        /// </summary>
         public IReadOnlyDictionary<string, NodeMetadata> OutputMetadata
         {
             get
@@ -101,11 +69,12 @@ public IReadOnlyDictionary<string, NodeMetadata> OutputMetadata
             }
         }
 
+
         /// <summary>
         /// Runs the loaded model for the given inputs, and fetches all the outputs.
         /// </summary>
         /// <param name="inputs"></param>
-        /// <returns>Output Tensors in a Collection of NamedOnnxValue</returns>
+        /// <returns>Output Tensors in a Collection of NamedOnnxValue. User must dispose the output.</returns>
         public IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run(IReadOnlyCollection<NamedOnnxValue> inputs)
         {
             string[] outputNames = new string[_outputMetadata.Count];
@@ -118,21 +87,22 @@ public IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run(IReadOnlyColl
         /// </summary>
         /// <param name="inputs"></param>
         /// <param name="outputNames"></param>
-        /// <returns>Output Tensors in a Collection of NamedOnnxValue</returns>
+        /// <returns>Output Tensors in a Collection of NamedOnnxValue. User must dispose the output.</returns>
         public IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run(IReadOnlyCollection<NamedOnnxValue> inputs, IReadOnlyCollection<string> outputNames)
         {
-            return Run(inputs, outputNames, RunOptions.Default);
+            IDisposableReadOnlyCollection<DisposableNamedOnnxValue> result = null;
+            result = Run(inputs, outputNames, _builtInRunOptions);
+            return result;
         }
 
         /// <summary>
-        /// Runs the loaded model for the given inputs, and fetches the specified outputs in <paramref name="outputNames"/>.
+        /// Runs the loaded model for the given inputs, and fetches the specified outputs in <paramref name="outputNames". Uses the given RunOptions for this run./>.
         /// </summary>
         /// <param name="inputs"></param>
         /// <param name="outputNames"></param>
         /// <param name="options"></param>
-        /// <returns>Output Tensors in a Collection of NamedOnnxValue</returns>
-        //TODO: kept internal until RunOptions is made public
-        internal IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run(IReadOnlyCollection<NamedOnnxValue> inputs, IReadOnlyCollection<string> outputNames, RunOptions options)
+        /// <returns>Output Tensors in a Collection of NamedOnnxValue. User must dispose the output.</returns>
+        public IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run(IReadOnlyCollection<NamedOnnxValue> inputs, IReadOnlyCollection<string> outputNames, RunOptions options)
         {
             var inputNames = new string[inputs.Count];
             var inputTensors = new IntPtr[inputs.Count];
@@ -154,8 +124,7 @@ internal IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run(IReadOnlyCo
 
             IntPtr status = NativeMethods.OrtRun(
                                                 this._nativeHandle,
-                                                IntPtr.Zero,  // TODO: use Run options when Run options creation API is available
-                                                              // Passing null uses the default run options in the C-api
+                                                options.Handle,
                                                 inputNames,
                                                 inputTensors,
                                                 (UIntPtr)(inputTensors.Length),
@@ -192,7 +161,8 @@ internal IDisposableReadOnlyCollection<DisposableNamedOnnxValue> Run(IReadOnlyCo
                 // always unpin the input buffers, and delete the native Onnx value objects
                 for (int i = 0; i < inputs.Count; i++)
                 {
-                    NativeMethods.OrtReleaseValue(inputTensors[i]); // this should not release the buffer, but should delete the native tensor object
+                    NativeMethods.OrtReleaseValue(inputTensors[i]); // For elementary type Tensors, this should not release the buffer, but should delete the native tensor object.
+                                                                    // For string tensors, this releases the native memory allocated for the tensor, including the buffer
                     pinnedBufferHandles[i].Dispose();
                 }
             }
@@ -211,6 +181,58 @@ internal ModelMetadata ModelMetadata
         #endregion
 
         #region private methods
+
+        protected void Init(string modelPath, SessionOptions options)
+        {
+            var envHandle = OnnxRuntime.Handle;
+
+            _nativeHandle = IntPtr.Zero;
+            try
+            {
+                if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtCreateSession(envHandle, System.Text.Encoding.Unicode.GetBytes(modelPath), options.Handle, out _nativeHandle));
+                else
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtCreateSession(envHandle, System.Text.Encoding.UTF8.GetBytes(modelPath), options.Handle, out _nativeHandle));
+
+                // Initialize input/output metadata
+                _inputMetadata = new Dictionary<string, NodeMetadata>();
+                _outputMetadata = new Dictionary<string, NodeMetadata>();
+
+                // get input count
+                UIntPtr inputCount = UIntPtr.Zero;
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtSessionGetInputCount(_nativeHandle, out inputCount));
+
+                // get all the output names
+                for (ulong i = 0; i < (ulong)inputCount; i++)
+                {
+                    var iname = GetInputName(i);
+                    _inputMetadata[iname] = GetInputMetadata(i);
+                }
+                // get output count
+                UIntPtr outputCount = UIntPtr.Zero;
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtSessionGetOutputCount(_nativeHandle, out outputCount));
+
+                // get all the output names
+                for (ulong i = 0; i < (ulong)outputCount; i++)
+                {
+                    _outputMetadata[GetOutputName(i)] = GetOutputMetadata(i);
+                }
+
+            }
+            catch (OnnxRuntimeException e)
+            {
+                if (_nativeHandle != IntPtr.Zero)
+                {
+                    NativeMethods.OrtReleaseSession(_nativeHandle);
+                    _nativeHandle = IntPtr.Zero;
+                }
+                throw e;
+            }
+
+            _builtInRunOptions = new RunOptions();  // create a default built-in run option, and avoid creating a new one every run() call
+        }
+
+
         private string GetOutputName(ulong index)
         {
             IntPtr nameHandle = IntPtr.Zero;
@@ -358,6 +380,15 @@ protected virtual void Dispose(bool disposing)
             if (disposing)
             {
                 // cleanup managed resources
+                if (_builtInSessionOptions != null)
+                {
+                    _builtInSessionOptions.Dispose();
+                }
+
+                if (_builtInRunOptions != null)
+                {
+                    _builtInRunOptions.Dispose();
+                }
             }
 
             // cleanup unmanaged resources
@@ -426,24 +457,5 @@ internal class ModelMetadata
         //TODO: placeholder for Model metadata. Currently C-API does not expose this.
     }
 
-    /// Sets various runtime options. 
-    /// TODO: currently uses Default options only. kept internal until fully implemented
-    internal class RunOptions
-    {
-        protected static readonly Lazy<RunOptions> _default = new Lazy<RunOptions>(() => new RunOptions());
-
-        public static RunOptions Default
-        {
-            get
-            {
-                return _default.Value;
-            }
-        }
-
-        private void RuntOptions()
-        {
-
-        }
-    }
 
 }
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/Microsoft.ML.OnnxRuntime.csproj b/csharp/src/Microsoft.ML.OnnxRuntime/Microsoft.ML.OnnxRuntime.csproj
index f7fbdcda281be..c966393f444bf 100644
--- a/csharp/src/Microsoft.ML.OnnxRuntime/Microsoft.ML.OnnxRuntime.csproj
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/Microsoft.ML.OnnxRuntime.csproj
@@ -3,10 +3,11 @@
   <PropertyGroup>
     <TargetFramework>netstandard1.1</TargetFramework>
     <Platforms>AnyCPU;x86</Platforms>
+    <LangVersion>7.2</LangVersion>
     <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
     <SignAssembly>true</SignAssembly>
     <DelaySign>false</DelaySign>
-    <AssemblyOriginatorKeyFile>OnnxRuntime.snk</AssemblyOriginatorKeyFile>
+    <AssemblyOriginatorKeyFile>..\..\OnnxRuntime.snk</AssemblyOriginatorKeyFile>
 
     <!--internal build related properties-->
     <OnnxRuntimeCsharpRoot>..\..</OnnxRuntimeCsharpRoot>
@@ -24,12 +25,11 @@
     <PackageLicenseFile>LICENSE.txt</PackageLicenseFile>
     <PackageIconUrl>https://go.microsoft.com/fwlink/?linkid=2049168</PackageIconUrl>
     <PackageReleaseNotes>
-        Release Def:
+      Release Def:
 	Branch: $(BUILD_SOURCEBRANCH)
 	Commit: $(BUILD_SOURCEVERSION)
 	Build: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=$(BUILD_BUILDID)
     </PackageReleaseNotes>
-
     <!-- sourcelink flags -->
     <PublishRepositoryUrl>true</PublishRepositoryUrl>
     <!-- Optional: Embed source files that are not tracked by the source control manager in the PDB -->
@@ -39,6 +39,7 @@
 
   <!--TODO: this works for single platform only. Need separate packaging scripts for multi-target packaging -->
   <!--TODO: Find a way to bundle the native symbol files properly -->
+  
   <ItemGroup>
     <None Include="$(OnnxRuntimeCsharpRoot)\..\include\onnxruntime\core\session\onnxruntime_*.h"
           PackagePath="\build\native\include"
@@ -98,6 +99,55 @@
           CopyToOutputDirectory="Always"
           Visible="false"
     />
+    <None Include="$(NativeBuildOutputDir)\tvm.dll"
+          Condition="Exists('$(NativeBuildOutputDir)\tvm.dll')"
+          PackagePath="\runtimes\win-$(TargetArchitecture)\native"
+          Pack="true"
+          CopyToOutputDirectory="Always"
+          Visible="false"
+    />
+    <None Include="$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\create_shared.cmd"
+          Condition="Exists('$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\create_shared.cmd')"
+          PackagePath="\scripts"
+          Pack="true"
+          CopyToOutputDirectory="Always"
+          Visible="false"
+    />
+    <None Include="$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\create_shared.sh"
+          Condition="Exists('$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\create_shared.sh')"
+          PackagePath="\scripts"
+          Pack="true"
+          CopyToOutputDirectory="Always"
+          Visible="false"
+    />
+    <None Include="$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\model_editor.py"
+          Condition="Exists('$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\model_editor.py')"
+          PackagePath="\scripts"
+          Pack="true"
+          CopyToOutputDirectory="Always"
+          Visible="false"
+    />
+    <None Include="$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\model_quantizer.py"
+          Condition="Exists('$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\model_quantizer.py')"
+          PackagePath="\scripts"
+          Pack="true"
+          CopyToOutputDirectory="Always"
+          Visible="false"
+    />
+    <None Include="$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\node_factory.py"
+          Condition="Exists('$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\node_factory.py')"
+          PackagePath="\scripts"
+          Pack="true"
+          CopyToOutputDirectory="Always"
+          Visible="false"
+    />
+    <None Include="$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\symbolic_shape_infer.py"
+          Condition="Exists('$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\providers\nuphar\scripts\symbolic_shape_infer.py')"
+          PackagePath="\scripts"
+          Pack="true"
+          CopyToOutputDirectory="Always"
+          Visible="false"
+    />
     <None Include="$(OnnxRuntimeCsharpRoot)\..\LICENSE.txt;$(OnnxRuntimeCsharpRoot)\..\ThirdPartyNotices.txt"
           PackagePath="\"
           Pack="true"
@@ -118,6 +168,20 @@
           Pack="true"
           Visible="false"
     />
+    <!-- Some tools to be packaged in nightly build only, should not be released -->
+    <!-- These are copied to the runtimes folder for coveniennce of loading with the dlls -->
+    <None Include="$(NativeBuildOutputDir)\onnxruntime_perf_test.exe"
+          Condition="('$(IsReleaseBuild)' != 'true') And ($(TargetArchitecture)=='x64') And Exists('$(NativeBuildOutputDir)\onnxruntime_perf_test.exe')"
+          PackagePath="\runtimes\win-$(TargetArchitecture)\native"
+          Pack="true"
+          Visible="false"
+    />
+    <None Include="$(NativeBuildOutputDir)\onnx_test_runner.exe"
+          Condition="('$(IsReleaseBuild)' != 'true') And ($(TargetArchitecture)=='x64') And Exists('$(NativeBuildOutputDir)\onnx_test_runner.exe')"
+          PackagePath="\runtimes\win-$(TargetArchitecture)\native"
+          Pack="true"
+          Visible="false"
+    />
 
   </ItemGroup>
 
@@ -147,8 +211,8 @@
   </Target>
 
   <ItemGroup>
-    <PackageReference Include="System.Numerics.Tensors" Version="0.1.0" />
     <PackageReference Include="Microsoft.SourceLink.GitHub" Version="1.0.0-beta-63127-02" PrivateAssets="All"/>
+    <PackageReference Include="System.Memory" Version="4.5.3" />
   </ItemGroup>
 
 </Project>
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/NamedOnnxValue.cs b/csharp/src/Microsoft.ML.OnnxRuntime/NamedOnnxValue.cs
index dfbd5e4899577..3ac360e67ebcd 100644
--- a/csharp/src/Microsoft.ML.OnnxRuntime/NamedOnnxValue.cs
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/NamedOnnxValue.cs
@@ -4,7 +4,7 @@
 using System;
 using System.Collections.Generic;
 using System.Text;
-using System.Numerics.Tensors;
+using Microsoft.ML.OnnxRuntime.Tensors;
 using System.Buffers;
 using System.Collections;
 using System.Diagnostics;
@@ -162,6 +162,15 @@ out nativeElementType
                                     ))
             {
             }
+            else if (TryPinAsTensor<sbyte>(out pinnedMemoryHandle,
+                                      out dataBufferPointer,
+                                      out dataBufferLength,
+                                      out shape,
+                                      out rank,
+                                      out nativeElementType
+                                    ))
+            {
+            }
             else if (TryPinAsTensor<bool>(out pinnedMemoryHandle,
                                       out dataBufferPointer,
                                       out dataBufferLength,
@@ -171,41 +180,93 @@ out nativeElementType
                                     ))
             {
             }
-
             //TODO: add other types
-            else
+            // special case for string Tensor, data needs to be copied to the native buffer
+            else if (!(_value is Tensor<string>))
             {
                 // nothing to cleanup here, since no memory has been pinned
                 throw new NotSupportedException("The inference value " + nameof(_value) + " is not of a supported type");
             }
 
 
-            Debug.Assert(dataBufferPointer != IntPtr.Zero, "dataBufferPointer must be non-null after obtaining the pinned buffer");
-
-            // copy to an ulong[] shape to match size_t[]
-            long[] longShape = new long[rank];
-            for (int i = 0; i < rank; i++)
+            if (_value is Tensor<string>)
             {
-                longShape[i] = shape[i];
-            }
+                // calculate native tensor length (sum of string lengths in utf-8)
+                var tensorValue = _value as Tensor<string>;
+                int totalLength = 0;
+                for (int i = 0; i < tensorValue.Length; i++)
+                {
+                    totalLength += Encoding.UTF8.GetByteCount(tensorValue.GetValue(i));
+                }
 
-            IntPtr status = NativeMethods.OrtCreateTensorWithDataAsOrtValue(
-                    NativeMemoryAllocatorInfo.DefaultInstance.Handle,
-                    dataBufferPointer,
-                    (UIntPtr)(dataBufferLength),
-                    longShape,
-                    (UIntPtr)rank,
-                    nativeElementType,
-                    out onnxValue
-                );
-            try
-            {
-                NativeApiStatus.VerifySuccess(status);
+                long[] longShape = new long[tensorValue.Dimensions.Length];
+                for (int i = 0; i < tensorValue.Dimensions.Length; i++)
+                {
+                    longShape[i] = tensorValue.Dimensions[i];
+                }
+
+                // allocate the native tensor
+                IntPtr nativeTensor = IntPtr.Zero;
+                try
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtCreateTensorAsOrtValue(
+                                                    NativeMemoryAllocator.DefaultInstance.Handle,
+                                                    longShape,
+                                                    (UIntPtr)(longShape.Length),
+                                                    TensorElementType.String,
+                                                    out nativeTensor
+                                                    ));
+
+                    // fill the native tensor, using GetValue(index) from the Tensor<string>
+                    string[] stringsInTensor = new string[tensorValue.Length];
+                    for (int i = 0; i < tensorValue.Length; i++)
+                    {
+                        stringsInTensor[i] = tensorValue.GetValue(i);
+                    }
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtFillStringTensor(nativeTensor, stringsInTensor, (UIntPtr)tensorValue.Length));
+                }
+                catch (OnnxRuntimeException e)
+                {
+                    if (nativeTensor != IntPtr.Zero)
+                    {
+                        NativeMethods.OrtReleaseValue(nativeTensor);
+                        throw e;
+                    }
+                }
+
+                onnxValue = nativeTensor; // set the output
+                pinnedMemoryHandle = default; // dummy value for the output
             }
-            catch (OnnxRuntimeException e)
+            else
             {
-                pinnedMemoryHandle.Dispose();
-                throw e;
+                Debug.Assert(dataBufferPointer != IntPtr.Zero, "dataBufferPointer must be non-null after obtaining the pinned buffer");
+
+                // copy to an ulong[] shape to match size_t[]
+                long[] longShape = new long[rank];
+                for (int i = 0; i < rank; i++)
+                {
+                    longShape[i] = shape[i];
+                }
+
+                IntPtr status = NativeMethods.OrtCreateTensorWithDataAsOrtValue(
+                        NativeMemoryAllocatorInfo.DefaultInstance.Handle,
+                        dataBufferPointer,
+                        (UIntPtr)(dataBufferLength),
+                        longShape,
+                        (UIntPtr)rank,
+                        nativeElementType,
+                        out onnxValue
+                    );
+                try
+                {
+                    NativeApiStatus.VerifySuccess(status);
+                }
+                catch (OnnxRuntimeException e)
+                {
+                    pinnedMemoryHandle.Dispose();
+                    throw e;
+                }
+
             }
 
         }
@@ -224,7 +285,9 @@ out TensorElementType nativeElementType
             dataBufferLength = 0;
             shape = null;
             rank = 0;
-            pinnedMemoryHandle = default(MemoryHandle);
+            pinnedMemoryHandle = default;
+
+            Debug.Assert(typeof(T) != typeof(string), "NamedOnnxValue.TryPinAsTensor() must not be called with a string Tensor value");
 
             if (_value is Tensor<T>)
             {
@@ -299,15 +362,21 @@ out TensorElementType nativeElementType
                     nativeElementType = TensorElementType.UInt8;
                     dataBufferLength = dt.Buffer.Length * sizeof(byte);
                 }
+                else if (typeof(T) == typeof(sbyte))
+                {
+                    nativeElementType = TensorElementType.Int8;
+                    dataBufferLength = dt.Buffer.Length * sizeof(sbyte);
+                }
                 else if (typeof(T) == typeof(string))
                 {
                     nativeElementType = TensorElementType.String;
                     dataBufferLength = dt.Buffer.Length * IntPtr.Size;
                 }
-                //TODO: Not supporting boolean for now. bool is non-blittable, the interop needs some care, and possibly need to copy
-                //else if (typeof(T) == typeof(bool))
-                //{
-                //}
+                else if (typeof(T) == typeof(bool))
+                {
+                    nativeElementType = TensorElementType.Bool;
+                    dataBufferLength = dt.Buffer.Length * sizeof(bool); // Assumes sizeof(BOOL) is always 1 byte in native
+                }
                 else
                 {
                     //TODO: may extend the supported types
@@ -397,10 +466,18 @@ public static void GetTypeAndWidth(TensorElementType elemType, out Type type, ou
                     type = typeof(byte);
                     width = sizeof(byte);
                     break;
+                case TensorElementType.Int8:
+                    type = typeof(sbyte);
+                    width = sizeof(sbyte);
+                    break;
                 case TensorElementType.String:
                     type = typeof(byte);
                     width = sizeof(byte);
                     break;
+                case TensorElementType.Bool:
+                    type = typeof(bool);
+                    width = sizeof(bool);
+                    break;
                 default:
                     type = null;
                     width = 0;
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/NativeMemoryAllocator.cs b/csharp/src/Microsoft.ML.OnnxRuntime/NativeMemoryAllocator.cs
index a9b4e60f5ac58..50b961bf23eec 100644
--- a/csharp/src/Microsoft.ML.OnnxRuntime/NativeMemoryAllocator.cs
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/NativeMemoryAllocator.cs
@@ -77,22 +77,18 @@ protected override bool ReleaseHandle()
 
     internal class NativeMemoryAllocator : SafeHandle
     {
-        protected static readonly Lazy<NativeMemoryAllocator> _defaultInstance = new Lazy<NativeMemoryAllocator>(CreateDefaultCpuAllocator);
+        protected static readonly Lazy<NativeMemoryAllocator> _defaultInstance = new Lazy<NativeMemoryAllocator>(GetDefaultCpuAllocator);
 
-        private static NativeMemoryAllocator CreateDefaultCpuAllocator()
+        private static NativeMemoryAllocator GetDefaultCpuAllocator()
         {
             IntPtr allocator = IntPtr.Zero;
             try
             {
-                IntPtr status = NativeMethods.OrtCreateDefaultAllocator(out allocator);
+                IntPtr status = NativeMethods.OrtGetAllocatorWithDefaultOptions(out allocator);
                 NativeApiStatus.VerifySuccess(status);
             }
             catch (Exception e)
             {
-                if (allocator != IntPtr.Zero)
-                {
-                    Delete(allocator);
-                }
                 throw e;
             }
 
@@ -124,7 +120,7 @@ public override bool IsInvalid
             }
         }
 
-        internal IntPtr Handle  
+        internal IntPtr Handle
         {
             get
             {
@@ -138,15 +134,8 @@ protected NativeMemoryAllocator(IntPtr allocator)
             this.handle = allocator;
         }
 
-
-        protected static void Delete(IntPtr allocator)
-        {
-            NativeMethods.OrtReleaseAllocator(allocator);
-        }
-
         protected override bool ReleaseHandle()
         {
-            Delete(this.handle);
             return true;
         }
     }
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.cs b/csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.cs
index 4c213ec66d58e..03650989e4604 100644
--- a/csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.cs
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.cs
@@ -130,6 +130,9 @@ IntPtr[] outputValues /* An array of output value pointers. Array must be alloca
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /*(OrtStatus*)*/ OrtDisableSequentialExecution(IntPtr /*(OrtSessionOptions*)*/ options);
 
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtSetOptimizedModelFilePath(IntPtr /* OrtSessionOptions* */ options, [MarshalAs(UnmanagedType.LPWStr)]string optimizedModelFilepath);
+
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /*(OrtStatus*)*/ OrtEnableProfiling(IntPtr /* OrtSessionOptions* */ options, string profilePathPrefix);
 
@@ -154,11 +157,14 @@ IntPtr[] outputValues /* An array of output value pointers. Array must be alloca
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /*(OrtStatus*)*/ OrtSetSessionLogVerbosityLevel(IntPtr /* OrtSessionOptions* */ options, LogLevel sessionLogVerbosityLevel);
 
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtSetSessionLogSeverityLevel(IntPtr /* OrtSessionOptions* */ options, LogLevel sessionLogSeverityLevel);
+
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /*(OrtStatus*)*/ OrtSetSessionThreadPoolSize(IntPtr /* OrtSessionOptions* */ options, int sessionThreadPoolSize);
 
         [DllImport(nativeLib, CharSet = charSet)]
-        public static extern IntPtr /*(OrtStatus*)*/ OrtSetSessionGraphOptimizationLevel(IntPtr /* OrtSessionOptions* */ options, uint graphOptimizationLevel);
+        public static extern IntPtr /*(OrtStatus*)*/ OrtSetSessionGraphOptimizationLevel(IntPtr /* OrtSessionOptions* */ options, GraphOptimizationLevel graphOptimizationLevel);
 
 
         ///**
@@ -175,12 +181,43 @@ IntPtr[] outputValues /* An array of output value pointers. Array must be alloca
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /*(OrtStatus*)*/ OrtSessionOptionsAppendExecutionProvider_CUDA(IntPtr /*(OrtSessionOptions*) */ options, int device_id);
 
-        //[DllImport(nativeLib, CharSet = charSet)]
-        //public static extern IntPtr /*(OrtStatus*)*/ OrtCreateNupharExecutionProviderFactory(int device_id, string target_str, out IntPtr /*(OrtProviderFactoryPtr**)*/ factory);
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtSessionOptionsAppendExecutionProvider_Nuphar(IntPtr /*(OrtSessionOptions*) */ options, int allow_unaligned_buffers, string settings);
 
         //[DllImport(nativeLib, CharSet = charSet)]
         //public static extern void OrtAddCustomOp(IntPtr /*(OrtSessionOptions*)*/ options, string custom_op_path);
 
+        #endregion
+
+        #region RunOptions API
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtCreateRunOptions( out IntPtr /* OrtRunOptions** */ runOptions);
+
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern void OrtReleaseRunOptions(IntPtr /*(OrtRunOptions*)*/options);
+
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtRunOptionsSetRunLogVerbosityLevel(IntPtr /* OrtRunOptions* */ options, LogLevel value);
+
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtRunOptionsSetRunTag(IntPtr /* OrtRunOptions* */ options, string /* const char* */ runTag);
+
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtRunOptionsGetRunLogVerbosityLevel(IntPtr /* OrtRunOptions* */ options, out LogLevel verbosityLevel);
+
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtRunOptionsGetRunTag(IntPtr /* const OrtRunOptions* */options, out IntPtr /* const char** */ runtag);
+
+        // Set a flag so that any running OrtRun* calls that are using this instance of OrtRunOptions
+        // will exit as soon as possible if the flag is true.
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtRunOptionsEnableTerminate(IntPtr /* OrtRunOptions* */ options);
+
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtRunOptionsDisableTerminate(IntPtr /* OrtRunOptions* */ options);
+
+
+
         #endregion
 
         #region Allocator/AllocatorInfo API
@@ -223,10 +260,7 @@ public enum MemoryType
         public static extern void OrtReleaseAllocatorInfo(IntPtr /*(OrtAllocatorInfo*)*/ allocatorInfo);
 
         [DllImport(nativeLib, CharSet = charSet)]
-        public static extern IntPtr /*(OrtStatus*)*/OrtCreateDefaultAllocator(out IntPtr /*(OrtAllocator**)*/ allocator);
-
-        [DllImport(nativeLib, CharSet = charSet)]
-        public static extern void OrtReleaseAllocator(IntPtr /*(OrtAllocator*)*/ allocator);
+        public static extern IntPtr /*(OrtStatus*)*/OrtGetAllocatorWithDefaultOptions(out IntPtr /*(OrtAllocator**)*/ allocator);
 
         /// <summary>
         /// Release any object allocated by an allocator
@@ -261,6 +295,14 @@ public enum MemoryType
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /*(OrtStatus*)*/ OrtGetTypeInfo(IntPtr /*(OrtValue*)*/ value, IntPtr /*(OrtValue**)*/ typeInfo);
 
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtCreateTensorAsOrtValue(
+                        IntPtr /*_Inout_ OrtAllocator* */ allocator,
+                        long[] /*_In_ const int64_t* */ shape, 
+                        UIntPtr /*size_t*/ shape_len, 
+                        TensorElementType type,
+                        out IntPtr /* OrtValue** */ outputValue);
+
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /* OrtStatus */ OrtCreateTensorWithDataAsOrtValue(
                                                         IntPtr /* (const OrtAllocatorInfo*) */ allocatorInfo,
@@ -276,6 +318,15 @@ public enum MemoryType
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /*(OrtStatus*)*/ OrtGetTensorMutableData(IntPtr /*(OrtValue*)*/ value, out IntPtr /* (void**)*/ dataBufferHandle);
 
+
+        /// \param value A tensor created from OrtCreateTensor... function.
+        /// \param len total data length, not including the trailing '\0' chars.
+        [DllImport(nativeLib, CharSet = charSet)]
+        public static extern IntPtr /*(OrtStatus*)*/ OrtFillStringTensor(
+                                                        IntPtr /* OrtValue */ value,
+                                                        string[] /* const char* const* */s, 
+                                                        UIntPtr /* size_t */ s_len);
+
         [DllImport(nativeLib, CharSet = charSet)]
         public static extern IntPtr /*(OrtStatus*)*/ OrtGetStringTensorContent(
                                                         IntPtr /*(OrtValue*)*/ value,
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/OnnxRuntime.cs b/csharp/src/Microsoft.ML.OnnxRuntime/OnnxRuntime.cs
index 62141e9a81362..3069b4cb71b4c 100644
--- a/csharp/src/Microsoft.ML.OnnxRuntime/OnnxRuntime.cs
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/OnnxRuntime.cs
@@ -15,7 +15,7 @@ internal struct GlobalOptions  //Options are currently not accessible to user
         public LogLevel LogLevel { get; set; }
     }
 
-    internal enum LogLevel
+    public enum LogLevel
     {
         Verbose = 0,
         Info = 1,
@@ -51,6 +51,9 @@ public override bool IsInvalid
         private OnnxRuntime()  //Problem: it is not possible to pass any option for a Singleton
             :base(IntPtr.Zero, true)
         {
+            // Check LibC version on Linux, before doing any onnxruntime initialization
+            CheckLibcVersionGreaterThanMinimum();
+
             handle = IntPtr.Zero;
             try
             {
@@ -78,5 +81,32 @@ protected override bool ReleaseHandle()
             Delete(handle);
             return true;
         }
+
+        [DllImport("libc", ExactSpelling = true, CallingConvention = CallingConvention.Cdecl)]
+        private static extern IntPtr gnu_get_libc_version();
+
+        private static void CheckLibcVersionGreaterThanMinimum()
+        {
+            // require libc version 2.23 or higher
+            var minVersion = new Version(2, 23);
+            var curVersion = new Version(0, 0);
+            if (RuntimeInformation.IsOSPlatform(OSPlatform.Linux))
+            {
+                try
+                {
+                    curVersion = Version.Parse(Marshal.PtrToStringAnsi(gnu_get_libc_version()));
+                    if (curVersion >= minVersion)
+                        return;
+                }
+                catch (Exception)
+                {
+                    // trap any obscure exception
+                }
+                throw new OnnxRuntimeException(ErrorCode.RuntimeException,
+                        $"libc.so version={curVersion} does not meet the minimun of 2.23 required by OnnxRuntime. " +
+                        "Linux distribution should be similar to Ubuntu 16.04 or higher");
+            }
+        }
+
     }
 }
\ No newline at end of file
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/RunOptions.cs b/csharp/src/Microsoft.ML.OnnxRuntime/RunOptions.cs
new file mode 100644
index 0000000000000..b40c795757397
--- /dev/null
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/RunOptions.cs
@@ -0,0 +1,120 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+using System;
+using System.Runtime.InteropServices;
+
+namespace Microsoft.ML.OnnxRuntime
+{
+    /// Sets various runtime options. 
+    public class RunOptions: IDisposable
+    {
+        private IntPtr _nativePtr;
+        internal IntPtr Handle
+        {
+            get
+            {
+                return _nativePtr;
+            }
+        }
+
+
+        public RunOptions()
+        {
+            NativeApiStatus.VerifySuccess(NativeMethods.OrtCreateRunOptions(out _nativePtr));
+        }
+
+
+        /// <summary>
+        /// LogVerbosityLevel for the Run 
+        /// default == LogLevel.Verbose
+        /// </summary>
+        public LogLevel LogVerbosityLevel 
+        {
+            get
+            {
+                LogLevel level;
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtRunOptionsGetRunLogVerbosityLevel(_nativePtr, out level));
+                return level;
+            }
+            set
+            {
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtRunOptionsSetRunLogVerbosityLevel(_nativePtr, value));
+            }
+        }
+
+
+        /// <summary>
+        /// Log tag to be used during the run. default = ""
+        /// </summary>
+        public string LogTag 
+        {
+            get
+            {
+                string tag = null;
+                IntPtr tagPtr = IntPtr.Zero;
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtRunOptionsGetRunTag(_nativePtr, out tagPtr));
+                tag = Marshal.PtrToStringAnsi(tagPtr); // assume ANSI string
+                // should not release the memory of the tagPtr, because it returns the c_str() of the std::string being used inside RunOptions C++ class
+                return tag;
+            }
+            set
+            {
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtRunOptionsSetRunTag(_nativePtr, value));
+            }
+        }
+
+
+        /// <summary>
+        /// Sets a flag to terminate any other Run() call that is using the same RunOptions object 
+        /// Default = false
+        /// </summary>
+        public bool Terminate
+        {
+            get
+            {
+                return _terminate;
+            }
+            set
+            {
+                if (!_terminate && value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtRunOptionsEnableTerminate(_nativePtr));
+                    _terminate = true;
+                }
+                else if (_terminate && !value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtRunOptionsDisableTerminate(_nativePtr));
+                    _terminate = false;
+                }
+            }
+        }
+        private bool _terminate = false; //value set to default value of the C++ RunOptions
+
+
+        #region destructors disposers
+
+        ~RunOptions()
+        {
+            Dispose(false);
+        }
+
+
+        public void Dispose()
+        {
+            GC.SuppressFinalize(this);
+            Dispose(true);
+        }
+
+
+        protected virtual void Dispose(bool disposing)
+        {
+            if (disposing)
+            {
+                // cleanup managed resources
+            }
+            NativeMethods.OrtReleaseRunOptions(_nativePtr);
+        }
+
+        #endregion
+    }
+}
\ No newline at end of file
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/SessionOptions.cs b/csharp/src/Microsoft.ML.OnnxRuntime/SessionOptions.cs
index 4ce708687ef79..56c597f9e8411 100644
--- a/csharp/src/Microsoft.ML.OnnxRuntime/SessionOptions.cs
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/SessionOptions.cs
@@ -4,117 +4,304 @@
 using System;
 using System.Text;
 using System.Runtime.InteropServices;
+using System.IO;
 
 namespace Microsoft.ML.OnnxRuntime
 {
+    /// <summary>
+    /// TODO Add documentation about which optimizations are enabled for each value.
+    /// </summary>
+    public enum GraphOptimizationLevel
+    {
+        ORT_DISABLE_ALL = 0,
+        ORT_ENABLE_BASIC = 1,
+        ORT_ENABLE_EXTENDED = 2,
+        ORT_ENABLE_ALL = 99
+    }
+
     /// <summary>
     /// Holds the options for creating an InferenceSession
     /// </summary>
-    public class SessionOptions:IDisposable
+    public class SessionOptions : IDisposable
     {
-        public IntPtr _nativePtr;
-        protected static readonly Lazy<SessionOptions> _default = new Lazy<SessionOptions>(MakeSessionOptionWithCpuProvider);
+        private IntPtr _nativePtr;
         private static string[] cudaDelayLoadedLibs = { "cublas64_100.dll", "cudnn64_7.dll" };
 
+        #region Constructor and Factory methods
+
         /// <summary>
         /// Constructs an empty SessionOptions
         /// </summary>
         public SessionOptions()
         {
-            NativeMethods.OrtCreateSessionOptions(out _nativePtr);
+            NativeApiStatus.VerifySuccess(NativeMethods.OrtCreateSessionOptions(out _nativePtr));
         }
 
+
         /// <summary>
-        /// Sets the graph optimization level for the session. Default is set to 1.        
+        /// A helper method to constuct a SessionOptions object for CUDA execution
         /// </summary>
-        /// <param name="optimization_level">optimization level for the session</param>
-        /// Available options are : 0, 1, 2
-        /// 0 -> Disable all optimizations
-        /// 1 -> Enable basic optimizations
-        /// 2 -> Enable all optimizations
-        public void SetSessionGraphOptimizationLevel(uint optimization_level)
+        /// <returns>A SessionsOptions() object configured for execution on deviceId=0</returns>
+        public static SessionOptions MakeSessionOptionWithCudaProvider()
         {
-            NativeApiStatus.VerifySuccess(NativeMethods.OrtSetSessionGraphOptimizationLevel(_nativePtr, optimization_level));
+            return MakeSessionOptionWithCudaProvider(0);
         }
 
+
         /// <summary>
-        /// Enable Sequential Execution. By default, it is enabled.
+        /// A helper method to constuct a SessionOptions object for CUDA execution
         /// </summary>
-        /// </param>
-        public void EnableSequentialExecution()
+        /// <param name="deviceId"></param>
+        /// <returns>A SessionsOptions() object configured for execution on deviceId</returns>
+        public static SessionOptions MakeSessionOptionWithCudaProvider(int deviceId = 0)
         {
-            NativeApiStatus.VerifySuccess(NativeMethods.OrtEnableSequentialExecution(_nativePtr));
+            CheckCudaExecutionProviderDLLs();
+            SessionOptions options = new SessionOptions();
+            NativeMethods.OrtSessionOptionsAppendExecutionProvider_CUDA(options._nativePtr, deviceId);
+            NativeMethods.OrtSessionOptionsAppendExecutionProvider_CPU(options._nativePtr, 1);
+            return options;
         }
 
         /// <summary>
-        /// Disable Sequential Execution and enable Parallel Execution.
+        /// A helper method to construct a SessionOptions object for Nuphar execution
         /// </summary>
-        /// </param>
-        public void DisableSequentialExecution()
+        /// <param name="settings">settings string, comprises of comma separated key:value pairs. default is empty</param>
+        /// <returns>A SessionsOptions() object configured for execution with Nuphar</returns>
+        public static SessionOptions MakeSessionOptionWithNupharProvider(String settings = "")
+        {
+            SessionOptions options = new SessionOptions();
+            NativeMethods.OrtSessionOptionsAppendExecutionProvider_Nuphar(options._nativePtr, 1, settings);
+            return options;
+        }
+
+        #endregion
+
+        #region Public Properties
+
+        internal IntPtr Handle
         {
-            NativeApiStatus.VerifySuccess(NativeMethods.OrtDisableSequentialExecution(_nativePtr));
+            get
+            {
+                return _nativePtr;
+            }
         }
 
+
         /// <summary>
-        /// Enable Mem Pattern. By default, it is enabled
+        /// Enable Sequential Execution. Default = true.
         /// </summary>
         /// </param>
-        public void EnableMemPattern()
+        /// 
+        public bool EnableSequentialExecution
         {
-            NativeApiStatus.VerifySuccess(NativeMethods.OrtEnableMemPattern(_nativePtr));
+            get
+            {
+                return _enableSequentialExecution;
+            }
+            set
+            {
+                if (!_enableSequentialExecution && value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtEnableSequentialExecution(_nativePtr));
+                    _enableSequentialExecution = true;
+                }
+                else if (_enableSequentialExecution && !value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtDisableSequentialExecution(_nativePtr));
+                    _enableSequentialExecution = false;
+                }
+            }
         }
+        private bool _enableSequentialExecution = true;
+
 
         /// <summary>
-        /// Disable Mem Pattern.
+        /// Enables the use of the memory allocation patterns in the first Run() call for subsequent runs. Default = true.
         /// </summary>
-        /// </param>
-        public void DisableMemPattern()
+        public bool EnableMemoryPattern
         {
-            NativeApiStatus.VerifySuccess(NativeMethods.OrtDisableMemPattern(_nativePtr));
+            get
+            {
+                return _enableMemoryPattern;
+            }
+            set
+            {
+                if (!_enableMemoryPattern && value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtEnableMemPattern(_nativePtr));
+                    _enableMemoryPattern = true;
+                }
+                else if (_enableMemoryPattern && !value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtDisableMemPattern(_nativePtr));
+                    _enableMemoryPattern = false;
+                }
+            }
         }
+        private bool _enableMemoryPattern = true;
+
 
         /// <summary>
-        /// Default instance
+        /// Path prefix to use for output of profiling data
         /// </summary>
-        public static SessionOptions Default
+        public string ProfileOutputPathPrefix
+        {
+            get; set;
+        } = "onnxruntime_profile_";   // this is the same default in C++ implementation
+
+
+
+        /// <summary>
+        /// Enables profiling of InferenceSession.Run() calls. Default is false
+        /// </summary>
+        public bool EnableProfiling
         {
             get
             {
-                return _default.Value;
+                return _enableProfiling;
+            }
+            set
+            {
+                if (!_enableProfiling && value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtEnableProfiling(_nativePtr, ProfileOutputPathPrefix));
+                    _enableProfiling = true;
+                }
+                else if (_enableProfiling && !value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtDisableProfiling(_nativePtr));
+                    _enableProfiling = false;
+                }
             }
         }
+        private bool _enableProfiling = false;
 
-        private static SessionOptions MakeSessionOptionWithCpuProvider()
+        /// <summary>
+        ///  Set filepath to save optimized model after graph level transformations. Default is empty, which implies saving is disabled.
+        /// </summary>
+        public string OptimizedModelFilePath
         {
-            CheckLibcVersionGreaterThanMinimum();
-            SessionOptions options = new SessionOptions();
-            NativeMethods.OrtSessionOptionsAppendExecutionProvider_CPU(options._nativePtr, 1);
-            return options;
+            get
+            {
+                return _optimizedModelFilePath;
+            }
+            set
+            {
+                if (value != _optimizedModelFilePath)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtSetOptimizedModelFilePath(_nativePtr, value));
+                    _optimizedModelFilePath = value;
+                }
+            }
         }
+        private string _optimizedModelFilePath = "";
+
+
 
         /// <summary>
-        /// A helper method to constuct a SessionOptions object for CUDA execution
+        /// Enables Arena allocator for the CPU memory allocations. Default is true.
         /// </summary>
-        /// <returns>A SessionsOptions() object configured for execution on deviceId=0</returns>
-        public static SessionOptions MakeSessionOptionWithCudaProvider()
+        public bool EnableCpuMemArena
         {
-            return MakeSessionOptionWithCudaProvider(0);
+            get
+            {
+                return _enableCpuMemArena;
+            }
+            set
+            {
+                if (!_enableCpuMemArena && value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtEnableCpuMemArena(_nativePtr));
+                    _enableCpuMemArena = true;
+                }
+                else if (_enableCpuMemArena && !value)
+                {
+                    NativeApiStatus.VerifySuccess(NativeMethods.OrtDisableCpuMemArena(_nativePtr));
+                    _enableCpuMemArena = false;
+                }
+            }
         }
+        private bool _enableCpuMemArena = true;
+
 
         /// <summary>
-        /// A helper method to constuct a SessionOptions object for CUDA execution
+        /// Log Id to be used for the session. Default is empty string.
+        /// TODO: Should it be named LogTag as in RunOptions?
         /// </summary>
-        /// <param name="deviceId"></param>
-        /// <returns>A SessionsOptions() object configured for execution on deviceId</returns>
-        public static SessionOptions MakeSessionOptionWithCudaProvider(int deviceId=0)
+        public string LogId
         {
-            CheckLibcVersionGreaterThanMinimum();
-            CheckCudaExecutionProviderDLLs();
-            SessionOptions options = new SessionOptions();
-            NativeMethods.OrtSessionOptionsAppendExecutionProvider_CUDA(options._nativePtr, deviceId);
-            NativeMethods.OrtSessionOptionsAppendExecutionProvider_CPU(options._nativePtr, 1);
-            return options;
+            get
+            {
+                return _logId;
+            }
+
+            set
+            {
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtSetSessionLogId(_nativePtr, value));
+                _logId = value;
+            }
+        }
+        private string _logId = "";
+
+
+        /// <summary>
+        /// Log Verbosity Level for the session logs. Default = LogLevel.Verbose
+        /// </summary>
+        public LogLevel LogVerbosityLevel
+        {
+            get
+            {
+                return _logVerbosityLevel;
+            }
+            set
+            {
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtSetSessionLogVerbosityLevel(_nativePtr, value));
+                _logVerbosityLevel = value;
+            }
+        }
+        private LogLevel _logVerbosityLevel = LogLevel.Verbose;
+
+
+        /// <summary>
+        /// Threadpool size for the session.Run() calls. 
+        /// Default = 0, meaning threadpool size is aumatically selected from number of available cores.
+        /// </summary>
+        public int ThreadPoolSize
+        {
+            get
+            {
+                return _threadPoolSize;
+            }
+            set
+            {
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtSetSessionThreadPoolSize(_nativePtr, value));
+                _threadPoolSize = value;
+            }
         }
+        private int _threadPoolSize = 0; // set to what is set in C++ SessionOptions by default;
+
+
+        /// <summary>
+        /// Sets the graph optimization level for the session. Default is set to ORT_ENABLE_BASIC.        
+        /// </summary>
+        public GraphOptimizationLevel GraphOptimizationLevel
+        {
+            get
+            {
+                return _graphOptimizationLevel;
+            }
+            set
+            {
+                NativeApiStatus.VerifySuccess(NativeMethods.OrtSetSessionGraphOptimizationLevel(_nativePtr, value));
+                _graphOptimizationLevel = value;
+            }
+        }
+        private GraphOptimizationLevel _graphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_BASIC;
+
+        #endregion
+
+        #region Private Methods
+
 
         // Declared, but called only if OS = Windows.
         [DllImport("kernel32.dll")]
@@ -130,45 +317,21 @@ private static bool CheckCudaExecutionProviderDLLs()
                 {
                     IntPtr handle = LoadLibrary(dll);
                     if (handle != IntPtr.Zero)
-                        continue;                    
+                        continue;
                     var sysdir = new StringBuilder(String.Empty, 2048);
                     GetSystemDirectory(sysdir, (uint)sysdir.Capacity);
                     throw new OnnxRuntimeException(
-                        ErrorCode.NoSuchFile, 
+                        ErrorCode.NoSuchFile,
                         $"kernel32.LoadLibrary():'{dll}' not found. CUDA is required for GPU execution. " +
                         $". Verify it is available in the system directory={sysdir}. Else copy it to the output folder."
-                        );               
+                        );
                 }
-            }   
+            }
             return true;
         }
 
-        [DllImport("libc", ExactSpelling = true, CallingConvention = CallingConvention.Cdecl)]
-        private static extern IntPtr gnu_get_libc_version();
-
-        private static void CheckLibcVersionGreaterThanMinimum()
-        {
-            // require libc version 2.23 or higher
-            var minVersion = new Version(2, 23);
-            var curVersion = new Version(0, 0);
-            if (RuntimeInformation.IsOSPlatform(OSPlatform.Linux))
-            {
-                try
-                {
-                    curVersion = Version.Parse(Marshal.PtrToStringAnsi(gnu_get_libc_version()));
-                    if (curVersion >= minVersion)
-                        return;
-                }
-                catch (Exception)
-                {
-                    // trap any obscure exception
-                }
-                throw new OnnxRuntimeException(ErrorCode.RuntimeException,
-                        $"libc.so version={curVersion} does not meet the minimun of 2.23 required by OnnxRuntime. " +
-                        "Linux distribution should be similar to Ubuntu 16.04 or higher");
-            }
-        }
 
+        #endregion
         #region destructors disposers
 
         ~SessionOptions()
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayTensorExtensions.cs b/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayTensorExtensions.cs
new file mode 100644
index 0000000000000..5189ddf71e300
--- /dev/null
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayTensorExtensions.cs
@@ -0,0 +1,66 @@
+﻿// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/src/System/Numerics/Tensors/ArrayTensorExtensions.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    public static class ArrayTensorExtensions
+    {
+        /// <summary>
+        /// Creates a copy of this single-dimensional array as a DenseTensor&lt;T&gt;
+        /// </summary>
+        /// <typeparam name="T">Type contained in the array to copy to the DenseTensor&lt;T&gt;.</typeparam>
+        /// <param name="array">The array to create a DenseTensor&lt;T&gt; from.</param>
+        /// <returns>A 1-dimensional DenseTensor&lt;T&gt; with the same length and content as <paramref name="array"/>.</returns>
+        public static DenseTensor<T> ToTensor<T>(this T[] array)
+        {
+            return new DenseTensor<T>(array);
+        }
+
+        /// <summary>
+        /// Creates a copy of this two-dimensional array as a DenseTensor&lt;T&gt;
+        /// </summary>
+        /// <typeparam name="T">Type contained in the array to copy to the DenseTensor&lt;T&gt;.</typeparam>
+        /// <param name="array">The array to create a DenseTensor&lt;T&gt; from.</param>
+        /// <param name="reverseStride">False (default) to indicate that the first dimension is most major (farthest apart) and the last dimension is most minor (closest together): row-major.  True to indicate that the last dimension is most major (farthest apart) and the first dimension is most minor (closest together): column-major.</param>
+        /// <returns>A 2-dimensional DenseTensor&lt;T&gt; with the same dimensions and content as <paramref name="array"/>.</returns>
+        public static DenseTensor<T> ToTensor<T>(this T[,] array, bool reverseStride = false)
+        {
+            return new DenseTensor<T>(array, reverseStride);
+        }
+
+        /// <summary>
+        /// Creates a copy of this three-dimensional array as a DenseTensor&lt;T&gt;
+        /// </summary>
+        /// <typeparam name="T">Type contained in the array to copy to the DenseTensor&lt;T&gt;.</typeparam>
+        /// <param name="array">The array to create a DenseTensor&lt;T&gt; from.</param>
+        /// <param name="reverseStride">False (default) to indicate that the first dimension is most major (farthest apart) and the last dimension is most minor (closest together): akin to row-major in a rank-2 tensor.  True to indicate that the last dimension is most major (farthest apart) and the first dimension is most minor (closest together): akin to column-major in a rank-2 tensor.</param>
+        /// <returns>A 3-dimensional DenseTensor&lt;T&gt; with the same dimensions and content as <paramref name="array"/>.</returns>
+        public static DenseTensor<T> ToTensor<T>(this T[,,] array, bool reverseStride = false)
+        {
+            return new DenseTensor<T>(array, reverseStride);
+        }
+
+        /// <summary>
+        /// Creates a copy of this n-dimensional array as a DenseTensor&lt;T&gt;
+        /// </summary>
+        /// <typeparam name="T">Type contained in the array to copy to the DenseTensor&lt;T&gt;.</typeparam>
+        /// <param name="array">The array to create a DenseTensor&lt;T&gt; from.</param>
+        /// <param name="reverseStride">False (default) to indicate that the first dimension is most major (farthest apart) and the last dimension is most minor (closest together): akin to row-major in a rank-2 tensor.  True to indicate that the last dimension is most major (farthest apart) and the first dimension is most minor (closest together): akin to column-major in a rank-2 tensor.</param>
+        /// <returns>A n-dimensional DenseTensor&lt;T&gt; with the same dimensions and content as <paramref name="array"/>.</returns>
+        public static DenseTensor<T> ToTensor<T>(this Array array, bool reverseStride = false)
+        {
+            return new DenseTensor<T>(array, reverseStride);
+        }
+    }
+}
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayUtilities.cs b/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayUtilities.cs
new file mode 100644
index 0000000000000..2913799968930
--- /dev/null
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayUtilities.cs
@@ -0,0 +1,227 @@
+﻿// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/src/System/Numerics/Tensors/ArrayUtilities.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System.Diagnostics;
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    internal static class ArrayUtilities
+    {
+        public const int StackallocMax = 16;
+
+        public static long GetProduct(ReadOnlySpan<int> dimensions, int startIndex = 0)
+        {
+            if (dimensions.Length == 0)
+            {
+                return 0;
+            }
+
+            long product = 1;
+            for (int i = startIndex; i < dimensions.Length; i++)
+            {
+                if (dimensions[i] < 0)
+                {
+                    throw new ArgumentOutOfRangeException($"{nameof(dimensions)}[{i}]");
+                }
+
+                // we use a long which should be much larger than is ever used here,
+                // but still force checked
+                checked
+                {
+                    product *= dimensions[i];
+                }
+            }
+
+            return product;
+        }
+
+        public static bool IsAscending(ReadOnlySpan<int> values)
+        {
+            for (int i = 1; i < values.Length; i++)
+            {
+                if (values[i] < values[i - 1])
+                {
+                    return false;
+                }
+            }
+
+            return true;
+        }
+
+        public static bool IsDescending(ReadOnlySpan<int> values)
+        {
+            for (int i = 1; i < values.Length; i++)
+            {
+                if (values[i] > values[i - 1])
+                {
+                    return false;
+                }
+            }
+
+            return true;
+        }
+
+        /// <summary>
+        /// Gets the set of strides that can be used to calculate the offset of n-dimensions in a 1-dimensional layout
+        /// </summary>
+        /// <param name="dimensions"></param>
+        /// <param name="reverseStride"></param>
+        /// <returns></returns>
+        public static int[] GetStrides(ReadOnlySpan<int> dimensions, bool reverseStride = false)
+        {
+            int[] strides = new int[dimensions.Length];
+
+            int stride = 1;
+            if (reverseStride)
+            {
+                for (int i = 0; i < strides.Length; i++)
+                {
+                    strides[i] = stride;
+                    stride *= dimensions[i];
+                }
+            }
+            else
+            {
+                for (int i = strides.Length - 1; i >= 0; i--)
+                {
+                    strides[i] = stride;
+                    stride *= dimensions[i];
+                }
+            }
+
+            return strides;
+        }
+
+        public static void SplitStrides(int[] strides, int[] splitAxes, int[] newStrides, int stridesOffset, int[] splitStrides, int splitStridesOffset)
+        {
+            int newStrideIndex = 0;
+            for (int i = 0; i < strides.Length; i++)
+            {
+                int stride = strides[i];
+                bool isSplit = false;
+                for (int j = 0; j < splitAxes.Length; j++)
+                {
+                    if (splitAxes[j] == i)
+                    {
+                        splitStrides[splitStridesOffset + j] = stride;
+                        isSplit = true;
+                        break;
+                    }
+                }
+
+                if (!isSplit)
+                {
+                    newStrides[stridesOffset + newStrideIndex++] = stride;
+                }
+            }
+        }
+
+        /// <summary>
+        /// Calculates the 1-d index for n-d indices in layout specified by strides.
+        /// </summary>
+        /// <param name="strides"></param>
+        /// <param name="indices"></param>
+        /// <param name="startFromDimension"></param>
+        /// <returns></returns>
+        public static int GetIndex(int[] strides, ReadOnlySpan<int> indices, int startFromDimension = 0)
+        {
+            Debug.Assert(strides.Length == indices.Length);
+
+            int index = 0;
+            for (int i = startFromDimension; i < indices.Length; i++)
+            {
+                index += strides[i] * indices[i];
+            }
+
+            return index;
+        }
+
+        /// <summary>
+        /// Calculates the n-d indices from the 1-d index in a layout specificed by strides
+        /// </summary>
+        /// <param name="strides"></param>
+        /// <param name="reverseStride"></param>
+        /// <param name="index"></param>
+        /// <param name="indices"></param>
+        /// <param name="startFromDimension"></param>
+        public static void GetIndices(ReadOnlySpan<int> strides, bool reverseStride, int index, int[] indices, int startFromDimension = 0)
+        {
+            Debug.Assert(reverseStride ? IsAscending(strides) : IsDescending(strides), "Index decomposition requires ordered strides");
+            Debug.Assert(strides.Length == indices.Length);
+
+            int remainder = index;
+            for (int i = startFromDimension; i < strides.Length; i++)
+            {
+                // reverse the index for reverseStride so that we divide by largest stride first
+                var nIndex = reverseStride ? strides.Length - 1 - i : i;
+
+                var stride = strides[nIndex];
+                indices[nIndex] = remainder / stride;
+                remainder %= stride;
+            }
+        }
+
+        /// <summary>
+        /// Calculates the n-d indices from the 1-d index in a layout specificed by strides
+        /// </summary>
+        /// <param name="strides"></param>
+        /// <param name="reverseStride"></param>
+        /// <param name="index"></param>
+        /// <param name="indices"></param>
+        /// <param name="startFromDimension"></param>
+        public static void GetIndices(ReadOnlySpan<int> strides, bool reverseStride, int index, Span<int> indices, int startFromDimension = 0)
+        {
+            Debug.Assert(reverseStride ? IsAscending(strides) : IsDescending(strides), "Index decomposition requires ordered strides");
+            Debug.Assert(strides.Length == indices.Length);
+
+            int remainder = index;
+            for (int i = startFromDimension; i < strides.Length; i++)
+            {
+                // reverse the index for reverseStride so that we divide by largest stride first
+                var nIndex = reverseStride ? strides.Length - 1 - i : i;
+
+                var stride = strides[nIndex];
+                indices[nIndex] = remainder / stride;
+                remainder %= stride;
+            }
+        }
+
+        /// <summary>
+        /// Takes an 1-d index over n-d sourceStrides and recalculates it assuming same n-d coordinates over a different n-d strides
+        /// </summary>
+        public static int TransformIndexByStrides(int index, int[] sourceStrides, bool sourceReverseStride, int[] transformStrides)
+        {
+            Debug.Assert(index >= 0);
+            Debug.Assert(sourceReverseStride ? IsAscending(sourceStrides) : IsDescending(sourceStrides), "Index decomposition requires ordered strides");
+            Debug.Assert(sourceStrides.Length == transformStrides.Length);
+
+            int transformIndex = 0;
+            int remainder = index;
+
+            for (int i = 0; i < sourceStrides.Length; i++)
+            {
+                // reverse the index for reverseStride so that we divide by largest stride first
+                var nIndex = sourceReverseStride ? sourceStrides.Length - 1 - i: i;
+
+                var sourceStride = sourceStrides[nIndex];
+                var transformStride = transformStrides[nIndex];
+
+                transformIndex += transformStride * (remainder / sourceStride);
+                remainder %= sourceStride;
+            }
+
+            return transformIndex;
+        }
+    }
+}
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/DenseTensor.cs b/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/DenseTensor.cs
new file mode 100644
index 0000000000000..efa193a42b1af
--- /dev/null
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/DenseTensor.cs
@@ -0,0 +1,188 @@
+﻿// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/src/System/Numerics/Tensors/DenseTensor.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System.Runtime.InteropServices;
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    /// <summary>
+    /// Represents a multi-dimensional collection of objects of type T that can be accessed by indices.  DenseTensor stores values in a contiguous sequential block of memory where all values are represented.
+    /// </summary>
+    /// <typeparam name="T">type contained within the Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+    public class DenseTensor<T> : Tensor<T>
+    {
+        private readonly Memory<T> memory;
+
+        internal DenseTensor(Array fromArray, bool reverseStride = false) : base(fromArray, reverseStride)
+        {
+            // copy initial array
+            var backingArray = new T[fromArray.Length];
+
+            int index = 0;
+            if (reverseStride)
+            {
+                // Array is always row-major
+                var sourceStrides = ArrayUtilities.GetStrides(dimensions);
+
+                foreach (var item in fromArray)
+                {
+                    var destIndex = ArrayUtilities.TransformIndexByStrides(index++, sourceStrides, false, strides);
+                    backingArray[destIndex] = (T)item;
+                }
+            }
+            else
+            {
+                foreach (var item in fromArray)
+                {
+                    backingArray[index++] = (T)item;
+                }
+            }
+            memory = backingArray;
+        }
+
+        /// <summary>
+        /// Initializes a rank-1 Tensor using the specified <paramref name="length"/>.
+        /// </summary>
+        /// <param name="length">Size of the 1-dimensional tensor</param>
+        public DenseTensor(int length) : base(length)
+        {
+            memory = new T[length];
+        }
+
+        /// <summary>
+        /// Initializes a rank-n Tensor using the dimensions specified in <paramref name="dimensions"/>.
+        /// </summary>
+        /// <param name="dimensions">An span of integers that represent the size of each dimension of the DenseTensor to create.</param>
+        /// <param name="reverseStride">False (default) to indicate that the first dimension is most major (farthest apart) and the last dimension is most minor (closest together): akin to row-major in a rank-2 tensor.  True to indicate that the last dimension is most major (farthest apart) and the first dimension is most minor (closest together): akin to column-major in a rank-2 tensor.</param>
+        public DenseTensor(ReadOnlySpan<int> dimensions, bool reverseStride = false) : base(dimensions, reverseStride)
+        {
+            memory = new T[Length];
+        }
+
+        /// <summary>
+        /// Constructs a new DenseTensor of the specifed dimensions, wrapping existing backing memory for the contents.
+        /// </summary>
+        /// <param name="memory"></param>
+        /// <param name="dimensions">An span of integers that represent the size of each dimension of the DenseTensor to create.</param>
+        /// <param name="reverseStride">False (default) to indicate that the first dimension is most major (farthest apart) and the last dimension is most minor (closest together): akin to row-major in a rank-2 tensor.  True to indicate that the last dimension is most major (farthest apart) and the first dimension is most minor (closest together): akin to column-major in a rank-2 tensor.</param>
+        public DenseTensor(Memory<T> memory, ReadOnlySpan<int> dimensions, bool reverseStride = false) : base(dimensions, reverseStride)
+        {
+            this.memory = memory;
+
+            if (Length != memory.Length)
+            {
+                throw new ArgumentException($"Length of {nameof(memory)} ({memory.Length}) must match product of {nameof(dimensions)} ({Length}).");
+            }
+        }
+
+        /// <summary>
+        /// Memory storing backing values of this tensor.
+        /// </summary>
+        public Memory<T> Buffer => memory;
+
+        /// <summary>
+        /// Gets the value at the specied index, where index is a linearized version of n-dimension indices using strides.
+        /// </summary>
+        /// <param name="index">An integer index computed as a dot-product of indices.</param>
+        /// <returns>The value at the specified position in this Tensor.</returns>
+        public override T GetValue(int index)
+        {
+            return Buffer.Span[index];
+        }
+
+        /// <summary>
+        /// Sets the value at the specied index, where index is a linearized version of n-dimension indices using strides.
+        /// </summary>
+        /// <param name="index">An integer index computed as a dot-product of indices.</param>
+        /// <param name="value">The new value to set at the specified position in this Tensor.</param>
+        public override void SetValue(int index, T value)
+        {
+            Buffer.Span[index] = value;
+        }
+
+        protected override void CopyTo(T[] array, int arrayIndex)
+        {
+            if (array == null)
+            {
+                throw new ArgumentNullException(nameof(array));
+            }
+            if (array.Length < arrayIndex + Length)
+            {
+                throw new ArgumentException("The number of elements in the Tensor is greater than the available space from index to the end of the destination array.", nameof(array));
+            }
+
+            Buffer.Span.CopyTo(array.AsSpan(arrayIndex));
+        }
+
+        protected override int IndexOf(T item)
+        {
+            // TODO: use Span.IndexOf when/if it removes the IEquatable type constraint
+            if (MemoryMarshal.TryGetArray<T>(Buffer, out var arraySegment))
+            {
+                var result = Array.IndexOf(arraySegment.Array, item, arraySegment.Offset, arraySegment.Count);
+                if (result != -1)
+                {
+                    result -= arraySegment.Offset;
+                }
+                return result;
+            }
+            else
+            {
+                return base.IndexOf(item);
+            }
+        }
+
+        /// <summary>
+        /// Creates a shallow copy of this tensor, with new backing storage.
+        /// </summary>
+        /// <returns>A shallow copy of this tensor.</returns>
+        public override Tensor<T> Clone()
+        {
+            return new DenseTensor<T>(Buffer.ToArray(), dimensions, IsReversedStride);
+        }
+
+        /// <summary>
+        /// Creates a new Tensor of a different type with the specified dimensions and the same layout as this tensor with elements initialized to their default value.
+        /// </summary>
+        /// <typeparam name="TResult">Type contained in the returned Tensor.</typeparam>
+        /// <param name="dimensions">An span of integers that represent the size of each dimension of the DenseTensor to create.</param>
+        /// <returns>A new tensor with the same layout as this tensor but different type and dimensions.</returns>
+        public override Tensor<TResult> CloneEmpty<TResult>(ReadOnlySpan<int> dimensions)
+        {
+            return new DenseTensor<TResult>(dimensions, IsReversedStride);
+        }
+
+        /// <summary>
+        /// Reshapes the current tensor to new dimensions, using the same backing storage.
+        /// </summary>
+        /// <param name="dimensions">An span of integers that represent the size of each dimension of the DenseTensor to create.</param>
+        /// <returns>A new tensor that reinterprets backing Buffer of this tensor with different dimensions.</returns>
+        public override Tensor<T> Reshape(ReadOnlySpan<int> dimensions)
+        {
+            if (dimensions.Length == 0)
+            {
+                throw new ArgumentException("Dimensions must contain elements.", nameof(dimensions));
+            }
+
+            var newSize = ArrayUtilities.GetProduct(dimensions);
+
+            if (newSize != Length)
+            {
+                throw new ArgumentException($"Cannot reshape array due to mismatch in lengths, currently {Length} would become {newSize}.", nameof(dimensions));
+            }
+
+            return new DenseTensor<T>(Buffer, dimensions, IsReversedStride);
+        }
+    }
+}
diff --git a/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/Tensor.cs b/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/Tensor.cs
new file mode 100644
index 0000000000000..a8eb8b9a27c3e
--- /dev/null
+++ b/csharp/src/Microsoft.ML.OnnxRuntime/Tensors/Tensor.cs
@@ -0,0 +1,1311 @@
+﻿// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/src/System/Numerics/Tensors/Tensor.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System.Collections;
+using System.Collections.Generic;
+using System.Diagnostics;
+using System.Text;
+using System;
+using System.Runtime.CompilerServices;
+
+// Making this assembly's internals visible to the internal Test assembly
+[assembly: InternalsVisibleTo("Microsoft.ML.OnnxRuntime.Tests," +
+                              "PublicKey=002400000480000094000000060200000024000052534131000400000100010059013e94e4bc70" +
+                              "136ca4c35f33acd6b62974536b698f9c7a21cee18d805c7ad860ad9eebfdc47a96ba2f8d03f4cf" +
+                              "1c36b9d30787e276c7b9833b5bf2a6eba7e919e6b90083078a352262aed1d842e5f70a3085cbcf" +
+                              "4c56ae851b161137920961c23fcc246598d61d258ccc615c927b2441359eea666a99ce1c3c07dc" +
+                              "a18fb0e1")]
+
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    /// <summary>
+    /// Various methods for creating and manipulating Tensor&lt;T&gt;
+    /// </summary>
+    public static partial class Tensor
+    {
+        /// <summary>
+        /// Creates an identity tensor of the specified size.  An identity tensor is a two dimensional tensor with 1s in the diagonal.
+        /// </summary>
+        /// <typeparam name="T">type contained within the Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+        /// <param name="size">Width and height of the identity tensor to create.</param>
+        /// <returns>a <paramref name="size"/> by <paramref name="size"/> with 1s along the diagonal and zeros elsewhere.</returns>
+        public static Tensor<T> CreateIdentity<T>(int size)
+        {
+            return CreateIdentity(size, false, Tensor<T>.One);
+        }
+
+        /// <summary>
+        /// Creates an identity tensor of the specified size and layout (row vs column major).  An identity tensor is a two dimensional tensor with 1s in the diagonal.
+        /// </summary>
+        /// <typeparam name="T">type contained within the Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+        /// <param name="size">Width and height of the identity tensor to create.</param>
+        /// <param name="columMajor">>False to indicate that the first dimension is most minor (closest) and the last dimension is most major (farthest): row-major.  True to indicate that the last dimension is most minor (closest together) and the first dimension is most major (farthest apart): column-major.</param>
+        /// <returns>a <paramref name="size"/> by <paramref name="size"/> with 1s along the diagonal and zeros elsewhere.</returns>
+        public static Tensor<T> CreateIdentity<T>(int size, bool columMajor)
+        {
+            return CreateIdentity(size, columMajor, Tensor<T>.One);
+        }
+
+        /// <summary>
+        /// Creates an identity tensor of the specified size and layout (row vs column major) using the specified one value.  An identity tensor is a two dimensional tensor with 1s in the diagonal.  This may be used in case T is a type that doesn't have a known 1 value.
+        /// </summary>
+        /// <typeparam name="T">type contained within the Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+        /// <param name="size">Width and height of the identity tensor to create.</param>
+        /// <param name="columMajor">>False to indicate that the first dimension is most minor (closest) and the last dimension is most major (farthest): row-major.  True to indicate that the last dimension is most minor (closest together) and the first dimension is most major (farthest apart): column-major.</param>
+        /// <param name="oneValue">Value of <typeparamref name="T"/> that is used along the diagonal.</param>
+        /// <returns>a <paramref name="size"/> by <paramref name="size"/> with 1s along the diagonal and zeros elsewhere.</returns>
+        public static Tensor<T> CreateIdentity<T>(int size, bool columMajor, T oneValue)
+        {
+            unsafe
+            {
+                Span<int> dimensions = stackalloc int[2];
+                dimensions[0] = dimensions[1] = size;
+
+                var result = new DenseTensor<T>(dimensions, columMajor);
+
+                for (int i = 0; i < size; i++)
+                {
+                    result.SetValue(i * size + i, oneValue);
+                }
+
+                return result;
+            }
+        }
+
+        /// <summary>
+        /// Creates a n+1-rank tensor using the specified n-rank diagonal.  Values not on the diagonal will be filled with zeros.
+        /// </summary>
+        /// <typeparam name="T">type contained within the Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+        /// <param name="diagonal">Tensor representing the diagonal to build the new tensor from.</param>
+        /// <returns>A new tensor of the same layout and order as <paramref name="diagonal"/> of one higher rank, with the values of <paramref name="diagonal"/> along the diagonal and zeros elsewhere.</returns>
+        public static Tensor<T> CreateFromDiagonal<T>(Tensor<T> diagonal)
+        {
+            return CreateFromDiagonal(diagonal, 0);
+        }
+
+        /// <summary>
+        /// Creates a n+1-dimension tensor using the specified n-dimension diagonal at the specified offset from the center.  Values not on the diagonal will be filled with zeros.
+        /// </summary>
+        /// <typeparam name="T">type contained within the Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+        /// <param name="diagonal">Tensor representing the diagonal to build the new tensor from.</param>
+        /// <param name="offset">Offset of diagonal to set in returned tensor.  0 for the main diagonal, less than zero for diagonals below, greater than zero from diagonals above.</param>
+        /// <returns>A new tensor of the same layout and order as <paramref name="diagonal"/> of one higher rank, with the values of <paramref name="diagonal"/> along the specified diagonal and zeros elsewhere.</returns>
+        public static Tensor<T> CreateFromDiagonal<T>(Tensor<T> diagonal, int offset)
+        {
+            if (diagonal.Rank < 1)
+            {
+                throw new ArgumentException($"Tensor {nameof(diagonal)} must have at least one dimension.", nameof(diagonal));
+            }
+
+            int diagonalLength = diagonal.dimensions[0];
+
+            // TODO: allow specification of axis1 and axis2?
+            var rank = diagonal.dimensions.Length + 1;
+            Span<int> dimensions = rank < ArrayUtilities.StackallocMax ? stackalloc int[rank] : new int[rank];
+
+            // assume square
+            var axisLength = diagonalLength + Math.Abs(offset);
+            dimensions[0] = dimensions[1] = axisLength;
+
+            for (int i = 1; i < diagonal.dimensions.Length; i++)
+            {
+                dimensions[i + 1] = diagonal.dimensions[i];
+            }
+
+            var result = diagonal.CloneEmpty(dimensions);
+
+            var sizePerDiagonal = diagonal.Length / diagonalLength;
+
+            var diagProjectionStride = diagonal.IsReversedStride && diagonal.Rank > 1 ? diagonal.strides[1] : 1;
+            var resultProjectionStride = result.IsReversedStride && result.Rank > 2 ? result.strides[2] : 1;
+
+            for (int diagIndex = 0; diagIndex < diagonalLength; diagIndex++)
+            {
+                var resultIndex0 = offset < 0 ? diagIndex - offset : diagIndex;
+                var resultIndex1 = offset > 0 ? diagIndex + offset : diagIndex;
+
+                var resultBase = resultIndex0 * result.strides[0] + resultIndex1 * result.strides[1];
+                var diagBase = diagIndex * diagonal.strides[0];
+
+                for (int diagProjectionOffset = 0; diagProjectionOffset < sizePerDiagonal; diagProjectionOffset++)
+                {
+                    result.SetValue(resultBase + diagProjectionOffset * resultProjectionStride,
+                        diagonal.GetValue(diagBase + diagProjectionOffset * diagProjectionStride));
+                }
+            }
+
+            return result;
+        }
+    }
+
+    /// <summary>
+    /// Represents a multi-dimensional collection of objects of type T that can be accessed by indices.
+    /// </summary>
+    /// <typeparam name="T">type contained within the Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+    [DebuggerDisplay("{GetArrayString(false)}")]
+    // When we cross-compile for frameworks that expose ICloneable this must implement ICloneable as well.
+    public abstract class Tensor<T> : IList, IList<T>, IReadOnlyList<T>, IStructuralComparable, IStructuralEquatable
+    {
+        internal static T Zero
+        {
+            get
+            {
+                if (typeof(T) == typeof(bool))
+                {
+                    return (T)(object)(false);
+                }
+                else if (typeof(T) == typeof(byte))
+                {
+                    return (T)(object)(byte)(0);
+                }
+                else if (typeof(T) == typeof(char))
+                {
+                    return (T)(object)(char)(0);
+                }
+                else if (typeof(T) == typeof(decimal))
+                {
+                    return (T)(object)(decimal)(0);
+                }
+                else if (typeof(T) == typeof(double))
+                {
+                    return (T)(object)(double)(0);
+                }
+                else if (typeof(T) == typeof(float))
+                {
+                    return (T)(object)(float)(0);
+                }
+                else if (typeof(T) == typeof(int))
+                {
+                    return (T)(object)(int)(0);
+                }
+                else if (typeof(T) == typeof(long))
+                {
+                    return (T)(object)(long)(0);
+                }
+                else if (typeof(T) == typeof(sbyte))
+                {
+                    return (T)(object)(sbyte)(0);
+                }
+                else if (typeof(T) == typeof(short))
+                {
+                    return (T)(object)(short)(0);
+                }
+                else if (typeof(T) == typeof(uint))
+                {
+                    return (T)(object)(uint)(0);
+                }
+                else if (typeof(T) == typeof(ulong))
+                {
+                    return (T)(object)(ulong)(0);
+                }
+                else if (typeof(T) == typeof(ushort))
+                {
+                    return (T)(object)(ushort)(0);
+                }
+
+                throw new NotSupportedException();
+            }
+        }
+
+        internal static T One
+        {
+            get
+            {
+                if (typeof(T) == typeof(bool))
+                {
+                    return (T)(object)(true);
+                }
+                else if (typeof(T) == typeof(byte))
+                {
+                    return (T)(object)(byte)(1);
+                }
+                else if (typeof(T) == typeof(char))
+                {
+                    return (T)(object)(char)(1);
+                }
+                else if (typeof(T) == typeof(decimal))
+                {
+                    return (T)(object)(decimal)(1);
+                }
+                else if (typeof(T) == typeof(double))
+                {
+                    return (T)(object)(double)(1);
+                }
+                else if (typeof(T) == typeof(float))
+                {
+                    return (T)(object)(float)(1);
+                }
+                else if (typeof(T) == typeof(int))
+                {
+                    return (T)(object)(int)(1);
+                }
+                else if (typeof(T) == typeof(long))
+                {
+                    return (T)(object)(long)(1);
+                }
+                else if (typeof(T) == typeof(sbyte))
+                {
+                    return (T)(object)(sbyte)(1);
+                }
+                else if (typeof(T) == typeof(short))
+                {
+                    return (T)(object)(short)(1);
+                }
+                else if (typeof(T) == typeof(uint))
+                {
+                    return (T)(object)(uint)(1);
+                }
+                else if (typeof(T) == typeof(ulong))
+                {
+                    return (T)(object)(ulong)(1);
+                }
+                else if (typeof(T) == typeof(ushort))
+                {
+                    return (T)(object)(ushort)(1);
+                }
+
+                throw new NotSupportedException();
+            }
+        }
+
+        internal readonly int[] dimensions;
+        internal readonly int[] strides;
+        private readonly bool isReversedStride;
+
+        private readonly long length;
+
+        /// <summary>
+        /// Initialize a 1-dimensional tensor of the specified length
+        /// </summary>
+        /// <param name="length">Size of the 1-dimensional tensor</param>
+        protected Tensor(int length)
+        {
+            dimensions = new[] { length };
+            strides = new[] { 1 };
+            isReversedStride = false;
+            this.length = length;
+        }
+
+        /// <summary>
+        /// Initialize an n-dimensional tensor with the specified dimensions and layout.  ReverseStride=true gives a stride of 1-element witdth to the first dimension (0).  ReverseStride=false gives a stride of 1-element width to the last dimension (n-1).
+        /// </summary>
+        /// <param name="dimensions">An span of integers that represent the size of each dimension of the Tensor to create.</param>
+        /// <param name="reverseStride">False (default) to indicate that the first dimension is most major (farthest apart) and the last dimension is most minor (closest together): akin to row-major in a rank-2 tensor.  True to indicate that the last dimension is most major (farthest apart) and the first dimension is most minor (closest together): akin to column-major in a rank-2 tensor.</param>
+        protected Tensor(ReadOnlySpan<int> dimensions, bool reverseStride)
+        {
+            if (dimensions.Length == 0)
+            {
+                throw new ArgumentException("Dimensions must contain elements.", nameof(dimensions));
+            }
+
+            this.dimensions = new int[dimensions.Length];
+            long size = 1;
+            for (int i = 0; i < dimensions.Length; i++)
+            {
+                if (dimensions[i] < 1)
+                {
+                    throw new ArgumentOutOfRangeException(nameof(dimensions), "Dimensions must be positive and non-zero");
+                }
+                this.dimensions[i] = dimensions[i];
+                size *= dimensions[i];
+            }
+
+            strides = ArrayUtilities.GetStrides(dimensions, reverseStride);
+            isReversedStride = reverseStride;
+
+            length = size;
+        }
+
+        /// <summary>
+        /// Initializes tensor with same dimensions as array, content of array is ignored.  ReverseStride=true gives a stride of 1-element witdth to the first dimension (0).  ReverseStride=false gives a stride of 1-element width to the last dimension (n-1).
+        /// </summary>
+        /// <param name="fromArray">Array from which to derive dimensions.</param>
+        /// <param name="reverseStride">False (default) to indicate that the first dimension is most major (farthest apart) and the last dimension is most minor (closest together): akin to row-major in a rank-2 tensor.  True to indicate that the last dimension is most major (farthest apart) and the first dimension is most minor (closest together): akin to column-major in a rank-2 tensor.</param>
+        protected Tensor(Array fromArray, bool reverseStride)
+        {
+            if (fromArray == null)
+            {
+                throw new ArgumentNullException(nameof(fromArray));
+            }
+
+            if (fromArray.Rank == 0)
+            {
+                throw new ArgumentException("Array must contain elements.", nameof(fromArray));
+            }
+
+            dimensions = new int[fromArray.Rank];
+            long size = 1;
+            for (int i = 0; i < dimensions.Length; i++)
+            {
+                dimensions[i] = fromArray.GetLength(i);
+                size *= dimensions[i];
+            }
+
+            strides = ArrayUtilities.GetStrides(dimensions, reverseStride);
+            isReversedStride = reverseStride;
+
+            length = size;
+        }
+
+        /// <summary>
+        /// Total length of the Tensor.
+        /// </summary>
+        public long Length => length;
+
+        /// <summary>
+        /// Rank of the tensor: number of dimensions.
+        /// </summary>
+        public int Rank => dimensions.Length;
+
+        /// <summary>
+        /// True if strides are reversed (AKA Column-major)
+        /// </summary>
+        public bool IsReversedStride => isReversedStride;
+
+        /// <summary>
+        /// Returns a readonly view of the dimensions of this tensor.
+        /// </summary>
+        public ReadOnlySpan<int> Dimensions => dimensions;
+
+        /// <summary>
+        /// Returns a readonly view of the strides of this tensor.
+        /// </summary>
+        public ReadOnlySpan<int> Strides => strides;
+
+        /// <summary>
+        /// Sets all elements in Tensor to <paramref name="value"/>.
+        /// </summary>
+        /// <param name="value">Value to fill</param>
+        public virtual void Fill(T value)
+        {
+            for (int i = 0; i < Length; i++)
+            {
+                SetValue(i, value);
+            }
+        }
+
+        /// <summary>
+        /// Creates a shallow copy of this tensor, with new backing storage.
+        /// </summary>
+        /// <returns>A shallow copy of this tensor.</returns>
+        public abstract Tensor<T> Clone();
+
+        /// <summary>
+        /// Creates a new Tensor with the same layout and dimensions as this tensor with elements initialized to their default value.
+        /// </summary>
+        /// <returns>A new Tensor with the same layout and dimensions as this tensor with elements initialized to their default value.</returns>
+        public virtual Tensor<T> CloneEmpty()
+        {
+            return CloneEmpty<T>(dimensions);
+        }
+
+        /// <summary>
+        /// Creates a new Tensor with the specified dimensions and the same layout as this tensor with elements initialized to their default value.
+        /// </summary>
+        /// <param name="dimensions">An span of integers that represent the size of each dimension of the DenseTensor to create.</param>
+        /// <returns>A new Tensor with the same layout as this tensor and specified <paramref name="dimensions"/> with elements initialized to their default value.</returns>
+        public virtual Tensor<T> CloneEmpty(ReadOnlySpan<int> dimensions)
+        {
+            return CloneEmpty<T>(dimensions);
+        }
+
+        /// <summary>
+        /// Creates a new Tensor of a different type with the same layout and size as this tensor with elements initialized to their default value.
+        /// </summary>
+        /// <typeparam name="TResult">Type contained within the new Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+        /// <returns>A new Tensor with the same layout and dimensions as this tensor with elements of <typeparamref name="TResult"/> type initialized to their default value.</returns>
+        public virtual Tensor<TResult> CloneEmpty<TResult>()
+        {
+            return CloneEmpty<TResult>(dimensions);
+        }
+
+        /// <summary>
+        /// Creates a new Tensor of a different type with the specified dimensions and the same layout as this tensor with elements initialized to their default value.
+        /// </summary>
+        /// <typeparam name="TResult">Type contained within the new Tensor.  Typically a value type such as int, double, float, etc.</typeparam>
+        /// <param name="dimensions">An span of integers that represent the size of each dimension of the DenseTensor to create.</param>
+        /// <returns>A new Tensor with the same layout as this tensor of specified <paramref name="dimensions"/> with elements of <typeparamref name="TResult"/> type initialized to their default value.</returns>
+        public abstract Tensor<TResult> CloneEmpty<TResult>(ReadOnlySpan<int> dimensions);
+
+        /// <summary>
+        /// Gets the n-1 dimension diagonal from the n dimension tensor.
+        /// </summary>
+        /// <returns>An n-1 dimension tensor with the values from the main diagonal of this tensor.</returns>
+        public Tensor<T> GetDiagonal()
+        {
+            return GetDiagonal(0);
+        }
+
+        /// <summary>
+        /// Gets the n-1 dimension diagonal from the n dimension tensor at the specified offset from center.
+        /// </summary>
+        /// <param name="offset">Offset of diagonal to set in returned tensor.  0 for the main diagonal, less than zero for diagonals below, greater than zero from diagonals above.</param>
+        /// <returns>An n-1 dimension tensor with the values from the specified diagonal of this tensor.</returns>
+        public Tensor<T> GetDiagonal(int offset)
+        {
+            // Get diagonal of first two dimensions for all remaining dimensions
+
+            // diagnonal is as follows:
+            // { 1, 2, 4 }
+            // { 8, 3, 9 }
+            // { 0, 7, 5 }
+            // The diagonal at offset 0 is { 1, 3, 5 }
+            // The diagonal at offset 1 is { 2, 9 }
+            // The diagonal at offset -1 is { 8, 7 }
+
+            if (Rank < 2)
+            {
+                throw new InvalidOperationException($"Cannot compute diagonal of {nameof(Tensor<T>)} with Rank less than 2.");
+            }
+
+            // TODO: allow specification of axis1 and axis2?
+            var axisLength0 = dimensions[0];
+            var axisLength1 = dimensions[1];
+
+            // the diagonal will be the length of the smaller axis
+            // if offset it positive, the length will shift along the second axis
+            // if the offsett is negative, the length will shift along the first axis
+            // In that way the length of the diagonal will be 
+            //   Min(offset < 0 ? axisLength0 + offset : axisLength0, offset > 0 ? axisLength1 - offset : axisLength1)
+            // To illustrate, consider the following
+            // { 1, 2, 4, 3, 7 }
+            // { 8, 3, 9, 2, 6 }
+            // { 0, 7, 5, 2, 9 }
+            // The diagonal at offset 0 is { 1, 3, 5 }, Min(3, 5) = 3
+            // The diagonal at offset 1 is { 2, 9, 2 }, Min(3, 5 - 1) = 3
+            // The diagonal at offset 3 is { 3, 6 }, Min(3, 5 - 3) = 2
+            // The diagonal at offset -1 is { 8, 7 }, Min(3 - 1, 5) = 2
+            var offsetAxisLength0 = offset < 0 ? axisLength0 + offset : axisLength0;
+            var offsetAxisLength1 = offset > 0 ? axisLength1 - offset : axisLength1;
+
+            var diagonalLength = Math.Min(offsetAxisLength0, offsetAxisLength1);
+
+            if (diagonalLength <= 0)
+            {
+                throw new ArgumentException($"Cannot compute diagonal with offset {offset}", nameof(offset));
+            }
+
+            var newTensorRank = Rank - 1;
+            var newTensorDimensions = newTensorRank < ArrayUtilities.StackallocMax ? stackalloc int[newTensorRank] : new int[newTensorRank];
+            newTensorDimensions[0] = diagonalLength;
+
+            for (int i = 2; i < dimensions.Length; i++)
+            {
+                newTensorDimensions[i - 1] = dimensions[i];
+            }
+
+            var diagonalTensor = CloneEmpty(newTensorDimensions);
+            var sizePerDiagonal = diagonalTensor.Length / diagonalTensor.Dimensions[0];
+
+            var diagProjectionStride = diagonalTensor.IsReversedStride && diagonalTensor.Rank > 1 ? diagonalTensor.strides[1] : 1;
+            var sourceProjectionStride = IsReversedStride && Rank > 2 ? strides[2] : 1;
+
+            for (int diagIndex = 0; diagIndex < diagonalLength; diagIndex++)
+            {
+                var sourceIndex0 = offset < 0 ? diagIndex - offset : diagIndex;
+                var sourceIndex1 = offset > 0 ? diagIndex + offset : diagIndex;
+
+                var sourceBase = sourceIndex0 * strides[0] + sourceIndex1 * strides[1];
+                var diagBase = diagIndex * diagonalTensor.strides[0];
+
+                for (int diagProjectionIndex = 0; diagProjectionIndex < sizePerDiagonal; diagProjectionIndex++)
+                {
+                    diagonalTensor.SetValue(diagBase + diagProjectionIndex * diagProjectionStride,
+                        GetValue(sourceBase + diagProjectionIndex * sourceProjectionStride));
+                }
+            }
+
+            return diagonalTensor;
+        }
+
+        /// <summary>
+        /// Gets a tensor representing the elements below and including the diagonal, with the rest of the elements zero-ed.
+        /// </summary>
+        /// <returns>A tensor with the values from this tensor at and below the main diagonal and zeros elsewhere.</returns>
+        public Tensor<T> GetTriangle()
+        {
+            return GetTriangle(0, upper: false);
+        }
+
+        /// <summary>
+        /// Gets a tensor representing the elements below and including the specified diagonal, with the rest of the elements zero-ed.
+        /// </summary>
+        /// <param name="offset">Offset of diagonal to set in returned tensor.  0 for the main diagonal, less than zero for diagonals below, greater than zero from diagonals above.</param>
+        /// <returns>A tensor with the values from this tensor at and below the specified diagonal and zeros elsewhere.</returns>
+        public Tensor<T> GetTriangle(int offset)
+        {
+            return GetTriangle(offset, upper: false);
+        }
+
+        /// <summary>
+        /// Gets a tensor representing the elements above and including the diagonal, with the rest of the elements zero-ed.
+        /// </summary>
+        /// <returns>A tensor with the values from this tensor at and above the main diagonal and zeros elsewhere.</returns>
+        public Tensor<T> GetUpperTriangle()
+        {
+            return GetTriangle(0, upper: true);
+        }
+
+        /// <summary>
+        /// Gets a tensor representing the elements above and including the specified diagonal, with the rest of the elements zero-ed.
+        /// </summary>
+        /// <param name="offset">Offset of diagonal to set in returned tensor.  0 for the main diagonal, less than zero for diagonals below, greater than zero from diagonals above.</param>
+        /// <returns>A tensor with the values from this tensor at and above the specified diagonal and zeros elsewhere.</returns>
+        public Tensor<T> GetUpperTriangle(int offset)
+        {
+            return GetTriangle(offset, upper: true);
+        }
+
+        public Tensor<T> GetTriangle(int offset, bool upper)
+        {
+            if (Rank < 2)
+            {
+                throw new InvalidOperationException($"Cannot compute triangle of {nameof(Tensor<T>)} with Rank less than 2.");
+            }
+
+            // Similar to get diagonal except it gets every element below and including the diagonal.
+
+            // TODO: allow specification of axis1 and axis2?
+            var axisLength0 = dimensions[0];
+            var axisLength1 = dimensions[1];
+            var diagonalLength = Math.Max(axisLength0, axisLength1);
+
+            var result = CloneEmpty();
+
+            var projectionSize = Length / (axisLength0 * axisLength1);
+            var projectionStride = IsReversedStride && Rank > 2 ? strides[2] : 1;
+
+            for (int diagIndex = 0; diagIndex < diagonalLength; diagIndex++)
+            {
+                // starting point for the tri
+                var triIndex0 = offset > 0 ? diagIndex - offset : diagIndex;
+                var triIndex1 = offset > 0 ? diagIndex : diagIndex + offset;
+
+                // for lower triangle, iterate index0 keeping same index1
+                // for upper triangle, iterate index1 keeping same index0
+
+                if (triIndex0 < 0)
+                {
+                    if (upper)
+                    {
+                        // out of bounds, ignore this diagIndex.
+                        continue;
+                    }
+                    else
+                    {
+                        // set index to 0 so that we can iterate on the remaining index0 values.
+                        triIndex0 = 0;
+                    }
+                }
+
+                if (triIndex1 < 0)
+                {
+                    if (upper)
+                    {
+                        // set index to 0 so that we can iterate on the remaining index1 values.
+                        triIndex1 = 0;
+                    }
+                    else
+                    {
+                        // out of bounds, ignore this diagIndex.
+                        continue;
+                    }
+                }
+
+                while ((triIndex1 < axisLength1) && (triIndex0 < axisLength0))
+                {
+                    var baseIndex = triIndex0 * strides[0] + triIndex1 * result.strides[1];
+
+                    for (int projectionIndex = 0; projectionIndex < projectionSize; projectionIndex++)
+                    {
+                        var index = baseIndex + projectionIndex * projectionStride;
+
+                        result.SetValue(index, GetValue(index));
+                    }
+
+                    if (upper)
+                    {
+                        triIndex1++;
+                    }
+                    else
+                    {
+                        triIndex0++;
+                    }
+                }
+            }
+
+            return result;
+        }
+
+        /// <summary>
+        /// Reshapes the current tensor to new dimensions, using the same backing storage if possible.
+        /// </summary>
+        /// <param name="dimensions">An span of integers that represent the size of each dimension of the Tensor to create.</param>
+        /// <returns>A new tensor that reinterprets this tensor with different dimensions.</returns>
+        public abstract Tensor<T> Reshape(ReadOnlySpan<int> dimensions);
+
+        /// <summary>
+        /// Obtains the value at the specified indices
+        /// </summary>
+        /// <param name="indices">A one-dimensional array of integers that represent the indices specifying the position of the element to get.</param>
+        /// <returns>The value at the specified position in this Tensor.</returns>
+        public virtual T this[params int[] indices]
+        {
+            get
+            {
+                if (indices == null)
+                {
+                    throw new ArgumentNullException(nameof(indices));
+                }
+                var span = new ReadOnlySpan<int>(indices);
+                return this[span];
+            }
+
+            set
+            {
+                if (indices == null)
+                {
+                    throw new ArgumentNullException(nameof(indices));
+                }
+                var span = new ReadOnlySpan<int>(indices);
+                this[span] = value;
+            }
+        }
+
+        /// <summary>
+        /// Obtains the value at the specified indices
+        /// </summary>
+        /// <param name="indices">A span integers that represent the indices specifying the position of the element to get.</param>
+        /// <returns>The value at the specified position in this Tensor.</returns>
+        public virtual T this[ReadOnlySpan<int> indices]
+        {
+            get
+            {
+                return GetValue(ArrayUtilities.GetIndex(strides, indices));
+            }
+
+            set
+            {
+                SetValue(ArrayUtilities.GetIndex(strides, indices), value);
+            }
+        }
+
+        /// <summary>
+        /// Gets the value at the specied index, where index is a linearized version of n-dimension indices using strides.
+        /// </summary>
+        /// <param name="index">An integer index computed as a dot-product of indices.</param>
+        /// <returns>The value at the specified position in this Tensor.</returns>
+        public abstract T GetValue(int index);
+
+        /// <summary>
+        /// Sets the value at the specied index, where index is a linearized version of n-dimension indices using strides.
+        /// </summary>
+        /// <param name="index">An integer index computed as a dot-product of indices.</param>
+        /// <param name="value">The new value to set at the specified position in this Tensor.</param>
+        public abstract void SetValue(int index, T value);
+
+
+        #region statics
+        /// <summary>
+        /// Performs a value comparison of the content and shape of two tensors.  Two tensors are equal if they have the same shape and same value at every set of indices.  If not equal a tensor is greater or less than another tensor based on the first non-equal element when enumerating in linear order.
+        /// </summary>
+        /// <param name="left"></param>
+        /// <param name="right"></param>
+        /// <returns></returns>
+        public static int Compare(Tensor<T> left, Tensor<T> right)
+        {
+            return StructuralComparisons.StructuralComparer.Compare(left, right);
+        }
+
+        /// <summary>
+        /// Performs a value equality comparison of the content of two tensors. Two tensors are equal if they have the same shape and same value at every set of indices.
+        /// </summary>
+        /// <param name="left"></param>
+        /// <param name="right"></param>
+        /// <returns></returns>
+        public static bool Equals(Tensor<T> left, Tensor<T> right)
+        {
+            return StructuralComparisons.StructuralEqualityComparer.Equals(left, right);
+        }
+        #endregion
+
+        #region IEnumerable members
+        IEnumerator IEnumerable.GetEnumerator()
+        {
+            return ((IEnumerable<T>)this).GetEnumerator();
+        }
+        #endregion
+
+        #region ICollection members
+        int ICollection.Count => (int)Length;
+
+        bool ICollection.IsSynchronized => false;
+
+        object ICollection.SyncRoot => this; // backingArray.this?
+
+        void ICollection.CopyTo(Array array, int index)
+        {
+            if (array is T[] destinationArray)
+            {
+                CopyTo(destinationArray, index);
+            }
+            else
+            {
+                if (array == null)
+                {
+                    throw new ArgumentNullException(nameof(array));
+                }
+                if (array.Rank != 1)
+                {
+                    throw new ArgumentException("Only single dimensional arrays are supported for the requested action.", nameof(array));
+                }
+                if (array.Length < index + Length)
+                {
+                    throw new ArgumentException("The number of elements in the Tensor is greater than the available space from index to the end of the destination array.", nameof(array));
+                }
+
+                for (int i = 0; i < length; i++)
+                {
+                    array.SetValue(GetValue(i), index + i);
+                }
+            }
+        }
+        #endregion
+
+        #region IList members
+        object IList.this[int index]
+        {
+            get
+            {
+                return GetValue(index);
+            }
+            set
+            {
+                try
+                {
+                    SetValue(index, (T)value);
+                }
+                catch (InvalidCastException)
+                {
+                    throw new ArgumentException($"The value \"{value}\" is not of type \"{typeof(T)}\" and cannot be used in this generic collection.");
+                }
+            }
+        }
+
+        public bool IsFixedSize => true;
+
+        public bool IsReadOnly => false;
+
+        int IList.Add(object value)
+        {
+            throw new InvalidOperationException();
+        }
+
+        void IList.Clear()
+        {
+            Fill(default(T));
+        }
+
+        bool IList.Contains(object value)
+        {
+            if (IsCompatibleObject(value))
+            {
+                return Contains((T)value);
+            }
+            return false;
+        }
+
+        int IList.IndexOf(object value)
+        {
+            if (IsCompatibleObject(value))
+            {
+                return IndexOf((T)value);
+            }
+            return -1;
+        }
+
+        void IList.Insert(int index, object value)
+        {
+            throw new InvalidOperationException();
+        }
+
+        void IList.Remove(object value)
+        {
+            throw new InvalidOperationException();
+        }
+
+        void IList.RemoveAt(int index)
+        {
+            throw new InvalidOperationException();
+        }
+        #endregion
+
+        #region IEnumerable<T> members
+        IEnumerator<T> IEnumerable<T>.GetEnumerator()
+        {
+            for (int i = 0; i < Length; i++)
+            {
+                yield return GetValue(i);
+            }
+        }
+        #endregion
+
+        #region ICollection<T> members
+        int ICollection<T>.Count => (int)Length;
+
+        void ICollection<T>.Add(T item)
+        {
+            throw new InvalidOperationException();
+        }
+
+        void ICollection<T>.Clear()
+        {
+            Fill(default(T));
+        }
+
+        bool ICollection<T>.Contains(T item)
+        {
+            return Contains(item);
+        }
+
+        /// <summary>
+        /// Determines whether an element is in the Tensor&lt;T&gt;.
+        /// </summary>
+        /// <param name="item">
+        /// The object to locate in the Tensor&lt;T&gt;. The value can be null for reference types.
+        /// </param>
+        /// <returns>
+        /// true if item is found in the Tensor&lt;T&gt;; otherwise, false.
+        /// </returns>
+        protected virtual bool Contains(T item)
+        {
+            return Length != 0 && IndexOf(item) != -1;
+        }
+
+        void ICollection<T>.CopyTo(T[] array, int arrayIndex)
+        {
+            CopyTo(array, arrayIndex);
+        }
+
+        /// <summary>
+        /// Copies the elements of the Tensor&lt;T&gt; to an Array, starting at a particular Array index.
+        /// </summary>
+        /// <param name="array">
+        /// The one-dimensional Array that is the destination of the elements copied from Tensor&lt;T&gt;. The Array must have zero-based indexing.
+        /// </param>
+        /// <param name="arrayIndex">
+        /// The zero-based index in array at which copying begins.
+        /// </param>
+        protected virtual void CopyTo(T[] array, int arrayIndex)
+        {
+            if (array == null)
+            {
+                throw new ArgumentNullException(nameof(array));
+            }
+            if (array.Length < arrayIndex + Length)
+            {
+                throw new ArgumentException("The number of elements in the Tensor is greater than the available space from index to the end of the destination array.", nameof(array));
+            }
+
+            for (int i = 0; i < length; i++)
+            {
+                array[arrayIndex + i] = GetValue(i);
+            }
+        }
+
+        bool ICollection<T>.Remove(T item)
+        {
+            throw new InvalidOperationException();
+        }
+        #endregion
+
+        #region IReadOnlyCollection<T> members
+
+        int IReadOnlyCollection<T>.Count => (int)Length;
+
+        #endregion
+
+        #region IList<T> members
+        T IList<T>.this[int index]
+        {
+            get { return GetValue(index); }
+            set { SetValue(index, value); }
+        }
+
+        int IList<T>.IndexOf(T item)
+        {
+            return IndexOf(item);
+        }
+
+        /// <summary>
+        /// Determines the index of a specific item in the Tensor&lt;T&gt;.
+        /// </summary>
+        /// <param name="item">The object to locate in the Tensor&lt;T&gt;.</param>
+        /// <returns>The index of item if found in the tensor; otherwise, -1.</returns>
+        protected virtual int IndexOf(T item)
+        {
+            for (int i = 0; i < Length; i++)
+            {
+                if (GetValue(i).Equals(item))
+                {
+                    return i;
+                }
+            }
+
+            return -1;
+        }
+
+        void IList<T>.Insert(int index, T item)
+        {
+            throw new InvalidOperationException();
+        }
+
+        void IList<T>.RemoveAt(int index)
+        {
+            throw new InvalidOperationException();
+        }
+        #endregion
+
+        #region IReadOnlyList<T> members
+
+        T IReadOnlyList<T>.this[int index] => GetValue(index);
+
+        #endregion
+
+        #region IStructuralComparable members
+        int IStructuralComparable.CompareTo(object other, IComparer comparer)
+        {
+            if (other == null)
+            {
+                return 1;
+            }
+
+            if (other is Tensor<T>)
+            {
+                return CompareTo((Tensor<T>)other, comparer);
+            }
+
+            var otherArray = other as Array;
+
+            if (otherArray != null)
+            {
+                return CompareTo(otherArray, comparer);
+            }
+
+            throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)} to {other.GetType()}.", nameof(other));
+        }
+
+        private int CompareTo(Tensor<T> other, IComparer comparer)
+        {
+            if (Rank != other.Rank)
+            {
+                throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)} with Rank {Rank} to {nameof(other)} with Rank {other.Rank}.", nameof(other));
+            }
+
+            for (int i = 0; i < dimensions.Length; i++)
+            {
+                if (dimensions[i] != other.dimensions[i])
+                {
+                    throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)}s with differning dimension {i}, {dimensions[i]} != {other.dimensions[i]}.", nameof(other));
+                }
+            }
+
+            int result = 0;
+
+            if (IsReversedStride == other.IsReversedStride)
+            {
+                for (int i = 0; i < Length; i++)
+                {
+                    result = comparer.Compare(GetValue(i), other.GetValue(i));
+                    if (result != 0)
+                    {
+                        break;
+                    }
+                }
+            }
+            else
+            {
+                var indices = Rank < ArrayUtilities.StackallocMax ? stackalloc int[Rank] : new int[Rank];
+                for (int i = 0; i < Length; i++)
+                {
+                    ArrayUtilities.GetIndices(strides, IsReversedStride, i, indices);
+                    result = comparer.Compare(this[indices], other[indices]);
+                    if (result != 0)
+                    {
+                        break;
+                    }
+                }
+            }
+
+            return result;
+        }
+
+        private int CompareTo(Array other, IComparer comparer)
+        {
+            if (Rank != other.Rank)
+            {
+                throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)} with Rank {Rank} to {nameof(Array)} with rank {other.Rank}.", nameof(other));
+            }
+
+            for (int i = 0; i < dimensions.Length; i++)
+            {
+                var otherDimension = other.GetLength(i);
+                if (dimensions[i] != otherDimension)
+                {
+                    throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)} to {nameof(Array)} with differning dimension {i}, {dimensions[i]} != {otherDimension}.", nameof(other));
+                }
+            }
+
+            int result = 0;
+            var indices = new int[Rank];
+            for (int i = 0; i < Length; i++)
+            {
+                ArrayUtilities.GetIndices(strides, IsReversedStride, i, indices);
+
+                result = comparer.Compare(GetValue(i), other.GetValue(indices));
+
+                if (result != 0)
+                {
+                    break;
+                }
+            }
+
+            return result;
+        }
+        #endregion
+
+        #region IStructuralEquatable members
+        bool IStructuralEquatable.Equals(object other, IEqualityComparer comparer)
+        {
+            if (other == null)
+            {
+                return false;
+            }
+
+            if (other is Tensor<T>)
+            {
+                return Equals((Tensor<T>)other, comparer);
+            }
+
+            var otherArray = other as Array;
+
+            if (otherArray != null)
+            {
+                return Equals(otherArray, comparer);
+            }
+
+            throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)} to {other.GetType()}.", nameof(other));
+        }
+
+        private bool Equals(Tensor<T> other, IEqualityComparer comparer)
+        {
+            if (Rank != other.Rank)
+            {
+                throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)} with Rank {Rank} to {nameof(other)} with Rank {other.Rank}.", nameof(other));
+            }
+
+            for (int i = 0; i < dimensions.Length; i++)
+            {
+                if (dimensions[i] != other.dimensions[i])
+                {
+                    throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)}s with differning dimension {i}, {dimensions[i]} != {other.dimensions[i]}.", nameof(other));
+                }
+            }
+
+            if (IsReversedStride == other.IsReversedStride)
+            {
+                for (int i = 0; i < Length; i++)
+                {
+                    if (!comparer.Equals(GetValue(i), other.GetValue(i)))
+                    {
+                        return false;
+                    }
+                }
+            }
+            else
+            {
+                var indices = Rank < ArrayUtilities.StackallocMax ? stackalloc int[Rank] : new int[Rank];
+                for (int i = 0; i < Length; i++)
+                {
+                    ArrayUtilities.GetIndices(strides, IsReversedStride, i, indices);
+
+                    if (!comparer.Equals(this[indices], other[indices]))
+                    {
+                        return false;
+                    }
+                }
+            }
+
+            return true;
+        }
+
+        private bool Equals(Array other, IEqualityComparer comparer)
+        {
+            if (Rank != other.Rank)
+            {
+                throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)} with Rank {Rank} to {nameof(Array)} with rank {other.Rank}.", nameof(other));
+            }
+
+            for (int i = 0; i < dimensions.Length; i++)
+            {
+                var otherDimension = other.GetLength(i);
+                if (dimensions[i] != otherDimension)
+                {
+                    throw new ArgumentException($"Cannot compare {nameof(Tensor<T>)} to {nameof(Array)} with differning dimension {i}, {dimensions[i]} != {otherDimension}.", nameof(other));
+                }
+            }
+
+            var indices = new int[Rank];
+            for (int i = 0; i < Length; i++)
+            {
+                ArrayUtilities.GetIndices(strides, IsReversedStride, i, indices);
+
+                if (!comparer.Equals(GetValue(i), other.GetValue(indices)))
+                {
+                    return false;
+                }
+            }
+
+            return true;
+        }
+        int IStructuralEquatable.GetHashCode(IEqualityComparer comparer)
+        {
+            int hashCode = 0;
+            // this ignores shape, which is fine  it just means we'll have hash collisions for things 
+            // with the same content and different shape.
+            for (int i = 0; i < Length; i++)
+            {
+                hashCode ^= comparer.GetHashCode(GetValue(i));
+            }
+
+            return hashCode;
+        }
+        #endregion
+
+        #region Translations
+
+        /// <summary>
+        /// Creates a copy of this tensor as a DenseTensor&lt;T&gt;.  If this tensor is already a DenseTensor&lt;T&gt; calling this method is equivalent to calling Clone().
+        /// </summary>
+        /// <returns></returns>
+        public virtual DenseTensor<T> ToDenseTensor()
+        {
+            var denseTensor = new DenseTensor<T>(Dimensions, IsReversedStride);
+            for (int i = 0; i < Length; i++)
+            {
+                denseTensor.SetValue(i, GetValue(i));
+            }
+            return denseTensor;
+        }
+
+        #endregion
+
+        public string GetArrayString(bool includeWhitespace = true)
+        {
+            var builder = new StringBuilder();
+
+            var strides = ArrayUtilities.GetStrides(dimensions);
+            var indices = new int[Rank];
+            var innerDimension = Rank - 1;
+            var innerLength = dimensions[innerDimension];
+            var outerLength = Length / innerLength;
+
+            int indent = 0;
+            for (int outerIndex = 0; outerIndex < Length; outerIndex += innerLength)
+            {
+                ArrayUtilities.GetIndices(strides, false, outerIndex, indices);
+
+                while ((indent < innerDimension) && (indices[indent] == 0))
+                {
+                    // start up
+                    if (includeWhitespace)
+                    {
+                        Indent(builder, indent);
+                    }
+                    indent++;
+                    builder.Append('{');
+                    if (includeWhitespace)
+                    {
+                        builder.AppendLine();
+                    }
+                }
+
+                for (int innerIndex = 0; innerIndex < innerLength; innerIndex++)
+                {
+                    indices[innerDimension] = innerIndex;
+
+                    if ((innerIndex == 0))
+                    {
+                        if (includeWhitespace)
+                        {
+                            Indent(builder, indent);
+                        }
+                        builder.Append('{');
+                    }
+                    else
+                    {
+                        builder.Append(',');
+                    }
+                    builder.Append(this[indices]);
+                }
+                builder.Append('}');
+
+                for (int i = Rank - 2; i >= 0; i--)
+                {
+                    var lastIndex = dimensions[i] - 1;
+                    if (indices[i] == lastIndex)
+                    {
+                        // close out
+                        --indent;
+                        if (includeWhitespace)
+                        {
+                            builder.AppendLine();
+                            Indent(builder, indent);
+                        }
+                        builder.Append('}');
+                    }
+                    else
+                    {
+                        builder.Append(',');
+                        if (includeWhitespace)
+                        {
+                            builder.AppendLine();
+                        }
+                        break;
+                    }
+                }
+            }
+
+            return builder.ToString();
+        }
+
+        private static void Indent(StringBuilder builder, int tabs, int spacesPerTab = 4)
+        {
+            for (int tab = 0; tab < tabs; tab++)
+            {
+                for (int space = 0; space < spacesPerTab; space++)
+                {
+                    builder.Append(' ');
+                }
+            }
+        }
+
+        private static bool IsCompatibleObject(object value)
+        {
+            // Non-null values are fine.  Only accept nulls if T is a class or Nullable<T>.
+            // Note that default(T) is not equal to null for value types except when T is Nullable<T>.
+            return ((value is T) || (value == null && default(T) == null));
+        }
+    }
+}
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp b/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp
index 34765ab133fa2..559d3690e9664 100644
--- a/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/CXX_Api_Sample.cpp
@@ -23,10 +23,11 @@ int main(int argc, char* argv[]) {
 
   // Sets graph optimization level
   // Available levels are
-  // 0 -> To disable all optimizations
-  // 1 -> To enable basic optimizations (Such as redundant node removals)
-  // 2 -> To enable all optimizations (Includes level 1 + more complex optimizations like node fusions)
-  session_options.SetGraphOptimizationLevel(1);
+  // ORT_DISABLE_ALL -> To disable all optimizations
+  // ORT_ENABLE_BASIC -> To enable basic optimizations (Such as redundant node removals)
+  // ORT_ENABLE_EXTENDED -> To enable extended optimizations (Includes level 1 + more complex optimizations like node fusions)
+  // ORT_ENABLE_ALL -> To Enable All possible opitmizations
+  session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
 
   //*************************************************************************
   // create session and load model into memory
@@ -43,7 +44,7 @@ int main(int argc, char* argv[]) {
 
   //*************************************************************************
   // print model input layer (node names, types, shape etc.)
-  Ort::Allocator allocator = Ort::Allocator::CreateDefault();
+  Ort::AllocatorWithDefaultOptions allocator;
 
   // print number of model input nodes
   size_t num_input_nodes = session.GetInputCount();
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/C_Api_Sample.cpp b/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/C_Api_Sample.cpp
index 11dae1ab52197..bdc413715281c 100644
--- a/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/C_Api_Sample.cpp
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/C_Api_Sample.cpp
@@ -34,11 +34,7 @@ int main(int argc, char* argv[]) {
   OrtSetSessionThreadPoolSize(session_options, 1);
 
   // Sets graph optimization level
-  // Available levels are
-  // 0 -> To disable all optimizations
-  // 1 -> To enable basic optimizations (Such as redundant node removals)
-  // 2 -> To enable all optimizations (Includes level 1 + more complex optimizations like node fusions)
-  OrtSetSessionGraphOptimizationLevel(session_options, 1);
+  OrtSetSessionGraphOptimizationLevel(session_options, ORT_ENABLE_BASIC);
 
   // Optionally add more execution providers via session_options
   // E.g. for CUDA include cuda_provider_factory.h and uncomment the following line:
@@ -63,7 +59,7 @@ int main(int argc, char* argv[]) {
   size_t num_input_nodes;
   OrtStatus* status;
   OrtAllocator* allocator;
-  CHECK_STATUS(OrtCreateDefaultAllocator(&allocator));
+  CHECK_STATUS(OrtGetAllocatorWithDefaultOptions(&allocator));
 
   // print number of model input nodes
   status = OrtSessionGetInputCount(session, &num_input_nodes);
@@ -101,7 +97,6 @@ int main(int argc, char* argv[]) {
 
     OrtReleaseTypeInfo(typeinfo);
   }
-  OrtReleaseAllocator(allocator);
 
   // Results should be...
   // Number of inputs = 1
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs
index 88bf5f83d4c8f..02888e5a52647 100644
--- a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs
@@ -6,7 +6,7 @@
 using System.Collections.Generic;
 using System.Linq;
 using System.Runtime.InteropServices;
-using System.Numerics.Tensors;
+using Microsoft.ML.OnnxRuntime.Tensors;
 using System.Threading.Tasks;
 using Xunit;
 
@@ -17,6 +17,81 @@ public class InferenceTest
         private const string module = "onnxruntime.dll";
         private const string propertiesFile = "Properties.txt";
 
+        [Fact]
+        public void TestSessionOptions()
+        {
+            using (SessionOptions opt = new SessionOptions())
+            {
+                Assert.NotNull(opt);
+
+                // check default values of the properties
+                Assert.True(opt.EnableSequentialExecution);
+                Assert.True(opt.EnableMemoryPattern);
+                Assert.False(opt.EnableProfiling);
+                Assert.Equal("onnxruntime_profile_", opt.ProfileOutputPathPrefix);
+                Assert.True(opt.EnableCpuMemArena);
+                Assert.Equal("", opt.LogId);
+                Assert.Equal(LogLevel.Verbose, opt.LogVerbosityLevel);
+                Assert.Equal(0, opt.ThreadPoolSize);
+                Assert.Equal(GraphOptimizationLevel.ORT_ENABLE_BASIC, opt.GraphOptimizationLevel);
+
+                // try setting options 
+                opt.EnableSequentialExecution = false;
+                Assert.False(opt.EnableSequentialExecution);
+
+                opt.EnableMemoryPattern = false;
+                Assert.False(opt.EnableMemoryPattern);
+
+                opt.EnableProfiling = true;
+                Assert.True(opt.EnableProfiling);
+                Assert.Equal("onnxruntime_profile_", opt.ProfileOutputPathPrefix);
+
+                opt.ProfileOutputPathPrefix = "Ort_P_";
+                Assert.Equal("Ort_P_", opt.ProfileOutputPathPrefix);
+
+                opt.EnableCpuMemArena = false;
+                Assert.False(opt.EnableCpuMemArena);
+
+                opt.LogId = "MyLogId";
+                Assert.Equal("MyLogId", opt.LogId);
+
+                opt.LogVerbosityLevel = LogLevel.Error;
+                Assert.Equal(LogLevel.Error, opt.LogVerbosityLevel);
+
+                opt.ThreadPoolSize = 4;
+                Assert.Equal(4, opt.ThreadPoolSize);
+
+                opt.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED;
+                Assert.Equal(GraphOptimizationLevel.ORT_ENABLE_EXTENDED, opt.GraphOptimizationLevel);
+                
+                Assert.Throws<OnnxRuntimeException>(() => { opt.GraphOptimizationLevel = (GraphOptimizationLevel)10; });
+            }
+        }
+
+        [Fact]
+        public void TestRunOptions()
+        {
+            using (var opt = new RunOptions())
+            {
+                Assert.NotNull(opt);
+
+                //verify default options
+                Assert.False(opt.Terminate);
+                Assert.Equal(LogLevel.Verbose, opt.LogVerbosityLevel);
+                Assert.Equal("", opt.LogTag);
+
+                // try setting options
+                opt.Terminate = true;
+                Assert.True(opt.Terminate);
+
+                opt.LogVerbosityLevel = LogLevel.Error;
+                Assert.Equal(LogLevel.Error, opt.LogVerbosityLevel);
+
+                opt.LogTag = "MyLogTag";
+                Assert.Equal("MyLogTag", opt.LogTag);
+            }
+        }
+
         [Fact]
         public void CanCreateAndDisposeSessionWithModelPath()
         {
@@ -51,18 +126,18 @@ public void CanCreateAndDisposeSessionWithModelPath()
         }
 
         [Theory]
-        [InlineData(0, true)]
-        [InlineData(0, false)]
-        [InlineData(2, true)]
-        [InlineData(2, false)]
-        private void CanRunInferenceOnAModel(uint graphOptimizationLevel, bool disableSequentialExecution)
+        [InlineData(GraphOptimizationLevel.ORT_DISABLE_ALL, true)]
+        [InlineData(GraphOptimizationLevel.ORT_DISABLE_ALL, false)]
+        [InlineData(GraphOptimizationLevel.ORT_ENABLE_EXTENDED, true)]
+        [InlineData(GraphOptimizationLevel.ORT_ENABLE_EXTENDED, false)]
+        private void CanRunInferenceOnAModel(GraphOptimizationLevel graphOptimizationLevel, bool disableSequentialExecution)
         {
             string modelPath = Path.Combine(Directory.GetCurrentDirectory(), "squeezenet.onnx");
 
             // Set the graph optimization level for this session.
             SessionOptions options = new SessionOptions();
-            options.SetSessionGraphOptimizationLevel(graphOptimizationLevel);
-            if (disableSequentialExecution) options.DisableSequentialExecution();
+            options.GraphOptimizationLevel = graphOptimizationLevel;
+            if (disableSequentialExecution) options.EnableSequentialExecution = false;
 
             using (var session = new InferenceSession(modelPath, options))
             {
@@ -82,32 +157,51 @@ private void CanRunInferenceOnAModel(uint graphOptimizationLevel, bool disableSe
                 // Run the inference
                 using (var results = session.Run(container))  // results is an IReadOnlyList<NamedOnnxValue> container
                 {
-                    Assert.Equal(1, results.Count);
+                    validateRunResults(results);
+                }
+
+                // Run Inference with RunOptions
+                using (var runOptions = new RunOptions())
+                {
+                    runOptions.LogTag = "CsharpTest";
+                    runOptions.Terminate = false;  // TODO: Test terminate = true, it currently crashes
+                    runOptions.LogVerbosityLevel = LogLevel.Error;
+                    IReadOnlyCollection<string> outputNames = session.OutputMetadata.Keys.ToList();
 
-                    float[] expectedOutput = LoadTensorFromFile(@"bench.expected_out");
-                    // validate the results
-                    foreach (var r in results)
+                    using (var results = session.Run(container, outputNames, runOptions))  // results is an IReadOnlyList<NamedOnnxValue> container
                     {
-                        Assert.Equal("softmaxout_1", r.Name);
+                        validateRunResults(results);
+                    }
+                }
+            }
+        }
 
-                        var resultTensor = r.AsTensor<float>();
-                        int[] expectedDimensions = { 1, 1000, 1, 1 };  // hardcoded for now for the test data
-                        Assert.Equal(expectedDimensions.Length, resultTensor.Rank);
+        private void validateRunResults(IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results)
+        {
+            float[] expectedOutput = LoadTensorFromFile(@"bench.expected_out");
+            // validate the results
+            foreach (var r in results)
+            {
+                Assert.Equal(1, results.Count);
+                Assert.Equal("softmaxout_1", r.Name);
 
-                        var resultDimensions = resultTensor.Dimensions;
-                        for (int i = 0; i < expectedDimensions.Length; i++)
-                        {
-                            Assert.Equal(expectedDimensions[i], resultDimensions[i]);
-                        }
+                var resultTensor = r.AsTensor<float>();
+                int[] expectedDimensions = { 1, 1000, 1, 1 };  // hardcoded for now for the test data
+                Assert.Equal(expectedDimensions.Length, resultTensor.Rank);
 
-                        var resultArray = r.AsTensor<float>().ToArray();
-                        Assert.Equal(expectedOutput.Length, resultArray.Length);
-                        Assert.Equal(expectedOutput, resultArray, new floatComparer());
-                    }
+                var resultDimensions = resultTensor.Dimensions;
+                for (int i = 0; i < expectedDimensions.Length; i++)
+                {
+                    Assert.Equal(expectedDimensions[i], resultDimensions[i]);
                 }
+
+                var resultArray = r.AsTensor<float>().ToArray();
+                Assert.Equal(expectedOutput.Length, resultArray.Length);
+                Assert.Equal(expectedOutput, resultArray, new floatComparer());
             }
         }
 
+
         [Fact]
         private void ThrowWrongInputName()
         {
@@ -297,7 +391,7 @@ private void TestModelInputFloat()
             }
         }
 
-        [Fact(Skip = "Boolean tensor not supported yet")]
+        [Fact]
         private void TestModelInputBOOL()
         {
             // model takes 1x5 input of fixed type, echoes back
@@ -355,15 +449,15 @@ private void TestModelInputDOUBLE()
 
         }
 
-        [Fact(Skip = "String tensor not supported yet")]
+        [Fact]
         private void TestModelInputSTRING()
         {
             // model takes 1x5 input of fixed type, echoes back
-            string modelPath = Path.Combine(Directory.GetCurrentDirectory(), "test_types_STRING.onnx");
+            string modelPath = Path.Combine(Directory.GetCurrentDirectory(), "test_types_STRING.pb");
             using (var session = new InferenceSession(modelPath))
             {
                 var container = new List<NamedOnnxValue>();
-                var tensorIn = new DenseTensor<string>(new string[] { "a", "c", "d", "z", "f" }, new int[] { 1, 5 });
+                var tensorIn = new DenseTensor<string>(new string[] { "abc", "ced", "def", "", "frozen" }, new int[] { 1, 5 });
                 var nov = NamedOnnxValue.CreateFromTensor("input", tensorIn);
                 container.Add(nov);
                 using (var res = session.Run(container))
@@ -374,7 +468,7 @@ private void TestModelInputSTRING()
             }
         }
 
-        [Fact(Skip = "Int8 not supported yet")]
+        [Fact]
         private void TestModelInputINT8()
         {
             // model takes 1x5 input of fixed type, echoes back
@@ -638,6 +732,20 @@ private void TestModelSequenceOfMapStringFloat()
             }
         }
 
+        [Fact(Skip="The Model Serialization Test fails on linux. Test skipped until fixed. Serialization API should not be used before fix.")]
+        private void TestModelSerialization()
+        {
+            string modelPath = Path.Combine(Directory.GetCurrentDirectory(), "squeezenet.onnx");
+            string modelOutputPath = Path.Combine(Directory.GetCurrentDirectory(), "optimized-squeezenet.onnx");
+            // Set the optimized model file path to assert that no exception are thrown.
+            SessionOptions options = new SessionOptions();
+            options.OptimizedModelFilePath = modelOutputPath;
+            options.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_BASIC;
+            var session = new InferenceSession(modelPath, options);
+            Assert.NotNull(session);
+            Assert.True(File.Exists(modelOutputPath));
+        }
+
         [GpuFact]
         private void TestGpu()
         {
@@ -658,6 +766,7 @@ private void TestGpu()
             }
         }
 
+
         [DllImport("kernel32", SetLastError = true)]
         static extern IntPtr LoadLibrary(string lpFileName);
 
@@ -671,15 +780,17 @@ private void VerifyNativeMethodsExist()
             if (!RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
                 return;
             var entryPointNames = new[]{
-            "OrtCreateEnv","OrtReleaseEnv","OrtGetErrorCode","OrtGetErrorMessage",
-            "OrtReleaseStatus","OrtCreateSession","OrtRun","OrtSessionGetInputCount",
-            "OrtSessionGetOutputCount","OrtSessionGetInputName","OrtSessionGetOutputName","OrtSessionGetInputTypeInfo",
-            "OrtSessionGetOutputTypeInfo","OrtReleaseSession","OrtCreateSessionOptions","OrtCloneSessionOptions",
+            "OrtCreateEnv","OrtReleaseEnv",
+            "OrtGetErrorCode","OrtGetErrorMessage", "OrtReleaseStatus",
+            "OrtCreateSession","OrtRun",
+            "OrtSessionGetInputCount", "OrtSessionGetOutputCount","OrtSessionGetInputName","OrtSessionGetOutputName",
+            "OrtSessionGetInputTypeInfo", "OrtSessionGetOutputTypeInfo","OrtReleaseSession",
+            "OrtCreateSessionOptions","OrtCloneSessionOptions",
             "OrtEnableSequentialExecution","OrtDisableSequentialExecution","OrtEnableProfiling","OrtDisableProfiling",
             "OrtEnableMemPattern","OrtDisableMemPattern","OrtEnableCpuMemArena","OrtDisableCpuMemArena",
             "OrtSetSessionLogId","OrtSetSessionLogVerbosityLevel","OrtSetSessionThreadPoolSize","OrtSetSessionGraphOptimizationLevel",
-            "OrtSessionOptionsAppendExecutionProvider_CPU","OrtCreateAllocatorInfo","OrtCreateCpuAllocatorInfo",
-            "OrtCreateDefaultAllocator","OrtAllocatorFree","OrtAllocatorGetInfo",
+            "OrtSetOptimizedModelFilePath", "OrtSessionOptionsAppendExecutionProvider_CPU","OrtCreateAllocatorInfo","OrtCreateCpuAllocatorInfo",
+            "OrtGetAllocatorWithDefaultOptions","OrtAllocatorFree","OrtAllocatorGetInfo",
             "OrtCreateTensorWithDataAsOrtValue","OrtGetTensorMutableData", "OrtReleaseAllocatorInfo",
             "OrtCastTypeInfoToTensorInfo","OrtGetTensorTypeAndShape","OrtGetTensorElementType","OrtGetDimensionsCount",
             "OrtGetDimensions","OrtGetTensorShapeElementCount","OrtReleaseValue"};
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Microsoft.ML.OnnxRuntime.Tests.csproj b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Microsoft.ML.OnnxRuntime.Tests.csproj
index 0f307ae7680f2..a271ff96d6b7a 100644
--- a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Microsoft.ML.OnnxRuntime.Tests.csproj
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Microsoft.ML.OnnxRuntime.Tests.csproj
@@ -10,8 +10,40 @@
     <ProtocDirectory>$(OnnxRuntimeBuildDirectory)\$(Configuration)\external\protobuf\cmake\$(Configuration)</ProtocDirectory>
     <ProtoSrc>$(OnnxRuntimeCsharpRoot)\..\onnxruntime\core\protobuf</ProtoSrc>
     <NativeBuildOutputDir>$(OnnxRuntimeBuildDirectory)\$(Configuration)\$(Configuration)</NativeBuildOutputDir>
+
+    <!-- following attributes were necessary for the migrated Tensor tests -->  
+    <LangVersion>7.2</LangVersion>
+    <AllowUnsafeBlocks>True</AllowUnsafeBlocks>
+    <SignAssembly>true</SignAssembly> <!-- need signing for freind access to the internals of the Tensors assembly -->
+    <DelaySign>false</DelaySign>
+    <AssemblyOriginatorKeyFile>..\..\OnnxRuntime.snk</AssemblyOriginatorKeyFile>
+    <!-- end -->
   </PropertyGroup>
 
+  <ItemGroup>
+    <Compile Update="Tensors\TensorArithmetic.cs">
+      <AutoGen>True</AutoGen>
+      <DesignTime>True</DesignTime>
+      <DependentUpon>Tensors\TensorArithmetic.tt</DependentUpon>
+    </Compile>
+    <Compile Update="Tensors\TensorOperations.cs">
+      <AutoGen>True</AutoGen>
+      <DesignTime>True</DesignTime>
+      <DependentUpon>Tensors\TensorOperations.tt</DependentUpon>
+    </Compile>
+  </ItemGroup>
+  <ItemGroup>
+    <None Update="Tensors\TensorArithmetic.tt">
+      <Generator>TextTemplatingFileGenerator</Generator>
+      <LastGenOutput>Tensors\TensorArithmetic.cs</LastGenOutput>
+    </None>
+    <None Update="Tensors\TensorOperations.tt">
+      <Generator>TextTemplatingFileGenerator</Generator>
+      <LastGenOutput>Tensors\TensorOperations.cs</LastGenOutput>
+    </None>
+    <None Update="Tensors\TensorTemplate.ttinclude" />
+  </ItemGroup> 
+
   <ItemGroup>
     <PackageReference Include="Google.Protobuf" Version="3.6.1" />
     <PackageReference Include="Microsoft.NET.Test.Sdk" Version="15.8.0" />
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/NativeMemory.cs b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/NativeMemory.cs
new file mode 100644
index 0000000000000..019c3d58cd663
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/NativeMemory.cs
@@ -0,0 +1,119 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/tests/NativeMemory.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System.Buffers;
+using System.Runtime.InteropServices;
+using System.Runtime.CompilerServices;
+using System.Threading;
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors.Tests
+{
+    public class NativeMemory<T> : MemoryManager<T>
+    {
+        private bool disposed = false;
+        private int refCount = 0;
+        private IntPtr memory;
+        private int length;
+
+        public NativeMemory(IntPtr memory, int length)
+        {
+            this.memory = memory;
+            this.length = length;
+        }
+
+        public unsafe NativeMemory(void* memory, int length)
+        {
+            this.memory = (IntPtr)memory;
+            this.length = length;
+        }
+
+        ~NativeMemory()
+        {
+            Dispose(false);
+        }
+
+        public static NativeMemory<T> Allocate(int length)
+        {
+            // typically this would call into a native method appropriate for the platform
+            // or the constructors above would be used to wrap the native pointer
+            IntPtr memory = Marshal.AllocHGlobal(Marshal.SizeOf<T>() * length);
+            return new NativeMemory<T>(memory, length);
+        }
+
+        public bool IsDisposed => disposed;
+
+        public unsafe override Span<T> GetSpan() => new Span<T>((void*)memory, length);
+
+        protected bool IsRetained => refCount > 0;
+
+        public override MemoryHandle Pin(int elementIndex = 0)
+        {
+            unsafe
+            {
+                Retain();
+                if ((uint)elementIndex > length) throw new ArgumentOutOfRangeException(nameof(elementIndex));
+                void* pointer = Unsafe.Add<T>((void*)memory, elementIndex);
+                return new MemoryHandle(pointer, default, this);
+            }
+        }
+
+        public bool Release()
+        {
+            int newRefCount = Interlocked.Decrement(ref refCount);
+
+            if (newRefCount < 0)
+            {
+                throw new InvalidOperationException("Unmatched Release/Retain");
+            }
+
+            return newRefCount != 0;
+        }
+
+        public void Retain()
+        {
+            if (disposed)
+            {
+                throw new ObjectDisposedException(nameof(NativeMemory<T>));
+            }
+
+            Interlocked.Increment(ref refCount);
+        }
+
+        protected override void Dispose(bool disposing)
+        {
+            if (disposed)
+            {
+                return;
+            }
+
+            // typically this would call into a native method appropriate for the platform
+            Marshal.FreeHGlobal(memory);
+            memory = IntPtr.Zero;
+
+            disposed = true;
+        }
+
+        protected override bool TryGetArray(out ArraySegment<T> arraySegment)
+        {
+            // cannot expose managed array
+            arraySegment = default;
+            return false;
+        }
+
+        public override void Unpin()
+        {
+            Release();
+        }
+    }
+}
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorArithmetic.cs b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorArithmetic.cs
new file mode 100644
index 0000000000000..b1a476b27abe5
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorArithmetic.cs
@@ -0,0 +1,16201 @@
+﻿// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/tests/TensorArithmetic.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    internal interface ITensorArithmetic<T>
+    {
+        T One { get; }
+        T Zero { get; }
+        void Add(Tensor<T> left, Tensor<T> right, Tensor<T> result);
+        void Add(Tensor<T> tensor, T scalar, Tensor<T> result);
+        void And(Tensor<T> left, Tensor<T> right, Tensor<T> result);
+        void And(Tensor<T> tensor, T scalar, Tensor<T> result);
+        void Contract(Tensor<T> left, Tensor<T> right, int[] leftAxes, int[] rightAxes, Tensor<T> result);
+        void Decrement(Tensor<T> tensor, Tensor<T> result);
+        void Divide(Tensor<T> left, Tensor<T> right, Tensor<T> result);
+        void Divide(Tensor<T> tensor, T scalar, Tensor<T> result);
+        void Equals(Tensor<T> left, Tensor<T> right, Tensor<bool> result);
+        void GreaterThan(Tensor<T> left, Tensor<T> right, Tensor<bool> result);
+        void GreaterThanOrEqual(Tensor<T> left, Tensor<T> right, Tensor<bool> result);
+        void Increment(Tensor<T> tensor, Tensor<T> result);
+        void LeftShift(Tensor<T> tensor, int value, Tensor<T> result);
+        void LessThan(Tensor<T> left, Tensor<T> right, Tensor<bool> result);
+        void LessThanOrEqual(Tensor<T> left, Tensor<T> right, Tensor<bool> result);
+        void Modulo(Tensor<T> left, Tensor<T> right, Tensor<T> result);
+        void Modulo(Tensor<T> tensor, T scalar, Tensor<T> result);
+        void Multiply(Tensor<T> left, Tensor<T> right, Tensor<T> result);
+        void Multiply(Tensor<T> tensor, T scalar, Tensor<T> result);
+        void NotEquals(Tensor<T> left, Tensor<T> right, Tensor<bool> result);
+        void Or(Tensor<T> left, Tensor<T> right, Tensor<T> result);
+        void Or(Tensor<T> tensor, T scalar, Tensor<T> result);
+        void RightShift(Tensor<T> tensor, int value, Tensor<T> result);
+        void Subtract(Tensor<T> left, Tensor<T> right, Tensor<T> result);
+        void Subtract(Tensor<T> tensor, T scalar, Tensor<T> result);
+        void UnaryMinus(Tensor<T> tensor, Tensor<T> result);
+        void UnaryPlus(Tensor<T> tensor, Tensor<T> result);
+        void Xor(Tensor<T> left, Tensor<T> right, Tensor<T> result);
+        void Xor(Tensor<T> tensor, T scalar, Tensor<T> result);
+    }
+
+    internal static class TensorArithmetic<T>
+    {
+        public static ITensorArithmetic<T> Instance => TensorArithmetic.GetArithmetic<T>();
+    }
+
+    internal static class TensorArithmetic
+    { 
+        public static ITensorArithmetic<T> GetArithmetic<T>()
+        {
+            if (typeof(T) == typeof(bool))
+            {
+                return (ITensorArithmetic<T>)new BoolArithmetic();
+            }
+            else if (typeof(T) == typeof(byte))
+            {
+                return (ITensorArithmetic<T>)new ByteArithmetic();
+            }
+            else if (typeof(T) == typeof(char))
+            {
+                return (ITensorArithmetic<T>)new CharArithmetic();
+            }
+            else if (typeof(T) == typeof(decimal))
+            {
+                return (ITensorArithmetic<T>)new DecimalArithmetic();
+            }
+            else if (typeof(T) == typeof(double))
+            {
+                return (ITensorArithmetic<T>)new DoubleArithmetic();
+            }
+            else if (typeof(T) == typeof(float))
+            {
+                return (ITensorArithmetic<T>)new FloatArithmetic();
+            }
+            else if (typeof(T) == typeof(int))
+            {
+                return (ITensorArithmetic<T>)new IntArithmetic();
+            }
+            else if (typeof(T) == typeof(long))
+            {
+                return (ITensorArithmetic<T>)new LongArithmetic();
+            }
+            else if (typeof(T) == typeof(sbyte))
+            {
+                return (ITensorArithmetic<T>)new SByteArithmetic();
+            }
+            else if (typeof(T) == typeof(short))
+            {
+                return (ITensorArithmetic<T>)new ShortArithmetic();
+            }
+            else if (typeof(T) == typeof(uint))
+            {
+                return (ITensorArithmetic<T>)new UIntArithmetic();
+            }
+            else if (typeof(T) == typeof(ulong))
+            {
+                return (ITensorArithmetic<T>)new ULongArithmetic();
+            }
+            else if (typeof(T) == typeof(ushort))
+            {
+                return (ITensorArithmetic<T>)new UShortArithmetic();
+            }
+            return null;
+        }
+    }
+    
+    internal class BoolArithmetic : ITensorArithmetic<bool>
+    {
+        public bool One => true;
+        public bool Zero => false;
+
+        public void Add(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Add(Tensor<bool> tensor, bool scalar, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void And(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (bool)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<bool> tensor, bool scalar, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (bool)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<bool> left, Tensor<bool> right, int[] leftAxes, int[] rightAxes, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Decrement(Tensor<bool> tensor, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Divide(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Divide(Tensor<bool> tensor, bool scalar, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Equals(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void GreaterThanOrEqual(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Increment(Tensor<bool> tensor, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LeftShift(Tensor<bool> tensor, int value, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThan(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThanOrEqual(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Modulo(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Modulo(Tensor<bool> tensor, bool scalar, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Multiply(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Multiply(Tensor<bool> tensor, bool scalar, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void NotEquals(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (bool)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<bool> tensor, bool scalar, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (bool)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<bool> tensor, int value, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(Tensor<bool> tensor, bool scalar, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryMinus(Tensor<bool> tensor, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryPlus(Tensor<bool> tensor, Tensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Xor(Tensor<bool> left, Tensor<bool> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (bool)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<bool> tensor, bool scalar, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (bool)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Add(DenseTensor<bool> tensor, bool scalar, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void And(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (bool)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (bool)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<bool> tensor, bool scalar, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (bool)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (bool)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<bool> left, DenseTensor<bool> right, int[] leftAxes, int[] rightAxes, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Decrement(DenseTensor<bool> tensor, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Divide(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Divide(DenseTensor<bool> tensor, bool scalar, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Equals(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void GreaterThanOrEqual(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Increment(DenseTensor<bool> tensor, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LeftShift(DenseTensor<bool> tensor, int value, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThan(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThanOrEqual(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Modulo(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Modulo(DenseTensor<bool> tensor, bool scalar, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Multiply(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Multiply(DenseTensor<bool> tensor, bool scalar, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void NotEquals(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (bool)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (bool)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<bool> tensor, bool scalar, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (bool)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (bool)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<bool> tensor, int value, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(DenseTensor<bool> tensor, bool scalar, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryMinus(DenseTensor<bool> tensor, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryPlus(DenseTensor<bool> tensor, DenseTensor<bool> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Xor(DenseTensor<bool> left, DenseTensor<bool> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (bool)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (bool)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<bool> tensor, bool scalar, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (bool)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (bool)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class ByteArithmetic : ITensorArithmetic<byte>
+    {
+        public byte One => 1;
+        public byte Zero => 0;
+
+        public void Add(Tensor<byte> left, Tensor<byte> right, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<byte> tensor, byte scalar, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<byte> left, Tensor<byte> right, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<byte> tensor, byte scalar, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<byte> left, Tensor<byte> right, int[] leftAxes, int[] rightAxes, Tensor<byte> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                byte sum = (byte)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (byte)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<byte> tensor, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<byte> left, Tensor<byte> right, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<byte> tensor, byte scalar, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<byte> left, Tensor<byte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<byte> left, Tensor<byte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<byte> left, Tensor<byte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<byte> tensor, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<byte> tensor, int value, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<byte> left, Tensor<byte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<byte> left, Tensor<byte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<byte> left, Tensor<byte> right, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<byte> tensor, byte scalar, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<byte> left, Tensor<byte> right, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<byte> tensor, byte scalar, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<byte> left, Tensor<byte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<byte> left, Tensor<byte> right, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<byte> tensor, byte scalar, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<byte> tensor, int value, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<byte> left, Tensor<byte> right, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<byte> tensor, byte scalar, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<byte> tensor, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<byte> tensor, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<byte> left, Tensor<byte> right, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<byte> tensor, byte scalar, Tensor<byte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (byte)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<byte> tensor, byte scalar, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<byte> tensor, byte scalar, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<byte> left, DenseTensor<byte> right, int[] leftAxes, int[] rightAxes, DenseTensor<byte> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                byte sum = (byte)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (byte)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<byte> tensor, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<byte> tensor, byte scalar, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<byte> tensor, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<byte> tensor, int value, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<byte> tensor, byte scalar, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<byte> tensor, byte scalar, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<byte> tensor, byte scalar, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<byte> tensor, int value, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<byte> tensor, byte scalar, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<byte> tensor, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<byte> tensor, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<byte> left, DenseTensor<byte> right, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<byte> tensor, byte scalar, DenseTensor<byte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (byte)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (byte)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class CharArithmetic : ITensorArithmetic<char>
+    {
+        public char One => (char)1;
+        public char Zero => (char)0;
+
+        public void Add(Tensor<char> left, Tensor<char> right, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<char> tensor, char scalar, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<char> left, Tensor<char> right, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<char> tensor, char scalar, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<char> left, Tensor<char> right, int[] leftAxes, int[] rightAxes, Tensor<char> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                char sum = (char)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (char)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<char> tensor, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<char> left, Tensor<char> right, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<char> tensor, char scalar, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<char> left, Tensor<char> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<char> left, Tensor<char> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<char> left, Tensor<char> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<char> tensor, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<char> tensor, int value, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<char> left, Tensor<char> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<char> left, Tensor<char> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<char> left, Tensor<char> right, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<char> tensor, char scalar, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<char> left, Tensor<char> right, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<char> tensor, char scalar, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<char> left, Tensor<char> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<char> left, Tensor<char> right, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<char> tensor, char scalar, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<char> tensor, int value, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<char> left, Tensor<char> right, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<char> tensor, char scalar, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<char> tensor, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<char> tensor, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<char> left, Tensor<char> right, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<char> tensor, char scalar, Tensor<char> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (char)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<char> tensor, char scalar, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<char> tensor, char scalar, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<char> left, DenseTensor<char> right, int[] leftAxes, int[] rightAxes, DenseTensor<char> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                char sum = (char)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (char)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<char> tensor, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<char> tensor, char scalar, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<char> tensor, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<char> tensor, int value, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<char> tensor, char scalar, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<char> tensor, char scalar, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<char> tensor, char scalar, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<char> tensor, int value, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<char> tensor, char scalar, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<char> tensor, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<char> tensor, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<char> left, DenseTensor<char> right, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<char> tensor, char scalar, DenseTensor<char> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (char)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (char)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class DecimalArithmetic : ITensorArithmetic<decimal>
+    {
+        public decimal One => 1;
+        public decimal Zero => 0;
+
+        public void Add(Tensor<decimal> left, Tensor<decimal> right, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<decimal> tensor, decimal scalar, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<decimal> left, Tensor<decimal> right, Tensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void And(Tensor<decimal> tensor, decimal scalar, Tensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Contract(Tensor<decimal> left, Tensor<decimal> right, int[] leftAxes, int[] rightAxes, Tensor<decimal> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                decimal sum = (decimal)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (decimal)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<decimal> tensor, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<decimal> left, Tensor<decimal> right, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<decimal> tensor, decimal scalar, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<decimal> left, Tensor<decimal> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<decimal> left, Tensor<decimal> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<decimal> left, Tensor<decimal> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<decimal> tensor, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<decimal> tensor, int value, Tensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThan(Tensor<decimal> left, Tensor<decimal> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<decimal> left, Tensor<decimal> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<decimal> left, Tensor<decimal> right, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<decimal> tensor, decimal scalar, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<decimal> left, Tensor<decimal> right, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<decimal> tensor, decimal scalar, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<decimal> left, Tensor<decimal> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<decimal> left, Tensor<decimal> right, Tensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Or(Tensor<decimal> tensor, decimal scalar, Tensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void RightShift(Tensor<decimal> tensor, int value, Tensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(Tensor<decimal> left, Tensor<decimal> right, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<decimal> tensor, decimal scalar, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<decimal> tensor, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<decimal> tensor, Tensor<decimal> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (decimal)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<decimal> left, Tensor<decimal> right, Tensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Xor(Tensor<decimal> tensor, decimal scalar, Tensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+
+        public void Add(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<decimal> tensor, decimal scalar, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void And(DenseTensor<decimal> tensor, decimal scalar, DenseTensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Contract(DenseTensor<decimal> left, DenseTensor<decimal> right, int[] leftAxes, int[] rightAxes, DenseTensor<decimal> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                decimal sum = (decimal)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (decimal)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<decimal> tensor, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<decimal> tensor, decimal scalar, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<decimal> tensor, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<decimal> tensor, int value, DenseTensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThan(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<decimal> tensor, decimal scalar, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<decimal> tensor, decimal scalar, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Or(DenseTensor<decimal> tensor, decimal scalar, DenseTensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void RightShift(DenseTensor<decimal> tensor, int value, DenseTensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<decimal> tensor, decimal scalar, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<decimal> tensor, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<decimal> tensor, DenseTensor<decimal> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (decimal)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (decimal)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<decimal> left, DenseTensor<decimal> right, DenseTensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Xor(DenseTensor<decimal> tensor, decimal scalar, DenseTensor<decimal> result)
+        {
+            throw new NotSupportedException();
+        }
+    }
+    internal class DoubleArithmetic : ITensorArithmetic<double>
+    {
+        public double One => 1.0;
+        public double Zero => 0;
+
+        public void Add(Tensor<double> left, Tensor<double> right, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<double> tensor, double scalar, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<double> left, Tensor<double> right, Tensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void And(Tensor<double> tensor, double scalar, Tensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Contract(Tensor<double> left, Tensor<double> right, int[] leftAxes, int[] rightAxes, Tensor<double> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                double sum = (double)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (double)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<double> tensor, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<double> left, Tensor<double> right, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<double> tensor, double scalar, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<double> left, Tensor<double> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<double> left, Tensor<double> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<double> left, Tensor<double> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<double> tensor, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<double> tensor, int value, Tensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThan(Tensor<double> left, Tensor<double> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<double> left, Tensor<double> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<double> left, Tensor<double> right, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<double> tensor, double scalar, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<double> left, Tensor<double> right, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<double> tensor, double scalar, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<double> left, Tensor<double> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<double> left, Tensor<double> right, Tensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Or(Tensor<double> tensor, double scalar, Tensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void RightShift(Tensor<double> tensor, int value, Tensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(Tensor<double> left, Tensor<double> right, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<double> tensor, double scalar, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<double> tensor, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<double> tensor, Tensor<double> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (double)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<double> left, Tensor<double> right, Tensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Xor(Tensor<double> tensor, double scalar, Tensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+
+        public void Add(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<double> tensor, double scalar, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void And(DenseTensor<double> tensor, double scalar, DenseTensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Contract(DenseTensor<double> left, DenseTensor<double> right, int[] leftAxes, int[] rightAxes, DenseTensor<double> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                double sum = (double)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (double)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<double> tensor, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<double> tensor, double scalar, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<double> tensor, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<double> tensor, int value, DenseTensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThan(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<double> tensor, double scalar, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<double> tensor, double scalar, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Or(DenseTensor<double> tensor, double scalar, DenseTensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void RightShift(DenseTensor<double> tensor, int value, DenseTensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<double> tensor, double scalar, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<double> tensor, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<double> tensor, DenseTensor<double> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (double)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (double)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<double> left, DenseTensor<double> right, DenseTensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Xor(DenseTensor<double> tensor, double scalar, DenseTensor<double> result)
+        {
+            throw new NotSupportedException();
+        }
+    }
+    internal class FloatArithmetic : ITensorArithmetic<float>
+    {
+        public float One => 1.0f;
+        public float Zero => 0;
+
+        public void Add(Tensor<float> left, Tensor<float> right, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<float> tensor, float scalar, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<float> left, Tensor<float> right, Tensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void And(Tensor<float> tensor, float scalar, Tensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Contract(Tensor<float> left, Tensor<float> right, int[] leftAxes, int[] rightAxes, Tensor<float> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                float sum = (float)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (float)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<float> tensor, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<float> left, Tensor<float> right, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<float> tensor, float scalar, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<float> left, Tensor<float> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<float> left, Tensor<float> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<float> left, Tensor<float> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<float> tensor, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<float> tensor, int value, Tensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThan(Tensor<float> left, Tensor<float> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<float> left, Tensor<float> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<float> left, Tensor<float> right, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<float> tensor, float scalar, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<float> left, Tensor<float> right, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<float> tensor, float scalar, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<float> left, Tensor<float> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<float> left, Tensor<float> right, Tensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Or(Tensor<float> tensor, float scalar, Tensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void RightShift(Tensor<float> tensor, int value, Tensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(Tensor<float> left, Tensor<float> right, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<float> tensor, float scalar, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<float> tensor, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<float> tensor, Tensor<float> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (float)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<float> left, Tensor<float> right, Tensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Xor(Tensor<float> tensor, float scalar, Tensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+
+        public void Add(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<float> tensor, float scalar, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void And(DenseTensor<float> tensor, float scalar, DenseTensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Contract(DenseTensor<float> left, DenseTensor<float> right, int[] leftAxes, int[] rightAxes, DenseTensor<float> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                float sum = (float)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (float)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<float> tensor, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<float> tensor, float scalar, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<float> tensor, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<float> tensor, int value, DenseTensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void LessThan(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<float> tensor, float scalar, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<float> tensor, float scalar, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Or(DenseTensor<float> tensor, float scalar, DenseTensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void RightShift(DenseTensor<float> tensor, int value, DenseTensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Subtract(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<float> tensor, float scalar, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<float> tensor, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<float> tensor, DenseTensor<float> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (float)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (float)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<float> left, DenseTensor<float> right, DenseTensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void Xor(DenseTensor<float> tensor, float scalar, DenseTensor<float> result)
+        {
+            throw new NotSupportedException();
+        }
+    }
+    internal class IntArithmetic : ITensorArithmetic<int>
+    {
+        public int One => 1;
+        public int Zero => 0;
+
+        public void Add(Tensor<int> left, Tensor<int> right, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<int> tensor, int scalar, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<int> left, Tensor<int> right, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<int> tensor, int scalar, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<int> left, Tensor<int> right, int[] leftAxes, int[] rightAxes, Tensor<int> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                int sum = (int)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (int)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<int> tensor, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<int> left, Tensor<int> right, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<int> tensor, int scalar, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<int> left, Tensor<int> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<int> left, Tensor<int> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<int> left, Tensor<int> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<int> tensor, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<int> tensor, int value, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<int> left, Tensor<int> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<int> left, Tensor<int> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<int> left, Tensor<int> right, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<int> tensor, int scalar, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<int> left, Tensor<int> right, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<int> tensor, int scalar, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<int> left, Tensor<int> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<int> left, Tensor<int> right, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<int> tensor, int scalar, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<int> tensor, int value, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<int> left, Tensor<int> right, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<int> tensor, int scalar, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<int> tensor, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<int> tensor, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<int> left, Tensor<int> right, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<int> tensor, int scalar, Tensor<int> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (int)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<int> tensor, int scalar, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<int> tensor, int scalar, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<int> left, DenseTensor<int> right, int[] leftAxes, int[] rightAxes, DenseTensor<int> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                int sum = (int)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (int)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<int> tensor, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<int> tensor, int scalar, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<int> tensor, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<int> tensor, int value, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<int> tensor, int scalar, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<int> tensor, int scalar, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<int> tensor, int scalar, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<int> tensor, int value, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<int> tensor, int scalar, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<int> tensor, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<int> tensor, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<int> left, DenseTensor<int> right, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<int> tensor, int scalar, DenseTensor<int> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (int)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (int)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class LongArithmetic : ITensorArithmetic<long>
+    {
+        public long One => 1;
+        public long Zero => 0;
+
+        public void Add(Tensor<long> left, Tensor<long> right, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<long> tensor, long scalar, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<long> left, Tensor<long> right, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<long> tensor, long scalar, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<long> left, Tensor<long> right, int[] leftAxes, int[] rightAxes, Tensor<long> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                long sum = (long)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (long)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<long> tensor, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<long> left, Tensor<long> right, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<long> tensor, long scalar, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<long> left, Tensor<long> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<long> left, Tensor<long> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<long> left, Tensor<long> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<long> tensor, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<long> tensor, int value, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<long> left, Tensor<long> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<long> left, Tensor<long> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<long> left, Tensor<long> right, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<long> tensor, long scalar, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<long> left, Tensor<long> right, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<long> tensor, long scalar, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<long> left, Tensor<long> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<long> left, Tensor<long> right, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<long> tensor, long scalar, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<long> tensor, int value, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<long> left, Tensor<long> right, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<long> tensor, long scalar, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<long> tensor, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<long> tensor, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<long> left, Tensor<long> right, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<long> tensor, long scalar, Tensor<long> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (long)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<long> tensor, long scalar, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<long> tensor, long scalar, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<long> left, DenseTensor<long> right, int[] leftAxes, int[] rightAxes, DenseTensor<long> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                long sum = (long)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (long)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<long> tensor, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<long> tensor, long scalar, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<long> tensor, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<long> tensor, int value, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<long> tensor, long scalar, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<long> tensor, long scalar, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<long> tensor, long scalar, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<long> tensor, int value, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<long> tensor, long scalar, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<long> tensor, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<long> tensor, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<long> left, DenseTensor<long> right, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<long> tensor, long scalar, DenseTensor<long> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (long)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (long)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class SByteArithmetic : ITensorArithmetic<sbyte>
+    {
+        public sbyte One => 1;
+        public sbyte Zero => 0;
+
+        public void Add(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<sbyte> tensor, sbyte scalar, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<sbyte> tensor, sbyte scalar, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<sbyte> left, Tensor<sbyte> right, int[] leftAxes, int[] rightAxes, Tensor<sbyte> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                sbyte sum = (sbyte)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (sbyte)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<sbyte> tensor, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<sbyte> tensor, sbyte scalar, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<sbyte> tensor, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<sbyte> tensor, int value, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<sbyte> tensor, sbyte scalar, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<sbyte> tensor, sbyte scalar, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<sbyte> tensor, sbyte scalar, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<sbyte> tensor, int value, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<sbyte> tensor, sbyte scalar, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<sbyte> tensor, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<sbyte> tensor, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<sbyte> left, Tensor<sbyte> right, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<sbyte> tensor, sbyte scalar, Tensor<sbyte> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (sbyte)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<sbyte> tensor, sbyte scalar, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<sbyte> tensor, sbyte scalar, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<sbyte> left, DenseTensor<sbyte> right, int[] leftAxes, int[] rightAxes, DenseTensor<sbyte> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                sbyte sum = (sbyte)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (sbyte)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<sbyte> tensor, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<sbyte> tensor, sbyte scalar, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<sbyte> tensor, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<sbyte> tensor, int value, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<sbyte> tensor, sbyte scalar, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<sbyte> tensor, sbyte scalar, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<sbyte> tensor, sbyte scalar, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<sbyte> tensor, int value, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<sbyte> tensor, sbyte scalar, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<sbyte> tensor, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<sbyte> tensor, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<sbyte> left, DenseTensor<sbyte> right, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<sbyte> tensor, sbyte scalar, DenseTensor<sbyte> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (sbyte)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (sbyte)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class ShortArithmetic : ITensorArithmetic<short>
+    {
+        public short One => 1;
+        public short Zero => 0;
+
+        public void Add(Tensor<short> left, Tensor<short> right, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<short> tensor, short scalar, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<short> left, Tensor<short> right, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<short> tensor, short scalar, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<short> left, Tensor<short> right, int[] leftAxes, int[] rightAxes, Tensor<short> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                short sum = (short)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (short)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<short> tensor, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<short> left, Tensor<short> right, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<short> tensor, short scalar, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<short> left, Tensor<short> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<short> left, Tensor<short> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<short> left, Tensor<short> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<short> tensor, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<short> tensor, int value, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<short> left, Tensor<short> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<short> left, Tensor<short> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<short> left, Tensor<short> right, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<short> tensor, short scalar, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<short> left, Tensor<short> right, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<short> tensor, short scalar, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<short> left, Tensor<short> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<short> left, Tensor<short> right, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<short> tensor, short scalar, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<short> tensor, int value, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<short> left, Tensor<short> right, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<short> tensor, short scalar, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<short> tensor, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)-tensor[indices];
+            }
+            
+        }
+        public void UnaryPlus(Tensor<short> tensor, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<short> left, Tensor<short> right, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<short> tensor, short scalar, Tensor<short> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (short)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<short> tensor, short scalar, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<short> tensor, short scalar, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<short> left, DenseTensor<short> right, int[] leftAxes, int[] rightAxes, DenseTensor<short> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                short sum = (short)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (short)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<short> tensor, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<short> tensor, short scalar, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<short> tensor, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<short> tensor, int value, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<short> tensor, short scalar, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<short> tensor, short scalar, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<short> tensor, short scalar, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<short> tensor, int value, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<short> tensor, short scalar, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<short> tensor, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)-tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)-tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void UnaryPlus(DenseTensor<short> tensor, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<short> left, DenseTensor<short> right, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<short> tensor, short scalar, DenseTensor<short> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (short)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (short)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class UIntArithmetic : ITensorArithmetic<uint>
+    {
+        public uint One => 1;
+        public uint Zero => 0;
+
+        public void Add(Tensor<uint> left, Tensor<uint> right, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<uint> tensor, uint scalar, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<uint> left, Tensor<uint> right, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<uint> tensor, uint scalar, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<uint> left, Tensor<uint> right, int[] leftAxes, int[] rightAxes, Tensor<uint> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                uint sum = (uint)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (uint)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<uint> tensor, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<uint> left, Tensor<uint> right, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<uint> tensor, uint scalar, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<uint> left, Tensor<uint> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<uint> left, Tensor<uint> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<uint> left, Tensor<uint> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<uint> tensor, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<uint> tensor, int value, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<uint> left, Tensor<uint> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<uint> left, Tensor<uint> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<uint> left, Tensor<uint> right, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<uint> tensor, uint scalar, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<uint> left, Tensor<uint> right, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<uint> tensor, uint scalar, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<uint> left, Tensor<uint> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<uint> left, Tensor<uint> right, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<uint> tensor, uint scalar, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<uint> tensor, int value, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<uint> left, Tensor<uint> right, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<uint> tensor, uint scalar, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<uint> tensor, Tensor<uint> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryPlus(Tensor<uint> tensor, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<uint> left, Tensor<uint> right, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<uint> tensor, uint scalar, Tensor<uint> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (uint)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<uint> tensor, uint scalar, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<uint> tensor, uint scalar, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<uint> left, DenseTensor<uint> right, int[] leftAxes, int[] rightAxes, DenseTensor<uint> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                uint sum = (uint)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (uint)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<uint> tensor, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<uint> tensor, uint scalar, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<uint> tensor, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<uint> tensor, int value, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<uint> tensor, uint scalar, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<uint> tensor, uint scalar, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<uint> tensor, uint scalar, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<uint> tensor, int value, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<uint> tensor, uint scalar, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<uint> tensor, DenseTensor<uint> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryPlus(DenseTensor<uint> tensor, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<uint> left, DenseTensor<uint> right, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<uint> tensor, uint scalar, DenseTensor<uint> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (uint)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (uint)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class ULongArithmetic : ITensorArithmetic<ulong>
+    {
+        public ulong One => 1;
+        public ulong Zero => 0;
+
+        public void Add(Tensor<ulong> left, Tensor<ulong> right, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<ulong> tensor, ulong scalar, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<ulong> left, Tensor<ulong> right, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<ulong> tensor, ulong scalar, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<ulong> left, Tensor<ulong> right, int[] leftAxes, int[] rightAxes, Tensor<ulong> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                ulong sum = (ulong)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (ulong)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<ulong> tensor, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<ulong> left, Tensor<ulong> right, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<ulong> tensor, ulong scalar, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<ulong> left, Tensor<ulong> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<ulong> left, Tensor<ulong> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<ulong> left, Tensor<ulong> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<ulong> tensor, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<ulong> tensor, int value, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<ulong> left, Tensor<ulong> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<ulong> left, Tensor<ulong> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<ulong> left, Tensor<ulong> right, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<ulong> tensor, ulong scalar, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<ulong> left, Tensor<ulong> right, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<ulong> tensor, ulong scalar, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<ulong> left, Tensor<ulong> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<ulong> left, Tensor<ulong> right, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<ulong> tensor, ulong scalar, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<ulong> tensor, int value, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<ulong> left, Tensor<ulong> right, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<ulong> tensor, ulong scalar, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<ulong> tensor, Tensor<ulong> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryPlus(Tensor<ulong> tensor, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<ulong> left, Tensor<ulong> right, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<ulong> tensor, ulong scalar, Tensor<ulong> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ulong)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<ulong> tensor, ulong scalar, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<ulong> tensor, ulong scalar, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<ulong> left, DenseTensor<ulong> right, int[] leftAxes, int[] rightAxes, DenseTensor<ulong> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                ulong sum = (ulong)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (ulong)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<ulong> tensor, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<ulong> tensor, ulong scalar, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<ulong> tensor, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<ulong> tensor, int value, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<ulong> tensor, ulong scalar, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<ulong> tensor, ulong scalar, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<ulong> tensor, ulong scalar, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<ulong> tensor, int value, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<ulong> tensor, ulong scalar, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<ulong> tensor, DenseTensor<ulong> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryPlus(DenseTensor<ulong> tensor, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<ulong> left, DenseTensor<ulong> right, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<ulong> tensor, ulong scalar, DenseTensor<ulong> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ulong)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ulong)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+    internal class UShortArithmetic : ITensorArithmetic<ushort>
+    {
+        public ushort One => 1;
+        public ushort Zero => 0;
+
+        public void Add(Tensor<ushort> left, Tensor<ushort> right, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(left[indices] + right[indices]);
+            }
+            
+        }
+        public void Add(Tensor<ushort> tensor, ushort scalar, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] + scalar);
+            }
+            
+        }
+        public void And(Tensor<ushort> left, Tensor<ushort> right, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(left[indices] & right[indices]);
+            }
+            
+        }
+        public void And(Tensor<ushort> tensor, ushort scalar, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] & scalar);
+            }
+            
+        }
+        public void Contract(Tensor<ushort> left, Tensor<ushort> right, int[] leftAxes, int[] rightAxes, Tensor<ushort> result)
+        {
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                ushort sum = (ushort)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (ushort)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+        }
+        public void Decrement(Tensor<ushort> tensor, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]--;
+            }
+            
+        }
+        public void Divide(Tensor<ushort> left, Tensor<ushort> right, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(left[indices] / right[indices]);
+            }
+            
+        }
+        public void Divide(Tensor<ushort> tensor, ushort scalar, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] / scalar);
+            }
+            
+        }
+        public void Equals(Tensor<ushort> left, Tensor<ushort> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] == right[indices];
+            }
+            
+        }
+        public void GreaterThan(Tensor<ushort> left, Tensor<ushort> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] > right[indices];
+            }
+            
+        }
+        public void GreaterThanOrEqual(Tensor<ushort> left, Tensor<ushort> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] >= right[indices];
+            }
+            
+        }
+        public void Increment(Tensor<ushort> tensor, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices]++;
+            }
+            
+        }
+        public void LeftShift(Tensor<ushort> tensor, int value, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] << value);
+            }
+            
+        }
+        public void LessThan(Tensor<ushort> left, Tensor<ushort> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] < right[indices];
+            }
+            
+        }
+        public void LessThanOrEqual(Tensor<ushort> left, Tensor<ushort> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] <= right[indices];
+            }
+            
+        }
+        public void Modulo(Tensor<ushort> left, Tensor<ushort> right, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(left[indices] % right[indices]);
+            }
+            
+        }
+        public void Modulo(Tensor<ushort> tensor, ushort scalar, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] % scalar);
+            }
+            
+        }
+        public void Multiply(Tensor<ushort> left, Tensor<ushort> right, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(left[indices] * right[indices]);
+            }
+            
+        }
+        public void Multiply(Tensor<ushort> tensor, ushort scalar, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] * scalar);
+            }
+            
+        }
+        public void NotEquals(Tensor<ushort> left, Tensor<ushort> right, Tensor<bool> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = left[indices] != right[indices];
+            }
+            
+        }
+        public void Or(Tensor<ushort> left, Tensor<ushort> right, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(left[indices] | right[indices]);
+            }
+            
+        }
+        public void Or(Tensor<ushort> tensor, ushort scalar, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] | scalar);
+            }
+            
+        }
+        public void RightShift(Tensor<ushort> tensor, int value, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] >> value);
+            }
+            
+        }
+        public void Subtract(Tensor<ushort> left, Tensor<ushort> right, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(left[indices] - right[indices]);
+            }
+            
+        }
+        public void Subtract(Tensor<ushort> tensor, ushort scalar, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] - scalar);
+            }
+            
+        }
+        public void UnaryMinus(Tensor<ushort> tensor, Tensor<ushort> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryPlus(Tensor<ushort> tensor, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)+tensor[indices];
+            }
+            
+        }
+        public void Xor(Tensor<ushort> left, Tensor<ushort> right, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(left[indices] ^ right[indices]);
+            }
+            
+        }
+        public void Xor(Tensor<ushort> tensor, ushort scalar, Tensor<ushort> result)
+        {
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < result.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                result[indices] = (ushort)(tensor[indices] ^ scalar);
+            }
+            
+        }
+
+        public void Add(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(leftSpan[i] + rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(leftSpan[op1Index] + rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Add(DenseTensor<ushort> tensor, ushort scalar, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] + scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] + scalar);
+
+                }
+            }
+        }
+        public void And(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(leftSpan[i] & rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(leftSpan[op1Index] & rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void And(DenseTensor<ushort> tensor, ushort scalar, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] & scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] & scalar);
+
+                }
+            }
+        }
+        public void Contract(DenseTensor<ushort> left, DenseTensor<ushort> right, int[] leftAxes, int[] rightAxes, DenseTensor<ushort> result)
+        {
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                ushort sum = (ushort)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (ushort)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+        }
+        public void Decrement(DenseTensor<ushort> tensor, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]--;
+            }
+        }
+        public void Divide(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(leftSpan[i] / rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(leftSpan[op1Index] / rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Divide(DenseTensor<ushort> tensor, ushort scalar, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] / scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] / scalar);
+
+                }
+            }
+        }
+        public void Equals(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] == rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] == rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThan(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] > rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] > rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void GreaterThanOrEqual(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] >= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] >= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Increment(DenseTensor<ushort> tensor, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            for(int i = 0; i < resultSpan.Length; i++)
+            {
+                resultSpan[i]++;
+            }
+        }
+        public void LeftShift(DenseTensor<ushort> tensor, int value, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] << value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] << value);
+
+                }
+            }
+        }
+        public void LessThan(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] < rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] < rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void LessThanOrEqual(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] <= rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] <= rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(leftSpan[i] % rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(leftSpan[op1Index] % rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Modulo(DenseTensor<ushort> tensor, ushort scalar, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] % scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] % scalar);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(leftSpan[i] * rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(leftSpan[op1Index] * rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Multiply(DenseTensor<ushort> tensor, ushort scalar, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] * scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] * scalar);
+
+                }
+            }
+        }
+        public void NotEquals(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<bool> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = leftSpan[i] != rightSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = leftSpan[op1Index] != rightSpan[op2Index];
+
+                }
+            }
+        }
+        public void Or(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(leftSpan[i] | rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(leftSpan[op1Index] | rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Or(DenseTensor<ushort> tensor, ushort scalar, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] | scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] | scalar);
+
+                }
+            }
+        }
+        public void RightShift(DenseTensor<ushort> tensor, int value, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] >> value);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] >> value);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(leftSpan[i] - rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(leftSpan[op1Index] - rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Subtract(DenseTensor<ushort> tensor, ushort scalar, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] - scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] - scalar);
+
+                }
+            }
+        }
+        public void UnaryMinus(DenseTensor<ushort> tensor, DenseTensor<ushort> result)
+        {
+            throw new NotSupportedException();
+        }
+        public void UnaryPlus(DenseTensor<ushort> tensor, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)+tensorSpan[i];
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)+tensorSpan[op1Index];
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<ushort> left, DenseTensor<ushort> right, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+            if  ((result.IsReversedStride == left.IsReversedStride) && (result.IsReversedStride == right.IsReversedStride))
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(leftSpan[i] ^ rightSpan[i]);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref left.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                ref int op2Index = ref right.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      !left.IsReversedStride ? left.strides : 
+                                      right.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         left.IsReversedStride ? left.strides : 
+                                         right.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(leftSpan[op1Index] ^ rightSpan[op2Index]);
+
+                }
+            }
+        }
+        public void Xor(DenseTensor<ushort> tensor, ushort scalar, DenseTensor<ushort> result)
+        {
+
+            var resultSpan = result.Buffer.Span;
+            var tensorSpan = tensor.Buffer.Span;
+            if  (result.IsReversedStride == tensor.IsReversedStride)
+            {
+                for(int i = 0; i < resultSpan.Length; i++)
+                {
+                    resultSpan[i] = (ushort)(tensorSpan[i] ^ scalar);
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref result.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref tensor.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+                var rowMajorStrides = !result.IsReversedStride ? result.strides :
+                                      tensor.strides;
+                var columnMajorStrides = result.IsReversedStride ? result.strides :
+                                         tensor.strides;
+                for(;rowMajorIndex < resultSpan.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    resultSpan[resultIndex] = (ushort)(tensorSpan[op1Index] ^ scalar);
+
+                }
+            }
+        }
+    }
+}
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorArithmetic.tt b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorArithmetic.tt
new file mode 100644
index 0000000000000..dc7741052f702
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorArithmetic.tt
@@ -0,0 +1,249 @@
+﻿<#@ template debug="false" hostspecific="false" language="C#" #>
+<#@ assembly name="System.Core" #>
+<#@ output extension=".cs" #>
+<#@ include file="TensorTemplate.ttinclude" #>// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/tests/TensorArithmetic.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    internal interface ITensorArithmetic<T>
+    {
+        T One { get; }
+        T Zero { get; }
+<# foreach (MethodConfiguration method in methodConfiguration) { #>
+        <#= method.GetResultMethodSignature("Tensor", "T")#>;
+<# } #>
+    }
+
+    internal static class TensorArithmetic<T>
+    {
+        public static ITensorArithmetic<T> Instance => TensorArithmetic.GetArithmetic<T>();
+    }
+
+    internal static class TensorArithmetic
+    { 
+        public static ITensorArithmetic<T> GetArithmetic<T>()
+        {
+<# foreach (TypeConfiguration type in typeConfiguration) { #>
+            <#=GenerateIfStatementHeader(type)#>
+            {
+                return (ITensorArithmetic<T>)new <#=type.ClassPrefix#>Arithmetic();
+            }
+<# } #>
+            return null;
+        }
+    }
+    
+<# foreach (TypeConfiguration type in typeConfiguration) { #>
+    internal class <#=type.ClassPrefix#>Arithmetic : ITensorArithmetic<<#=type.TypeName#>>
+    {
+        public <#=type.TypeName#> One => <#=type.OneLiteral#>;
+        public <#=type.TypeName#> Zero => <#=type.ZeroLiteral#>;
+
+<# foreach (MethodConfiguration method in methodConfiguration) { #>
+        public <#= method.GetResultMethodSignature("Tensor", type.TypeName)#>
+        {
+<# if ((method.IsNumeric && !type.SupportsNumeric) ||  (method.IsBitwise && !type.SupportsBitwise) || (type.UnsupportedMethods.Contains(method.MethodName))) { #>
+            throw new NotSupportedException();
+<# } else if (method.Operator != null) { #>
+
+            Span<int> indices = new Span<int>(new int[result.Rank]);
+            for(int i = 0; i < <#= method.ResultName #>.Length; i++)
+            {
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, i, indices);
+                <#=method.GetElementOperation(type.TypeName, "[indices]")#>;
+            }
+            
+<# } else if (method.MethodName == "Contract") {#>
+            var leftIndices = new int[left.Rank];
+            var rightIndices = new int[right.Rank];
+            var resultIndices = new int[result.Rank];
+
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+
+            for (int resultIndex = 0; resultIndex < result.Length; resultIndex++)
+            {
+                <#=type.TypeName#> sum = (<#=type.TypeName#>)0;
+                
+                ArrayUtilities.GetIndices(result.strides, result.IsReversedStride, resultIndex, resultIndices);
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    // todo, make this more efficient
+                    ArrayUtilities.GetIndices(left.strides, left.IsReversedStride, leftIndex, leftIndices);
+                    ArrayUtilities.GetIndices(right.strides, right.IsReversedStride, rightIndex, rightIndices);
+
+                    sum += (<#=type.TypeName#>)(left[leftIndices] * right[rightIndices]);
+                }
+                
+                result[resultIndices] = sum;
+            }
+<# } #>
+        }
+<# } #>
+
+<# foreach (MethodConfiguration method in methodConfiguration) { #>
+        public <#= method.GetResultMethodSignature("DenseTensor", type.TypeName)#>
+        {
+<# if ((method.IsNumeric && !type.SupportsNumeric) ||  (method.IsBitwise && !type.SupportsBitwise) || (type.UnsupportedMethods.Contains(method.MethodName))) { #>
+            throw new NotSupportedException();
+<# } else if (method.Operator != null) { #>
+
+<# if (method.MethodType == MethodType.UnaryInPlace) { #>
+            var <#=method.ResultName #>Span = <#=method.ResultName #>.Buffer.Span;
+            var <#=method.Op1Name #>Span = <#=method.Op1Name #>.Buffer.Span;
+            for(int i = 0; i < <#=method.ResultName #>Span.Length; i++)
+            {
+                <#=method.GetElementOperation(type.TypeName, "Span[i]")#>;
+            }
+<# } else {#>
+            var <#=method.ResultName #>Span = <#=method.ResultName #>.Buffer.Span;
+            var <#=method.Op1Name #>Span = <#=method.Op1Name #>.Buffer.Span;
+<# if ((method.MethodType == MethodType.Binary) || (method.MethodType == MethodType.Comparison)) {#>
+            var <#=method.Op2Name #>Span = <#=method.Op2Name #>.Buffer.Span;
+<# } #>
+            if  <#= method.GetLinearOperationCheck() #>
+            {
+                for(int i = 0; i < <#= method.ResultName #>Span.Length; i++)
+                {
+                    <#=method.GetElementOperation(type.TypeName, "Span[i]")#>;
+                }
+            }
+            else
+            {
+                int rowMajorIndex = 0;
+                int colMajorIndex = 0;
+                
+                ref int resultIndex = ref <#= method.ResultName #>.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                ref int op1Index = ref <#= method.Op1Name #>.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+                
+<# if ((method.MethodType == MethodType.Binary) || (method.MethodType == MethodType.Comparison)) {#>
+                ref int op2Index = ref <#= method.Op2Name #>.IsReversedStride ? ref colMajorIndex : ref rowMajorIndex;
+
+                var rowMajorStrides = !<#= method.ResultName #>.IsReversedStride ? <#= method.ResultName #>.strides :
+                                      !<#= method.Op1Name #>.IsReversedStride ? <#= method.Op1Name #>.strides : 
+                                      <#= method.Op2Name #>.strides;
+                var columnMajorStrides = <#= method.ResultName #>.IsReversedStride ? <#= method.ResultName #>.strides :
+                                         <#= method.Op1Name #>.IsReversedStride ? <#= method.Op1Name #>.strides : 
+                                         <#= method.Op2Name #>.strides;
+<# } else {#>
+                var rowMajorStrides = !<#= method.ResultName #>.IsReversedStride ? <#= method.ResultName #>.strides :
+                                      <#= method.Op1Name #>.strides;
+                var columnMajorStrides = <#= method.ResultName #>.IsReversedStride ? <#= method.ResultName #>.strides :
+                                         <#= method.Op1Name #>.strides;
+<# } #>
+                for(;rowMajorIndex < <#= method.ResultName #>Span.Length; rowMajorIndex++)
+                {
+                    colMajorIndex = ArrayUtilities.TransformIndexByStrides(rowMajorIndex, rowMajorStrides, false, columnMajorStrides);
+                    
+                    <#=method.GetElementOperation(type.TypeName, "Span[resultIndex]", "Span[op1Index]", "Span[op2Index]")#>;
+
+                }
+            }
+<# } #>
+<# } else if (method.MethodName == "Contract") {#>
+            var summingDimensions = new int[leftAxes.Length];
+            for(int i = 0; i < leftAxes.Length; i++)
+            {
+                summingDimensions[i] = left.dimensions[leftAxes[i]];
+            }
+
+            var summingStrides = ArrayUtilities.GetStrides(summingDimensions);
+            int summingLength = (int)ArrayUtilities.GetProduct(summingDimensions);
+
+            var resultStrides = result.strides;
+
+            // translates from result index to left non-summing dimensions' index portion
+            // since left non-summing dimensions are given precedence in result, the end is zero-padded
+            int[] leftNonSummingStrides = new int[result.Rank];
+
+            // translates from summing index to left summing dimensions' index portion
+            int[] leftSummingStrides = new int[leftAxes.Length];
+            ArrayUtilities.SplitStrides(left.strides, leftAxes, leftNonSummingStrides, 0, leftSummingStrides, 0);
+
+            // translates from result index to right non-summing dimensions' index portion
+            int[] rightNonSummingStrides = new int[result.Rank];
+            //  right non-summing dimensions appear after left non-summing dimensions.
+            int rightNonSummingStridesOffset = (left.Rank - leftAxes.Length);
+
+            // translates from summing index to right summing dimensions' index portion
+            int[] rightSummingStrides = new int[rightAxes.Length];
+            ArrayUtilities.SplitStrides(right.strides, rightAxes, rightNonSummingStrides, rightNonSummingStridesOffset, rightSummingStrides, 0);
+            
+            var resultSpan = result.Buffer.Span;
+            var leftSpan = left.Buffer.Span;
+            var rightSpan = right.Buffer.Span;
+
+            for (int resultIndex = 0; resultIndex < resultSpan.Length; resultIndex++)
+            {
+                <#=type.TypeName#> sum = (<#=type.TypeName#>)0;
+
+                int leftIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, leftNonSummingStrides);
+                int rightIndexNonSumming = ArrayUtilities.TransformIndexByStrides(resultIndex, resultStrides, result.IsReversedStride, rightNonSummingStrides);
+
+                for (int summingIndex = 0; summingIndex < summingLength; summingIndex++)
+                {
+                    int leftIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, leftSummingStrides);
+                    int rightIndexSumming = ArrayUtilities.TransformIndexByStrides(summingIndex, summingStrides, false, rightSummingStrides);
+
+                    int leftIndex = leftIndexNonSumming + leftIndexSumming;
+                    int rightIndex = rightIndexNonSumming + rightIndexSumming;
+
+                    sum += (<#=type.TypeName#>)(leftSpan[leftIndex] * rightSpan[rightIndex]);
+                }
+
+                resultSpan[resultIndex] = sum;
+            }
+<# } #>
+        }
+<# } #>
+    }
+<# } #>
+}
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorExtensions.cs b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorExtensions.cs
new file mode 100644
index 0000000000000..ee9c2438428c0
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorExtensions.cs
@@ -0,0 +1,42 @@
+﻿// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/tests/TensorExtensions.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    public static partial class TensorExtensions
+    {
+        private static int[] s_zeroArray = new[] { 0 };
+        private static int[] s_oneArray = new[] { 1 };
+
+        internal static Tensor<T> MatrixMultiply<T>(this Tensor<T> left, Tensor<T> right)
+        {
+            if (left.Rank != 2)
+            {
+                throw new InvalidOperationException($"{nameof(MatrixMultiply)} is only valid for a {nameof(Tensor<T>)} of {nameof(left.Rank)} 2.");
+            }
+
+            if (right.Rank != 2)
+            {
+                throw new ArgumentException($"{nameof(Tensor<T>)} {nameof(right)} must have {nameof(left.Rank)} 2.", nameof(right));
+            }
+
+            if (left.dimensions[1] != right.dimensions[0])
+            {
+                throw new ArgumentException($"{nameof(Tensor<T>)} {nameof(right)} must have first dimension of {left.dimensions[1]}.", nameof(right));
+            }
+
+            return TensorOperations.Contract(left, right, s_oneArray, s_zeroArray);
+        }
+    }
+}
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorOperations.cs b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorOperations.cs
new file mode 100644
index 0000000000000..2efda4872edec
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorOperations.cs
@@ -0,0 +1,750 @@
+﻿// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/tests/TensorOperations.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    public static partial class TensorOperations
+    {
+        internal static void ValidateBinaryArgs<T>(Tensor<T> left, Tensor<T> right)
+        {
+            if (left.Rank != right.Rank || left.Length != right.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+            }
+
+            if (left.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(left));
+            }
+
+            for (int i = 0; i < left.Rank; i++)
+            {
+                if (left.dimensions[i] != right.dimensions[i])
+                {
+                    throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+                }
+            }
+        }
+
+        internal static void ValidateBinaryArgs<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            if (left.Rank != right.Rank || left.Length != right.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+            }
+
+            if (left.Rank != result.Rank || left.Length != result.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(result));
+            }
+
+            if (left.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(left));
+            }
+
+            for (int i = 0; i < result.Rank; i++)
+            {
+                if (left.dimensions[i] != right.dimensions[i])
+                {
+                    throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+                }
+
+                if (left.dimensions[i] != result.dimensions[i])
+                {
+                    throw new ArgumentException("Operands and result must have matching dimensions", nameof(result));
+                }
+            }
+        }
+
+        internal static void ValidateBinaryArgs<T>(Tensor<T> left, Tensor<T> right, Tensor<bool> result)
+        {
+            if (left.Rank != right.Rank || left.Length != right.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+            }
+
+            if (left.Rank != result.Rank || left.Length != result.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(result));
+            }
+
+            if (left.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(left));
+            }
+
+            for (int i = 0; i < result.Rank; i++)
+            {
+                if (left.dimensions[i] != right.dimensions[i])
+                {
+                    throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+                }
+
+                if (left.dimensions[i] != result.dimensions[i])
+                {
+                    throw new ArgumentException("Operands and result must have matching dimensions", nameof(result));
+                }
+            }
+        }
+
+        internal static void ValidateArgs<T>(Tensor<T> tensor)
+        {
+            if (tensor.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(tensor));
+            }
+        }
+
+        internal static void ValidateArgs<T>(Tensor<T> tensor, Tensor<T> result)
+        {
+            if (tensor.Rank != result.Rank || tensor.Length != result.Length)
+            {
+                throw new ArgumentException("Operands and result must have matching dimensions", nameof(result));
+            }
+
+            if (tensor.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(tensor));
+            }
+
+            for (int i = 0; i < result.Rank; i++)
+            {
+                if (tensor.dimensions[i] != result.dimensions[i])
+                {
+                    throw new ArgumentException("Operands and result must have matching dimensions", nameof(result));
+                }
+            }
+        }
+
+        internal static int[] ValidateContractArgs<T>(Tensor<T> left, Tensor<T> right, int[] leftAxes, int[] rightAxes)
+        {
+            if (leftAxes == null)
+            {
+                throw new ArgumentNullException(nameof(left));
+            }
+
+            if (rightAxes == null)
+            {
+                throw new ArgumentNullException(nameof(left));
+            }
+
+            if (leftAxes.Length != rightAxes.Length)
+            {
+                throw new ArgumentException($"{nameof(leftAxes)} and {nameof(rightAxes)} must have the same length, but were {leftAxes.Length} and {rightAxes.Length}, respectively.");
+            }
+
+            for (int i = 0; i < leftAxes.Length; i++)
+            {
+                var leftAxis = leftAxes[i];
+
+                if (leftAxis >= left.Rank)
+                {
+                    throw new ArgumentOutOfRangeException($"{nameof(leftAxes)}[{i}] was set to axis index {leftAxis} which exceeds the Rank of {left}.");
+                }
+
+                var leftDimension = left.dimensions[leftAxis];
+
+                var rightAxis = rightAxes[i];
+
+                if (rightAxis >= right.Rank)
+                {
+                    throw new ArgumentOutOfRangeException($"{nameof(rightAxes)}[{i}] was set to axis index {rightAxis} which exceeds the Rank of {right}.");
+                }
+
+                var rightDimension = right.dimensions[rightAxis];
+
+                if (leftDimension != rightDimension)
+                {
+                    throw new ArgumentOutOfRangeException($"Tensors may only be contracted on axes of the same length, but {nameof(leftAxes)} index {i} was length {leftDimension} and {nameof(rightAxes)} index {i} was length {rightDimension}.");
+                }
+            }
+
+            var leftNonSummingDimensions = left.Rank - leftAxes.Length;
+            var rightNonSummingDimensions = right.Rank - rightAxes.Length;
+            var resultDimensions = new int[leftNonSummingDimensions + rightNonSummingDimensions];
+            int dimensionsIndex = 0;
+
+            Action<Tensor<T>, int[]> fillDimensions = (tensor, axes) =>
+            {
+                for (int i = 0; i < tensor.Rank; i++)
+                {
+                    var skip = false;
+                    foreach (var contractionIndex in axes)
+                    {
+                        if (contractionIndex == i)
+                        {
+                            skip = true;
+                            break;
+                        }
+                    }
+
+                    if (!skip)
+                    {
+                        resultDimensions[dimensionsIndex++] = tensor.dimensions[i];
+                    }
+                }
+            };
+
+            fillDimensions(left, leftAxes);
+            fillDimensions(right, rightAxes);
+
+            return resultDimensions;
+        }
+
+        internal static int[] ValidateContractArgs<T>(Tensor<T> left, Tensor<T> right, int[] leftAxes, int[] rightAxes, Tensor<T> result)
+        {
+            var expectedDimensions = ValidateContractArgs(left, right, leftAxes, rightAxes);
+
+            if (result.Rank != expectedDimensions.Length)
+            {
+                throw new ArgumentException($"{nameof(result)} should have {expectedDimensions.Length} dimensions but had {result.Rank}.");
+            }
+
+            for (int i = 0; i < expectedDimensions.Length; i++)
+            {
+                if (result.dimensions[i] != expectedDimensions[i])
+                {
+                    throw new ArgumentException($"{nameof(result)} dimension {i} should be {expectedDimensions[i]} but was {result.dimensions[i]}.");
+                }
+            }
+
+            return expectedDimensions;
+        }
+
+        internal static void Add<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.Add(left, right, result);
+        }
+
+        internal static Tensor<T> Add<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Add(left, right, result);
+
+            return result;
+        }
+
+        internal static void Add<T>(Tensor<T> tensor, T scalar, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Add(tensor, scalar, result);
+        }
+
+        internal static Tensor<T> Add<T>(Tensor<T> tensor, T scalar)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Add(tensor, scalar, result);
+
+            return result;
+        }
+
+        internal static void And<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.And(left, right, result);
+        }
+
+        internal static Tensor<T> And<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.And(left, right, result);
+
+            return result;
+        }
+
+        internal static void And<T>(Tensor<T> tensor, T scalar, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.And(tensor, scalar, result);
+        }
+
+        internal static Tensor<T> And<T>(Tensor<T> tensor, T scalar)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.And(tensor, scalar, result);
+
+            return result;
+        }
+
+        internal static void Contract<T>(Tensor<T> left, Tensor<T> right, int[] leftAxes, int[] rightAxes, Tensor<T> result)
+        {
+            var resultDimensions = ValidateContractArgs(left, right, leftAxes, rightAxes, result);
+
+            TensorArithmetic<T>.Instance.Contract(left, right, leftAxes, rightAxes, result);
+        }
+
+        internal static Tensor<T> Contract<T>(Tensor<T> left, Tensor<T> right, int[] leftAxes, int[] rightAxes)
+        {
+            var resultDimensions = ValidateContractArgs(left, right, leftAxes, rightAxes);
+
+            var result = left.CloneEmpty(resultDimensions);
+            
+            TensorArithmetic<T>.Instance.Contract(left, right, leftAxes, rightAxes, result);
+
+            return result;
+        }
+
+        internal static void Decrement<T>(Tensor<T> tensor, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Decrement(tensor, result);
+        }
+
+        internal static Tensor<T> Decrement<T>(Tensor<T> tensor)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.Clone();
+            
+            TensorArithmetic<T>.Instance.Decrement(tensor, result);
+
+            return result;
+        }
+
+        internal static void Divide<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.Divide(left, right, result);
+        }
+
+        internal static Tensor<T> Divide<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Divide(left, right, result);
+
+            return result;
+        }
+
+        internal static void Divide<T>(Tensor<T> tensor, T scalar, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Divide(tensor, scalar, result);
+        }
+
+        internal static Tensor<T> Divide<T>(Tensor<T> tensor, T scalar)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Divide(tensor, scalar, result);
+
+            return result;
+        }
+
+        internal static void Equals<T>(Tensor<T> left, Tensor<T> right, Tensor<bool> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.Equals(left, right, result);
+        }
+
+        internal static Tensor<bool> Equals<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty<bool>();
+            
+            TensorArithmetic<T>.Instance.Equals(left, right, result);
+
+            return result;
+        }
+
+        internal static void GreaterThan<T>(Tensor<T> left, Tensor<T> right, Tensor<bool> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.GreaterThan(left, right, result);
+        }
+
+        internal static Tensor<bool> GreaterThan<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty<bool>();
+            
+            TensorArithmetic<T>.Instance.GreaterThan(left, right, result);
+
+            return result;
+        }
+
+        internal static void GreaterThanOrEqual<T>(Tensor<T> left, Tensor<T> right, Tensor<bool> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.GreaterThanOrEqual(left, right, result);
+        }
+
+        internal static Tensor<bool> GreaterThanOrEqual<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty<bool>();
+            
+            TensorArithmetic<T>.Instance.GreaterThanOrEqual(left, right, result);
+
+            return result;
+        }
+
+        internal static void Increment<T>(Tensor<T> tensor, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Increment(tensor, result);
+        }
+
+        internal static Tensor<T> Increment<T>(Tensor<T> tensor)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.Clone();
+            
+            TensorArithmetic<T>.Instance.Increment(tensor, result);
+
+            return result;
+        }
+
+        internal static void LeftShift<T>(Tensor<T> tensor, int value, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.LeftShift(tensor, value, result);
+        }
+
+        internal static Tensor<T> LeftShift<T>(Tensor<T> tensor, int value)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.LeftShift(tensor, value, result);
+
+            return result;
+        }
+
+        internal static void LessThan<T>(Tensor<T> left, Tensor<T> right, Tensor<bool> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.LessThan(left, right, result);
+        }
+
+        internal static Tensor<bool> LessThan<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty<bool>();
+            
+            TensorArithmetic<T>.Instance.LessThan(left, right, result);
+
+            return result;
+        }
+
+        internal static void LessThanOrEqual<T>(Tensor<T> left, Tensor<T> right, Tensor<bool> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.LessThanOrEqual(left, right, result);
+        }
+
+        internal static Tensor<bool> LessThanOrEqual<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty<bool>();
+            
+            TensorArithmetic<T>.Instance.LessThanOrEqual(left, right, result);
+
+            return result;
+        }
+
+        internal static void Modulo<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.Modulo(left, right, result);
+        }
+
+        internal static Tensor<T> Modulo<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Modulo(left, right, result);
+
+            return result;
+        }
+
+        internal static void Modulo<T>(Tensor<T> tensor, T scalar, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Modulo(tensor, scalar, result);
+        }
+
+        internal static Tensor<T> Modulo<T>(Tensor<T> tensor, T scalar)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Modulo(tensor, scalar, result);
+
+            return result;
+        }
+
+        internal static void Multiply<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.Multiply(left, right, result);
+        }
+
+        internal static Tensor<T> Multiply<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Multiply(left, right, result);
+
+            return result;
+        }
+
+        internal static void Multiply<T>(Tensor<T> tensor, T scalar, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Multiply(tensor, scalar, result);
+        }
+
+        internal static Tensor<T> Multiply<T>(Tensor<T> tensor, T scalar)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Multiply(tensor, scalar, result);
+
+            return result;
+        }
+
+        internal static void NotEquals<T>(Tensor<T> left, Tensor<T> right, Tensor<bool> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.NotEquals(left, right, result);
+        }
+
+        internal static Tensor<bool> NotEquals<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty<bool>();
+            
+            TensorArithmetic<T>.Instance.NotEquals(left, right, result);
+
+            return result;
+        }
+
+        internal static void Or<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.Or(left, right, result);
+        }
+
+        internal static Tensor<T> Or<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Or(left, right, result);
+
+            return result;
+        }
+
+        internal static void Or<T>(Tensor<T> tensor, T scalar, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Or(tensor, scalar, result);
+        }
+
+        internal static Tensor<T> Or<T>(Tensor<T> tensor, T scalar)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Or(tensor, scalar, result);
+
+            return result;
+        }
+
+        internal static void RightShift<T>(Tensor<T> tensor, int value, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.RightShift(tensor, value, result);
+        }
+
+        internal static Tensor<T> RightShift<T>(Tensor<T> tensor, int value)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.RightShift(tensor, value, result);
+
+            return result;
+        }
+
+        internal static void Subtract<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.Subtract(left, right, result);
+        }
+
+        internal static Tensor<T> Subtract<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Subtract(left, right, result);
+
+            return result;
+        }
+
+        internal static void Subtract<T>(Tensor<T> tensor, T scalar, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Subtract(tensor, scalar, result);
+        }
+
+        internal static Tensor<T> Subtract<T>(Tensor<T> tensor, T scalar)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Subtract(tensor, scalar, result);
+
+            return result;
+        }
+
+        internal static void UnaryMinus<T>(Tensor<T> tensor, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.UnaryMinus(tensor, result);
+        }
+
+        internal static Tensor<T> UnaryMinus<T>(Tensor<T> tensor)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.UnaryMinus(tensor, result);
+
+            return result;
+        }
+
+        internal static void UnaryPlus<T>(Tensor<T> tensor, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.UnaryPlus(tensor, result);
+        }
+
+        internal static Tensor<T> UnaryPlus<T>(Tensor<T> tensor)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.UnaryPlus(tensor, result);
+
+            return result;
+        }
+
+        internal static void Xor<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            ValidateBinaryArgs(left, right, result);
+
+            TensorArithmetic<T>.Instance.Xor(left, right, result);
+        }
+
+        internal static Tensor<T> Xor<T>(Tensor<T> left, Tensor<T> right)
+        {
+            ValidateBinaryArgs(left, right);
+
+            var result = left.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Xor(left, right, result);
+
+            return result;
+        }
+
+        internal static void Xor<T>(Tensor<T> tensor, T scalar, Tensor<T> result)
+        {
+            ValidateArgs(tensor, result);
+
+            TensorArithmetic<T>.Instance.Xor(tensor, scalar, result);
+        }
+
+        internal static Tensor<T> Xor<T>(Tensor<T> tensor, T scalar)
+        {
+            ValidateArgs(tensor);
+
+            var result = tensor.CloneEmpty();
+            
+            TensorArithmetic<T>.Instance.Xor(tensor, scalar, result);
+
+            return result;
+        }
+
+    }
+}
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorOperations.tt b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorOperations.tt
new file mode 100644
index 0000000000000..627aa7625cbd2
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorOperations.tt
@@ -0,0 +1,251 @@
+﻿<#@ template debug="false" hostspecific="false" language="C#" #>
+<#@ assembly name="System.Core" #>
+<#@ output extension=".cs" #>
+<#@ include file="TensorTemplate.ttinclude" #>// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/tests/TensorOperations.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors
+{
+    public static partial class TensorOperations
+    {
+        internal static void ValidateBinaryArgs<T>(Tensor<T> left, Tensor<T> right)
+        {
+            if (left.Rank != right.Rank || left.Length != right.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+            }
+
+            if (left.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(left));
+            }
+
+            for (int i = 0; i < left.Rank; i++)
+            {
+                if (left.dimensions[i] != right.dimensions[i])
+                {
+                    throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+                }
+            }
+        }
+
+        internal static void ValidateBinaryArgs<T>(Tensor<T> left, Tensor<T> right, Tensor<T> result)
+        {
+            if (left.Rank != right.Rank || left.Length != right.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+            }
+
+            if (left.Rank != result.Rank || left.Length != result.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(result));
+            }
+
+            if (left.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(left));
+            }
+
+            for (int i = 0; i < result.Rank; i++)
+            {
+                if (left.dimensions[i] != right.dimensions[i])
+                {
+                    throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+                }
+
+                if (left.dimensions[i] != result.dimensions[i])
+                {
+                    throw new ArgumentException("Operands and result must have matching dimensions", nameof(result));
+                }
+            }
+        }
+
+        internal static void ValidateBinaryArgs<T>(Tensor<T> left, Tensor<T> right, Tensor<bool> result)
+        {
+            if (left.Rank != right.Rank || left.Length != right.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+            }
+
+            if (left.Rank != result.Rank || left.Length != result.Length)
+            {
+                throw new ArgumentException("Operands must have matching dimensions", nameof(result));
+            }
+
+            if (left.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(left));
+            }
+
+            for (int i = 0; i < result.Rank; i++)
+            {
+                if (left.dimensions[i] != right.dimensions[i])
+                {
+                    throw new ArgumentException("Operands must have matching dimensions", nameof(right));
+                }
+
+                if (left.dimensions[i] != result.dimensions[i])
+                {
+                    throw new ArgumentException("Operands and result must have matching dimensions", nameof(result));
+                }
+            }
+        }
+
+        internal static void ValidateArgs<T>(Tensor<T> tensor)
+        {
+            if (tensor.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(tensor));
+            }
+        }
+
+        internal static void ValidateArgs<T>(Tensor<T> tensor, Tensor<T> result)
+        {
+            if (tensor.Rank != result.Rank || tensor.Length != result.Length)
+            {
+                throw new ArgumentException("Operands and result must have matching dimensions", nameof(result));
+            }
+
+            if (tensor.Rank == 0)
+            {
+                throw new ArgumentException($"Cannot operate on Tensor with {nameof(Tensor<T>.Rank)} of 0.", nameof(tensor));
+            }
+
+            for (int i = 0; i < result.Rank; i++)
+            {
+                if (tensor.dimensions[i] != result.dimensions[i])
+                {
+                    throw new ArgumentException("Operands and result must have matching dimensions", nameof(result));
+                }
+            }
+        }
+
+        internal static int[] ValidateContractArgs<T>(Tensor<T> left, Tensor<T> right, int[] leftAxes, int[] rightAxes)
+        {
+            if (leftAxes == null)
+            {
+                throw new ArgumentNullException(nameof(left));
+            }
+
+            if (rightAxes == null)
+            {
+                throw new ArgumentNullException(nameof(left));
+            }
+
+            if (leftAxes.Length != rightAxes.Length)
+            {
+                throw new ArgumentException($"{nameof(leftAxes)} and {nameof(rightAxes)} must have the same length, but were {leftAxes.Length} and {rightAxes.Length}, respectively.");
+            }
+
+            for (int i = 0; i < leftAxes.Length; i++)
+            {
+                var leftAxis = leftAxes[i];
+
+                if (leftAxis >= left.Rank)
+                {
+                    throw new ArgumentOutOfRangeException($"{nameof(leftAxes)}[{i}] was set to axis index {leftAxis} which exceeds the Rank of {left}.");
+                }
+
+                var leftDimension = left.dimensions[leftAxis];
+
+                var rightAxis = rightAxes[i];
+
+                if (rightAxis >= right.Rank)
+                {
+                    throw new ArgumentOutOfRangeException($"{nameof(rightAxes)}[{i}] was set to axis index {rightAxis} which exceeds the Rank of {right}.");
+                }
+
+                var rightDimension = right.dimensions[rightAxis];
+
+                if (leftDimension != rightDimension)
+                {
+                    throw new ArgumentOutOfRangeException($"Tensors may only be contracted on axes of the same length, but {nameof(leftAxes)} index {i} was length {leftDimension} and {nameof(rightAxes)} index {i} was length {rightDimension}.");
+                }
+            }
+
+            var leftNonSummingDimensions = left.Rank - leftAxes.Length;
+            var rightNonSummingDimensions = right.Rank - rightAxes.Length;
+            var resultDimensions = new int[leftNonSummingDimensions + rightNonSummingDimensions];
+            int dimensionsIndex = 0;
+
+            Action<Tensor<T>, int[]> fillDimensions = (tensor, axes) =>
+            {
+                for (int i = 0; i < tensor.Rank; i++)
+                {
+                    var skip = false;
+                    foreach (var contractionIndex in axes)
+                    {
+                        if (contractionIndex == i)
+                        {
+                            skip = true;
+                            break;
+                        }
+                    }
+
+                    if (!skip)
+                    {
+                        resultDimensions[dimensionsIndex++] = tensor.dimensions[i];
+                    }
+                }
+            };
+
+            fillDimensions(left, leftAxes);
+            fillDimensions(right, rightAxes);
+
+            return resultDimensions;
+        }
+
+        internal static int[] ValidateContractArgs<T>(Tensor<T> left, Tensor<T> right, int[] leftAxes, int[] rightAxes, Tensor<T> result)
+        {
+            var expectedDimensions = ValidateContractArgs(left, right, leftAxes, rightAxes);
+
+            if (result.Rank != expectedDimensions.Length)
+            {
+                throw new ArgumentException($"{nameof(result)} should have {expectedDimensions.Length} dimensions but had {result.Rank}.");
+            }
+
+            for (int i = 0; i < expectedDimensions.Length; i++)
+            {
+                if (result.dimensions[i] != expectedDimensions[i])
+                {
+                    throw new ArgumentException($"{nameof(result)} dimension {i} should be {expectedDimensions[i]} but was {result.dimensions[i]}.");
+                }
+            }
+
+            return expectedDimensions;
+        }
+
+<# foreach (MethodConfiguration method in methodConfiguration) { #>
+        internal static <#= method.GetGenericResultMethodSignature("Tensor", "T")#>
+        {
+            <#= method.GetValidationMethod(true) #>
+
+            TensorArithmetic<T>.Instance.<#=method.MethodName#>(<#=method.GetCallArguments()#>, <#= method.ResultName #>);
+        }
+
+        internal static <#= method.GetGenericMethodSignature("Tensor", "T")#>
+        {
+            <#= method.GetValidationMethod(false) #>
+
+            var <#= method.ResultName #> = <#=method.InitializeResult("T")#>;
+            
+            TensorArithmetic<T>.Instance.<#=method.MethodName#>(<#=method.GetCallArguments()#>, <#= method.ResultName #>);
+
+            return <#= method.ResultName #>;
+        }
+
+<# } #>
+    }
+}
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTemplate.ttinclude b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTemplate.ttinclude
new file mode 100644
index 0000000000000..9448791a5db6c
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTemplate.ttinclude
@@ -0,0 +1,328 @@
+﻿<#@ import namespace="System.Linq" #>
+<#@ import namespace="System.Text" #>
+<#@ import namespace="System.Collections.Generic" #>
+<#+
+    public class TypeConfiguration
+    {
+        public TypeConfiguration(string typeName, string classPrefix = null, string oneLiteral = "1", string zeroLiteral = "0", bool supportsNumeric = true, bool supportsBitwise = true, IEnumerable<string> unsupportedMethods = null)
+        {
+            TypeName = typeName;
+            ClassPrefix = classPrefix ?? char.ToUpper(typeName[0]) + typeName.Substring(1);
+            OneLiteral = oneLiteral;
+            ZeroLiteral = zeroLiteral;
+            SupportsNumeric = supportsNumeric;
+            SupportsBitwise = supportsBitwise;
+            UnsupportedMethods = new HashSet<string>(unsupportedMethods ?? Enumerable.Empty<string>());
+        }
+
+        public string TypeName { get; }
+        public string ClassPrefix { get; }
+        public string OneLiteral { get; }
+        public string ZeroLiteral { get; }
+        
+        public bool SupportsNumeric { get; }
+        public bool SupportsBitwise { get; }
+        public ISet<string> UnsupportedMethods { get; }
+    }
+
+    public string GenerateIfStatementHeader(TypeConfiguration type)
+    {
+        string keyword = (type == typeConfiguration[0]) ? "if" : "else if";
+        return $"{keyword} (typeof(T) == typeof({type.TypeName}))";
+    }
+
+    public TypeConfiguration[] typeConfiguration = new []
+    {
+        new TypeConfiguration("bool", oneLiteral:"true", zeroLiteral:"false", supportsNumeric: false, unsupportedMethods: new[] {"LeftShift", "RightShift"}),
+        new TypeConfiguration("byte"),
+        new TypeConfiguration("char", oneLiteral:"(char)1", zeroLiteral:"(char)0"),
+        new TypeConfiguration("decimal", supportsBitwise: false),
+        new TypeConfiguration("double", oneLiteral:"1.0", supportsBitwise: false),
+        new TypeConfiguration("float", oneLiteral:"1.0f", supportsBitwise: false),
+        new TypeConfiguration("int"),
+        new TypeConfiguration("long"),
+        new TypeConfiguration("sbyte", classPrefix:"SByte"),
+        new TypeConfiguration("short"),
+        new TypeConfiguration("uint", classPrefix:"UInt", unsupportedMethods: new[] {"UnaryMinus"}),
+        new TypeConfiguration("ulong", classPrefix:"ULong", unsupportedMethods: new[] {"UnaryMinus"}),
+        new TypeConfiguration("ushort", classPrefix:"UShort", unsupportedMethods: new[] {"UnaryMinus"})
+    };
+
+    public enum MethodType
+    {
+       Unary,
+       UnaryInPlace,
+       BinaryScalar,
+       BinaryInt,
+       Binary,
+       Comparison,
+       Contraction
+    }
+
+    public class MethodConfiguration
+    {
+        public MethodConfiguration(string methodName, MethodType methodType, string op = null, bool isNumeric = false, bool isBitwise = false)
+        {
+            MethodName = methodName;
+            MethodType = methodType;
+            Operator = op;
+            IsNumeric = isNumeric;
+            IsBitwise = isBitwise;
+        }
+
+        public string ResultName => "result";
+        
+        public string Op1Name 
+        {
+            get
+            {
+                switch (MethodType)
+                {
+                    case MethodType.Unary:
+                    case MethodType.UnaryInPlace:
+                    case MethodType.BinaryScalar:
+                    case MethodType.BinaryInt:
+                        return "tensor";
+                    case MethodType.Binary:
+                    case MethodType.Comparison:
+                    case MethodType.Contraction:
+                        return "left";
+                    default:
+                        throw new ArgumentException();
+                };
+            }
+        }
+
+        public string Op2Name 
+        {
+            get
+            {
+                switch (MethodType)
+                {
+                    case MethodType.BinaryScalar:
+                        return "scalar";
+                    case MethodType.BinaryInt:
+                        return "value";
+                    case MethodType.Binary:
+                    case MethodType.Comparison:
+                    case MethodType.Contraction:
+                        return "right";
+                    case MethodType.Unary:
+                    case MethodType.UnaryInPlace:
+                    default:
+                        throw new ArgumentException();
+                };
+            }
+        }
+
+        public string MethodName { get; }
+        public MethodType MethodType { get; }
+        public string Operator { get; }
+        
+        public string GetGenericMethodSignature(string tensorType, string genericType)
+        {
+            var resultType = GetResultType(tensorType, genericType);
+            var arguments = GetMethodArguments(tensorType, genericType);
+
+            return $"{resultType} {MethodName}<{genericType}>({arguments})";
+        }
+        
+        public string GetGenericResultMethodSignature(string tensorType, string genericType)
+        {
+            var resultType = GetResultType(tensorType, genericType);
+            var arguments = GetMethodArguments(tensorType, genericType);
+
+            return $"void {MethodName}<{genericType}>({arguments}, {resultType} {ResultName})";
+        }
+
+        public string GetResultMethodSignature(string tensorType, string genericType)
+        {
+            var resultType = GetResultType(tensorType, genericType);
+            var arguments = GetMethodArguments(tensorType, genericType);
+
+            return $"void {MethodName}({arguments}, {resultType} {ResultName})";
+        }
+
+        public string GetMethodArguments(string tensorType, string genericType)
+        {
+            switch (MethodType)
+            {
+                case MethodType.Unary:
+                case MethodType.UnaryInPlace:
+                    return $"{tensorType}<{genericType}> {Op1Name}";
+                case MethodType.BinaryScalar:
+                    return $"{tensorType}<{genericType}> {Op1Name}, {genericType} {Op2Name}";
+                case MethodType.BinaryInt:
+                    return $"{tensorType}<{genericType}> {Op1Name}, int {Op2Name}";
+                case MethodType.Binary:
+                case MethodType.Comparison:
+                    return $"{tensorType}<{genericType}> {Op1Name}, {tensorType}<{genericType}> {Op2Name}";
+                case MethodType.Contraction:
+                    return $"{tensorType}<{genericType}> {Op1Name}, {tensorType}<{genericType}> {Op2Name}, int[] leftAxes, int[] rightAxes";
+                default:
+                    throw new ArgumentException();
+            }
+        }
+
+        public string GetCallArguments()
+        {
+            switch (MethodType)
+            {
+                case MethodType.Unary:
+                case MethodType.UnaryInPlace:
+                    return $"{Op1Name}";
+                case MethodType.BinaryScalar:
+                case MethodType.BinaryInt:
+                case MethodType.Binary:
+                case MethodType.Comparison:
+                    return $"{Op1Name}, {Op2Name}";
+                case MethodType.Contraction:
+                    return "left, right, leftAxes, rightAxes";
+                default:
+                    throw new ArgumentException();
+            }
+        }
+        
+        public string GetValidationMethod(bool includeResult)
+        {
+            var suffix = includeResult ? ", result" : "";
+            switch (MethodType)
+            {
+                case MethodType.Unary:
+                case MethodType.UnaryInPlace:
+                case MethodType.BinaryScalar:
+                case MethodType.BinaryInt:
+                    return $"ValidateArgs({Op1Name}{suffix});";
+                case MethodType.Binary:
+                case MethodType.Comparison:
+                    return $"ValidateBinaryArgs({Op1Name}, {Op2Name}{suffix});";
+                case MethodType.Contraction:
+                    return $"var resultDimensions = ValidateContractArgs({Op1Name}, {Op2Name}, leftAxes, rightAxes{suffix});";
+                default:
+                    throw new ArgumentException();
+            }
+        }
+
+        public string GetResultType(string tensorType, string typeName)
+        {
+            switch (MethodType)
+            {
+                case MethodType.Unary:
+                case MethodType.UnaryInPlace:
+                case MethodType.BinaryScalar:
+                case MethodType.BinaryInt:
+                case MethodType.Binary:
+                case MethodType.Contraction:
+                    return $"{tensorType}<{typeName}>";
+                case MethodType.Comparison:
+                    return $"{tensorType}<bool>";
+                default:
+                    throw new ArgumentException();
+            }
+        }
+
+        public string GetLinearOperationCheck()
+        {
+            switch (MethodType)
+            {
+                case MethodType.Unary:
+                case MethodType.BinaryScalar:
+                case MethodType.BinaryInt:
+                    return $"({ResultName}.IsReversedStride == {Op1Name}.IsReversedStride)";
+                case MethodType.Binary:
+                case MethodType.Comparison:
+                    return $"(({ResultName}.IsReversedStride == {Op1Name}.IsReversedStride) && ({ResultName}.IsReversedStride == {Op2Name}.IsReversedStride))";
+                case MethodType.UnaryInPlace:
+                default:
+                    throw new ArgumentException();
+            }
+        }
+
+
+        public string GetElementOperation(string typeName, string access)
+        {
+            return GetElementOperation(typeName, access, access, access);
+        }
+
+        public string GetElementOperation(string typeName, string resultAccess, string leftAccess, string rightAccess)
+        {
+            switch (MethodType)
+            {
+                case MethodType.Unary:
+                    return $"{ResultName}{resultAccess} = ({typeName}){Operator}{Op1Name}{leftAccess}";
+                case MethodType.UnaryInPlace:
+                    return $"{ResultName}{resultAccess}{Operator}";
+                case MethodType.BinaryScalar:
+                case MethodType.BinaryInt:
+                    return $"{ResultName}{resultAccess} = ({typeName})({Op1Name}{leftAccess} {Operator} {Op2Name})";
+                case MethodType.Binary:
+                    return $"{ResultName}{resultAccess} = ({typeName})({Op1Name}{leftAccess} {Operator} {Op2Name}{rightAccess})";
+                case MethodType.Comparison:
+                    return $"{ResultName}{resultAccess} = {Op1Name}{leftAccess} {Operator} {Op2Name}{rightAccess}";
+                default:
+                    throw new ArgumentException();
+
+            }
+        }
+
+        public string InitializeResult(string typeName)
+        {
+            switch (MethodType)
+            {
+                case MethodType.UnaryInPlace:
+                    return $"{Op1Name}.Clone()";
+                case MethodType.Unary:
+                case MethodType.BinaryScalar:
+                case MethodType.BinaryInt:
+                case MethodType.Binary:
+                    return $"{Op1Name}.CloneEmpty()";
+                case MethodType.Comparison:
+                    return $"{Op1Name}.CloneEmpty<bool>()";
+                case MethodType.Contraction:
+                    return $"{Op1Name}.CloneEmpty(resultDimensions)";
+                default:
+                    throw new ArgumentException();
+            }
+        }
+        
+        public bool IsNumeric { get; }
+        public bool IsBitwise { get; }
+    }
+
+    
+    public MethodConfiguration[] methodConfiguration = new []
+    {
+        new MethodConfiguration("Add", MethodType.Binary, "+", isNumeric:true),
+        new MethodConfiguration("Add", MethodType.BinaryScalar, "+", isNumeric:true),
+        new MethodConfiguration("UnaryPlus", MethodType.Unary, "+", isNumeric:true),
+        new MethodConfiguration("Subtract", MethodType.Binary, "-", isNumeric:true),
+        new MethodConfiguration("Subtract", MethodType.BinaryScalar, "-", isNumeric:true),
+        new MethodConfiguration("UnaryMinus", MethodType.Unary, "-", isNumeric:true),
+        new MethodConfiguration("Increment", MethodType.UnaryInPlace, "++", isNumeric:true),
+        new MethodConfiguration("Decrement", MethodType.UnaryInPlace, "--", isNumeric:true),
+        new MethodConfiguration("Multiply", MethodType.Binary, "*", isNumeric:true),  // element-wise product, not matrix product
+        new MethodConfiguration("Multiply", MethodType.BinaryScalar, "*", isNumeric:true),
+        new MethodConfiguration("Divide", MethodType.Binary, "/", isNumeric:true),
+        new MethodConfiguration("Divide", MethodType.BinaryScalar, "/", isNumeric:true),
+        new MethodConfiguration("Modulo", MethodType.Binary, "%", isNumeric:true),
+        new MethodConfiguration("Modulo", MethodType.BinaryScalar, "%", isNumeric:true),
+        new MethodConfiguration("And", MethodType.Binary, "&", isBitwise: true),
+        new MethodConfiguration("And", MethodType.BinaryScalar, "&", isBitwise: true),
+        new MethodConfiguration("Or", MethodType.Binary, "|", isBitwise: true),
+        new MethodConfiguration("Or", MethodType.BinaryScalar, "|", isBitwise: true),
+        new MethodConfiguration("Xor", MethodType.Binary, "^", isBitwise: true),
+        new MethodConfiguration("Xor", MethodType.BinaryScalar, "^", isBitwise: true),
+        new MethodConfiguration("LeftShift", MethodType.BinaryInt, "<<", isBitwise: true),
+        new MethodConfiguration("RightShift", MethodType.BinaryInt, ">>", isBitwise: true),
+
+        // Note all of these are element-wise operations not testing the operation on the entire Tensor
+        new MethodConfiguration("Equals", MethodType.Comparison, "=="),
+        new MethodConfiguration("NotEquals", MethodType.Comparison, "!="),
+        new MethodConfiguration("GreaterThanOrEqual", MethodType.Comparison, ">=", isNumeric:true),
+        new MethodConfiguration("LessThanOrEqual", MethodType.Comparison, "<=", isNumeric:true),
+        new MethodConfiguration("GreaterThan", MethodType.Comparison, ">", isNumeric:true),
+        new MethodConfiguration("LessThan", MethodType.Comparison, "<", isNumeric:true),
+
+        new MethodConfiguration("Contract", MethodType.Contraction, isNumeric:true),
+    }.OrderBy(m => m.MethodName).ToArray();
+#>
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTests.cs b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTests.cs
new file mode 100644
index 0000000000000..272484e1fb24f
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTests.cs
@@ -0,0 +1,2243 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/tests/TensorTests.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System.Collections;
+using System.Collections.Generic;
+using System.Linq;
+using Xunit;
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors.Tests
+{
+    public class TensorTests : TensorTestsBase
+    {
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void ConstructTensorFromArrayRank1(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(new[] { 0, 1, 2 });
+
+            Assert.Equal(tensorConstructor.IsReversedStride, tensor.IsReversedStride);
+            Assert.Equal(0, tensor[0]);
+            Assert.Equal(1, tensor[1]);
+            Assert.Equal(2, tensor[2]);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void ConstructTensorFromArrayRank2(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+                {0, 1, 2},
+                {3, 4, 5}
+            });
+
+            Assert.Equal(tensorConstructor.IsReversedStride, tensor.IsReversedStride);
+            Assert.Equal(0, tensor[0, 0]);
+            Assert.Equal(1, tensor[0, 1]);
+            Assert.Equal(2, tensor[0, 2]);
+            Assert.Equal(3, tensor[1, 0]);
+            Assert.Equal(4, tensor[1, 1]);
+            Assert.Equal(5, tensor[1, 2]);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void ConstructTensorFromArrayRank3(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(new[, ,]
+            {
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                },
+                {
+                    {6, 7 ,8 },
+                    {9, 10 ,11 },
+                },
+                {
+                    {12, 13 ,14 },
+                    {15, 16 ,17 },
+                },
+                {
+                    {18, 19 ,20 },
+                    {21, 22 ,23 },
+                }
+            });
+
+            Assert.Equal(tensorConstructor.IsReversedStride, tensor.IsReversedStride);
+
+            Assert.Equal(0, tensor[0, 0, 0]);
+            Assert.Equal(1, tensor[0, 0, 1]);
+            Assert.Equal(2, tensor[0, 0, 2]);
+            Assert.Equal(3, tensor[0, 1, 0]);
+            Assert.Equal(4, tensor[0, 1, 1]);
+            Assert.Equal(5, tensor[0, 1, 2]);
+
+            Assert.Equal(6, tensor[1, 0, 0]);
+            Assert.Equal(7, tensor[1, 0, 1]);
+            Assert.Equal(8, tensor[1, 0, 2]);
+            Assert.Equal(9, tensor[1, 1, 0]);
+            Assert.Equal(10, tensor[1, 1, 1]);
+            Assert.Equal(11, tensor[1, 1, 2]);
+
+            Assert.Equal(12, tensor[2, 0, 0]);
+            Assert.Equal(13, tensor[2, 0, 1]);
+            Assert.Equal(14, tensor[2, 0, 2]);
+            Assert.Equal(15, tensor[2, 1, 0]);
+            Assert.Equal(16, tensor[2, 1, 1]);
+            Assert.Equal(17, tensor[2, 1, 2]);
+
+            Assert.Equal(18, tensor[3, 0, 0]);
+            Assert.Equal(19, tensor[3, 0, 1]);
+            Assert.Equal(20, tensor[3, 0, 2]);
+            Assert.Equal(21, tensor[3, 1, 0]);
+            Assert.Equal(22, tensor[3, 1, 1]);
+            Assert.Equal(23, tensor[3, 1, 2]);
+        }
+
+        [Fact]
+        public void ConstructDenseTensorFromPointer()
+        {
+            using (var nativeMemory = NativeMemoryFromArray(Enumerable.Range(0, 24).ToArray()))
+            {
+                var dimensions = new[] { 4, 2, 3 };
+                var tensor = new DenseTensor<int>(nativeMemory.Memory, dimensions, false);
+
+                Assert.Equal(0, tensor[0, 0, 0]);
+                Assert.Equal(1, tensor[0, 0, 1]);
+                Assert.Equal(2, tensor[0, 0, 2]);
+                Assert.Equal(3, tensor[0, 1, 0]);
+                Assert.Equal(4, tensor[0, 1, 1]);
+                Assert.Equal(5, tensor[0, 1, 2]);
+
+                Assert.Equal(6, tensor[1, 0, 0]);
+                Assert.Equal(7, tensor[1, 0, 1]);
+                Assert.Equal(8, tensor[1, 0, 2]);
+                Assert.Equal(9, tensor[1, 1, 0]);
+                Assert.Equal(10, tensor[1, 1, 1]);
+                Assert.Equal(11, tensor[1, 1, 2]);
+
+                Assert.Equal(12, tensor[2, 0, 0]);
+                Assert.Equal(13, tensor[2, 0, 1]);
+                Assert.Equal(14, tensor[2, 0, 2]);
+                Assert.Equal(15, tensor[2, 1, 0]);
+                Assert.Equal(16, tensor[2, 1, 1]);
+                Assert.Equal(17, tensor[2, 1, 2]);
+
+                Assert.Equal(18, tensor[3, 0, 0]);
+                Assert.Equal(19, tensor[3, 0, 1]);
+                Assert.Equal(20, tensor[3, 0, 2]);
+                Assert.Equal(21, tensor[3, 1, 0]);
+                Assert.Equal(22, tensor[3, 1, 1]);
+                Assert.Equal(23, tensor[3, 1, 2]);
+            }
+        }
+
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void ConstructFromDimensions(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromDimensions<int>(new[] { 2, 3, 4 });
+            Assert.Equal(3, tensor.Rank);
+            Assert.Equal(3, tensor.Dimensions.Length);
+            Assert.Equal(2, tensor.Dimensions[0]);
+            Assert.Equal(3, tensor.Dimensions[1]);
+            Assert.Equal(4, tensor.Dimensions[2]);
+            Assert.Equal(24, tensor.Length);
+            Assert.Equal(tensorConstructor.IsReversedStride, tensor.IsReversedStride);
+
+            //Assert.Throws<ArgumentNullException>("dimensions", () => tensorConstructor.CreateFromDimensions<int>(dimensions: null));
+            Assert.Throws<ArgumentException>("dimensions", () => tensorConstructor.CreateFromDimensions<int>(dimensions: new int[0]));
+
+            Assert.Throws<ArgumentOutOfRangeException>("dimensions", () => tensorConstructor.CreateFromDimensions<int>(dimensions: new[] { 1, 0 }));
+            Assert.Throws<ArgumentOutOfRangeException>("dimensions", () => tensorConstructor.CreateFromDimensions<int>(dimensions: new[] { 1, -1 }));
+
+            // ensure dimensions are immutable
+            var dimensions = new[] { 1, 2, 3 };
+            tensor = tensorConstructor.CreateFromDimensions<int>(dimensions: dimensions);
+            dimensions[0] = dimensions[1] = dimensions[2] = 0;
+            Assert.Equal(1, tensor.Dimensions[0]);
+            Assert.Equal(2, tensor.Dimensions[1]);
+            Assert.Equal(3, tensor.Dimensions[2]);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void ConstructTensorFromArrayRank3WithLowerBounds(TensorConstructor tensorConstructor)
+        {
+            var dimensions = new[] { 2, 3, 4 };
+            var lowerBounds = new[] { 0, 5, 200 };
+            var arrayWithLowerBounds = Array.CreateInstance(typeof(int), dimensions, lowerBounds);
+
+            int value = 0;
+            for (int x = lowerBounds[0]; x < lowerBounds[0] + dimensions[0]; x++)
+            {
+                for (int y = lowerBounds[1]; y < lowerBounds[1] + dimensions[1]; y++)
+                {
+                    for (int z = lowerBounds[2]; z < lowerBounds[2] + dimensions[2]; z++)
+                    {
+                        arrayWithLowerBounds.SetValue(value++, x, y, z);
+                    }
+                }
+            }
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arrayWithLowerBounds);
+
+            var expected = tensorConstructor.CreateFromArray<int>(new[, ,]
+                    {
+                        {
+                            { 0, 1, 2, 3 },
+                            { 4, 5, 6, 7 },
+                            { 8, 9, 10, 11 }
+                        },
+                        {
+                            { 12, 13, 14, 15 },
+                            { 16, 17, 18, 19 },
+                            { 20, 21, 22, 23 }
+                        }
+                    }
+                );
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(expected, tensor));
+            Assert.Equal(tensorConstructor.IsReversedStride, tensor.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void StructurallyEqualTensor(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var arr = new[, ,]
+            {
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                },
+                {
+                    {6, 7 ,8 },
+                    {9, 10 ,11 },
+                },
+                {
+                    {12, 13 ,14 },
+                    {15, 16 ,17 },
+                },
+                {
+                    {18, 19 ,20 },
+                    {21, 22 ,23 },
+                }
+            };
+            var tensor = leftConstructor.CreateFromArray<int>(arr);
+            var tensor2 = rightConstructor.CreateFromArray<int>(arr);
+
+            Assert.Equal(0, StructuralComparisons.StructuralComparer.Compare(tensor, tensor2));
+            Assert.Equal(0, StructuralComparisons.StructuralComparer.Compare(tensor2, tensor));
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tensor, tensor2));
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tensor2, tensor));
+            // Issue: should Tensors with different layout be structurally equal?
+            if (leftConstructor.IsReversedStride == leftConstructor.IsReversedStride)
+            {
+                Assert.Equal(StructuralComparisons.StructuralEqualityComparer.GetHashCode(tensor), StructuralComparisons.StructuralEqualityComparer.GetHashCode(tensor2));
+            }
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void StructurallyEqualArray(TensorConstructor tensorConstructor)
+        {
+            var arr = new[, ,]
+            {
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                },
+                {
+                    {6, 7 ,8 },
+                    {9, 10 ,11 },
+                },
+                {
+                    {12, 13 ,14 },
+                    {15, 16 ,17 },
+                },
+                {
+                    {18, 19 ,20 },
+                    {21, 22 ,23 },
+                }
+            };
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+
+            Assert.Equal(0, StructuralComparisons.StructuralComparer.Compare(tensor, arr));
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tensor, arr));
+
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetDiagonalSquare(TensorConstructor tensorConstructor)
+        {
+            var arr = new[,]
+            {
+               { 1, 2, 4 },
+               { 8, 3, 9 },
+               { 1, 7, 5 },
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var diag = tensor.GetDiagonal();
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 1, 3, 5 }));
+            diag = tensor.GetDiagonal(1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 2, 9 }));
+            diag = tensor.GetDiagonal(2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 4 }));
+            Assert.Throws<ArgumentException>("offset", () => tensor.GetDiagonal(3));
+
+            diag = tensor.GetDiagonal(-1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 8, 7 }));
+            diag = tensor.GetDiagonal(-2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 1 }));
+            Assert.Throws<ArgumentException>("offset", () => tensor.GetDiagonal(-3));
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetDiagonalRectangle(TensorConstructor tensorConstructor)
+        {
+            var arr = new[,]
+            {
+               { 1, 2, 4, 3, 7 },
+               { 8, 3, 9, 2, 6 },
+               { 1, 7, 5, 2, 9 }
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var diag = tensor.GetDiagonal();
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 1, 3, 5 }));
+            diag = tensor.GetDiagonal(1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 2, 9, 2 }));
+            diag = tensor.GetDiagonal(2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 4, 2, 9 }));
+            diag = tensor.GetDiagonal(3);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 3, 6 }));
+            diag = tensor.GetDiagonal(4);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 7 }));
+            Assert.Throws<ArgumentException>("offset", () => tensor.GetDiagonal(5));
+
+            diag = tensor.GetDiagonal(-1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 8, 7 }));
+            diag = tensor.GetDiagonal(-2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, new[] { 1 }));
+            Assert.Throws<ArgumentException>("offset", () => tensor.GetDiagonal(-3));
+            Assert.Throws<ArgumentException>("offset", () => tensor.GetDiagonal(-4));
+            Assert.Throws<ArgumentException>("offset", () => tensor.GetDiagonal(-5));
+        }
+
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetDiagonalCube(TensorConstructor tensorConstructor)
+        {
+            var arr = new[, ,]
+            {
+                {
+                   { 1, 2, 4 },
+                   { 8, 3, 9 },
+                   { 1, 7, 5 },
+                },
+                {
+                   { 4, 5, 7 },
+                   { 1, 6, 2 },
+                   { 3, 0, 8 },
+                },
+                {
+                   { 5, 6, 1 },
+                   { 2, 2, 3 },
+                   { 4, 9, 4 },
+                },
+
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var diag = tensor.GetDiagonal();
+            var expected = new[,]
+            {
+                { 1, 2, 4 },
+                { 1, 6, 2 },
+                { 4, 9, 4 }
+            };
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(diag, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, diag.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetTriangleSquare(TensorConstructor tensorConstructor)
+        {
+            var arr = new[,]
+            {
+               { 1, 2, 4 },
+               { 8, 3, 9 },
+               { 1, 7, 5 },
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var tri = tensor.GetTriangle(0);
+            Assert.Equal(tensorConstructor.IsReversedStride, tri.IsReversedStride);
+
+            var expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 0, 0 },
+               { 8, 3, 0 },
+               { 1, 7, 5 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(1);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 0 },
+               { 8, 3, 9 },
+               { 1, 7, 5 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(2);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4 },
+               { 8, 3, 9 },
+               { 1, 7, 5 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetTriangle(3);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetTriangle(200);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetTriangle(-1);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0 },
+               { 8, 0, 0 },
+               { 1, 7, 0 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(-2);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0 },
+               { 0, 0, 0 },
+               { 1, 0, 0 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0 },
+               { 0, 0, 0 },
+               { 0, 0, 0 },
+            });
+            tri = tensor.GetTriangle(-3);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            // same as -3, should it be an exception?
+            tri = tensor.GetTriangle(-4);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(-300);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetTriangleRectangle(TensorConstructor tensorConstructor)
+        {
+            var arr = new[,]
+            {
+               { 1, 2, 4, 3, 7 },
+               { 8, 3, 9, 2, 6 },
+               { 1, 7, 5, 2, 9 }
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var tri = tensor.GetTriangle(0);
+            var expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 0, 0, 0, 0 },
+               { 8, 3, 0, 0, 0 },
+               { 1, 7, 5, 0, 0 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, tri.IsReversedStride);
+
+            tri = tensor.GetTriangle(1);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 0, 0, 0 },
+               { 8, 3, 9, 0, 0 },
+               { 1, 7, 5, 2, 0 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(2);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4, 0, 0 },
+               { 8, 3, 9, 2, 0 },
+               { 1, 7, 5, 2, 9 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(3);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4, 3, 0 },
+               { 8, 3, 9, 2, 6 },
+               { 1, 7, 5, 2, 9 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetTriangle(4);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4, 3, 7 },
+               { 8, 3, 9, 2, 6 },
+               { 1, 7, 5, 2, 9 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            // same as 4, should it be an exception?
+            tri = tensor.GetTriangle(5);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(1000);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetTriangle(-1);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0, 0, 0 },
+               { 8, 0, 0, 0, 0 },
+               { 1, 7, 0, 0, 0 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0, 0, 0 },
+               { 0, 0, 0, 0, 0 },
+               { 1, 0, 0, 0, 0 }
+            });
+            tri = tensor.GetTriangle(-2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0, 0, 0 },
+               { 0, 0, 0, 0, 0 },
+               { 0, 0, 0, 0, 0 }
+            });
+            tri = tensor.GetTriangle(-3);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetTriangle(-4);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(-5);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetTriangle(-100);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetTriangleCube(TensorConstructor tensorConstructor)
+        {
+            var arr = new[, ,]
+            {
+                {
+                   { 1, 2, 4 },
+                   { 8, 3, 9 },
+                   { 1, 7, 5 },
+                },
+                {
+                   { 4, 5, 7 },
+                   { 1, 6, 2 },
+                   { 3, 0, 8 },
+                },
+                {
+                   { 5, 6, 1 },
+                   { 2, 2, 3 },
+                   { 4, 9, 4 },
+                },
+
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var tri = tensor.GetTriangle(0);
+            var expected = tensorConstructor.CreateFromArray<int>(new[, ,]
+            {
+                {
+                   { 1, 2, 4 },
+                   { 0, 0, 0 },
+                   { 0, 0, 0 },
+                },
+                {
+                   { 4, 5, 7 },
+                   { 1, 6, 2 },
+                   { 0, 0, 0 },
+                },
+                {
+                   { 5, 6, 1 },
+                   { 2, 2, 3 },
+                   { 4, 9, 4 },
+                },
+
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, tri.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetUpperTriangleSquare(TensorConstructor tensorConstructor)
+        {
+            var arr = new[,]
+            {
+               { 1, 2, 4 },
+               { 8, 3, 9 },
+               { 1, 7, 5 },
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var tri = tensor.GetUpperTriangle(0);
+
+            var expected = tensorConstructor.CreateFromArray<int>(new[,]
+             {
+               { 1, 2, 4 },
+               { 0, 3, 9 },
+               { 0, 0, 5 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, tri.IsReversedStride);
+
+            tri = tensor.GetUpperTriangle(1);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 2, 4 },
+               { 0, 0, 9 },
+               { 0, 0, 0 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(2);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 4 },
+               { 0, 0, 0 },
+               { 0, 0, 0 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetUpperTriangle(3);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0 },
+               { 0, 0, 0 },
+               { 0, 0, 0 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetUpperTriangle(4);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(42);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetUpperTriangle(-1);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4 },
+               { 8, 3, 9 },
+               { 0, 7, 5 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(-2);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4 },
+               { 8, 3, 9 },
+               { 1, 7, 5 },
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetUpperTriangle(-3);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(-300);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetUpperTriangleRectangle(TensorConstructor tensorConstructor)
+        {
+            var arr = new[,]
+            {
+               { 1, 2, 4, 3, 7 },
+               { 8, 3, 9, 2, 6 },
+               { 1, 7, 5, 2, 9 }
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var tri = tensor.GetUpperTriangle(0);
+            var expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4, 3, 7 },
+               { 0, 3, 9, 2, 6 },
+               { 0, 0, 5, 2, 9 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, tri.IsReversedStride);
+            tri = tensor.GetUpperTriangle(1);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 2, 4, 3, 7 },
+               { 0, 0, 9, 2, 6 },
+               { 0, 0, 0, 2, 9 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(2);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 4, 3, 7 },
+               { 0, 0, 0, 2, 6 },
+               { 0, 0, 0, 0, 9 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(3);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0, 3, 7 },
+               { 0, 0, 0, 0, 6 },
+               { 0, 0, 0, 0, 0 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetUpperTriangle(4);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0, 0, 7 },
+               { 0, 0, 0, 0, 0 },
+               { 0, 0, 0, 0, 0 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 0, 0, 0, 0, 0 },
+               { 0, 0, 0, 0, 0 },
+               { 0, 0, 0, 0, 0 }
+            });
+            tri = tensor.GetUpperTriangle(5);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(6);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(1000);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetUpperTriangle(-1);
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4, 3, 7 },
+               { 8, 3, 9, 2, 6 },
+               { 0, 7, 5, 2, 9 }
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            expected = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+               { 1, 2, 4, 3, 7 },
+               { 8, 3, 9, 2, 6 },
+               { 1, 7, 5, 2, 9 }
+            });
+            tri = tensor.GetUpperTriangle(-2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+
+            tri = tensor.GetUpperTriangle(-3);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(-4);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            tri = tensor.GetUpperTriangle(-100);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetUpperTriangleCube(TensorConstructor tensorConstructor)
+        {
+            var arr = new[, ,]
+            {
+                {
+                   { 1, 2, 4 },
+                   { 8, 3, 9 },
+                   { 1, 7, 5 },
+                },
+                {
+                   { 4, 5, 7 },
+                   { 1, 6, 2 },
+                   { 3, 0, 8 },
+                },
+                {
+                   { 5, 6, 1 },
+                   { 2, 2, 3 },
+                   { 4, 9, 4 },
+                },
+
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var tri = tensor.GetUpperTriangle(0);
+            var expected = tensorConstructor.CreateFromArray<int>(new[, ,]
+            {
+                {
+                   { 1, 2, 4 },
+                   { 8, 3, 9 },
+                   { 1, 7, 5 },
+                },
+                {
+                   { 0, 0, 0 },
+                   { 1, 6, 2 },
+                   { 3, 0, 8 },
+                },
+                {
+                   { 0, 0, 0 },
+                   { 0, 0, 0 },
+                   { 4, 9, 4 },
+                },
+
+            });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tri, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, tri.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void Reshape(TensorConstructor tensorConstructor)
+        {
+            var arr = new[,]
+            {
+                { 1, 2, 3 },
+                { 4, 5, 6 }
+            };
+
+            var tensor = tensorConstructor.CreateFromArray<int>(arr);
+            var actual = tensor.Reshape(new[] { 3, 2 });
+
+            var expected = tensorConstructor.IsReversedStride ?
+                new[,]
+                {
+                    { 1, 5 },
+                    { 4, 3 },
+                    { 2, 6 }
+                } :
+                new[,]
+                {
+                    { 1, 2 },
+                    { 3, 4 },
+                    { 5, 6 }
+                };
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Fact]
+        public void Identity()
+        {
+            var actual = Tensor.CreateIdentity<double>(3);
+
+            var expected = new[,]
+            {
+                {1.0, 0, 0 },
+                {0, 1.0, 0 },
+                {0, 0, 1.0 }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void CreateWithDiagonal(TensorConstructor tensorConstructor)
+        {
+            var diagonal = tensorConstructor.CreateFromArray<int>(new[] { 1, 2, 3, 4, 5 });
+            var actual = Tensor.CreateFromDiagonal(diagonal);
+
+            var expected = new[,]
+            {
+                {1, 0, 0, 0, 0 },
+                {0, 2, 0, 0, 0 },
+                {0, 0, 3, 0, 0 },
+                {0, 0, 0, 4, 0 },
+                {0, 0, 0, 0, 5 }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void CreateWithDiagonal3D(TensorConstructor tensorConstructor)
+        {
+            var diagonal = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+                { 1, 2, 3, 4, 5 },
+                { 1, 2, 3, 4, 5 },
+                { 1, 2, 3, 4, 5 }
+            });
+            var actual = Tensor.CreateFromDiagonal(diagonal);
+            var expected = new[, ,]
+            {
+                {
+                    {1, 2, 3, 4, 5 },
+                    {0, 0, 0, 0, 0 },
+                    {0, 0, 0, 0, 0 }
+                },
+                {
+                    {0, 0, 0, 0, 0 },
+                    {1, 2, 3, 4, 5 },
+                    {0, 0, 0, 0, 0 }
+                },
+                {
+                    {0, 0, 0, 0, 0 },
+                    {0, 0, 0, 0, 0 },
+                    {1, 2, 3, 4, 5 }
+                }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void CreateWithDiagonalAndOffset(TensorConstructor tensorConstructor)
+        {
+            var diagonal = tensorConstructor.CreateFromArray<int>(new[] { 1, 2, 3, 4 });
+            var actual = Tensor.CreateFromDiagonal(diagonal, 1);
+
+            var expected = new[,]
+            {
+                {0, 1, 0, 0, 0 },
+                {0, 0, 2, 0, 0 },
+                {0, 0, 0, 3, 0 },
+                {0, 0, 0, 0, 4 },
+                {0, 0, 0, 0, 0 }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+
+            diagonal = tensorConstructor.CreateFromArray<int>(new[] { 1, 2, 3, 4 });
+            actual = Tensor.CreateFromDiagonal(diagonal, -1);
+
+            expected = new[,]
+            {
+                {0, 0, 0, 0, 0 },
+                {1, 0, 0, 0, 0 },
+                {0, 2, 0, 0, 0 },
+                {0, 0, 3, 0, 0 },
+                {0, 0, 0, 4, 0 }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+
+            diagonal = tensorConstructor.CreateFromArray<int>(new[] { 1 });
+            actual = Tensor.CreateFromDiagonal(diagonal, -4);
+            expected = new[,]
+            {
+                {0, 0, 0, 0, 0 },
+                {0, 0, 0, 0, 0 },
+                {0, 0, 0, 0, 0 },
+                {0, 0, 0, 0, 0 },
+                {1, 0, 0, 0, 0 }
+            };
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+
+            diagonal = tensorConstructor.CreateFromArray<int>(new[] { 1 });
+            actual = Tensor.CreateFromDiagonal(diagonal, 4);
+            expected = new[,]
+            {
+                {0, 0, 0, 0, 1 },
+                {0, 0, 0, 0, 0 },
+                {0, 0, 0, 0, 0 },
+                {0, 0, 0, 0, 0 },
+                {0, 0, 0, 0, 0 }
+            };
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void CreateWithDiagonalAndOffset3D(TensorConstructor tensorConstructor)
+        {
+            var diagonal = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+                { 1, 2, 3 },
+                { 1, 2, 3 },
+                { 1, 2, 3 }
+            });
+            var actual = Tensor.CreateFromDiagonal(diagonal, 1);
+
+            var expected = new[, ,]
+            {
+                {
+                    { 0, 0, 0 },
+                    { 1, 2, 3 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 1, 2, 3 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 1, 2, 3 }
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+
+            diagonal = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+                { 1, 2, 3 },
+                { 1, 2, 3 },
+                { 1, 2, 3 }
+            });
+            actual = Tensor.CreateFromDiagonal(diagonal, -1);
+
+            expected = new[, ,]
+            {
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 1, 2, 3 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 0, 0, 0 },
+                    { 1, 2, 3 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 1, 2, 3 },
+                    { 0, 0, 0 }
+                }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+
+            diagonal = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+                { 1, 2, 3 }
+            });
+            actual = Tensor.CreateFromDiagonal(diagonal, 3);
+
+            expected = new[, ,]
+            {
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 1, 2, 3 },
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+
+            diagonal = tensorConstructor.CreateFromArray<int>(new[,]
+            {
+                { 1, 2, 3 }
+            });
+            actual = Tensor.CreateFromDiagonal(diagonal, -3);
+
+            expected = new[, ,]
+            {
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                },
+                {
+                    { 1, 2, 3 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 },
+                    { 0, 0, 0 }
+                }
+            };
+
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void Add(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { 6, 7 ,8 },
+                    { 9, 10 ,11 },
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { 6, 8, 10 },
+                    { 12, 14, 16 },
+                });
+
+            var actual = TensorOperations.Add(left, right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(leftConstructor.IsReversedStride, actual.IsReversedStride);
+
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void AddScalar(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { 1, 2, 3 },
+                    { 4, 5, 6 },
+                });
+
+            var actual = TensorOperations.Add(tensor, 1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void UnaryPlus(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = tensor;
+
+            var actual = TensorOperations.UnaryPlus(tensor);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.False(ReferenceEquals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void Subtract(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { 6, 7 ,8 },
+                    { 9, 10 ,11 },
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { -6, -6, -6 },
+                    { -6, -6, -6},
+                });
+
+            var actual = TensorOperations.Subtract(left, right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(leftConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void SubtractScalar(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { -1, 0, 1 },
+                    { 2, 3, 4 },
+                });
+
+            var actual = TensorOperations.Subtract(tensor, 1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void UnaryMinus(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, -1, -2},
+                    {-3, -4, -5}
+                });
+
+            var actual = TensorOperations.UnaryMinus(tensor);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.False(ReferenceEquals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void PrefixIncrement(TensorConstructor tensorConstructor)
+        {
+            Tensor<int> tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expectedResult = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 2, 3},
+                    {4, 5, 6}
+                });
+
+            var expectedTensor = expectedResult;
+
+            tensor = TensorOperations.Increment(tensor);
+            var actual = tensor;
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expectedResult));
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tensor, expectedTensor));
+            Assert.True(ReferenceEquals(tensor, actual));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void PostfixIncrement(TensorConstructor tensorConstructor)
+        {
+            Tensor<int> tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            // returns original value
+            var expectedResult = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            // increments operand
+            var expectedTensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 2, 3},
+                    {4, 5, 6}
+                }); ;
+
+            var actual = tensor;
+            tensor = TensorOperations.Increment(tensor);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expectedResult));
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tensor, expectedTensor));
+            Assert.False(ReferenceEquals(tensor, actual));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void PrefixDecrement(TensorConstructor tensorConstructor)
+        {
+            Tensor<int> tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expectedResult = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {-1, 0, 1},
+                    {2, 3, 4}
+                });
+
+            var expectedTensor = expectedResult;
+
+            tensor = TensorOperations.Decrement(tensor);
+            var actual = tensor;
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expectedResult));
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tensor, expectedTensor));
+            Assert.True(ReferenceEquals(tensor, actual));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void PostfixDecrement(TensorConstructor tensorConstructor)
+        {
+            Tensor<int> tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            // returns original value
+            var expectedResult = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            // decrements operand
+            var expectedTensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {-1, 0, 1},
+                    {2, 3, 4}
+                }); ;
+
+            var actual = tensor;
+            tensor = TensorOperations.Decrement(tensor);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expectedResult));
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(tensor, expectedTensor));
+            Assert.False(ReferenceEquals(tensor, actual));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void Multiply(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 4},
+                    {9, 16, 25}
+                });
+
+            var actual = TensorOperations.Multiply(left, right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(leftConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void MultiplyScalar(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 2, 4},
+                    {6, 8, 10}
+                });
+
+            var actual = TensorOperations.Multiply(tensor, 2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void Divide(TensorConstructor dividendConstructor, TensorConstructor divisorConstructor)
+        {
+            var dividend = dividendConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 4},
+                    {9, 16, 25}
+                });
+
+            var divisor = divisorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = divisorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var actual = TensorOperations.Divide(dividend, divisor);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(dividendConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void DivideScalar(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 2, 4},
+                    {6, 8, 10}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var actual = TensorOperations.Divide(tensor, 2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void Modulo(TensorConstructor dividendConstructor, TensorConstructor divisorConstructor)
+        {
+            var dividend = dividendConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 3, 8},
+                    {11, 14, 17}
+                });
+
+            var divisor = divisorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 2, 3},
+                    {4, 5, 6}
+                });
+
+            var expected = dividendConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var actual = TensorOperations.Modulo(dividend, divisor);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(dividendConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void ModuloScalar(TensorConstructor tensorConstructor)
+        {
+            var tensor = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 3, 4},
+                    {7, 8, 9}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 0},
+                    {1, 0, 1}
+                });
+
+            var actual = TensorOperations.Modulo(tensor, 2);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void And(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 3},
+                    {7, 15, 31}
+                });
+
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 1, 3},
+                    {2, 4, 8}
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 3},
+                    {2, 4, 8}
+                });
+
+            var actual = TensorOperations.And(left, right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(leftConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void AndScalar(TensorConstructor tensorConstructor)
+        {
+            var left = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 3},
+                    {5, 15, 31}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 0, 0},
+                    {4, 4, 20}
+                });
+
+            var actual = TensorOperations.And(left, 20);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void Or(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 3},
+                    {7, 14, 31}
+                });
+
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 2, 4},
+                    {2, 4, 8}
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 3, 7},
+                    {7, 14, 31}
+                });
+
+            var actual = TensorOperations.Or(left, right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(leftConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void OrScalar(TensorConstructor tensorConstructor)
+        {
+            var left = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 1, 3},
+                    {3, 5, 5}
+                });
+
+            var actual = TensorOperations.Or(left, 1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void Xor(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 3},
+                    {7, 14, 31}
+                });
+
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 2, 4},
+                    {2, 4, 8}
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 3, 7},
+                    {5, 10, 23}
+                });
+
+            var actual = TensorOperations.Xor(left, right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(leftConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void XorScalar(TensorConstructor tensorConstructor)
+        {
+            var left = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 0, 3},
+                    {2, 5, 4}
+                });
+
+            var actual = TensorOperations.Xor(left, 1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void LeftShift(TensorConstructor tensorConstructor)
+        {
+            var left = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 2, 4},
+                    {6, 8, 10}
+                });
+
+            var actual = TensorOperations.LeftShift(left, 1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void RightShift(TensorConstructor tensorConstructor)
+        {
+            var left = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var expected = tensorConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 0, 1},
+                    {1, 2, 2}
+                });
+
+            var actual = TensorOperations.RightShift(left, 1);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(tensorConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void ElementWiseEquals(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, -2},
+                    {2, 3, 5}
+                });
+
+            var expected = new[,]
+                {
+                    {true, true, false },
+                    {false, false, true}
+                }.ToTensor();
+
+            var actual = TensorOperations.Equals(left, right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(leftConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory()]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void ElementWiseNotEquals(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, -2},
+                    {2, 3, 5}
+                });
+
+            var expected = new[,]
+                {
+                    {false, false, true},
+                    {true, true, false}
+                }.ToTensor();
+
+            var actual = TensorOperations.NotEquals(left, right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+            Assert.Equal(leftConstructor.IsReversedStride, actual.IsReversedStride);
+        }
+
+        [Theory]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void MatrixMultiply(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2},
+                    {3, 4, 5}
+                });
+
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0, 1, 2, 3, 4},
+                    {5, 6, 7, 8, 9},
+                    {10, 11, 12, 13, 14}
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0*0 + 1*5 + 2*10, 0*1 + 1*6 + 2*11, 0*2 + 1*7 + 2*12, 0*3 + 1*8 + 2*13, 0*4 + 1*9 + 2*14},
+                    {3*0 + 4*5 + 5*10, 3*1 + 4*6 + 5*11, 3*2 + 4*7 + 5*12, 3*3 + 4*8 + 5*13, 3*4 + 4*9 + 5*14}
+                });
+
+            var actual = left.MatrixMultiply(right);
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+        }
+
+
+        [Theory]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void Contract(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[, ,]
+                {
+                    {
+                        {0, 1},
+                        {2, 3}
+                    },
+                    {
+                        {4, 5},
+                        {6, 7}
+                    },
+                    {
+                        {8, 9},
+                        {10, 11}
+                    }
+                });
+
+            var right = rightConstructor.CreateFromArray<int>(
+                new[, ,]
+                {
+                    {
+                        {0, 1},
+                        {2, 3},
+                        {4, 5}
+                    },
+                    {
+                        {6, 7},
+                        {8, 9},
+                        {10, 11}
+                    },
+                    {
+                        {12, 13},
+                        {14, 15},
+                        {16, 17}
+                    },
+                    {
+                        {18, 19},
+                        {20, 21},
+                        {22, 23}
+                    }
+                });
+
+            // contract a 3*2*2 with a 4*3*2 tensor, summing on (3*2)*2 and 4*(3*2) to produce a 2*4 tensor
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {110, 290, 470, 650},
+                    {125, 341, 557, 773},
+                });
+            var actual = TensorOperations.Contract(left, right, new[] { 0, 1 }, new[] { 1, 2 });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+
+            // contract a 3*2*2 with a 4*3*2 tensor, summing on (3)*2*(2) and 4*(3*2) to produce a 2*4 tensor
+            expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {101, 263, 425, 587},
+                    {131, 365, 599, 833},
+                });
+            actual = TensorOperations.Contract(left, right, new[] { 0, 2 }, new[] { 1, 2 });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+        }
+
+
+        [Theory]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void ContractWithSingleLengthDimension(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {1, 2, 3},
+                    {4, 5, 6},
+                });
+
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { 1, 2 },
+                    { 3, 4 },
+                    { 5, 6 }
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { 22, 28 },
+                    { 49, 64 }
+                });
+
+            // contract a 2*3 with a 3*2 tensor, summing on 2*(3) and (3)*2 to produce a 2*2 tensor
+            var actual = TensorOperations.Contract(left, right, new[] { 1 }, new[] { 0 });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+
+            // contract a 1*2*3*1 with a 3*2 tensor, summing on 1*2*(3)*1 and (3)*2 to produce a 1*2*1*2 tensor
+            var reshapedLeft = left.Reshape(new int[] { 1, 2, 3, 1 });
+            var reshapedExpected = expected.Reshape(new int[] { 1, 2, 1, 2 });
+            actual = TensorOperations.Contract(reshapedLeft, right, new[] { 2 }, new[] { 0 });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, reshapedExpected));
+
+        }
+
+        [Theory]
+        [MemberData(nameof(GetDualTensorConstructors))]
+        public void ContractMismatchedDimensions(TensorConstructor leftConstructor, TensorConstructor rightConstructor)
+        {
+            var left = leftConstructor.CreateFromArray<int>(
+                new[] { 0, 1, 2, 3 });
+
+            var right = rightConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    { 0 },
+                    { 1 },
+                    { 2 }
+                });
+
+            var expected = leftConstructor.CreateFromArray<int>(
+                new[,]
+                {
+                    {0,0,0},
+                    {0,1,2},
+                    {0,2,4},
+                    {0,3,6},
+                });
+
+            Assert.Throws<ArgumentException>(() => TensorOperations.Contract(left, right, new int[] { }, new[] { 1 }));
+
+            // reshape to include dimension of length 1.
+            var leftReshaped = left.Reshape(new[] { 1, (int)left.Length });
+
+            var actual = TensorOperations.Contract(leftReshaped, right, new[] { 0 }, new[] { 1 });
+            Assert.True(StructuralComparisons.StructuralEqualityComparer.Equals(actual, expected));
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void GetArrayString(TensorConstructor constructor)
+        {
+            var tensor = constructor.CreateFromArray<int>(
+                new[, ,]
+                {
+                    {
+                        {0, 1},
+                        {2, 3},
+                        {4, 5}
+                    },
+                    {
+                        {6, 7},
+                        {8, 9},
+                        {10, 11}
+                    },
+                    {
+                        {12, 13},
+                        {14, 15},
+                        {16, 17}
+                    },
+                    {
+                        {18, 19},
+                        {20, 21},
+                        {22, 23}
+                    }
+                });
+
+            var expected =
+@"{
+    {
+        {0,1},
+        {2,3},
+        {4,5}
+    },
+    {
+        {6,7},
+        {8,9},
+        {10,11}
+    },
+    {
+        {12,13},
+        {14,15},
+        {16,17}
+    },
+    {
+        {18,19},
+        {20,21},
+        {22,23}
+    }
+}";
+
+            Assert.Equal(expected, tensor.GetArrayString());
+
+            var expectedNoSpace = expected.Replace(Environment.NewLine, "").Replace(" ", "");
+            Assert.Equal(expectedNoSpace, tensor.GetArrayString(false));
+        }
+
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void TestICollectionMembers(TensorConstructor constructor)
+        {
+            var arr = new[,]
+            {
+                { 1, 2, 3 },
+                { 4, 5, 6 }
+            };
+
+            var tensor = constructor.CreateFromArray<int>(arr);
+            ICollection tensorCollection = tensor;
+
+            Assert.Equal(6, tensorCollection.Count);
+
+            Assert.False(tensorCollection.IsSynchronized);
+
+            Assert.True(ReferenceEquals(tensorCollection, tensorCollection.SyncRoot));
+
+            var actual = Array.CreateInstance(typeof(int), tensor.Length);
+            tensorCollection.CopyTo(actual, 0);
+            var expected = constructor.IsReversedStride ?
+                new[] { 1, 4, 2, 5, 3, 6 } :
+                new[] { 1, 2, 3, 4, 5, 6 };
+            Assert.Equal(expected, actual);
+
+            actual = Array.CreateInstance(typeof(int), tensor.Length + 2);
+            tensorCollection.CopyTo(actual, 2);
+            expected = constructor.IsReversedStride ?
+                new[] { 0, 0, 1, 4, 2, 5, 3, 6 } :
+                new[] { 0, 0, 1, 2, 3, 4, 5, 6 };
+            Assert.Equal(expected, actual);
+
+            Assert.Throws<ArgumentNullException>(() => tensorCollection.CopyTo(null, 0));
+            Assert.Throws<ArgumentException>(() => tensorCollection.CopyTo(new int[3, 4], 0));
+            Assert.Throws<ArgumentException>(() => tensorCollection.CopyTo(new int[5], 0));
+            Assert.Throws<ArgumentException>(() => tensorCollection.CopyTo(new int[6], 1));
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void TestIListMembers(TensorConstructor constructor)
+        {
+            var arr = new[,]
+            {
+                { 1, 2, 3 },
+                { 4, 5, 6 }
+            };
+
+            var tensor = constructor.CreateFromArray<int>(arr);
+            IList tensorList = tensor;
+
+            int expectedIndexValue = constructor.IsReversedStride ? 4 : 2;
+            Assert.Equal(expectedIndexValue, tensorList[1]);
+
+            tensorList[1] = 7;
+            Assert.Equal(7, tensorList[1]);
+            var expected = constructor.IsReversedStride ?
+                new[] { 1, 7, 2, 5, 3, 6 } :
+                new[] { 1, 7, 3, 4, 5, 6 };
+            Assert.Equal(expected, tensor);
+
+            Assert.True(tensorList.IsFixedSize);
+            Assert.False(tensorList.IsReadOnly);
+
+            Assert.Throws<InvalidOperationException>(() => (tensorList).Add(8));
+
+            Assert.True(tensorList.Contains(5));
+            Assert.True(tensorList.Contains(6));
+            Assert.False(tensorList.Contains(0));
+            Assert.False(tensorList.Contains(42));
+            Assert.False(tensorList.Contains("foo"));
+
+            Assert.Equal(constructor.IsReversedStride ? 3 : 4, tensorList.IndexOf(5));
+            Assert.Equal(5, tensorList.IndexOf(6));
+            Assert.Equal(-1, tensorList.IndexOf(0));
+            Assert.Equal(-1, tensorList.IndexOf(42));
+
+            Assert.Throws<InvalidOperationException>(() => (tensorList).Insert(2, 5));
+            Assert.Throws<InvalidOperationException>(() => (tensorList).Remove(1));
+            Assert.Throws<InvalidOperationException>(() => (tensorList).RemoveAt(0));
+
+            tensorList.Clear();
+            Assert.Equal(new[] { 0, 0, 0, 0, 0, 0 }, tensor);
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void TestICollectionTMembers(TensorConstructor constructor)
+        {
+            var arr = new[,]
+            {
+                { 1, 2, 3 },
+                { 4, 5, 6 }
+            };
+
+            var tensor = constructor.CreateFromArray<int>(arr);
+            ICollection<int> tensorCollection = tensor;
+
+            Assert.Equal(6, tensorCollection.Count);
+            Assert.False(tensorCollection.IsReadOnly);
+
+            Assert.Throws<InvalidOperationException>(() => tensorCollection.Add(8));
+            Assert.Throws<InvalidOperationException>(() => tensorCollection.Remove(1));
+
+            Assert.True(tensorCollection.Contains(5));
+            Assert.True(tensorCollection.Contains(6));
+            Assert.False(tensorCollection.Contains(0));
+            Assert.False(tensorCollection.Contains(42));
+
+            var actual = new int[tensor.Length];
+            tensorCollection.CopyTo(actual, 0);
+            var expected = constructor.IsReversedStride ?
+                new[] { 1, 4, 2, 5, 3, 6 } :
+                new[] { 1, 2, 3, 4, 5, 6 };
+            Assert.Equal(expected, actual);
+
+            actual = new int[tensor.Length + 2];
+            tensorCollection.CopyTo(actual, 2);
+            expected = constructor.IsReversedStride ?
+                new[] { 0, 0, 1, 4, 2, 5, 3, 6 } :
+                new[] { 0, 0, 1, 2, 3, 4, 5, 6 };
+            Assert.Equal(expected, actual);
+
+            Assert.Throws<ArgumentNullException>(() => tensorCollection.CopyTo(null, 0));
+            Assert.Throws<ArgumentException>(() => tensorCollection.CopyTo(new int[5], 0));
+            Assert.Throws<ArgumentException>(() => tensorCollection.CopyTo(new int[6], 1));
+
+            tensorCollection.Clear();
+            Assert.Equal(new[] { 0, 0, 0, 0, 0, 0 }, tensor);
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void TestIListTMembers(TensorConstructor constructor)
+        {
+            var arr = new[,]
+            {
+                { 1, 2, 3 },
+                { 4, 5, 6 }
+            };
+
+            var tensor = constructor.CreateFromArray<int>(arr);
+            IList<int> tensorList = tensor;
+
+            int expectedIndexValue = constructor.IsReversedStride ? 4 : 2;
+            Assert.Equal(expectedIndexValue, tensorList[1]);
+
+            tensorList[1] = 7;
+            Assert.Equal(7, tensorList[1]);
+            var expected = constructor.IsReversedStride ?
+                new[] { 1, 7, 2, 5, 3, 6 } :
+                new[] { 1, 7, 3, 4, 5, 6 };
+            Assert.Equal(expected, tensor);
+
+            Assert.Equal(constructor.IsReversedStride ? 3 : 4, tensorList.IndexOf(5));
+            Assert.Equal(5, tensorList.IndexOf(6));
+            Assert.Equal(-1, tensorList.IndexOf(0));
+            Assert.Equal(-1, tensorList.IndexOf(42));
+
+            Assert.Throws<InvalidOperationException>(() => (tensorList).Insert(2, 5));
+            Assert.Throws<InvalidOperationException>(() => (tensorList).RemoveAt(0));
+        }
+
+        [Theory]
+        [MemberData(nameof(GetSingleTensorConstructors))]
+        public void TestIReadOnlyTMembers(TensorConstructor constructor)
+        {
+            var arr = new[,]
+            {
+                { 1, 2, 3 },
+                { 4, 5, 6 }
+            };
+
+            var tensor = constructor.CreateFromArray<int>(arr);
+
+            IReadOnlyCollection<int> tensorCollection = tensor;
+            Assert.Equal(6, tensorCollection.Count);
+
+            IReadOnlyList<int> tensorList = tensor;
+            int expectedIndexValue = constructor.IsReversedStride ? 4 : 2;
+            Assert.Equal(expectedIndexValue, tensorList[1]);
+        }
+    }        
+}
diff --git a/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTestsBase.cs b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTestsBase.cs
new file mode 100644
index 0000000000000..f7b2ac774e650
--- /dev/null
+++ b/csharp/test/Microsoft.ML.OnnxRuntime.Tests/Tensors/TensorTestsBase.cs
@@ -0,0 +1,164 @@
+﻿// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file is copied and adapted from the following git repository -
+// https://github.com/dotnet/corefx
+// Commit ID: bdd0814360d4c3a58860919f292a306242f27da1
+// Path: /src/System.Numerics.Tensors/tests/TensorTestsBase.cs
+// Original license statement below -
+
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System.Collections.Generic;
+using System;
+
+namespace Microsoft.ML.OnnxRuntime.Tensors.Tests
+{
+    public class TensorTestsBase
+    {
+        public enum TensorType
+        {
+            Dense
+        };
+
+        public class TensorConstructor
+        {
+            public TensorType TensorType { get; set; }
+
+            public bool IsReversedStride { get; set; }
+
+            public Tensor<T> CreateFromArray<T>(Array array)
+            {
+                switch (TensorType)
+                {
+                    case TensorType.Dense:
+                        return array.ToTensor<T>(IsReversedStride);
+                }
+
+                throw new ArgumentException(nameof(TensorType));
+            }
+            public Tensor<T> CreateFromDimensions<T>(ReadOnlySpan<int> dimensions)
+            {
+                switch (TensorType)
+                {
+                    case TensorType.Dense:
+                        return new DenseTensor<T>(dimensions, IsReversedStride);
+                }
+
+                throw new ArgumentException(nameof(TensorType));
+            }
+
+            public override string ToString()
+            {
+                return $"{TensorType}, {nameof(IsReversedStride)} = {IsReversedStride}";
+            }
+        }
+
+        private static TensorType[] s_tensorTypes = new[]
+        {
+            TensorType.Dense
+        };
+
+        private static bool[] s_reverseStrideValues = new[]
+        {
+            false,
+            true
+        };
+
+        public static IEnumerable<object[]> GetSingleTensorConstructors()
+        {
+            foreach (TensorType tensorType in s_tensorTypes)
+            {
+                foreach (bool isReversedStride in s_reverseStrideValues)
+                {
+                    yield return new[]
+                    {
+                        new TensorConstructor()
+                        {
+                            TensorType = tensorType,
+                            IsReversedStride = isReversedStride
+                        }
+                    };
+                }
+            }
+        }
+
+        public static IEnumerable<object[]> GetDualTensorConstructors()
+        {
+            foreach (TensorType leftTensorType in s_tensorTypes)
+            {
+                foreach (TensorType rightTensorType in s_tensorTypes)
+                {
+                    foreach (bool isLeftReversedStride in s_reverseStrideValues)
+                    {
+                        foreach (bool isRightReversedStride in s_reverseStrideValues)
+                        {
+                            yield return new[]
+                            {
+                                new TensorConstructor()
+                                {
+                                    TensorType = leftTensorType,
+                                    IsReversedStride = isLeftReversedStride
+                                },
+                                new TensorConstructor()
+                                {
+                                    TensorType = rightTensorType,
+                                    IsReversedStride = isRightReversedStride
+                                }
+                            };
+                        }
+                    }
+                }
+            }
+        }
+
+        public static IEnumerable<object[]> GetTensorAndResultConstructor()
+        {
+            foreach (TensorType leftTensorType in s_tensorTypes)
+            {
+                foreach (TensorType rightTensorType in s_tensorTypes)
+                {
+                    foreach (bool isReversedStride in s_reverseStrideValues)
+                    {
+                        yield return new[]
+                        {
+                            new TensorConstructor()
+                            {
+                                TensorType = leftTensorType,
+                                IsReversedStride = isReversedStride
+                            },
+                            new TensorConstructor()
+                            {
+                                TensorType = rightTensorType,
+                                IsReversedStride = isReversedStride
+                            }
+                        };
+                    }
+                }
+            }
+        }
+
+        public static NativeMemory<T> NativeMemoryFromArray<T>(T[] array)
+        {
+            return NativeMemoryFromArray<T>((Array)array);
+        }
+
+        public static NativeMemory<T> NativeMemoryFromArray<T>(Array array)
+        {
+            // this silly method takes a managed array and copies it over to unmanaged memory,
+            // **only for test purposes**
+
+            var memory = NativeMemory<T>.Allocate(array.Length);
+            var span = memory.GetSpan();
+            int index = 0;
+            foreach (T item in array)
+            {
+                span[index++] = item;
+            }
+
+            return memory;
+        }
+    }
+}
diff --git a/csharp/testdata/test_types_BOOL.pb b/csharp/testdata/test_types_BOOL.pb
index 005aa79303179a00829b60fa90054a93960f4d30..2c58b06d0aa6d1adcd80239aecec060807492e93 100644
GIT binary patch
delta 76
zcmZ3^IGvG~gIS2((bF?8ttio|bs}$wnK&0~W?n&Qi4Y$b4+p0Z2Nx3uBM`GDNpP{{
NmzH3Wabgl+2LM?B4BY?#

delta 92
zcmbQvxSWxfgIS2((bF?8ttioI>O|fUH+e4B%)Elq5+N}z5e`lv0WKyEMr53%z{Qqd
MT7pM~6O#Zt03MGGi2wiq

diff --git a/csharp/testdata/test_types_INT8.pb b/csharp/testdata/test_types_INT8.pb
index b6947e990a46ae50b8450502eb3616ef50529bc5..72698a779578d667502629bfbdbb7836e2257a7c 100644
GIT binary patch
delta 76
zcmZ3^IGvG~gIS2((bF?8ttio|bs}$wnK&0~W?n&Qi4Y$b4^WbWi;05~h*^^)xY+Vb
MOR&f|F$u5(09UXK(EtDd

delta 92
zcmbQvxSWxfgIS2((bF?8ttioI>O|fUH+e4B%)Elq5+N}z5ul_17ZV2~GEP$9V#_Zr
L!K1>7Nq`*y9Ip+1

diff --git a/csharp/testdata/test_types_STRING.pb b/csharp/testdata/test_types_STRING.pb
index 16927acbcc5e3604326067719f330c0c746753db..7c8b3e7e2eb82ada293453d760007b565f92c759 100644
GIT binary patch
delta 76
zcmZ3^IGvG~gIS2((bF?8ttio|bs}$wnK&0~W?n&Qi4Y$b4+n=32Nx3uBM`GDNpP{{
NmzH3Wabgl+2LM>e4BG$z

delta 92
zcmbQvxSWxfgIS2((bF?8ttioI>O|fUH+e4B%)Elq5+N}z5e^O^0WKyEMr53%z{Qqd
MT7pM~6O#Zt03JULhX4Qo

diff --git a/csharp/tools/Microsoft.ML.OnnxRuntime.PerfTool/Program.cs b/csharp/tools/Microsoft.ML.OnnxRuntime.PerfTool/Program.cs
index 0e9915ad6696d..79b89aec17acb 100644
--- a/csharp/tools/Microsoft.ML.OnnxRuntime.PerfTool/Program.cs
+++ b/csharp/tools/Microsoft.ML.OnnxRuntime.PerfTool/Program.cs
@@ -3,7 +3,7 @@
 
 using System;
 using System.Collections.Generic;
-using System.Numerics.Tensors;
+using Microsoft.ML.OnnxRuntime.Tensors;
 using System.Diagnostics;
 using CommandLine;
 
@@ -33,7 +33,7 @@ class CommandOptions
         public bool ParallelExecution { get; set; } = false;
 
         [Option('o', "optimization_level", Required = false, HelpText = "Optimization Level. Default is 1, partial optimization.")]
-        public uint OptimizationLevel { get; set; } = 1;
+        public GraphOptimizationLevel OptimizationLevel { get; set; } = GraphOptimizationLevel.ORT_ENABLE_BASIC;
     }
 
     class Program
@@ -42,7 +42,8 @@ public static void Main(string[] args)
         {
             var cmdOptions = Parser.Default.ParseArguments<CommandOptions>(args);
             cmdOptions.WithParsed(
-                options => {
+                options =>
+                {
                     Run(options);
                 });
         }
@@ -52,7 +53,7 @@ public static void Run(CommandOptions options)
             string inputPath = options.InputFile;
             int iteration = options.IterationCount;
             bool parallelExecution = options.ParallelExecution;
-            uint optLevel = options.OptimizationLevel;
+            GraphOptimizationLevel optLevel = options.OptimizationLevel;
             Console.WriteLine("Running model {0} in OnnxRuntime:", modelPath);
             Console.WriteLine("input:{0}", inputPath);
             Console.WriteLine("iteration count:{0}", iteration);
@@ -84,17 +85,17 @@ public static float[] LoadTensorFromFile(string filename)
             return tensorData.ToArray();
         }
 
-        static void RunModelOnnxRuntime(string modelPath, string inputPath, int iteration, DateTime[] timestamps, bool parallelExecution, uint optLevel)
+        static void RunModelOnnxRuntime(string modelPath, string inputPath, int iteration, DateTime[] timestamps, bool parallelExecution, GraphOptimizationLevel optLevel)
         {
             if (timestamps.Length != (int)TimingPoint.TotalCount)
             {
-                throw new ArgumentException("Timestamps array must have "+(int)TimingPoint.TotalCount+" size");
+                throw new ArgumentException("Timestamps array must have " + (int)TimingPoint.TotalCount + " size");
             }
 
             timestamps[(int)TimingPoint.Start] = DateTime.Now;
             SessionOptions options = new SessionOptions();
-            if (parallelExecution) options.DisableSequentialExecution();
-            options.SetSessionGraphOptimizationLevel(optLevel);
+            if (parallelExecution) options.EnableSequentialExecution = false;
+            options.GraphOptimizationLevel = optLevel;
             using (var session = new InferenceSession(modelPath, options))
             {
                 timestamps[(int)TimingPoint.ModelLoaded] = DateTime.Now;
@@ -108,12 +109,12 @@ static void RunModelOnnxRuntime(string modelPath, string inputPath, int iteratio
                     container.Add(NamedOnnxValue.CreateFromTensor<float>(name, tensor));
                 }
 
-                
+
 
                 timestamps[(int)TimingPoint.InputLoaded] = DateTime.Now;
 
                 // Run the inference
-                for (int i=0; i < iteration; i++)
+                for (int i = 0; i < iteration; i++)
                 {
                     var results = session.Run(container);  // results is an IReadOnlyList<NamedOnnxValue> container
                     Debug.Assert(results != null);
@@ -132,7 +133,7 @@ static void RunModelOnnxRuntime(string modelPath, string inputPath, int iteratio
         static void PrintUsage()
         {
             Console.WriteLine("Usage:\n"
-                +"dotnet Microsoft.ML.OnnxRuntime.PerfTool <onnx-model-path> <input-file-path> <iteration-count>"
+                + "dotnet Microsoft.ML.OnnxRuntime.PerfTool <onnx-model-path> <input-file-path> <iteration-count>"
                 );
         }
 
diff --git a/dockerfiles/Dockerfile.cuda b/dockerfiles/Dockerfile.cuda
index 0a537b774873a..0358629b28e8f 100644
--- a/dockerfiles/Dockerfile.cuda
+++ b/dockerfiles/Dockerfile.cuda
@@ -18,11 +18,13 @@ WORKDIR /code
 ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:/code/cmake-3.14.3-Linux-x86_64/bin:/opt/miniconda/bin:${PATH}
 
 # Prepare onnxruntime repository & build onnxruntime with TensorRT
-RUN git clone --single-branch --branch ${ONNXRUNTIME_SERVER_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
-    /bin/sh onnxruntime/dockerfiles/install_common_deps.sh &&\
+ADD scripts /tmp/scripts
+RUN /bin/sh /tmp/scripts/install_common_deps.sh && \
+    git clone --single-branch --branch ${ONNXRUNTIME_SERVER_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
+    cp onnxruntime/ThirdPartyNotices.txt /code/ThirdPartyNotices.txt &&\
     cp onnxruntime/dockerfiles/LICENSE-IMAGE.txt /code/LICENSE-IMAGE.txt &&\
     cd onnxruntime &&\
     /bin/sh ./build.sh --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --use_cuda --config Release --build_wheel --update --build --cmake_extra_defines ONNXRUNTIME_VERSION=$(cat ./VERSION_NUMBER) &&\
     pip install /code/onnxruntime/build/Linux/Release/dist/*.whl &&\
     cd .. &&\
-    rm -rf onnxruntime cmake-3.14.3-Linux-x86_64.tar.gz cmake-3.14.3-Linux-x86_64
+    rm -rf onnxruntime cmake-3.14.3-Linux-x86_64
diff --git a/dockerfiles/Dockerfile.openvino b/dockerfiles/Dockerfile.openvino
index a829e1651de60..3574725654b4c 100644
--- a/dockerfiles/Dockerfile.openvino
+++ b/dockerfiles/Dockerfile.openvino
@@ -6,7 +6,8 @@
 FROM ubuntu:16.04
 
 RUN apt update && \
-    apt -y install python3.5 python3-pip zip x11-apps lsb-core wget cpio sudo libboost-python-dev libpng-dev zlib1g-dev git libnuma1 ocl-icd-libopencl1 clinfo libboost-filesystem1.58.0 libboost-thread1.58.0 protobuf-compiler libprotoc-dev libusb-1.0-0-dev && pip3 install numpy networkx opencv-python pytest && locale-gen en_US.UTF-8 && update-locale LANG=en_US.UTF-8
+    apt -y install git sudo wget \
+    zip x11-apps lsb-core cpio libboost-python-dev libpng-dev zlib1g-dev libnuma1 ocl-icd-libopencl1 clinfo libboost-filesystem1.58.0 libboost-thread1.58.0 protobuf-compiler libprotoc-dev libusb-1.0-0-dev
 
 ARG DEVICE=CPU_FP32
 ARG ONNXRUNTIME_REPO=https://github.com/microsoft/onnxruntime
@@ -39,7 +40,7 @@ ENV OpenCV_DIR=${INTEL_OPENVINO_DIR}/opencv/share/OpenCV
 ENV LD_LIBRARY_PATH=${INTEL_OPENVINO_DIR}/opencv/lib:${INTEL_OPENVINO_DIR}/opencv/share/OpenCV/3rdparty/lib:${LD_LIBRARY_PATH}
 ENV PATH=${INTEL_CVSDK_DIR}/deployment_tools/model_optimizer:$PATH
 ENV PYTHONPATH=${INTEL_CVSDK_DIR}/deployment_tools/model_optimizer:$PYTHONPATH
-ENV PYTHONPATH=$INTEL_CVSDK_DIR/python/python3.5:${INTEL_CVSDK_DIR}/python/python3.5/ubuntu16:${PYTHONPATH}
+# ENV PYTHONPATH=$INTEL_CVSDK_DIR/python/python3.5:${INTEL_CVSDK_DIR}/python/python3.5/ubuntu16:${PYTHONPATH}
 ENV HDDL_INSTALL_DIR=${INTEL_OPENVINO_DIR}/deployment_tools/inference_engine/external/hddl
 ENV LD_LIBRARY_PATH=${INTEL_OPENVINO_DIR}/deployment_tools/inference_engine/external/hddl/lib:$LD_LIBRARY_PATH
 
@@ -51,16 +52,15 @@ RUN wget https://github.com/intel/compute-runtime/releases/download/19.15.12831/
 
 RUN sudo dpkg -i *.deb && rm -rf *.deb
 
-
-RUN mkdir -p /opt/cmake/bin
-
-ENV PATH /opt/cmake/bin:$PATH
 ENV LANG en_US.UTF-8
-RUN wget https://github.com/Kitware/CMake/releases/download/v3.13.2/cmake-3.13.2-Linux-x86_64.tar.gz && \
-    tar -xf cmake-3.13.2-Linux-x86_64.tar.gz --strip 1 -C /opt/cmake && rm -rf /cmake-3.13.2-Linux-x86_64.tar.gz
+WORKDIR /code
+ENV PATH /opt/miniconda/bin:/code/cmake-3.14.3-Linux-x86_64/bin:$PATH
 
-RUN git clone --recursive -b $ONNXRUNTIME_BRANCH $ONNXRUNTIME_REPO /onnxruntime && \
+ADD scripts /tmp/scripts
+RUN /bin/sh /tmp/scripts/install_common_deps.sh && \
+    git clone --recursive -b $ONNXRUNTIME_BRANCH $ONNXRUNTIME_REPO /onnxruntime && \
     cd /onnxruntime/cmake/external/onnx && python3 setup.py install && \
-    cd /onnxruntime && ./build.sh --config RelWithDebInfo --update --build --parallel --use_openvino $DEVICE --build_wheel && pip3 install /onnxruntime/build/Linux/RelWithDebInfo/dist/*-linux_x86_64.whl && rm -rf /onnxruntime
-
-
+    cp /onnxruntime/dockerfiles/LICENSE-IMAGE.txt /code/LICENSE-IMAGE.txt && \
+    cp /onnxruntime/ThirdPartyNotices.txt /code/ThirdPartyNotices.txt && \
+    cd /onnxruntime && ./build.sh --config RelWithDebInfo --update --build --parallel --use_openvino $DEVICE --build_wheel && \
+    pip install /onnxruntime/build/Linux/RelWithDebInfo/dist/*-linux_x86_64.whl && rm -rf /onnxruntime cmake-3.14.3-Linux-x86_64
diff --git a/dockerfiles/Dockerfile.server b/dockerfiles/Dockerfile.server
index 3eebdf6db036b..29fff56eadc3b 100644
--- a/dockerfiles/Dockerfile.server
+++ b/dockerfiles/Dockerfile.server
@@ -32,9 +32,8 @@ RUN mkdir -p /onnxruntime/build && \
 
 FROM minimal AS final
 WORKDIR /onnxruntime/server/
-ENV MODEL_ABSOLUTE_PATH /onnxruntime/model/model.onnx
 COPY --from=build /onnxruntime/build/Release/onnxruntime_server /onnxruntime/server/
 COPY --from=build /onnxruntime/build/Release/libonnxruntime.so.* /lib/
 RUN apt-get update \
     && apt-get install -y libgomp1
-ENTRYPOINT /onnxruntime/server/onnxruntime_server --model_path $MODEL_ABSOLUTE_PATH
+ENTRYPOINT ["/onnxruntime/server/onnxruntime_server"]
diff --git a/dockerfiles/Dockerfile.source b/dockerfiles/Dockerfile.source
index 1a0f0921136fb..a0880ee68d84e 100644
--- a/dockerfiles/Dockerfile.source
+++ b/dockerfiles/Dockerfile.source
@@ -17,11 +17,12 @@ RUN apt-get update &&\
 WORKDIR /code
 ENV PATH /opt/miniconda/bin:/code/cmake-3.14.3-Linux-x86_64/bin:${PATH}
 
+ADD scripts /tmp/scripts
 # Prepare onnxruntime repository & build onnxruntime
-RUN git clone --single-branch --branch ${ONNXRUNTIME_SERVER_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
-    /bin/sh onnxruntime/dockerfiles/install_common_deps.sh &&\
+RUN /bin/sh /tmp/scripts/install_common_deps.sh &&\
+    git clone --single-branch --branch ${ONNXRUNTIME_SERVER_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
     cd onnxruntime &&\
     /bin/sh ./build.sh --config Release --build_wheel --update --build --cmake_extra_defines ONNXRUNTIME_VERSION=$(cat ./VERSION_NUMBER) &&\
     pip install /code/onnxruntime/build/Linux/Release/dist/*.whl &&\
     cd .. &&\
-    rm -rf onnxruntime cmake-3.14.3-Linux-x86_64.tar.gz cmake-3.14.3-Linux-x86_64
+    rm -rf onnxruntime cmake-3.14.3-Linux-x86_64
diff --git a/dockerfiles/Dockerfile.tensorrt b/dockerfiles/Dockerfile.tensorrt
index 6f3df1fbbba81..6ffd3b19f3f1a 100644
--- a/dockerfiles/Dockerfile.tensorrt
+++ b/dockerfiles/Dockerfile.tensorrt
@@ -17,12 +17,14 @@ RUN apt-get update &&\
 WORKDIR /code
 ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:/code/cmake-3.14.3-Linux-x86_64/bin:/opt/miniconda/bin:${PATH}
 
+ADD scripts /tmp/scripts
 # Prepare onnxruntime repository & build onnxruntime with TensorRT
-RUN git clone --single-branch --branch ${ONNXRUNTIME_SERVER_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
-    /bin/sh onnxruntime/dockerfiles/install_common_deps.sh &&\
+RUN /bin/sh /tmp/scripts/install_common_deps.sh && \
+    git clone --single-branch --branch ${ONNXRUNTIME_SERVER_BRANCH} --recursive ${ONNXRUNTIME_REPO} onnxruntime &&\
     cp onnxruntime/dockerfiles/LICENSE-IMAGE.txt /code/LICENSE-IMAGE.txt &&\
+    cp onnxruntime/ThirdPartyNotices.txt /code/ThirdPartyNotices.txt &&\
     cd onnxruntime &&\
     /bin/sh ./build.sh --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu/ --use_tensorrt --tensorrt_home /workspace/tensorrt --config Release --build_wheel --update --build --cmake_extra_defines ONNXRUNTIME_VERSION=$(cat ./VERSION_NUMBER) &&\
     pip install /code/onnxruntime/build/Linux/Release/dist/*.whl &&\
     cd .. &&\
-    rm -rf onnxruntime cmake-3.14.3-Linux-x86_64.tar.gz cmake-3.14.3-Linux-x86_64
+    rm -rf onnxruntime cmake-3.14.3-Linux-x86_64
diff --git a/dockerfiles/README.md b/dockerfiles/README.md
index f395acc2ef6b5..8fdc7959d0437 100644
--- a/dockerfiles/README.md
+++ b/dockerfiles/README.md
@@ -8,7 +8,7 @@
 - [OpenVINO](Dockerfile.openvino)
 - [ONNX Runtime Server](Dockerfile.server)
 
-## Build from Source Version (Preview)
+## Build from Source
 #### Linux 16.04, CPU, Python Bindings
 
 1. Build the docker image from the Dockerfile in this repository.
@@ -26,7 +26,7 @@
   docker run -it onnxruntime-source
   ```
 
-## CUDA Version (Preview)
+## CUDA
 #### Linux 16.04, CUDA 10.0, CuDNN 7
 
 1. Build the docker image from the Dockerfile in this repository.
@@ -44,7 +44,7 @@
   docker run -it onnxruntime-cuda
   ```
 
-## nGraph Version (Preview)
+## nGraph (Public Preview)
 #### Linux 16.04, Python Bindings
 
 1. Build the docker image from the Dockerfile in this repository.
@@ -62,7 +62,7 @@
   docker run -it onnxruntime-ngraph
   ```
 
-## TensorRT Version (Preview)
+## TensorRT
 #### Linux 16.04, TensorRT 5.0.2
 
 1. Build the docker image from the Dockerfile in this repository.
@@ -80,7 +80,7 @@
   docker run -it onnxruntime-trt
   ```
 
-## OpenVINO Version (Preview)
+## OpenVINO (Public Preview)
 #### Linux 16.04, Python Bindings
 
 1. Build the onnxruntime image for all the accelerators supported as below 
@@ -104,7 +104,7 @@
 	| <code>MYRIAD_FP16</code> | Intel<sup></sup> Movidius<sup>TM</sup> USB sticks |
 	| <code>VAD-M_FP16</code> | Intel<sup></sup> Vision Accelerator Design based on Movidius<sup>TM</sup> MyriadX VPUs |
 
-## CPU Version 
+## CPU 
 
 1. Retrieve your docker image in one of the following ways.
 
@@ -122,7 +122,7 @@
      docker run -it onnxruntime-cpu
     ```
 
-## GPU Version
+## GPU
 
 1. Retrieve your docker image in one of the following ways. 
    - Build the docker image from the DockerFile in this repository.
@@ -138,7 +138,7 @@
     ```
     docker run -it --device /dev/dri:/dev/dri onnxruntime-gpu:latest
     ```
-## Myriad VPU Accelerator Version 
+## Myriad VPU Accelerator 
 
 1. Retrieve your docker image in one of the following ways. 
    - Build the docker image from the DockerFile in this repository.
@@ -155,6 +155,7 @@
     docker run -it --network host --privileged -v /dev:/dev  onnxruntime-myriad:latest
 
     ```
+=======
 ## VAD-M Accelerator Version 
 
 1. Retrieve your docker image in one of the following ways. 
@@ -172,7 +173,7 @@
     docker run -it --device --mount type=bind,source=/var/tmp,destination=/var/tmp --device /dev/ion:/dev/ion  onnxruntime-hddl:latest
 
     ```
-## ONNX Runtime Server (Preview)
+## ONNX Runtime Server (Public Preview)
 #### Linux 16.04
 
 1. Build the docker image from the Dockerfile in this repository
@@ -183,7 +184,7 @@
 2. Run the ONNXRuntime server with the image created in step 1
 
   ```
-  docker run -v {localModelAbsoluteFolder}:{dockerModelAbsoluteFolder} -e MODEL_ABSOLUTE_PATH={dockerModelAbsolutePath} -p {your_local_port}:8001 {imageName}
+  docker run -v {localModelAbsoluteFolder}:{dockerModelAbsoluteFolder} -p {your_local_port}:8001 {imageName} --model_path {dockerModelAbsolutePath}
   ```
 3. Send HTTP requests to the container running ONNX Runtime Server
 
diff --git a/dockerfiles/install_common_deps.sh b/dockerfiles/scripts/install_common_deps.sh
similarity index 81%
rename from dockerfiles/install_common_deps.sh
rename to dockerfiles/scripts/install_common_deps.sh
index dab394cb33fe7..173734332b761 100644
--- a/dockerfiles/install_common_deps.sh
+++ b/dockerfiles/scripts/install_common_deps.sh
@@ -13,13 +13,13 @@ apt-get install -y --no-install-recommends \
 # Dependencies: conda
 wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-4.5.11-Linux-x86_64.sh -O ~/miniconda.sh --no-check-certificate && /bin/bash ~/miniconda.sh -b -p /opt/miniconda
 rm ~/miniconda.sh
-/opt/miniconda/bin/conda clean -tipsy
-find / -type d -name __pycache__ -prune -exec rm -rf {};
+/opt/miniconda/bin/conda clean -ya
 
-conda install -y python=3.6 numpy
-conda clean -aqy
+/opt/miniconda/bin/conda install -y numpy
+/opt/miniconda/bin/conda clean -aqy
 rm -rf /opt/miniconda/pkgs
 
 # Dependencies: cmake
 sudo wget --quiet https://github.com/Kitware/CMake/releases/download/v3.14.3/cmake-3.14.3-Linux-x86_64.tar.gz
 tar zxf cmake-3.14.3-Linux-x86_64.tar.gz
+rm -rf cmake-3.14.3-Linux-x86_64.tar.gz
\ No newline at end of file
diff --git a/docs/ONNX_Runtime_Perf_Tuning.md b/docs/ONNX_Runtime_Perf_Tuning.md
index b0ac2e0b2c039..b76b82534d34f 100644
--- a/docs/ONNX_Runtime_Perf_Tuning.md
+++ b/docs/ONNX_Runtime_Perf_Tuning.md
@@ -72,7 +72,7 @@ sess_options.set_graph_optimization_level(2)
 ```
 * sess_options.session_thread_pool_size=2 controls how many thread do you want to use to run your model
 * sess_options.enable_sequential_execution=True controls whether you want to run operators in your graph sequentially or in parallel. Usually when your model has many branches, set this option to false will give you better performance.
-* sess_options.set_graph_optimization_level(2). There are three levels, 0 means disable optimization, 1 means enable optimizations before graph partition, 2 means enable all optimization. 
+* sess_options.set_graph_optimization_level(2). Default is 1. Please see [onnxruntime_c_api.h](../include/onnxruntime/core/session/onnxruntime_c_api.h#L241)  (enum GraphOptimizationLevel) for the full list of all optimization levels.
 
 ### MKL_DNN/nGraph/MKL_ML Execution Provider
 MKL_DNN, MKL_ML and nGraph all depends on openmp for parallization. For those execution providers, we need to use openmp enviroment variable to tune the performance.
diff --git a/docs/OperatorKernels.md b/docs/OperatorKernels.md
new file mode 100644
index 0000000000000..2cad94aae0481
--- /dev/null
+++ b/docs/OperatorKernels.md
@@ -0,0 +1,470 @@
+## Supported Operators Data Types
+*This file is automatically generated from the
+            [def files](/onnxruntime/core/providers/cpu/cpu_execution_provider.cc) via [this script](/tools/python/gen_opkernel_doc.py).
+            Do not modify directly and instead edit operator definitions.*
+
+
+
+## Operators implemented by CPUExecutionProvider
+
+| Op Name | Parameters | OpSet Version | Types Supported |
+|---------|------------|---------------|-----------------|
+**Operator Domain:** *ai.onnx.ml*
+|Abs|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(int32), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(int64), tensor(double)|
+|Acos|(*in* input:**T**, *out* output:**T**)|7+|**T** = tensor(float)|
+|Acosh|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(float)|
+|Add|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|7+|**T** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|Affine|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|And|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|7+|**T** = tensor(bool)|
+| | ||**T1** = tensor(bool)|
+|ArgMax|(*in* data:**T**, *out* reduced:**tensor(int64)**)|1+|**T** = tensor(int32), tensor(float)|
+|ArgMin|(*in* data:**T**, *out* reduced:**tensor(int64)**)|1+|**T** = tensor(int32), tensor(float)|
+|ArrayFeatureExtractor|(*in* X:**T**, *in* Y:**tensor(int64)**, *out* Z:**T**)|1+|**T** = tensor(string), tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|Asin|(*in* input:**T**, *out* output:**T**)|7+|**T** = tensor(float)|
+|Asinh|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(float)|
+|Atan|(*in* input:**T**, *out* output:**T**)|7+|**T** = tensor(float)|
+|Atanh|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(float)|
+|AveragePool|(*in* X:**T**, *out* Y:**T**)|10+|**T** = tensor(float)|
+| | |[7, 9]|**T** = tensor(float)|
+|BatchNormalization|(*in* X:**T**, *in* scale:**T**, *in* B:**T**, *in* mean:**T**, *in* var:**T**, *out* Y:**T**, *out* mean:**T**, *out* var:**T**, *out* saved_mean:**T**, *out* saved_var:**T**)|[7, 9]|**B** = tensor(float)|
+| | ||**X** = tensor(float)|
+| | ||**mean** = tensor(float)|
+| | ||**scale** = tensor(float)|
+| | ||**var** = tensor(float)|
+|Binarizer|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|Cast|(*in* input:**T1**, *out* output:**T2**)|9+|**T1** = tensor(string)|
+| | ||**T2** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | |[6, 9]|**T1** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|CastMap|(*in* X:**T1**, *out* Y:**T2**)|1+|**T1** = unknown|
+| | ||**T2** = tensor(string), tensor(float), tensor(int64)|
+|CategoryMapper|(*in* X:**T1**, *out* Y:**T2**)|1+|**T1** = tensor(string), tensor(int64)|
+| | ||**T2** = tensor(string), tensor(int64)|
+|Ceil|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|Clip|(*in* input:**T**, *out* output:**T**)|6+|**T** = tensor(float)|
+|Compress|(*in* input:**T**, *in* condition:**T1**, *out* output:**T**)|9+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T1** = tensor(bool)|
+|Concat|(*in* inputs:**T**, *out* concat_result:**T**)|4+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|ConstantOfShape|(*in* input:**T1**, *out* output:**T2**)|9+|**T1** = tensor(int64)|
+| | ||**T2** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Conv|(*in* X:**T**, *in* W:**T**, *in* B:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|ConvInteger|(*in* x:**T1**, *in* w:**T2**, *in* x_zero_point:**T1**, *in* w_zero_point:**T2**, *out* y:**T3**)|10+|**T1** = tensor(uint8)|
+| | ||**T2** = tensor(uint8)|
+| | ||**T3** = tensor(int32)|
+|ConvTranspose|(*in* X:**T**, *in* W:**T**, *in* B:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|Cos|(*in* input:**T**, *out* output:**T**)|7+|**T** = tensor(float)|
+|Cosh|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(float)|
+|Crop|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|DepthToSpace|(*in* input:**T**, *out* output:**T**)|[1, 4]|**T** = tensor(float)|
+|DequantizeLinear|(*in* x:**T**, *in* x_scale:**tensor(float)**, *in* x_zero_point:**T**, *out* y:**tensor(float)**)|10+|**x** = tensor(uint8), unknown|
+| | ||**x_scale** = tensor(float)|
+| | ||**x_zero_point** = tensor(uint8), unknown|
+| | ||**y** = tensor(float)|
+|DictVectorizer|(*in* X:**T1**, *out* Y:**T2**)|1+|**T1** = unknown|
+| | ||**T2** = tensor(string), tensor(float), tensor(int64), tensor(double)|
+|Div|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|7+|**T** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|Dropout|(*in* data:**T**, *out* output:**T**, *out* mask:**T**) or (*in* data:**T**, *out* output:**T**, *out* mask:**T1**)|10+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**T1** = tensor(bool)|
+| | |[7, 9]|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**T1** = tensor(bool)|
+|DynamicSlice|(*in* data:**T**, *in* starts:**Tind**, *in* ends:**Tind**, *in* axes:**Tind**, *out* output:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+|Elu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|Equal|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|11+|**T** = tensor(float)|
+| | ||**T1** = tensor(bool)|
+| | |7+|**T** = tensor(int32), tensor(bool), tensor(int64)|
+| | ||**T1** = tensor(bool)|
+|Erf|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(float)|
+|Exp|(*in* input:**T**, *out* output:**T**)|6+|**T** = tensor(float), tensor(double)|
+|Expand|(*in* input:**T**, *in* shape:**tensor(int64)**, *out* output:**T**)|8+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|EyeLike|(*in* input:**T1**, *out* output:**T2**)|9+|**T1** = tensor(uint64), tensor(int32), tensor(float), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(uint64), tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|FeatureVectorizer|(*in* X:**T1**, *out* Y:**tensor(float)**)|1+|**T1** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|Flatten|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | |[1, 8]|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Floor|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|GRU|(*in* X:**T**, *in* W:**T**, *in* R:**T**, *in* B:**T**, *in* sequence_lens:**T1**, *in* initial_h:**T**, *out* Y:**T**, *out* Y_h:**T**)|7+|**T** = tensor(float), tensor(double)|
+| | ||**T1** = tensor(int32)|
+|Gather|(*in* data:**T**, *in* indices:**Tind**, *out* output:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+|Gemm|(*in* A:**T**, *in* B:**T**, *in* C:**T**, *out* Y:**T**)|[7, 9]|**T** = tensor(float)|
+|GlobalAveragePool|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|GlobalLpPool|(*in* X:**T**, *out* Y:**T**)|2+|**T** = tensor(float)|
+|GlobalMaxPool|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|Greater|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|9+|**T** = tensor(int32), tensor(int64)|
+| | ||**T1** = tensor(bool)|
+| | |[7, 9]|**T** = tensor(float)|
+| | ||**T1** = tensor(bool)|
+|HardSigmoid|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|Hardmax|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|Identity|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|If|(*in* cond:**B**, *out* outputs:**V**)|1+|**B** = tensor(bool)|
+| | ||**V** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|ImageScaler|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|Imputer|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(int64)|
+|InstanceNormalization|(*in* input:**T**, *in* scale:**T**, *in* B:**T**, *out* output:**T**)|6+|**T** = tensor(float)|
+|IsInf|(*in* X:**T1**, *out* Y:**T2**)|10+|**T1** = tensor(float), tensor(double)|
+| | ||**T2** = tensor(bool)|
+|IsNaN|(*in* X:**T1**, *out* Y:**T2**)|9+|**T1** = tensor(float), tensor(MLFloat16)|
+| | ||**T2** = tensor(bool)|
+|LRN|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|LSTM|(*in* X:**T**, *in* W:**T**, *in* R:**T**, *in* B:**T**, *in* sequence_lens:**T1**, *in* initial_h:**T**, *in* initial_c:**T**, *in* P:**T**, *out* Y:**T**, *out* Y_h:**T**, *out* Y_c:**T**)|7+|**T** = tensor(float), tensor(double)|
+| | ||**T1** = tensor(int32)|
+|LabelEncoder|(*in* X:**T1**, *out* Y:**T2**)|2+|**T1** = tensor(string), tensor(float), tensor(int64)|
+| | ||**T2** = tensor(string), tensor(float), tensor(int64)|
+| | |[1, 1]|**T1** = tensor(string), tensor(int64)|
+| | ||**T2** = tensor(string), tensor(int64)|
+|LeakyRelu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|Less|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|9+|**T** = tensor(int32), tensor(int64)|
+| | ||**T1** = tensor(bool)|
+| | |[7, 9]|**T** = tensor(float)|
+| | ||**T1** = tensor(bool)|
+|LinearClassifier|(*in* X:**T1**, *out* Y:**T2**, *out* Z:**tensor(float)**)|1+|**T1** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(string), tensor(int64)|
+|LinearRegressor|(*in* X:**T**, *out* Y:**tensor(float)**)|1+|**T** = tensor(float)|
+|Log|(*in* input:**T**, *out* output:**T**)|6+|**T** = tensor(float)|
+|LogSoftmax|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|Loop|(*in* M:**I**, *in* cond:**B**, *in* v_initial:**V**, *out* v_final_and_scan_outputs:**V**)|1+|**B** = tensor(bool)|
+| | ||**I** = tensor(int64)|
+| | ||**V** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|LpNormalization|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|LpPool|(*in* X:**T**, *out* Y:**T**)|2+|**T** = tensor(float)|
+|MatMul|(*in* A:**T**, *in* B:**T**, *out* Y:**T**)|[1, 9]|**T** = tensor(float), tensor(double)|
+| | |[9, 9]|**T** = tensor(uint64), tensor(int32), tensor(int64), tensor(uint32)|
+|MatMulInteger|(*in* A:**T1**, *in* B:**T2**, *in* a_zero_point:**T1**, *in* b_zero_point:**T2**, *out* Y:**T3**)|10+|**T1** = tensor(uint8)|
+| | ||**T2** = tensor(uint8)|
+| | ||**T3** = tensor(int32)|
+|Max|(*in* data_0:**T**, *out* max:**T**)|8+|**T** = tensor(float), tensor(double)|
+| | |[6, 7]|**T** = tensor(float)|
+|MaxPool|(*in* X:**T**, *out* Y:**T**) or (*in* X:**T**, *out* Y:**T**, *out* Indices:**I**)|10+|**I** = tensor(int64)|
+| | ||**T** = tensor(float)|
+| | |[1, 7]|**T** = tensor(float)|
+| | |[8, 9]|**I** = tensor(int64)|
+| | ||**T** = tensor(float)|
+|MaxRoiPool|(*in* X:**T**, *in* rois:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|MaxUnpool|(*in* X:**T1**, *in* I:**T2**, *in* output_shape:**T2**, *out* output:**T1**)|9+|**T1** = tensor(float)|
+| | ||**T2** = tensor(int64)|
+|Mean|(*in* data_0:**T**, *out* mean:**T**)|8+|**T** = tensor(float)|
+| | |[6, 7]|**T** = tensor(float)|
+|MeanVarianceNormalization|(*in* X:**T**, *out* Y:**T**) or (*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(float)|
+| | |[1, 8]|**T** = tensor(float)|
+|Min|(*in* data_0:**T**, *out* min:**T**)|8+|**T** = tensor(float)|
+| | |[6, 7]|**T** = tensor(float)|
+|Mod|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|10+|**T** = tensor(int32), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Mul|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|7+|**T** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|Multinomial|(*in* input:**T1**, *out* output:**T2**)|7+|**T1** = tensor(float)|
+| | ||**T2** = tensor(int32), tensor(int64)|
+|Neg|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(int32), tensor(float), unknown|
+|NonZero|(*in* X:**T**, *out* Y:**tensor(int64)**)|9+|**T** = tensor(int32), tensor(float), tensor(bool), tensor(int64)|
+|Normalizer|(*in* X:**T**, *out* Y:**tensor(float)**)|1+|**T** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|Not|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(bool)|
+| | ||**T1** = tensor(bool)|
+|OneHot|(*in* indices:**T1**, *in* depth:**T2**, *in* values:**T3**, *out* output:**T3**)|9+|**T1** = tensor(int32), tensor(float), tensor(int64)|
+| | ||**T2** = tensor(int32), tensor(float), tensor(int64)|
+| | ||**T3** = tensor(string), tensor(int32), tensor(float), tensor(int64)|
+|OneHotEncoder|(*in* X:**T**, *out* Y:**tensor(float)**)|1+|**T** = tensor(string), tensor(float), tensor(int64), tensor(double)|
+|Or|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|7+|**T** = tensor(bool)|
+| | ||**T1** = tensor(bool)|
+|PRelu|(*in* X:**T**, *in* slope:**T**, *out* Y:**T**)|[7, 9]|**T** = tensor(float)|
+|Pad|(*in* data:**T**, *out* output:**T**)|2+|**T** = tensor(float)|
+|ParametricSoftplus|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|Pow|(*in* X:**T**, *in* Y:**T**, *out* Z:**T**)|7+|**T** = tensor(float), tensor(double)|
+|QLinearConv|(*in* x:**T1**, *in* x_scale:**tensor(float)**, *in* x_zero_point:**T1**, *in* w:**T2**, *in* w_scale:**tensor(float)**, *in* w_zero_point:**T2**, *in* y_scale:**tensor(float)**, *in* y_zero_point:**T3**, *in* B:**T4**, *out* y:**T3**)|10+|**T1** = tensor(uint8)|
+| | ||**T2** = tensor(uint8)|
+| | ||**T3** = tensor(uint8)|
+| | ||**T4** = tensor(int32)|
+|QLinearMatMul|(*in* a:**T1**, *in* a_scale:**tensor(float)**, *in* a_zero_point:**T1**, *in* b:**T2**, *in* b_scale:**tensor(float)**, *in* b_zero_point:**T2**, *in* y_scale:**tensor(float)**, *in* y_zero_point:**T3**, *out* y:**T3**)|10+|**T1** = tensor(uint8)|
+| | ||**T2** = tensor(uint8)|
+| | ||**T3** = tensor(uint8)|
+|QuantizeLinear|(*in* x:**T1**, *in* y_scale:**tensor(float)**, *in* y_zero_point:**T2**, *out* y:**T2**)|10+|**x** = tensor(float)|
+| | ||**y** = tensor(uint8), unknown|
+| | ||**y_zero_point** = tensor(uint8), unknown|
+|RNN|(*in* X:**T**, *in* W:**T**, *in* R:**T**, *in* B:**T**, *in* sequence_lens:**T1**, *in* initial_h:**T**, *out* Y:**T**, *out* Y_h:**T**)|7+|**T** = tensor(float)|
+| | ||**T1** = tensor(int32)|
+|RandomNormal|(*out* output:**T**)|1+|**T** = tensor(float), tensor(double)|
+|RandomNormalLike|(*in* input:**T1**, *out* output:**T2**)|1+|**T1** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(float), tensor(double)|
+|RandomUniform|(*out* output:**T**)|1+|**T** = tensor(float), tensor(double)|
+|RandomUniformLike|(*in* input:**T1**, *out* output:**T2**)|1+|**T1** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(float), tensor(double)|
+|Reciprocal|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|ReduceL1|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float)|
+|ReduceL2|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float)|
+|ReduceLogSum|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float)|
+|ReduceLogSumExp|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float)|
+|ReduceMax|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float)|
+|ReduceMean|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float)|
+|ReduceMin|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float)|
+|ReduceProd|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float)|
+|ReduceSum|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float), tensor(double)|
+|ReduceSumSquare|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(int32), tensor(float), tensor(double)|
+|Relu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|Reshape|(*in* data:**T**, *in* shape:**tensor(int64)**, *out* reshaped:**T**) or (*in* data:**T**, *out* reshaped:**T**)|5+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**shape** = tensor(int64)|
+|Reshape_1||[1, 4]|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Resize|(*in* X:**T**, *in* scales:**tensor(float)**, *out* Y:**T**)|10+|**T** = tensor(int32), tensor(float), tensor(uint8)|
+|ReverseSequence|(*in* input:**T**, *in* sequence_lens:**tensor(int64)**, *out* Y:**T**)|10+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|RoiAlign|(*in* X:**T1**, *in* rois:**T1**, *in* batch_indices:**T2**, *out* Y:**T1**)|10+|**T** = tensor(float), tensor(double)|
+| | ||**T2** = tensor(int64)|
+|SVMClassifier|(*in* X:**T1**, *out* Y:**T2**, *out* Z:**tensor(float)**)|1+|**T1** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(string), tensor(int64)|
+|SVMRegressor|(*in* X:**T**, *out* Y:**tensor(float)**)|1+|**T** = tensor(float)|
+|Scale|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|ScaledTanh|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|Scaler|(*in* X:**T**, *out* Y:**tensor(float)**)|1+|**T** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|Scan|(*in* sequence_lens:**I**, *in* initial_state_and_scan_inputs:**V**, *out* final_state_and_scan_outputs:**V**) or (*in* initial_state_and_scan_inputs:**V**, *out* final_state_and_scan_outputs:**V**)|9+|**I** = tensor(int64)|
+| | ||**V** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | |[8, 8]|**I** = tensor(int64)|
+| | ||**V** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Scatter|(*in* data:**T**, *in* indices:**Tind**, *in* updates:**T**, *out* output:**T**)|9+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+|Selu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|Shape|(*in* data:**T**, *out* shape:**T1**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T1** = tensor(int64)|
+|Shrink|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(int32), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Sigmoid|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|Sign|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(int32), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Sin|(*in* input:**T**, *out* output:**T**)|7+|**T** = tensor(float), tensor(double)|
+|Sinh|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(float)|
+|Size|(*in* data:**T**, *out* size:**T1**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(int64), tensor(double)|
+| | ||**T1** = tensor(int64)|
+|Slice|(*in* data:**T**, *out* output:**T**) or (*in* data:**T**, *in* starts:**Tind**, *in* ends:**Tind**, *in* axes:**Tind**, *in* steps:**Tind**, *out* output:**T**)|10+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+| | |[1, 9]|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Softmax|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|Softplus|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|Softsign|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|SpaceToDepth|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|Split|(*in* input:**T**, *out* outputs:**T**) or (*in* input:**T**, *in* split:**T**, *out* outputs...:**T**)|2+|**T** = tensor(string), tensor(int32), tensor(float)|
+|Sqrt|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(double)|
+|Squeeze|(*in* data:**T**, *out* squeezed:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|StringNormalizer|(*in* X:**tensor(string)**, *out* Y:**tensor(string)**)|10+|**T** = tensor(string)|
+|Sub|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|7+|**T** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+|Sum|(*in* data_0:**T**, *out* sum:**T**)|8+|**T** = tensor(float)|
+| | |[6, 7]|**T** = tensor(float)|
+|Tan|(*in* input:**T**, *out* output:**T**)|7+|**T** = tensor(float)|
+|Tanh|(*in* input:**T**, *out* output:**T**)|6+|**T** = tensor(float)|
+|TfIdfVectorizer|(*in* X:**T**, *out* Y:**T1**)|9+|**T** = tensor(string), tensor(int32), tensor(int64)|
+| | ||**T1** = tensor(float)|
+|ThresholdedRelu|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+| | |10+|**T** = tensor(float)|
+|Tile|(*in* input:**T**, *in* tiles:**T**, *in* axis:**T**, *out* output:**T**) or (*in* input:**T**, *in* repeats:**T1**, *out* output:**T**)|6+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(int64), tensor(double)|
+| | ||**T1** = tensor(int64)|
+|TopK|(*in* X:**T**, *in* K:**tensor(int64)**, *out* Values:**T**, *out* Indices:**I**) or (*in* X:**T**, *out* Values:**T**, *out* Indices:**I**)|10+|**I** = tensor(int64)|
+| | ||**T** = tensor(float)|
+| | |[1, 9]|**I** = tensor(int64)|
+| | ||**T** = tensor(float)|
+|Transpose|(*in* data:**T**, *out* transposed:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|TreeEnsembleClassifier|(*in* X:**T1**, *out* Y:**T2**, *out* Z:**tensor(float)**)|1+|**T1** = tensor(int32), tensor(float), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(string), tensor(int64)|
+|TreeEnsembleRegressor|(*in* X:**T**, *out* Y:**tensor(float)**)|1+|**T** = tensor(float)|
+|Unsqueeze|(*in* data:**T**, *out* expanded:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Upsample|(*in* X:**T**, *out* Y:**T**) or (*in* X:**T**, *in* scales:**tensor(float)**, *out* Y:**T**)|[7, 9]|**T** = tensor(int32), tensor(float), tensor(uint8)|
+|Where|(*in* condition:**B**, *in* X:**T**, *in* Y:**T**, *out* output:**T**)|9+|**T** = tensor(string), tensor(int32), tensor(float)|
+|Xor|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|7+|**T** = tensor(bool)|
+| | ||**T1** = tensor(bool)|
+|ZipMap|(*in* X:**tensor(float)**, *out* Z:**T**)|1+|**T** = unknown|
+| |
+| |
+**Operator Domain:** *com.microsoft*
+|AttnLSTM|(*in* X:**T**, *in* W:**T**, *in* R:**T**, *in* B:**T**, *in* sequence_lens:**T1**, *in* initial_h:**T**, *in* initial_c:**T**, *in* P:**T**, *in* QW:**T**, *in* MW:**T**, *in* V:**T**, *in* M:**T**, *in* memory_seq_lens:**T1**, *in* AW:**T**, *out* Y:**T**, *out* Y_h:**T**, *out* Y_c:**T**)|1+|**T** = tensor(float), tensor(double)|
+| | ||**T1** = tensor(int32)|
+|ConvTransposeWithDynamicPads|(*in* X:**T**, *in* W:**T**, *in* Pads:**tensor(int64)**, *in* B:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|CropAndResize|(*in* X:**T1**, *in* rois:**T1**, *in* batch_indices:**T2**, *in* crop_size:**T2**, *out* Y:**T1**)|1+|**T** = tensor(float)|
+| | ||**T2** = tensor(int32)|
+|ExpandDims|(*in* X:**T**, *in* axis:**tensor(int32)**, *out* Y:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**axis** = tensor(int32)|
+|FusedConv|(*in* X:**T**, *in* W:**T**, *in* B:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|FusedGemm|(*in* A:**T**, *in* B:**T**, *in* C:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|GatherND|(*in* data:**T**, *in* indices:**Tind**, *out* output:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(string), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+|MaxpoolWithMask|(*in* X:**T**, *in* M:**tensor(int32)**, *out* Y:**T**)|1+|**X** = tensor(float)|
+|MurmurHash3|(*in* X:**T1**, *out* Y:**T2**)|1+|**T1** = tensor(string), tensor(int32), tensor(uint32)|
+| | ||**T2** = tensor(int32), tensor(uint32)|
+|Pad|(*in* data:**T**, *in* pads:**tensor(int64)**, *in* value:**T**, *out* output:**T**)|1+|**T** = tensor(float)|
+|Range|(*in* start:**T**, *in* limit:**T**, *in* delta:**T**, *out* Y:**T**)|1+|**T** = tensor(int32), tensor(float), tensor(int64), tensor(int16), tensor(double)|
+|SampleOp|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|Tokenizer|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(string)|
+|Unique|(*in* x:**T**, *out* y:**T**, *out* idx:**tensor(int64)**, *out* counts:**tensor(int64)**)|1+|**T** = tensor(float)|
+|WordConvEmbedding|(*in* Sequence:**T**, *in* W:**T1**, *in* B:**T1**, *in* C:**T1**, *out* Y:**T1**)|1+|**T** = tensor(int32)|
+| | ||**T1** = tensor(float)|
+| |
+| |
+**Operator Domain:** *com.microsoft.nchwc*
+|AveragePool|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|Conv|(*in* X:**T**, *in* W:**T**, *in* B:**T**, *in* Sum:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|GlobalAveragePool|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|GlobalMaxPool|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|MaxPool|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|ReorderInput|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|ReorderOutput|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+| |
+| |
+
+
+## Operators implemented by CUDAExecutionProvider
+
+| Op Name | Parameters | OpSet Version | Types Supported |
+|---------|------------|---------------|-----------------|
+**Operator Domain:** *ai.onnx.ml*
+|Abs|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(int32), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Add|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|7+|**T** = tensor(int32), tensor(uint32), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Affine|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|And|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|7+|**T** = tensor(bool)|
+| | ||**T1** = tensor(bool)|
+|ArgMax|(*in* data:**T**, *out* reduced:**tensor(int64)**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ArgMin|(*in* data:**T**, *out* reduced:**tensor(int64)**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|AveragePool|(*in* X:**T**, *out* Y:**T**)|10+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |[7, 9]|**I** = tensor(int64)|
+| | ||**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|BatchNormalization|(*in* X:**T**, *in* scale:**T**, *in* B:**T**, *in* mean:**T**, *in* var:**T**, *out* Y:**T**, *out* mean:**T**, *out* var:**T**, *out* saved_mean:**T**, *out* saved_var:**T**)|9+|**B** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**X** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**mean** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**scale** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**var** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |[7, 8]|**B** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**X** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**mean** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**scale** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**var** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Cast|(*in* input:**T1**, *out* output:**T2**)|9+|**T1** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | |[6, 8]|**T1** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T2** = tensor(int32), tensor(bool), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Ceil|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Compress|(*in* input:**T**, *in* condition:**T1**, *out* output:**T**)|9+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T1** = tensor(bool)|
+|Concat|(*in* inputs:**T**, *out* concat_result:**T**)|4+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|ConstantOfShape|(*in* input:**T1**, *out* output:**T2**)|9+|**T1** = tensor(int64)|
+| | ||**T2** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Conv|(*in* X:**T**, *in* W:**T**, *in* B:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ConvTranspose|(*in* X:**T**, *in* W:**T**, *in* B:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Crop|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Div|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|7+|**T** = tensor(int32), tensor(uint32), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Dropout|(*in* data:**T**, *out* output:**T**, *out* mask:**T**) or (*in* data:**T**, *out* output:**T**, *out* mask:**T1**)|10+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**T1** = tensor(bool)|
+| | |[7, 9]|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|DynamicSlice|(*in* data:**T**, *in* starts:**Tind**, *in* ends:**Tind**, *in* axes:**Tind**, *out* output:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+|Elu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Equal|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|7+|**T** = tensor(int32), tensor(bool), tensor(int64)|
+|Erf|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Exp|(*in* input:**T**, *out* output:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Expand|(*in* input:**T**, *in* shape:**tensor(int64)**, *out* output:**T**)|8+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Flatten|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | |[1, 8]|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Floor|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|GRU|(*in* X:**T**, *in* W:**T**, *in* R:**T**, *in* B:**T**, *in* sequence_lens:**T1**, *in* initial_h:**T**, *out* Y:**T**, *out* Y_h:**T**)|7+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**T1** = tensor(int32)|
+|Gather|(*in* data:**T**, *in* indices:**Tind**, *out* output:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+|Gemm|(*in* A:**T**, *in* B:**T**, *in* C:**T**, *out* Y:**T**)|9+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |[7, 8]|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|GlobalAveragePool|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|GlobalMaxPool|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Greater|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|9+|**T** = tensor(int32), tensor(uint32), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T1** = tensor(bool)|
+| | |[7, 8]|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|HardSigmoid|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Identity|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|ImageScaler|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|InstanceNormalization|(*in* input:**T**, *in* scale:**T**, *in* B:**T**, *out* output:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|LRN|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|LSTM|(*in* X:**T**, *in* W:**T**, *in* R:**T**, *in* B:**T**, *in* sequence_lens:**T1**, *in* initial_h:**T**, *in* initial_c:**T**, *in* P:**T**, *out* Y:**T**, *out* Y_h:**T**, *out* Y_c:**T**)|7+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**T1** = tensor(int32)|
+|LeakyRelu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Log|(*in* input:**T**, *out* output:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|MatMul|(*in* A:**T**, *in* B:**T**, *out* Y:**T**)|9+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |[1, 8]|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Max|(*in* data_0:**T**, *out* max:**T**)|8+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |[6, 7]|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|MaxPool|(*in* X:**T**, *out* Y:**T**) or (*in* X:**T**, *out* Y:**T**, *out* Indices:**I**)|10+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |[1, 7]|**I** = tensor(int64)|
+| | ||**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |[8, 9]|**I** = tensor(int64)|
+| | ||**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|MemcpyFromHost|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|MemcpyToHost|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Min|(*in* data_0:**T**, *out* min:**T**)|8+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |[6, 7]|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Mul|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|7+|**T** = tensor(int32), tensor(uint32), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Neg|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(int32), tensor(int16), unknown, tensor(float), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Or|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|7+|**T** = tensor(bool)|
+| | ||**T1** = tensor(bool)|
+|PRelu|(*in* X:**T**, *in* slope:**T**, *out* Y:**T**)|7+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Pad|(*in* data:**T**, *out* output:**T**)|2+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ParametricSoftplus|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Pow|(*in* X:**T**, *in* Y:**T**, *out* Z:**T**)|7+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|RNN|(*in* X:**T**, *in* W:**T**, *in* R:**T**, *in* B:**T**, *in* sequence_lens:**T1**, *in* initial_h:**T**, *out* Y:**T**, *out* Y_h:**T**)|7+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**T1** = tensor(int32)|
+|Reciprocal|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceL1|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceL2|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceLogSum|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceLogSumExp|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceMax|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceMean|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceMin|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceProd|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceSum|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ReduceSumSquare|(*in* data:**T**, *out* reduced:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Relu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Reshape|(*in* data:**T**, *in* shape:**tensor(int64)**, *out* reshaped:**T**) or (*in* data:**T**, *out* reshaped:**T**)|5+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**shape** = tensor(int64)|
+|Reshape_1||[1, 4]|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Resize|(*in* X:**T**, *in* scales:**tensor(float)**, *out* Y:**T**)|10+|**T** = tensor(int32), tensor(float), tensor(MLFloat16), tensor(uint8), tensor(double)|
+|ScaledTanh|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Selu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Shape|(*in* data:**T**, *out* shape:**T1**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**T1** = tensor(int64)|
+|Shrink|(*in* input:**T**, *out* output:**T**)|9+|**T** = tensor(int32), tensor(int16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Sigmoid|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Slice|(*in* data:**T**, *out* output:**T**) or (*in* data:**T**, *in* starts:**Tind**, *in* ends:**Tind**, *in* axes:**Tind**, *in* steps:**Tind**, *out* output:**T**)|10+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+| | |[1, 9]|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | ||**Tind** = tensor(int32), tensor(int64)|
+|Softmax|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Softplus|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Softsign|(*in* input:**T**, *out* output:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Split|(*in* input:**T**, *out* outputs:**T**) or (*in* input:**T**, *in* split:**T**, *out* outputs...:**T**)|2+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Sqrt|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Squeeze|(*in* data:**T**, *out* squeezed:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Sub|(*in* A:**T**, *in* B:**T**, *out* C:**T**)|7+|**T** = tensor(int32), tensor(uint32), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Sum|(*in* data_0:**T**, *out* sum:**T**)|8+|**T** = tensor(int32), tensor(uint32), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+| | |[6, 7]|**T** = tensor(int32), tensor(uint32), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Tanh|(*in* input:**T**, *out* output:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|ThresholdedRelu|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | |10+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Tile|(*in* input:**T**, *in* tiles:**T**, *in* axis:**T**, *out* output:**T**) or (*in* input:**T**, *in* repeats:**T1**, *out* output:**T**)|6+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+| | ||**T1** = tensor(int64)|
+|Transpose|(*in* data:**T**, *out* transposed:**T**)|1+|**T** = tensor(float), tensor(MLFloat16), tensor(double)|
+|Unsqueeze|(*in* data:**T**, *out* expanded:**T**)|1+|**T** = tensor(int32), tensor(bool), tensor(int16), tensor(bfloat16), tensor(uint8), unknown, tensor(uint32), tensor(uint16), tensor(float), tensor(uint64), tensor(MLFloat16), tensor(int64), tensor(double)|
+|Upsample|(*in* X:**T**, *out* Y:**T**) or (*in* X:**T**, *in* scales:**tensor(float)**, *out* Y:**T**)|[7, 9]|**T** = tensor(int32), tensor(float), tensor(MLFloat16), tensor(uint8), tensor(double)|
+|Xor|(*in* A:**T**, *in* B:**T**, *out* C:**T1**)|7+|**T** = tensor(bool)|
+| | ||**T1** = tensor(bool)|
+| |
+| |
+**Operator Domain:** *com.microsoft*
+|ConvTransposeWithDynamicPads|(*in* X:**T**, *in* W:**T**, *in* Pads:**tensor(int64)**, *in* B:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+| |
+| |
+
+
+## Operators implemented by MKLDNNExecutionProvider
+
+| Op Name | Parameters | OpSet Version | Types Supported |
+|---------|------------|---------------|-----------------|
+**Operator Domain:** *ai.onnx.ml*
+|AveragePool|(*in* X:**T**, *out* Y:**T**)|[7, 8]|**T** = tensor(float)|
+|BatchNormalization|(*in* X:**T**, *in* scale:**T**, *in* B:**T**, *in* mean:**T**, *in* var:**T**, *out* Y:**T**, *out* mean:**T**, *out* var:**T**, *out* saved_mean:**T**, *out* saved_var:**T**)|7+|**T** = tensor(float)|
+|Conv|(*in* X:**T**, *in* W:**T**, *in* B:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|Gemm|(*in* A:**T**, *in* B:**T**, *in* C:**T**, *out* Y:**T**)|7+|**T** = tensor(float)|
+|GlobalAveragePool|(*in* X:**T**, *out* Y:**T**)|[1, 8]|**T** = tensor(float)|
+|GlobalMaxPool|(*in* X:**T**, *out* Y:**T**)|[1, 8]|**T** = tensor(float)|
+|LRN|(*in* X:**T**, *out* Y:**T**)|1+|**T** = tensor(float)|
+|MaxPool|(*in* X:**T**, *out* Y:**T**) or (*in* X:**T**, *out* Y:**T**, *out* Indices:**I**)|[1, 7]|**T** = tensor(float)|
+| | |[8, 8]|**T** = tensor(float)|
+|Relu|(*in* X:**T**, *out* Y:**T**)|6+|**T** = tensor(float)|
+|Sum|(*in* data_0:**T**, *out* sum:**T**)|6+|**T** = tensor(float)|
+| |
+| |
diff --git a/docs/Versioning.md b/docs/Versioning.md
index d646d777d8335..18a43eb712b05 100644
--- a/docs/Versioning.md
+++ b/docs/Versioning.md
@@ -45,7 +45,7 @@ A variety of tools can be used to create ONNX models. Unless otherwise noted, pl
 
 |Tool|Recommended Version|Supported ONNX version(s)|
 |---|---|---|
-|[PyTorch](https://pytorch.org/)|[Latest stable](https://pytorch.org/get-started/locally/)|1.2-1.5*<br>*may require [ONNX version converter](https://github.com/onnx/onnx/blob/master/docs/VersionConverter.md) to convert to desired opset #*|
+|[PyTorch](https://pytorch.org/)|[Latest stable](https://pytorch.org/get-started/locally/)|1.2-1.5|
 |[ONNXMLTools](https://pypi.org/project/onnxmltools/)<br>CoreML, LightGBM, XGBoost, LibSVM|[Latest stable](https://github.com/onnx/onnxmltools/releases)|1.2-1.5|
 |[ONNXMLTools](https://pypi.org/project/onnxmltools/)<br> SparkML|[Latest stable](https://github.com/onnx/onnxmltools/releases)|1.4-1.5|
 |[SKLearn-ONNX](https://pypi.org/project/skl2onnx/)|[Latest stable](https://github.com/onnx/sklearn-onnx/releases)|1.2-1.5|
diff --git a/docs/execution_providers/Nuphar-ExecutionProvider.md b/docs/execution_providers/Nuphar-ExecutionProvider.md
new file mode 100644
index 0000000000000..a7c859a818e75
--- /dev/null
+++ b/docs/execution_providers/Nuphar-ExecutionProvider.md
@@ -0,0 +1,142 @@
+## Nuphar Execution Provider (preview)
+
+NUPHAR stands for Neural-network Unified Preprocessing Heterogeneous ARchitecture. As an execution provider in the ONNX Runtime, it is built on top of [TVM](https://github.com/dmlc/tvm) and [LLVM](https://llvm.org) to accelerate ONNX models by compiling nodes in subgraphs into optimized functions via JIT. It also provides JIT caching to save compilation time at runtime. 
+
+This execution provider release is currently in preview. With the Nuphar execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic X64 CPU acceleration, especially for quantized recurrent neural networks. Various products at Microsoft have seen up to a 5x improvement in performance with no loss of accuracy, by running quantized LSTMs via the Nuphar execution provider in the ONNX Runtime.
+
+### Build Nuphar execution provider
+Developers can now tap into the power of Nuphar through ONNX Runtime to accelerate inferencing of ONNX models. Besides, the Nuphar execution provider also comes with a common ONNX to TVM lowering [library](../../onnxruntime/core/codegen), that could be reused by other execution providers to leverage TVM. Instructions to build the Nuphar execution provider from source is available [here](../../BUILD.md#nuphar).
+
+### Using the Nuphar execution provider
+#### C/C++
+The Nuphar execution provider needs to be registered with ONNX Runtime to enable in the inference session. The C API details are [here](../C_API.md#c-api).
+
+### Python
+You can use the Nuphar execution provider via the python wheel from the ONNX Runtime build. The Nuphar execution provider will be automatically prioritized over the default CPU execution providers, thus no need to separately register the execution provider. Python APIs details are [here](../python/api_summary.rst#api-summary).
+
+### Using onnxruntime_perf_test/onnx_test_runner for performance and accuracy test
+You can test your ONNX model's performance with [onnxruntime_perf_test](../../onnxruntime/test/perftest/README.md), or test accuracy with [onnx_test_runner](../../onnxruntime/test/onnx/README.txt). To run these tools with the Nuphar execution provider, please pass `-e nuphar` in command line options.
+
+### Model conversion/quantization
+You may use Python script [model_editor.py](../../onnxruntime/core/providers/nuphar/scripts/model_editor.py) to turn LSTM/GRU/RNN ops to Scan ops for a given model, and then use [model_quantizer.py](../../onnxruntime/core/providers/nuphar/scripts/model_quantizer.py) to quantize MatMul ops into MatMulInteger ops.
+
+We use dynamic per-row quantization for inputs of LSTM MatMul, so MatMul becomes three parts: quantization, MatMulInteger and dequantization. Weights for MatMulInteger are statically quantized per-column to int8. We have observed good speed-up and no loss of accuracy with this quantization scheme inside Scan for various LSTM models.
+
+To convert models with LSTM/GRU/RNN ops to Scan ops:
+```
+python model_editor.py --input /path/to/input/model --output /path/to/output/model --mode to_scan
+```
+
+To quantize MatMul ops to MatMulInteger ops (use option --only_for_scan to only quantize MatMuls inside Scan):
+```
+python model_quantizer.py --input /path/to/input/model --output /path/to/output/model --only_for_scan
+```
+
+As an experiment, you may test conversion and quantization on [the BiDAF model](https://github.com/onnx/models/tree/master/bidaf) from the ONNX model zoo. This model has 5 bidirectional LSTM ops, and long sequence lengths. Our test shows that the quantized model has comparable accuracy of F1 76.24, EM 68.08, vs. floating point model accuracy of F1 76.20, EM 68.11.
+
+Speed-up in this model is ~20% on Intel Xeon E5-1620v4 (Note that AVX2 is required for Nuphar int8 GEMV performance), when comparing CPU execution provider with the floating point model with LSTM ops, vs. the Nuphar execution provider with quantized MatMulInteger inside Scan ops. Profile shows that most of the cost is in input projection outside of Scan ops, which uses MKL SGEMM. It's worth noting that MKL int8 GEMM is about the same speed as SGEMM in this model, so quantization of SGEMMs outside of Scan won't help performance. We are looking at ways to speedup int8 GEMM for better performance on quantized models.
+
+### JIT caching
+You may cache JIT binaries to reduce model loading time spent in JIT, using [create_shared.cmd](../../onnxruntime/core/providers/nuphar/scripts/create_shared.cmd) on Windows with Visual Studio 2017, or [create_shared.sh](../../onnxruntime/core/providers/nuphar/scripts/create_shared.sh) on Linux with gcc.
+
+Windows
+```
+REM You need to have Visual Studio 2017 for compile and link. Optionally, you can save model checksum to the output dll with FCIV tool from https://support.microsoft.com/en-us/help/841290
+set NUPHAR_CACHE_PATH=\path\to\jit\cache
+REM Then run Nuphar inference from either onnx_test_runner or onnxruntime_perf_test, or whatever inference using C++ or Python
+REM JIT object files would be saved to \path\to\jit\cache\<NUPHAR_CACHE_VERSION>
+create_shared.cmd \path\to\jit\cache\NUPHAR_CACHE_VERSION [optional_model_file_for_checksum] [optional_output_dll_name]
+REM If checksum is embedded in dll, set NUPHAR_CACHE_MODEL_CHECKSUM to FCIV output for the model to inference to pass checksum verification at runtime
+REM Checksum verification failure will cause Nuphar to fallback to JIT instead of loading binary from cache
+REM Run Nuphar inference again with cached JIT dll
+```
+
+Linux
+```
+# You need to have GCC of the same version Nuphar is built with, for compile and link. Optionally, you can save model checksum to jit.so with md5sum
+export NUPHAR_CACHE_PATH=/path/to/jit/cache
+# Then run Nuphar inference from either onnx_test_runner or onnxruntime_perf_test, or whatever inference using C++ or Python
+# JIT object files would be saved to /path/to/jit/cache/<NUPHAR_CACHE_VERSION>
+create_shared.sh -c /path/to/jit/cache/NUPHAR_CACHE_VERSION [-m optional_model_file_for_checksum] [-o optional_output_so_name]
+# If checksum is embedded in dll, set NUPHAR_CACHE_MODEL_CHECKSUM to md5sum output for the model to inference to pass checksum verification at runtime
+# Checksum verification failure will cause Nuphar to fallback to JIT instead of loading binary from cache
+# run Nuphar inference again with cached JIT dll
+```
+
+### Debugging
+There are several [environment variables](../../onnxruntime/core/codegen/common/settings.h) to dump debug information during code generation, plus [some more environment variables](../../onnxruntime/core/providers/nuphar/common/nuphar_settings.h) to dump/control the Nuphar execution provider. You can set environment variables prior to inference to dump debug info to the console. To list some most useful ones:
+* CODEGEN_DUMP_LOWER
+
+    Dumps the lowered function from TVM.
+
+    Set it to "verbose" to dump all nodes, or node op_type to dump specific nodes. You may use "concise" to dump just the op_type of nodes.
+
+* CODEGEN_DUMP_MODULE
+
+    Dumps compiled binary.
+
+    Set it to "ll" to dumps LLVM bit code, "asm" to dumps assembly.
+
+* CODEGEN_DUMP_SCHEDULE
+
+    Dumps the schedule used in TVM nodes, like compute_root/compute_inline/compute_at.
+
+    Set it to "verbose" to dump all nodes, or node op_type to dump specific nodes. You may use "concise" to dump just the op_type of nodes.
+
+* NUPHAR_DUMP_PARTITION
+
+    Dumps nodes in each partition.
+
+    Set it to "1" to dump partitions.
+
+### Settings
+When there are conflicts of environment variables running Nuphar in multiple processes, user can specify settings string when creating the Nuphar execution provider. The string comprises of comma separated key:value pairs. Keys should be lower cased environment variable names as shown above, and separated from corresponding values with colon. For example, the equivalent string of setting environment variables of NUPHAR_CACHE_PATH/NUPHAR_CACHE_MODEL_CHECKSUM would be "nuphar_cache_path:<path_to_cache>, nuphar_cache_model_checksum:<model_file_checksum>".
+
+* Using in C/C++
+
+Settings string could be specified when creating execution provider to specify JIT cache path, as well as model checksum:
+
+```
+OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_Nuphar(session_options, 1, "nuphar_cache_path:/path/to/cache, nuphar_cache_model_checksum:<model_checksum>"));
+```
+
+* Using in C#
+
+Settings string could be specified when creating session options:
+
+```
+SessionOptions.MakeSessionOptionWithNupharProvider("nuphar_cache_path:/path/to/cache, nuphar_cache_model_checksum:<model_checksum>")
+```
+
+* Using in Python
+
+Settings string should be passed in before InferenceSession is created, as providers are not currently exposed yet. Here's an example in Python to set cache path and model checksum:
+
+```
+nuphar_settings = 'nuphar_cache_path:{}, nuphar_cache_model_checksum:{}'.format(cache_dir, model_checksum)
+onnxruntime.capi._pybind_state.set_nuphar_settings(nuphar_settings)
+sess = onnxruntime.InferenceSession(model_path)
+```
+
+### Known issues
+* ONNX shape inference dependency
+
+    To save runtime JIT cost, Nuphar requires models to have shape inference information from ONNX after model is loaded. Some nodes in ONNX can generate dynamic output tensor shapes from input data value, i.e. ConstantOfShape, Tile, Slice in opset 10, Compress, etc. Those ops may block ONNX shape inference and make the part of graph after such nodes not runnable in Nuphar.
+
+    User may use Python script [symbolic_shape_infer.py](../../onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py) to run symbolic shape inference in ONNX model. This script adds output tensor shapes in the model in graph.value_info field, by doing symbolic dimension computation using sympy when there are Shape ops in model. Besides, running symbolic shape inference on ONNX model would make the graph more readable. Note that when using [model_editor.py](../../onnxruntime/core/providers/nuphar/scripts/model_editor.py) to convert models with LSTM/GRU/RNN to Scan, the resulting model may have incomplete shape inference. Running symbolic_shape_infer.py is needed to get the Scan ops in the model to run in Nuphar. Besides, please note that quantization should be the last step, after verified accuracy and performance of the edited floating point model.
+
+    In addition, user may also manually add shapes to graph.value_info using [onnx.helper.make_tensor_value_info](https://github.com/onnx/onnx/blob/v1.5.0/onnx/helper.py#L290) with model specific knowledge. For example, if you have Hardmax output casted to bool as Compress input condition, then the unknown dimension of the output of Compress is actually 1.
+
+* Performance benchmark
+
+    Current Nuphar's speed-up in quantized RNNs is optimized for AVX2, when running in single thread and batch size is 1. To help understand RNN performance in different configurations, please use Python script [rnn_benchmark.py](../../onnxruntime/core/providers/nuphar/scripts/rnn_benchmark.py). For older X64 CPUs that do not support AVX2, quantized model may have worse performance than non-quantized ones.
+
+* Patches to TVM
+
+    There are some changes/bug fixes in TVM for Nuphar to work properly. We are in the process of contributing them back to TVM, but for now patches are used in [our forked TVM](https://github.com/microsoft/onnxruntime-tvm). To build cleanly from scratch, please run following commands before running build.bat or build.sh:
+```
+git submodule sync
+git submodule foreach --recursive git stash
+git submodule foreach --recursive git clean -fd
+git submodule update --init --recursive
+```
\ No newline at end of file
diff --git a/docs/execution_providers/TensorRT-ExecutionProvider.md b/docs/execution_providers/TensorRT-ExecutionProvider.md
index 37c4c75ff58fa..a688d4c4cb813 100644
--- a/docs/execution_providers/TensorRT-ExecutionProvider.md
+++ b/docs/execution_providers/TensorRT-ExecutionProvider.md
@@ -1,11 +1,11 @@
-## TensortRT Execution Provider (preview)
+## TensortRT Execution Provider
 
-The TensorRT execution provider in the ONNX Runtime will make use of NVIDIA's [TensortRT](https://developer.nvidia.com/tensorrt) Deep Learning inferencing engine to accelerate ONNX model in their family of GPUs. Microsoft and NVIDIA worked closely to integrate the TensorRT execution provider with ONNX Runtime.
+The TensorRT execution provider in the ONNX Runtime makes use of NVIDIA's [TensortRT](https://developer.nvidia.com/tensorrt) Deep Learning inferencing engine to accelerate ONNX model in their family of GPUs. Microsoft and NVIDIA worked closely to integrate the TensorRT execution provider with ONNX Runtime.
 
-This execution provider release is currently in preview but, we have validated support for all the ONNX Models in the model zoo. With the TensorRT execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic GPU acceleration. 
+With the TensorRT execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic GPU acceleration. 
 
 ### Build TensorRT execution provider
-Developers can now tap into the power of TensorRT through ONNX Runtime to accelerate inferencing of ONNX models. Instructions to build the TensorRT execution provider from source is available [here](https://github.com/Microsoft/onnxruntime/blob/master/BUILD.md#build).
+Developers can now tap into the power of TensorRT through ONNX Runtime to accelerate inferencing of ONNX models. Instructions to build the TensorRT execution provider from source are available [here](https://github.com/Microsoft/onnxruntime/blob/master/BUILD.md#build). [Dockerfiles](https://github.com/microsoft/onnxruntime/tree/master/dockerfiles#tensorrt-version-preview) are available for convenience.
 
 ### Using the TensorRT execution provider
 #### C/C++
@@ -18,7 +18,23 @@ status = session_object.Load(model_file_name);
 The C API details are [here](https://github.com/Microsoft/onnxruntime/blob/master/docs/C_API.md#c-api).
 
 ### Python
-When using the python wheel from the ONNX Runtime build with TensorRT execution provider, it will be automatically prioritized over the default GPU or CPU execution providers. There is no need to separately register the execution provider. Python APIs details are [here](https://github.com/Microsoft/onnxruntime/blob/master/docs/python/api_summary.rst#api-summary).
+When using the Python wheel from the ONNX Runtime build with TensorRT execution provider, it will be automatically prioritized over the default GPU or CPU execution providers. There is no need to separately register the execution provider. Python APIs details are [here](https://microsoft.github.io/onnxruntime/api_summary.html).
+
+### Performance Tuning
+To test the performance of your ONNX Model with the TensorRT execution provider, use the flag `-e tensorrt` in [onnxruntime_perf_test](https://github.com/Microsoft/onnxruntime/tree/master/onnxruntime/test/perftest#onnxruntime-performance-test).
+
+### Sample
+Please see [this Notebook](https://github.com/microsoft/onnxruntime/blob/master/docs/python/notebooks/onnx-inference-byoc-gpu-cpu-aks.ipynb) for an example of running a model on GPU using ONNX Runtime through Azure Machine Learning Services.
 
 ### Using onnxruntime_perf_test
 You can test the performance for your ONNX Model with the TensorRT execution provider. Use the flag `-e tensorrt` in [onnxruntime_perf_test](https://github.com/Microsoft/onnxruntime/tree/master/onnxruntime/test/perftest#onnxruntime-performance-test).
+
+### Configuring Engine Max Batch Size and Workspace Size
+By default TensorRT execution provider builds an ICudaEngine with max batch size = 1 and max workspace size = 1 GB
+One can override these defaults by setting environment variables ORT_TENSORRT_MAX_BATCH_SIZE and ORT_TENSORRT_MAX_WORKSPACE_SIZE.
+e.g. on Linux
+#### override default batch size to 10
+export ORT_TENSORRT_MAX_BATCH_SIZE=10
+#### override default max workspace size to 2GB
+export ORT_TENSORRT_MAX_WORKSPACE_SIZE=2147483648
+
diff --git a/include/onnxruntime/core/common/callback.h b/include/onnxruntime/core/common/callback.h
deleted file mode 100644
index b52288758d7e6..0000000000000
--- a/include/onnxruntime/core/common/callback.h
+++ /dev/null
@@ -1,17 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT License.
-#pragma once
-#include "core/session/onnxruntime_c_api.h"
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-typedef struct OrtCallback {
-  void(ORT_API_CALL* f)(void* param) NO_EXCEPTION;
-  void* param;
-} OrtDeleter;
-
-#ifdef __cplusplus
-}
-#endif
\ No newline at end of file
diff --git a/include/onnxruntime/core/framework/allocator.h b/include/onnxruntime/core/framework/allocator.h
index 8a37553ea976b..d25645123896d 100644
--- a/include/onnxruntime/core/framework/allocator.h
+++ b/include/onnxruntime/core/framework/allocator.h
@@ -72,6 +72,10 @@ struct OrtDevice {
   DeviceId device_id;
 };
 
+inline bool operator==(const OrtDevice& left, const OrtDevice& other) {
+  return left.Id() == other.Id() && left.MemType() == other.MemType() && left.Type() == other.Type();
+}
+
 struct OrtAllocatorInfo {
   // use string for name, so we could have customized allocator in execution provider.
   const char* name;
@@ -128,6 +132,8 @@ namespace onnxruntime {
 constexpr const char* CPU = "Cpu";
 constexpr const char* CUDA = "Cuda";
 constexpr const char* CUDA_PINNED = "CudaPinned";
+constexpr const char* TRT = "Tensorrt";
+constexpr const char* TRT_PINNED = "TensorrtPinned";
 
 // forward declaration
 class SessionState;
diff --git a/include/onnxruntime/core/framework/data_types.h b/include/onnxruntime/core/framework/data_types.h
index ea90436c7c3a9..83248a2721d23 100644
--- a/include/onnxruntime/core/framework/data_types.h
+++ b/include/onnxruntime/core/framework/data_types.h
@@ -205,6 +205,9 @@ class DataTypeImpl {
   static const std::vector<MLDataType>& AllTensorTypes();
   static const std::vector<MLDataType>& AllFixedSizeTensorTypes();
   static const std::vector<MLDataType>& AllNumericTensorTypes();
+  static const std::vector<MLDataType>& AllIEEEFloatTensorTypes();
+  static const std::vector<MLDataType>& AllFixedSizeTensorExceptHalfTypes();
+  static const std::vector<MLDataType>& AllIEEEFloatTensorExceptHalfTypes();
 };
 
 std::ostream& operator<<(std::ostream& out, MLDataType data_type);
diff --git a/include/onnxruntime/core/framework/kernel_def_builder.h b/include/onnxruntime/core/framework/kernel_def_builder.h
index 3c093f45401fb..5f783348365bf 100644
--- a/include/onnxruntime/core/framework/kernel_def_builder.h
+++ b/include/onnxruntime/core/framework/kernel_def_builder.h
@@ -42,6 +42,12 @@ class KernelDef {
     *end = op_since_version_end_;
   }
 
+#ifdef onnxruntime_PYBIND_EXPORT_OPSCHEMA
+  const std::pair<int, int> SinceVersion() const {
+    return std::pair<int, int>(op_since_version_start_, op_since_version_end_);
+  }
+#endif
+
   onnxruntime::ProviderType Provider() const {
     return provider_type_;
   }
diff --git a/include/onnxruntime/core/framework/kernel_registry.h b/include/onnxruntime/core/framework/kernel_registry.h
index 3a0d35e298f98..95d9b1d415b92 100644
--- a/include/onnxruntime/core/framework/kernel_registry.h
+++ b/include/onnxruntime/core/framework/kernel_registry.h
@@ -39,6 +39,14 @@ class KernelRegistry {
 
   bool IsEmpty() const { return kernel_creator_fn_map_.empty(); }
 
+#ifdef onnxruntime_PYBIND_EXPORT_OPSCHEMA
+// This is used by the opkernel doc generator to enlist all registered operators for a given provider's opkernel
+  const KernelCreateMap& GetKernelCreateMap() const
+  {
+    return kernel_creator_fn_map_;
+  }
+#endif
+
  private:
   // Check whether the types of inputs/outputs of the given node match the extra
   // type-constraints of the given kernel. This serves two purposes: first, to
diff --git a/include/onnxruntime/core/framework/op_kernel.h b/include/onnxruntime/core/framework/op_kernel.h
index e02027a328fcf..6e98dbc20588b 100644
--- a/include/onnxruntime/core/framework/op_kernel.h
+++ b/include/onnxruntime/core/framework/op_kernel.h
@@ -210,7 +210,7 @@ struct KernelCreateInfo {
       : kernel_def(std::move(definition)),
         kernel_create_func(create_func) {}
 
-  KernelCreateInfo(KernelCreateInfo&& other)
+  KernelCreateInfo(KernelCreateInfo&& other) noexcept
       : kernel_def(std::move(other.kernel_def)),
         kernel_create_func(std::move(other.kernel_create_func)) {}
 };
@@ -231,6 +231,11 @@ template <typename T>
 KernelCreateInfo BuildKernelCreateInfo();
 }  // namespace contrib
 
+namespace automl {
+template <typename T>
+KernelCreateInfo BuildKernelCreateInfo();
+}  // namespace automl
+
 namespace contrib {
 namespace cuda {
 template <typename T>
diff --git a/include/onnxruntime/core/framework/tensor.h b/include/onnxruntime/core/framework/tensor.h
index 35eb359c714a3..31a43c7d905cb 100644
--- a/include/onnxruntime/core/framework/tensor.h
+++ b/include/onnxruntime/core/framework/tensor.h
@@ -78,9 +78,9 @@ class Tensor final {
   //Move is allowed
   ORT_DISALLOW_COPY_AND_ASSIGNMENT(Tensor);
 
-  Tensor(Tensor&& other);
+  Tensor(Tensor&& other) noexcept;
 
-  Tensor& operator=(Tensor&& other);
+  Tensor& operator=(Tensor&& other) noexcept;
 
   /**
      Returns the data type.
diff --git a/include/onnxruntime/core/framework/tensor_shape.h b/include/onnxruntime/core/framework/tensor_shape.h
index acf39638fe0db..c280f61eb1518 100644
--- a/include/onnxruntime/core/framework/tensor_shape.h
+++ b/include/onnxruntime/core/framework/tensor_shape.h
@@ -34,12 +34,13 @@ class TensorShape : private std::vector<int64_t> {
   TensorShape(TensorShape&& /*other*/) = default;
   TensorShape& operator=(TensorShape&& /*other*/) = default;
 
-  TensorShape(const int64_t* dimension_sizes, size_t dimension_count);
+  TensorShape(const std::vector<int64_t>& dims) : std::vector<int64_t>(dims) {}
+
+  TensorShape(std::vector<int64_t>&& dims) : std::vector<int64_t>(std::move(dims)) {}
 
-  TensorShape(const std::vector<int64_t>& dims);
-  TensorShape(std::vector<int64_t>&& dims);
+  TensorShape(const std::initializer_list<int64_t>& dims) : std::vector<int64_t>(dims) {}
 
-  TensorShape(const std::initializer_list<int64_t>& dims);
+  TensorShape(const int64_t* dimension_sizes, size_t dimension_count);
 
   TensorShape(const std::vector<int64_t>& dims, size_t start, size_t end);
 
diff --git a/include/onnxruntime/core/graph/constants.h b/include/onnxruntime/core/graph/constants.h
index 5872228f383d2..6a960e82a3074 100644
--- a/include/onnxruntime/core/graph/constants.h
+++ b/include/onnxruntime/core/graph/constants.h
@@ -19,6 +19,7 @@ constexpr const char* kOnnxDomainAlias = "ai.onnx";
 constexpr const char* kMLDomain = "ai.onnx.ml";
 constexpr const char* kMSDomain = "com.microsoft";
 constexpr const char* kMSNchwcDomain = "com.microsoft.nchwc";
+constexpr const char* kMSAutoMLDomain = "com.microsoft.automl";
 constexpr const char* kNGraphDomain = "com.intel.ai";
 constexpr const char* kCpuExecutionProvider = "CPUExecutionProvider";
 constexpr const char* kCudaExecutionProvider = "CUDAExecutionProvider";
diff --git a/include/onnxruntime/core/graph/graph.h b/include/onnxruntime/core/graph/graph.h
index b626a7541713f..1901822011f74 100644
--- a/include/onnxruntime/core/graph/graph.h
+++ b/include/onnxruntime/core/graph/graph.h
@@ -25,7 +25,6 @@
 namespace onnxruntime {
 class Graph;
 struct IndexedSubGraph;
-class Node;
 class OpSignature;
 
 /**
diff --git a/include/onnxruntime/core/graph/graph_viewer.h b/include/onnxruntime/core/graph/graph_viewer.h
index 7e2a0364ed0db..8d6530719dc20 100644
--- a/include/onnxruntime/core/graph/graph_viewer.h
+++ b/include/onnxruntime/core/graph/graph_viewer.h
@@ -38,6 +38,9 @@ class GraphViewer {
   */
   bool GetInitializedTensor(const std::string& tensor_name, const ONNX_NAMESPACE::TensorProto*& value) const;
 
+  /** Returns true if an initializer value can be overridden by a graph input with the same name. */
+  bool CanOverrideInitializer() const noexcept;
+
   /**
   Gets the Graph inputs, excluding initializers.
   @returns Collection of NodeArg pointers for the graph inputs, excluding inputs that have matching initializers.
@@ -102,9 +105,15 @@ class GraphViewer {
     return graph_->DomainToVersionMap();
   }
 
-  /** Check if this is a Subgraph */
+  /** Checks if this is a Subgraph */
   bool IsSubgraph() const;
 
+  /** 
+  returns true if 'name' is an initializer, and is constant and cannot be overridden at runtime. 
+  @param check_outer_scope If true and the 'graph_' is a subgraph, check parent graph/s for 'name' if not found in 'graph_'.
+  */
+  bool IsConstantInitializer(const std::string& name, bool check_outer_scope) const;
+
  private:
   ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(GraphViewer);
 
diff --git a/include/onnxruntime/core/platform/threadpool.h b/include/onnxruntime/core/platform/threadpool.h
index 66952591ce470..3337583612065 100644
--- a/include/onnxruntime/core/platform/threadpool.h
+++ b/include/onnxruntime/core/platform/threadpool.h
@@ -7,12 +7,27 @@
 #include <functional>
 #include <memory>
 
+#if defined(__GNUC__)
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-parameter"
+#else
+#pragma warning(push)
+#pragma warning(disable : 4267)
+#endif
+#include <unsupported/Eigen/CXX11/ThreadPool>
+#if defined(__GNUC__)
+#pragma GCC diagnostic pop
+#else
+#pragma warning(pop)
+#endif
+
 namespace onnxruntime {
 
 namespace concurrency {
 
 /**
  * Generic class for instantiating thread pools.
+ * Don't put any object of this type into a global variable in a Win32 DLL.
  */
 class ThreadPool {
  public:
@@ -43,14 +58,10 @@ class ThreadPool {
 
   int CurrentThreadId() const;
 
-  /*
-  Ensure that the pool has terminated and cleaned up all threads cleanly.
-  */
-  ~ThreadPool();
+  Eigen::ThreadPool& GetHandler() { return impl_; }
 
  private:
-  class Impl;
-  std::unique_ptr<Impl> impl_;
+  Eigen::ThreadPool impl_;
 };
 
 }  // namespace concurrency
diff --git a/include/onnxruntime/core/providers/nuphar/nuphar_provider_factory.h b/include/onnxruntime/core/providers/nuphar/nuphar_provider_factory.h
new file mode 100644
index 0000000000000..58c82a0e1f251
--- /dev/null
+++ b/include/onnxruntime/core/providers/nuphar/nuphar_provider_factory.h
@@ -0,0 +1,17 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+#pragma once
+#include "core/session/onnxruntime_c_api.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+/**
+ * \param device_id nuphar device id, starts from zero.
+ * \param target_str TVM target string.
+ */
+ORT_API_STATUS(OrtSessionOptionsAppendExecutionProvider_Nuphar, _In_ OrtSessionOptions* options, int allow_unaligned_buffers, _In_ const char* settings_str);
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/include/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.h b/include/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.h
index fb077fc5ff41d..f6f03f80465f4 100644
--- a/include/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.h
+++ b/include/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.h
@@ -7,7 +7,7 @@
 extern "C" {
 #endif
 
-ORT_API_STATUS(OrtSessionOptionsAppendExecutionProvider_Tensorrt, _In_ OrtSessionOptions* options);
+ORT_API_STATUS(OrtSessionOptionsAppendExecutionProvider_Tensorrt, _In_ OrtSessionOptions* options, int device_id);
 
 #ifdef __cplusplus
 }
diff --git a/include/onnxruntime/core/session/onnxruntime_c_api.h b/include/onnxruntime/core/session/onnxruntime_c_api.h
index 6848fc31e453c..899d23181750a 100644
--- a/include/onnxruntime/core/session/onnxruntime_c_api.h
+++ b/include/onnxruntime/core/session/onnxruntime_c_api.h
@@ -23,6 +23,10 @@ extern "C" {
 #define _Inout_
 #define _Inout_opt_
 #define _Frees_ptr_opt_
+#define _Ret_maybenull_
+#define _Ret_notnull_
+#define _Check_return_
+#define _Success_(X)
 #define ORT_ALL_ARGS_NONNULL __attribute__((nonnull))
 #else
 #include <specstrings.h>
@@ -127,11 +131,11 @@ typedef enum OrtErrorCode {
   ORT_EXPORT RETURN_TYPE ORT_API_CALL NAME(__VA_ARGS__) NO_EXCEPTION
 
 #define ORT_API_STATUS(NAME, ...) \
-  ORT_EXPORT OrtStatus* ORT_API_CALL NAME(__VA_ARGS__) NO_EXCEPTION ORT_MUST_USE_RESULT
+  ORT_EXPORT _Check_return_ _Success_(return == 0) _Ret_maybenull_ OrtStatus* ORT_API_CALL NAME(__VA_ARGS__) NO_EXCEPTION ORT_MUST_USE_RESULT
 
 // Used in *.cc files. Almost as same as ORT_API_STATUS, except without ORT_MUST_USE_RESULT
 #define ORT_API_STATUS_IMPL(NAME, ...) \
-  ORT_EXPORT OrtStatus* ORT_API_CALL NAME(__VA_ARGS__) NO_EXCEPTION
+  ORT_EXPORT _Check_return_ _Success_(return == 0) _Ret_maybenull_ OrtStatus* ORT_API_CALL NAME(__VA_ARGS__) NO_EXCEPTION
 
 #define ORT_RUNTIME_CLASS(X)    \
   struct Ort##X;                \
@@ -143,16 +147,13 @@ ORT_RUNTIME_CLASS(Env);
 ORT_RUNTIME_CLASS(Status);  // nullptr for Status* indicates success
 ORT_RUNTIME_CLASS(Provider);
 ORT_RUNTIME_CLASS(AllocatorInfo);
-ORT_RUNTIME_CLASS(Session);
+ORT_RUNTIME_CLASS(Session);  //Don't call OrtReleaseSession from Dllmain (because session owns a thread pool)
 ORT_RUNTIME_CLASS(Value);
-ORT_RUNTIME_CLASS(ValueList);
 ORT_RUNTIME_CLASS(RunOptions);
 ORT_RUNTIME_CLASS(TypeInfo);
 ORT_RUNTIME_CLASS(TensorTypeAndShapeInfo);
 ORT_RUNTIME_CLASS(SessionOptions);
-ORT_RUNTIME_CLASS(Callback);
 ORT_RUNTIME_CLASS(CustomOpDomain);
-ORT_RUNTIME_CLASS(Allocator);
 
 // When passing in an allocator to any ORT function, be sure that the allocator object
 // is not destroyed until the last allocated object using it is freed.
@@ -202,6 +203,9 @@ ORT_API_STATUS(OrtRun, _Inout_ OrtSession* sess,
  */
 ORT_API_STATUS(OrtCreateSessionOptions, _Outptr_ OrtSessionOptions** options);
 
+// Set filepath to save optimized model after graph level transformations.
+ORT_API_STATUS(OrtSetOptimizedModelFilePath, _Inout_ OrtSessionOptions* options, _In_ const ORTCHAR_T* optimized_model_filepath);
+
 // create a copy of an existing OrtSessionOptions
 ORT_API_STATUS(OrtCloneSessionOptions, _In_ const OrtSessionOptions* in_options, _Outptr_ OrtSessionOptions** out_options);
 ORT_API_STATUS(OrtEnableSequentialExecution, _Inout_ OrtSessionOptions* options);
@@ -230,15 +234,25 @@ ORT_API_STATUS(OrtSetSessionLogId, _Inout_ OrtSessionOptions* options, const cha
 
 // < applies to session load, initialization, etc
 ORT_API_STATUS(OrtSetSessionLogVerbosityLevel, _Inout_ OrtSessionOptions* options, int session_log_verbosity_level);
+ORT_API_STATUS(OrtSetSessionLogSeverityLevel, _Inout_ OrtSessionOptions* options, int session_log_severity_level);
 
 // Set Graph optimization level.
-// Available options are : 0, 1, 2.
-// 0 -> Disable all optimizations
-// 1 -> Enable basic optimizations
-// 2 -> Enable all optimizations
-ORT_API_STATUS(OrtSetSessionGraphOptimizationLevel, _Inout_ OrtSessionOptions* options, int graph_optimization_level);
-
-// How many threads in the session thread pool.
+// TODO Add documentation about which optimizations are enabled for each value.
+typedef enum GraphOptimizationLevel {
+  ORT_DISABLE_ALL = 0,
+  ORT_ENABLE_BASIC = 1,
+  ORT_ENABLE_EXTENDED = 2,
+  ORT_ENABLE_ALL = 99
+} GraphOptimizationLevel;
+ORT_API_STATUS(OrtSetSessionGraphOptimizationLevel, _Inout_ OrtSessionOptions* options,
+               GraphOptimizationLevel graph_optimization_level);
+
+/**
+ * How many threads in the session thread pool.
+ * Set it to 0 to make onnxruntime run as single threaded.
+ * \param session_thread_pool_size <0, let the runtime choose a default. =0, Don't create extra threads. 
+ *                                 >0, create a thread pool with size of this value.
+ */
 ORT_API_STATUS(OrtSetSessionThreadPoolSize, _Inout_ OrtSessionOptions* options, int session_thread_pool_size);
 
 /**
@@ -279,9 +293,11 @@ ORT_API_STATUS(OrtSessionGetOutputName, _In_ const OrtSession* sess, size_t inde
 ORT_API_STATUS(OrtCreateRunOptions, _Outptr_ OrtRunOptions** out);
 
 ORT_API_STATUS(OrtRunOptionsSetRunLogVerbosityLevel, _Inout_ OrtRunOptions* options, int value);
+ORT_API_STATUS(OrtRunOptionsSetRunLogSeverityLevel, _Inout_ OrtRunOptions* options, int value);
 ORT_API_STATUS(OrtRunOptionsSetRunTag, _In_ OrtRunOptions*, _In_ const char* run_tag);
 
 ORT_API_STATUS(OrtRunOptionsGetRunLogVerbosityLevel, _In_ const OrtRunOptions* options, _Out_ int* out);
+ORT_API_STATUS(OrtRunOptionsGetRunLogSeverityLevel, _In_ const OrtRunOptions* options, _Out_ int* out);
 ORT_API_STATUS(OrtRunOptionsGetRunTag, _In_ const OrtRunOptions*, _Out_ const char** out);
 
 // Set a flag so that any running OrtRun* calls that are using this instance of OrtRunOptions
@@ -336,35 +352,6 @@ ORT_API_STATUS(OrtGetStringTensorDataLength, _In_ const OrtValue* value, _Out_ s
 ORT_API_STATUS(OrtGetStringTensorContent, _In_ const OrtValue* value, _Out_ void* s, size_t s_len,
                _Out_ size_t* offsets, size_t offsets_len);
 
-/**
- * Create an OrtValue in CPU memory from a serialized TensorProto
- * @param input           serialized TensorProto object
- * @param input_len       length of 'input'.
- * @param input_file_path A local file path of where the input was loaded from. Can be NULL if the tensor proto doesn't
- *                        have any external data or it was loaded from current working dir. This path could be either a
- *                        relative path or an absolute path.
- * @param preallocated A preallocated buffer for the tensor. It should be allocated from CPU memory
- * @param preallocated_size Length of the preallocated buffer in bytes, can be computed from
- *          the OrtGetTensorMemSizeInBytesFromTensorProto function. This function will return an error if the
- *          preallocated_size is not enough.
- * @param out
- * @return
- */
-ORT_API_STATUS(OrtTensorProtoToOrtValue, _In_ const void* input, int input_len,
-               _In_opt_ const ORTCHAR_T* input_file_path, _Inout_ void* preallocated, size_t preallocated_size,
-               _Outptr_ OrtValue** out, _Outptr_ OrtCallback** deleter);
-
-/**
- *  f will be freed in this call
- */
-ORT_API(void, OrtRunCallback, _Frees_ptr_opt_ OrtCallback* f);
-
-/**
- * calculate the memory requirement for the OrtTensorProtoToOrtValue function
- */
-ORT_API_STATUS(OrtGetTensorMemSizeInBytesFromTensorProto, _In_ const void* input, int input_len, size_t alignment,
-               _Out_ size_t* out);
-
 /**
  * Don't free the 'out' value
  */
@@ -461,14 +448,16 @@ ORT_API_STATUS(OrtAllocatorAlloc, _Inout_ OrtAllocator* ptr, size_t size, _Outpt
 ORT_API_STATUS(OrtAllocatorFree, _Inout_ OrtAllocator* ptr, void* p);
 ORT_API_STATUS(OrtAllocatorGetInfo, _In_ const OrtAllocator* ptr, _Out_ const OrtAllocatorInfo** out);
 
-ORT_API_STATUS(OrtCreateDefaultAllocator, _Outptr_ OrtAllocator** out);
+// The returned pointer doesn't have to be freed.
+// Always returns the same instance on every invocation.
+ORT_API_STATUS(OrtGetAllocatorWithDefaultOptions, _Outptr_ OrtAllocator** out);
 
 ORT_API(const char*, OrtGetVersionString);
 /**
  * \param msg A null-terminated string. Its content will be copied into the newly created OrtStatus
  */
-ORT_API(OrtStatus*, OrtCreateStatus, OrtErrorCode code, _In_ const char* msg)
-ORT_ALL_ARGS_NONNULL;
+ORT_EXPORT _Check_return_ _Ret_notnull_ OrtStatus* ORT_API_CALL OrtCreateStatus(OrtErrorCode code, _In_ const char* msg) NO_EXCEPTION
+    ORT_ALL_ARGS_NONNULL;
 
 ORT_API(OrtErrorCode, OrtGetErrorCode, _In_ const OrtStatus* status)
 ORT_ALL_ARGS_NONNULL;
diff --git a/include/onnxruntime/core/session/onnxruntime_cxx_api.h b/include/onnxruntime/core/session/onnxruntime_cxx_api.h
index e21e87596781e..82263bb071a7a 100644
--- a/include/onnxruntime/core/session/onnxruntime_cxx_api.h
+++ b/include/onnxruntime/core/session/onnxruntime_cxx_api.h
@@ -43,7 +43,6 @@ struct Exception : std::exception {
 #define ORT_DEFINE_RELEASE(NAME) \
   inline void OrtRelease(Ort##NAME* ptr) { OrtRelease##NAME(ptr); }
 
-ORT_DEFINE_RELEASE(Allocator);
 ORT_DEFINE_RELEASE(AllocatorInfo);
 ORT_DEFINE_RELEASE(CustomOpDomain);
 ORT_DEFINE_RELEASE(Env);
@@ -93,7 +92,7 @@ struct Unowned : T {
   ~Unowned() { this->p_ = nullptr; }
 };
 
-struct Allocator;
+struct AllocatorWithDefaultOptions;
 struct AllocatorInfo;
 struct Env;
 struct TypeInfo;
@@ -120,6 +119,9 @@ struct RunOptions : Base<OrtRunOptions> {
   RunOptions& SetRunLogVerbosityLevel(int);
   int GetRunLogVerbosityLevel() const;
 
+  RunOptions& SetRunLogSeverityLevel(int);
+  int GetRunLogSeverityLevel() const;
+
   RunOptions& SetRunTag(const char* run_tag);
   const char* GetRunTag() const;
 
@@ -135,11 +137,13 @@ struct SessionOptions : Base<OrtSessionOptions> {
   SessionOptions Clone() const;
 
   SessionOptions& SetThreadPoolSize(int session_thread_pool_size);
-  SessionOptions& SetGraphOptimizationLevel(int graph_optimization_level);
+  SessionOptions& SetGraphOptimizationLevel(GraphOptimizationLevel graph_optimization_level);
 
   SessionOptions& EnableCpuMemArena();
   SessionOptions& DisableCpuMemArena();
 
+  SessionOptions& SetOptimizedModelFilePath(const ORTCHAR_T* optimized_model_file);
+
   SessionOptions& EnableProfiling(const ORTCHAR_T* profile_file_prefix);
   SessionOptions& DisableProfiling();
 
@@ -225,16 +229,19 @@ struct Value : Base<OrtValue> {
   TensorTypeAndShapeInfo GetTensorTypeAndShapeInfo() const;
 };
 
-struct Allocator : Base<OrtAllocator> {
-  static Allocator CreateDefault();
+struct AllocatorWithDefaultOptions {
+  AllocatorWithDefaultOptions();
 
-  explicit Allocator(nullptr_t) {}
-  explicit Allocator(OrtAllocator* p) : Base<OrtAllocator>{p} {}
+  operator OrtAllocator*() { return p_; }
+  operator const OrtAllocator*() const { return p_; }
 
   void* Alloc(size_t size);
   void Free(void* p);
 
   const OrtAllocatorInfo* GetInfo() const;
+
+ private:
+  OrtAllocator* p_{};
 };
 
 struct AllocatorInfo : Base<OrtAllocatorInfo> {
diff --git a/include/onnxruntime/core/session/onnxruntime_cxx_inline.h b/include/onnxruntime/core/session/onnxruntime_cxx_inline.h
index 0fbbbde445b16..3670fdfe71bc6 100644
--- a/include/onnxruntime/core/session/onnxruntime_cxx_inline.h
+++ b/include/onnxruntime/core/session/onnxruntime_cxx_inline.h
@@ -39,23 +39,21 @@ struct TypeToTensorType<uint32_t> { static constexpr ONNXTensorElementDataType t
 template <>
 struct TypeToTensorType<uint64_t> { static constexpr ONNXTensorElementDataType type = ONNX_TENSOR_ELEMENT_DATA_TYPE_UINT64; };
 
-inline Allocator Allocator::CreateDefault() {
-  OrtAllocator* p;
-  ORT_THROW_ON_ERROR(OrtCreateDefaultAllocator(&p));
-  return Allocator(p);
+inline AllocatorWithDefaultOptions::AllocatorWithDefaultOptions() {
+  ORT_THROW_ON_ERROR(OrtGetAllocatorWithDefaultOptions(&p_));
 }
 
-inline void* Allocator::Alloc(size_t size) {
+inline void* AllocatorWithDefaultOptions::Alloc(size_t size) {
   void* out;
   ORT_THROW_ON_ERROR(OrtAllocatorAlloc(p_, size, &out));
   return out;
 }
 
-inline void Allocator::Free(void* p) {
+inline void AllocatorWithDefaultOptions::Free(void* p) {
   ORT_THROW_ON_ERROR(OrtAllocatorFree(p_, p));
 }
 
-inline const OrtAllocatorInfo* Allocator::GetInfo() const {
+inline const OrtAllocatorInfo* AllocatorWithDefaultOptions::GetInfo() const {
   const OrtAllocatorInfo* out;
   ORT_THROW_ON_ERROR(OrtAllocatorGetInfo(p_, &out));
   return out;
@@ -96,6 +94,11 @@ inline RunOptions& RunOptions::SetRunLogVerbosityLevel(int level) {
   return *this;
 }
 
+inline RunOptions& RunOptions::SetRunLogSeverityLevel(int level) {
+  ORT_THROW_ON_ERROR(OrtRunOptionsSetRunLogSeverityLevel(p_, level));
+  return *this;
+}
+
 inline int RunOptions::GetRunLogVerbosityLevel() const {
   int out;
   ORT_THROW_ON_ERROR(OrtRunOptionsGetRunLogVerbosityLevel(p_, &out));
@@ -138,11 +141,16 @@ inline SessionOptions& SessionOptions::SetThreadPoolSize(int session_thread_pool
   return *this;
 }
 
-inline SessionOptions& SessionOptions::SetGraphOptimizationLevel(int graph_optimization_level) {
+inline SessionOptions& SessionOptions::SetGraphOptimizationLevel(GraphOptimizationLevel graph_optimization_level) {
   ORT_THROW_ON_ERROR(OrtSetSessionGraphOptimizationLevel(p_, graph_optimization_level));
   return *this;
 }
 
+inline SessionOptions& SessionOptions::SetOptimizedModelFilePath(const ORTCHAR_T* optimized_model_filepath) {
+  ORT_THROW_ON_ERROR(OrtSetOptimizedModelFilePath(p_, optimized_model_filepath));
+  return *this;
+}
+
 inline SessionOptions& SessionOptions::EnableProfiling(const ORTCHAR_T* profile_file_prefix) {
   ORT_THROW_ON_ERROR(OrtEnableProfiling(p_, profile_file_prefix));
   return *this;
diff --git a/onnxruntime/__init__.py b/onnxruntime/__init__.py
index 29e8f5fb33ebf..4d222e6945916 100644
--- a/onnxruntime/__init__.py
+++ b/onnxruntime/__init__.py
@@ -18,4 +18,4 @@
 from onnxruntime.capi import onnxruntime_validation
 onnxruntime_validation.check_distro_info()
 from onnxruntime.capi.session import InferenceSession
-from onnxruntime.capi._pybind_state import RunOptions, SessionOptions, set_default_logger_severity, get_device, NodeArg, ModelMetadata
+from onnxruntime.capi._pybind_state import get_device, RunOptions, SessionOptions, NodeArg, ModelMetadata, GraphOptimizationLevel
diff --git a/onnxruntime/automl_ops/automl_featurizers.h b/onnxruntime/automl_ops/automl_featurizers.h
new file mode 100644
index 0000000000000..37e6e982d9a62
--- /dev/null
+++ b/onnxruntime/automl_ops/automl_featurizers.h
@@ -0,0 +1,8 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// Cumulative header with automl featurizers includes exposed to
+// ORT
+#pragma once
+
+#include "core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.h"
diff --git a/onnxruntime/automl_ops/automl_types.cc b/onnxruntime/automl_ops/automl_types.cc
new file mode 100644
index 0000000000000..8f0cb77701606
--- /dev/null
+++ b/onnxruntime/automl_ops/automl_types.cc
@@ -0,0 +1,39 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/common/common.h"
+#include "core/framework/data_types.h"
+#include "core/framework/op_kernel.h"
+
+#include "automl_ops/automl_types.h"
+#include "automl_ops/automl_featurizers.h"
+
+namespace dtf = Microsoft::Featurizer::DateTimeFeaturizer;
+
+namespace onnxruntime {
+
+// This temporary to register custom types so ORT is aware of it
+// although it still can not serialize such a type.
+// These character arrays must be extern so the resulting instantiated template
+// is globally unique
+
+extern const char kMsAutoMLDomain[] = "com.microsoft.automl";
+
+extern const char kTimepointName[] = "DateTimeFeaturizer_TimePoint";
+// This has to be under onnxruntime to properly specialize a function template
+ORT_REGISTER_OPAQUE_TYPE(dtf::TimePoint, kMsAutoMLDomain, kTimepointName);
+
+namespace automl {
+
+#define REGISTER_CUSTOM_PROTO(TYPE, reg_fn)            \
+  {                                                    \
+    MLDataType mltype = DataTypeImpl::GetType<TYPE>(); \
+    reg_fn(mltype);                                    \
+  }
+
+void RegisterAutoMLTypes(const std::function<void(MLDataType)>& reg_fn) {
+  REGISTER_CUSTOM_PROTO(dtf::TimePoint, reg_fn);
+}
+#undef REGISTER_CUSTOM_PROTO
+} // namespace automl
+} // namespace onnxruntime
diff --git a/onnxruntime/automl_ops/automl_types.h b/onnxruntime/automl_ops/automl_types.h
new file mode 100644
index 0000000000000..798c6778966bb
--- /dev/null
+++ b/onnxruntime/automl_ops/automl_types.h
@@ -0,0 +1,13 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/framework/data_types.h"
+#include <functional>
+
+namespace onnxruntime {
+namespace automl {
+void RegisterAutoMLTypes(const std::function<void(MLDataType)>& reg_fn);
+} // namespace automl
+} // namespace onnxruntime
diff --git a/onnxruntime/automl_ops/cpu/datetime_transformer.cc b/onnxruntime/automl_ops/cpu/datetime_transformer.cc
new file mode 100644
index 0000000000000..05a655f8d7453
--- /dev/null
+++ b/onnxruntime/automl_ops/cpu/datetime_transformer.cc
@@ -0,0 +1,42 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/common/common.h"
+#include "core/framework/data_types.h"
+#include "core/framework/op_kernel.h"
+
+#include "core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.h"
+
+namespace dtf = Microsoft::Featurizer::DateTimeFeaturizer;
+
+namespace onnxruntime {
+namespace automl {
+
+class DateTimeTransformer final : public OpKernel {
+ public:
+  explicit DateTimeTransformer(const OpKernelInfo& info) : OpKernel(info) {}
+  Status Compute(OpKernelContext* context) const override;
+};
+
+Status DateTimeTransformer::Compute(OpKernelContext* ctx) const {
+  Status s;
+  auto input_tensor = ctx->Input<Tensor>(0);
+  dtf::TimePoint* output = ctx->Output<dtf::TimePoint>(0);
+
+  int64_t tp = *input_tensor->Data<int64_t>();
+  std::chrono::system_clock::time_point sys_time{std::chrono::seconds(tp)};
+  *output = dtf::SystemToDPTimePoint(sys_time);
+  return s;
+}
+
+ONNX_OPERATOR_KERNEL_EX(
+    DateTimeTransformer,
+    kMSAutoMLDomain,
+    1,
+    kCpuExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<int64_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetType<Microsoft::Featurizer::DateTimeFeaturizer::TimePoint>()),
+    DateTimeTransformer);
+}  // namespace automl
+}  // namespace onnxruntime
diff --git a/onnxruntime/automl_ops/cpu_automl_kernels.cc b/onnxruntime/automl_ops/cpu_automl_kernels.cc
new file mode 100644
index 0000000000000..23d5e2ad72e6a
--- /dev/null
+++ b/onnxruntime/automl_ops/cpu_automl_kernels.cc
@@ -0,0 +1,25 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "automl_ops/cpu_automl_kernels.h"
+#include "core/graph/constants.h"
+#include "core/framework/data_types.h"
+
+namespace onnxruntime {
+namespace automl {
+
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSAutoMLDomain, 1, DateTimeTransformer);
+
+void RegisterCpuAutoMLKernels(KernelRegistry& kernel_registry) {
+  static const BuildKernelCreateInfoFn function_table[] = {
+     // add more kernels here
+      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSAutoMLDomain, 1, DateTimeTransformer)>
+  };
+
+  for (auto& function_table_entry : function_table) {
+    kernel_registry.Register(function_table_entry());
+  }
+}
+
+}  // namespace automl
+}  // namespace onnxruntime
diff --git a/onnxruntime/automl_ops/cpu_automl_kernels.h b/onnxruntime/automl_ops/cpu_automl_kernels.h
new file mode 100644
index 0000000000000..f14a8983d5a39
--- /dev/null
+++ b/onnxruntime/automl_ops/cpu_automl_kernels.h
@@ -0,0 +1,13 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/framework/op_kernel.h"
+#include "core/framework/kernel_registry.h"
+
+namespace onnxruntime {
+namespace automl {
+void RegisterCpuAutoMLKernels(KernelRegistry& kernel_registry);
+} // namespace automl
+}  // namespace onnxruntime
diff --git a/onnxruntime/contrib_ops/cpu/attnlstm/attention_wrapper.cc b/onnxruntime/contrib_ops/cpu/attnlstm/attention_wrapper.cc
index 4555713a59fe1..8757ccb35f771 100644
--- a/onnxruntime/contrib_ops/cpu/attnlstm/attention_wrapper.cc
+++ b/onnxruntime/contrib_ops/cpu/attnlstm/attention_wrapper.cc
@@ -16,7 +16,7 @@ template <typename T>
 AttentionWrapper<T>::AttentionWrapper(AllocatorPtr alloc, const logging::Logger& logger,
                                       int batch_size, int attn_context_depth, int attn_layer_depth,
                                       int inner_cell_hidden_size, bool has_attn_layer,
-                                      const IAttentionMechanism<T>& attention_mechanism)
+                                      const IAttentionMechanism<T>& attention_mechanism, concurrency::ThreadPool* threadpool)
     : allocator_(alloc),
       logger_(logger),
       batch_size_(batch_size),
@@ -24,7 +24,8 @@ AttentionWrapper<T>::AttentionWrapper(AllocatorPtr alloc, const logging::Logger&
       attn_layer_depth_(attn_layer_depth),
       inner_cell_hidden_size_(inner_cell_hidden_size),
       has_attn_layer_(has_attn_layer),
-      attention_mechanism_(attention_mechanism) {
+      attention_mechanism_(attention_mechanism),
+      ttp_(threadpool) {
   auto mem_max_steps = attention_mechanism_.GetMaxMemorySteps();
   prev_alignments_ = Allocate(allocator_, batch_size_ * mem_max_steps, prev_alignments_ptr_, true);
   alignments_ = Allocate(allocator_, batch_size_ * mem_max_steps, alignments_ptr_, true);
@@ -37,11 +38,11 @@ template <typename T>
 void AttentionWrapper<T>::ProcessOutput(const gsl::span<const T>& rnn_cell_output) {
   if (has_attn_layer_) {
     // rnn_cell_output * cell_weights, (part of the attention layer above the attention mechanism).
-    math::GemmEx<T, CPUMathUtil>(CblasNoTrans, CblasNoTrans,
-                                 batch_size_, attn_layer_depth_, inner_cell_hidden_size_, T{1.0},
-                                 rnn_cell_output.data(), inner_cell_hidden_size_,
-                                 attn_layer_cell_weights_.data(), attn_layer_depth_, T{0.0},
-                                 attn_states_.data(), attn_layer_depth_, &CPUMathUtil::Instance());
+    math::GemmEx<T>(CblasNoTrans, CblasNoTrans,
+                    batch_size_, attn_layer_depth_, inner_cell_hidden_size_, T{1.0},
+                    rnn_cell_output.data(), inner_cell_hidden_size_,
+                    attn_layer_cell_weights_.data(), attn_layer_depth_, T{0.0},
+                    attn_states_.data(), attn_layer_depth_, ttp_);
   }
 
   // Get the context which is calculated within attention mechanism.
@@ -54,11 +55,11 @@ void AttentionWrapper<T>::ProcessOutput(const gsl::span<const T>& rnn_cell_outpu
     //concat([p_cell_output, context]) * stack([attn_layer_cell_weights_, attn_layer_attn_weights_]) =
     //     p_cell_output * attn_layer_cell_weights_ + context * attn_layer_attn_weights_
     // The first part is calulated above. Here just add the later.
-    math::GemmEx<T, CPUMathUtil>(CblasNoTrans, CblasNoTrans,
-                                 batch_size_, attn_layer_depth_, attn_context_depth_, T{1.0},
-                                 attn_context_.data(), attn_context_depth_,
-                                 attn_layer_attn_weights_.data(), attn_layer_depth_, T{1.0},
-                                 attn_states_.data(), attn_layer_depth_, &CPUMathUtil::Instance());
+    math::GemmEx<T>(CblasNoTrans, CblasNoTrans,
+                    batch_size_, attn_layer_depth_, attn_context_depth_, T{1.0},
+                    attn_context_.data(), attn_context_depth_,
+                    attn_layer_attn_weights_.data(), attn_layer_depth_, T{1.0},
+                    attn_states_.data(), attn_layer_depth_, ttp_);
   }
 }
 
diff --git a/onnxruntime/contrib_ops/cpu/attnlstm/attention_wrapper.h b/onnxruntime/contrib_ops/cpu/attnlstm/attention_wrapper.h
index 2469a7b99a3fb..b6cc06c040e3a 100644
--- a/onnxruntime/contrib_ops/cpu/attnlstm/attention_wrapper.h
+++ b/onnxruntime/contrib_ops/cpu/attnlstm/attention_wrapper.h
@@ -8,6 +8,7 @@
 #include "core/common/common.h"
 #include "core/common/logging/logging.h"
 #include "core/framework/allocator.h"
+#include "core/platform/threadpool.h"
 
 namespace onnxruntime {
 namespace contrib {
@@ -22,7 +23,7 @@ class AttentionWrapper {
                    int attn_layer_depth,
                    int inner_cell_hidden_size,
                    bool has_attn_layer,
-                   const IAttentionMechanism<T>& attention_mechanism);
+                   const IAttentionMechanism<T>& attention_mechanism, concurrency::ThreadPool* threadpool);
 
   virtual ~AttentionWrapper() = default;
 
@@ -69,6 +70,7 @@ class AttentionWrapper {
   bool has_attn_layer_;
 
   const IAttentionMechanism<T>& attention_mechanism_;
+  concurrency::ThreadPool* ttp_;
 };
 
 }  // namespace contrib
diff --git a/onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.cc b/onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.cc
index 932ac263f8e22..74ad84b5af839 100644
--- a/onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.cc
+++ b/onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.cc
@@ -15,8 +15,8 @@ namespace contrib {
 template <typename T>
 BahdanauAttention<T>::BahdanauAttention(AllocatorPtr allocator, const logging::Logger& logger,
                                         int batch_size, int max_memory_step, int memory_depth,
-                                        int query_depth, int attn_depth, bool normalize)
-    : allocator_(allocator), logger_(logger), batch_size_(batch_size), max_memory_steps_(max_memory_step), memory_depth_(memory_depth), query_depth_(query_depth), attn_depth_(attn_depth), normalize_(normalize) {
+                                        int query_depth, int attn_depth, bool normalize, concurrency::ThreadPool* threadpool)
+    : allocator_(allocator), logger_(logger), batch_size_(batch_size), max_memory_steps_(max_memory_step), memory_depth_(memory_depth), query_depth_(query_depth), attn_depth_(attn_depth), normalize_(normalize), ttp_(threadpool) {
   values_ = Allocate(allocator_, batch_size_ * max_memory_steps_ * memory_depth_, values_ptr_, true);
   keys_ = Allocate(allocator_, batch_size_ * max_memory_steps_ * attn_depth_, keys_ptr_, true);
   processed_query_ = Allocate(allocator_, batch_size_ * attn_depth_, processed_query_ptr_, true);
@@ -72,11 +72,11 @@ void BahdanauAttention<T>::PrepareMemory(
                 "Real memory steps ", mem_steps, " is not in (0, ", max_memory_steps_, "]");
   }
 
-  math::GemmEx<T, CPUMathUtil>(CblasNoTrans, CblasNoTrans,
-                               batch_size_ * max_memory_steps_, attn_depth_, memory_depth_, T{1.0},
-                               memory.data(), memory_depth_,
-                               memory_layer_weights_.data(), attn_depth_, T{0.0},
-                               keys_.data(), attn_depth_, &CPUMathUtil::Instance());
+  math::GemmEx<T>(CblasNoTrans, CblasNoTrans,
+                  batch_size_ * max_memory_steps_, attn_depth_, memory_depth_, T{1.0},
+                  memory.data(), memory_depth_,
+                  memory_layer_weights_.data(), attn_depth_, T{0.0},
+                  keys_.data(), attn_depth_, ttp_);
 }
 
 template <typename T>
@@ -115,11 +115,11 @@ void BahdanauAttention<T>::Compute(
     const gsl::span<T>& output,
     const gsl::span<T>& aligns) const {
   //process query in dense query layer without bias
-  math::GemmEx<T, CPUMathUtil>(CblasNoTrans, CblasNoTrans,
-                               batch_size_, attn_depth_, query_depth_, T{1.0},
-                               queries.data(), query_depth_,
-                               query_layer_weights_.data(), attn_depth_, T{0.0},
-                               processed_query_.data(), attn_depth_, &CPUMathUtil::Instance());
+  math::GemmEx<T>(CblasNoTrans, CblasNoTrans,
+                  batch_size_, attn_depth_, query_depth_, T{1.0},
+                  queries.data(), query_depth_,
+                  query_layer_weights_.data(), attn_depth_, T{0.0},
+                  processed_query_.data(), attn_depth_, ttp_);
 
   std::fill(aligns.begin(), aligns.end(), T{});
 
@@ -146,11 +146,11 @@ void BahdanauAttention<T>::Compute(
     // Calculate the context
     auto outspan = output.subspan(b * memory_depth_);
     auto values = values_.subspan(b * max_memory_steps_ * memory_depth_);
-    math::GemmEx<T, CPUMathUtil>(CblasNoTrans, CblasNoTrans,
-                                 1, memory_depth_, max_memory_steps_, T{1.0},
-                                 alignments, max_memory_steps_,
-                                 values.data(), memory_depth_, T{0.0},
-                                 outspan.data(), memory_depth_, &CPUMathUtil::Instance());
+    math::GemmEx<T>(CblasNoTrans, CblasNoTrans,
+                    1, memory_depth_, max_memory_steps_, T{1.0},
+                    alignments, max_memory_steps_,
+                    values.data(), memory_depth_, T{0.0},
+                    outspan.data(), memory_depth_, ttp_);
   }
 }
 
diff --git a/onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.h b/onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.h
index 755af6ba6d5c3..c2bfee15c5bcc 100644
--- a/onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.h
+++ b/onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.h
@@ -23,7 +23,7 @@ class BahdanauAttention : public IAttentionMechanism<T> {
       int memory_depth,
       int query_depth,
       int attn_depth,
-      bool normalize);
+      bool normalize, concurrency::ThreadPool* threadpool);
 
   void SetWeights(
       const gsl::span<const T>& attn_weights,
@@ -77,6 +77,7 @@ class BahdanauAttention : public IAttentionMechanism<T> {
   gsl::span<int> mem_seq_lengths_;
 
   bool normalize_;
+  concurrency::ThreadPool* ttp_;
 };
 
 }  // namespace contrib
diff --git a/onnxruntime/contrib_ops/cpu/attnlstm/deep_cpu_attn_lstm.cc b/onnxruntime/contrib_ops/cpu/attnlstm/deep_cpu_attn_lstm.cc
index 7f7102475c620..50e98f834260b 100644
--- a/onnxruntime/contrib_ops/cpu/attnlstm/deep_cpu_attn_lstm.cc
+++ b/onnxruntime/contrib_ops/cpu/attnlstm/deep_cpu_attn_lstm.cc
@@ -8,7 +8,9 @@
 
 #include "core/common/common.h"
 #include "core/common/logging/logging.h"
+#include "core/platform/threadpool.h"
 #include "core/framework/allocator.h"
+#include "core/framework/op_kernel_context_internal.h"
 
 namespace onnxruntime {
 namespace contrib {
@@ -70,6 +72,9 @@ static gsl::span<const T> SecondHalfSpan(const gsl::span<const T>& dspan) {
 
 template <typename T>
 Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(&context);
+  concurrency::ThreadPool* thread_pool = ctx_internal->GetOperatorThreadPool();
+
   auto& logger = context.Logger();
 
   // original lstm processing
@@ -236,7 +241,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         memory_depth,
         query_depth,
         am_attn_size,
-        false);
+        false, thread_pool);
 
     fam.SetWeights(
         FirstHalfSpan(am_v_weights.DataAsSpan<T>()),
@@ -252,7 +257,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         attn_layer_depth,
         hidden_size_,
         has_attention_layer,
-        fam);
+        fam, thread_pool);
     faw.SetWeights(FirstHalfSpan(attn_layer_weights_span));
 
     UniDirectionalAttnLstm<T> fw(
@@ -263,7 +268,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         activation_funcs_.Entries()[0],
         activation_funcs_.Entries()[1],
         activation_funcs_.Entries()[2],
-        clip_, ttp_);
+        clip_, thread_pool);
 
     BahdanauAttention<T> bam(
         alloc,
@@ -273,7 +278,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         memory_depth,
         query_depth,
         am_attn_size,
-        false);
+        false, thread_pool);
     bam.SetWeights(
         SecondHalfSpan(am_v_weights.DataAsSpan<T>()),
         SecondHalfSpan(am_query_layer_weights.DataAsSpan<T>()),
@@ -288,7 +293,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         attn_layer_depth,
         hidden_size_,
         has_attention_layer,
-        bam);
+        bam, thread_pool);
     baw.SetWeights(SecondHalfSpan(attn_layer_weights_span));
 
     UniDirectionalAttnLstm<T> bw(
@@ -299,7 +304,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         activation_funcs_.Entries()[3],
         activation_funcs_.Entries()[4],
         activation_funcs_.Entries()[5],
-        clip_, ttp_);
+        clip_, thread_pool);
 
     fw.Compute(input, sequence_lens_span, num_directions_, input_weights_1, recurrent_weights_1, output_1, hidden_output_1, last_cell_1);
     bw.Compute(input, sequence_lens_span, num_directions_, input_weights_2, hidden_weights_2, output_2, hidden_output_2, last_cell_2);
@@ -313,7 +318,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         memory_depth,
         query_depth,
         am_attn_size,
-        false);
+        false, thread_pool);
 
     fam.SetWeights(
         am_v_weights.DataAsSpan<T>(),
@@ -329,7 +334,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         attn_layer_depth,
         hidden_size_,
         has_attention_layer,
-        fam);
+        fam, thread_pool);
 
     faw.SetWeights(attn_layer_weights_span);
 
@@ -341,7 +346,7 @@ Status DeepCpuAttnLstmOp::ComputeImpl(OpKernelContext& context) const {
         activation_funcs_.Entries()[0],
         activation_funcs_.Entries()[1],
         activation_funcs_.Entries()[2],
-        clip_, ttp_);
+        clip_, thread_pool);
 
     fw.Compute(input, sequence_lens_span, num_directions_, input_weights_1, recurrent_weights_1, output_1, hidden_output_1, last_cell_1);
   }
diff --git a/onnxruntime/contrib_ops/cpu/attnlstm/uni_dir_attn_lstm.cc b/onnxruntime/contrib_ops/cpu/attnlstm/uni_dir_attn_lstm.cc
index caa05f9d5ceff..4183b6e2d6de4 100644
--- a/onnxruntime/contrib_ops/cpu/attnlstm/uni_dir_attn_lstm.cc
+++ b/onnxruntime/contrib_ops/cpu/attnlstm/uni_dir_attn_lstm.cc
@@ -45,7 +45,7 @@ UniDirectionalAttnLstm<T>::UniDirectionalAttnLstm(AllocatorPtr allocator,
                                                   const ActivationFuncs::Entry& activation_func_g,
                                                   const ActivationFuncs::Entry& activation_func_h,
                                                   const float clip,
-                                                  onnxruntime::concurrency::ThreadPool& ttp)
+                                                  onnxruntime::concurrency::ThreadPool* ttp)
     : allocator_(allocator),
       logger_(logger),
       seq_length_(seq_length),
@@ -254,7 +254,7 @@ void UniDirectionalAttnLstm<T>::Compute(const gsl::span<const T>& inputs_arg,
               input_weights.cbegin(), input_weights.cend(),  // W[iofc]^T
               input_size_ + attention_size_, T{0.0},
               output_iofc_.begin(), output_iofc_.end(),
-              hidden_size_x4);
+              hidden_size_x4, ttp_);
 
   DumpMatrix("Xt*(W[iofc]^T)", output_iofc_.data(), total_rows, hidden_size_x4);
 
@@ -296,7 +296,7 @@ void UniDirectionalAttnLstm<T>::Compute(const gsl::span<const T>& inputs_arg,
                   input_weights.cbegin() + input_size_, input_weights.cend(),  // WA[iofc]
                   input_size_ + attention_size_, T{1.0},
                   step_out_IOFC, output_iofc_.end(),  // input contains Xt*(W[iofc]^T)
-                  hidden_size_x4);
+                  hidden_size_x4, ttp_);
 
       // calculate Xt*(W[iofc]^T) + Ht-1*R[iofc]
       ComputeGemm(batch_size_, hidden_size_x4, hidden_size_, T{1.0},
@@ -305,7 +305,7 @@ void UniDirectionalAttnLstm<T>::Compute(const gsl::span<const T>& inputs_arg,
                   recurrent_weights.cbegin(), recurrent_weights.cend(),  // R[iofc]
                   hidden_size_, T{1.0},
                   step_out_IOFC, output_iofc_.end(),  // input contains Xt*(W[iofc]^T)
-                  hidden_size_x4);
+                  hidden_size_x4, ttp_);
 
       span_T_iter batched_output, batched_output_end;
       if (output_sequence) {
diff --git a/onnxruntime/contrib_ops/cpu/attnlstm/uni_dir_attn_lstm.h b/onnxruntime/contrib_ops/cpu/attnlstm/uni_dir_attn_lstm.h
index 5a8e4e3224a25..2d3a6f20fe1e9 100644
--- a/onnxruntime/contrib_ops/cpu/attnlstm/uni_dir_attn_lstm.h
+++ b/onnxruntime/contrib_ops/cpu/attnlstm/uni_dir_attn_lstm.h
@@ -51,7 +51,7 @@ class UniDirectionalAttnLstm {
                          const ActivationFuncs::Entry& activation_func_g,
                          const ActivationFuncs::Entry& activation_func_h,
                          const float clip,
-                         onnxruntime::concurrency::ThreadPool& ttp);
+                         onnxruntime::concurrency::ThreadPool* ttp);
 
   void Compute(const gsl::span<const T>& inputs,
                const gsl::span<const int>& sequence_lengths,
@@ -151,7 +151,7 @@ class UniDirectionalAttnLstm {
 
   AttentionWrapper<T>& attention_wrapper_;
 
-  onnxruntime::concurrency::ThreadPool& ttp_;
+  onnxruntime::concurrency::ThreadPool* ttp_;
 };
 
 }  // namespace detail
diff --git a/onnxruntime/contrib_ops/cpu/matmul_integer16.cc b/onnxruntime/contrib_ops/cpu/matmul_integer16.cc
new file mode 100644
index 0000000000000..7378cd56510d5
--- /dev/null
+++ b/onnxruntime/contrib_ops/cpu/matmul_integer16.cc
@@ -0,0 +1,45 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "contrib_ops/cpu/matmul_integer16.h"
+#include "core/providers/cpu/math/matmul_helper.h"
+
+namespace onnxruntime {
+namespace contrib {
+
+ONNX_OPERATOR_KERNEL_EX(
+    MatMulInteger16,
+    kMSDomain,
+    1,
+    kCpuExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<int16_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<int16_t>())
+        .TypeConstraint("T3", DataTypeImpl::GetTensorType<int32_t>()),
+    MatMulInteger16<int16_t, int16_t, int32_t>);
+
+template <>
+Status MatMulInteger16<int16_t, int16_t, int32_t>::Compute(OpKernelContext* ctx) const {
+  auto A = ctx->Input<Tensor>(0);
+  auto B = ctx->Input<Tensor>(1);
+  ORT_ENFORCE(A != nullptr && B != nullptr);
+
+  MatMulComputeHelper helper;
+  ORT_RETURN_IF_ERROR(helper.Compute(A->Shape(), B->Shape()));
+  Tensor* Y = ctx->Output(0, helper.OutputShape());
+
+  for (int i = 0; i < static_cast<int>(helper.OutputOffsets().size()); i++) {
+    EigenCastGEMM<int16_t, int16_t, int32_t>(
+        A->template Data<int16_t>() + helper.LeftOffsets()[i],
+        B->template Data<int16_t>() + helper.RightOffsets()[i],
+        Y->template MutableData<int32_t>() + helper.OutputOffsets()[i],
+        static_cast<int>(helper.M()),
+        static_cast<int>(helper.N()),
+        static_cast<int>(helper.K()));
+  }
+
+  return Status::OK();
+}
+
+}  // namespace contrib
+}  // namespace onnxruntime
diff --git a/onnxruntime/contrib_ops/cpu/matmul_integer16.h b/onnxruntime/contrib_ops/cpu/matmul_integer16.h
new file mode 100644
index 0000000000000..633e8eee52b6a
--- /dev/null
+++ b/onnxruntime/contrib_ops/cpu/matmul_integer16.h
@@ -0,0 +1,22 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/common/common.h"
+#include "core/framework/op_kernel.h"
+#include "core/util/math_cpuonly.h"
+
+namespace onnxruntime {
+namespace contrib {
+
+template <typename T1, typename T2, typename T3>
+class MatMulInteger16 final : public OpKernel {
+ public:
+  MatMulInteger16(const OpKernelInfo& info) : OpKernel(info) {
+  }
+
+  Status Compute(OpKernelContext* context) const override;
+};
+}  // namespace contrib
+}  // namespace onnxruntime
diff --git a/onnxruntime/contrib_ops/cpu/nchwc_ops.cc b/onnxruntime/contrib_ops/cpu/nchwc_ops.cc
index b5625551ad104..3b14b21a79533 100644
--- a/onnxruntime/contrib_ops/cpu/nchwc_ops.cc
+++ b/onnxruntime/contrib_ops/cpu/nchwc_ops.cc
@@ -170,9 +170,6 @@ Status NchwcPoolBase::NchwcPool(OpKernelContext* context, MLAS_POOLING_KIND kind
   ORT_ENFORCE(X_shape.NumDimensions() == 4);
   ORT_ENFORCE((X_shape[1] % MlasNchwcGetBlockSize()) == 0);
 
-  if (!global_pooling_) {
-    ORT_RETURN_IF_NOT(kernel_shape_.size() == 2, "kernel_shape num_dims is not compatible with X num_dims.");
-  }
 
   std::vector<int64_t> pads = pads_;
   std::vector<int64_t> output_dims = PoolBase::SetOutputSize(X_shape, X_shape[1], &pads, dilations_, ceil_mode_);
diff --git a/onnxruntime/contrib_ops/cpu/nchwc_ops.h b/onnxruntime/contrib_ops/cpu/nchwc_ops.h
index 65045cd0eeb85..b9f8993114094 100644
--- a/onnxruntime/contrib_ops/cpu/nchwc_ops.h
+++ b/onnxruntime/contrib_ops/cpu/nchwc_ops.h
@@ -50,6 +50,8 @@ class NchwcConv : public OpKernel, public ConvBase {
 class NchwcPoolBase : public PoolBase {
  public:
   NchwcPoolBase(const OpKernelInfo& info) : PoolBase(info) {
+    if (!global_pooling_)
+      ORT_ENFORCE(kernel_shape_.size() == 2, "kernel_shape num_dims is not compatible with X num_dims.");
   }
 
   Status NchwcPool(OpKernelContext* context, MLAS_POOLING_KIND kind) const;
diff --git a/onnxruntime/contrib_ops/cpu/word_conv_embedding.cc b/onnxruntime/contrib_ops/cpu/word_conv_embedding.cc
index 7d7f577d5e3a1..3213ff4fc1db3 100644
--- a/onnxruntime/contrib_ops/cpu/word_conv_embedding.cc
+++ b/onnxruntime/contrib_ops/cpu/word_conv_embedding.cc
@@ -6,6 +6,7 @@
 #include "core/util/math.h"
 #include "core/util/math_cpuonly.h"
 #include "core/mlas/inc/mlas.h"
+#include "core/framework/op_kernel_context_internal.h"
 
 namespace onnxruntime {
 namespace contrib {
@@ -45,7 +46,7 @@ void WordConvEmbedding::ComputeConvMaxPoolWithActivation(
     int64_t char_embedding_size,
     int64_t filter_width,
     int64_t num_filters,
-    float* output) const {
+    float* output, concurrency::ThreadPool* tp) const {
   int64_t input_word_size = word_len * char_embedding_size;
   int64_t unfolded_width = word_len - filter_width + 1;
   int64_t unfolded_kernal_size = filter_width * char_embedding_size;
@@ -83,12 +84,12 @@ void WordConvEmbedding::ComputeConvMaxPoolWithActivation(
       tmp_word_inx++;
     }
 
-    math::GemmEx<float, CPUMathUtil>(
+    math::GemmEx<float>(
         CblasNoTrans, CblasTrans,
         static_cast<int>(words_unfolded_width), static_cast<int>(num_filters), static_cast<int>(unfolded_kernal_size), 1.0f,
         unfolded_buffer_p.get(), static_cast<int>(unfolded_kernal_size),
         weights, static_cast<int>(unfolded_kernal_size), 0.0f,
-        conv_buf_p, static_cast<int>(num_filters), &CPUMathUtil::Instance());
+        conv_buf_p, static_cast<int>(num_filters), tp);
 
     for (int64_t unfolded_inx = 0; unfolded_inx < words_unfolded_width; unfolded_inx++)
       for (int64_t filter_inx = 0; filter_inx < num_filters; filter_inx++) {
@@ -160,6 +161,9 @@ Status WordConvEmbedding::ValidateInputShape(const TensorShape& w_conv_shape, co
 }
 
 Status WordConvEmbedding::Compute(OpKernelContext* ctx) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(ctx);
+  concurrency::ThreadPool* tp = ctx_internal->GetOperatorThreadPool();
+
   // original lstm processing
   const Tensor& sequence = *(ctx->Input<Tensor>(0));          // sequence: [sequence_length, word_length]
   const Tensor& w_conv = *(ctx->Input<Tensor>(1));            // conv weight: [M, C/group, kH, kW]
@@ -216,7 +220,7 @@ Status WordConvEmbedding::Compute(OpKernelContext* ctx) const {
       char_embedding_size,
       filter_width,
       filter_size,
-      Y->MutableData<float>());
+      Y->MutableData<float>(), tp);
 
   return Status::OK();
 }
diff --git a/onnxruntime/contrib_ops/cpu/word_conv_embedding.h b/onnxruntime/contrib_ops/cpu/word_conv_embedding.h
index e74afab169fd8..5ee4127e3bfb9 100644
--- a/onnxruntime/contrib_ops/cpu/word_conv_embedding.h
+++ b/onnxruntime/contrib_ops/cpu/word_conv_embedding.h
@@ -8,6 +8,9 @@
 #include "core/framework/tensor.h"
 
 namespace onnxruntime {
+namespace concurrency {
+class ThreadPool;
+}
 namespace contrib {
 
 class WordConvEmbedding final : public OpKernel {
@@ -38,7 +41,7 @@ class WordConvEmbedding final : public OpKernel {
       int64_t char_embedding_size,
       int64_t filter_width,
       int64_t num_filters,
-      float* output) const;
+      float* output, onnxruntime::concurrency::ThreadPool* tp) const;
   void CalculateLengthOfEachWordInSequence(
       const int* seq_ptr,
       int* words_len_ptr,
diff --git a/onnxruntime/contrib_ops/cpu_contrib_kernels.cc b/onnxruntime/contrib_ops/cpu_contrib_kernels.cc
index 8446a35bd8947..7124360dc6408 100644
--- a/onnxruntime/contrib_ops/cpu_contrib_kernels.cc
+++ b/onnxruntime/contrib_ops/cpu_contrib_kernels.cc
@@ -17,6 +17,7 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1,
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Range);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, WordConvEmbedding);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, GatherND);
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MatMulInteger16);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MurmurHash3);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, MaxpoolWithMask);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Pad);
@@ -87,6 +88,7 @@ void RegisterCpuContribKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, WordConvEmbedding)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, GatherND)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MurmurHash3)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MatMulInteger16)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, MaxpoolWithMask)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Pad)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Unique)>,
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizer.h b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizer.h
new file mode 100644
index 0000000000000..54b737b645da9
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizer.h
@@ -0,0 +1,163 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+#pragma once
+
+#include <memory>
+#include <tuple>
+
+namespace Microsoft {
+namespace Featurizer {
+
+/////////////////////////////////////////////////////////////////////////
+///  \class         Transformer
+///  \brief         Transforms a single "value" and output the result.
+///                 A value can be anything from an integer to a collection
+///                 of integers.
+///
+template <typename ReturnT, typename ArgT>
+class Transformer {
+public:
+    // ----------------------------------------------------------------------
+    // |  Public Types
+    using return_type                       = ReturnT;
+    using arg_type                          = ArgT;
+    using transformer_type                  = Transformer<ReturnT, ArgT>;
+
+    // ----------------------------------------------------------------------
+    // |  Public Methods
+    Transformer(void) = default;
+    virtual ~Transformer(void) = default;
+
+    Transformer(Transformer const &) = delete;
+    Transformer & operator =(Transformer const &) = delete;
+
+    Transformer(Transformer &&) = default;
+    Transformer & operator =(Transformer &&) = delete;
+
+    virtual return_type transform(arg_type const &arg) const = 0;
+
+private:
+    // ----------------------------------------------------------------------
+    // |  Private Methods
+    template <typename ArchiveT>
+    void serialize(ArchiveT &, unsigned int const /*version*/);
+};
+
+/////////////////////////////////////////////////////////////////////////
+///  \class         Estimator
+///  \brief         Collects state over a collection of data, then produces
+///                 a `Transformer` that is able to operate on that collected
+///                 state.
+///
+template <typename ReturnT, typename ArgT>
+class Estimator {
+public:
+    // ----------------------------------------------------------------------
+    // |  Public Types
+    using transformer_type                  = Transformer<ReturnT, ArgT>;
+    using TransformerUniquePtr              = std::unique_ptr<transformer_type>;
+
+    using estimator_type                    = Estimator<ReturnT, ArgT>;
+
+    using apache_arrow                      = unsigned long; // TODO: Temp type as we figure out what will eventually be here
+
+    // ----------------------------------------------------------------------
+    // |  Public Methods
+    Estimator(void) = default;
+    virtual ~Estimator(void) = default;
+
+    Estimator(Estimator const &) = delete;
+    Estimator & operator =(Estimator const &) = delete;
+
+    Estimator(Estimator &&) = default;
+    Estimator & operator =(Estimator &&) = delete;
+
+    // This method can be called repeatedly in the support of streaming scenarios
+    Estimator & fit(apache_arrow const &data);
+
+    // Calls to `commit` are destructive - all previously generated state should
+    // be reset. `Estimator` objects that want to share state prior to calls to commit
+    // should implement a `copy` method.
+    TransformerUniquePtr commit(void);
+
+private:
+    // ----------------------------------------------------------------------
+    // |  Private Data
+    bool _committed                         = false;
+
+    // ----------------------------------------------------------------------
+    // |  Private Methods
+    template <typename ArchiveT>
+    void serialize(ArchiveT &, unsigned int const /*version*/);
+
+    virtual Estimator & fit_impl(apache_arrow const &data) = 0;
+    virtual TransformerUniquePtr commit_impl(void) = 0;
+};
+
+template <typename EstimatorT, typename... EstimatorConstructorArgsT>
+typename EstimatorT::TransformerUniquePtr fit_and_commit(typename EstimatorT::apache_arrow const &data, EstimatorConstructorArgsT &&...args);
+
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+// |
+// |  Implementation
+// |
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+
+// ----------------------------------------------------------------------
+// |
+// |  Transformer
+// |
+// ----------------------------------------------------------------------
+template <typename ReturnT, typename ArgT>
+template <typename ArchiveT>
+void Transformer<ReturnT, ArgT>::serialize(ArchiveT & /*ar*/, unsigned int const /*version*/) {
+}
+
+// ----------------------------------------------------------------------
+// |
+// |  Estimator
+// |
+// ----------------------------------------------------------------------
+template <typename ReturnT, typename ArgT>
+Estimator<ReturnT, ArgT> & Estimator<ReturnT, ArgT>::fit(apache_arrow const &data) {
+    if(_committed)
+        throw std::runtime_error("This instance has already been committed");
+
+    return fit_impl(data);
+}
+
+template <typename ReturnT, typename ArgT>
+typename Estimator<ReturnT, ArgT>::TransformerUniquePtr Estimator<ReturnT, ArgT>::commit(void) {
+    if(_committed)
+        throw std::runtime_error("This instance has already been committed");
+
+    TransformerUniquePtr                    result(commit_impl());
+
+    if(!result)
+        throw std::runtime_error("Invalid result");
+
+    _committed = true;
+    return result;
+}
+
+template <typename ReturnT, typename ArgT>
+template <typename ArchiveT>
+void Estimator<ReturnT, ArgT>::serialize(ArchiveT & /*ar*/, unsigned int const /*version*/) {
+}
+
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+template <typename EstimatorT, typename... EstimatorConstructorArgsT>
+typename EstimatorT::TransformerUniquePtr fit_and_commit(typename EstimatorT::apache_arrow const &data, EstimatorConstructorArgsT &&...args) {
+    return EstimatorT(std::forward<EstimatorConstructorArgsT>(args)...).fit(data).commit();
+}
+
+} // namespace Featurizer
+} // namespace Microsoft
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.cpp b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.cpp
new file mode 100644
index 0000000000000..56fc238d86aee
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.cpp
@@ -0,0 +1,56 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+#include "DateTimeFeaturizer.h"
+
+#ifdef _MSC_VER
+inline struct tm *gmtime_r(time_t const* const timer, struct tm* const  result) {
+    return gmtime_s(result, timer) == 0 ? result : nullptr;
+}
+
+#endif
+
+namespace Microsoft {
+namespace Featurizer {
+
+namespace DateTimeFeaturizer {
+    
+    TimePoint::TimePoint(const std::chrono::system_clock::time_point& sysTime) {
+        // Get to a tm to get what we need.
+        // Eventually C++202x will have expanded chrono support that might 
+        // have what we need, but not yet!
+        std::tm tmt;
+        time_t tt = std::chrono::system_clock::to_time_t(sysTime);
+        std::tm* res = gmtime_r(&tt, &tmt);
+        if (res) {
+            year = static_cast<std::int32_t>(tmt.tm_year) + 1900;
+            month = static_cast<std::uint8_t>(tmt.tm_mon) + 1;
+            day = static_cast<std::uint8_t>(tmt.tm_mday);
+            hour = static_cast<std::uint8_t>(tmt.tm_hour);
+            minute = static_cast<std::uint8_t>(tmt.tm_min);
+            second = static_cast<std::uint8_t>(tmt.tm_sec);
+            dayOfWeek = static_cast<std::uint8_t>(tmt.tm_wday);
+            dayOfYear = static_cast<std::uint16_t>(tmt.tm_yday);
+            quarterOfYear = (month + 2) / 3;
+            weekOfMonth = (day - 1) / 7;
+        }
+        else
+        {
+            if (tt < 0) {
+                throw std::invalid_argument("Dates prior to 1970 are not supported.");
+            }
+            else {
+                throw std::invalid_argument("Unknown error converting input date.");
+            }
+        }
+    }
+
+    Transformer::return_type Transformer::transform(arg_type const &arg) const /*override*/ {
+        return Microsoft::Featurizer::DateTimeFeaturizer::TimePoint(arg);
+    }
+
+
+} // namespace DateTimeFeaturizer
+} // namespace Featurizer
+} // namespace Microsoft
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.h b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.h
new file mode 100644
index 0000000000000..e1f98351db0b4
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.h
@@ -0,0 +1,101 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+#pragma once
+
+#include "../Featurizer.h"
+#include <chrono>
+#include <ctime>
+#include <cstdint>
+#include <stdexcept>
+
+namespace Microsoft {
+namespace Featurizer {
+
+/////////////////////////////////////////////////////////////////////////
+///  \namespace     DateTimeTransformer
+///  \brief         A Transformer that takes a chrono::system_clock::time_point and
+///                 returns a struct with all the data split out. 
+///
+namespace DateTimeFeaturizer {
+
+    /////////////////////////////////////////////////////////////////////////
+    ///  \struct        TimePoint
+    ///  \brief         Struct to hold various components of DateTime information 
+    ///
+    struct TimePoint {
+        std::int32_t year = 0;
+        std::uint8_t month = 0;         /* 1-12 */
+        std::uint8_t day = 0;           /* 1-31 */
+        std::uint8_t hour = 0;          /* 0-23 */
+        std::uint8_t minute = 0;        /* 0-59 */
+        std::uint8_t second = 0;        /* 0-59 */
+        std::uint8_t dayOfWeek = 0;     /* 0-6 */
+        std::uint16_t dayOfYear = 0;    /* 0-365 */
+        std::uint8_t quarterOfYear = 0; /* 1-4 */
+        std::uint8_t weekOfMonth = 0;   /* 0-4 */
+
+        // Need default __ctor to satisfy ORT type system
+        TimePoint() = default;
+        TimePoint(const std::chrono::system_clock::time_point& sysTime);
+
+        TimePoint(TimePoint&&) = default;
+        TimePoint& operator=(TimePoint&&) = default;
+
+        TimePoint(const TimePoint&) = delete;
+        TimePoint& operator=(const TimePoint&) = delete;
+
+        bool operator==(const TimePoint& o) const {
+          return year == o.year &&
+                 month == o.month &&
+                 day == o.day &&
+                 hour == o.hour &&
+                 minute == o.minute &&
+                 second == o.second &&
+                 dayOfWeek == o.dayOfWeek &&
+                 dayOfYear == o.dayOfYear &&
+                 quarterOfYear == o.quarterOfYear &&
+                 weekOfMonth == o.weekOfMonth;
+        }
+
+        enum { 
+            JANUARY = 1, FEBRUARY, MARCH, APRIL, MAY, JUNE, 
+            JULY, AUGUST, SEPTEMBER, OCTOBER, NOVEMBER, DECEMBER
+        };
+        enum {
+            SUNDAY = 0, MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY
+        };
+    };
+
+    inline TimePoint SystemToDPTimePoint(const std::chrono::system_clock::time_point& sysTime) {
+        return TimePoint (sysTime);
+    }
+
+    /////////////////////////////////////////////////////////////////////////
+    ///  \class         DateTimeTransformer
+    ///  \brief         Transformer
+    ///
+    class Transformer : public Microsoft::Featurizer::Transformer<Microsoft::Featurizer::DateTimeFeaturizer::TimePoint, std::chrono::system_clock::time_point> {
+        public:
+            Transformer(void) = default;
+            ~Transformer(void) override = default;
+
+            Transformer(Transformer const &) = delete;
+            Transformer & operator =(Transformer const &) = delete;
+
+            Transformer(Transformer &&) = default;
+            Transformer & operator =(Transformer &&) = delete;
+
+            return_type transform(arg_type const &arg) const override;
+
+        private:
+        // ----------------------------------------------------------------------
+        // |  Private Methods
+        template <typename ArchiveT>
+        void serialize(ArchiveT &ar, unsigned int const version);
+    };
+
+} // Namespace DateTimeFeaturizer
+} // Namespace Featurizer
+} // Namespace Microsoft
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/SampleAdd.cpp b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/SampleAdd.cpp
new file mode 100644
index 0000000000000..b474ce3bd8a62
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/SampleAdd.cpp
@@ -0,0 +1,40 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+#include "SampleAdd.h"
+
+namespace Microsoft {
+namespace Featurizer {
+namespace SampleAdd {
+
+// ----------------------------------------------------------------------
+// |
+// |  Transformer
+// |
+// ----------------------------------------------------------------------
+Transformer::Transformer(std::uint16_t delta) :
+    _delta(delta) {
+}
+
+Transformer::return_type Transformer::transform(arg_type const &arg) const /*override*/ {
+    return _delta + arg;
+}
+
+// ----------------------------------------------------------------------
+// |
+// |  Estimator
+// |
+// ----------------------------------------------------------------------
+Estimator & Estimator::fit_impl(apache_arrow const &data) /*override*/ {
+    _accumulated_delta += static_cast<std::uint16_t>(data);
+    return *this;
+}
+
+Estimator::TransformerUniquePtr Estimator::commit_impl(void) /*override*/ {
+  return std::make_unique<SampleAdd::Transformer>(static_cast<std::uint16_t>(_accumulated_delta));
+}
+
+} // namespace SampleAdd
+} // namespace Featurizer
+} // namespace Microsoft
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/SampleAdd.h b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/SampleAdd.h
new file mode 100644
index 0000000000000..f4ca7601e5dd0
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/SampleAdd.h
@@ -0,0 +1,95 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+#pragma once
+
+#include "../Featurizer.h"
+
+namespace Microsoft {
+namespace Featurizer {
+
+/////////////////////////////////////////////////////////////////////////
+///  \namespace     SampleAdd
+///  \brief         A Transformer and Estimator that add values. This is a
+///                 sample intended to demonstrate patterns within the
+///                 implementation of these types.
+///
+namespace SampleAdd {
+
+/////////////////////////////////////////////////////////////////////////
+///  \class         Transformer
+///  \brief         Transformer that adds an integer value to a saved delta
+///                 and returns the result.
+///
+class Transformer : public Microsoft::Featurizer::Transformer<std::uint32_t, std::uint16_t> {
+public:
+    // ----------------------------------------------------------------------
+    // |  Public Methods
+    Transformer(std::uint16_t delta=0);
+    ~Transformer(void) override = default;
+
+    Transformer(Transformer const &) = delete;
+    Transformer & operator =(Transformer const &) = delete;
+
+    Transformer(Transformer &&) = default;
+    Transformer & operator =(Transformer &&) = delete;
+
+    return_type transform(arg_type const &arg) const override;
+
+private:
+    // ----------------------------------------------------------------------
+    // |  Private Data
+    std::uint32_t const                     _delta;
+
+    // ----------------------------------------------------------------------
+    // |  Private Methods
+    template <typename ArchiveT>
+    void serialize(ArchiveT &ar, unsigned int const version);
+};
+
+/////////////////////////////////////////////////////////////////////////
+///  \class         Estimator
+///  \brief         Estimator that accumulates a delta value and then
+///                 creates a Transformer with than value when requested.
+///
+class Estimator : public Microsoft::Featurizer::Estimator<std::uint32_t, std::uint16_t> {
+public:
+    // ----------------------------------------------------------------------
+    // |  Public Methods
+    Estimator(void) = default;
+    ~Estimator(void) override = default;
+
+    Estimator(Estimator const &) = delete;
+    Estimator & operator =(Estimator const &) = delete;
+
+    Estimator(Estimator &&) = default;
+    Estimator & operator =(Estimator &&) = delete;
+
+private:
+    // ----------------------------------------------------------------------
+    // |  Private Data
+    std::uint32_t                           _accumulated_delta = 0;
+
+    // ----------------------------------------------------------------------
+    // |  Private Methods
+    template <typename ArchiveT>
+    void serialize(ArchiveT &ar, unsigned int const version);
+
+    Estimator & fit_impl(apache_arrow const &data) override;
+    TransformerUniquePtr commit_impl(void) override;
+};
+
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+// |
+// |  Implementation
+// |
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+// ----------------------------------------------------------------------
+} // namespace SampleAdd
+
+} // namespace Featurizer
+} // namespace Microsoft
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/CMakeLists.txt b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/CMakeLists.txt
new file mode 100644
index 0000000000000..acbc320062979
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/CMakeLists.txt
@@ -0,0 +1,48 @@
+# ----------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License
+# ----------------------------------------------------------------------
+cmake_minimum_required(VERSION 3.5.0)
+
+project(Featurizer_UnitTests LANGUAGES CXX)
+
+set(CMAKE_MODULE_PATH "$ENV{DEVELOPMENT_ENVIRONMENT_CMAKE_MODULE_PATH}")
+
+if(NOT WIN32)
+    string(REPLACE ":" ";" CMAKE_MODULE_PATH "${CMAKE_MODULE_PATH}")
+    string(REPLACE ":" ";" _includes "$ENV{INCLUDE}")
+    string(REPLACE ":" ";" _libs "$ENV{LIB}")
+endif()
+
+set(CppCommon_STATIC_CRT ON CACHE BOOL "" FORCE)
+set(BoostCommon_HEADER_ONLY ON CACHE BOOL "" FORCE)
+
+include(CppCommon)
+include(BoostCommon)
+
+set(CMAKE_CXX_STANDARD 14)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_CXX_EXTENSIONS OFF)
+
+add_library(libFeaturizers STATIC
+    ../SampleAdd.h
+    ../SampleAdd.cpp
+    ../DateTimeFeaturizer.h
+    ../DateTimeFeaturizer.cpp
+)
+
+enable_testing()
+
+foreach(_test_name IN ITEMS
+    SampleAdd_UnitTest
+    DateTimeFeaturizer_UnitTests
+)
+    add_executable(${_test_name} ${_test_name}.cpp)
+
+    target_include_directories(${_test_name} PRIVATE ${_includes})
+    target_link_directories(${_test_name} PRIVATE ${_libs})
+
+    target_link_libraries(${_test_name} PRIVATE ${Boost_LIBRARIES} libFeaturizers)
+
+    add_test(NAME ${_test_name} COMMAND ${_test_name} --success)
+endforeach()
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/DateTimeFeaturizer_UnitTests.cpp b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/DateTimeFeaturizer_UnitTests.cpp
new file mode 100644
index 0000000000000..d81bb22964dbe
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/DateTimeFeaturizer_UnitTests.cpp
@@ -0,0 +1,125 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+
+#define CATCH_CONFIG_MAIN
+#include <cstdio>
+#include "gtest/gtest.h"
+
+#include "../DateTimeFeaturizer.h"
+
+
+namespace Microsoft {
+namespace Featurizer {
+namespace DateTimeFeaturizer {
+
+using SysClock = std::chrono::system_clock;
+
+TEST(DateTimeFeaturizer_DateTime, Past_1976_Nov_17__12_27_04) {
+    const time_t date = 217081624;
+    SysClock::time_point stp = SysClock::from_time_t(date);
+
+    // Constructor
+    TimePoint tp(stp);
+    ASSERT_EQ(tp.year, 1976);
+    ASSERT_EQ(tp.month, TimePoint::NOVEMBER);
+    ASSERT_EQ(tp.day, 17);
+    ASSERT_EQ(tp.hour, 12);
+    ASSERT_EQ(tp.minute, 27);
+    ASSERT_EQ(tp.second, 4);
+    ASSERT_EQ(tp.dayOfWeek, TimePoint::WEDNESDAY);
+    ASSERT_EQ(tp.dayOfYear, 321);
+    ASSERT_EQ(tp.quarterOfYear, 4);
+    ASSERT_EQ(tp.weekOfMonth, 2);
+
+    // assignment
+    TimePoint tp1 = stp;
+    ASSERT_EQ(tp1.year, 1976);
+    ASSERT_EQ(tp1.month, TimePoint::NOVEMBER);
+    ASSERT_EQ(tp1.day, 17);
+
+    // function
+    TimePoint tp2 = SystemToDPTimePoint(stp);
+    ASSERT_EQ(tp2.year, 1976);
+    ASSERT_EQ(tp2.month, TimePoint::NOVEMBER);
+    ASSERT_EQ(tp2.day, 17);
+}
+
+TEST(DateTimeFeaturizer_Transformer , Past_1976_Nov_17__12_27_05) {
+    const time_t date = 217081625;
+    SysClock::time_point stp = SysClock::from_time_t(date);
+    
+    Transformer dt;
+    TimePoint tp = dt.transform(stp);
+    ASSERT_EQ(tp.year, 1976);
+    ASSERT_EQ(tp.month, TimePoint::NOVEMBER);
+    ASSERT_EQ(tp.day, 17);
+    ASSERT_EQ(tp.hour, 12);
+    ASSERT_EQ(tp.minute, 27);
+    ASSERT_EQ(tp.second, 5);
+    ASSERT_EQ(tp.dayOfWeek, TimePoint::WEDNESDAY);
+    ASSERT_EQ(tp.dayOfYear, 321);
+    ASSERT_EQ(tp.quarterOfYear, 4);
+    ASSERT_EQ(tp.weekOfMonth, 2);
+
+}
+
+TEST(DateTimeFeaturizer_Transformer , Future_2025_June_30) {
+    const time_t date = 1751241600;
+    SysClock::time_point stp = SysClock::from_time_t(date);
+
+    Transformer dt;
+    TimePoint tp = dt.transform(stp);
+    ASSERT_EQ(tp.year, 2025);
+    ASSERT_EQ(tp.month, TimePoint::JUNE);
+    ASSERT_EQ(tp.day, 30);
+    ASSERT_EQ(tp.hour, 0);
+    ASSERT_EQ(tp.minute, 0);
+    ASSERT_EQ(tp.second, 0);
+    ASSERT_EQ(tp.dayOfWeek, TimePoint::MONDAY);
+    ASSERT_EQ(tp.dayOfYear, 180);
+    ASSERT_EQ(tp.quarterOfYear, 2);
+    ASSERT_EQ(tp.weekOfMonth, 4);
+}
+
+#ifdef _MSC_VER
+// others define system_clock::time_point as nanoseconds (64-bit),
+// which rolls over somewhere around 2260. Still a couple hundred years!
+TEST(DateTimeFeaturizer_Transformer , Far_Future__2998_March_2__14_03_02) {
+    const time_t date = 32445842582;
+    SysClock::time_point stp = SysClock::from_time_t(date);
+
+    Transformer dt;
+    TimePoint tp = dt.transform(stp);
+    ASSERT_EQ(tp.year, 2998);
+    ASSERT_EQ(tp.month, TimePoint::MARCH);
+    ASSERT_EQ(tp.day, 2);
+    ASSERT_EQ(tp.hour, 14);
+    ASSERT_EQ(tp.minute, 3);
+    ASSERT_EQ(tp.second, 2);
+    ASSERT_EQ(tp.dayOfWeek, TimePoint::FRIDAY);
+    ASSERT_EQ(tp.dayOfYear, 60);
+    ASSERT_EQ(tp.quarterOfYear, 1);
+    ASSERT_EQ(tp.weekOfMonth, 0);
+}
+
+#else
+
+// msvcrt doesn't support negative time_t, so nothing before 1970
+TEST(DateTimeFeaturizer_Transformer, Pre_Epoch__1776_July_4) {
+
+    const time_t date = -6106060800;
+    SysClock::time_point stp = SysClock::from_time_t(date);
+
+    // Constructor
+    Transformer dt;
+    TimePoint tp = dt.transform(stp);
+    ASSERT_EQ(tp.year, 1776);
+    ASSERT_EQ(tp.month, TimePoint::JULY);
+    ASSERT_EQ(tp.day, 4);
+}
+#endif /* _MSC_VER */
+} // namespace DateTimeFeaturizer
+} // namespace Featurizer
+} // namespace Microsoft
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/SampleAdd_UnitTest.cpp b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/SampleAdd_UnitTest.cpp
new file mode 100644
index 0000000000000..b3796ec3c4d62
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/SampleAdd_UnitTest.cpp
@@ -0,0 +1,22 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+
+#define CATCH_CONFIG_MAIN
+#include "gtest/gtest.h"
+
+#include "../SampleAdd.h"
+
+TEST(SampleAddTests, Transformer) {
+  ASSERT_EQ(Microsoft::Featurizer::SampleAdd::Transformer(10).transform(20), 30U);
+  ASSERT_EQ(Microsoft::Featurizer::SampleAdd::Transformer(20).transform(1), 21U);
+}
+
+TEST(SampleAddTests, Estimator) {
+  ASSERT_EQ(Microsoft::Featurizer::SampleAdd::Estimator().fit(10).commit()->transform(20), 30U);
+  ASSERT_EQ(Microsoft::Featurizer::SampleAdd::Estimator().fit(20).commit()->transform(1), 21U);
+
+  ASSERT_EQ(Microsoft::Featurizer::SampleAdd::Estimator().fit(10).fit(20).commit()->transform(20), 50U);
+  ASSERT_EQ(Microsoft::Featurizer::SampleAdd::Estimator().fit(10).fit(20).fit(30).commit()->transform(20), 80U);
+}
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/code_coverage.yaml b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/code_coverage.yaml
new file mode 100644
index 0000000000000..e3f068978a9bd
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Featurizers/UnitTests/code_coverage.yaml
@@ -0,0 +1,5 @@
+filter:
+  includes:
+    - Microsoft::Featurizer::*
+  excludes:
+    - std::*
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Traits.h b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Traits.h
new file mode 100644
index 0000000000000..37a70a059d14a
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/Traits.h
@@ -0,0 +1,218 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+
+#pragma once
+#include <array>
+#include <cmath>
+#include <functional>
+#include <map>
+#include <vector>
+
+namespace Microsoft {
+namespace Featurizer {
+namespace Traits {
+
+// XXX: Define the type
+template<class T>
+struct Nullable {};
+
+/////////////////////////////////////////////////////////////////////////
+///  \namespace     Traits
+///  \brief         We have a range of of types we are dealing with. Many types
+///                 have different ways to represent what a `NULL` value is
+///                 (float has NAN for example) as well as different ways to
+///                 convert the value to a string representation. By using
+///                 templates combined with partial template specialization
+///                 we can handle scenarios like these that vary based on the data type.
+///
+///                 Example: This allows us to do things like `Traits<std::int8_t>::IsNull()`
+///                 and `Traits<float>::IsNull()` and let the trait itself deal with the
+///                 actual implementation and allows us as developers to not worry about that.
+///
+///                 This benefit is magnified because we are also using templates for our
+///                 transformers. When we declare that a transformer has type T = std::int8_t,
+///                 we can then also use `Traits<T>::IsNull()` and the compiler will know that
+///                 `T` is a `std::int8_t` and call the appropate template specialization.
+///
+template <typename T>
+struct Traits {};
+
+/////////////////////////////////////////////////////////////////////////
+///  \namespace     Traits
+///  \brief         When using partial template specilization, if the compiler
+///                 cannot find a more specfic implementation of the template
+///                 it will fall back to the base template and use whatever is
+///                 defined there. If you have methods defined in that base template,
+///                 it makes it very difficult to debug what is going on. By
+///                 putting no implementation in the `Traits<>` template and
+///                 having the real base struct be `TraitsImpl<>`, if you try and
+///                 specify a trait that doesn't have a specilization, the compiler
+///                 can detect that and throw an error during compilation.
+///
+///                 Example: There is no template `Traits<char>`. If you try and use it
+///                 the compiler will fall back to the `Traits<>` struct which has no methods
+///                 defined. Trying to then use `Traits<char>` will cause a compile time error
+///                 letting you know something isn't correct.
+///
+template <typename T>
+struct TraitsImpl {
+  using nullable_type = Nullable<T>;
+  static bool IsNull(nullable_type const& value) {
+    return !value.is_initialized();
+  }
+};
+
+template <>
+struct Traits<float> : public TraitsImpl<float> {
+  using nullable_type = float;
+  static bool IsNull(nullable_type const& value) {
+    return std::isnan(value);
+  }
+
+  // static std::string ToString(nullable_type const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<double> : public TraitsImpl<double> {
+  using nullable_type = double;
+  static bool IsNull(nullable_type const& value) {
+    return std::isnan(value);
+  }
+
+  // static std::string ToString(nullable_type const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::int8_t> : public TraitsImpl<std::int8_t> {
+  // static std::string ToString(std::int8_t const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::int16_t> : public TraitsImpl<std::int16_t> {
+  // static std::string ToString(std::int16_t const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::int32_t> : public TraitsImpl<std::int32_t> {
+  // static std::string ToString(std::int32_t const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::int64_t> : public TraitsImpl<std::int64_t> {
+  // static std::string ToString(std::int64_t const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::uint8_t> : public TraitsImpl<std::uint8_t> {
+  // static std::string ToString(std::uint8_t const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::uint16_t> : public TraitsImpl<std::uint16_t> {
+  using nullable_type = Nullable<std::uint16_t>;
+  // static std::string ToString(std::uint16_t const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::uint32_t> : public TraitsImpl<std::uint32_t> {
+  // static std::string ToString(std::uint32_t const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::uint64_t> : public TraitsImpl<std::uint64_t> {
+  // static std::string ToString(std::uint64_t const& value) {
+  //     return std::to_string(value);
+  // }
+};
+
+template <>
+struct Traits<std::string> : public TraitsImpl<std::string> {
+  // static std::string ToString(std::string const& value) {
+  //     value;
+  // }
+};
+
+template <typename T, size_t size>
+struct Traits<std::array<T, size>> : public TraitsImpl<std::array<T, size>> {
+  // static std::string ToString(std::array<T, size> const& value) {
+  //     // Decide what to return here
+  //     throw std::logic_error("Function not yet implemented");
+  // }
+};
+
+template <>
+struct Traits<bool> : public TraitsImpl<bool> {
+  // static std::string ToString(bool const& value) {
+  //     // Decide what to return here
+  //     throw std::logic_error("Function not yet implemented");
+  // }
+};
+
+template <typename KeyT, typename T, typename CompareT, typename AllocatorT>
+struct Traits<std::map<KeyT, T, CompareT, AllocatorT>> : public TraitsImpl<std::map<KeyT, T, CompareT, AllocatorT>> {
+  // static std::string ToString(std::map<KeyT, T, CompareT, AllocatorT> const& value) {
+  //     // Decide what to return here
+  //     throw std::logic_error("Function not yet implemented");
+  // }
+};
+
+template <typename T, typename AllocatorT>
+struct Traits<std::vector<T, AllocatorT>> : public TraitsImpl<std::vector<T, AllocatorT>> {
+  // static std::string ToString(std::vector<T, AllocatorT> const& value) {
+  //     // Decide what to return here
+  //     throw std::logic_error("Function not yet implemented");
+  // }
+};
+
+template <typename... Types>
+struct Traits<std::function<Types...>> : public TraitsImpl<std::function<Types...>> {
+  // static std::string ToString(std::function<Types ...> const& value) {
+  //     // Decide what to return here
+  //     throw std::logic_error("Function not yet implemented");
+  // }
+};
+
+template <typename T>
+struct Traits<Nullable<T>> : public TraitsImpl<Nullable<T>> {
+  using nullable_type = Nullable<T>;
+
+  // static std::string ToString(nullable_type const& value) {
+  //     if (value) {
+  //         return Traits<T>::ToString(value.get());
+  //     }
+
+  //     return "NULL";
+  // }
+};
+
+template <typename... Types>
+struct Traits<std::tuple<Types...>> : public TraitsImpl<std::tuple<Types...>> {
+  // static std::string ToString(std::tuple<Types ...> const& value) {
+  //     // Decide what to return here
+  //     throw std::logic_error("Function not yet implemented");
+  // }
+};
+
+}  // namespace Traits
+}  // namespace Featurizer
+}  // namespace Microsoft
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/CMakeLists.txt b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/CMakeLists.txt
new file mode 100644
index 0000000000000..024c76f3443a7
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/CMakeLists.txt
@@ -0,0 +1,41 @@
+# ----------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License
+# ----------------------------------------------------------------------
+cmake_minimum_required(VERSION 3.5.0)
+
+project(Featurizer_UnitTests LANGUAGES CXX)
+
+set(CMAKE_MODULE_PATH "$ENV{DEVELOPMENT_ENVIRONMENT_CMAKE_MODULE_PATH}")
+
+if(NOT WIN32)
+    string(REPLACE ":" ";" CMAKE_MODULE_PATH "${CMAKE_MODULE_PATH}")
+    string(REPLACE ":" ";" _includes "$ENV{INCLUDE}")
+    string(REPLACE ":" ";" _libs "$ENV{LIB}")
+endif()
+
+set(CppCommon_STATIC_CRT ON CACHE BOOL "" FORCE)
+set(BoostCommon_HEADER_ONLY ON CACHE BOOL "" FORCE)
+
+include(CppCommon)
+include(BoostCommon)
+
+set(CMAKE_CXX_STANDARD 14)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_CXX_EXTENSIONS OFF)
+
+enable_testing()
+
+foreach(_test_name IN ITEMS
+    Featurizer_UnitTest
+    Traits_UnitTests
+)
+    add_executable(${_test_name} ${_test_name}.cpp)
+
+    target_include_directories(${_test_name} PRIVATE ${_includes})
+    target_link_directories(${_test_name} PRIVATE ${_libs})
+
+    target_link_libraries(${_test_name} PRIVATE ${Boost_LIBRARIES})
+
+    add_test(NAME ${_test_name} COMMAND ${_test_name} --success)
+endforeach()
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Featurizer_UnitTest.cpp b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Featurizer_UnitTest.cpp
new file mode 100644
index 0000000000000..c0340e738c1c4
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Featurizer_UnitTest.cpp
@@ -0,0 +1,104 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+
+#define CATCH_CONFIG_MAIN
+#include "gtest/gtest.h"
+#include "../Featurizer.h"
+
+class MyTransformer : public Microsoft::Featurizer::Transformer<bool, int> {
+public:
+    // ----------------------------------------------------------------------
+    // |  Public Methods
+    MyTransformer(bool true_on_odd=false) :
+        _true_on_odd(true_on_odd) {
+    }
+
+    ~MyTransformer(void) override = default;
+
+    MyTransformer(MyTransformer const &) = delete;
+    MyTransformer & operator =(MyTransformer const &) = delete;
+
+    MyTransformer(MyTransformer &&) = default;
+    MyTransformer & operator =(MyTransformer &&) = delete;
+
+    return_type transform(arg_type const &arg) const override {
+        bool const                          is_odd(arg & 1);
+
+        return _true_on_odd ? is_odd : !is_odd;
+    }
+
+private:
+    // ----------------------------------------------------------------------
+    // |  Private Data
+    bool const                              _true_on_odd;
+};
+
+class MyEstimator : public Microsoft::Featurizer::Estimator<bool, int> {
+public:
+    // ----------------------------------------------------------------------
+    // |  Public Methods
+    MyEstimator(bool return_invalid_transformer=false) :
+        _return_invalid_transformer(return_invalid_transformer) {
+    }
+
+    ~MyEstimator(void) override = default;
+
+    MyEstimator(MyEstimator const &) = delete;
+    MyEstimator & operator =(MyEstimator const &) = delete;
+
+    MyEstimator(MyEstimator &&) = default;
+    MyEstimator & operator =(MyEstimator &&) = delete;
+
+private:
+    // ----------------------------------------------------------------------
+    // |  Private Data
+    bool const                              _return_invalid_transformer;
+    bool                                    _true_on_odd_state;
+
+    // ----------------------------------------------------------------------
+    // |  Private Methods
+    MyEstimator & fit_impl(apache_arrow const &data) override {
+        _true_on_odd_state = static_cast<bool>(data);
+        return *this;
+    }
+
+    TransformerUniquePtr commit_impl(void) override {
+        if(_return_invalid_transformer)
+            return TransformerUniquePtr();
+
+        return std::make_unique<MyTransformer>(_true_on_odd_state);
+    }
+};
+
+TEST(FeaturizerTests, TransformerFunctionality) {
+  ASSERT_TRUE(MyTransformer(true).transform(1));
+  ASSERT_FALSE(MyTransformer(false).transform(1));
+  ASSERT_FALSE(MyTransformer(true).transform(2));
+  ASSERT_TRUE(MyTransformer(false).transform(2));
+}
+
+TEST(FeaturizerTests, EstimatorFunctionality) {
+  ASSERT_TRUE(MyEstimator().fit(1).commit()->transform(1));
+  ASSERT_FALSE(MyEstimator().fit(0).commit()->transform(1));
+  ASSERT_FALSE(MyEstimator().fit(1).commit()->transform(2));
+  ASSERT_TRUE(MyEstimator().fit(0).commit()->transform(2));
+}
+
+TEST(FeaturizerTests, EstimatorErrors) {
+    MyEstimator                             e;
+
+    ASSERT_NE(e.commit(), nullptr);
+    //CHECK_THROWS_WITH(e.fit(1), Catch::Contains("has already been committed"));
+    //CHECK_THROWS_WITH(e.commit(), Catch::Contains("has already been committed"));
+
+    //CHECK_THROWS_WITH(MyEstimator(true).commit(), Catch::Matches("Invalid result"));
+}
+
+TEST(FeaturizerTests, EstimatorFitAndCommit) {
+  ASSERT_TRUE(Microsoft::Featurizer::fit_and_commit<MyEstimator>(1, false)->transform(1));
+  ASSERT_FALSE(Microsoft::Featurizer::fit_and_commit<MyEstimator>(0, false)->transform(1));
+  ASSERT_FALSE(Microsoft::Featurizer::fit_and_commit<MyEstimator>(1, false)->transform(2));
+  ASSERT_TRUE(Microsoft::Featurizer::fit_and_commit<MyEstimator>(0, false)->transform(2));
+}
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Traits_UnitTests.cpp b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Traits_UnitTests.cpp
new file mode 100644
index 0000000000000..66589a5c9decc
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/Traits_UnitTests.cpp
@@ -0,0 +1,40 @@
+// ----------------------------------------------------------------------
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License
+// ----------------------------------------------------------------------
+#define CATCH_CONFIG_MAIN
+#include <type_traits>
+#include "gtest/gtest.h"
+
+#include "../Traits.h"
+
+using namespace Microsoft::Featurizer::Traits;
+
+// Floating point values
+static_assert(std::is_same<Traits<float>::nullable_type, float>::value, "Incorrect nullable type for float");
+static_assert(std::is_same<Traits<double>::nullable_type, double>::value, "Incorrect nullable type for double");
+
+// Int values
+static_assert(std::is_same<Traits<std::int8_t>::nullable_type, Nullable<std::int8_t>>::value, "Incorrect nullable type for std::int8_t");
+static_assert(std::is_same<Traits<std::int16_t>::nullable_type, Nullable<std::int16_t>>::value, "Incorrect nullable type for std::int16_t");
+static_assert(std::is_same<Traits<std::int32_t>::nullable_type, Nullable<std::int32_t>>::value, "Incorrect nullable type for std::int32_t");
+static_assert(std::is_same<Traits<std::int64_t>::nullable_type, Nullable<std::int64_t>>::value, "Incorrect nullable type for std::int64_t");
+static_assert(std::is_same<Traits<std::uint8_t>::nullable_type, Nullable<std::uint8_t>>::value, "Incorrect nullable type for std::uint8_t");
+static_assert(std::is_same<Traits<std::uint16_t>::nullable_type, Nullable<std::uint16_t>>::value, "Incorrect nullable type for std::uint16_t");
+static_assert(std::is_same<Traits<std::uint32_t>::nullable_type, Nullable<std::uint32_t>>::value, "Incorrect nullable type for std::uint32_t");
+static_assert(std::is_same<Traits<std::uint64_t>::nullable_type, Nullable<std::uint64_t>>::value, "Incorrect nullable type for std::uint64_t");
+
+// Others
+static_assert(std::is_same<Traits<std::string>::nullable_type, Nullable<std::string>>::value, "Incorrect nullable type for std::string");
+static_assert(std::is_same<Traits<std::array<char, 4>>::nullable_type, Nullable<std::array<char, 4>>>::value, "Incorrect nullable type for std::array");
+static_assert(std::is_same<Traits<bool>::nullable_type, Nullable<bool>>::value, "Incorrect nullable type for std::string");
+static_assert(std::is_same<Traits<std::map<int,int>>::nullable_type, Nullable<std::map<int,int>>>::value, "Incorrect nullable type for std::string");
+static_assert(std::is_same<Traits<std::vector<int>>::nullable_type, Nullable<std::vector<int>>>::value, "Incorrect nullable type for std::string");
+static_assert(std::is_same<Traits<std::function<int>>::nullable_type, Nullable<std::function<int>>>::value, "Incorrect nullable type for std::string");
+static_assert(std::is_same<Traits<Nullable<int>>::nullable_type, Nullable<int>>::value, "Incorrect nullable type for std::string");
+static_assert(std::is_same<Traits<std::tuple<int>>::nullable_type, Nullable<std::tuple<int>>>::value, "Incorrect nullable type for std::string");
+
+// Dummy test so it will compile. Replace this with actual tests.
+TEST(TraitsTests, Dummy) {
+    ASSERT_TRUE(true);
+}
diff --git a/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/test_main.cpp b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/test_main.cpp
new file mode 100644
index 0000000000000..b6a004002b83c
--- /dev/null
+++ b/onnxruntime/core/automl/featurizers/src/FeaturizerPrep/UnitTests/test_main.cpp
@@ -0,0 +1,18 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "gtest/gtest.h"
+
+GTEST_API_ int main(int argc, char** argv) {
+  int status = 0;
+
+  testing::InitGoogleTest(&argc, argv);
+  try {
+    status = RUN_ALL_TESTS();
+  } catch (const std::exception& ex) {
+    std::cerr << ex.what();
+    status = -1;
+  }
+
+  return status;
+}
diff --git a/onnxruntime/core/codegen/common/common.cc b/onnxruntime/core/codegen/common/common.cc
index 757c1677dd2e5..f7a774609669c 100644
--- a/onnxruntime/core/codegen/common/common.cc
+++ b/onnxruntime/core/codegen/common/common.cc
@@ -120,6 +120,11 @@ std::unique_ptr<ComputeCapability> ToCapacity(const onnxruntime::GraphViewer& gr
   meta_def->name += "_With" + std::to_string(subgraph->nodes.size()) + "Nodes_";
   meta_def->name += end_node.OpType() + std::to_string(end_node_index);
 
+  std::unordered_set<std::string> real_output_names;
+  for (const auto* def : graph.GetOutputs()) {
+    real_output_names.insert(def->Name());
+  }
+
   for (const auto& node_index : subgraph->nodes) {
     const auto& node = *graph.GetNode(node_index);
     // handle current graph's inputs
@@ -140,6 +145,7 @@ std::unique_ptr<ComputeCapability> ToCapacity(const onnxruntime::GraphViewer& gr
     // 1. Output NodeArg is not used by any Node
     // 2. Output NodeArg is used by at least one Node out of this subgraph.
     //    Note a NodeArg can be used by Nodes in and out of the subgraph at the same time.
+    // 3. Output NodeArg is one of real outputs of an Ort graph.
 
     auto InsertOutputToSubgraph = [&meta_def](const NodeArg* def) {
       if (std::find(meta_def->outputs.begin(), meta_def->outputs.end(), def->Name()) ==
@@ -169,11 +175,12 @@ std::unique_ptr<ComputeCapability> ToCapacity(const onnxruntime::GraphViewer& gr
       }
     }
 
-    // handle case 1
+    // handle case 1 and 3
     node.ForEachWithIndex(
         node.OutputDefs(),
         [&](const onnxruntime::NodeArg& def, size_t) {
-          if (input_names_from_the_output_node.count(def.Name()) == 0) {
+          if (input_names_from_the_output_node.count(def.Name()) == 0 ||
+              real_output_names.count(def.Name()) > 0) {
             InsertOutputToSubgraph(&def);
           }
           return Status::OK();
diff --git a/onnxruntime/core/codegen/common/creator.h b/onnxruntime/core/codegen/common/creator.h
index d15e86b5a481f..b31a12db4875b 100644
--- a/onnxruntime/core/codegen/common/creator.h
+++ b/onnxruntime/core/codegen/common/creator.h
@@ -25,7 +25,7 @@ class CreatorBase {
   CreatorBase(const std::string& name)
       : name_(name) {}
 
-  ~CreatorBase() = default;
+  virtual ~CreatorBase() = default;
 
   virtual RETURN_TYPE Evaluate(INPUT_TYPE,
                                NODE_TYPE,
diff --git a/onnxruntime/core/codegen/common/dispatcher.h b/onnxruntime/core/codegen/common/dispatcher.h
index b4313cecad3a8..80a854a06977c 100644
--- a/onnxruntime/core/codegen/common/dispatcher.h
+++ b/onnxruntime/core/codegen/common/dispatcher.h
@@ -16,6 +16,7 @@ namespace codegen {
 //           2) dump corresponding name
 // DispatcherBase may or may not keep ownership,
 // depending on the template parameter, CONTENT_TYPE.
+// Note DispatcherBase has a protected destructor
 
 template <typename CONTENT_TYPE>
 class DispatcherBase {
@@ -68,6 +69,7 @@ class DispatcherBase {
   std::string name_;
   std::unordered_map<std::string, CONTENT_TYPE> contents_;
   ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(DispatcherBase);
+  ~DispatcherBase() = default;
 };
 
 }  // namespace codegen
diff --git a/onnxruntime/core/codegen/common/profile.h b/onnxruntime/core/codegen/common/profile.h
index 642ae83db723b..d9e5a9725e9e4 100644
--- a/onnxruntime/core/codegen/common/profile.h
+++ b/onnxruntime/core/codegen/common/profile.h
@@ -28,7 +28,7 @@ class ProfilerEvent {
 
 }  // namespace onnxruntime
 
-#define CODEGEN_PROFILER_EVENT(name) onnxruntime::ProfilerEvent name##_profiler_event(#name)
+#define CODEGEN_PROFILER_EVENT(name) onnxruntime::ProfilerEvent profiler_event(name)
 
 #else
 
diff --git a/onnxruntime/core/codegen/common/registry.h b/onnxruntime/core/codegen/common/registry.h
index 1ec06d4d8d96c..c1642e76e2120 100644
--- a/onnxruntime/core/codegen/common/registry.h
+++ b/onnxruntime/core/codegen/common/registry.h
@@ -21,6 +21,8 @@ class RegistryBase {
  public:
   RegistryBase() = default;
 
+  virtual ~RegistryBase() = default;
+
   bool Contains(const std::string& name) const {
     return contents_.count(name) > 0;
   }
diff --git a/onnxruntime/core/codegen/common/settings.cc b/onnxruntime/core/codegen/common/settings.cc
index c046f2892088d..529cb654f922c 100644
--- a/onnxruntime/core/codegen/common/settings.cc
+++ b/onnxruntime/core/codegen/common/settings.cc
@@ -70,5 +70,9 @@ bool CodeGenSettings::OptionMatches(const std::string& key, const std::string& v
 #endif
 }
 
+void CodeGenSettings::Clear() {
+  options_.clear();
+}
+
 }  // namespace codegen
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/codegen/common/settings.h b/onnxruntime/core/codegen/common/settings.h
index 95a2282ccb1ff..4bce9a614b7e1 100644
--- a/onnxruntime/core/codegen/common/settings.h
+++ b/onnxruntime/core/codegen/common/settings.h
@@ -26,6 +26,7 @@ class CodeGenSettings {
   std::string GetOptionValue(const std::string& key) const;
   bool HasOption(const std::string& key) const;
   bool OptionMatches(const std::string& key, const std::string& value) const;
+  void Clear();
   static CodeGenSettings& Instance();
 
  private:
diff --git a/onnxruntime/core/codegen/mti/math/gemm.cc b/onnxruntime/core/codegen/mti/math/gemm.cc
index b5e5da5301775..7a79513ccaa97 100644
--- a/onnxruntime/core/codegen/mti/math/gemm.cc
+++ b/onnxruntime/core/codegen/mti/math/gemm.cc
@@ -17,10 +17,12 @@ tvm::Tensor Gemm(const tvm::Tensor& A, const tvm::Tensor& B, const tvm::Tensor&
                  bool trans_A, bool trans_B, float alpha, float beta,
                  const std::string& name) {
   auto A_dot_B = MatMul2D(A, B, trans_A, trans_B, name + "_matmul2d");
+  tvm::Expr alphaExpr = tvm::make_const(A->dtype, alpha);
   if (beta != 0) {
-    return Rename(alpha * A_dot_B + (beta * C), name);
+    tvm::Expr betaExpr = tvm::make_const(A->dtype, beta);
+    return Rename(alphaExpr * A_dot_B + (betaExpr * C), name);
   } else {
-    return Rename(alpha * A_dot_B, name);
+    return Rename(alphaExpr * A_dot_B, name);
   }
 }
 
diff --git a/onnxruntime/core/codegen/mti/math/matmul_ops.cc b/onnxruntime/core/codegen/mti/math/matmul_ops.cc
index 672aa3a6cf8db..46f2fb75b6e24 100644
--- a/onnxruntime/core/codegen/mti/math/matmul_ops.cc
+++ b/onnxruntime/core/codegen/mti/math/matmul_ops.cc
@@ -117,22 +117,31 @@ tvm::Tensor MatMul(const tvm::Tensor& A, const tvm::Tensor& B, const std::string
       return tvm::sum(A(a_indices) * B(b_indices), {k});
     };
 
-    tvm::Array<tvm::Expr> output_shape;
-    int64_t output_rank = std::max(a_rank, b_rank);
-    MTI_ASSERT(tvm::ir::Equal(A_shape[a_rank - 1], B_shape[b_rank - 2]));
-    for (int64_t i = 0; i < output_rank - 2; i++) {
-      tvm::Expr broadcasted_dim = tvm::make_const(HalideIR::Int(32), 1);
-      bool broadcasted =
-          BroadcastDim(A_shape, i, output_rank, broadcasted_dim) &&
-          BroadcastDim(B_shape, i, output_rank, broadcasted_dim);
-      MTI_ASSERT(broadcasted);
-      output_shape.push_back(broadcasted_dim);
-    }
-    output_shape.push_back(A_shape[a_rank - 2]);
-    output_shape.push_back(B_shape[b_rank - 1]);
-    return tvm::compute(output_shape, l, name);
+    return tvm::compute(ComputeMatMulShape(A_shape, B_shape), l, name);
   }
 }
 
+tvm::Array<tvm::Expr>
+ComputeMatMulShape(
+    const tvm::Array<tvm::Expr>& A_shape,
+    const tvm::Array<tvm::Expr>& B_shape) {
+  auto a_rank = A_shape.size();
+  auto b_rank = B_shape.size();
+  tvm::Array<tvm::Expr> output_shape;
+  int64_t output_rank = std::max(a_rank, b_rank);
+  MTI_ASSERT(tvm::ir::Equal(A_shape[a_rank - 1], B_shape[b_rank - 2]));
+  for (int64_t i = 0; i < output_rank - 2; i++) {
+    tvm::Expr broadcasted_dim = tvm::make_const(HalideIR::Int(32), 1);
+    bool broadcasted =
+        BroadcastDim(A_shape, i, output_rank, broadcasted_dim) &&
+        BroadcastDim(B_shape, i, output_rank, broadcasted_dim);
+    MTI_ASSERT(broadcasted);
+    output_shape.push_back(broadcasted_dim);
+  }
+  output_shape.push_back(A_shape[a_rank - 2]);
+  output_shape.push_back(B_shape[b_rank - 1]);
+  return output_shape;
+}
+
 }  // namespace tvm_codegen
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/codegen/mti/math/matmul_ops.h b/onnxruntime/core/codegen/mti/math/matmul_ops.h
index c149486a87fab..7180b4f6d81e5 100644
--- a/onnxruntime/core/codegen/mti/math/matmul_ops.h
+++ b/onnxruntime/core/codegen/mti/math/matmul_ops.h
@@ -8,6 +8,11 @@
 namespace onnxruntime {
 namespace tvm_codegen {
 
+tvm::Array<tvm::Expr>
+ComputeMatMulShape(
+    const tvm::Array<tvm::Expr>& A_shape,
+    const tvm::Array<tvm::Expr>& B_shape);
+
 tvm::Tensor MatMul2D(const tvm::Tensor& A, const tvm::Tensor& B, bool trans_a = false, bool trans_b = false, const std::string& name = "matmul2d");
 
 tvm::Tensor MatMul(const tvm::Tensor& A, const tvm::Tensor& B, const std::string& name = "matmul");
diff --git a/onnxruntime/core/codegen/mti/math/unary_ops.cc b/onnxruntime/core/codegen/mti/math/unary_ops.cc
index 7f45a9115fb0b..a9b18072988b2 100644
--- a/onnxruntime/core/codegen/mti/math/unary_ops.cc
+++ b/onnxruntime/core/codegen/mti/math/unary_ops.cc
@@ -21,7 +21,9 @@ tvm::Tensor Abs(const tvm::Tensor& X, const std::string& name) {
 }
 
 tvm::Tensor Affine(const tvm::Tensor& X, float alpha, float beta, const std::string& name) {
-  return Rename(alpha * X + beta, name);
+  tvm::Expr alphaExpr = tvm::make_const(X->dtype, alpha);
+  tvm::Expr betaExpr = tvm::make_const(X->dtype, beta);
+  return Rename(alphaExpr * X + betaExpr, name);
 }
 
 tvm::Tensor Ceil(const tvm::Tensor& X, const std::string& name) {
@@ -39,7 +41,8 @@ tvm::Tensor Clip(const tvm::Tensor& X, float min_value, float max_value, const s
 }
 
 tvm::Tensor Elu(const tvm::Tensor& X, float alpha, const std::string& name) {
-  return Rename(Relu(X) - alpha * Relu(1 - Exp(X)), name);
+  tvm::Expr alphaExpr = tvm::make_const(X->dtype, alpha);
+  return Rename(Relu(X) - alphaExpr * Relu(1 - Exp(X)), name);
 }
 
 tvm::Tensor Exp(const tvm::Tensor& X, const std::string& name) {
@@ -56,11 +59,14 @@ tvm::Tensor Floor(const tvm::Tensor& X, const std::string& name) {
 }
 
 tvm::Tensor HardSigmoid(const tvm::Tensor& X, float alpha, float beta, const std::string& name) {
-  return maximum(0, minimum(1, alpha * X + beta), name);
+  tvm::Expr alphaExpr = tvm::make_const(X->dtype, alpha);
+  tvm::Expr betaExpr = tvm::make_const(X->dtype, beta);
+  return maximum(0, minimum(1, alphaExpr * X + betaExpr), name);
 }
 
 tvm::Tensor LeakyRelu(const tvm::Tensor& X, float alpha, const std::string& name) {
-  return Rename(Relu(X) - alpha * Relu(0 - X), name);
+  tvm::Expr alphaExpr = tvm::make_const(X->dtype, alpha);
+  return Rename(Relu(X) - alphaExpr * Relu(0 - X), name);
 }
 
 tvm::Tensor Log(const tvm::Tensor& X, const std::string& name) {
@@ -77,7 +83,9 @@ tvm::Tensor Neg(const tvm::Tensor& X, const std::string& name) {
 }
 
 tvm::Tensor ParametricSoftplus(const tvm::Tensor& X, float alpha, float beta, const std::string& name) {
-  return Rename(alpha * Softplus(beta * X), name);
+  tvm::Expr alphaExpr = tvm::make_const(X->dtype, alpha);
+  tvm::Expr betaExpr = tvm::make_const(X->dtype, beta);
+  return Rename(alphaExpr * Softplus(betaExpr * X), name);
 }
 
 tvm::Tensor Reciprocal(const tvm::Tensor& X, const std::string& name) {
@@ -89,11 +97,15 @@ tvm::Tensor Relu(const tvm::Tensor& X, const std::string& name) {
 }
 
 tvm::Tensor ScaledTanh(const tvm::Tensor& X, float alpha, float beta, const std::string& name) {
-  return Rename(alpha * Tanh(beta * X), name);
+  tvm::Expr alphaExpr = tvm::make_const(X->dtype, alpha);
+  tvm::Expr betaExpr = tvm::make_const(X->dtype, beta);
+  return Rename(alphaExpr * Tanh(betaExpr * X), name);
 }
 
 tvm::Tensor Selu(const tvm::Tensor& X, float alpha, float gamma, const std::string& name) {
-  return Rename(gamma * (-alpha * Relu(1 - Exp(X)) + Relu(X)), name);
+  tvm::Expr alphaExpr = tvm::make_const(X->dtype, alpha);
+  tvm::Expr gammaExpr = tvm::make_const(X->dtype, gamma);
+  return Rename(gammaExpr * (-alphaExpr * Relu(1 - Exp(X)) + Relu(X)), name);
 }
 
 tvm::Tensor Sigmoid(const tvm::Tensor& X, const std::string& name) {
@@ -135,7 +147,8 @@ tvm::Tensor Tanh(const tvm::Tensor& X, const std::string& name) {
 }
 
 tvm::Tensor ThresholdedRelu(const tvm::Tensor& X, float alpha, const std::string& name) {
-  return topi::where(greater(X, alpha), X, topi::full_like(X, tvm::make_zero(X->dtype)), name);
+  tvm::Expr alphaExpr = tvm::make_const(X->dtype, alpha);
+  return topi::where(greater(X, alphaExpr), X, topi::full_like(X, tvm::make_zero(X->dtype)), name);
 }
 
 }  // namespace tvm_codegen
diff --git a/onnxruntime/core/codegen/mti/mti_tvm_utils.cc b/onnxruntime/core/codegen/mti/mti_tvm_utils.cc
index e905a34432a6e..3696deea22b3c 100644
--- a/onnxruntime/core/codegen/mti/mti_tvm_utils.cc
+++ b/onnxruntime/core/codegen/mti/mti_tvm_utils.cc
@@ -4,6 +4,7 @@
 #include "core/codegen/mti/mti_tvm_utils.h"
 
 #include "core/codegen/common/settings.h"
+#include "core/codegen/mti/tensor/reshape_ops.h"
 #include <topi/detail/extern.h>
 #include <tvm/ir_pass.h>
 
@@ -158,5 +159,38 @@ bool BroadcastDim(const tvm::Array<tvm::Expr>& shape, size_t i, size_t output_ra
   return true;
 }
 
+tvm::Array<tvm::Tensor> MakeInputsForExtern(const tvm::Array<tvm::Tensor>& inputs, const std::string& name) {
+  // note that currently TVM StorageFlatten creates strides like max(symbolic_dim, 1)
+  // which is not zero when checking symbolic_dim - max(symbolic_dim, 1)
+  // then triggers error like: Trying to bind compact buffer to strided one
+  // here's a workaround to reshape inputs to avoid that
+  tvm::Array<tvm::Tensor> fixed_inputs;
+  for (size_t idx_input = 0; idx_input < inputs.size(); ++idx_input) {
+    const auto& input = inputs[idx_input];
+    tvm::Array<tvm::Expr> fixed_shape;
+    if (input->shape.size() > 0) {
+      // stride compute does not use dim 0, so directly push to fixed_shape
+      fixed_shape.push_back(input->shape[0]);
+      bool need_fix = false;
+      for (size_t idx_dim = 1; idx_dim < input->shape.size(); ++idx_dim) {
+        const auto& dim = input->shape[idx_dim];
+        if (tvm::as_const_int(dim) == nullptr) {
+          fixed_shape.push_back(tvm::max(dim, tvm::make_const(HalideIR::Int(32), 1)));
+          need_fix = true;
+        } else {
+          fixed_shape.push_back(dim);
+        }
+      }
+      if (need_fix) {
+        fixed_inputs.push_back(tvm_codegen::Reshape(input, fixed_shape, name + "_" + std::to_string(idx_input)));
+        continue;
+      }
+    }
+    // no fix needed
+    fixed_inputs.push_back(input);
+  }
+  return fixed_inputs;
+}
+
 }  // namespace tvm_codegen
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/codegen/mti/mti_tvm_utils.h b/onnxruntime/core/codegen/mti/mti_tvm_utils.h
index 3f65658554f2c..034a4fe28b23a 100644
--- a/onnxruntime/core/codegen/mti/mti_tvm_utils.h
+++ b/onnxruntime/core/codegen/mti/mti_tvm_utils.h
@@ -60,5 +60,8 @@ inline int64_t HandleNegativeAxis(int64_t axis, int64_t rank) {
   return axis = axis < 0 ? (axis + rank) : axis;
 }
 
+// Helper function to workaround tvm ExternOp issue when input has symbolic dimensions
+tvm::Array<tvm::Tensor> MakeInputsForExtern(const tvm::Array<tvm::Tensor>& inputs, const std::string& name = "make_inputs_for_extern");
+
 }  //  namespace tvm_codegen
 }  //  namespace onnxruntime
diff --git a/onnxruntime/core/codegen/passes/op_ir_creator/math/quantize/matmul_integer.cc b/onnxruntime/core/codegen/passes/op_ir_creator/math/quantize/matmul_integer.cc
index 60841d049e734..6f66b1f1a2afb 100644
--- a/onnxruntime/core/codegen/passes/op_ir_creator/math/quantize/matmul_integer.cc
+++ b/onnxruntime/core/codegen/passes/op_ir_creator/math/quantize/matmul_integer.cc
@@ -16,19 +16,19 @@ Status GENERIC_OP_IR_CREATOR_CLASS(MatMulInteger)::Evaluate(
     const Node& node,
     CodeGenContext& ctx_codegen,
     tvm::Array<tvm::Tensor>& outputs) {
-  const auto& lhs_tensor = inputs[0];
-  const auto& rhs_tensor = inputs[1];
+  const auto& A = inputs[0];
+  const auto& B = inputs[1];
   auto& name = node.Name();
 
   // A generic path, cast to int32
   // Support skipped trailing inputs
-  auto lhs = (node.InputDefs().size() >= 3 && node.InputDefs()[2]->Exists())
-                 ? Sub(Cast(lhs_tensor, HalideIR::Int(32)), Cast(inputs[2], HalideIR::Int(32)))
-                 : Cast(lhs_tensor, HalideIR::Int(32));
-  auto rhs = (node.InputDefs().size() >= 4 && node.InputDefs()[3]->Exists())
-                 ? Sub(Cast(rhs_tensor, HalideIR::Int(32)), Cast(inputs[3], HalideIR::Int(32)))
-                 : Cast(rhs_tensor, HalideIR::Int(32));
-  tvm::Tensor Y = MatMul(lhs, rhs, name + "_MatMulInteger");
+  auto A_Int32 = (node.InputDefs().size() >= 3 && node.InputDefs()[2]->Exists())
+                     ? Sub(Cast(A, HalideIR::Int(32)), Cast(inputs[2], HalideIR::Int(32)))
+                     : Cast(A, HalideIR::Int(32));
+  auto B_Int32 = (node.InputDefs().size() >= 4 && node.InputDefs()[3]->Exists())
+                     ? Sub(Cast(B, HalideIR::Int(32)), Cast(inputs[3], HalideIR::Int(32)))
+                     : Cast(B, HalideIR::Int(32));
+  tvm::Tensor Y = MatMul(A_Int32, B_Int32, name + "_MatMulInteger");
   outputs.push_back(Y);
   return Status::OK();
 }
diff --git a/onnxruntime/core/codegen/passes/op_ir_creator/tensor/crop.cc b/onnxruntime/core/codegen/passes/op_ir_creator/tensor/crop.cc
index 46adb7e984f2d..3b6a9a76f0723 100644
--- a/onnxruntime/core/codegen/passes/op_ir_creator/tensor/crop.cc
+++ b/onnxruntime/core/codegen/passes/op_ir_creator/tensor/crop.cc
@@ -29,7 +29,8 @@ Status GENERIC_OP_IR_CREATOR_CLASS(Crop)::Evaluate(
 
   ORT_ENFORCE(attrs.GetAttrs<int64_t>("border", border).IsOK());
   // scale is optional and status is false when omit
-  attrs.GetAttrs<int64_t>("scale", scale);
+  bool is_ok = attrs.GetAttrs<int64_t>("scale", scale).IsOK();
+  ORT_UNUSED_PARAMETER(is_ok);
 
   if (border.size() != 4) {
     return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,
diff --git a/onnxruntime/core/codegen/passes/op_ir_creator/tensor/transpose.cc b/onnxruntime/core/codegen/passes/op_ir_creator/tensor/transpose.cc
index f4d7bb1da5e97..43999ebd1f465 100644
--- a/onnxruntime/core/codegen/passes/op_ir_creator/tensor/transpose.cc
+++ b/onnxruntime/core/codegen/passes/op_ir_creator/tensor/transpose.cc
@@ -21,20 +21,22 @@ Status GENERIC_OP_IR_CREATOR_CLASS(Transpose)::Evaluate(
 
   size_t input_0_shape_rank = inputs[0]->shape.size();
   std::vector<int64_t> permute;
-  attrs.GetAttrs<int64_t>("perm", permute);
+  bool is_ok = attrs.GetAttrs<int64_t>("perm", permute).IsOK();
   if (permute.size() != 0 && permute.size() != input_0_shape_rank)
     return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Transpose: Incorrect permute size");
 
   std::vector<int64_t> default_permute;
   const std::vector<int64_t>* perm;
-  if (permute.size() > 0) {
-    perm = &permute;
-  } else {
+  // either we don't have perm attribute or the perm attribute is empty
+  bool use_default_perm = !is_ok || permute.size() == 0;
+  if (use_default_perm) {
     default_permute.resize(input_0_shape_rank);
     for (size_t i = 0; i < input_0_shape_rank; ++i) {
       default_permute[i] = gsl::narrow<int64_t>(input_0_shape_rank - 1 - i);
     }
     perm = &default_permute;
+  } else {
+    perm = &permute;
   }
 
   tvm::Tensor Y = Transpose(inputs[0], ToTvmArrayInt(*perm), node.Name() + "_Transpose");
diff --git a/onnxruntime/core/codegen/passes/op_ir_creator/tvm_op_creator.h b/onnxruntime/core/codegen/passes/op_ir_creator/tvm_op_creator.h
index fe2648462e4f5..e29c4a9f20767 100644
--- a/onnxruntime/core/codegen/passes/op_ir_creator/tvm_op_creator.h
+++ b/onnxruntime/core/codegen/passes/op_ir_creator/tvm_op_creator.h
@@ -29,7 +29,7 @@ class OpIRDispatcher : public codegen::DispatcherBase<OpIRCreator*> {
   OpIRDispatcher(const std::string& name)
       : DispatcherBase(name) {}
 
-  ~OpIRDispatcher() = default;
+  virtual ~OpIRDispatcher() = default;
 
   virtual OpIRCreator* Find(const Node&) = 0;
 
diff --git a/onnxruntime/core/codegen/passes/scheduler/tvm_scheduler.h b/onnxruntime/core/codegen/passes/scheduler/tvm_scheduler.h
index 413e0fb504e89..d022497c77f7e 100644
--- a/onnxruntime/core/codegen/passes/scheduler/tvm_scheduler.h
+++ b/onnxruntime/core/codegen/passes/scheduler/tvm_scheduler.h
@@ -58,7 +58,7 @@ class TVMScheduleDispatcher : public codegen::DispatcherBase<Scheduler*> {
   TVMScheduleDispatcher(const std::string& name)
       : DispatcherBase(name) {}
 
-  ~TVMScheduleDispatcher() = default;
+  virtual ~TVMScheduleDispatcher() = default;
 
   virtual Scheduler* Find(const tvm::Tensor&,
                           const Node*,
diff --git a/onnxruntime/core/codegen/passes/utils/ort_tvm_utils.cc b/onnxruntime/core/codegen/passes/utils/ort_tvm_utils.cc
index f7906b71e1189..670a540404c94 100644
--- a/onnxruntime/core/codegen/passes/utils/ort_tvm_utils.cc
+++ b/onnxruntime/core/codegen/passes/utils/ort_tvm_utils.cc
@@ -100,7 +100,7 @@ tvm::Expr ShapeDimToTvmDim(const ONNX_NAMESPACE::TensorShapeProto_Dimension& dim
 #ifdef CODEGEN_ENABLE_PROFILER
 struct event_in_bracket_and_id {
   bool in_bracket;
-  int id;
+  size_t id;
 };
 std::unordered_map<std::string, event_in_bracket_and_id> g_codegen_profiler_event_ids;
 std::vector<std::pair<std::string, TimePoint>> g_codegen_profiler_events(1024);
@@ -109,7 +109,7 @@ TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.profile_event")
     .set_body([](tvm::TVMArgs args, tvm::TVMRetValue* ret) {
       DLTensor* X = args[0];
       DLTensor* Y = args[1];
-      int event_id = args[2];
+      size_t event_id = args[2];
       bool is_begin = args[3];
       if (!is_begin) {
         DCHECK(event_id < g_codegen_profiler_event_ids.size());
@@ -120,7 +120,7 @@ TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.profile_event")
       }
 
       {
-        CODEGEN_PROFILER_EVENT(profile_stub);
+        CODEGEN_PROFILER_EVENT("profile_stub");
         int64_t elem_count = 1;
         for (int i = 0; i < X->ndim; ++i) {
           elem_count *= X->shape[i];
@@ -141,7 +141,7 @@ TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.profile_event")
     });
 
 tvm::Tensor ProfileBegin(tvm::Tensor X, const std::string& event_name) {
-  int event_id;
+  size_t event_id;
   if (g_codegen_profiler_event_ids.count(event_name) == 0) {
     event_id = g_codegen_profiler_event_ids.size();
     ORT_ENFORCE(event_id < g_codegen_profiler_events.size());
@@ -157,7 +157,7 @@ tvm::Tensor ProfileBegin(tvm::Tensor X, const std::string& event_name) {
         return topi::detail::call_packed({tvm::Expr("tvm.contrib.onnxruntime.profile_event"),
                                           topi::detail::pack_buffer(ins[0]),
                                           topi::detail::pack_buffer(outs[0]),
-                                          event_id,
+                                          gsl::narrow<int>(event_id),
                                           true});
       },
       event_name + "_begin", "", {})[0];
@@ -166,7 +166,7 @@ tvm::Tensor ProfileBegin(tvm::Tensor X, const std::string& event_name) {
 tvm::Tensor ProfileEnd(tvm::Tensor X, const std::string& event_name) {
   ORT_ENFORCE(g_codegen_profiler_event_ids.at(event_name).in_bracket);
   g_codegen_profiler_event_ids.at(event_name).in_bracket = false;
-  int event_id = g_codegen_profiler_event_ids.at(event_name).id;
+  size_t event_id = g_codegen_profiler_event_ids.at(event_name).id;
   ORT_ENFORCE(event_id < g_codegen_profiler_events.size());
   ORT_ENFORCE(g_codegen_profiler_events[event_id].first == event_name);
   return topi::detail::make_extern(
@@ -175,7 +175,7 @@ tvm::Tensor ProfileEnd(tvm::Tensor X, const std::string& event_name) {
         return topi::detail::call_packed({tvm::Expr("tvm.contrib.onnxruntime.profile_event"),
                                           topi::detail::pack_buffer(ins[0]),
                                           topi::detail::pack_buffer(outs[0]),
-                                          event_id,
+                                          gsl::narrow<int>(event_id),
                                           false});
       },
       event_name + "_end", "", {})[0];
diff --git a/onnxruntime/core/codegen/passes/weight_layout/weight_layout.h b/onnxruntime/core/codegen/passes/weight_layout/weight_layout.h
index bcd9b229b5a3d..af61641a74937 100644
--- a/onnxruntime/core/codegen/passes/weight_layout/weight_layout.h
+++ b/onnxruntime/core/codegen/passes/weight_layout/weight_layout.h
@@ -30,7 +30,7 @@ class WeightLayout {
       int input_dim,
       float pad_zero);
 
-  ~WeightLayout() = default;
+  virtual ~WeightLayout() = default;
 
   // Return a CoordTransFunc from actual (transformed) coordinate to normial (original) coordinate
   virtual CoordTransFunc ToNominal(const tvm::Tensor& X) const = 0;
diff --git a/onnxruntime/core/common/logging/capture.cc b/onnxruntime/core/common/logging/capture.cc
index 016ddb9fc06be..6223d2ca70ec2 100644
--- a/onnxruntime/core/common/logging/capture.cc
+++ b/onnxruntime/core/common/logging/capture.cc
@@ -27,16 +27,26 @@ void Capture::ProcessPrintf(msvc_printf_check const char* format, va_list args)
   char message_buffer[kMaxMessageSize];
   const auto message = gsl::make_span(message_buffer);
 
+  bool error = false;
+  bool truncated = false;
+
 #if (defined(WIN32) || defined(_WIN32) || defined(__WIN32__) && !defined(__GNUC__))
+  errno = 0;
   const int nbrcharacters = vsnprintf_s(message.data(), message.size(), _TRUNCATE, format, args);
+  if (nbrcharacters < 0) {
+    error = errno != 0;
+    truncated = !error;
+  }
 #else
   const int nbrcharacters = vsnprintf(message.data(), message.size(), format, args);
+  error = nbrcharacters < 0;
+  truncated = nbrcharacters > message.size();
 #endif
 
-  if (nbrcharacters <= 0) {
+  if (error) {
     stream_ << "\n\tERROR LOG MSG NOTIFICATION: Failure to successfully parse the message";
     stream_ << '"' << format << '"' << std::endl;
-  } else if (nbrcharacters > message.size()) {
+  } else if (truncated) {
     stream_ << message.data() << kTruncatedWarningText;
   } else {
     stream_ << message.data();
diff --git a/onnxruntime/core/common/profiler.cc b/onnxruntime/core/common/profiler.cc
index d8eb1b2354027..1fa0577a676b9 100644
--- a/onnxruntime/core/common/profiler.cc
+++ b/onnxruntime/core/common/profiler.cc
@@ -7,6 +7,16 @@ namespace onnxruntime {
 namespace profiling {
 using namespace std::chrono;
 
+#ifdef ENABLE_STATIC_PROFILER_INSTANCE
+Profiler* Profiler::instance_ = nullptr;
+
+profiling::Profiler::~Profiler() {
+  instance_ = nullptr;
+}
+#else
+profiling::Profiler::~Profiler() {}
+#endif
+
 ::onnxruntime::TimePoint profiling::Profiler::StartTime() const {
   return std::chrono::high_resolution_clock::now();
 }
@@ -14,6 +24,14 @@ ::onnxruntime::TimePoint profiling::Profiler::StartTime() const {
 void Profiler::Initialize(const logging::Logger* session_logger) {
   ORT_ENFORCE(session_logger != nullptr);
   session_logger_ = session_logger;
+#ifdef ENABLE_STATIC_PROFILER_INSTANCE
+  // In current design, profiler instance goes with inference session. Since it's possible to have
+  // multiple inference sessions, profiler by definition is not singleton. However, in performance
+  // debugging, it would be helpful to access profiler in code that have no access to inference session,
+  // which is why we have this pseudo-singleton implementation here for debugging in single inference session.
+  ORT_ENFORCE(instance_ == nullptr, "Static profiler instance only works with single session");
+  instance_ = this;
+#endif
 }
 
 void Profiler::StartProfiling(const logging::Logger* custom_logger) {
diff --git a/onnxruntime/core/common/profiler.h b/onnxruntime/core/common/profiler.h
index 48ecf5747467a..815695a4fa4ed 100644
--- a/onnxruntime/core/common/profiler.h
+++ b/onnxruntime/core/common/profiler.h
@@ -13,6 +13,10 @@ namespace onnxruntime {
 
 namespace profiling {
 
+// uncomment the macro below, or use -DENABLE_STATIC_PROFILER_INSTANCE for debugging
+// note that static profiler instance only works with single session
+//#define ENABLE_STATIC_PROFILER_INSTANCE
+
 /**
  * Main class for profiling. It continues to accumulate events and produce
  * a corresponding "complete event (X)" in "chrome tracing" format.
@@ -23,6 +27,8 @@ class Profiler {
   /// Even this function is marked as noexcept, the code inside it may throw exceptions
   Profiler() noexcept {};  //NOLINT
 
+  ~Profiler();
+
   /*
   Initializes Profiler with the session logger to log framework specific messages
   */
@@ -67,6 +73,15 @@ class Profiler {
   */
   std::string EndProfiling();
 
+  static Profiler& Instance() {
+#ifdef ENABLE_STATIC_PROFILER_INSTANCE
+    ORT_ENFORCE(instance_ != nullptr);
+    return *instance_;
+#else
+    ORT_THROW("Static profiler instance is not enabled, please compile with -DENABLE_STATIC_PROFILER_INSTANCE");
+#endif
+  }
+
  private:
   ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(Profiler);
 
@@ -82,6 +97,10 @@ class Profiler {
   bool max_events_reached{false};
   static constexpr size_t max_num_events_ = 1000000;
   bool profile_with_logger_{false};
+
+#ifdef ENABLE_STATIC_PROFILER_INSTANCE
+  static Profiler* instance_;
+#endif
 };
 
 }  // namespace profiling
diff --git a/onnxruntime/core/common/task_thread_pool.h b/onnxruntime/core/common/task_thread_pool.h
deleted file mode 100644
index 1cc0d64ecfd6b..0000000000000
--- a/onnxruntime/core/common/task_thread_pool.h
+++ /dev/null
@@ -1,213 +0,0 @@
-/**
- * Copyright (c) 2016-present, Facebook, Inc.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-Changed to use std::packaged_task instead of std::function so exceptions can be propagated.
-
-This also allows the task threadpool to be shared across multiple operators as the caller
-can keep a container of the packaged_task futures to check when they have completed. Calling
-WaitWorkComplete in that use case is invalid as there may be other concurrent usage of the 
-threadpool.
-
-Example of that usage:
-
-  std::vector<std::future<void>> task_results{};
-
-  for (...) {
-    std::packaged_task<void()> task{std::bind(lambda, i)};
-    task_results.push_back(task.get_future());
-    task_thread_pool.RunTask(std::move(task));
-  }
-
-  try {
-    // wait for all and propagate any exceptions
-    for (auto& future : task_results)
-      future.get();
-  } catch (const std::exception& ex) {
-    ...
-    throw;
-  }
-
-*/
-
-#pragma once
-
-#include <condition_variable>
-#include <functional>
-#include <future>
-#include <mutex>
-#include <queue>
-#include <thread>
-#include <utility>
-
-#include "core/common/common.h"
-#include "core/common/logging/logging.h"
-#include "core/platform/ort_mutex.h"
-
-namespace onnxruntime {
-
-class TaskThreadPool {
- private:
-  struct task_element_t {
-    bool run_with_id;
-    std::packaged_task<void()> no_id;
-    std::packaged_task<void(std::size_t)> with_id;
-
-    task_element_t(task_element_t&& other) noexcept {
-      run_with_id = other.run_with_id;
-      no_id = std::move(other.no_id);
-      with_id = std::move(other.with_id);
-    }
-
-    explicit task_element_t(std::packaged_task<void()>&& f)
-        : run_with_id(false), no_id(std::move(f)) {}
-
-    explicit task_element_t(std::packaged_task<void(std::size_t)>&& f)
-        : run_with_id(true), with_id(std::move(f)) {}
-  };
-
-  std::queue<task_element_t> tasks_;
-  std::vector<std::thread> threads_;
-  OrtMutex mutex_;
-  OrtCondVar condition_;
-  OrtCondVar completed_;
-  bool running_;
-  bool complete_;
-  std::size_t available_;
-  std::size_t total_;
-
- public:
-  /// @brief Constructor.
-  explicit TaskThreadPool(std::size_t pool_size)
-      : threads_(pool_size), running_(true), complete_(true), available_(pool_size), total_(pool_size) {
-    for (std::size_t i = 0; i < pool_size; ++i) {
-      threads_[i] = std::thread(std::bind(&TaskThreadPool::MainLoop, this, i));
-    }
-  }
-
-  /// @brief Destructor.
-  ~TaskThreadPool() {
-    // Set running flag to false then notify all threads.
-    {
-      std::unique_lock<OrtMutex> lock(mutex_);
-      running_ = false;
-      condition_.notify_all();
-    }
-
-    try {
-      for (auto& t : threads_) {
-        t.join();
-      }
-    }
-    // Suppress all exceptions.
-    catch (const std::exception& ex) {
-      LOGS_DEFAULT(ERROR) << "Exception joining threads in TaskThreadPool: " << ex.what();
-    }
-  }
-
-  int NumThreads() const {
-    return (int)threads_.size();
-  }
-
-  // This thread pool does not support ids
-  int CurrentThreadId() const {
-    return -1;
-  }
-
-  void RunTask(std::packaged_task<void()>&& task) {
-    std::unique_lock<OrtMutex> lock(mutex_);
-
-    // Set task and signal condition variable so that a worker thread will
-    // wake up and use the task.
-    tasks_.push(task_element_t(std::move(task)));
-    complete_ = false;
-    condition_.notify_one();
-  }
-
-  void RunTaskWithID(std::packaged_task<void(std::size_t)>&& task) {
-    std::unique_lock<OrtMutex> lock(mutex_);
-
-    // Set task and signal condition variable so that a worker thread will
-    // wake up and use the task.
-    tasks_.push(task_element_t(std::move(task)));
-    complete_ = false;
-    condition_.notify_one();
-  }
-
-  /// @brief Wait for queue to be empty
-  void WaitWorkComplete() {
-    std::unique_lock<OrtMutex> lock(mutex_);
-    while (!complete_)
-      completed_.wait(lock);
-  }
-
- private:
-  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(TaskThreadPool);
-
-  /// @brief Entry point for pool threads.
-  void MainLoop(std::size_t index) {
-    while (running_) {
-      // Wait on condition variable while the task is empty and
-      // the pool is still running.
-      std::unique_lock<OrtMutex> lock(mutex_);
-      while (tasks_.empty() && running_) {
-        condition_.wait(lock);
-      }
-
-      // If pool is no longer running, break out of loop.
-      if (!running_) break;
-
-      // Copy task locally and remove from the queue.  This is
-      // done within its own scope so that the task object is
-      // destructed immediately after running the task.  This is
-      // useful in the event that the function contains
-      // shared_ptr arguments bound via bind.
-      {
-        auto task = std::move(tasks_.front());
-        tasks_.pop();
-        // Decrement count, indicating thread is no longer available.
-        --available_;
-
-        lock.unlock();
-
-        // Run the task.
-        try {
-          if (task.run_with_id) {
-            task.with_id(index);
-          } else {
-            task.no_id();
-          }
-        } catch (const std::exception& /*ex*/) {
-          // LOGS_DEFAULT(ERROR) << "Exception running TaskThreadPool task: " << ex.what();
-          throw;
-        }
-
-        // Update status of empty, maybe
-        // Need to recover the lock first
-        lock.lock();
-
-        // Increment count, indicating thread is available.
-        ++available_;
-        if (tasks_.empty() && available_ == total_) {
-          complete_ = true;
-          completed_.notify_one();
-        }
-      }
-    }  // while running_
-  }
-};
-
-}  // namespace onnxruntime
diff --git a/onnxruntime/core/common/threadpool.cc b/onnxruntime/core/common/threadpool.cc
index 07305a41d0645..6cdcb3add7cf0 100644
--- a/onnxruntime/core/common/threadpool.cc
+++ b/onnxruntime/core/common/threadpool.cc
@@ -6,174 +6,31 @@
 
 #include <cassert>
 
-#ifdef USE_EIGEN_THREADPOOL
-#if defined(_MSC_VER)
-#pragma warning(disable : 4267)
-#endif
-
 #if defined(__GNUC__)
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wunused-parameter"
+#else
+#pragma warning(push)
+#pragma warning(disable : 4267)
 #endif
-#include <unsupported/Eigen/CXX11/ThreadPool>
+#include <unsupported/Eigen/CXX11/src/ThreadPool/Barrier.h>
 #if defined(__GNUC__)
 #pragma GCC diagnostic pop
-#endif
 #else
-#include "task_thread_pool.h"
+#pragma warning(pop)
 #endif
 
+using Eigen::Barrier;
+
 namespace onnxruntime {
 
 namespace concurrency {
-
-// TODO: This is temporarily taken from Eigen until we upgrade its version.
-// Barrier is an object that allows one or more threads to wait until
-// Notify has been called a specified number of times.
-class Barrier {
- public:
-  Barrier(unsigned int count) : state_(count << 1), notified_(false) {
-    assert(((count << 1) >> 1) == count);
-  }
-  ~Barrier() {
-    assert((state_ >> 1) == 0);
-  }
-
-  void Notify() {
-    unsigned int v = state_.fetch_sub(2, std::memory_order_acq_rel) - 2;
-    if (v != 1) {
-      assert(((v + 2) & ~1) != 0);
-      return;  // either count has not dropped to 0, or waiter is not waiting
-    }
-    std::unique_lock<std::mutex> l(mu_);
-    assert(!notified_);
-    notified_ = true;
-    cv_.notify_all();
-  }
-
-  void Wait() {
-    unsigned int v = state_.fetch_or(1, std::memory_order_acq_rel);
-    if ((v >> 1) == 0) return;
-    std::unique_lock<std::mutex> l(mu_);
-    while (!notified_) {
-      cv_.wait(l);
-    }
-  }
-
- private:
-  std::mutex mu_;
-  std::condition_variable cv_;
-  std::atomic<unsigned int> state_;  // low bit is waiter flag
-  bool notified_;
-};
-
-#ifdef USE_EIGEN_THREADPOOL
-class ThreadPool::Impl : public Eigen::ThreadPool {
- public:
-  Impl(const std::string& name, int num_threads)
-      : Eigen::ThreadPool(num_threads) {
-    ORT_UNUSED_PARAMETER(name);
-  }
-
-  void ParallelFor(int32_t total, std::function<void(int32_t)> fn) {
-    // TODO: Eigen supports a more efficient ThreadPoolDevice mechanism
-    // We will simply rely on the work queue and stealing in the short term.
-    Barrier barrier(static_cast<unsigned int>(total - 1));
-    std::function<void(int32_t)> handle_iteration = [&barrier, &fn](int iteration) {
-      fn(iteration);
-      barrier.Notify();
-    };
-
-    for (int32_t id = 1; id < total; ++id) {
-      Schedule([=, &handle_iteration]() { handle_iteration(id); });
-    }
-
-    fn(0);
-    barrier.Wait();
-  }
-
-  void ParallelForRange(int64_t first, int64_t last, std::function<void(int64_t, int64_t)> fn) {
-    // TODO: Eigen supports a more efficient ThreadPoolDevice mechanism
-    // We will simply rely on the work queue and stealing in the short term.
-    Barrier barrier(static_cast<unsigned int>(last - first));
-    std::function<void(int64_t, int64_t)> handle_range = [&barrier, &fn](int64_t first, int64_t last) {
-      fn(first, last);
-      barrier.Notify();
-    };
-
-    for (int64_t id = first + 1; id <= last; ++id) {
-      Schedule([=, &handle_range]() { handle_range(id, id + 1); });
-    }
-
-    fn(first, first + 1);
-    barrier.Wait();
-  }
-};
-#else
-class ThreadPool::Impl : public TaskThreadPool {
- public:
-  Impl(const std::string& name, int num_threads)
-      : TaskThreadPool(num_threads) {
-    ORT_UNUSED_PARAMETER(name);
-  }
-
-  void Schedule(std::function<void()> fn) {
-    std::packaged_task<void()> task(fn);
-    RunTask(std::move(task));
-  }
-
-  void ParallelFor(int32_t total, std::function<void(int32_t)> fn) {
-#ifdef USE_OPENMP
-#pragma omp parallel for
-    for (int32_t id = 0; id < total; ++id) {
-      fn(id);
-    }
-#else
-    Barrier barrier(static_cast<unsigned int>(total - 1));
-    std::function<void(int32_t)> handle_iteration = [&barrier, &fn](int iteration) {
-      fn(iteration);
-      barrier.Notify();
-    };
-    for (int32_t id = 1; id < total; ++id) {
-      std::packaged_task<void()> task(std::bind(handle_iteration, id));
-      RunTask(std::move(task));
-    }
-    fn(0);
-    barrier.Wait();
-#endif
-  }
-
-  void ParallelForRange(int64_t first, int64_t last, std::function<void(int64_t, int64_t)> fn) {
-#ifdef USE_OPENMP
-#pragma omp parallel for
-    for (int64_t id = first; id < last; ++id) {
-      fn(id, id + 1);
-    }
-#else
-    Barrier barrier(static_cast<unsigned int>(last - first));
-    std::function<void(int64_t, int64_t)> handle_iteration = [&barrier, &fn](int64_t first, int64_t last) {
-      fn(first, last);
-      barrier.Notify();
-    };
-    for (int64_t id = first + 1; id < last; ++id) {
-      std::packaged_task<void()> task(std::bind(handle_iteration, id, id + 1));
-      RunTask(std::move(task));
-    }
-    fn(first, first + 1);
-    barrier.Wait();
-#endif
-  }
-};
-#endif
-
 //
 // ThreadPool
 //
-ThreadPool::ThreadPool(const std::string& name, int num_threads)
-    : impl_(std::make_unique<Impl>(name, num_threads)) {
-}
+ThreadPool::ThreadPool(const std::string&, int num_threads) : impl_(num_threads) {}
 
-void ThreadPool::Schedule(std::function<void()> fn) { impl_->Schedule(fn); }
+void ThreadPool::Schedule(std::function<void()> fn) { impl_.Schedule(fn); }
 
 void ThreadPool::ParallelFor(int32_t total, std::function<void(int32_t)> fn) {
   if (total <= 0) return;
@@ -183,7 +40,20 @@ void ThreadPool::ParallelFor(int32_t total, std::function<void(int32_t)> fn) {
     return;
   }
 
-  impl_->ParallelFor(total, fn);
+  // TODO: Eigen supports a more efficient ThreadPoolDevice mechanism
+  // We will simply rely on the work queue and stealing in the short term.
+  Barrier barrier(static_cast<unsigned int>(total - 1));
+  std::function<void(int32_t)> handle_iteration = [&barrier, &fn](int iteration) {
+    fn(iteration);
+    barrier.Notify();
+  };
+
+  for (int32_t id = 1; id < total; ++id) {
+    Schedule([=, &handle_iteration]() { handle_iteration(id); });
+  }
+
+  fn(0);
+  barrier.Wait();
 }
 
 void ThreadPool::ParallelForRange(int64_t first, int64_t last, std::function<void(int64_t, int64_t)> fn) {
@@ -193,18 +63,28 @@ void ThreadPool::ParallelForRange(int64_t first, int64_t last, std::function<voi
     return;
   }
 
-  impl_->ParallelForRange(first, last, fn);
+  // TODO: Eigen supports a more efficient ThreadPoolDevice mechanism
+  // We will simply rely on the work queue and stealing in the short term.
+  Barrier barrier(static_cast<unsigned int>(last - first));
+  std::function<void(int64_t, int64_t)> handle_range = [&barrier, &fn](int64_t first, int64_t last) {
+    fn(first, last);
+    barrier.Notify();
+  };
+
+  for (int64_t id = first + 1; id <= last; ++id) {
+    Schedule([=, &handle_range]() { handle_range(id, id + 1); });
+  }
+
+  fn(first, first + 1);
+  barrier.Wait();
 }
 
 // void ThreadPool::SetStealPartitions(const std::vector<std::pair<unsigned, unsigned>>& partitions) {
 //   impl_->SetStealPartitions(partitions);
 // }
 
-int ThreadPool::NumThreads() const { return impl_->NumThreads(); }
-
-int ThreadPool::CurrentThreadId() const { return impl_->CurrentThreadId(); }
-
-ThreadPool::~ThreadPool() {}
+int ThreadPool::NumThreads() const { return impl_.NumThreads(); }
 
+int ThreadPool::CurrentThreadId() const { return impl_.CurrentThreadId(); }
 }  // namespace concurrency
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/framework/allocation_planner.cc b/onnxruntime/core/framework/allocation_planner.cc
index 552d702b80e7c..5046b6e7b5fd6 100644
--- a/onnxruntime/core/framework/allocation_planner.cc
+++ b/onnxruntime/core/framework/allocation_planner.cc
@@ -338,12 +338,19 @@ class PlannerImpl {
     // Initialize execution plan:
     plan_.execution_plan.reserve(num_graph_nodes);
 
+    // Initialize node_has_fence.
+    plan_.node_has_fence.resize(graph_viewer_.MaxNodeIndex());
+
     // Initialize allocation plan:
     plan_.allocation_plan.resize(num_ml_values);
   }
 
   Status ComputeUseCounts() {
     // Note: for every ml-value, its definition must appear before all its uses in a topological sort of a valid model
+    std::unordered_set<std::string> graph_inputs;
+    for (auto& graph_input : graph_viewer_.GetInputsIncludingInitializers()) {
+      graph_inputs.insert(graph_input->Name());
+    }
 
     for (auto graph_input : graph_viewer_.GetInputs()) {
       OrtValueIndex index = Index(graph_input->Name());
@@ -368,15 +375,7 @@ class PlannerImpl {
     for (SequentialExecutionPlan::NodeExecutionPlan& step : plan_.execution_plan) {
       auto pnode = graph_viewer_.GetNode(step.node_index);
       if (pnode == nullptr) return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Can not find the node ", step.node_index);
-      for (auto node_input : pnode->InputDefs()) {
-        if (node_input->Exists())
-          UseCount(node_input->Name())++;
-      }
 
-      for (auto node_input : pnode->ImplicitInputDefs()) {
-        if (node_input->Exists())
-          UseCount(node_input->Name())++;
-      }
       // Identify where each output of this node should be allocated.
       // This is determined by the opkernel bound to the node.
       const KernelCreateInfo* kernel_create_info = nullptr;
@@ -391,31 +390,45 @@ class PlannerImpl {
         if (!pnode->Name().empty()) errormsg << " (node " << pnode->Name() << ")";
         return Status(ONNXRUNTIME, FAIL, errormsg.str());
       }
-
       auto exec_provider = execution_providers_.Get(*pnode);
       if (exec_provider == nullptr) {
         return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Can not find the execution provider ",
                                pnode->GetExecutionProviderType());
       }
 
-      auto& default_allocator_info = exec_provider->GetAllocator(0, OrtMemTypeDefault)->Info();
+      // increment UseCount and add location information if applicable for the provided input def
+      auto process_input = [&graph_inputs, &exec_provider, &p_kernelDef, this](const NodeArg& input, size_t arg_idx) {
+        const auto& name = input.Name();
+        UseCount(name)++;
+
+        // If it's a graph input or outer scope node arg, set its plan.
+        // NOTE: Copy nodes should have already been added if a graph input is fed as input
+        // to nodes assigned to different providers.
+        if (graph_inputs.find(name) != graph_inputs.cend() ||
+            std::find_if(outer_scope_node_args_.cbegin(), outer_scope_node_args_.cend(),
+                         [&name](const NodeArg* value) {
+                           return value && value->Name() == name;
+                         }) != outer_scope_node_args_.cend()) {
+          OrtValueIndex index = Index(name);
+          plan_.SetLocation(static_cast<size_t>(index),
+                            exec_provider->GetAllocator(0, p_kernelDef->InputMemoryType(arg_idx))->Info());
+        }
+
+        return Status::OK();
+      };
+
+      ORT_RETURN_IF_ERROR(Node::ForEachWithIndex(pnode->InputDefs(), process_input));
+      ORT_RETURN_IF_ERROR(Node::ForEachWithIndex(pnode->ImplicitInputDefs(), process_input));
+
       auto outputs = pnode->OutputDefs();
       auto num_outputs = outputs.size();
-
       for (size_t i = 0; i < num_outputs; ++i) {
         auto* node_output = outputs[i];
         if (!node_output->Exists()) continue;
         OrtValueIndex index = Index(node_output->Name());
         ProcessDef(index, node_output);
         ++UseCount(index);
-        if (strcmp(default_allocator_info.name, CPU) != 0) {
-          // By default, outputs of this node are allocated on the default device allocator,
-          // except for outputs marked for allocation in MemoryType:
-          auto memory_type = p_kernelDef->OutputMemoryType(i);
-          plan_.SetLocation(static_cast<size_t>(index), memory_type == OrtMemTypeDefault
-                                                            ? default_allocator_info
-                                                            : exec_provider->GetAllocator(0, memory_type)->Info());
-        }
+        plan_.SetLocation(static_cast<size_t>(index), exec_provider->GetAllocator(0, p_kernelDef->OutputMemoryType(i))->Info());
       }
       // if sync is needed, mark allocation plan as create_fence_if_async=true
       // note that the input arg may come from an execution provider (i.e. CPU) that does not support async,
@@ -585,6 +598,51 @@ class PlannerImpl {
     return Status::OK();
   }
 
+  // Whether a given NodeArg has fence or not.
+  // If the buffer is reused, need to check whether original OrtValue has fence or not.
+  bool HasFence(const onnxruntime::NodeArg* arg) {
+    bool has_fence = false;
+    if (arg && arg->Exists()) {
+      OrtValueIndex index = Index(arg->Name());
+      AllocPlanPerValue& value_plan = AllocPlan(index);
+
+      has_fence = value_plan.create_fence_if_async;
+      if (value_plan.alloc_kind == AllocKind::kReuse)
+      {
+        // Buffer reused, check original buffer to see if fence is shared.
+        has_fence = has_fence || AllocPlan(value_plan.reused_buffer).create_fence_if_async;
+      }
+    }
+
+    return has_fence;
+  }
+
+  // Compute fence check. Set has_fence flag if either one of inputs, implicit inputs or outputs of a given node has fence.
+  Status ComputeFenceCheck() {
+
+    for (SequentialExecutionPlan::NodeExecutionPlan& step : plan_.execution_plan) {
+      auto pnode = graph_viewer_.GetNode(step.node_index);
+      if (pnode == nullptr) return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Can not find the node ", step.node_index);
+
+      bool has_fence = false;
+      for (auto node_input : pnode->InputDefs()) {
+        has_fence = has_fence || HasFence(node_input);
+      }
+
+      for (auto node_input : pnode->ImplicitInputDefs()) {
+        has_fence = has_fence || HasFence(node_input);
+      }
+
+      for (auto node_output : pnode->OutputDefs()) {
+        has_fence = has_fence || HasFence(node_output);
+      }
+
+      plan_.node_has_fence[step.node_index] = has_fence;
+    }
+
+    return Status::OK();
+  }
+
   // Convert information in a freelist (about which ml-value becomes free when) into
   // a deallocation plan in the format required in an ExecutionPlan
   void GenerateDeallocationPlan() {
@@ -642,6 +700,9 @@ Status PlannerImpl::CreatePlan() {
   // determine sharing/reuse among ml-values
   ORT_RETURN_IF_ERROR(ComputeReusePlan());
 
+  // Determine nodes that need fence check. This needs to be done after ComputeUseCounts and ComputeReusePlan.
+  ORT_RETURN_IF_ERROR(ComputeFenceCheck());
+
   // convert information in the freelist_ into a deallocation plan in required format
   GenerateDeallocationPlan();
 
diff --git a/onnxruntime/core/framework/allocator.cc b/onnxruntime/core/framework/allocator.cc
index b8847a00801c3..800a2b898526c 100644
--- a/onnxruntime/core/framework/allocator.cc
+++ b/onnxruntime/core/framework/allocator.cc
@@ -3,86 +3,67 @@
 
 #include "core/framework/allocator.h"
 #include "core/framework/allocatormgr.h"
-#include "core/mlas/inc/mlas.h"
+#include "core/framework/utils.h"
 #include <cstdlib>
 #include <sstream>
 
 namespace onnxruntime {
 
 void* CPUAllocator::Alloc(size_t size) {
-  if (size <= 0)
-    return nullptr;
-  void* p;
-  size_t alignment = MlasGetPreferredBufferAlignment();
-#if _MSC_VER
-  p = _aligned_malloc(size, alignment);
-  if (p == nullptr) throw std::bad_alloc();
-#elif defined(_LIBCPP_SGX_CONFIG)
-  p = memalign(alignment, size);
-  if (p == nullptr) throw std::bad_alloc();
-#else
-  int ret = posix_memalign(&p, alignment, size);
-  if (ret != 0) throw std::bad_alloc();
-#endif
-  return p;
+  return utils::DefaultAlloc(size);
 }
 
 void CPUAllocator::Free(void* p) {
-#if _MSC_VER
-  _aligned_free(p);
-#else
-  free(p);
-#endif
+  utils::DefaultFree(p);
 }
 
-const OrtAllocatorInfo& CPUAllocator::Info() const {
-  return *allocator_info_;
-}
+const OrtAllocatorInfo& CPUAllocator::Info() const { return *allocator_info_; }
 }  // namespace onnxruntime
 
-std::ostream& operator<<(std::ostream& out, const OrtAllocatorInfo& info) {
-  return (out << info.ToString());
-}
+std::ostream& operator<<(std::ostream& out, const OrtAllocatorInfo& info) { return (out << info.ToString()); }
 
 ORT_API_STATUS_IMPL(OrtCreateAllocatorInfo, _In_ const char* name1, OrtAllocatorType type, int id1,
                     OrtMemType mem_type1, _Out_ OrtAllocatorInfo** out) {
   if (strcmp(name1, onnxruntime::CPU) == 0) {
     *out = new OrtAllocatorInfo(name1, type, OrtDevice(), id1, mem_type1);
   } else if (strcmp(name1, onnxruntime::CUDA) == 0) {
-    *out = new OrtAllocatorInfo(name1, type, OrtDevice(OrtDevice::GPU, OrtDevice::MemType::DEFAULT, static_cast<OrtDevice::DeviceId>(id1)), id1, mem_type1);
+    *out = new OrtAllocatorInfo(
+        name1, type, OrtDevice(OrtDevice::GPU, OrtDevice::MemType::DEFAULT, static_cast<OrtDevice::DeviceId>(id1)), id1,
+        mem_type1);
   } else if (strcmp(name1, onnxruntime::CUDA_PINNED) == 0) {
-    *out = new OrtAllocatorInfo(name1, type, OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, static_cast<OrtDevice::DeviceId>(id1)), id1, mem_type1);
+    *out = new OrtAllocatorInfo(
+        name1, type, OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, static_cast<OrtDevice::DeviceId>(id1)),
+        id1, mem_type1);
   } else {
     return OrtCreateStatus(ORT_INVALID_ARGUMENT, "Specified device is not supported.");
   }
   return nullptr;
 }
 
-ORT_API(void, OrtReleaseAllocatorInfo, _Frees_ptr_opt_ OrtAllocatorInfo* p) {
-  delete p;
-}
+ORT_API(void, OrtReleaseAllocatorInfo, _Frees_ptr_opt_ OrtAllocatorInfo* p) { delete p; }
 
-ORT_API_STATUS_IMPL(OrtAllocatorInfoGetName, _In_ OrtAllocatorInfo* ptr, _Out_ const char** out) {
+ORT_API_STATUS_IMPL(OrtAllocatorInfoGetName, _In_ const OrtAllocatorInfo* ptr, _Out_ const char** out) {
   *out = ptr->name;
   return nullptr;
 }
 
-ORT_API_STATUS_IMPL(OrtAllocatorInfoGetId, _In_ OrtAllocatorInfo* ptr, _Out_ int* out) {
+ORT_API_STATUS_IMPL(OrtAllocatorInfoGetId, _In_ const OrtAllocatorInfo* ptr, _Out_ int* out) {
   *out = ptr->id;
   return nullptr;
 }
 
-ORT_API_STATUS_IMPL(OrtAllocatorInfoGetMemType, _In_ OrtAllocatorInfo* ptr, _Out_ OrtMemType* out) {
+ORT_API_STATUS_IMPL(OrtAllocatorInfoGetMemType, _In_ const OrtAllocatorInfo* ptr, _Out_ OrtMemType* out) {
   *out = ptr->mem_type;
   return nullptr;
 }
 
-ORT_API_STATUS_IMPL(OrtAllocatorInfoGetType, _In_ OrtAllocatorInfo* ptr, _Out_ OrtAllocatorType* out) {
+ORT_API_STATUS_IMPL(OrtAllocatorInfoGetType, _In_ const OrtAllocatorInfo* ptr, _Out_ OrtAllocatorType* out) {
   *out = ptr->type;
   return nullptr;
 }
 
-ORT_API_STATUS_IMPL(OrtCompareAllocatorInfo, _In_ const OrtAllocatorInfo* info1, _In_ const OrtAllocatorInfo* info2, _Out_ int* out) {
+ORT_API_STATUS_IMPL(OrtCompareAllocatorInfo, _In_ const OrtAllocatorInfo* info1, _In_ const OrtAllocatorInfo* info2,
+                    _Out_ int* out) {
   *out = (*info1 == *info2) ? 0 : -1;
   return nullptr;
 }
diff --git a/onnxruntime/core/framework/bfc_arena.h b/onnxruntime/core/framework/bfc_arena.h
index 664f6fa72a04b..bdc6496c63205 100644
--- a/onnxruntime/core/framework/bfc_arena.h
+++ b/onnxruntime/core/framework/bfc_arena.h
@@ -244,7 +244,7 @@ class BFCArena : public IArenaAllocator {
 
     ~AllocationRegion() { delete[] handles_; }
 
-    AllocationRegion(AllocationRegion&& other) { Swap(other); }
+    AllocationRegion(AllocationRegion&& other) noexcept { Swap(other); }
 
     AllocationRegion& operator=(AllocationRegion&& other) {
       Swap(other);
diff --git a/onnxruntime/core/framework/callback.cc b/onnxruntime/core/framework/callback.cc
index 414b7ad0d2dc8..deb4d1e277d47 100644
--- a/onnxruntime/core/framework/callback.cc
+++ b/onnxruntime/core/framework/callback.cc
@@ -1,12 +1,14 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 
-#include "core/common/callback.h"
+#include "core/framework/callback.h"
 
-ORT_API(void, OrtRunCallback, _Frees_ptr_opt_ OrtCallback* f){
-  if(f == nullptr) return;
-  if(f->f != nullptr) {
+namespace onnxruntime {
+void OrtRunCallback(OrtCallback* f) noexcept {
+  if (f == nullptr) return;
+  if (f->f != nullptr) {
     f->f(f->param);
     delete f;
   }
 }
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/framework/callback.h b/onnxruntime/core/framework/callback.h
new file mode 100644
index 0000000000000..63cb3b6fcf586
--- /dev/null
+++ b/onnxruntime/core/framework/callback.h
@@ -0,0 +1,15 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+#pragma once
+
+namespace onnxruntime {
+struct OrtCallback {
+  void (*f)(void* param) noexcept;
+  void* param;
+};
+
+/**
+ *  f will be freed in this call
+ */
+void OrtRunCallback(OrtCallback* f) noexcept;
+}  // namespace onnxruntime
\ No newline at end of file
diff --git a/onnxruntime/core/framework/data_types.cc b/onnxruntime/core/framework/data_types.cc
index b41a59518cfaf..a372e52058036 100644
--- a/onnxruntime/core/framework/data_types.cc
+++ b/onnxruntime/core/framework/data_types.cc
@@ -6,6 +6,10 @@
 #include "core/framework/sparse_tensor.h"
 #include "core/graph/onnx_protobuf.h"
 
+#ifdef MICROSOFT_AUTOML
+#include "automl_ops/automl_types.h"
+#endif
+
 #ifdef __GNUC__
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wignored-qualifiers"
@@ -285,6 +289,9 @@ class DataTypeRegistry {
 
   DataTypeRegistry() {
     RegisterAllProtos([this](MLDataType mltype) { RegisterDataType(mltype); });
+#ifdef MICROSOFT_AUTOML
+    automl::RegisterAutoMLTypes([this](MLDataType mltype) { RegisterDataType(mltype); });
+#endif
   }
 
   ~DataTypeRegistry() = default;
@@ -887,6 +894,40 @@ ORT_REGISTER_NON_ONNX_TYPE(uint64_t);
 ORT_REGISTER_NON_ONNX_TYPE(MLFloat16);
 ORT_REGISTER_NON_ONNX_TYPE(BFloat16);
 
+const std::vector<MLDataType>& DataTypeImpl::AllFixedSizeTensorExceptHalfTypes() {
+  static std::vector<MLDataType> all_fixed_size_tensor_types =
+      {DataTypeImpl::GetTensorType<float>(),
+       DataTypeImpl::GetTensorType<double>(),
+       DataTypeImpl::GetTensorType<int64_t>(),
+       DataTypeImpl::GetTensorType<uint64_t>(),
+       DataTypeImpl::GetTensorType<int32_t>(),
+       DataTypeImpl::GetTensorType<uint32_t>(),
+       DataTypeImpl::GetTensorType<int16_t>(),
+       DataTypeImpl::GetTensorType<uint16_t>(),
+       DataTypeImpl::GetTensorType<int8_t>(),
+       DataTypeImpl::GetTensorType<uint8_t>(),
+       DataTypeImpl::GetTensorType<bool>()};
+
+  return all_fixed_size_tensor_types;
+}
+
+const std::vector<MLDataType>& DataTypeImpl::AllIEEEFloatTensorExceptHalfTypes() {
+  static std::vector<MLDataType> all_IEEE_float_tensor_except_half_types =
+      {DataTypeImpl::GetTensorType<float>(),
+       DataTypeImpl::GetTensorType<double>()};
+
+  return all_IEEE_float_tensor_except_half_types;
+}
+
+const std::vector<MLDataType>& DataTypeImpl::AllIEEEFloatTensorTypes() {
+  static std::vector<MLDataType> all_IEEE_float_tensor_types =
+      {DataTypeImpl::GetTensorType<float>(),
+       DataTypeImpl::GetTensorType<double>(),
+       DataTypeImpl::GetTensorType<MLFloat16>()};
+
+  return all_IEEE_float_tensor_types;
+}
+
 const std::vector<MLDataType>& DataTypeImpl::AllFixedSizeTensorTypes() {
   static std::vector<MLDataType> all_fixed_size_tensor_types =
       {DataTypeImpl::GetTensorType<float>(),
diff --git a/onnxruntime/core/framework/error_code.cc b/onnxruntime/core/framework/error_code.cc
index 2cf11f4e1de8e..c727b7464f3ac 100644
--- a/onnxruntime/core/framework/error_code.cc
+++ b/onnxruntime/core/framework/error_code.cc
@@ -12,11 +12,12 @@ struct OrtStatus {
   char msg[1];  // a null-terminated string
 };
 
-ORT_API(OrtStatus*, OrtCreateStatus, OrtErrorCode code, _In_ const char* msg) {
+//Even we say it may not return NULL, indeed it may.
+ORT_EXPORT _Check_return_ _Ret_notnull_ OrtStatus* ORT_API_CALL OrtCreateStatus(OrtErrorCode code, _In_ const char* msg) NO_EXCEPTION {
   assert(!(code == 0 && msg != nullptr));
   size_t clen = strlen(msg);
   OrtStatus* p = reinterpret_cast<OrtStatus*>(::malloc(sizeof(OrtStatus) + clen));
-  if (p == nullptr) return nullptr;  // OOM
+  if (p == nullptr) return nullptr;  // OOM. What we can do here? abort()?
   p->code = code;
   memcpy(p->msg, msg, clen);
   p->msg[clen] = '\0';
diff --git a/onnxruntime/core/framework/execution_frame.cc b/onnxruntime/core/framework/execution_frame.cc
index c44bb3e0497a3..59a025a61711f 100644
--- a/onnxruntime/core/framework/execution_frame.cc
+++ b/onnxruntime/core/framework/execution_frame.cc
@@ -22,11 +22,15 @@ IExecutionFrame::IExecutionFrame(const std::vector<int>& feed_mlvalue_idxs, cons
                                  const std::unordered_map<int, OrtValue>& initializers,
                                  const std::vector<int>& fetch_mlvalue_idxs, const std::vector<OrtValue>& fetches,
                                  const OrtValueNameIdxMap& ort_value_idx_map, const NodeIndexInfo& node_index_info)
-    : node_index_info_{node_index_info}, fetch_mlvalue_idxs_{fetch_mlvalue_idxs} {
+    : node_index_info_{node_index_info},
+      all_values_size_{static_cast<size_t>(ort_value_idx_map.MaxIdx()) + 1},
+      fetch_mlvalue_idxs_{fetch_mlvalue_idxs} {
   ORT_ENFORCE(feeds.size() == feed_mlvalue_idxs.size());
   ORT_ENFORCE(fetches.empty() || fetches.size() == fetch_mlvalue_idxs_.size());
+  ORT_ENFORCE(node_index_info_.GetMaxMLValueIdx() == ort_value_idx_map.MaxIdx(),
+              "node_index_info and ort_value_idx_map are out of sync and cannot be used");
 
-  Init(feed_mlvalue_idxs, feeds, initializers, fetches, ort_value_idx_map);
+  Init(feed_mlvalue_idxs, feeds, initializers, fetches);
 }
 
 IExecutionFrame::~IExecutionFrame() = default;
@@ -79,7 +83,7 @@ AllocatorPtr IExecutionFrame::GetAllocator(const OrtAllocatorInfo& info) const {
 Status IExecutionFrame::ReleaseMLValue(int ort_value_idx) { return ReleaseMLValueImpl(ort_value_idx); }
 
 Status IExecutionFrame::ReleaseMLValueImpl(int ort_value_idx) {
-  if (ort_value_idx == NodeIndexInfo::kInvalidEntry || static_cast<size_t>(ort_value_idx) >= all_values_.size()) {
+  if (ort_value_idx == NodeIndexInfo::kInvalidEntry || static_cast<size_t>(ort_value_idx) >= all_values_size_) {
     return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, "invalid index ", ort_value_idx);
   }
 
@@ -95,19 +99,16 @@ Status IExecutionFrame::ReleaseMLValueImpl(int ort_value_idx) {
 }
 
 int IExecutionFrame::GetNodeIdxToMLValueIdx(int index) const {
+  // the validity of index is checked by GetMLValueIndex
   int ort_value_idx = node_index_info_.GetMLValueIndex(index);
-  ORT_ENFORCE(ort_value_idx == NodeIndexInfo::kInvalidEntry ||
-              (ort_value_idx >= 0 && static_cast<size_t>(ort_value_idx) < all_values_.size()));
-
   return ort_value_idx;
 }
 
 void IExecutionFrame::Init(const std::vector<int>& feed_mlvalue_idxs, const std::vector<OrtValue>& feeds,
                            const std::unordered_map<int, OrtValue>& initializers,
-                           const std::vector<OrtValue>& fetches,
-                           const OrtValueNameIdxMap& ort_value_idx_map) {
+                           const std::vector<OrtValue>& fetches) {
   // 1. resize the all_value_ vector
-  all_values_.resize(ort_value_idx_map.MaxIdx() + 1);
+  all_values_.resize(all_values_size_);
 
   // 2. Handle non-empty output vector
   if (!fetches.empty()) {
@@ -402,54 +403,54 @@ Status ExecutionFrame::AllocateAsPerAllocationPlan(OrtValue& ort_value, int ort_
 
   const auto& alloc_info = per_alloc_plan.location;
   const auto* ml_type = per_alloc_plan.value_type;
-  if (ml_type == nullptr)
+  if (ml_type == nullptr) {
     return Status(
         ONNXRUNTIME, INVALID_ARGUMENT,
         "Tried to allocate without valid type information, ort_value index=" + std::to_string(ort_value_index));
-
-  if (ml_type->IsSparseTensorType()) {
-    return AllocateSparseTensor(ort_value, *ml_type, GetAllocator(alloc_info),
-                                *shape, nnz, per_alloc_plan.create_fence_if_async, session_state_);
-  }
-  if (!ml_type->IsTensorType()) {
-    return AllocateTraditionalMLValue(ort_value, *static_cast<const NonTensorTypeBase*>(ml_type));
   }
 
-  ORT_ENFORCE(shape, "Allocation of tensor types requires a shape.");
+  if (ml_type->IsTensorType()) {
+    ORT_ENFORCE(shape, "Allocation of tensor types requires a shape.");
 
-  // tensors
-  const auto* ml_data_type = static_cast<const TensorTypeBase*>(ml_type)->GetElementType();
+    // tensors
+    const auto* ml_data_type = static_cast<const TensorTypeBase*>(ml_type)->GetElementType();
 
-  AllocKind alloc_kind = per_alloc_plan.alloc_kind;
-  switch (alloc_kind) {
-    // Right now for kAllocate and kAllocateOutput we are using same approach.
-    // In the future we may want to have different way to handle it.
-    case AllocKind::kAllocateOutput:
-    case AllocKind::kAllocate: {
-      ORT_RETURN_IF_ERROR(AllocateMLValueTensorSelfOwnBuffer(ort_value, ort_value_index, ml_data_type, alloc_info,
-                                                             *shape, per_alloc_plan.create_fence_if_async));
-      break;
-    }
-    case AllocKind::kReuse: {
-      int reuse_mlvalue_index = per_alloc_plan.reused_buffer;
-      ORT_RETURN_IF_ERROR(AllocateMLValueTensorPreAllocateBuffer(
-          ort_value, reuse_mlvalue_index, ml_data_type, alloc_info, *shape, per_alloc_plan.create_fence_if_async));
-      break;
-    }
-    case AllocKind::kShare: {
-      int reuse_mlvalue_index = per_alloc_plan.reused_buffer;
-      // copy at the OrtValue level so the shared_ptr for the data is shared between the two OrtValue instances
-      ort_value = GetMutableMLValue(reuse_mlvalue_index);
-      break;
-    }
-    default: {
-      std::ostringstream ostr;
-      ostr << "Invalid allocation kind: " << static_cast<std::underlying_type<AllocKind>::type>(alloc_kind);
-      return Status(ONNXRUNTIME, FAIL, ostr.str());
+    AllocKind alloc_kind = per_alloc_plan.alloc_kind;
+    switch (alloc_kind) {
+      // Right now for kAllocate and kAllocateOutput we are using same approach.
+      // In the future we may want to have different way to handle it.
+      case AllocKind::kAllocateOutput:
+      case AllocKind::kAllocate: {
+        ORT_RETURN_IF_ERROR(AllocateMLValueTensorSelfOwnBuffer(ort_value, ort_value_index, ml_data_type, alloc_info,
+                                                               *shape, per_alloc_plan.create_fence_if_async));
+        break;
+      }
+      case AllocKind::kReuse: {
+        int reuse_mlvalue_index = per_alloc_plan.reused_buffer;
+        ORT_RETURN_IF_ERROR(AllocateMLValueTensorPreAllocateBuffer(
+            ort_value, reuse_mlvalue_index, ml_data_type, alloc_info, *shape, per_alloc_plan.create_fence_if_async));
+        break;
+      }
+      case AllocKind::kShare: {
+        int reuse_mlvalue_index = per_alloc_plan.reused_buffer;
+        // copy at the OrtValue level so the shared_ptr for the data is shared between the two OrtValue instances
+        ort_value = GetMutableMLValue(reuse_mlvalue_index);
+        break;
+      }
+      default: {
+        std::ostringstream ostr;
+        ostr << "Invalid allocation kind: " << static_cast<std::underlying_type<AllocKind>::type>(alloc_kind);
+        return Status(ONNXRUNTIME, FAIL, ostr.str());
+      }
     }
-  }
 
-  return Status::OK();
+    return Status::OK();
+  } else if (ml_type->IsSparseTensorType()) {
+    return AllocateSparseTensor(ort_value, *ml_type, GetAllocator(alloc_info),
+                                *shape, nnz, per_alloc_plan.create_fence_if_async, session_state_);
+  } else {
+    return AllocateTraditionalMLValue(ort_value, *static_cast<const NonTensorTypeBase*>(ml_type));
+  }
 }
 
 AllocatorPtr ExecutionFrame::GetAllocatorImpl(const OrtAllocatorInfo& info) const {
diff --git a/onnxruntime/core/framework/execution_frame.h b/onnxruntime/core/framework/execution_frame.h
index c99979edb7eba..06d042de3bd20 100644
--- a/onnxruntime/core/framework/execution_frame.h
+++ b/onnxruntime/core/framework/execution_frame.h
@@ -74,10 +74,10 @@ class IExecutionFrame {
 
   void Init(const std::vector<int>& feed_mlvalue_idxs, const std::vector<OrtValue>& feeds,
             const std::unordered_map<int, OrtValue>& initializers,
-            const std::vector<OrtValue>& fetches, const OrtValueNameIdxMap& ort_value_idx_map);
+            const std::vector<OrtValue>& fetches);
 
   const OrtValue& GetMLValue(int ort_value_index) const {
-    ORT_ENFORCE(ort_value_index >= 0 && static_cast<size_t>(ort_value_index) < all_values_.size());
+    ORT_ENFORCE(ort_value_index >= 0 && static_cast<size_t>(ort_value_index) < all_values_size_);
     return all_values_[ort_value_index];
   }
 
@@ -91,6 +91,9 @@ class IExecutionFrame {
   // Input and Output values are passed in by executors
   std::vector<OrtValue> all_values_;
 
+  // perf optimization to avoid calling all_values_.size() repeatedly as the size is fixed once constructed
+  const size_t all_values_size_;
+
   const std::vector<int> fetch_mlvalue_idxs_;
 };
 
diff --git a/onnxruntime/core/framework/feeds_fetches_manager.h b/onnxruntime/core/framework/feeds_fetches_manager.h
index 000eaa504176f..d646c82ab23d4 100644
--- a/onnxruntime/core/framework/feeds_fetches_manager.h
+++ b/onnxruntime/core/framework/feeds_fetches_manager.h
@@ -48,9 +48,8 @@ struct FeedsFetchesInfo {
 class FeedsFetchesManager {
  public:
   struct MLValueCopyInfo {
-    int allocation_device_id = 0;
+    OrtDevice target_device;
     const IExecutionProvider* allocation_provider = nullptr;
-    const IExecutionProvider* copy_provider = nullptr;
   };
 
   static Status Create(const std::vector<std::string>& feed_names, const std::vector<std::string>& output_names,
diff --git a/onnxruntime/core/framework/graph_partitioner.cc b/onnxruntime/core/framework/graph_partitioner.cc
index 5b0cba6c3b0d8..fe53971656932 100644
--- a/onnxruntime/core/framework/graph_partitioner.cc
+++ b/onnxruntime/core/framework/graph_partitioner.cc
@@ -2,7 +2,6 @@
 // Licensed under the MIT License.
 
 #include "core/framework/graph_partitioner.h"
-
 #include "core/framework/kernel_registry_manager.h"
 #include "core/graph/function.h"
 #include "core/graph/graph_viewer.h"
@@ -176,10 +175,6 @@ Status GraphPartitioner::Partition(Graph& graph, bool export_dll, FuncManager& f
         //prepare the func kernel
         KernelDefBuilder builder;
         BuildFusedKernelDef(builder, *node);
-        if (node->GetExecutionProviderType() == onnxruntime::kTensorrtExecutionProvider || node->GetExecutionProviderType() == onnxruntime::kNGraphExecutionProvider || node->GetExecutionProviderType() == onnxruntime::kNnapiExecutionProvider) {
-          builder.SetDefaultInputsMemoryType(OrtMemTypeCPUInput);
-          builder.SetDefaultOutputMemoryType(OrtMemTypeCPUOutput);
-        }
         ORT_RETURN_IF_ERROR(fused_kernel_registry->Register(
             builder, static_cast<KernelCreatePtrFn>([](const OpKernelInfo& info) -> OpKernel* { return new FunctionKernel(info); })));
       }
diff --git a/onnxruntime/core/framework/kernel_registry_manager.cc b/onnxruntime/core/framework/kernel_registry_manager.cc
index 5fe803b368022..203bc7c21e45f 100644
--- a/onnxruntime/core/framework/kernel_registry_manager.cc
+++ b/onnxruntime/core/framework/kernel_registry_manager.cc
@@ -14,11 +14,21 @@ Status KernelRegistryManager::CreateKernel(const onnxruntime::Node& node,
                                            const IExecutionProvider& execution_provider,
                                            const SessionState& session_state,
                                            /*out*/ std::unique_ptr<OpKernel>& op_kernel) const {
+  auto create_error_message = [&node](const std::string& error) {
+    std::ostringstream errormsg;
+    errormsg << error << node.OpType();
+    if (node.Op() != nullptr) errormsg << "(" << node.Op()->since_version() << ")";
+    if (!node.Name().empty()) errormsg << " (node " << node.Name() << ")";
+    return errormsg.str();
+  };
+
   const std::string& ptype = node.GetExecutionProviderType();
   if (ptype.empty()) {
     return Status(ONNXRUNTIME, FAIL,
-                  "The node is not placed on any Execution Provider, therefore, can't find a suitable kernel for it");
+                  create_error_message("The node is not placed on any Execution Provider, "
+                                       "therefore, can't find a suitable kernel for "));
   }
+
   Status status;
   {
     for (auto& registry : custom_kernel_registries_) {
@@ -41,11 +51,7 @@ Status KernelRegistryManager::CreateKernel(const onnxruntime::Node& node,
     }
   }
 
-  std::ostringstream errormsg;
-  errormsg << "Failed to find kernel for " << node.OpType();
-  if (node.Op() != nullptr) errormsg << "(" << node.Op()->since_version() << ")";
-  if (!node.Name().empty()) errormsg << " (node " << node.Name() << ")";
-  return Status(ONNXRUNTIME, FAIL, errormsg.str());
+  return Status(ONNXRUNTIME, FAIL, create_error_message("Failed to find kernel for "));
 }
 
 Status KernelRegistryManager::RegisterKernels(const ExecutionProviders& execution_providers) {
diff --git a/onnxruntime/core/framework/mem_pattern.h b/onnxruntime/core/framework/mem_pattern.h
index 57d9e99360b13..2aa1e3cad32eb 100644
--- a/onnxruntime/core/framework/mem_pattern.h
+++ b/onnxruntime/core/framework/mem_pattern.h
@@ -20,11 +20,11 @@ class MemoryPattern {
  public:
   MemoryPattern() = default;
 
-  MemoryPattern(MemoryPattern&& rhs)
+  MemoryPattern(MemoryPattern&& rhs) noexcept
       : patterns_{std::move(rhs.patterns_)},
         peak_size_{std::move(rhs.peak_size_)} {}
 
-  MemoryPattern& operator=(MemoryPattern&& rhs) {
+  MemoryPattern& operator=(MemoryPattern&& rhs) noexcept {
     patterns_ = std::move(rhs.patterns_);
     peak_size_ = std::move(rhs.peak_size_);
     return *this;
diff --git a/onnxruntime/core/framework/node_index_info.cc b/onnxruntime/core/framework/node_index_info.cc
index 7931825e7fd7c..d77a72cabc909 100644
--- a/onnxruntime/core/framework/node_index_info.cc
+++ b/onnxruntime/core/framework/node_index_info.cc
@@ -69,6 +69,10 @@ void NodeIndexInfo::Init(const TValidNodes& nodes, NodeIndex max_node_index,
   // init all to kInvalidEntry
   node_offsets_.resize(GetNodeOffsetsIndex(max_node_index), kInvalidEntry);
   node_values_.resize(total_def_count, kInvalidEntry);
+
+  node_offsets_size_ = node_offsets_.size();
+  node_values_size_ = node_values_.size();
+
   int cur_idx = 0;
 
   for (auto& node : nodes) {
diff --git a/onnxruntime/core/framework/node_index_info.h b/onnxruntime/core/framework/node_index_info.h
index afd74a1874900..19b4a202f578f 100644
--- a/onnxruntime/core/framework/node_index_info.h
+++ b/onnxruntime/core/framework/node_index_info.h
@@ -31,14 +31,14 @@ class NodeIndexInfo final {
   // Returns kInvalidEntry if the Node with the given node_index did not exist when the NodeIndexInfo was created.
   int GetNodeOffset(NodeIndex node_index) const {
     auto node_offsets_index = GetNodeOffsetsIndex(node_index);
-    ORT_ENFORCE(node_offsets_index < node_offsets_.size());
+    ORT_ENFORCE(node_offsets_index < node_offsets_size_);
     return node_offsets_[node_offsets_index];
   }
 
   // Get the ort_value index value.
   // Returns kInvalidEntry for optional inputs/outputs that do not exist in this graph.
   int GetMLValueIndex(int offset) const {
-    ORT_ENFORCE(offset >= 0 && static_cast<size_t>(offset) < node_values_.size());
+    ORT_ENFORCE(offset >= 0 && static_cast<size_t>(offset) < node_values_size_);
     return node_values_[offset];
   }
 
@@ -63,5 +63,9 @@ class NodeIndexInfo final {
   std::vector<int> node_offsets_;
 
   const int max_mlvalue_idx_;
+
+  // perf optimization to avoid calls to size() on node_values_ and node_offsets_ as they don't change
+  size_t node_values_size_;
+  size_t node_offsets_size_;
 };
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/framework/op_kernel_context_internal.h b/onnxruntime/core/framework/op_kernel_context_internal.h
index 02515ba39a160..b837356504d36 100644
--- a/onnxruntime/core/framework/op_kernel_context_internal.h
+++ b/onnxruntime/core/framework/op_kernel_context_internal.h
@@ -5,6 +5,7 @@
 
 #include "core/framework/op_kernel.h"
 #include "core/framework/session_state.h"
+#include "core/session/onnxruntime_c_api.h"
 
 // onnxruntime internal OpKernelContext derived class to provide additional
 // APIs that aren't desirable to add to the public OpKernelContext API
@@ -57,7 +58,8 @@ class OpKernelContextInternal : public OpKernelContext {
 
   const bool& GetTerminateFlag() const noexcept { return terminate_flag_; }
 
-  const onnxruntime::concurrency::ThreadPool* GetOperatorThreadPool() const { return session_state_.GetThreadPool(); }
+  _Ret_maybenull_ const onnxruntime::concurrency::ThreadPool* GetOperatorThreadPool() const { return session_state_.GetThreadPool(); }
+  _Ret_maybenull_ onnxruntime::concurrency::ThreadPool* GetOperatorThreadPool() { return session_state_.GetThreadPool(); }
 
  private:
   const SessionState& session_state_;
diff --git a/onnxruntime/core/framework/parallel_executor.cc b/onnxruntime/core/framework/parallel_executor.cc
index 72ee80cd421ee..ff33f93eab6c4 100644
--- a/onnxruntime/core/framework/parallel_executor.cc
+++ b/onnxruntime/core/framework/parallel_executor.cc
@@ -122,6 +122,7 @@ Status ParallelExecutor::RunNodeAsync(size_t p_node_index,
   TimePoint sync_time_begin;
   TimePoint kernel_begin_time;
   const bool f_profiler_enabled = session_state.Profiler().IsEnabled();
+  const SequentialExecutionPlan& exec_plan = *session_state.GetExecutionPlan();
 
   // Avoid context switching if possible.
   while (keep_running) {
@@ -149,33 +150,34 @@ Status ParallelExecutor::RunNodeAsync(size_t p_node_index,
     }
     // sync before compute
     int queue_id = p_op_kernel->KernelDef().ExecQueueId();
-
-    for (int input_index = 0; input_index < op_kernel_context.InputCount(); ++input_index) {
-      Fence_t fence = op_kernel_context.InputFence(input_index);
-      if (fence) {
-        auto execution_provider_type = p_op_kernel->Node().GetExecutionProviderType();
-        if (OrtMemTypeCPUInput == p_op_kernel->KernelDef().InputMemoryType(input_index)) {
-          execution_provider_type = kCpuExecutionProvider;
+    if (exec_plan.NodeHasFence(node_index)) {
+      for (int input_index = 0; input_index < op_kernel_context.InputCount(); ++input_index) {
+        Fence_t fence = op_kernel_context.InputFence(input_index);
+        if (fence) {
+          auto execution_provider_type = p_op_kernel->Node().GetExecutionProviderType();
+          if (OrtMemTypeCPUInput == p_op_kernel->KernelDef().InputMemoryType(input_index)) {
+            execution_provider_type = kCpuExecutionProvider;
+          }
+          fence->BeforeUsingAsInput(execution_provider_type, queue_id);
         }
-        fence->BeforeUsingAsInput(execution_provider_type, queue_id);
       }
-    }
 
-    for (int input_index = 0; input_index < op_kernel_context.ImplicitInputCount(); ++input_index) {
-      Fence_t fence = op_kernel_context.ImplicitInputFence(input_index);
-      if (fence) {
-        auto execution_provider_type = p_op_kernel->Node().GetExecutionProviderType();
-        if (OrtMemTypeCPUInput == p_op_kernel->KernelDef().InputMemoryType(input_index)) {
-          execution_provider_type = kCpuExecutionProvider;
+      for (int input_index = 0; input_index < op_kernel_context.ImplicitInputCount(); ++input_index) {
+        Fence_t fence = op_kernel_context.ImplicitInputFence(input_index);
+        if (fence) {
+          auto execution_provider_type = p_op_kernel->Node().GetExecutionProviderType();
+          if (OrtMemTypeCPUInput == p_op_kernel->KernelDef().InputMemoryType(input_index)) {
+            execution_provider_type = kCpuExecutionProvider;
+          }
+          fence->BeforeUsingAsInput(execution_provider_type, queue_id);
         }
-        fence->BeforeUsingAsInput(execution_provider_type, queue_id);
       }
-    }
 
-    for (int output_index = 0; output_index < op_kernel_context.OutputCount(); ++output_index) {
-      Fence_t fence = op_kernel_context.OutputFence(output_index);
-      if (fence) {
-        fence->BeforeUsingAsOutput(p_op_kernel->Node().GetExecutionProviderType(), queue_id);
+      for (int output_index = 0; output_index < op_kernel_context.OutputCount(); ++output_index) {
+        Fence_t fence = op_kernel_context.OutputFence(output_index);
+        if (fence) {
+          fence->BeforeUsingAsOutput(p_op_kernel->Node().GetExecutionProviderType(), queue_id);
+        }
       }
     }
 
@@ -209,32 +211,36 @@ Status ParallelExecutor::RunNodeAsync(size_t p_node_index,
       sync_time_begin = session_state.Profiler().StartTime();
     }
     // sync after compute for outputs
-    for (int input_index = 0; input_index < op_kernel_context.InputCount(); ++input_index) {
-      Fence_t fence = op_kernel_context.InputFence(input_index);
-      if (fence) {
-        fence->AfterUsedAsInput(queue_id);
+    if (exec_plan.NodeHasFence(node_index)) {
+      for (int input_index = 0; input_index < op_kernel_context.InputCount(); ++input_index) {
+        Fence_t fence = op_kernel_context.InputFence(input_index);
+        if (fence) {
+          fence->AfterUsedAsInput(queue_id);
+        }
       }
-    }
 
-    for (int input_index = 0; input_index < op_kernel_context.ImplicitInputCount(); ++input_index) {
-      Fence_t fence = op_kernel_context.ImplicitInputFence(input_index);
-      if (fence) {
-        fence->AfterUsedAsInput(queue_id);
+      for (int input_index = 0; input_index < op_kernel_context.ImplicitInputCount(); ++input_index) {
+        Fence_t fence = op_kernel_context.ImplicitInputFence(input_index);
+        if (fence) {
+          fence->AfterUsedAsInput(queue_id);
+        }
       }
-    }
 
-    for (int output_index = 0; output_index < op_kernel_context.OutputCount(); ++output_index) {
-      Fence_t fence = op_kernel_context.OutputFence(output_index);
-      if (fence) {
-        fence->AfterUsedAsOutput(queue_id);
+      for (int output_index = 0; output_index < op_kernel_context.OutputCount(); ++output_index) {
+        Fence_t fence = op_kernel_context.OutputFence(output_index);
+        if (fence) {
+          fence->AfterUsedAsOutput(queue_id);
+        }
       }
     }
+
     if (f_profiler_enabled) {
       session_state.Profiler().EndTimeAndRecordEvent(profiling::NODE_EVENT,
                                                      p_op_kernel->Node().Name() + "_fence_after",
                                                      sync_time_begin,
                                                      {{"op_name", p_op_kernel->KernelDef().OpName()}});
     }
+
     //std::cout << "Run async node finish: " << p_node_index << std::endl;
 
     keep_running = false;
diff --git a/onnxruntime/core/framework/parallel_executor.h b/onnxruntime/core/framework/parallel_executor.h
index 5f34309937bac..74d3fbce3d8d4 100644
--- a/onnxruntime/core/framework/parallel_executor.h
+++ b/onnxruntime/core/framework/parallel_executor.h
@@ -21,7 +21,6 @@ class ExecutionFrame;
 
 class ParallelExecutor : public IExecutor {
  public:
-  ParallelExecutor(const bool& terminate_flag = false) : terminate_flag_{terminate_flag} {}
   ParallelExecutor(const SessionState& session_state, const bool& terminate_flag = false);
 
   common::Status Execute(const SessionState& session_state, const std::vector<int>& feed_mlvalue_idxs,
diff --git a/onnxruntime/core/framework/run_options.cc b/onnxruntime/core/framework/run_options.cc
index 079be56fc5ae4..640c610841774 100644
--- a/onnxruntime/core/framework/run_options.cc
+++ b/onnxruntime/core/framework/run_options.cc
@@ -17,6 +17,11 @@ ORT_API_STATUS_IMPL(OrtRunOptionsSetRunLogVerbosityLevel, _In_ OrtRunOptions* op
   return nullptr;
 }
 
+ORT_API_STATUS_IMPL(OrtRunOptionsSetRunLogSeverityLevel, _In_ OrtRunOptions* options, int value) {
+  options->run_log_severity_level = value;
+  return nullptr;
+}
+
 ORT_API_STATUS_IMPL(OrtRunOptionsSetRunTag, _In_ OrtRunOptions* options, _In_ const char* run_tag) {
   if (run_tag)
     options->run_tag = run_tag;
@@ -28,6 +33,11 @@ ORT_API_STATUS_IMPL(OrtRunOptionsGetRunLogVerbosityLevel, _In_ const OrtRunOptio
   return nullptr;
 }
 
+ORT_API_STATUS_IMPL(OrtRunOptionsGetRunLogSeverityLevel, _In_ const OrtRunOptions* options, int* out) {
+  *out = options->run_log_severity_level;
+  return nullptr;
+}
+
 ORT_API_STATUS_IMPL(OrtRunOptionsGetRunTag, _In_ const OrtRunOptions* options, const char** out) {
   *out = options->run_tag.c_str();
   return nullptr;
diff --git a/onnxruntime/core/framework/sequential_execution_plan.h b/onnxruntime/core/framework/sequential_execution_plan.h
index 24ed345965cbc..5c6827966dd41 100644
--- a/onnxruntime/core/framework/sequential_execution_plan.h
+++ b/onnxruntime/core/framework/sequential_execution_plan.h
@@ -66,6 +66,9 @@ struct SequentialExecutionPlan : public ExecutionPlanBase {
   // Execution_plan: represents the nodes in the sequential order to be executed
   std::vector<NodeExecutionPlan> execution_plan;
 
+  // Records whether a given node has fence on its input or output, key is node index.
+  std::vector<bool> node_has_fence;
+
   // to_be_freed: vector elements represent indices of ml-values to be freed (as described above)
   std::vector<OrtValueIndex> to_be_freed;
 
@@ -84,6 +87,12 @@ struct SequentialExecutionPlan : public ExecutionPlanBase {
     }
     return locations;
   }
+
+  // Whether a given node needs fence check or not.
+  bool NodeHasFence(onnxruntime::NodeIndex node_index) const {
+    return node_has_fence[node_index];
+  }
+
 };
 
 // Output details of an execution plan:
diff --git a/onnxruntime/core/framework/sequential_executor.cc b/onnxruntime/core/framework/sequential_executor.cc
index bd45bbfdc0b01..0f08e8613cc1a 100644
--- a/onnxruntime/core/framework/sequential_executor.cc
+++ b/onnxruntime/core/framework/sequential_executor.cc
@@ -71,32 +71,34 @@ Status SequentialExecutor::Execute(const SessionState& session_state, const std:
 
     // sync before compute
     int queue_id = p_op_kernel->KernelDef().ExecQueueId();
-    for (int input_index = 0; input_index < op_kernel_context.InputCount(); ++input_index) {
-      Fence_t fence = op_kernel_context.InputFence(input_index);
-      if (fence) {
-        auto execution_provider_type = p_op_kernel->Node().GetExecutionProviderType();
-        if (OrtMemTypeCPUInput == p_op_kernel->KernelDef().InputMemoryType(input_index)) {
-          execution_provider_type = kCpuExecutionProvider;
+    if (seq_exec_plan.NodeHasFence(node_index)) {
+      for (int input_index = 0; input_index < op_kernel_context.InputCount(); ++input_index) {
+        Fence_t fence = op_kernel_context.InputFence(input_index);
+        if (fence) {
+          auto execution_provider_type = p_op_kernel->Node().GetExecutionProviderType();
+          if (OrtMemTypeCPUInput == p_op_kernel->KernelDef().InputMemoryType(input_index)) {
+            execution_provider_type = kCpuExecutionProvider;
+          }
+          fence->BeforeUsingAsInput(execution_provider_type, queue_id);
         }
-        fence->BeforeUsingAsInput(execution_provider_type, queue_id);
       }
-    }
 
-    for (int input_index = 0; input_index < op_kernel_context.ImplicitInputCount(); ++input_index) {
-      Fence_t fence = op_kernel_context.ImplicitInputFence(input_index);
-      if (fence) {
-        auto execution_provider_type = p_op_kernel->Node().GetExecutionProviderType();
-        if (OrtMemTypeCPUInput == p_op_kernel->KernelDef().InputMemoryType(input_index)) {
-          execution_provider_type = kCpuExecutionProvider;
+      for (int input_index = 0; input_index < op_kernel_context.ImplicitInputCount(); ++input_index) {
+        Fence_t fence = op_kernel_context.ImplicitInputFence(input_index);
+        if (fence) {
+          auto execution_provider_type = p_op_kernel->Node().GetExecutionProviderType();
+          if (OrtMemTypeCPUInput == p_op_kernel->KernelDef().InputMemoryType(input_index)) {
+            execution_provider_type = kCpuExecutionProvider;
+          }
+          fence->BeforeUsingAsInput(execution_provider_type, queue_id);
         }
-        fence->BeforeUsingAsInput(execution_provider_type, queue_id);
       }
-    }
 
-    for (int output_index = 0; output_index < op_kernel_context.OutputCount(); ++output_index) {
-      Fence_t fence = op_kernel_context.OutputFence(output_index);
-      if (fence) {
-        fence->BeforeUsingAsOutput(p_op_kernel->Node().GetExecutionProviderType(), queue_id);
+      for (int output_index = 0; output_index < op_kernel_context.OutputCount(); ++output_index) {
+        Fence_t fence = op_kernel_context.OutputFence(output_index);
+        if (fence) {
+          fence->BeforeUsingAsOutput(p_op_kernel->Node().GetExecutionProviderType(), queue_id);
+        }
       }
     }
 
@@ -138,24 +140,26 @@ Status SequentialExecutor::Execute(const SessionState& session_state, const std:
     }
 
     // sync after compute for outputs
-    for (int input_index = 0; input_index < op_kernel_context.InputCount(); ++input_index) {
-      Fence_t fence = op_kernel_context.InputFence(input_index);
-      if (fence) {
-        fence->AfterUsedAsInput(queue_id);
+    if (seq_exec_plan.NodeHasFence(node_index)) {
+      for (int input_index = 0; input_index < op_kernel_context.InputCount(); ++input_index) {
+        Fence_t fence = op_kernel_context.InputFence(input_index);
+        if (fence) {
+          fence->AfterUsedAsInput(queue_id);
+        }
       }
-    }
 
-    for (int input_index = 0; input_index < op_kernel_context.ImplicitInputCount(); ++input_index) {
-      Fence_t fence = op_kernel_context.ImplicitInputFence(input_index);
-      if (fence) {
-        fence->AfterUsedAsInput(queue_id);
+      for (int input_index = 0; input_index < op_kernel_context.ImplicitInputCount(); ++input_index) {
+        Fence_t fence = op_kernel_context.ImplicitInputFence(input_index);
+        if (fence) {
+          fence->AfterUsedAsInput(queue_id);
+        }
       }
-    }
 
-    for (int output_index = 0; output_index < op_kernel_context.OutputCount(); ++output_index) {
-      Fence_t fence = op_kernel_context.OutputFence(output_index);
-      if (fence) {
-        fence->AfterUsedAsOutput(queue_id);
+      for (int output_index = 0; output_index < op_kernel_context.OutputCount(); ++output_index) {
+        Fence_t fence = op_kernel_context.OutputFence(output_index);
+        if (fence) {
+          fence->AfterUsedAsOutput(queue_id);
+        }
       }
     }
 
diff --git a/onnxruntime/core/framework/session_state.cc b/onnxruntime/core/framework/session_state.cc
index a6fe46be955ed..fbf0f50d37253 100644
--- a/onnxruntime/core/framework/session_state.cc
+++ b/onnxruntime/core/framework/session_state.cc
@@ -11,23 +11,96 @@
 #include "core/framework/utils.h"
 
 using namespace ::onnxruntime::common;
-namespace onnxruntime {
 
-void SessionState::SetGraphViewer(std::unique_ptr<onnxruntime::GraphViewer> graph_viewer) {
-  ORT_ENFORCE(nullptr != graph_viewer);
-  graph_viewer_ = std::move(graph_viewer);
-}
+namespace onnxruntime {
 
 const GraphViewer* SessionState::GetGraphViewer() const { return graph_viewer_.get(); }
+Status SessionState::SetGraph(const Graph& graph) {
+  graph_viewer_ = std::make_unique<onnxruntime::GraphViewer>(graph);
+  auto& logger = Logger();
+  // use graph_viewer_ to initialize ort_value_name_idx_map_
+  LOGS(logger, INFO) << "SaveMLValueNameIndexMapping";
+  int idx = 0;
+
+  // we keep all graph inputs (including initializers), even if they are unused, so make sure they all have an entry
+  for (const auto* input_def : graph_viewer_->GetInputsIncludingInitializers()) {
+    idx = ort_value_name_idx_map_.Add(input_def->Name());
+    VLOGS(logger, 1) << "Added graph_viewer_ input with name: " << input_def->Name()
+                     << " to OrtValueIndex with index: " << idx;
+  }
+
+  for (auto& node : graph_viewer_->Nodes()) {
+    // build the OrtValue->index map
+    for (const auto* input_def : node.InputDefs()) {
+      if (input_def->Exists()) {
+        idx = ort_value_name_idx_map_.Add(input_def->Name());
+        VLOGS(logger, 1) << "Added input argument with name: " << input_def->Name()
+                         << " to OrtValueIndex with index: " << idx;
+      }
+    }
+
+    for (const auto* input_def : node.ImplicitInputDefs()) {
+      if (input_def->Exists()) {
+        idx = ort_value_name_idx_map_.Add(input_def->Name());
+        VLOGS(logger, 1) << "Added implicit input argument with name: " << input_def->Name()
+                         << " to OrtValueIndex with index: " << idx;
+      }
+    }
+
+    for (const auto* output_def : node.OutputDefs()) {
+      if (output_def->Exists()) {
+        ort_value_name_idx_map_.Add(output_def->Name());
+        VLOGS(logger, 1) << "Added output argument with name: " << output_def->Name()
+                         << " to OrtValueIndex with index: " << idx;
+      }
+    }
+  }
+
+  // allocate OrtValue for graph outputs when coming from initializers
+  for (const auto& output : graph_viewer_->GetOutputs()) {
+    if (output->Exists()) {
+      idx = ort_value_name_idx_map_.Add(output->Name());
+      VLOGS(logger, 1) << "Added graph output with name: " << output->Name() << " to OrtValueIndex with index: " << idx;
+    }
+  }
 
-const OpKernel* SessionState::GetKernel(NodeIndex node_id) const {
-  auto kernel = session_kernels_.find(node_id);
-  return (kernel != session_kernels_.cend()) ? kernel->second.get() : nullptr;
+  LOGS(logger, INFO) << "Done saving OrtValue mappings.";
+  return Status::OK();
 }
 
-void SessionState::AddKernel(onnxruntime::NodeIndex node_id, std::unique_ptr<OpKernel> p_kernel) {
-  // assumes vector is already resize()'ed to the number of nodes in the graph
-  session_kernels_[node_id] = std::move(p_kernel);
+Status SessionState::CreateKernels(const KernelRegistryManager& custom_registry_manager) {
+  const GraphNodes& nodes = graph_viewer_->Nodes();
+  if (!nodes.empty()) {
+    size_t max_nodeid = 0;
+    for (auto& node : graph_viewer_->Nodes()) {
+      max_nodeid = std::max(max_nodeid, node.Index());
+    }
+    session_kernels_.clear();
+    session_kernels_.resize(max_nodeid + 1, nullptr);
+    for (auto& node : graph_viewer_->Nodes()) {
+      // construct and save the kernels
+      std::unique_ptr<OpKernel> op_kernel;
+      onnxruntime::ProviderType exec_provider_name = node.GetExecutionProviderType();
+
+      const IExecutionProvider* exec_provider = nullptr;
+      if (exec_provider_name.empty() || (exec_provider = execution_providers_.Get(exec_provider_name)) == nullptr) {
+        return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Could not create kernel for node: ", node.Name(),
+                               " as there's no execution provider allocated.");
+      }
+
+      common::Status status = custom_registry_manager.CreateKernel(node, *exec_provider, *this, op_kernel);
+      if (!status.IsOK()) {
+        return common::Status(
+            status.Category(), status.Code(),
+            MakeString("Kernel creation failed for node: ", node.Name(), " with error: ", status.ErrorMessage()));
+      }
+      assert(session_kernels_[node.Index()] == nullptr);
+      // assumes vector is already resize()'ed to the number of nodes in the graph
+      session_kernels_[node.Index()] = op_kernel.release();
+    }
+  }
+  node_index_info_ = std::make_unique<NodeIndexInfo>(*graph_viewer_, ort_value_name_idx_map_);
+  return Status::OK();
 }
 
 void SessionState::SetExecutionPlan(std::unique_ptr<SequentialExecutionPlan> p_seq_exec_plan) {
@@ -38,7 +111,6 @@ const SequentialExecutionPlan* SessionState::GetExecutionPlan() const { return p
 
 Status SessionState::AddInitializedTensor(int ort_value_index, const OrtValue& ort_value, const OrtCallback* d,
                                           bool constant) {
-  ORT_ENFORCE(ort_value_index >= 0 && ort_value_index <= ort_value_name_idx_map_.MaxIdx());
   auto p = initialized_tensors_.insert({ort_value_index, ort_value});
   if (!p.second)
     return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, "duplicated ort_value index:", ort_value_index,
@@ -55,9 +127,7 @@ Status SessionState::AddInitializedTensor(int ort_value_index, const OrtValue& o
   return Status::OK();
 }
 
-const std::unordered_map<int, OrtValue>& SessionState::GetInitializedTensors() const {
-  return initialized_tensors_;
-}
+const std::unordered_map<int, OrtValue>& SessionState::GetInitializedTensors() const { return initialized_tensors_; }
 
 const std::unordered_map<int, OrtValue>& SessionState::GetConstantInitializedTensors() const {
   return constant_initialized_tensors_;
@@ -86,7 +156,8 @@ static int64_t CalculateMemoryPatternsKey(const std::vector<std::reference_wrapp
   return key;
 }
 
-const MemoryPatternGroup* SessionState::GetMemoryPatternGroup(const std::vector<std::reference_wrapper<const TensorShape>>& input_shapes) const {
+const MemoryPatternGroup* SessionState::GetMemoryPatternGroup(
+    const std::vector<std::reference_wrapper<const TensorShape>>& input_shapes) const {
   int64_t key = CalculateMemoryPatternsKey(input_shapes);
 
   std::lock_guard<OrtMutex> lock(mem_patterns_lock_);
@@ -96,8 +167,9 @@ const MemoryPatternGroup* SessionState::GetMemoryPatternGroup(const std::vector<
   return it->second.get();
 }
 
-Status SessionState::UpdateMemoryPatternGroupCache(const std::vector<std::reference_wrapper<const TensorShape>>& input_shapes,
-                                                   std::unique_ptr<MemoryPatternGroup> mem_patterns) const {
+Status SessionState::UpdateMemoryPatternGroupCache(
+    const std::vector<std::reference_wrapper<const TensorShape>>& input_shapes,
+    std::unique_ptr<MemoryPatternGroup> mem_patterns) const {
   int64_t key = CalculateMemoryPatternsKey(input_shapes);
 
   std::lock_guard<OrtMutex> lock(mem_patterns_lock_);
@@ -109,9 +181,7 @@ Status SessionState::UpdateMemoryPatternGroupCache(const std::vector<std::refere
   return Status::OK();
 }
 
-bool SessionState::GetEnableMemoryPattern() const {
-  return enable_mem_pattern_;
-}
+bool SessionState::GetEnableMemoryPattern() const { return enable_mem_pattern_; }
 
 common::Status SessionState::AddInputNameToNodeInfoMapping(const std::string& input_name, const NodeInfo& node_info) {
   // in the future we could support multiple nodes on difference devices using an input, however right now
@@ -135,19 +205,20 @@ common::Status SessionState::AddInputNameToNodeInfoMapping(const std::string& in
       // replace existing entry that is for an implicit input with new entry for explicit usage in this graph
       entries[0] = node_info;
     } else {
-      // if the providers match we can add the new entry for completeness (it will be ignored in
+      // if the devices match we can add the new entry for completeness (it will be ignored in
       // utils::CopyOneInputAcrossDevices though).
       // if they don't, we are broken.
-      const auto& current_provider = utils::GetNodeInputProviderType(entries[0]);
-      const auto& new_provider = utils::GetNodeInputProviderType(node_info);
+      const auto& current_device = entries[0].device;
+      const auto& new_device = node_info.device;
 
-      if (current_provider == new_provider) {
+      if (current_device == new_device) {
         entries.push_back(node_info);
       } else {
-        return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,
-                               "Using an input in multiple nodes on different devices is not supported currently. Input:",
-                               input_name, " is used by node ", existing_entry.p_node->Name(), " (", current_provider,
-                               ") and node ", node_info.p_node->Name(), " (", new_provider, ").");
+        return ORT_MAKE_STATUS(
+            ONNXRUNTIME, NOT_IMPLEMENTED,
+            "Using an input in multiple nodes on different devices is not supported currently. Input:", input_name,
+            " is used by node ", existing_entry.p_node->Name(), " (", current_device->ToString(), ") and node ",
+            node_info.p_node->Name(), " (", new_device->ToString(), ").");
       }
     }
   }
@@ -178,16 +249,15 @@ const SessionState::NameNodeInfoMapType& SessionState::GetOutputNodeInfoMap() co
   return output_names_to_nodeinfo_mapping_;
 }
 
-void SessionState::AddSubgraphSessionState(onnxruntime::NodeIndex index,
-                                           const std::string& attribute_name,
+void SessionState::AddSubgraphSessionState(onnxruntime::NodeIndex index, const std::string& attribute_name,
                                            std::unique_ptr<SessionState> session_state) {
   auto entry = subgraph_session_states_.find(index);
 
   // make sure this is new. internal logic error if it is not so using ORT_ENFORCE.
   if (entry != subgraph_session_states_.cend()) {
     const auto& existing_entries = entry->second;
-    ORT_ENFORCE(existing_entries.find(attribute_name) == existing_entries.cend(),
-                "Entry exists in node ", index, " for attribute ", attribute_name);
+    ORT_ENFORCE(existing_entries.find(attribute_name) == existing_entries.cend(), "Entry exists in node ", index,
+                " for attribute ", attribute_name);
   }
 
   subgraph_session_states_[index].insert(std::make_pair(attribute_name, std::move(session_state)));
@@ -215,19 +285,8 @@ const SessionState* SessionState::GetSubgraphSessionState(onnxruntime::NodeIndex
   return const_cast<SessionState*>(this)->GetMutableSubgraphSessionState(index, attribute_name);
 }
 
-void SessionState::CalculateNodeIndexInfo() {
-  ORT_ENFORCE(graph_viewer_);
-  node_index_info_ = std::make_unique<NodeIndexInfo>(*graph_viewer_, ort_value_name_idx_map_);
-
-  for (auto& node_to_map_pair : subgraph_session_states_) {
-    for (auto& attr_name_to_subgraph : node_to_map_pair.second) {
-      attr_name_to_subgraph.second->CalculateNodeIndexInfo();
-    }
-  }
-}
-
 const NodeIndexInfo& SessionState::GetNodeIndexInfo() const {
-  ORT_ENFORCE(node_index_info_, "CalculateNodeIndexInfo must be called prior to GetExecutionInfo.");
+  ORT_ENFORCE(node_index_info_, "SetGraphAndCreateKernels must be called prior to GetExecutionInfo.");
   return *node_index_info_;
 }
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/framework/session_state.h b/onnxruntime/core/framework/session_state.h
index dfec27108257a..b9e4c08900ddc 100644
--- a/onnxruntime/core/framework/session_state.h
+++ b/onnxruntime/core/framework/session_state.h
@@ -20,7 +20,7 @@
 #include "core/framework/kernel_registry_manager.h"
 #include "core/framework/mem_pattern.h"
 #include "core/framework/ml_value.h"
-#include "core/common/callback.h"
+#include "core/framework/callback.h"
 #include "core/framework/ort_value_name_idx_map.h"
 #include "core/framework/node_index_info.h"
 #include "core/graph/graph_viewer.h"
@@ -40,33 +40,41 @@ struct MemoryPatternGroup;
  * SessionState should be modified by the inference session class only.
  * It is supposed to be passed by const-ref only to all the executors.
  * This class owns all the initializers.
+ * Brief usage:
+ * SessionState s(...);
+ * for(...) s.AddInitializedTensor(...);
+ * s.SetGraphAndCreateKernels(...);
+ * Then you can use:
+ * s.GetKernel(...);
  */
 class SessionState {
  public:
-  SessionState(const ExecutionProviders& execution_providers, bool enable_mem_pattern)
-      : execution_providers_{execution_providers}, enable_mem_pattern_(enable_mem_pattern) {}
+  SessionState(const ExecutionProviders& execution_providers, bool enable_mem_pattern,
+               concurrency::ThreadPool* thread_pool)
+      : execution_providers_{execution_providers}, enable_mem_pattern_(enable_mem_pattern), thread_pool_(thread_pool) {}
 
   ~SessionState() {
+    for (auto* p : session_kernels_) {
+      delete p;
+    }
     for (auto& kvp : deleter_for_initialized_tensors_) {
       kvp.second.f(kvp.second.param);
     }
   }
 
   // Graph viewer.
-  void SetGraphViewer(std::unique_ptr<GraphViewer> graph_viewer);
   const GraphViewer* GetGraphViewer() const;
 
   // kernels
   // Get kernel for specified node.
   // It should called right before graph execution only.
-  const OpKernel* GetKernel(NodeIndex node_id) const;
-
-  void AddKernel(NodeIndex node_id, std::unique_ptr<OpKernel> p_kernel);
+  const OpKernel* GetKernel(size_t node_id) const {
+    return (node_id < session_kernels_.size()) ? session_kernels_[node_id] : nullptr;
+  }
 
   const ExecutionProviders& GetExecutionProviders() const noexcept { return execution_providers_; }
 
   const OrtValueNameIdxMap& GetOrtValueNameIdxMap() const noexcept { return ort_value_name_idx_map_; }
-  OrtValueNameIdxMap& GetOrtValueNameIdxMap() noexcept { return ort_value_name_idx_map_; }
 
   // initialized tensors
   /**
@@ -77,6 +85,12 @@ class SessionState {
    */
   Status AddInitializedTensor(int ort_value_index, const OrtValue& ort_value, const OrtCallback* d, bool constant);
 
+  Status SetGraph(const Graph& graph);
+  Status CreateKernels(const KernelRegistryManager& custom_registry_manager);
+  Status SetGraphAndCreateKernels(const Graph& graph, const KernelRegistryManager& custom_registry_manager) {
+    ORT_RETURN_IF_ERROR(SetGraph(graph));
+    return CreateKernels(custom_registry_manager);
+  }
   /**
    * Gets the map of ort_value_index to initialized tensors (weights) so that it can be used by the
    * execution frame to setup the appropriate OrtValue vectors.
@@ -85,8 +99,8 @@ class SessionState {
   const std::unordered_map<int, OrtValue>& GetInitializedTensors() const;
 
   /**
-   * Gets the map of ort_value_index to initialized tensors (e.g. weights) that are constant 
-   * and cannot be overridden at runtime. 
+   * Gets the map of ort_value_index to initialized tensors (e.g. weights) that are constant
+   * and cannot be overridden at runtime.
    * The lifetime of returned OrtValues are limited by this SessionState object.
    */
   const std::unordered_map<int, OrtValue>& GetConstantInitializedTensors() const;
@@ -96,12 +110,12 @@ class SessionState {
   const SequentialExecutionPlan* GetExecutionPlan() const;
 
   /**
-  Set the logger to use for this session. 
+  Set the logger to use for this session.
   */
   SessionState& SetLogger(const logging::Logger& logger);
 
   /**
-  Get the logger for this session. 
+  Get the logger for this session.
   Falls back to returning Logging::LoggingManager::DefaultLogger if SetLogger has not been called.
   */
   const logging::Logger& Logger() const;
@@ -120,10 +134,11 @@ class SessionState {
   /**
   Get cached memory pattern based on input shapes
   */
-  const MemoryPatternGroup* GetMemoryPatternGroup(const std::vector<std::reference_wrapper<const TensorShape>>& input_shapes) const;
+  const MemoryPatternGroup* GetMemoryPatternGroup(
+      const std::vector<std::reference_wrapper<const TensorShape>>& input_shapes) const;
 
   /**
-  Set generated memory pattern with a given input shapes. 
+  Set generated memory pattern with a given input shapes.
   Const as it's an internal cache update only.
   */
   Status UpdateMemoryPatternGroupCache(const std::vector<std::reference_wrapper<const TensorShape>>& input_shape,
@@ -141,17 +156,15 @@ class SessionState {
      * \param p_node0 Nullable
      * \param kci0 Nullable
      */
-    NodeInfo(size_t index0, const onnxruntime::Node* p_node0, const KernelCreateInfo* kci0)
-        : index(index0),
-          p_node(p_node0),
-          kci(kci0) {
-    }
+    NodeInfo(size_t index0, const onnxruntime::Node* p_node0, const KernelCreateInfo* kci0, const OrtDevice& device0)
+        : index(index0), p_node(p_node0), kci(kci0), device(&device0) {}
 
     size_t index;
     // Nullable
     const onnxruntime::Node* p_node = nullptr;
     // Nullable
     const KernelCreateInfo* kci = nullptr;
+    const OrtDevice* device = nullptr;
   };
 
   using NameNodeInfoMapType = std::unordered_map<std::string, std::vector<NodeInfo>>;
@@ -174,8 +187,7 @@ class SessionState {
 
   SessionState* GetMutableSubgraphSessionState(onnxruntime::NodeIndex index, const std::string& attribute_name);
 
-  onnxruntime::concurrency::ThreadPool* GetThreadPool() const { return thread_pool_; }
-  void SetThreadPool(onnxruntime::concurrency::ThreadPool* p_pool) { thread_pool_ = p_pool; }
+  concurrency::ThreadPool* GetThreadPool() const { return thread_pool_; }
 
   bool ExportDll() const { return export_fused_dll_; }
   void SetExportDllFlag(bool flag) { export_fused_dll_ = flag; }
@@ -187,7 +199,6 @@ class SessionState {
   void SetDataTransferMgr(const DataTransferManager* data_transfer_mgr) { data_transfer_mgr_ = data_transfer_mgr; }
 
   std::vector<BufferUniquePtr>& GetMutableWeightsBuffers() { return weights_buffers_; }
-  void CalculateNodeIndexInfo();
   const NodeIndexInfo& GetNodeIndexInfo() const;
 
  private:
@@ -195,7 +206,7 @@ class SessionState {
 
   // cache of the constructed kernels to avoid spending construction
   // time per executor
-  std::unordered_map<NodeIndex, std::unique_ptr<OpKernel>> session_kernels_;
+  std::vector<OpKernel*> session_kernels_;
   std::unique_ptr<GraphViewer> graph_viewer_;
 
   const ExecutionProviders& execution_providers_;  // owned by InferenceSession
@@ -231,7 +242,8 @@ class SessionState {
       std::unordered_map<onnxruntime::NodeIndex, std::unordered_map<std::string, std::unique_ptr<SessionState>>>;
   SubgraphSessionStateMap subgraph_session_states_;
 
-  onnxruntime::concurrency::ThreadPool* thread_pool_ = nullptr;
+  // It could be NULL
+  concurrency::ThreadPool* const thread_pool_;
 
   bool export_fused_dll_ = false;
   FuncManager fused_funcs_mgr_;
diff --git a/onnxruntime/core/framework/session_state_initializer.cc b/onnxruntime/core/framework/session_state_initializer.cc
index 3f4777d8608d0..18589de82679a 100644
--- a/onnxruntime/core/framework/session_state_initializer.cc
+++ b/onnxruntime/core/framework/session_state_initializer.cc
@@ -27,9 +27,6 @@
 
 namespace onnxruntime {
 
-static common::Status SaveMLValueNameIndexMapping(const GraphViewer& graph_viewer,
-                                                  OrtValueNameIdxMap& ort_value_name_idx_map,
-                                                  const logging::Logger& logger);
 
 // T should have signature of '(int idx, const OrtValue& value, const OrtCallback& d) -> Status'
 template <typename T>
@@ -40,11 +37,6 @@ static common::Status SaveInitializedTensors(const Env& env, const std::basic_st
                                              const logging::Logger& logger,
                                              const DataTransferManager& data_transfer_mgr);
 
-static common::Status SaveKernels(const ExecutionProviders& execution_providers,
-                                  SessionState& session_state,
-                                  const KernelRegistryManager& custom_registry_manager,
-                                  const logging::Logger& logger);
-
 static common::Status SaveInputOutputNamesToNodeMapping(
     const onnxruntime::Graph& graph,
     const KernelRegistryManager& custom_registry_manager,
@@ -68,11 +60,11 @@ common::Status SessionStateInitializer::CreatePlan(
     const Node* parent_node,
     const ConstPointerContainer<std::vector<NodeArg*>>* outer_scope_node_args,
     bool enable_sequential_execution) {
-  auto graph_viewer = std::make_unique<onnxruntime::GraphViewer>(graph_);
+  session_state_.SetGraph(graph_);
+  const GraphViewer* graph_viewer = session_state_.GetGraphViewer();
 
   // populate the SessionState OrtValueNameIdxMap
-  auto& ort_value_name_idx_map = session_state_.GetOrtValueNameIdxMap();
-  ORT_RETURN_IF_ERROR(SaveMLValueNameIndexMapping(*graph_viewer, ort_value_name_idx_map, logger_));
+  const auto& ort_value_name_idx_map = session_state_.GetOrtValueNameIdxMap();
 
   // ignore any outer scope args we don't know about. this can happen if a node contains multiple subgraphs.
   std::vector<const NodeArg*> valid_outer_scope_node_args;
@@ -92,17 +84,10 @@ common::Status SessionStateInitializer::CreatePlan(
                                                     execution_providers_, kernel_registry_manager_,
                                                     ort_value_name_idx_map, context, exec_plan));
   session_state_.SetExecutionPlan(std::move(exec_plan));
-  session_state_.SetGraphViewer(std::move(graph_viewer));
 
-  return Status::OK();
-}
-
-common::Status SessionStateInitializer::InitializeAndSave(
-    const ConstPointerContainer<std::vector<NodeArg*>>* implicit_inputs) {
   const auto* exec_plan_ptr = session_state_.GetExecutionPlan();
   ORT_ENFORCE(exec_plan_ptr, "Execution plan was not found in SessionState. CreatePlan must be called first.");
 
-  const auto& ort_value_name_idx_map{session_state_.GetOrtValueNameIdxMap()};
   std::unique_ptr<ITensorAllocator> tensor_allocator_(ITensorAllocator::Create(
       enable_mem_pattern_, *exec_plan_ptr, execution_providers_, session_state_.GetMutableWeightsBuffers()));
 
@@ -119,64 +104,12 @@ common::Status SessionStateInitializer::InitializeAndSave(
   // TODO: make it better
   graph_.CleanAllInitializedTensors();
 
-  ORT_RETURN_IF_ERROR(SaveKernels(execution_providers_, session_state_, kernel_registry_manager_, logger_));
-  ORT_RETURN_IF_ERROR(SaveInputOutputNamesToNodeMapping(graph_, kernel_registry_manager_, session_state_,
-                                                        implicit_inputs));
-
+  ORT_RETURN_IF_ERROR(session_state_.CreateKernels(kernel_registry_manager_));
+  ORT_RETURN_IF_ERROR(
+      SaveInputOutputNamesToNodeMapping(graph_, kernel_registry_manager_, session_state_, outer_scope_node_args));
   return Status::OK();
 }
 
-// Build the OrtValue name->idx mapping
-common::Status SaveMLValueNameIndexMapping(const GraphViewer& graph_viewer, OrtValueNameIdxMap& ort_value_name_idx_map,
-                                           const logging::Logger& logger) {
-  LOGS(logger, INFO) << "SaveMLValueNameIndexMapping";
-  int idx = 0;
-
-  // we keep all graph inputs (including initializers), even if they are unused, so make sure they all have an entry
-  for (const auto* input_def : graph_viewer.GetInputsIncludingInitializers()) {
-    idx = ort_value_name_idx_map.Add(input_def->Name());
-    VLOGS(logger, 1) << "Added graph_viewer input with name: " << input_def->Name()
-                     << " to OrtValueIndex with index: " << idx;
-  }
-
-  for (auto& node : graph_viewer.Nodes()) {
-    // build the OrtValue->index map
-    for (const auto* input_def : node.InputDefs()) {
-      if (input_def->Exists()) {
-        idx = ort_value_name_idx_map.Add(input_def->Name());
-        VLOGS(logger, 1) << "Added input argument with name: " << input_def->Name()
-                         << " to OrtValueIndex with index: " << idx;
-      }
-    }
-
-    for (const auto* input_def : node.ImplicitInputDefs()) {
-      if (input_def->Exists()) {
-        idx = ort_value_name_idx_map.Add(input_def->Name());
-        VLOGS(logger, 1) << "Added implicit input argument with name: " << input_def->Name()
-                         << " to OrtValueIndex with index: " << idx;
-      }
-    }
-
-    for (const auto* output_def : node.OutputDefs()) {
-      if (output_def->Exists()) {
-        ort_value_name_idx_map.Add(output_def->Name());
-        VLOGS(logger, 1) << "Added output argument with name: " << output_def->Name()
-                         << " to OrtValueIndex with index: " << idx;
-      }
-    }
-  }
-
-  // allocate OrtValue for graph outputs when coming from initializers
-  for (const auto& output : graph_viewer.GetOutputs()) {
-    if (output->Exists()) {
-      idx = ort_value_name_idx_map.Add(output->Name());
-      VLOGS(logger, 1) << "Added graph output with name: " << output->Name() << " to OrtValueIndex with index: " << idx;
-    }
-  }
-
-  LOGS(logger, INFO) << "Done saving OrtValue mappings.";
-  return Status::OK();
-}
 
 static common::Status DeserializeTensorProto(const Env& env, const std::basic_string<PATH_CHAR_TYPE>& proto_path,
                                              const ONNX_NAMESPACE::TensorProto& tensor_proto, const MemBuffer& m,
@@ -292,46 +225,6 @@ common::Status SaveInitializedTensors(const Env& env, const std::basic_string<PA
   return common::Status::OK();
 }
 
-static common::Status CreateOpKernel(const onnxruntime::Node& node, const ExecutionProviders& execution_providers,
-                                     const SessionState& session_state,
-                                     const KernelRegistryManager& custom_registry_manager,
-                                     std::unique_ptr<OpKernel>& op_kernel) {
-  onnxruntime::ProviderType exec_provider_name = node.GetExecutionProviderType();
-
-  const IExecutionProvider* exec_provider = nullptr;
-  if (exec_provider_name.empty() || (exec_provider = execution_providers.Get(exec_provider_name)) == nullptr) {
-    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Could not create kernel for node: ", node.Name(),
-                           " as there's no execution provider allocated.");
-  }
-
-  common::Status status = custom_registry_manager.CreateKernel(node, *exec_provider, session_state, op_kernel);
-  if (!status.IsOK()) {
-    return common::Status(
-        status.Category(), status.Code(),
-        MakeString("Kernel creation failed for node: ", node.Name(), " with error: ", status.ErrorMessage()));
-  }
-
-  return status;
-}
-
-common::Status SaveKernels(const ExecutionProviders& execution_providers,
-                           SessionState& session_state,
-                           const KernelRegistryManager& custom_registry_manager,
-                           const logging::Logger& logger) {
-  LOGS(logger, INFO) << "Saving kernels.";
-
-  for (auto& node : session_state.GetGraphViewer()->Nodes()) {
-    // construct and save the kernels
-    std::unique_ptr<OpKernel> op_kernel;
-    ORT_RETURN_IF_ERROR(CreateOpKernel(node, execution_providers, session_state, custom_registry_manager, op_kernel));
-    session_state.AddKernel(node.Index(), std::move(op_kernel));
-  }
-
-  LOGS(logger, INFO) << "Done saving kernels.";
-
-  return Status::OK();
-}
-
 template <typename T>  // T is container of const NodeArg* or NodeArg*
 static bool IsArgNameInInputsOutputs(const std::string& name,
                                      const T& graph_args) {
@@ -351,6 +244,8 @@ common::Status SaveInputOutputNamesToNodeMapping(const onnxruntime::Graph& graph
   if (implicit_inputs && implicit_inputs->empty()) {
     implicit_inputs = nullptr;
   }
+  const auto* exec_plan = session_state.GetExecutionPlan();
+  const auto& name_to_id = session_state.GetOrtValueNameIdxMap();
 
   for (auto& node : graph.Nodes()) {
     // note that KernelCreateInfo may not exist for custom kernel
@@ -365,7 +260,11 @@ common::Status SaveInputOutputNamesToNodeMapping(const onnxruntime::Graph& graph
                 return Status::OK();
               }
 
-              SessionState::NodeInfo node_info(index, &node, kci);
+              int arg_index;
+              ORT_RETURN_IF_ERROR(name_to_id.GetIdx(arg.Name(), arg_index));
+              const auto& device = exec_plan->GetLocation(arg_index).device;
+
+              SessionState::NodeInfo node_info(index, &node, kci, device);
 
               if (IsArgNameInInputsOutputs(arg.Name(), graph_inputs)) {
                 ORT_RETURN_IF_ERROR(session_state.AddInputNameToNodeInfoMapping(arg.Name(), node_info));
@@ -397,8 +296,13 @@ common::Status SaveInputOutputNamesToNodeMapping(const onnxruntime::Graph& graph
       // copy to/from CPU to go through the control flow nodes where possible/applicable.
       // the processing for the subgraph where the implicit input is consumed will do the real check on whether any
       // copy to a different device is required
-      SessionState::NodeInfo node_info(std::numeric_limits<size_t>::max(), &node, kci);
       for (const auto& input_def : node_implicit_inputs) {
+        int arg_index;
+        //Question: the implicit input may not be found in this session state name to id map, but in parent session state name to id map.
+        //@Scott
+        ORT_RETURN_IF_ERROR(name_to_id.GetIdx(input_def->Name(), arg_index));
+        auto& device = exec_plan->GetLocation(arg_index).device;
+        SessionState::NodeInfo node_info(std::numeric_limits<size_t>::max(), &node, kci, device);
         ORT_RETURN_IF_ERROR(session_state.AddInputNameToNodeInfoMapping(input_def->Name(), node_info));
       }
     }
@@ -413,7 +317,6 @@ common::Status SaveInputOutputNamesToNodeMapping(const onnxruntime::Graph& graph
 
   auto& input_map = session_state.GetInputNodeInfoMap();
   auto end_map = input_map.cend();
-  SessionState::NodeInfo empty_node_info(std::numeric_limits<size_t>::max(), nullptr, nullptr);
 
   for (const auto& graph_input : graph_inputs) {
     const auto& name = graph_input->Name();
@@ -422,6 +325,10 @@ common::Status SaveInputOutputNamesToNodeMapping(const onnxruntime::Graph& graph
       // utils::CopyOneInputAcrossDevices will use the input OrtValue as is given we don't believe it's used anywhere.
       LOGS(session_state.Logger(), INFO) << (graph.IsSubgraph() ? "Subgraph" : "Graph") << " input with name "
                                          << name << " is not used by any node.";
+      int arg_index;
+      ORT_RETURN_IF_ERROR(name_to_id.GetIdx(name, arg_index));
+      auto& device = exec_plan->GetLocation(arg_index).device;
+      SessionState::NodeInfo empty_node_info(std::numeric_limits<size_t>::max(), nullptr, nullptr, device);
       ORT_RETURN_IF_ERROR(session_state.AddInputNameToNodeInfoMapping(name, empty_node_info));
     }
   }
diff --git a/onnxruntime/core/framework/session_state_initializer.h b/onnxruntime/core/framework/session_state_initializer.h
index 3634704de5e2a..8c969571c558a 100644
--- a/onnxruntime/core/framework/session_state_initializer.h
+++ b/onnxruntime/core/framework/session_state_initializer.h
@@ -36,14 +36,11 @@ class SessionStateInitializer {
                           KernelRegistryManager& kernel_registry_manager);
 
   // First perform any transformations and create the execution plan
-  common::Status CreatePlan(const Node* parent_node,
-                            const ConstPointerContainer<std::vector<NodeArg*>>* outer_scope_node_args,
+  // Then initialize tensors, and save. save kernels and input/output node mappings
+  common::Status CreatePlan(_In_opt_ const Node* parent_node,
+                            _In_opt_ const ConstPointerContainer<std::vector<NodeArg*>>* outer_scope_node_args,
                             bool enable_sequential_execution);
 
-  // initialize tensors, and save. save kernels and input/output node mappings
-  // \param implicit_inputs could be NULL
-  common::Status InitializeAndSave(const ConstPointerContainer<std::vector<NodeArg*>>* implicit_inputs);
-
  private:
   const std::basic_string<PATH_CHAR_TYPE>& graph_loc_;
   onnxruntime::Graph& graph_;
diff --git a/onnxruntime/core/framework/tensor.cc b/onnxruntime/core/framework/tensor.cc
index d0085c0fe6c1a..692232a6a8abc 100644
--- a/onnxruntime/core/framework/tensor.cc
+++ b/onnxruntime/core/framework/tensor.cc
@@ -47,7 +47,7 @@ void Tensor::Init(MLDataType p_type, const TensorShape& shape, void* p_raw_data,
   byte_offset_ = offset;
 }
 
-Tensor::Tensor(Tensor&& other)
+Tensor::Tensor(Tensor&& other) noexcept
     : p_data_(other.p_data_),
       buffer_deleter_(other.buffer_deleter_),
       shape_(other.shape_),
@@ -61,7 +61,7 @@ Tensor::Tensor(Tensor&& other)
   other.byte_offset_ = 0;
 }
 
-Tensor& Tensor::operator=(Tensor&& other) {
+Tensor& Tensor::operator=(Tensor&& other) noexcept {
   if (this != &other) {
     ReleaseBuffer();
 
diff --git a/onnxruntime/core/framework/tensor_shape.cc b/onnxruntime/core/framework/tensor_shape.cc
index b37c2e9499c8a..72acfd7921975 100644
--- a/onnxruntime/core/framework/tensor_shape.cc
+++ b/onnxruntime/core/framework/tensor_shape.cc
@@ -8,16 +8,8 @@
 
 namespace onnxruntime {
 
-TensorShape::TensorShape(const std::vector<int64_t>& dims) : std::vector<int64_t>(dims) {
-}
-
-TensorShape::TensorShape(std::vector<int64_t>&& dims) : std::vector<int64_t>(std::move(dims)) {
-}
-
-TensorShape::TensorShape(const std::initializer_list<int64_t>& dims) : std::vector<int64_t>(dims) {
-}
-
-TensorShape::TensorShape(const int64_t* dimension_sizes, size_t dimension_count) : std::vector<int64_t>(dimension_count) {
+TensorShape::TensorShape(const int64_t* dimension_sizes, size_t dimension_count)
+    : std::vector<int64_t>(dimension_count) {
   for (size_t i = 0; i < dimension_count; ++i) {
     (*this)[i] = dimension_sizes[i];
   }
diff --git a/onnxruntime/core/framework/tensorprotoutils.cc b/onnxruntime/core/framework/tensorprotoutils.cc
index ce7fc4e91d286..768045179b5be 100644
--- a/onnxruntime/core/framework/tensorprotoutils.cc
+++ b/onnxruntime/core/framework/tensorprotoutils.cc
@@ -14,7 +14,7 @@
 #include "core/framework/tensor.h"
 #include "core/framework/ort_value_pattern_planner.h"
 #include "core/framework/allocator.h"
-#include "core/common/callback.h"
+#include "core/framework/callback.h"
 #include "core/framework/data_types.h"
 #include "core/framework/path_lib.h"
 
@@ -304,7 +304,7 @@ ORT_API_STATUS(OrtInitializeBufferForTensor, _In_opt_ void* input, size_t input_
  */
 ORT_API(void, OrtUninitializeBuffer, _In_opt_ void* input, size_t input_len, enum ONNXTensorElementDataType type);
 
-static void ORT_API_CALL UnInitTensor(void* param) noexcept {
+static void UnInitTensor(void* param) noexcept {
   UnInitializeParam* p = reinterpret_cast<UnInitializeParam*>(param);
   OrtUninitializeBuffer(p->preallocated, p->preallocated_size, p->ele_type);
   delete p;
diff --git a/onnxruntime/core/framework/utils.cc b/onnxruntime/core/framework/utils.cc
index b0171f25be843..cb126236d15e9 100644
--- a/onnxruntime/core/framework/utils.cc
+++ b/onnxruntime/core/framework/utils.cc
@@ -16,16 +16,51 @@
 #include "core/framework/parallel_executor.h"
 #include "core/framework/session_state.h"
 #include "core/framework/sequential_executor.h"
+#include "core/mlas/inc/mlas.h"
 
 namespace onnxruntime {
 namespace utils {
+void* DefaultAlloc(size_t size) {
+  if (size <= 0) return nullptr;
+  void* p;
+  size_t alignment = MlasGetPreferredBufferAlignment();
+#if _MSC_VER
+  p = _aligned_malloc(size, alignment);
+  if (p == nullptr) throw std::bad_alloc();
+#elif defined(_LIBCPP_SGX_CONFIG)
+  p = memalign(alignment, size);
+  if (p == nullptr) throw std::bad_alloc();
+#else
+  int ret = posix_memalign(&p, alignment, size);
+  if (ret != 0) throw std::bad_alloc();
+#endif
+  return p;
+}
+
+void DefaultFree(void* p) {
+#if _MSC_VER
+  _aligned_free(p);
+#else
+  free(p);
+#endif
+}
+
 AllocatorPtr GetAllocator(const SessionState& session_state, const OrtAllocatorInfo& allocator_info) {
   return session_state.GetExecutionProviders().GetAllocator(allocator_info);
 }
 
-common::Status AllocateHelper(const IExecutionProvider& execution_provider, int device_id, const Tensor& fetched_tensor,
+bool ProviderIsCpuBased(const std::string& provider_type) {
+  return provider_type == onnxruntime::kCpuExecutionProvider ||
+         provider_type == onnxruntime::kMklDnnExecutionProvider ||
+         provider_type == onnxruntime::kNGraphExecutionProvider ||
+         provider_type == onnxruntime::kNupharExecutionProvider ||
+         provider_type == onnxruntime::kOpenVINOExecutionProvider ||
+         provider_type == onnxruntime::kNnapiExecutionProvider;
+}
+
+common::Status AllocateHelper(const IExecutionProvider& execution_provider, const OrtDevice& device, const Tensor& fetched_tensor,
                               OrtValue& output_mlvalue) {
-  auto allocator = execution_provider.GetAllocator(device_id, OrtMemTypeDefault);
+  auto allocator = execution_provider.GetAllocator(device.Id(), OrtMemTypeDefault);
   if (!allocator) {
     return Status(common::ONNXRUNTIME, common::FAIL, "invalid allocator");
   }
@@ -62,20 +97,20 @@ static Status CopyMLValue(const DataTransferManager& data_transfer_mgr,
                           const FeedsFetchesManager::MLValueCopyInfo& copy_info,
                           const OrtValue& source_mlvalue,
                           OrtValue& target_mlvalue) {
-  if (copy_info.copy_provider == nullptr) {
+  if (copy_info.allocation_provider == nullptr) {
     target_mlvalue = source_mlvalue;
-  } else {
-    auto& source_tensor = source_mlvalue.Get<Tensor>();
+    return Status::OK();
+  }
 
-    if (!target_mlvalue.IsAllocated()) {
-      ORT_RETURN_IF_ERROR(utils::AllocateHelper(*copy_info.allocation_provider, copy_info.allocation_device_id,
-                                                source_tensor, target_mlvalue));
-    }
+  auto& source_tensor = source_mlvalue.Get<Tensor>();
+  if (!target_mlvalue.IsAllocated()) {
+    ORT_RETURN_IF_ERROR(utils::AllocateHelper(*copy_info.allocation_provider, copy_info.target_device,
+                                              source_tensor, target_mlvalue));
+  }
 
-    Tensor* p_output_tensor = target_mlvalue.GetMutable<Tensor>();
+  Tensor* p_output_tensor = target_mlvalue.GetMutable<Tensor>();
 
-    ORT_RETURN_IF_ERROR(data_transfer_mgr.CopyTensor(source_tensor, *p_output_tensor));
-  }
+  ORT_RETURN_IF_ERROR(data_transfer_mgr.CopyTensor(source_tensor, *p_output_tensor));
 
   return Status::OK();
 }
@@ -86,8 +121,6 @@ common::Status CopyOneInputAcrossDevices(const SessionState& session_state, cons
                                          FeedsFetchesManager::MLValueCopyInfo& copy_info) {
   needed_copy = false;
 
-  //TODO: make it configurable
-  const int target_device_id = 0;
   std::vector<SessionState::NodeInfo> node_info_vec;
   ORT_RETURN_IF_ERROR(session_state.GetInputNodeInfo(input_name, node_info_vec));
 
@@ -111,51 +144,23 @@ common::Status CopyOneInputAcrossDevices(const SessionState& session_state, cons
       break;
     }
 
-    auto& required_provider_type = GetNodeInputProviderType(node_info);
-    auto& input_tensor = orig_mlvalue.Get<Tensor>();
-    auto& input_tensor_loc = input_tensor.Location();
-
-    auto* p_input_provider = exec_providers.Get(input_tensor_loc);
-    if (!p_input_provider) {
-      p_input_provider = exec_providers.Get(onnxruntime::kCpuExecutionProvider);
-      ORT_ENFORCE(p_input_provider);
-    }
-
-    //no copy for TRT and  nGraph
-    if (required_provider_type == onnxruntime::kTensorrtExecutionProvider || required_provider_type == onnxruntime::kNGraphExecutionProvider) {
-      new_mlvalue = orig_mlvalue;
-      break;
-    }
-
-    auto input_provider_type = p_input_provider->Type();
-    if (input_provider_type == required_provider_type && input_tensor_loc.mem_type == OrtMemTypeDefault) {
-      new_mlvalue = orig_mlvalue;
-      break;
-    }
-
-    // If a node requires input on cpu and input tensor is allocated with pinned memory allocator, don't do copy
-    if (required_provider_type == onnxruntime::kCpuExecutionProvider &&
-        input_tensor_loc.mem_type == OrtMemTypeCPU) {
+    auto& required_device = *node_info.device;
+    auto& input_tensor_device = orig_mlvalue.Get<Tensor>().Location().device;
+    if (required_device == input_tensor_device) {
+      // No copy needed for same device.
       new_mlvalue = orig_mlvalue;
       break;
     }
 
+    auto& required_provider_type = GetNodeInputProviderType(node_info);
     auto* required_provider = exec_providers.Get(required_provider_type);
-    ORT_ENFORCE(required_provider);
-
-    auto* p_copy_provider = (required_provider_type != onnxruntime::kCpuExecutionProvider)
-                                ? required_provider
-                                : p_input_provider;
-
-    copy_info.allocation_device_id = target_device_id;
+    copy_info.target_device = required_device;
     copy_info.allocation_provider = required_provider;
-    copy_info.copy_provider = p_copy_provider;
 
     ORT_RETURN_IF_ERROR(CopyMLValue(session_state.GetDataTransferMgr(), copy_info, orig_mlvalue, new_mlvalue));
 
     needed_copy = true;
 
-    // } loop of node_info_vec
   } while (false);
 
   return Status::OK();
@@ -223,18 +228,16 @@ static common::Status CachedCopyInputsAcrossDevices(
 // Setup fetches for execution. Use any provided fetches directly if the provider matches.
 // If the provider doesn't match, we don't know what device the execution output may be on, so can't assume the output
 // can be returned to the user directly.
-// TODO: We should be able to use the allocation plan to know which device an output will be on.
 static common::Status SetupFetchesForExecute(const SessionState& session_state,
                                              const std::vector<std::string>& output_names,
                                              std::vector<OrtValue>& fetches, std::vector<OrtValue>& new_fetches,
                                              std::vector<bool>* copy_to_new_fetches_cached_values) {
   ORT_ENFORCE(new_fetches.empty());
-
-  const auto& execution_providers = session_state.GetExecutionProviders();
   auto num_outputs = output_names.size();
-
   new_fetches.resize(num_outputs);
 
+  const auto& name_to_id = session_state.GetOrtValueNameIdxMap();
+  const auto* exec_plan = session_state.GetExecutionPlan();
   // track which fetches can be copied to new_fetches and used directly in the execution.
   std::vector<bool> local_can_copy_flags(num_outputs, false);
 
@@ -275,16 +278,12 @@ static common::Status SetupFetchesForExecute(const SessionState& session_state,
           continue;
         }
 
-        const auto& node_provider_type = node.GetExecutionProviderType();
-        const auto& provided_tensor = provided_mlvalue.Get<Tensor>();
-        const auto& provided_tensor_loc = provided_tensor.Location();
-        const auto* tensor_provider = execution_providers.Get(provided_tensor_loc);
-        if (!tensor_provider) {
-          tensor_provider = execution_providers.Get(onnxruntime::kCpuExecutionProvider);
-        }
+        int arg_index;
+        ORT_RETURN_IF_ERROR(name_to_id.GetIdx(arg->Name(), arg_index));
+        const auto& planned_device = exec_plan->GetLocation(arg_index).device;
+        const auto& provided_tensor_device = provided_mlvalue.Get<Tensor>().Location().device;
 
-        auto tensor_provider_type = tensor_provider->Type();
-        if (node_provider_type == tensor_provider_type) {
+        if (planned_device == provided_tensor_device) {
           new_fetches[idx] = fetches[idx];
           local_can_copy_flags[idx] = true;
           continue;
@@ -344,43 +343,26 @@ static common::Status CopyOutputsAcrossDevices(const SessionState& session_state
       continue;
     }
 
-    auto& fetched_tensor = fetched_mlvalue.Get<Tensor>();
-    auto& fetched_tensor_location = fetched_tensor.Location();
-    auto* p_fetched_provider = execution_providers.Get(fetched_tensor_location);
-    if (!p_fetched_provider) {
-      p_fetched_provider = cpu_execution_provider;
-    }
-
-    auto fetched_provider_type = p_fetched_provider->Type();
-    auto& output_mlvalue = user_fetches[idx];
-
     const IExecutionProvider* p_output_provider = nullptr;
-
+    auto target_device = OrtDevice();
+    auto& output_mlvalue = user_fetches[idx];
     if (output_mlvalue.IsAllocated()) {
       Tensor* p_output_tensor = output_mlvalue.GetMutable<Tensor>();
+      target_device = p_output_tensor->Location().device;
       p_output_provider = execution_providers.Get(p_output_tensor->Location());
     }
+    auto fetch_result_device = fetched_mlvalue.Get<Tensor>().Location().device;
+    if (target_device == fetch_result_device) {
+      user_fetches[idx] = fetched_mlvalue;
+      continue;
+    }
 
     if (!p_output_provider) {
       p_output_provider = cpu_execution_provider;
     }
 
-    auto output_provider_type = p_output_provider->Type();
-
-    if (fetched_provider_type == output_provider_type ||
-        (p_output_provider == cpu_execution_provider && fetched_tensor_location.mem_type == OrtMemTypeCPUOutput)) {
-      user_fetches[idx] = fetched_mlvalue;
-      continue;
-    }
-
     needed_copy = true;
-
-    auto* p_copy_provider = (fetched_provider_type != onnxruntime::kCpuExecutionProvider)
-                                ? p_fetched_provider
-                                : p_output_provider;
-
-    const int device_id = 0;  // TODO: As per comment in the copy input code, make this configurable.
-    FeedsFetchesManager::MLValueCopyInfo copy_info{device_id, p_output_provider, p_copy_provider};
+    FeedsFetchesManager::MLValueCopyInfo copy_info{target_device, p_output_provider};
     ORT_RETURN_IF_ERROR(CopyMLValue(session_state.GetDataTransferMgr(), copy_info, fetched_mlvalue, output_mlvalue));
 
     if (copiers) {
@@ -410,11 +392,7 @@ static common::Status CachedCopyOutputsAcrossDevices(
 
 static DeviceCopyCheck CheckExecutionProviders(const ExecutionProviders& execution_providers) {
   for (const auto& execution_provider : execution_providers) {
-    if (execution_provider->Type() != onnxruntime::kCpuExecutionProvider &&
-        execution_provider->Type() != onnxruntime::kMklDnnExecutionProvider &&
-        execution_provider->Type() != onnxruntime::kNGraphExecutionProvider &&
-        execution_provider->Type() != onnxruntime::kNupharExecutionProvider &&
-        execution_provider->Type() != onnxruntime::kOpenVINOExecutionProvider) {
+    if (!ProviderIsCpuBased(execution_provider->Type())) {
       return DeviceCopyCheck::Unknown;
     }
   }
diff --git a/onnxruntime/core/framework/utils.h b/onnxruntime/core/framework/utils.h
index b096f1ecbaf8b..881762da85a2d 100644
--- a/onnxruntime/core/framework/utils.h
+++ b/onnxruntime/core/framework/utils.h
@@ -25,6 +25,8 @@ class Logger;
 }
 
 namespace utils {
+void* DefaultAlloc(size_t size);
+void DefaultFree(void* p);
 
 AllocatorPtr GetAllocator(const SessionState& session_state, const OrtAllocatorInfo& allocator_info);
 
diff --git a/onnxruntime/core/graph/automl_ops/automl_defs.cc b/onnxruntime/core/graph/automl_ops/automl_defs.cc
new file mode 100644
index 0000000000000..dc4dd653f37c0
--- /dev/null
+++ b/onnxruntime/core/graph/automl_ops/automl_defs.cc
@@ -0,0 +1,46 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/graph/constants.h"
+#include "core/graph/automl_ops/automl_defs.h"
+#include "core/graph/op.h"
+#include "onnx/defs/schema.h"
+#include "onnx/defs/shape_inference.h"
+
+namespace onnxruntime {
+namespace automl {
+using ONNX_NAMESPACE::AttributeProto;
+using ONNX_NAMESPACE::OpSchema;
+using ONNX_NAMESPACE::OPTIONAL;
+
+void RegisterAutoMLSchemas() {
+
+  static const char* DateTimeTransformer_ver1_doc = R"DOC(
+    DateTimeTransformer accepts a single scalar int64 tensor, constructs
+    an instance of std::chrono::system_clock::time_point and passes it as an argument
+    to Microsoft::DateTimeFeaturizer which is a part of a shared library.
+    It returns an instance of TimePoint class.
+  )DOC";
+
+  MS_AUTOML_OPERATOR_SCHEMA(DateTimeTransformer)
+      .SinceVersion(1)
+      .SetDomain(kMSAutoMLDomain)
+      .SetDoc(DateTimeTransformer_ver1_doc)
+      .Input(0, "X",
+             "The input represents a number of seconds passed since the epoch, suitable to properly construct"
+             "an instance of std::chrono::system_clock::time_point",
+             "T1")
+      .Output(0, "Y", "The output which is a Microsoft::DateTimeFeaturizer::TimePoint structure", "T2")
+      .TypeConstraint(
+          "T1",
+          {"tensor(int64)"},
+          "Constrain input type to int64 scalar tensor.")
+      .TypeConstraint(
+          "T2",
+          {"opaque(com.microsoft.automl,DateTimeFeaturizer_TimePoint)"},
+          "Constrain output type to an AutoML specific Microsoft::Featurizers::TimePoint type"
+          "currently not part of ONNX standard. When it becomes a part of the standard we will adjust this"
+          "kernel definition and move it to ONNX repo");
+}
+}  // namespace automl
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/graph/automl_ops/automl_defs.h b/onnxruntime/core/graph/automl_ops/automl_defs.h
new file mode 100644
index 0000000000000..b1a37366c396d
--- /dev/null
+++ b/onnxruntime/core/graph/automl_ops/automl_defs.h
@@ -0,0 +1,30 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/graph/onnx_protobuf.h"
+
+namespace onnxruntime {
+namespace automl {
+#define MS_AUTOML_OPERATOR_SCHEMA(name) \
+  MS_AUTOML_OPERATOR_SCHEMA_UNIQ_HELPER(__COUNTER__, name)
+#define MS_AUTOML_OPERATOR_SCHEMA_UNIQ_HELPER(Counter, name) \
+  MS_AUTOML_OPERATOR_SCHEMA_UNIQ(Counter, name)
+#define MS_AUTOML_OPERATOR_SCHEMA_UNIQ(Counter, name)         \
+  static ONNX_NAMESPACE::OpSchemaRegistry::OpSchemaRegisterOnce( \
+      op_schema_register_once##name##Counter) ONNX_UNUSED =      \
+      ONNX_NAMESPACE::OpSchema(#name, __FILE__, __LINE__)
+
+#define MS_AUTOML_OPERATOR_SCHEMA_ELSEWHERE(name, schema_func) \
+  MS_AUTOML_OPERATOR_SCHEMA_UNIQ_HELPER_ELSEWHERE(__COUNTER__, name, schema_func)
+#define MS_AUTOML_OPERATOR_SCHEMA_UNIQ_HELPER_ELSEWHERE(Counter, name, schema_func) \
+  MS_AUTOML_OPERATOR_SCHEMA_UNIQ_ELSEWHERE(Counter, name, schema_func)
+#define MS_AUTOML_OPERATOR_SCHEMA_UNIQ_ELSEWHERE(Counter, name, schema_func) \
+  static ONNX_NAMESPACE::OpSchemaRegistry::OpSchemaRegisterOnce(                \
+      op_schema_register_once##name##Counter) ONNX_UNUSED =                     \
+      schema_func(ONNX_NAMESPACE::OpSchema(#name, __FILE__, __LINE__))
+
+void RegisterAutoMLSchemas();
+}  // namespace automl
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/graph/contrib_ops/contrib_defs.cc b/onnxruntime/core/graph/contrib_ops/contrib_defs.cc
index 49d8e3309a989..66d85474461cd 100644
--- a/onnxruntime/core/graph/contrib_ops/contrib_defs.cc
+++ b/onnxruntime/core/graph/contrib_ops/contrib_defs.cc
@@ -21,6 +21,10 @@ void convPoolShapeInference(
     int input1Idx,
     int input2Idx);
 void globalPoolTypeShapeInference(ONNX_NAMESPACE::InferenceContext& ctx);
+void matmulShapeInference(
+    ONNX_NAMESPACE::InferenceContext& ctx,
+    int input1Idx,
+    int input2Idx);
 }  // namespace ONNX_NAMESPACE
 
 namespace onnxruntime {
@@ -1158,6 +1162,39 @@ of [N, 0] then [N, 0].
         updateOutputShape(ctx, 0, output_shape);
       });
 
+  ONNX_CONTRIB_OPERATOR_SCHEMA(MatMulInteger16)
+      .SetDomain(kMSDomain)
+      .SinceVersion(1)
+      .SetDoc(R"DOC(
+Matrix product that behaves like numpy.matmul: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.matmul.html.
+ The production MUST never overflow. The accumulation may overflow if and only if in 32 bits.)DOC")
+      .Input(0, "A", "N-dimensional matrix A", "T1")
+      .Input(1, "B", "N-dimensional matrix B", "T2")
+      .Output(0, "Y", "Matrix multiply results from A * B", "T3")
+      .TypeConstraint("T1", {"tensor(int16)", "tensor(uint16)"}, "Constrain input A data types as 16-bit integer tensor")
+      .TypeConstraint("T2", {"tensor(int16)", "tensor(uint16)"}, "Constrain input B data types as 16-bit integer tensor")
+      .TypeConstraint("T3",
+                      {"tensor(int32)", "tensor(uint32)"},
+                      "Constrain output Y data types as 32-bit integer tensor."
+                      "T3 must be tensor(uint32) when both T1 and T2 are tensor(uint16),"
+                      "or must be tensor(int32) when either T1 or T2 is tensor(int16).")
+      .TypeAndShapeInferenceFunction([](ONNX_NAMESPACE::InferenceContext& ctx) {
+        auto a_type = ctx.getInputType(0);
+        auto b_type = ctx.getInputType(1);
+        auto y_type = ctx.getOutputType(0);
+        if (nullptr == a_type || nullptr == b_type || nullptr == y_type ||
+            a_type->value_case() != ONNX_NAMESPACE::TypeProto::kTensorType ||
+            b_type->value_case() != ONNX_NAMESPACE::TypeProto::kTensorType) {
+          fail_type_inference(
+              "inputs are expected to have tensor type and output type should not be null.");
+        }
+
+        // Right now we only support int32
+        y_type->mutable_tensor_type()->set_elem_type(ONNX_NAMESPACE::TensorProto::INT32);
+
+        matmulShapeInference(ctx, 0, 1);
+      });
+
   ONNX_CONTRIB_OPERATOR_SCHEMA(ReduceSumInteger)
       .SetDomain(kMSDomain)
       .SinceVersion(1)
@@ -1599,4 +1636,4 @@ Example 4:
 #endif
 }
 }  // namespace contrib
-}  // namespace onnxruntime
+}  // namespace onnxruntime
\ No newline at end of file
diff --git a/onnxruntime/core/graph/graph_viewer.cc b/onnxruntime/core/graph/graph_viewer.cc
index 262a2591ddb08..53ee4047b4994 100644
--- a/onnxruntime/core/graph/graph_viewer.cc
+++ b/onnxruntime/core/graph/graph_viewer.cc
@@ -8,6 +8,8 @@
 
 #include "core/graph/graph_viewer.h"
 
+#include "core/graph/graph_utils.h"
+
 namespace onnxruntime {
 
 struct NodeCompare {
@@ -25,12 +27,13 @@ GraphViewer::GraphViewer(const Graph& graph) {
       leaf_nodes.push_back(&node);
     }
   }
-  graph.ReverseDFSFrom(leaf_nodes,
-                       nullptr,
-                       [this](const Node* n) {
-                         nodes_in_topological_order_.push_back(n->Index());
-                       },
-                       NodeCompare());
+  graph.ReverseDFSFrom(
+      leaf_nodes,
+      nullptr,
+      [this](const Node* n) {
+        nodes_in_topological_order_.push_back(n->Index());
+      },
+      NodeCompare());
 
   for (auto& node : graph_->Nodes()) {
     if (node.InputEdgesBegin() == node.InputEdgesEnd()) {
@@ -52,6 +55,10 @@ bool GraphViewer::GetInitializedTensor(const std::string& tensor_name, const ONN
   return graph_->GetInitializedTensor(tensor_name, value);
 }
 
+bool GraphViewer::CanOverrideInitializer() const noexcept {
+  return graph_->CanOverrideInitializer();
+}
+
 // Graph inputs excluding initializers.
 const std::vector<const NodeArg*>& GraphViewer::GetInputs() const noexcept {
   return graph_->GetInputs();
@@ -109,4 +116,8 @@ bool GraphViewer::IsSubgraph() const {
   return graph_->IsSubgraph();
 }
 
+bool GraphViewer::IsConstantInitializer(const std::string& name, bool check_outer_scope) const {
+  return graph_utils::IsConstantInitializer(*graph_, name, check_outer_scope);
+}
+
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/graph/model.cc b/onnxruntime/core/graph/model.cc
index 3d03b2b4efd5d..d2da1752292ba 100644
--- a/onnxruntime/core/graph/model.cc
+++ b/onnxruntime/core/graph/model.cc
@@ -108,18 +108,25 @@ Model::Model(std::unique_ptr<ModelProto> model_proto, const IOnnxRuntimeOpSchema
     const auto& domain = opSet.domain();
     const auto version = opSet.version();
     // empty domain and 'ai.onnx' are equivalent
-    if ((domain.empty() || domain == "ai.onnx") && version < 7) {
+    if ((domain.empty() || domain == kOnnxDomainAlias) && version < 7) {
       // TODO: Check if we can upgrade all the current opset 6 models that are being tested
       // in CI to opset 7 or above
       LOGS_DEFAULT(WARNING) << "ONNX Runtime only *guarantees* support for models stamped "
                                "with opset version 7 or above for opset domain 'ai.onnx'. "
                                "Please upgrade your model to opset 7 or higher. "
                                "For now, this opset "
-                            <<  version
+                            << version
                             << " model may run depending upon legacy support "
                                "of some older opset version operators.";
     }
-    domain_to_version[domain] = gsl::narrow_cast<int>(version);
+    // We need to overwrite the domain here with ("") or else the loop below will try to find ("")
+    // in the map and if not found (when domain == kOnnxDomainAlias), adds an entry for ("", 11).
+    // This effectively ignores the opset version specified by the model for the onnx domain.
+    if (domain == kOnnxDomainAlias) {
+      domain_to_version[kOnnxDomain] = gsl::narrow_cast<int>(version);
+    } else {
+      domain_to_version[domain] = gsl::narrow_cast<int>(version);
+    }
   }
 
   auto domain_map = schema_registry->GetLatestOpsetVersions(false);
diff --git a/onnxruntime/core/mlas/inc/mlas.h b/onnxruntime/core/mlas/inc/mlas.h
index b1e08f09b567c..884d97042b4bc 100644
--- a/onnxruntime/core/mlas/inc/mlas.h
+++ b/onnxruntime/core/mlas/inc/mlas.h
@@ -129,6 +129,27 @@ MlasSgemm(
     MLAS_THREADPOOL* ThreadPool
     );
 
+//
+// Quantized integer matrix/matrix multiply routine.
+//
+
+void
+MLASCALL
+MlasQgemm(
+    size_t M,
+    size_t N,
+    size_t K,
+    const uint8_t* A,
+    size_t lda,
+    uint8_t offa,
+    const uint8_t* B,
+    size_t ldb,
+    uint8_t offb,
+    int32_t* C,
+    size_t ldc,
+    MLAS_THREADPOOL* ThreadPool
+    );
+
 //
 // Convolution routines.
 //
diff --git a/onnxruntime/core/mlas/lib/aarch64/sgemma.s b/onnxruntime/core/mlas/lib/aarch64/SgemmKernelNeon.S
similarity index 95%
rename from onnxruntime/core/mlas/lib/aarch64/sgemma.s
rename to onnxruntime/core/mlas/lib/aarch64/SgemmKernelNeon.S
index 545465a5a86e8..c69fadc36893b 100644
--- a/onnxruntime/core/mlas/lib/aarch64/sgemma.s
+++ b/onnxruntime/core/mlas/lib/aarch64/SgemmKernelNeon.S
@@ -6,7 +6,7 @@ Licensed under the MIT License.
 
 Module Name:
 
-    sgemma.s
+    SgemmKernelNeon.s
 
 Abstract:
 
@@ -88,7 +88,7 @@ Abstract:
 
         .endm
 
-
+//
 // MultiplyAccumulateRow
 //
 // Generates the code to multiply and accumulate a single row of the output
@@ -137,11 +137,11 @@ Abstract:
         ClearBlockAccumulators \Columns\(),\Rows\()
 
 .if \Rows\() >= 2
-        add     x10,x0,x6,uxtw 2            // compute matrix A plus 1 row
+        add     x10,x0,x6,lsl #2            // compute matrix A plus 1 row
 .endif
 .if \Rows\() >= 4
-        add     x11,x10,x6,uxtw 2           // compute matrix A plus 2 rows
-        add     x12,x11,x6,uxtw 2           // compute matrix A plus 3 rows
+        add     x11,x10,x6,lsl #2           // compute matrix A plus 2 rows
+        add     x12,x11,x6,lsl #2           // compute matrix A plus 3 rows
 .endif
 
         sub     x9,x3,#4                    // decrement block count to process
@@ -183,7 +183,7 @@ Abstract:
         ldp     q6,q7,[x1,#-8*4]
 .endif
         MultiplyAccumulateBlock \Columns\(),\Rows\(),0
-        sub     x9,x9,1
+        sub     x9,x9,#1
         cbnz    x9,.L\Mode\().Compute\Columns\().x\Rows\().BlockBy1Loop
 
 .L\Mode\().Output\Columns\().x\Rows\().Block:
@@ -430,12 +430,12 @@ Return Value:
         .type   MlasSgemmKernel\Mode\(),%function
 MlasSgemmKernel\Mode\():
 
-        stp     d8,d9,[sp,-32]!
-        stp     d10,d11,[sp,16]
+        stp     d8,d9,[sp,#-32]!
+        stp     d10,d11,[sp,#16]
 
-        add     x13,x2,x7,uxtw 2            // compute matrix C plus 1 row
-        add     x14,x13,x7,uxtw 2           // compute matrix C plus 2 rows
-        add     x15,x14,x7,uxtw 2           // compute matrix C plus 3 rows
+        add     x13,x2,x7,lsl #2            // compute matrix C plus 1 row
+        add     x14,x13,x7,lsl #2           // compute matrix C plus 2 rows
+        add     x15,x14,x7,lsl #2           // compute matrix C plus 3 rows
         mov     x8,x0                       // save matrix A
 
 //
@@ -452,8 +452,8 @@ MlasSgemmKernel\Mode\():
 
 .L\Mode\().ExitKernel:
         mov     x0,x4
-        ldp     d10,d11,[sp,16]
-        ldp     d8,d9,[sp],32
+        ldp     d10,d11,[sp,#16]
+        ldp     d8,d9,[sp],#32
         ret
 
 //
diff --git a/onnxruntime/core/mlas/lib/amd64/AssembleAvx512Vnni.inc b/onnxruntime/core/mlas/lib/amd64/AssembleAvx512Vnni.inc
new file mode 100644
index 0000000000000..02f7d92256017
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/amd64/AssembleAvx512Vnni.inc
@@ -0,0 +1,232 @@
+;++
+;
+; Copyright (c) Microsoft Corporation. All rights reserved.
+;
+; Licensed under the MIT License.
+;
+; Module Name:
+;
+;   AssembleAvx512Vnni.inc
+;
+; Abstract:
+;
+;   This module contains macros to build VNNI instructions for toolchains that
+;   do not natively support this newer instruction set extension.
+;
+;--
+
+;
+; Map friendly register names to the encoded register index.
+;
+
+ZmmIndex_zmm0   EQU     0
+ZmmIndex_zmm1   EQU     1
+ZmmIndex_zmm2   EQU     2
+ZmmIndex_zmm3   EQU     3
+ZmmIndex_zmm4   EQU     4
+ZmmIndex_zmm5   EQU     5
+ZmmIndex_zmm6   EQU     6
+ZmmIndex_zmm7   EQU     7
+ZmmIndex_zmm8   EQU     8
+ZmmIndex_zmm9   EQU     9
+ZmmIndex_zmm10  EQU     10
+ZmmIndex_zmm11  EQU     11
+ZmmIndex_zmm12  EQU     12
+ZmmIndex_zmm13  EQU     13
+ZmmIndex_zmm14  EQU     14
+ZmmIndex_zmm15  EQU     15
+ZmmIndex_zmm16  EQU     16
+ZmmIndex_zmm17  EQU     17
+ZmmIndex_zmm18  EQU     18
+ZmmIndex_zmm19  EQU     19
+ZmmIndex_zmm20  EQU     20
+ZmmIndex_zmm21  EQU     21
+ZmmIndex_zmm22  EQU     22
+ZmmIndex_zmm23  EQU     23
+ZmmIndex_zmm24  EQU     24
+ZmmIndex_zmm25  EQU     25
+ZmmIndex_zmm26  EQU     26
+ZmmIndex_zmm27  EQU     27
+ZmmIndex_zmm28  EQU     28
+ZmmIndex_zmm29  EQU     29
+ZmmIndex_zmm30  EQU     30
+ZmmIndex_zmm31  EQU     31
+
+GprIndex_rax    EQU     0
+GprIndex_rcx    EQU     1
+GprIndex_rdx    EQU     2
+GprIndex_rbx    EQU     3
+GprIndex_rbp    EQU     5
+GprIndex_rsi    EQU     6
+GprIndex_rdi    EQU     7
+GprIndex_r8     EQU     8
+GprIndex_r9     EQU     9
+GprIndex_r10    EQU     10
+GprIndex_r11    EQU     11
+GprIndex_r12    EQU     12
+GprIndex_r13    EQU     13
+GprIndex_r14    EQU     14
+GprIndex_r15    EQU     15
+
+;
+; Macro Description:
+;
+;   This macro builds a VNNI instruction of the form:
+;
+;       instr zmm1,zmm2,zmm3
+;
+; Arguments:
+;
+;   Opcode - Specifies the opcode for the VNNI instruction.
+;
+;   DestReg - Specifies the destination register.
+;
+;   Src1Reg - Specifies the first source register.
+;
+;   Src2Reg - Specifies the second source register.
+;
+
+VnniZmmZmmZmm MACRO Opcode, DestReg, Src1Reg, Src2Reg
+
+        LOCAL   Payload0, Payload1, Payload2, ModRMByte
+
+        Payload0 = 002h                     ; "0F 38" prefix
+        Payload0 = Payload0 + ((((ZmmIndex_&DestReg& SHR 3) AND 1) XOR 1) SHL 7)
+        Payload0 = Payload0 + ((((ZmmIndex_&Src2Reg& SHR 4) AND 1) XOR 1) SHL 6)
+        Payload0 = Payload0 + ((((ZmmIndex_&Src2Reg& SHR 3) AND 1) XOR 1) SHL 5)
+        Payload0 = Payload0 + ((((ZmmIndex_&DestReg& SHR 4) AND 1) XOR 1) SHL 4)
+
+        Payload1 = 005h                     ; "66" prefix
+        Payload1 = Payload1 + (((ZmmIndex_&Src1Reg& AND 15) XOR 15) SHL 3)
+
+        Payload2 = 040h                     ; 512-bit vector length
+        Payload2 = Payload2 + ((((ZmmIndex_&Src1Reg& SHR 4) AND 1) XOR 1) SHL 3)
+
+        ModRMByte = 0C0h                    ; register form
+        ModRMByte = ModRMByte + ((ZmmIndex_&DestReg& AND 7) SHL 3)
+        ModRMByte = ModRMByte + (ZmmIndex_&Src2Reg& AND 7)
+
+        db      062h, Payload0, Payload1, Payload2, Opcode, ModRMByte
+
+        ENDM
+
+VpdpbusdZmmZmmZmm MACRO DestReg, Src1Reg, Src2Reg
+
+        VnniZmmZmmZmm 050h, DestReg, Src1Reg, Src2Reg
+
+        ENDM
+
+VpdpbusdsZmmZmmZmm MACRO DestReg, Src1Reg, Src2Reg
+
+        VnniZmmZmmZmm 051h, DestReg, Src1Reg, Src2Reg
+
+        ENDM
+
+VpdpwssdZmmZmmZmm MACRO DestReg, Src1Reg, Src2Reg
+
+        VnniZmmZmmZmm 052h, DestReg, Src1Reg, Src2Reg
+
+        ENDM
+
+VpdpwssdsZmmZmmZmm MACRO DestReg, Src1Reg, Src2Reg
+
+        VnniZmmZmmZmm 053h, DestReg, Src1Reg, Src2Reg
+
+        ENDM
+
+;
+; Macro Description:
+;
+;   This macro builds a VNNI instruction of the form:
+;
+;        instr zmm1,zmm2,DWORD BCST [BaseReg+IndexReg*Scale]
+;
+; Arguments:
+;
+;   Opcode - Specifies the opcode for the VNNI instruction.
+;
+;   DestReg - Specifies the destination register.
+;
+;   Src1Reg - Specifies the first source register.
+;
+;   BaseReg - Specifies the base register of the broadcast operand.
+;
+;   IndexReg - Specifies the optional index register of the broadcast operand.
+;
+;   Scale - Specifies the scaling factor of the optional index register.
+;
+
+VnniZmmZmmBroadcast MACRO Opcode, DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        LOCAL   Payload0, Payload1, Payload2, ModRMByte, SibByte
+
+        Payload0 = 002h                     ; "0F 38" prefix
+        Payload0 = Payload0 + ((((ZmmIndex_&DestReg& SHR 3) AND 1) XOR 1) SHL 7)
+IFNB <IndexReg>
+        Payload0 = Payload0 + ((((GprIndex_&IndexReg& SHR 3) AND 1) XOR 1) SHL 6)
+ELSE
+        Payload0 = Payload0 + 040h          ; zero logical index register
+ENDIF
+        Payload0 = Payload0 + ((((GprIndex_&BaseReg& SHR 3) AND 1) XOR 1) SHL 5)
+        Payload0 = Payload0 + ((((ZmmIndex_&DestReg& SHR 4) AND 1) XOR 1) SHL 4)
+
+        Payload1 = 005h                     ; "66" prefix
+        Payload1 = Payload1 + (((ZmmIndex_&Src1Reg& AND 15) XOR 15) SHL 3)
+
+        Payload2 = 050h                     ; 512-bit vector length, broadcast
+        Payload2 = Payload2 + ((((ZmmIndex_&Src1Reg& SHR 4) AND 1) XOR 1) SHL 3)
+
+        ModRMByte = 000h                    ; memory form
+        ModRMByte = ModRMByte + ((ZmmIndex_&DestReg& AND 7) SHL 3)
+IFNB <IndexReg>
+        ModRMByte = ModRMByte + 004h        ; indicate SIB byte needed
+ELSE
+        ModRMByte = ModRMByte + (GprIndex_&BaseReg& AND 7)
+ENDIF
+
+IFNB <IndexReg>
+        SibByte = 0
+IF Scale EQ 2
+        SibByte = SibByte + (1 SHL 6)
+ELSEIF Scale EQ 4
+        SibByte = SibByte + (2 SHL 6)
+ELSEIF Scale EQ 8
+        SibByte = SibByte + (3 SHL 6)
+ELSEIF Scale NE 1
+        .err <invalid index factor>
+ENDIF
+        SibByte = SibByte + ((GprIndex_&IndexReg& AND 7) SHL 3)
+        SibByte = SibByte + (GprIndex_&BaseReg& AND 7)
+ENDIF
+
+IFNB <IndexReg>
+        db      062h, Payload0, Payload1, Payload2, Opcode, ModRMByte, SibByte
+ELSE
+        db      062h, Payload0, Payload1, Payload2, Opcode, ModRMByte
+ENDIF
+
+        ENDM
+
+VpdpbusdZmmZmmBroadcast MACRO DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        VnniZmmZmmBroadcast 050h, DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        ENDM
+
+VpdpbusdsZmmZmmBroadcast MACRO DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        VnniZmmZmmBroadcast 051h, DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        ENDM
+
+VpdpwssdZmmZmmBroadcast MACRO DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        VnniZmmZmmBroadcast 052h, DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        ENDM
+
+VpdpwssdsZmmZmmBroadcast MACRO DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        VnniZmmZmmBroadcast 053h, DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        ENDM
diff --git a/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx2.asm b/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx2.asm
new file mode 100644
index 0000000000000..365348a14db1f
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx2.asm
@@ -0,0 +1,1241 @@
+;++
+;
+; Copyright (c) Microsoft Corporation. All rights reserved.
+;
+; Licensed under the MIT License.
+;
+; Module Name:
+;
+;   QgemmU8U8KernelAvx2.asm
+;
+; Abstract:
+;
+;   This module implements the kernels for the quantized integer matrix/matrix
+;   multiply operation (QGEMM).
+;
+;   This implementation uses AVX2 instructions.
+;
+;--
+
+        .xlist
+INCLUDE mlasi.inc
+        .list
+
+        EXTERN  MlasMaskMoveAvx:NEAR
+
+;
+; Stack frame layout for the U8U8 CopyPackA routine.
+;
+
+GemmU8U8CopyPackAFrame STRUCT
+
+        PaddedMatrixAData OWORD 4 DUP (?)
+        SavedXmm6 OWORD ?
+        SavedXmm7 OWORD ?
+        SavedXmm8 OWORD ?
+        SavedXmm9 OWORD ?
+        Padding QWORD ?
+        SavedR13 QWORD ?
+        SavedR12 QWORD ?
+        SavedRdi QWORD ?
+        SavedRsi QWORD ?
+        SavedRbx QWORD ?
+        SavedRbp QWORD ?
+        ReturnAddress QWORD ?
+        PreviousP1Home QWORD ?
+        PreviousP2Home QWORD ?
+        PreviousP3Home QWORD ?
+        PreviousP4Home QWORD ?
+        CountK QWORD ?
+        RowSumVector QWORD ?
+        offb QWORD ?
+
+GemmU8U8CopyPackAFrame ENDS
+
+;
+; Stack frame layout for the U8U8 CopyPackB routine.
+;
+
+GemmU8U8CopyPackBFrame STRUCT
+
+        PaddedMatrixBData OWORD 2 DUP (?)
+        SavedRsi QWORD ?
+        SavedRbx QWORD ?
+        SavedRbp QWORD ?
+        ReturnAddress QWORD ?
+        PreviousP1Home QWORD ?
+        PreviousP2Home QWORD ?
+        PreviousP3Home QWORD ?
+        PreviousP4Home QWORD ?
+        CountK QWORD ?
+        ColumnSumVector QWORD ?
+        offa QWORD ?
+
+GemmU8U8CopyPackBFrame ENDS
+
+;
+; Stack frame layout for the U8U8 kernel.
+;
+
+GemmU8U8KernelFrame STRUCT
+
+        SavedXmm6 OWORD ?
+        SavedXmm7 OWORD ?
+        SavedXmm8 OWORD ?
+        SavedXmm9 OWORD ?
+        SavedXmm10 OWORD ?
+        SavedXmm11 OWORD ?
+        SavedXmm12 OWORD ?
+        SavedXmm13 OWORD ?
+        SavedXmm14 OWORD ?
+        SavedXmm15 OWORD ?
+        SavedR14 QWORD ?
+        SavedR13 QWORD ?
+        SavedR12 QWORD ?
+        SavedRdi QWORD ?
+        SavedRsi QWORD ?
+        SavedRbx QWORD ?
+        SavedRbp QWORD ?
+        ReturnAddress QWORD ?
+        PreviousP1Home QWORD ?
+        PreviousP2Home QWORD ?
+        PreviousP3Home QWORD ?
+        PreviousP4Home QWORD ?
+        CountM QWORD ?
+        CountN QWORD ?
+        ldc QWORD ?
+        RowSumVector QWORD ?
+        ColumnSumVector QWORD ?
+        DepthValue QWORD ?
+        ZeroMode QWORD ?
+
+GemmU8U8KernelFrame ENDS
+
+;++
+;
+; Routine Description:
+;
+;   This routine copies elements from the source matrix to the destination
+;   packed buffer.
+;
+;   The kernel expects that elements from matrix A have been zero extended to
+;   16-bits and padded to a multiple of 32-bits (two pairs of 16-bit values).
+;   The kernel can then efficiently broadcast 32-bits from the packed buffer
+;   and avoid expensive shuffling inside the kernel.
+;
+; Arguments:
+;
+;   D (rcx) - Supplies the address of the destination packed buffer.
+;
+;   A (rdx) - Supplies the address of the source matrix.
+;
+;   lda (r8) - Supplies the number of elements per row of the source matrix.
+;
+;   CountM (r9) - Supplies the number of rows of the source matrix to copy.
+;
+;   CountK - Supplies the number of columns of the source matrix to copy.
+;
+;   RowSumVector - Supplies the address of the buffer to receive the sums of
+;       the elements from each of the rows. Each sum has also been multiplied
+;       by the zero point offset.
+;
+;   offb - Supplies the zero point offset for the other source matrix of the
+;       matrix multiplication.
+;
+; Return Value:
+;
+;   None.
+;
+;--
+
+        NESTED_ENTRY MlasGemmU8U8CopyPackAAvx2, _TEXT
+
+        rex_push_reg rbp
+        push_reg rbx
+        push_reg rsi
+        push_reg rdi
+        push_reg r12
+        push_reg r13
+        alloc_stack (GemmU8U8CopyPackAFrame.SavedR13)
+        save_xmm128_avx xmm6,GemmU8U8CopyPackAFrame.SavedXmm6
+        save_xmm128_avx xmm7,GemmU8U8CopyPackAFrame.SavedXmm7
+        save_xmm128_avx xmm8,GemmU8U8CopyPackAFrame.SavedXmm8
+        save_xmm128_avx xmm9,GemmU8U8CopyPackAFrame.SavedXmm9
+
+        END_PROLOGUE
+
+        mov     rdi,rcx
+        mov     rsi,rdx
+        mov     r10,GemmU8U8CopyPackAFrame.CountK[rsp]
+        lea     r11,[r10+1]
+        and     r11,NOT 1                   ; align CountK up to pair count
+        mov     r12,GemmU8U8CopyPackAFrame.RowSumVector[rsp]
+        vpbroadcastw xmm8,WORD PTR GemmU8U8CopyPackAFrame.offb[rsp]
+
+;
+; Compute the conditional load/store mask for an unaligned CountK.
+;
+
+        mov     eax,r10d
+        and     eax,15                      ; isolate unaligned count
+        inc     eax
+        shr     eax,1                       ; align unaligned count to pair count
+        mov     DWORD PTR GemmU8U8CopyPackAFrame.CountK[rsp],eax
+        vpbroadcastd ymm9,DWORD PTR GemmU8U8CopyPackAFrame.CountK[rsp]
+        vpcmpgtd ymm9,ymm9,YMMWORD PTR [MlasMaskMoveAvx]
+
+;
+; Zero initialize the padded stack buffers.
+;
+
+        vpxor   xmm0,xmm0,xmm0
+        vmovdqu YMMWORD PTR GemmU8U8CopyPackAFrame.PaddedMatrixAData[rsp],ymm0
+        vmovdqu YMMWORD PTR GemmU8U8CopyPackAFrame.PaddedMatrixAData[rsp+32],ymm0
+
+;
+; Process 4 rows of matrix A in a loop.
+;
+; For each row, zero extend the source bytes to 16-bits and write to the packed
+; buffer. The packed buffer has the same data ordering as the source bytes, but
+; the stride is CountK aligned up to an even number of 16-bit values.
+;
+; These 16-bit values are also accumulated into an intermediate per-row
+; accumulator. CountK cannot be greater than 256 to avoid overflowing these
+; 16-bit accumulators.
+;
+
+        sub     r9,4
+        jb      ProcessRemainingRows
+
+ProcessNextRowM4:
+        vpxor   xmm0,xmm0,xmm0              ; clear row accumulators
+        vpxor   xmm1,xmm1,xmm1
+        vpxor   xmm2,xmm2,xmm2
+        vpxor   xmm3,xmm3,xmm3
+        mov     rdx,rsi
+        mov     rcx,rdi
+        lea     rsi,[rsi+r8*4]              ; advance next matrix A by 4 rows
+        lea     rdi,[rdi+r11*(2*4)]         ; advance next matrix D by 4 rows
+        mov     rbx,r10                     ; reload columns remaining
+        sub     rbx,16
+        jb      ProcessRemainingColumnsM4
+
+ProcessNextColumnLoopM4:
+        lea     rax,[rdx+r8*2]              ; compute matrix A plus two rows
+        vpmovzxbw ymm4,XMMWORD PTR [rdx]
+        vpmovzxbw ymm5,XMMWORD PTR [rdx+r8]
+        vpmovzxbw ymm6,XMMWORD PTR [rax]
+        vpmovzxbw ymm7,XMMWORD PTR [rax+r8]
+        lea     rax,[rcx+r11*4]             ; compute matrix D plus two rows
+        vmovdqu YMMWORD PTR [rcx],ymm4
+        vmovdqu YMMWORD PTR [rcx+r11*2],ymm5
+        vmovdqu YMMWORD PTR [rax],ymm6
+        vmovdqu YMMWORD PTR [rax+r11*2],ymm7
+        vpaddw  ymm0,ymm0,ymm4              ; accumulate per row along columns
+        vpaddw  ymm1,ymm1,ymm5
+        vpaddw  ymm2,ymm2,ymm6
+        vpaddw  ymm3,ymm3,ymm7
+        add     rdx,16                      ; advance matrix A by 16 bytes
+        add     rcx,16*2                    ; advance matrix D by 16 words
+        sub     rbx,16                      ; subtract columns remaining
+        jae     ProcessNextColumnLoopM4
+
+ProcessRemainingColumnsM4:
+        add     rbx,16                      ; correct for over-subtract above
+        jz      ReduceRowSumVectorM4
+
+;
+; Copy the unaligned CountK columns to a zero padded stack buffer.
+;
+
+.errnz  GemmU8U8CopyPackAFrame.PaddedMatrixAData
+        mov     rbp,rsp                     ; GemmU8U8CopyPackAFrame.PaddedMatrixAData
+        test    bl,8                        ; (CountK & 8) != 0?
+        jz      CopyRemainingCountKLessThan8M4
+        lea     r13,[rdx+r8*2]              ; compute matrix A plus two rows
+        mov     rax,QWORD PTR [rdx]
+        mov     QWORD PTR [rbp],rax
+        mov     rax,QWORD PTR [rdx+r8]
+        mov     QWORD PTR [rbp+16],rax
+        mov     rax,QWORD PTR [r13]
+        mov     QWORD PTR [rbp+32],rax
+        mov     rax,QWORD PTR [r13+r8]
+        mov     QWORD PTR [rbp+48],rax
+        add     rdx,8
+        add     rbp,8                       ; advance padded buffer destination
+
+CopyRemainingCountKLessThan8M4:
+        test    bl,4                        ; (CountK & 4) != 0?
+        jz      CopyRemainingCountKLessThan4M4
+        lea     r13,[rdx+r8*2]              ; compute matrix A plus two rows
+        mov     eax,DWORD PTR [rdx]
+        mov     DWORD PTR [rbp],eax
+        mov     eax,DWORD PTR [rdx+r8]
+        mov     DWORD PTR [rbp+16],eax
+        mov     eax,DWORD PTR [r13]
+        mov     DWORD PTR [rbp+32],eax
+        mov     eax,DWORD PTR [r13+r8]
+        mov     DWORD PTR [rbp+48],eax
+        add     rdx,4
+        add     rbp,4                       ; advance padded buffer destination
+
+CopyRemainingCountKLessThan4M4:
+        test    bl,2                        ; (CountK & 2) != 0?
+        jz      CopyRemainingCountKLessThan2M4
+        lea     r13,[rdx+r8*2]              ; compute matrix A plus two rows
+        movzx   eax,WORD PTR [rdx]
+        mov     WORD PTR [rbp],ax
+        movzx   eax,WORD PTR [rdx+r8]
+        mov     WORD PTR [rbp+16],ax
+        movzx   eax,WORD PTR [r13]
+        mov     WORD PTR [rbp+32],ax
+        movzx   eax,WORD PTR [r13+r8]
+        mov     WORD PTR [rbp+48],ax
+        add     rdx,2
+        add     rbp,2                       ; advance padded buffer destination
+
+CopyRemainingCountKLessThan2M4:
+        test    bl,1                        ; (CountK & 1) != 0?
+        jz      ProcessPaddedMatrixADataM4
+        lea     r13,[rdx+r8*2]              ; compute matrix A plus two rows
+        movzx   eax,BYTE PTR [rdx]
+        mov     BYTE PTR [rbp],al
+        movzx   eax,BYTE PTR [rdx+r8]
+        mov     BYTE PTR [rbp+16],al
+        movzx   eax,BYTE PTR [r13]
+        mov     BYTE PTR [rbp+32],al
+        movzx   eax,BYTE PTR [r13+r8]
+        mov     BYTE PTR [rbp+48],al
+
+;
+; Process the remaining CountK columns using the zero padded stack buffer.
+;
+
+ProcessPaddedMatrixADataM4:
+        vpmovzxbw ymm4,XMMWORD PTR GemmU8U8CopyPackAFrame.PaddedMatrixAData[rsp]
+        vpmovzxbw ymm5,XMMWORD PTR GemmU8U8CopyPackAFrame.PaddedMatrixAData[rsp+16]
+        vpmovzxbw ymm6,XMMWORD PTR GemmU8U8CopyPackAFrame.PaddedMatrixAData[rsp+32]
+        vpmovzxbw ymm7,XMMWORD PTR GemmU8U8CopyPackAFrame.PaddedMatrixAData[rsp+48]
+        lea     rax,[rcx+r11*4]             ; compute matrix D plus two rows
+        vpmaskmovd YMMWORD PTR [rcx],ymm9,ymm4
+        vpmaskmovd YMMWORD PTR [rcx+r11*2],ymm9,ymm5
+        vpmaskmovd YMMWORD PTR [rax],ymm9,ymm6
+        vpmaskmovd YMMWORD PTR [rax+r11*2],ymm9,ymm7
+        vpaddw  ymm0,ymm0,ymm4              ; accumulate per row along columns
+        vpaddw  ymm1,ymm1,ymm5
+        vpaddw  ymm2,ymm2,ymm6
+        vpaddw  ymm3,ymm3,ymm7
+
+;
+; Reduce the sums for the four rows of output. Transpose the intermediate
+; accumulators by treating the registers as 32-bit elements containing a pair
+; of 16-bit sums. Continue reducing the transposed accumulators to produce the
+; final 32-bit vector output.
+;
+
+ReduceRowSumVectorM4:
+        vpunpckldq ymm4,ymm0,ymm1           ; [A5 B5 A4 B4 A1 B1 A0 B0]
+        vpunpckhdq ymm5,ymm0,ymm1           ; [A7 B7 A6 B6 A3 B3 A2 B2]
+        vpunpckldq ymm6,ymm2,ymm3           ; [C5 D5 C4 D4 C1 D1 C0 D0]
+        vpunpckhdq ymm7,ymm2,ymm3           ; [C7 D7 C6 D6 C3 D3 C2 D2]
+        vpunpcklqdq ymm0,ymm4,ymm6          ; [A4 B4 C4 D4 A0 B0 C0 D0]
+        vpunpckhqdq ymm1,ymm4,ymm6          ; [A5 B5 C5 D5 A1 B1 C1 D1]
+        vpunpcklqdq ymm2,ymm5,ymm7          ; [A6 B6 C6 D6 A2 B2 C2 D2]
+        vpunpckhqdq ymm3,ymm5,ymm7          ; [A7 B7 C7 D7 A3 B3 C3 D3]
+        vpaddw  ymm0,ymm0,ymm1              ; reduction
+        vpaddw  ymm0,ymm0,ymm2
+        vpaddw  ymm0,ymm0,ymm3
+        vextracti128 xmm1,ymm0,1            ; extract high pairs
+        vpaddw  xmm0,xmm0,xmm1              ; reduction
+        vpmaddwd xmm0,xmm0,xmm8             ; multiply by offset and reduce
+        vmovdqu XMMWORD PTR [r12],xmm0
+        add     r12,4*4                     ; advance row sum vector by 4 dwords
+        sub     r9,4                        ; subtract rows remaining
+        jae     ProcessNextRowM4
+
+ProcessRemainingRows:
+        add     r9,4                        ; correct for over-subtract above
+        jz      ExitRoutine
+
+;
+; Process a single row of matrix A in a loop.
+;
+
+ProcessNextRowM1:
+        vpxor   xmm0,xmm0,xmm0              ; clear row accumulator
+        mov     rdx,rsi
+        mov     rcx,rdi
+        add     rsi,r8
+        lea     rdi,[rdi+r11*2]
+        mov     rbx,r10                     ; reload columns remaining
+        sub     rbx,16
+        jb      ProcessRemainingColumnsM1
+
+ProcessNextColumnLoopM1:
+        vpmovzxbw ymm4,XMMWORD PTR [rdx]
+        vmovdqu YMMWORD PTR [rcx],ymm4
+        vpaddw  ymm0,ymm0,ymm4              ; accumulate per row along columns
+        add     rdx,16                      ; advance matrix A by 16 bytes
+        add     rcx,16*2                    ; advance matrix D by 16 words
+        sub     rbx,16                      ; subtract columns remaining
+        jae     ProcessNextColumnLoopM1
+
+ProcessRemainingColumnsM1:
+        add     rbx,16                      ; correct for over-subtract above
+        jz      ReduceRowSumVectorM1
+
+;
+; Copy the unaligned CountK columns to a zero padded stack buffer.
+;
+
+.errnz  GemmU8U8CopyPackAFrame.PaddedMatrixAData
+        mov     rbp,rsp                     ; GemmU8U8CopyPackAFrame.PaddedMatrixAData
+        test    bl,8                        ; (CountK & 8) != 0?
+        jz      CopyRemainingCountKLessThan8M1
+        mov     rax,QWORD PTR [rdx]
+        mov     QWORD PTR [rbp],rax
+        add     rdx,8
+        add     rbp,8                       ; advance padded buffer destination
+
+CopyRemainingCountKLessThan8M1:
+        test    bl,4                        ; (CountK & 4) != 0?
+        jz      CopyRemainingCountKLessThan4M1
+        mov     eax,DWORD PTR [rdx]
+        mov     DWORD PTR [rbp],eax
+        add     rdx,4
+        add     rbp,4                       ; advance padded buffer destination
+
+CopyRemainingCountKLessThan4M1:
+        test    bl,2                        ; (CountK & 2) != 0?
+        jz      CopyRemainingCountKLessThan2M1
+        movzx   eax,WORD PTR [rdx]
+        mov     WORD PTR [rbp],ax
+        add     rdx,2
+        add     rbp,2                       ; advance padded buffer destination
+
+CopyRemainingCountKLessThan2M1:
+        test    bl,1                        ; (CountK & 1) != 0?
+        jz      ProcessPaddedMatrixADataM1
+        movzx   eax,BYTE PTR [rdx]
+        mov     BYTE PTR [rbp],al
+
+;
+; Process the remaining CountK columns using the zero padded stack buffer.
+;
+
+ProcessPaddedMatrixADataM1:
+        vpmovzxbw ymm4,XMMWORD PTR GemmU8U8CopyPackAFrame.PaddedMatrixAData[rsp]
+        vpmaskmovd YMMWORD PTR [rcx],ymm9,ymm4
+        vpaddw  ymm0,ymm0,ymm4              ; accumulate per row along columns
+
+;
+; Reduce the sum for the single row of output.
+;
+
+ReduceRowSumVectorM1:
+        vextracti128 xmm1,ymm0,1            ; extract high pairs
+        vpaddw  xmm0,xmm0,xmm1              ; reduction
+        vphaddw xmm0,xmm0,xmm0
+        vphaddw xmm0,xmm0,xmm0
+        vpmaddwd xmm0,xmm0,xmm8             ; multiply by offset and reduce
+        vmovd   DWORD PTR [r12],xmm0
+        add     r12,4                       ; advance row sum vector by 1 DWORD
+        dec     r9                          ; decrement rows remaining
+        jnz     ProcessNextRowM1
+
+;
+; Restore non-volatile registers and return.
+;
+
+ExitRoutine:
+        vzeroupper
+        vmovaps xmm6,GemmU8U8CopyPackAFrame.SavedXmm6[rsp]
+        vmovaps xmm7,GemmU8U8CopyPackAFrame.SavedXmm7[rsp]
+        vmovaps xmm8,GemmU8U8CopyPackAFrame.SavedXmm8[rsp]
+        vmovaps xmm9,GemmU8U8CopyPackAFrame.SavedXmm9[rsp]
+        add     rsp,(GemmU8U8CopyPackAFrame.SavedR13)
+
+        BEGIN_EPILOGUE
+
+        pop     r13
+        pop     r12
+        pop     rdi
+        pop     rsi
+        pop     rbx
+        pop     rbp
+        ret
+
+        NESTED_END MlasGemmU8U8CopyPackAAvx2, _TEXT
+
+;++
+;
+; Routine Description:
+;
+;   This routine copies elements from the source matrix to the destination
+;   packed buffer.
+;
+; Arguments:
+;
+;   D (rcx) - Supplies the address of the destination packed buffer.
+;
+;   B (rdx) - Supplies the address of the source matrix.
+;
+;   ldb (r8) - Supplies the number of elements per row of the source matrix.
+;
+;   CountN (r9) - Supplies the number of columns of the source matrix to copy.
+;
+;   CountK - Supplies the number of rows of the source matrix to copy.
+;
+;   ColumnSumVector - Supplies the address of the buffer to receive the sums of
+;       the elements from each of the columns. Each sum has also been multiplied
+;       by the zero point offset.
+;
+;   offa - Supplies the zero point offset for the other source matrix of the
+;       matrix multiplication.
+;
+; Return Value:
+;
+;   None.
+;
+;--
+
+        NESTED_ENTRY MlasGemmU8U8CopyPackBAvx2, _TEXT
+
+        rex_push_reg rbp
+        push_reg rbx
+        push_reg rsi
+        alloc_stack (GemmU8U8CopyPackBFrame.SavedRsi)
+
+        END_PROLOGUE
+
+        mov     rsi,rdx
+        mov     r10,GemmU8U8CopyPackBFrame.CountK[rsp]
+        mov     r11,GemmU8U8CopyPackBFrame.ColumnSumVector[rsp]
+        vpbroadcastw ymm5,WORD PTR GemmU8U8CopyPackBFrame.offa[rsp]
+
+;
+; Zero initialize the padded stack buffers.
+;
+
+        vpxor   xmm0,xmm0,xmm0
+        vmovdqu YMMWORD PTR GemmU8U8CopyPackBFrame.PaddedMatrixBData[rsp],ymm0
+
+;
+; Process 16 columns of matrix B in a loop.
+;
+
+        sub     r9,16
+        jb      ProcessRemainingColumns
+
+ProcessNextColumnN16:
+        vpxor   xmm0,xmm0,xmm0              ; clear column accumulators
+        vpxor   xmm1,xmm1,xmm1
+        mov     rdx,rsi
+        add     rsi,16                      ; advance next matrix B by 16 columns
+        mov     rbx,r10                     ; reload rows remaining
+        sub     rbx,2
+        jb      ProcessRemainingRowsN16
+
+ProcessNextRowLoopN16:
+        vmovdqu xmm2,XMMWORD PTR [rdx]      ; load two rows
+        vmovdqu xmm3,XMMWORD PTR [rdx+r8]
+        lea     rdx,[rdx+r8*2]              ; advance matrix B by two rows
+        vpunpcklbw xmm4,xmm2,xmm3           ; interleave row data
+        vpunpckhbw xmm3,xmm2,xmm3
+        vmovdqu XMMWORD PTR [rcx],xmm4      ; store interleaved rows
+        vmovdqu XMMWORD PTR [rcx+16],xmm3
+        vpmovzxbw ymm4,xmm4
+        vpmovzxbw ymm3,xmm3
+        add     rcx,32                      ; advance matrix D by 32 bytes
+        vpaddw  ymm0,ymm0,ymm4              ; accumulate per column
+        vpaddw  ymm1,ymm1,ymm3
+        sub     rbx,2                       ; subtract columns remaining
+        jae     ProcessNextRowLoopN16
+
+ProcessRemainingRowsN16:
+        add     rbx,2                       ; correct for over-subtract above
+        jz      ReduceColumnSumVectorN16
+        vpmovzxbw ymm4,XMMWORD PTR [rdx]
+        vmovdqu YMMWORD PTR [rcx],ymm4      ; store interleaved rows
+        vextracti128 xmm3,ymm4,1
+        vpmovzxbw ymm4,xmm4
+        vpmovzxbw ymm3,xmm3
+        vpaddw  ymm0,ymm0,ymm4              ; accumulate per column
+        vpaddw  ymm1,ymm1,ymm3
+        add     rcx,32                      ; advance matrix D by 32 bytes
+
+ReduceColumnSumVectorN16:
+        vpmaddwd ymm0,ymm0,ymm5             ; multiply by offset and reduce
+        vpmaddwd ymm1,ymm1,ymm5             ; multiply by offset and reduce
+        vmovdqu YMMWORD PTR [r11],ymm0
+        vmovdqu YMMWORD PTR [r11+32],ymm1
+        add     r11,64                      ; advance column sum vector by 16 dwords
+        sub     r9,16                       ; subtract columns remaining
+        jae     ProcessNextColumnN16
+
+ProcessRemainingColumns:
+        add     r9,16                       ; correct for over-subtract above
+        jnz     ProcessColumnNUnaligned
+
+;
+; Restore non-volatile registers and return.
+;
+
+ExitRoutine:
+        vzeroupper
+        add     rsp,(GemmU8U8CopyPackBFrame.SavedRsi)
+
+        BEGIN_EPILOGUE
+
+        pop     rsi
+        pop     rbx
+        pop     rbp
+        ret
+
+;
+; Process the remaining columns of matrix B.
+;
+
+ProcessColumnNUnaligned:
+        vpxor   xmm0,xmm0,xmm0              ; clear column accumulators
+        vpxor   xmm1,xmm1,xmm1
+        sub     r10,2
+        jb      ProcessRemainingRowsNUnaligned
+
+ProcessNextRowLoopNUnaligned:
+        mov     rdx,rsi
+.errnz  GemmU8U8CopyPackBFrame.PaddedMatrixBData
+        mov     rbp,rsp                     ; GemmU8U8CopyPackBFrame.PaddedMatrixBData
+        test    r9b,8                       ; (CountN & 8) != 0?
+        jz      CopyRemainingCountNLessThan8K2
+        mov     rax,QWORD PTR [rdx]
+        mov     QWORD PTR [rbp],rax
+        mov     rax,QWORD PTR [rdx+r8]
+        mov     QWORD PTR [rbp+16],rax
+        add     rdx,8                       ; advance matrix B
+        add     rbp,8                       ; advance padded buffer destination
+
+CopyRemainingCountNLessThan8K2:
+        test    r9b,4                       ; (CountN & 4) != 0?
+        jz      CopyRemainingCountNLessThan4K2
+        mov     eax,DWORD PTR [rdx]
+        mov     DWORD PTR [rbp],eax
+        mov     eax,DWORD PTR [rdx+r8]
+        mov     DWORD PTR [rbp+16],eax
+        add     rdx,4                       ; advance matrix B
+        add     rbp,4                       ; advance padded buffer destination
+
+CopyRemainingCountNLessThan4K2:
+        test    r9b,2                       ; (CountN & 2) != 0?
+        jz      CopyRemainingCountNLessThan2K2
+        movzx   eax,WORD PTR [rdx]
+        mov     WORD PTR [rbp],ax
+        movzx   eax,WORD PTR [rdx+r8]
+        mov     WORD PTR [rbp+16],ax
+        add     rdx,2                       ; advance matrix B
+        add     rbp,2                       ; advance padded buffer destination
+
+CopyRemainingCountNLessThan2K2:
+        test    r9b,1                       ; (CountN & 1) != 0?
+        jz      ProcessPaddedMatrixBDataK2
+        movzx   eax,BYTE PTR [rdx]
+        mov     BYTE PTR [rbp],al
+        movzx   eax,BYTE PTR [rdx+r8]
+        mov     BYTE PTR [rbp+16],al
+
+ProcessPaddedMatrixBDataK2:
+        vmovdqu xmm2,XMMWORD PTR XMMWORD PTR GemmU8U8CopyPackBFrame.PaddedMatrixBData[rsp]
+        vmovdqu xmm3,XMMWORD PTR XMMWORD PTR GemmU8U8CopyPackBFrame.PaddedMatrixBData[rsp+16]
+        vpunpcklbw xmm4,xmm2,xmm3           ; interleave row data
+        vpunpckhbw xmm3,xmm2,xmm3
+        vmovdqu XMMWORD PTR [rcx],xmm4      ; store interleaved rows
+        vmovdqu XMMWORD PTR [rcx+16],xmm3
+        vpmovzxbw ymm4,xmm4
+        vpmovzxbw ymm3,xmm3
+        vpaddw  ymm0,ymm0,ymm4              ; accumulate per column
+        vpaddw  ymm1,ymm1,ymm3
+        lea     rsi,[rsi+r8*2]              ; advance next matrix B by two rows
+        add     rcx,32                      ; advance matrix D by 32 bytes
+        sub     r10,2                       ; subtract columns remaining
+        jae     ProcessNextRowLoopNUnaligned
+
+ProcessRemainingRowsNUnaligned:
+        add     r10,2
+        jz      ReduceColumnSumVectorNUnaligned
+        mov     rdx,rsi
+.errnz  GemmU8U8CopyPackBFrame.PaddedMatrixBData
+        mov     rbp,rsp                     ; GemmU8U8CopyPackBFrame.PaddedMatrixBData
+        test    r9b,8                       ; (CountN & 8) != 0?
+        jz      CopyRemainingCountNLessThan8K1
+        mov     rax,QWORD PTR [rdx]
+        mov     QWORD PTR [rbp],rax
+        add     rdx,8                       ; advance matrix B
+        add     rbp,8                       ; advance padded buffer destination
+
+CopyRemainingCountNLessThan8K1:
+        test    r9b,4                       ; (CountN & 4) != 0?
+        jz      CopyRemainingCountNLessThan4K1
+        mov     eax,DWORD PTR [rdx]
+        mov     DWORD PTR [rbp],eax
+        add     rdx,4                       ; advance matrix B
+        add     rbp,4                       ; advance padded buffer destination
+
+CopyRemainingCountNLessThan4K1:
+        test    r9b,2                       ; (CountN & 2) != 0?
+        jz      CopyRemainingCountNLessThan2K1
+        movzx   eax,WORD PTR [rdx]
+        mov     WORD PTR [rbp],ax
+        add     rdx,2                       ; advance matrix B
+        add     rbp,2                       ; advance padded buffer destination
+
+CopyRemainingCountNLessThan2K1:
+        test    r9b,1                       ; (CountN & 1) != 0?
+        jz      ProcessPaddedMatrixBDataK1
+        movzx   eax,BYTE PTR [rdx]
+        mov     BYTE PTR [rbp],al
+
+ProcessPaddedMatrixBDataK1:
+        vpmovzxbw ymm4,XMMWORD PTR GemmU8U8CopyPackBFrame.PaddedMatrixBData[rsp]
+        vmovdqu YMMWORD PTR [rcx],ymm4      ; store interleaved rows
+        vextracti128 xmm3,ymm4,1
+        vpmovzxbw ymm4,xmm4
+        vpmovzxbw ymm3,xmm3
+        vpaddw  ymm0,ymm0,ymm4              ; accumulate per column
+        vpaddw  ymm1,ymm1,ymm3
+
+ReduceColumnSumVectorNUnaligned:
+        vpmaddwd ymm0,ymm0,ymm5             ; multiply by offset and reduce
+        vpmaddwd ymm1,ymm1,ymm5             ; multiply by offset and reduce
+        vmovdqu YMMWORD PTR [r11],ymm0
+        vmovdqu YMMWORD PTR [r11+32],ymm1
+        jmp     ExitRoutine
+
+        NESTED_END MlasGemmU8U8CopyPackBAvx2, _TEXT
+
+;
+; Macro Description:
+;
+;   This macro generates code to multiply and accumulator a single row of the
+;   output block.
+;
+; Arguments:
+;
+;   ColumnCount - Supplies the number of columns to produce.
+;
+;   Vec1Reg - Supplies the high block accumulator register (when ColumnCount
+;       is 16).
+;
+;   Vec2Reg - Supplies the low block accumulator register.
+;
+; Implicit Arguments:
+;
+;   ymm0 - Supplies the first vector loaded from matrix B.
+;
+;   ymm1 - Supplies the second vector loaded from matrix B (when ColumnCount
+;       is 16).
+;
+;   ymm2 - Supplies the broadcast value loaded from matrix A.
+;
+
+MultiplyAccumulateRow MACRO ColumnCount, Vec1Reg, Vec2Reg
+
+IF ColumnCount EQ 16
+        vpmaddwd ymm3,ymm2,ymm0
+        vpaddd  Vec1Reg,Vec1Reg,ymm3
+        vpmaddwd ymm2,ymm2,ymm1
+        vpaddd  Vec2Reg,Vec2Reg,ymm2
+ELSE
+        vpmaddwd ymm3,ymm2,ymm0
+        vpaddd  Vec2Reg,Vec2Reg,ymm3
+ENDIF
+
+        ENDM
+
+;
+; Macro Description:
+;
+;   This macro generates code to multiply and accumulate each row of the output
+;   block.
+;
+; Arguments:
+;
+;   ColumnCount - Supplies the number of columns to produce.
+;
+;   RowCount - Supplies the number of rows to produce.
+;
+;   VectorOffset - Supplies the byte offset from matrix B to fetch elements.
+;
+;   BroadcastOffset - Supplies the byte offset from matrix A to fetch elements.
+;
+; Implicit Arguments:
+;
+;   rbx - Supplies the address into the matrix A data plus 3 rows.
+;
+;   rcx - Supplies the address into the matrix A data.
+;
+;   rdx - Supplies the address into the matrix B data.
+;
+;   r10 - Supplies the length in bytes of a row from matrix A.
+;
+;   ymm4-ymm15 - Supplies the block accumulators.
+;
+
+ComputeBlock MACRO ColumnCount, RowCount, VectorOffset, BroadcastOffset
+
+        vpmovzxbw ymm0,XMMWORD PTR [rdx+VectorOffset]
+        EmitIfCountGE ColumnCount, 16, <vpmovzxbw ymm1,XMMWORD PTR [rdx+VectorOffset+16]>
+        EmitIfCountGE RowCount, 1, <vpbroadcastd ymm2,DWORD PTR [rcx+BroadcastOffset]>
+        EmitIfCountGE RowCount, 1, <MultiplyAccumulateRow ColumnCount, ymm4, ymm5>
+        EmitIfCountGE RowCount, 2, <vpbroadcastd ymm2,DWORD PTR [rcx+r10+BroadcastOffset]>
+        EmitIfCountGE RowCount, 2, <MultiplyAccumulateRow ColumnCount, ymm6, ymm7>
+        EmitIfCountGE RowCount, 3, <vpbroadcastd ymm2,DWORD PTR [rcx+r10*2+BroadcastOffset]>
+        EmitIfCountGE RowCount, 3, <MultiplyAccumulateRow ColumnCount, ymm8, ymm9>
+        EmitIfCountGE RowCount, 4, <vpbroadcastd ymm2,DWORD PTR [rbx+BroadcastOffset]>
+        EmitIfCountGE RowCount, 4, <MultiplyAccumulateRow ColumnCount, ymm10, ymm11>
+        EmitIfCountGE RowCount, 5, <vpbroadcastd ymm2,DWORD PTR [rbx+r10+BroadcastOffset]>
+        EmitIfCountGE RowCount, 5, <MultiplyAccumulateRow ColumnCount, ymm12, ymm13>
+        EmitIfCountGE RowCount, 6, <vpbroadcastd ymm2,DWORD PTR [rbx+r10*2+BroadcastOffset]>
+        EmitIfCountGE RowCount, 6, <MultiplyAccumulateRow ColumnCount, ymm14, ymm15>
+
+        ENDM
+
+;
+; Macro Description:
+;
+;   This macro generates code to produce an output block for a set of columns
+;   and rows.
+;
+; Arguments:
+;
+;   ColumnCount - Supplies the number of columns to produce.
+;
+;   RowCount - Supplies the number of rows to produce.
+;
+; Implicit Arguments:
+;
+;   rax - Supplies the length in bytes of a row from matrix C.
+;
+;   rcx - Supplies the address into the matrix A data.
+;
+;   rdx - Supplies the address into the matrix B data.
+;
+;   r9 - Supplies the number of paired columns from matrix A and the number of
+;       paired rows from matrix B to iterate over.
+;
+;   r10 - Supplies the length in bytes of a row from matrix A.
+;
+;   r12 - Supplies the address of the row sum vector.
+;
+;   r13 - Supplies the address of the column sum vector.
+;
+
+ProduceOutputBlock MACRO ColumnCount, RowCount
+
+        LOCAL   ComputeBlockLoop
+        LOCAL   ProcessRemainingBlocks
+        LOCAL   ComputeBlockLoopExit
+
+;
+; Initialize the accumulators with the sum of the global depth value constant,
+; the column sums, and the row sums.
+;
+
+        vpbroadcastd ymm1,DWORD PTR GemmU8U8KernelFrame.DepthValue[rsp]
+IF ColumnCount EQ 16
+        vpaddd  ymm0,ymm1,YMMWORD PTR [r13]
+        vpaddd  ymm1,ymm1,YMMWORD PTR [r13+32]
+        add     r13,16*4                    ; advance ColumnSumVector by 16 columns
+ELSE
+        vpaddd  ymm1,ymm1,YMMWORD PTR [r13]
+ENDIF
+        EmitIfCountGE RowCount, 1, <vpbroadcastd ymm5,DWORD PTR [r12]>
+        EmitIfCountGE RowCount, 2, <vpbroadcastd ymm7,DWORD PTR [r12+4]>
+        EmitIfCountGE RowCount, 3, <vpbroadcastd ymm9,DWORD PTR [r12+8]>
+        EmitIfCountGE RowCount, 4, <vpbroadcastd ymm11,DWORD PTR [r12+12]>
+        EmitIfCountGE RowCount, 5, <vpbroadcastd ymm13,DWORD PTR [r12+16]>
+        EmitIfCountGE RowCount, 6, <vpbroadcastd ymm15,DWORD PTR [r12+20]>
+        EmitIfCount2GE RowCount, 1, ColumnCount, 16, <vpaddd ymm4,ymm5,ymm0>
+        EmitIfCountGE RowCount, 1, <vpaddd ymm5,ymm5,ymm1>
+        EmitIfCount2GE RowCount, 2, ColumnCount, 16, <vpaddd ymm6,ymm7,ymm0>
+        EmitIfCountGE RowCount, 2, <vpaddd ymm7,ymm7,ymm1>
+        EmitIfCount2GE RowCount, 3, ColumnCount, 16, <vpaddd ymm8,ymm9,ymm0>
+        EmitIfCountGE RowCount, 3, <vpaddd ymm9,ymm9,ymm1>
+        EmitIfCount2GE RowCount, 4, ColumnCount, 16, <vpaddd ymm10,ymm11,ymm0>
+        EmitIfCountGE RowCount, 4, <vpaddd ymm11,ymm11,ymm1>
+        EmitIfCount2GE RowCount, 5, ColumnCount, 16, <vpaddd ymm12,ymm13,ymm0>
+        EmitIfCountGE RowCount, 5, <vpaddd ymm13,ymm13,ymm1>
+        EmitIfCount2GE RowCount, 6, ColumnCount, 16, <vpaddd ymm14,ymm15,ymm0>
+        EmitIfCountGE RowCount, 6, <vpaddd ymm15,ymm15,ymm1>
+
+;
+; Iterate over PairedCountK elements from matrix A and matrix B.
+;
+; Unrolling the loop to do two iterations improves performance slightly at the
+; cost of larger code size. Balance this by only unrolling for the common case
+; of computing 16 columns for an even number of rows.
+;
+
+        mov     rsi,r9                      ; reload PairedCountK
+IF RowCount GT 3
+        lea     rbx,[r10*2+r10]
+        add     rbx,rcx                     ; compute matrix A plus 3 rows
+ENDIF
+
+IF (ColumnCount EQ 16) AND ((RowCount AND 1) EQ 0)
+        sub     rsi,2
+        jb      ProcessRemainingBlocks
+
+ComputeBlockLoop:
+        ComputeBlock ColumnCount, RowCount, 0, 0
+        ComputeBlock ColumnCount, RowCount, 32, 4
+        add     rcx,2*4                     ; advance matrix A by 2 pairs
+IF RowCount GT 3
+        add     rbx,2*4                     ; advance matrix A plus 3 rows by 2 pairs
+ENDIF
+        add     rdx,2*32                    ; advance matrix B by 64 columns
+        sub     rsi,2                       ; subtract pairs remaining
+        jae     ComputeBlockLoop
+
+ProcessRemainingBlocks:
+        add     rsi,2                       ; correct for over-subtract above
+        jz      ComputeBlockLoopExit
+        ComputeBlock ColumnCount, RowCount, 0, 0
+        add     rdx,32                      ; advance matrix B by 32 columns
+ELSE
+ComputeBlockLoop:
+        ComputeBlock ColumnCount, RowCount, 0, 0
+        add     rcx,4                       ; advance matrix A by 1 pair
+IF RowCount GT 3
+        add     rbx,4                       ; advance matrix A plus 3 rows by 1 pair
+ENDIF
+        add     rdx,32
+        dec     rsi                         ; decrement pairs remaining
+        jnz     ComputeBlockLoop
+ENDIF
+
+ComputeBlockLoopExit:
+IF RowCount GT 3
+        lea     rbx,[r8+rax*2]              ; compute matrix C plus 3 rows
+        add     rbx,rax
+ENDIF
+
+        ENDM
+
+;
+; Macro Description:
+;
+;   This macro generates code to compute matrix multiplication for a fixed set
+;   of rows.
+;
+; Arguments:
+;
+;   RowCount - Supplies the number of rows to process.
+;
+;   Fallthrough - Supplies a non-blank value if the macro may fall through to
+;       the ExitKernel label.
+;
+; Implicit Arguments:
+;
+;   rax - Supplies the length in bytes of a row from matrix C.
+;
+;   rcx - Supplies the address of matrix A.
+;
+;   rdx - Supplies the address of matrix B.
+;
+;   r8 - Supplies the address of matrix C.
+;
+;   rdi - Supplies the address of matrix A.
+;
+;   rbp - Supplies the number of columns from matrix B and matrix C to iterate
+;       over.
+;
+;   r9 - Supplies the number of paired columns from matrix A and the number of
+;       paired rows from matrix B to iterate over.
+;
+;   r10 - Supplies the length in bytes of a row from matrix A.
+;
+;   r12 - Supplies the address of the row sum vector.
+;
+;   r13 - Supplies the address of the column sum vector.
+;
+;   r14b - Supplies the zero mode flag.
+;
+
+ProcessCountM MACRO RowCount, Fallthrough
+
+        LOCAL   ProcessNextColumnLoop16xN
+        LOCAL   SkipAccumulateOutput16xNBlock
+        LOCAL   OutputMasked16xNBlock
+        LOCAL   ProcessRemainingCountN
+        LOCAL   SkipAccumulateOutput8xNBlock
+        LOCAL   SkipAccumulateOutputMasked16xNBlock
+        LOCAL   OutputMasked8xNBlock
+        LOCAL   SkipAccumulateOutputMasked8xNBlock
+
+        cmp     rbp,8
+        jbe     ProcessRemainingCountN
+
+ProcessNextColumnLoop16xN:
+        ProduceOutputBlock 16, RowCount
+        sub     rbp,16
+        jb      OutputMasked16xNBlock
+        test    r14b,r14b                   ; ZeroMode?
+        jnz     SkipAccumulateOutput16xNBlock
+        EmitIfCountGE RowCount, 1, <vpaddd ymm4,ymm4,YMMWORD PTR [r8]>
+        EmitIfCountGE RowCount, 1, <vpaddd ymm5,ymm5,YMMWORD PTR [r8+32]>
+        EmitIfCountGE RowCount, 2, <vpaddd ymm6,ymm6,YMMWORD PTR [r8+rax]>
+        EmitIfCountGE RowCount, 2, <vpaddd ymm7,ymm7,YMMWORD PTR [r8+rax+32]>
+        EmitIfCountGE RowCount, 3, <vpaddd ymm8,ymm8,YMMWORD PTR [r8+rax*2]>
+        EmitIfCountGE RowCount, 3, <vpaddd ymm9,ymm9,YMMWORD PTR [r8+rax*2+32]>
+        EmitIfCountGE RowCount, 4, <vpaddd ymm10,ymm10,YMMWORD PTR [rbx]>
+        EmitIfCountGE RowCount, 4, <vpaddd ymm11,ymm11,YMMWORD PTR [rbx+32]>
+        EmitIfCountGE RowCount, 5, <vpaddd ymm12,ymm12,YMMWORD PTR [rbx+rax]>
+        EmitIfCountGE RowCount, 5, <vpaddd ymm13,ymm13,YMMWORD PTR [rbx+rax+32]>
+        EmitIfCountGE RowCount, 6, <vpaddd ymm14,ymm14,YMMWORD PTR [rbx+rax*2]>
+        EmitIfCountGE RowCount, 6, <vpaddd ymm15,ymm15,YMMWORD PTR [rbx+rax*2+32]>
+
+SkipAccumulateOutput16xNBlock:
+        EmitIfCountGE RowCount, 1, <vmovdqu YMMWORD PTR [r8],ymm4>
+        EmitIfCountGE RowCount, 1, <vmovdqu YMMWORD PTR [r8+32],ymm5>
+        EmitIfCountGE RowCount, 2, <vmovdqu YMMWORD PTR [r8+rax],ymm6>
+        EmitIfCountGE RowCount, 2, <vmovdqu YMMWORD PTR [r8+rax+32],ymm7>
+        EmitIfCountGE RowCount, 3, <vmovdqu YMMWORD PTR [r8+rax*2],ymm8>
+        EmitIfCountGE RowCount, 3, <vmovdqu YMMWORD PTR [r8+rax*2+32],ymm9>
+        EmitIfCountGE RowCount, 4, <vmovdqu YMMWORD PTR [rbx],ymm10>
+        EmitIfCountGE RowCount, 4, <vmovdqu YMMWORD PTR [rbx+32],ymm11>
+        EmitIfCountGE RowCount, 5, <vmovdqu YMMWORD PTR [rbx+rax],ymm12>
+        EmitIfCountGE RowCount, 5, <vmovdqu YMMWORD PTR [rbx+rax+32],ymm13>
+        EmitIfCountGE RowCount, 6, <vmovdqu YMMWORD PTR [rbx+rax*2],ymm14>
+        EmitIfCountGE RowCount, 6, <vmovdqu YMMWORD PTR [rbx+rax*2+32],ymm15>
+        add     r8,16*4                     ; advance matrix C by 16 columns
+        mov     rcx,rdi                     ; reload matrix A
+        cmp     rbp,8
+        ja      ProcessNextColumnLoop16xN
+        test    rbp,rbp
+        jz      ExitKernel
+
+ProcessRemainingCountN:
+        ProduceOutputBlock 8, RowCount
+        cmp     rbp,8
+        jb      OutputMasked8xNBlock
+        test    r14b,r14b                   ; ZeroMode?
+        jnz     SkipAccumulateOutput8xNBlock
+        EmitIfCountGE RowCount, 1, <vpaddd ymm5,ymm5,YMMWORD PTR [r8]>
+        EmitIfCountGE RowCount, 2, <vpaddd ymm7,ymm7,YMMWORD PTR [r8+rax]>
+        EmitIfCountGE RowCount, 3, <vpaddd ymm9,ymm9,YMMWORD PTR [r8+rax*2]>
+        EmitIfCountGE RowCount, 4, <vpaddd ymm11,ymm11,YMMWORD PTR [rbx]>
+        EmitIfCountGE RowCount, 5, <vpaddd ymm13,ymm13,YMMWORD PTR [rbx+rax]>
+        EmitIfCountGE RowCount, 6, <vpaddd ymm15,ymm15,YMMWORD PTR [rbx+rax*2]>
+
+SkipAccumulateOutput8xNBlock:
+        EmitIfCountGE RowCount, 1, <vmovdqu YMMWORD PTR [r8],ymm5>
+        EmitIfCountGE RowCount, 2, <vmovdqu YMMWORD PTR [r8+rax],ymm7>
+        EmitIfCountGE RowCount, 3, <vmovdqu YMMWORD PTR [r8+rax*2],ymm9>
+        EmitIfCountGE RowCount, 4, <vmovdqu YMMWORD PTR [rbx],ymm11>
+        EmitIfCountGE RowCount, 5, <vmovdqu YMMWORD PTR [rbx+rax],ymm13>
+        EmitIfCountGE RowCount, 6, <vmovdqu YMMWORD PTR [rbx+rax*2],ymm15>
+        jmp     ExitKernel
+
+OutputMasked16xNBlock:
+        test    r14b,r14b                   ; ZeroMode?
+        jnz     SkipAccumulateOutputMasked16xNBlock
+        EmitIfCountGE RowCount, 1, <vpaddd ymm4,ymm4,YMMWORD PTR [r8]>
+        EmitIfCountGE RowCount, 2, <vpaddd ymm6,ymm6,YMMWORD PTR [r8+rax]>
+        EmitIfCountGE RowCount, 3, <vpaddd ymm8,ymm8,YMMWORD PTR [r8+rax*2]>
+        EmitIfCountGE RowCount, 4, <vpaddd ymm10,ymm10,YMMWORD PTR [rbx]>
+        EmitIfCountGE RowCount, 5, <vpaddd ymm12,ymm12,YMMWORD PTR [rbx+rax]>
+        EmitIfCountGE RowCount, 6, <vpaddd ymm14,ymm14,YMMWORD PTR [rbx+rax*2]>
+
+SkipAccumulateOutputMasked16xNBlock:
+        EmitIfCountGE RowCount, 1, <vmovdqu YMMWORD PTR [r8],ymm4>
+        EmitIfCountGE RowCount, 2, <vmovdqu YMMWORD PTR [r8+rax],ymm6>
+        EmitIfCountGE RowCount, 3, <vmovdqu YMMWORD PTR [r8+rax*2],ymm8>
+        EmitIfCountGE RowCount, 4, <vmovdqu YMMWORD PTR [rbx],ymm10>
+        EmitIfCountGE RowCount, 5, <vmovdqu YMMWORD PTR [rbx+rax],ymm12>
+        EmitIfCountGE RowCount, 6, <vmovdqu YMMWORD PTR [rbx+rax*2],ymm14>
+        add     r8,8*4                      ; advance matrix C by 8 columns
+IF RowCount GT 3
+        add     rbx,8*4                     ; advance matrix C plus 3 rows by 8 columns
+ENDIF
+        add     rbp,8                       ; correct for over-subtract above
+
+OutputMasked8xNBlock:
+        mov     DWORD PTR GemmU8U8KernelFrame.CountN[rsp],ebp
+        vpbroadcastd ymm0,DWORD PTR GemmU8U8KernelFrame.CountN[rsp]
+        vpcmpgtd ymm0,ymm0,YMMWORD PTR [MlasMaskMoveAvx]
+        test    r14b,r14b                   ; ZeroMode?
+        jnz     SkipAccumulateOutputMasked8xNBlock
+        EmitIfCountGE RowCount, 1, <vpmaskmovd ymm4,ymm0,YMMWORD PTR [r8]>
+        EmitIfCountGE RowCount, 2, <vpmaskmovd ymm6,ymm0,YMMWORD PTR [r8+rax]>
+        EmitIfCountGE RowCount, 3, <vpmaskmovd ymm8,ymm0,YMMWORD PTR [r8+rax*2]>
+        EmitIfCountGE RowCount, 4, <vpmaskmovd ymm10,ymm0,YMMWORD PTR [rbx]>
+        EmitIfCountGE RowCount, 5, <vpmaskmovd ymm12,ymm0,YMMWORD PTR [rbx+rax]>
+        EmitIfCountGE RowCount, 6, <vpmaskmovd ymm14,ymm0,YMMWORD PTR [rbx+rax*2]>
+        EmitIfCountGE RowCount, 1, <vpaddd ymm5,ymm5,ymm4>
+        EmitIfCountGE RowCount, 2, <vpaddd ymm7,ymm7,ymm6>
+        EmitIfCountGE RowCount, 3, <vpaddd ymm9,ymm9,ymm8>
+        EmitIfCountGE RowCount, 4, <vpaddd ymm11,ymm11,ymm10>
+        EmitIfCountGE RowCount, 5, <vpaddd ymm13,ymm13,ymm12>
+        EmitIfCountGE RowCount, 6, <vpaddd ymm15,ymm15,ymm14>
+
+SkipAccumulateOutputMasked8xNBlock:
+        EmitIfCountGE RowCount, 1, <vpmaskmovd YMMWORD PTR [r8],ymm0,ymm5>
+        EmitIfCountGE RowCount, 2, <vpmaskmovd YMMWORD PTR [r8+rax],ymm0,ymm7>
+        EmitIfCountGE RowCount, 3, <vpmaskmovd YMMWORD PTR [r8+rax*2],ymm0,ymm9>
+        EmitIfCountGE RowCount, 4, <vpmaskmovd YMMWORD PTR [rbx],ymm0,ymm11>
+        EmitIfCountGE RowCount, 5, <vpmaskmovd YMMWORD PTR [rbx+rax],ymm0,ymm13>
+        EmitIfCountGE RowCount, 6, <vpmaskmovd YMMWORD PTR [rbx+rax*2],ymm0,ymm15>
+IFB <Fallthrough>
+        jmp     ExitKernel
+ENDIF
+
+        ENDM
+
+;++
+;
+; Routine Description:
+;
+;   This routine is an inner kernel to compute matrix multiplication for a
+;   set of rows.
+;
+; Arguments:
+;
+;   A (rcx) - Supplies the address of matrix A. The matrix data has been packed
+;       using MlasGemmU8U8CopyPackAAvx2.
+;
+;   B (rdx) - Supplies the address of matrix B. The matrix data has been packed
+;       using MlasGemmU8U8CopyPackBAvx2.
+;
+;   C (r8) - Supplies the address of matrix C.
+;
+;   PairedCountK (r9) - Supplies the number of paired columns from matrix A and
+;       the number of paired rows from matrix B to iterate over.
+;
+;   CountM - Supplies the maximum number of rows that can be processed for
+;       matrix A and matrix C. The actual number of rows handled for this
+;       invocation depends on the kernel implementation.
+;
+;   CountN - Supplies the number of columns from matrix B and matrix C to iterate
+;       over.
+;
+;   ldc - Supplies the first dimension of matrix C.
+;
+;   RowSumVector - Supplies the sum of each row from matrix A multiplied by the
+;       zero point offset of matrix B. These values are accumulated into every
+;       row of matrix C.
+;
+;   ColumnSumVector - Supplies the sum of each column from matrix B multiplied
+;       by the zero point offset of matrix A. These values are accumulated into
+;       every column of matrix C.
+;
+;   DepthValue - Supplies the value CountK multiplied by the zero point offset
+;       of matrixA multplied by the zero point offset of matrix B. This value is
+;       accumulated into every element of matrix C.
+;
+;   ZeroMode - Supplies true if the output matrix must be zero initialized,
+;       else false if the output matrix is accumulated into.
+;
+; Return Value:
+;
+;   Returns the number of rows handled.
+;
+;--
+
+        NESTED_ENTRY MlasGemmU8U8KernelAvx2, _TEXT
+
+        rex_push_reg rbp
+        push_reg rbx
+        push_reg rsi
+        push_reg rdi
+        push_reg r12
+        push_reg r13
+        push_reg r14
+        alloc_stack (GemmU8U8KernelFrame.SavedR14)
+        save_xmm128_avx xmm6,GemmU8U8KernelFrame.SavedXmm6
+        save_xmm128_avx xmm7,GemmU8U8KernelFrame.SavedXmm7
+        save_xmm128_avx xmm8,GemmU8U8KernelFrame.SavedXmm8
+        save_xmm128_avx xmm9,GemmU8U8KernelFrame.SavedXmm9
+        save_xmm128_avx xmm10,GemmU8U8KernelFrame.SavedXmm10
+        save_xmm128_avx xmm11,GemmU8U8KernelFrame.SavedXmm11
+        save_xmm128_avx xmm12,GemmU8U8KernelFrame.SavedXmm12
+        save_xmm128_avx xmm13,GemmU8U8KernelFrame.SavedXmm13
+        save_xmm128_avx xmm14,GemmU8U8KernelFrame.SavedXmm14
+        save_xmm128_avx xmm15,GemmU8U8KernelFrame.SavedXmm15
+
+        END_PROLOGUE
+
+        mov     rdi,rcx
+        mov     rbp,GemmU8U8KernelFrame.CountN[rsp]
+        mov     rax,GemmU8U8KernelFrame.ldc[rsp]
+        shl     rax,2                       ; convert ldc to bytes
+        lea     r10,[r9*4]
+        mov     r11,GemmU8U8KernelFrame.CountM[rsp]
+        mov     r12,GemmU8U8KernelFrame.RowSumVector[rsp]
+        mov     r13,GemmU8U8KernelFrame.ColumnSumVector[rsp]
+        movzx   r14,BYTE PTR GemmU8U8KernelFrame.ZeroMode[rsp]
+
+;
+; Process CountM rows of the matrices.
+;
+
+        cmp     r11,5
+        ja      ProcessCountM6
+        je      ProcessCountM5
+        cmp     r11,3
+        ja      ProcessCountM4
+        je      ProcessCountM3
+        cmp     r11,1
+        je      ProcessCountM1
+
+ProcessCountM2:
+        ProcessCountM 2
+
+ProcessCountM4:
+        ProcessCountM 4
+
+ProcessCountM6:
+        mov     r11d,6                      ; return 6 rows handled
+        ProcessCountM 6, Fallthrough
+
+;
+; Restore non-volatile registers and return.
+;
+
+ExitKernel:
+        mov     eax,r11d
+        vzeroupper
+        vmovaps xmm6,GemmU8U8KernelFrame.SavedXmm6[rsp]
+        vmovaps xmm7,GemmU8U8KernelFrame.SavedXmm7[rsp]
+        vmovaps xmm8,GemmU8U8KernelFrame.SavedXmm8[rsp]
+        vmovaps xmm9,GemmU8U8KernelFrame.SavedXmm9[rsp]
+        vmovaps xmm10,GemmU8U8KernelFrame.SavedXmm10[rsp]
+        vmovaps xmm11,GemmU8U8KernelFrame.SavedXmm11[rsp]
+        vmovaps xmm12,GemmU8U8KernelFrame.SavedXmm12[rsp]
+        vmovaps xmm13,GemmU8U8KernelFrame.SavedXmm13[rsp]
+        vmovaps xmm14,GemmU8U8KernelFrame.SavedXmm14[rsp]
+        vmovaps xmm15,GemmU8U8KernelFrame.SavedXmm15[rsp]
+        add     rsp,(GemmU8U8KernelFrame.SavedR14)
+
+        BEGIN_EPILOGUE
+
+        pop     r14
+        pop     r13
+        pop     r12
+        pop     rdi
+        pop     rsi
+        pop     rbx
+        pop     rbp
+        ret
+
+ProcessCountM1:
+        ProcessCountM 1
+
+ProcessCountM3:
+        ProcessCountM 3
+
+ProcessCountM5:
+        ProcessCountM 5
+
+        NESTED_END MlasGemmU8U8KernelAvx2, _TEXT
+
+        END
diff --git a/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512BW.asm b/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512BW.asm
new file mode 100644
index 0000000000000..8f4d0fa47f7e2
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512BW.asm
@@ -0,0 +1,114 @@
+;++
+;
+; Copyright (c) Microsoft Corporation. All rights reserved.
+;
+; Licensed under the MIT License.
+;
+; Module Name:
+;
+;   QgemmU8U8KernelAvx512BW.asm
+;
+; Abstract:
+;
+;   This module implements the kernels for the quantized integer matrix/matrix
+;   multiply operation (QGEMM).
+;
+;   This implementation uses AVX512BW instructions.
+;
+;--
+
+        .xlist
+INCLUDE mlasi.inc
+INCLUDE QgemmU8U8KernelAvx512Common.inc
+        .list
+
+;
+; Macro Description:
+;
+;   This macro generates code to multiply and accumulator a single row of the
+;   output block.
+;
+; Arguments:
+;
+;   ColumnCount - Supplies the number of columns to produce.
+;
+;   Vec1Reg - Supplies the high block accumulator register (when ColumnCount
+;       is 32).
+;
+;   Vec2Reg - Supplies the low block accumulator register.
+;
+; Implicit Arguments:
+;
+;   zmm28 - Supplies the first vector loaded from matrix B.
+;
+;   zmm29 - Supplies the second vector loaded from matrix B (when ColumnCount
+;       is 32).
+;
+;   zmm30 - Supplies the broadcast value loaded from matrix A.
+;
+
+MultiplyAccumulateRow MACRO ColumnCount, Vec1Reg, Vec2Reg
+
+IF ColumnCount EQ 32
+        vpmaddwd zmm31,zmm30,zmm28
+        vpaddd  Vec1Reg,Vec1Reg,zmm31
+        vpmaddwd zmm30,zmm30,zmm29
+        vpaddd  Vec2Reg,Vec2Reg,zmm30
+ELSE
+        vpmaddwd zmm31,zmm30,zmm28
+        vpaddd  Vec2Reg,Vec2Reg,zmm31
+ENDIF
+
+        ENDM
+
+;
+; Macro Description:
+;
+;   This macro generates code to multiply and accumulate each row of the output
+;   block.
+;
+; Arguments:
+;
+;   ColumnCount - Supplies the number of columns to produce.
+;
+;   RowCount - Supplies the number of rows to produce.
+;
+; Implicit Arguments:
+;
+;   rbx - Supplies the address into the matrix A data plus 3 rows.
+;
+;   rcx - Supplies the address into the matrix A data.
+;
+;   rdx - Supplies the address into the matrix B data.
+;
+;   r10 - Supplies the length in bytes of a row from matrix A.
+;
+;   zmm16-zmm27 - Supplies the block accumulators.
+;
+
+ComputeBlock MACRO ColumnCount, RowCount
+
+        vpmovzxbw zmm28,YMMWORD PTR [rdx]
+        EmitIfCountGE ColumnCount, 32, <vpmovzxbw zmm29,YMMWORD PTR [rdx+r10*8]>
+        EmitIfCountGE RowCount, 1, <vpbroadcastd zmm30,DWORD PTR [rcx]>
+        EmitIfCountGE RowCount, 1, <MultiplyAccumulateRow ColumnCount, zmm16, zmm17>
+        EmitIfCountGE RowCount, 2, <vpbroadcastd zmm30,DWORD PTR [rcx+r10]>
+        EmitIfCountGE RowCount, 2, <MultiplyAccumulateRow ColumnCount, zmm18, zmm19>
+        EmitIfCountGE RowCount, 3, <vpbroadcastd zmm30,DWORD PTR [rcx+r10*2]>
+        EmitIfCountGE RowCount, 3, <MultiplyAccumulateRow ColumnCount, zmm20, zmm21>
+        EmitIfCountGE RowCount, 4, <vpbroadcastd zmm30,DWORD PTR [rbx]>
+        EmitIfCountGE RowCount, 4, <MultiplyAccumulateRow ColumnCount, zmm22, zmm23>
+        EmitIfCountGE RowCount, 5, <vpbroadcastd zmm30,DWORD PTR [rbx+r10]>
+        EmitIfCountGE RowCount, 5, <MultiplyAccumulateRow ColumnCount, zmm24, zmm25>
+        EmitIfCountGE RowCount, 6, <vpbroadcastd zmm30,DWORD PTR [rbx+r10*2]>
+        EmitIfCountGE RowCount, 6, <MultiplyAccumulateRow ColumnCount, zmm26, zmm27>
+
+        ENDM
+
+;
+; Generate the GEMM kernel.
+;
+
+GemmU8U8KernelAvx512Function Avx512BW
+
+        END
diff --git a/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Common.inc b/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Common.inc
new file mode 100644
index 0000000000000..1cd5cdc732b12
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Common.inc
@@ -0,0 +1,385 @@
+;++
+;
+; Copyright (c) Microsoft Corporation. All rights reserved.
+;
+; Licensed under the MIT License.
+;
+; Module Name:
+;
+;   QgemmU8U8KernelAvx512Common.inc
+;
+; Abstract:
+;
+;   This module contains common kernel macros and structures for the quantized
+;   integer matrix/matrix multiply operation (QGEMM) for the AVX512BW and
+;   AVX512VNNI kernels.
+;
+;--
+
+;
+; Stack frame layout for the U8U8 kernel.
+;
+
+GemmU8U8KernelFrame STRUCT
+
+        SavedR14 QWORD ?
+        SavedR13 QWORD ?
+        SavedR12 QWORD ?
+        SavedRdi QWORD ?
+        SavedRsi QWORD ?
+        SavedRbx QWORD ?
+        SavedRbp QWORD ?
+        ReturnAddress QWORD ?
+        PreviousP1Home QWORD ?
+        PreviousP2Home QWORD ?
+        PreviousP3Home QWORD ?
+        PreviousP4Home QWORD ?
+        CountM QWORD ?
+        CountN QWORD ?
+        ldc QWORD ?
+        RowSumVector QWORD ?
+        ColumnSumVector QWORD ?
+        DepthValue QWORD ?
+        ZeroMode QWORD ?
+
+GemmU8U8KernelFrame ENDS
+
+;
+; Macro Description:
+;
+;   This macro generates code to produce an output block for a set of columns
+;   and rows.
+;
+; Arguments:
+;
+;   ColumnCount - Supplies the number of columns to produce.
+;
+;   RowCount - Supplies the number of rows to produce.
+;
+; Implicit Arguments:
+;
+;   rcx - Supplies the address into the matrix A data.
+;
+;   rdx - Supplies the address into the matrix B data.
+;
+;   r9 - Supplies the number of paired columns from matrix A and the number of
+;       paired rows from matrix B to iterate over.
+;
+;   r10 - Supplies the length in bytes of a row from matrix A.
+;
+;   r12 - Supplies the address of the row sum vector.
+;
+;   r13 - Supplies the address of the column sum vector.
+;
+
+ProduceOutputBlock MACRO ColumnCount, RowCount
+
+        LOCAL   ComputeBlockLoop
+
+;
+; Initialize the accumulators with the sum of the global depth value constant,
+; the column sums, and the row sums.
+;
+
+        vpbroadcastd zmm31,DWORD PTR GemmU8U8KernelFrame.DepthValue[rsp]
+IF ColumnCount EQ 32
+        vpaddd  zmm30,zmm31,ZMMWORD PTR [r13]
+        vpaddd  zmm31,zmm31,ZMMWORD PTR [r13+64]
+        add     r13,32*4                    ; advance ColumnSumVector by 32 columns
+ELSE
+        vpaddd zmm31,zmm31,ZMMWORD PTR [r13]
+ENDIF
+        EmitIfCount2GE RowCount, 1, ColumnCount, 32, <vpaddd zmm16,zmm30,DWORD BCST [r12]>
+        EmitIfCountGE RowCount, 1, <vpaddd zmm17,zmm31,DWORD BCST [r12]>
+        EmitIfCount2GE RowCount, 2, ColumnCount, 32, <vpaddd zmm18,zmm30,DWORD BCST [r12+4]>
+        EmitIfCountGE RowCount, 2, <vpaddd zmm19,zmm31,DWORD BCST [r12+4]>
+        EmitIfCount2GE RowCount, 3, ColumnCount, 32, <vpaddd zmm20,zmm30,DWORD BCST [r12+8]>
+        EmitIfCountGE RowCount, 3, <vpaddd zmm21,zmm31,DWORD BCST [r12+8]>
+        EmitIfCount2GE RowCount, 4, ColumnCount, 32, <vpaddd zmm22,zmm30,DWORD BCST [r12+12]>
+        EmitIfCountGE RowCount, 4, <vpaddd zmm23,zmm31,DWORD BCST [r12+12]>
+        EmitIfCount2GE RowCount, 5, ColumnCount, 32, <vpaddd zmm24,zmm30,DWORD BCST [r12+16]>
+        EmitIfCountGE RowCount, 5, <vpaddd zmm25,zmm31,DWORD BCST [r12+16]>
+        EmitIfCount2GE RowCount, 6, ColumnCount, 32, <vpaddd zmm26,zmm30,DWORD BCST [r12+20]>
+        EmitIfCountGE RowCount, 6, <vpaddd zmm27,zmm31,DWORD BCST [r12+20]>
+
+;
+; Iterate over PairedCountK elements from matrix A and matrix B.
+;
+
+        mov     rsi,r9                      ; reload PairedCountK
+IF RowCount GT 3
+        lea     rbx,[r10*2+r10]
+        add     rbx,rcx                     ; compute matrix A plus 3 rows
+ENDIF
+
+ComputeBlockLoop:
+        ComputeBlock ColumnCount, RowCount
+        add     rcx,4                       ; advance matrix A by 1 pair
+IF RowCount GT 3
+        add     rbx,4                       ; advance matrix A plus 3 rows by 1 pair
+ENDIF
+        add     rdx,32
+        dec     rsi                         ; decrement pairs remaining
+        jnz     ComputeBlockLoop
+
+IF RowCount GT 3
+        lea     rbx,[r8+rax*2]              ; compute matrix C plus 3 rows
+        add     rbx,rax
+ENDIF
+
+        ENDM
+
+;
+; Macro Description:
+;
+;   This macro generates code to compute matrix multiplication for a fixed set
+;   of rows.
+;
+; Arguments:
+;
+;   RowCount - Supplies the number of rows to process.
+;
+; Implicit Arguments:
+;
+;   rax - Supplies the length in bytes of a row from matrix C.
+;
+;   rcx - Supplies the address of matrix A.
+;
+;   rdx - Supplies the address of matrix B.
+;
+;   r8 - Supplies the address of matrix C.
+;
+;   rdi - Supplies the address of matrix A.
+;
+;   rbp - Supplies the number of columns from matrix B and matrix C to iterate
+;       over.
+;
+;   r9 - Supplies the number of paired columns from matrix A and the number of
+;       paired rows from matrix B to iterate over.
+;
+;   r10 - Supplies the length in bytes of a row from matrix A.
+;
+;   r12 - Supplies the address of the row sum vector.
+;
+;   r13 - Supplies the address of the column sum vector.
+;
+;   r14b - Supplies the zero mode flag.
+;
+
+ProcessCountM MACRO RowCount
+
+        LOCAL   ProcessNextColumnLoop32xN
+        LOCAL   SkipAccumulateOutput32xNBlock
+        LOCAL   Output16xNBlock
+        LOCAL   Output16xNBlockWithMask
+        LOCAL   SkipAccumulateOutput16xNBlockWithMask
+        LOCAL   ProcessRemainingCountN
+
+        cmp     rbp,16
+        jbe     ProcessRemainingCountN
+
+ProcessNextColumnLoop32xN:
+        ProduceOutputBlock 32, RowCount
+        lea     rdx,[rdx+r10*8]             ; advance matrix B by 8*PairedCountK
+        test    r14b,r14b                   ; ZeroMode?
+        jnz     SkipAccumulateOutput32xNBlock
+        EmitIfCountGE RowCount, 1, <vpaddd zmm16,zmm16,ZMMWORD PTR [r8]>
+        EmitIfCountGE RowCount, 2, <vpaddd zmm18,zmm18,ZMMWORD PTR [r8+rax]>
+        EmitIfCountGE RowCount, 3, <vpaddd zmm20,zmm20,ZMMWORD PTR [r8+rax*2]>
+        EmitIfCountGE RowCount, 4, <vpaddd zmm22,zmm22,ZMMWORD PTR [rbx]>
+        EmitIfCountGE RowCount, 5, <vpaddd zmm24,zmm24,ZMMWORD PTR [rbx+rax]>
+        EmitIfCountGE RowCount, 6, <vpaddd zmm26,zmm26,ZMMWORD PTR [rbx+rax*2]>
+
+SkipAccumulateOutput32xNBlock:
+        EmitIfCountGE RowCount, 1, <vmovdqu32 ZMMWORD PTR [r8],zmm16>
+        EmitIfCountGE RowCount, 2, <vmovdqu32 ZMMWORD PTR [r8+rax],zmm18>
+        EmitIfCountGE RowCount, 3, <vmovdqu32 ZMMWORD PTR [r8+rax*2],zmm20>
+        EmitIfCountGE RowCount, 4, <vmovdqu32 ZMMWORD PTR [rbx],zmm22>
+        EmitIfCountGE RowCount, 5, <vmovdqu32 ZMMWORD PTR [rbx+rax],zmm24>
+        EmitIfCountGE RowCount, 6, <vmovdqu32 ZMMWORD PTR [rbx+rax*2],zmm26>
+        add     r8,16*4                     ; advance matrix C by 16 columns
+IF RowCount GT 3
+        add     rbx,16*4                    ; advance matrix C plus 3 rows by 16 columns
+ENDIF
+        sub     rbp,16
+
+Output16xNBlock:
+        sub     rbp,16
+        jae     Output16xNBlockWithMask
+        lea     ecx,[ebp+16]                ; correct for over-subtract above
+        mov     esi,1
+        shl     esi,cl
+        dec     esi
+        kmovw   k1,esi                      ; update mask for remaining columns
+        xor     ebp,ebp                     ; no more columns remaining
+
+Output16xNBlockWithMask:
+        test    r14b,r14b                   ; ZeroMode?
+        jnz     SkipAccumulateOutput16xNBlockWithMask
+        EmitIfCountGE RowCount, 1, <vpaddd zmm17{k1},zmm17,ZMMWORD PTR [r8]>
+        EmitIfCountGE RowCount, 2, <vpaddd zmm19{k1},zmm19,ZMMWORD PTR [r8+rax]>
+        EmitIfCountGE RowCount, 3, <vpaddd zmm21{k1},zmm21,ZMMWORD PTR [r8+rax*2]>
+        EmitIfCountGE RowCount, 4, <vpaddd zmm23{k1},zmm23,ZMMWORD PTR [rbx]>
+        EmitIfCountGE RowCount, 5, <vpaddd zmm25{k1},zmm25,ZMMWORD PTR [rbx+rax]>
+        EmitIfCountGE RowCount, 6, <vpaddd zmm27{k1},zmm27,ZMMWORD PTR [rbx+rax*2]>
+
+SkipAccumulateOutput16xNBlockWithMask:
+        EmitIfCountGE RowCount, 1, <vmovdqu32 ZMMWORD PTR [r8]{k1},zmm17>
+        EmitIfCountGE RowCount, 2, <vmovdqu32 ZMMWORD PTR [r8+rax]{k1},zmm19>
+        EmitIfCountGE RowCount, 3, <vmovdqu32 ZMMWORD PTR [r8+rax*2]{k1},zmm21>
+        EmitIfCountGE RowCount, 4, <vmovdqu32 ZMMWORD PTR [rbx]{k1},zmm23>
+        EmitIfCountGE RowCount, 5, <vmovdqu32 ZMMWORD PTR [rbx+rax]{k1},zmm25>
+        EmitIfCountGE RowCount, 6, <vmovdqu32 ZMMWORD PTR [rbx+rax*2]{k1},zmm27>
+        add     r8,16*4                     ; advance matrix C by 16 columns
+        mov     rcx,rdi                     ; reload matrix A
+        cmp     rbp,16
+        ja      ProcessNextColumnLoop32xN
+        test    rbp,rbp
+        jz      ExitKernel
+
+ProcessRemainingCountN:
+        ProduceOutputBlock 16, RowCount
+        jmp     Output16xNBlock
+
+        ENDM
+
+;
+; Macro Description:
+;
+;   This macro generates the common AVX512 code for the inner kernel to compute
+;   matrix multiplication.
+;
+; Arguments:
+;
+;   Isa - Supplies the instruction set architecture string for function tags.
+;
+
+GemmU8U8KernelAvx512Function MACRO Isa
+
+;++
+;
+; Routine Description:
+;
+;   This routine is an inner kernel to compute matrix multiplication for a
+;   set of rows.
+;
+; Arguments:
+;
+;   A (rcx) - Supplies the address of matrix A. The matrix data has been packed
+;       using MlasGemmU8U8CopyPackAAvx2.
+;
+;   B (rdx) - Supplies the address of matrix B. The matrix data has been packed
+;       using MlasGemmU8U8CopyPackBAvx2.
+;
+;   C (r8) - Supplies the address of matrix C.
+;
+;   PairedCountK (r9) - Supplies the number of paired columns from matrix A and
+;       the number of paired rows from matrix B to iterate over.
+;
+;   CountM - Supplies the maximum number of rows that can be processed for
+;       matrix A and matrix C. The actual number of rows handled for this
+;       invocation depends on the kernel implementation.
+;
+;   CountN - Supplies the number of columns from matrix B and matrix C to iterate
+;       over.
+;
+;   ldc - Supplies the first dimension of matrix C.
+;
+;   RowSumVector - Supplies the sum of each row from matrix A multiplied by the
+;       zero point offset of matrix B. These values are accumulated into every
+;       row of matrix C.
+;
+;   ColumnSumVector - Supplies the sum of each column from matrix B multiplied
+;       by the zero point offset of matrix A. These values are accumulated into
+;       every column of matrix C.
+;
+;   DepthValue - Supplies the value CountK multiplied by the zero point offset
+;       of matrixA multplied by the zero point offset of matrix B. This value is
+;       accumulated into every element of matrix C.
+;
+;   ZeroMode - Supplies true if the output matrix must be zero initialized,
+;       else false if the output matrix is accumulated into.
+;
+; Return Value:
+;
+;   Returns the number of rows handled.
+;
+;--
+
+        NESTED_ENTRY MlasGemmU8U8Kernel&Isa&, _TEXT
+
+        rex_push_reg rbp
+        push_reg rbx
+        push_reg rsi
+        push_reg rdi
+        push_reg r12
+        push_reg r13
+        push_reg r14
+
+        END_PROLOGUE
+
+        mov     rdi,rcx
+        mov     rbp,GemmU8U8KernelFrame.CountN[rsp]
+        mov     rax,GemmU8U8KernelFrame.ldc[rsp]
+        shl     rax,2                       ; convert ldc to bytes
+        lea     r10,[r9*4]
+        mov     r11,GemmU8U8KernelFrame.CountM[rsp]
+        mov     r12,GemmU8U8KernelFrame.RowSumVector[rsp]
+        mov     r13,GemmU8U8KernelFrame.ColumnSumVector[rsp]
+        movzx   r14,BYTE PTR GemmU8U8KernelFrame.ZeroMode[rsp]
+        mov     esi,-1
+        kmovw   k1,esi                      ; update mask to write all columns
+
+;
+; Process CountM rows of the matrices.
+;
+
+        cmp     r11,5
+        ja      ProcessCountM6
+        je      ProcessCountM5
+        cmp     r11,3
+        ja      ProcessCountM4
+        je      ProcessCountM3
+        cmp     r11,1
+        je      ProcessCountM1
+
+ProcessCountM2:
+        ProcessCountM 2
+
+ProcessCountM4:
+        ProcessCountM 4
+
+ProcessCountM6:
+        mov     r11d,6                      ; return 6 rows handled
+        ProcessCountM 6
+
+;
+; Restore non-volatile registers and return.
+;
+
+ExitKernel:
+        mov     eax,r11d
+
+        BEGIN_EPILOGUE
+
+        pop     r14
+        pop     r13
+        pop     r12
+        pop     rdi
+        pop     rsi
+        pop     rbx
+        pop     rbp
+        ret
+
+ProcessCountM1:
+        ProcessCountM 1
+
+ProcessCountM3:
+        ProcessCountM 3
+
+ProcessCountM5:
+        ProcessCountM 5
+
+        NESTED_END MlasGemmU8U8Kernel&Isa&, _TEXT
+
+        ENDM
diff --git a/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Vnni.asm b/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Vnni.asm
new file mode 100644
index 0000000000000..d2b6b696327b9
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/amd64/QgemmU8U8KernelAvx512Vnni.asm
@@ -0,0 +1,91 @@
+;++
+;
+; Copyright (c) Microsoft Corporation. All rights reserved.
+;
+; Licensed under the MIT License.
+;
+; Module Name:
+;
+;   QgemmU8U8KernelAvx512Vnni.asm
+;
+; Abstract:
+;
+;   This module implements the kernels for the quantized integer matrix/matrix
+;   multiply operation (QGEMM).
+;
+;   This implementation uses AVX512VNNI instructions.
+;
+;--
+
+        .xlist
+INCLUDE mlasi.inc
+INCLUDE QgemmU8U8KernelAvx512Common.inc
+INCLUDE AssembleAvx512Vnni.inc
+        .list
+
+;
+; Macro Description:
+;
+;   This macro generates code to multiply and accumulate each row of the output
+;   block.
+;
+; Arguments:
+;
+;   ColumnCount - Supplies the number of columns to produce.
+;
+;   RowCount - Supplies the number of rows to produce.
+;
+; Implicit Arguments:
+;
+;   rbx - Supplies the address into the matrix A data plus 3 rows.
+;
+;   rcx - Supplies the address into the matrix A data.
+;
+;   rdx - Supplies the address into the matrix B data.
+;
+;   r10 - Supplies the length in bytes of a row from matrix A.
+;
+;   zmm16-zmm27 - Supplies the block accumulators.
+;
+
+ComputeBlock MACRO ColumnCount, RowCount
+
+        vpmovzxbw zmm28,YMMWORD PTR [rdx]
+IF ColumnCount EQ 32
+        vpmovzxbw zmm29,YMMWORD PTR [rdx+r10*8]
+        EmitIfCountGE RowCount, 1, <vpbroadcastd zmm30,DWORD PTR [rcx]>
+        EmitIfCountGE RowCount, 1, <VpdpwssdZmmZmmZmm zmm16,zmm28,zmm30>
+        EmitIfCountGE RowCount, 1, <VpdpwssdZmmZmmZmm zmm17,zmm29,zmm30>
+        EmitIfCountGE RowCount, 2, <vpbroadcastd zmm30,DWORD PTR [rcx+r10]>
+        EmitIfCountGE RowCount, 2, <VpdpwssdZmmZmmZmm zmm18,zmm28,zmm30>
+        EmitIfCountGE RowCount, 2, <VpdpwssdZmmZmmZmm zmm19,zmm29,zmm30>
+        EmitIfCountGE RowCount, 3, <vpbroadcastd zmm30,DWORD PTR [rcx+r10*2]>
+        EmitIfCountGE RowCount, 3, <VpdpwssdZmmZmmZmm zmm20,zmm28,zmm30>
+        EmitIfCountGE RowCount, 3, <VpdpwssdZmmZmmZmm zmm21,zmm29,zmm30>
+        EmitIfCountGE RowCount, 4, <vpbroadcastd zmm30,DWORD PTR [rbx]>
+        EmitIfCountGE RowCount, 4, <VpdpwssdZmmZmmZmm zmm22,zmm28,zmm30>
+        EmitIfCountGE RowCount, 4, <VpdpwssdZmmZmmZmm zmm23,zmm29,zmm30>
+        EmitIfCountGE RowCount, 5, <vpbroadcastd zmm30,DWORD PTR [rbx+r10]>
+        EmitIfCountGE RowCount, 5, <VpdpwssdZmmZmmZmm zmm24,zmm28,zmm30>
+        EmitIfCountGE RowCount, 5, <VpdpwssdZmmZmmZmm zmm25,zmm29,zmm30>
+        EmitIfCountGE RowCount, 6, <vpbroadcastd zmm30,DWORD PTR [rbx+r10*2]>
+        EmitIfCountGE RowCount, 6, <VpdpwssdZmmZmmZmm zmm26,zmm28,zmm30>
+        EmitIfCountGE RowCount, 6, <VpdpwssdZmmZmmZmm zmm27,zmm29,zmm30>
+ELSE
+        EmitIfCountGE RowCount, 1, <VpdpwssdZmmZmmBroadcast zmm17,zmm28,rcx>
+        EmitIfCountGE RowCount, 2, <VpdpwssdZmmZmmBroadcast zmm19,zmm28,rcx,r10,1>
+        EmitIfCountGE RowCount, 3, <VpdpwssdZmmZmmBroadcast zmm21,zmm28,rcx,r10,2>
+        EmitIfCountGE RowCount, 4, <VpdpwssdZmmZmmBroadcast zmm23,zmm28,rbx>
+        EmitIfCountGE RowCount, 5, <VpdpwssdZmmZmmBroadcast zmm25,zmm28,rbx,r10,1>
+        EmitIfCountGE RowCount, 6, <VpdpwssdZmmZmmBroadcast zmm27,zmm28,rbx,r10,2>
+ENDIF
+
+        ENDM
+
+;
+; Generate the GEMM kernel.
+;
+
+GemmU8U8KernelAvx512Function Avx512Vnni
+
+        END
diff --git a/onnxruntime/core/mlas/lib/arm64/sgemma.asm b/onnxruntime/core/mlas/lib/arm64/SgemmKernelNeon.asm
similarity index 91%
rename from onnxruntime/core/mlas/lib/arm64/sgemma.asm
rename to onnxruntime/core/mlas/lib/arm64/SgemmKernelNeon.asm
index 0b6eb11fa2d78..3675689db6cd5 100644
--- a/onnxruntime/core/mlas/lib/arm64/sgemma.asm
+++ b/onnxruntime/core/mlas/lib/arm64/SgemmKernelNeon.asm
@@ -6,7 +6,7 @@
 ;
 ; Module Name:
 ;
-;   sgemma.asm
+;   SgemmKernelNeon.asm
 ;
 ; Abstract:
 ;
@@ -19,31 +19,6 @@
 
         TEXTAREA
 
-;
-; ComputeEffectiveAddress
-;
-; Generates the code to compute the effective address of a matrix element using
-; the instruction template:
-;
-;       add     $DestReg,$BaseReg,$IndexReg lsl #2
-;
-; For native ARM64, the macro generates a 64-bit address calculation. For CHPE
-; targets, the macro generates a 32-bit address calculation to stay within the
-; WOW64 sandbox.
-;
-
-
-        MACRO
-        ComputeEffectiveAddress $DestReg, $BaseReg, $IndexReg
-
-#if defined(_CHPE_X86_ARM64_)
-        DCD     0x0B000800:OR:(:RCONST:$DestReg):OR:((:RCONST:$BaseReg):SHL:5):OR:((:RCONST:$IndexReg):SHL:16)
-#else
-        DCD     0x8B000800:OR:(:RCONST:$DestReg):OR:((:RCONST:$BaseReg):SHL:5):OR:((:RCONST:$IndexReg):SHL:16)
-#endif
-
-        MEND
-
 ;
 ; ClearRowAccumulators
 ;
@@ -171,11 +146,11 @@
         ClearBlockAccumulators $Columns, $Rows
 
     IF $Rows >= 2
-        ComputeEffectiveAddress x10,x0,x6   ; compute matrix A plus 1 row
+        add     x10,x0,x6 lsl #2            ; compute matrix A plus 1 row
     ENDIF
     IF $Rows >= 4
-        ComputeEffectiveAddress x11,x10,x6  ; compute matrix A plus 2 rows
-        ComputeEffectiveAddress x12,x11,x6  ; compute matrix A plus 3 rows
+        add     x11,x10,x6 lsl #2           ; compute matrix A plus 2 rows
+        add     x12,x11,x6 lsl #2           ; compute matrix A plus 3 rows
     ENDIF
 
         sub     x9,x3,#4                    ; decrement block count to process
@@ -217,7 +192,7 @@ $Mode.Compute$Columns.x$Rows.BlockBy1Loop
         ldp     v6,v7,[x1,#-8*4]
     ENDIF
         MultiplyAccumulateBlock $Columns,$Rows,0
-        sub     x9,x9,1
+        sub     x9,x9,#1
         cbnz    x9,$Mode.Compute$Columns.x$Rows.BlockBy1Loop
 
 $Mode.Output$Columns.x$Rows.Block
@@ -476,9 +451,9 @@ $Mode.OutputRemaining1x$Rows.Block
         PROLOG_SAVE_REG_PAIR d8,d9,#-32!
         PROLOG_SAVE_REG_PAIR d10,d11,#16
 
-        ComputeEffectiveAddress x13,x2,x7   ; compute matrix C plus 1 row
-        ComputeEffectiveAddress x14,x13,x7  ; compute matrix C plus 2 rows
-        ComputeEffectiveAddress x15,x14,x7  ; compute matrix C plus 3 rows
+        add     x13,x2,x7 lsl #2            ; compute matrix C plus 1 row
+        add     x14,x13,x7 lsl #2           ; compute matrix C plus 2 rows
+        add     x15,x14,x7 lsl #2           ; compute matrix C plus 3 rows
         mov     x8,x0                       ; save matrix A
 
 ;
diff --git a/onnxruntime/core/mlas/lib/erf.cpp b/onnxruntime/core/mlas/lib/erf.cpp
index 12fd0a368d8ae..c1f4a7e6c2821 100644
--- a/onnxruntime/core/mlas/lib/erf.cpp
+++ b/onnxruntime/core/mlas/lib/erf.cpp
@@ -29,7 +29,7 @@ Module Name:
 // Bundles the constants for use by kernels written in assembly.
 //
 
-extern "C" const struct {
+MLAS_INTERNAL_DATA const struct {
     float ErfUpperAbsRange;
     float ErfSplitBoundary;
     float ErfSMALL_P0;
diff --git a/onnxruntime/core/mlas/lib/logistic.cpp b/onnxruntime/core/mlas/lib/logistic.cpp
index 03061bb6bafbd..9e657f1892cc4 100644
--- a/onnxruntime/core/mlas/lib/logistic.cpp
+++ b/onnxruntime/core/mlas/lib/logistic.cpp
@@ -26,7 +26,7 @@ Module Name:
 // Bundles the floating point constants for use by kernels written in assembly.
 //
 
-extern "C" const struct {
+MLAS_INTERNAL_DATA const struct {
     float LowerRange;
     float UpperRange;
     float alpha_9;
diff --git a/onnxruntime/core/mlas/lib/mlasi.h b/onnxruntime/core/mlas/lib/mlasi.h
index b191c155928d9..1ae227d1f9ac1 100644
--- a/onnxruntime/core/mlas/lib/mlasi.h
+++ b/onnxruntime/core/mlas/lib/mlasi.h
@@ -16,7 +16,6 @@ Module Name:
 --*/
 
 #pragma once
-// clang-format off
 
 #include <mlas.h>
 #include <memory.h>
@@ -56,6 +55,18 @@ Module Name:
 #define MLAS_FORCEINLINE __attribute__ ((always_inline)) inline
 #endif
 
+//
+// Macro to tag globals as internal data shared with kernels written in
+// assembly. These globals are marked with having hidden visibility to avoid
+// needing to access the data through the global object table.
+//
+
+#if defined(_MSC_VER)
+#define MLAS_INTERNAL_DATA extern "C"
+#else
+#define MLAS_INTERNAL_DATA extern "C" __attribute ((visibility("hidden")))
+#endif
+
 //
 // Macro to suppress unreferenced parameter warnings.
 //
@@ -69,7 +80,7 @@ Module Name:
 #if defined(_M_AMD64) || defined(__x86_64__)
 #define MLAS_TARGET_AMD64
 #endif
-#if (defined(_M_IX86) && !defined(_M_HYBRID_X86_ARM64)) || defined(__i386__)
+#if defined(_M_IX86) || defined(__i386__)
 #define MLAS_TARGET_IX86
 #endif
 #if defined(MLAS_TARGET_AMD64) || defined(MLAS_TARGET_IX86)
@@ -92,8 +103,6 @@ Module Name:
 
 #if defined(_OPENMP)
 #include <omp.h>
-#elif defined(_WIN32)
-#define MLAS_USE_WIN32_THREADPOOL
 #endif
 
 //
@@ -164,6 +173,52 @@ void
 
 typedef MLAS_SGEMM_TRANSPOSE_PACKB_BLOCK_ROUTINE* PMLAS_SGEMM_TRANSPOSE_PACKB_BLOCK_ROUTINE;
 
+typedef
+void
+(MLASCALL MLAS_GEMM_U8U8_COPY_PACKA_ROUTINE)(
+    int16_t* D,
+    const uint8_t* A,
+    size_t lda,
+    size_t CountM,
+    size_t CountK,
+    int32_t* RowSumVector,
+    int16_t offb
+    );
+
+typedef MLAS_GEMM_U8U8_COPY_PACKA_ROUTINE* PMLAS_GEMM_U8U8_COPY_PACKA_ROUTINE;
+
+typedef
+void
+(MLASCALL MLAS_GEMM_U8U8_COPY_PACKB_ROUTINE)(
+    uint8_t* D,
+    const uint8_t* B,
+    size_t ldb,
+    size_t CountN,
+    size_t CountK,
+    int32_t* ColumnSumVector,
+    int16_t offa
+    );
+
+typedef MLAS_GEMM_U8U8_COPY_PACKB_ROUTINE* PMLAS_GEMM_U8U8_COPY_PACKB_ROUTINE;
+
+typedef
+size_t
+(MLASCALL MLAS_GEMM_U8U8_KERNEL)(
+    const int16_t* A,
+    const uint8_t* B,
+    int32_t* C,
+    size_t PairedCountK,
+    size_t CountM,
+    size_t CountN,
+    size_t ldc,
+    const int32_t* RowSumVector,
+    const int32_t* ColumnSumVector,
+    int32_t DepthValue,
+    bool ZeroMode
+    );
+
+typedef MLAS_GEMM_U8U8_KERNEL* PMLAS_GEMM_U8U8_KERNEL;
+
 typedef
 void
 (MLASCALL MLAS_CONV_FLOAT_KERNEL)(
@@ -291,6 +346,19 @@ extern "C" {
     MLAS_SGEMM_TRANSPOSE_PACKB_BLOCK_ROUTINE MlasSgemmTransposePackB16x4Avx;
 #endif
 
+#if defined(MLAS_TARGET_AMD64_IX86)
+    MLAS_GEMM_U8U8_COPY_PACKA_ROUTINE MlasGemmU8U8CopyPackASse;
+    MLAS_GEMM_U8U8_COPY_PACKB_ROUTINE MlasGemmU8U8CopyPackBSse;
+    MLAS_GEMM_U8U8_KERNEL MlasGemmU8U8KernelSse;
+#if defined(MLAS_TARGET_AMD64)
+    MLAS_GEMM_U8U8_COPY_PACKA_ROUTINE MlasGemmU8U8CopyPackAAvx2;
+    MLAS_GEMM_U8U8_COPY_PACKB_ROUTINE MlasGemmU8U8CopyPackBAvx2;
+    MLAS_GEMM_U8U8_KERNEL MlasGemmU8U8KernelAvx2;
+    MLAS_GEMM_U8U8_KERNEL MlasGemmU8U8KernelAvx512BW;
+    MLAS_GEMM_U8U8_KERNEL MlasGemmU8U8KernelAvx512Vnni;
+#endif
+#endif
+
 #if defined(MLAS_TARGET_AMD64)
     MLAS_CONV_FLOAT_KERNEL MlasConvNchwFloatKernelSse;
     MLAS_CONV_FLOAT_KERNEL MlasConvNchwcFloatKernelSse;
@@ -406,6 +474,9 @@ struct MLAS_PLATFORM {
 #if defined(MLAS_TARGET_AMD64_IX86)
     PMLAS_SGEMM_KERNEL_ROUTINE KernelZeroRoutine;
     PMLAS_SGEMM_KERNEL_ROUTINE KernelAddRoutine;
+    PMLAS_GEMM_U8U8_COPY_PACKA_ROUTINE GemmU8U8CopyPackARoutine;
+    PMLAS_GEMM_U8U8_COPY_PACKB_ROUTINE GemmU8U8CopyPackBRoutine;
+    PMLAS_GEMM_U8U8_KERNEL GemmU8U8Kernel;
 #endif
 
 #if defined(MLAS_TARGET_AMD64)
@@ -423,10 +494,6 @@ struct MLAS_PLATFORM {
     uint32_t NchwcBlockSize;
     uint32_t PreferredBufferAlignment;
 #endif
-
-#if defined(MLAS_USE_WIN32_THREADPOOL)
-    int32_t MaximumThreadCount;
-#endif
 };
 
 extern MLAS_PLATFORM MlasPlatform;
@@ -462,13 +529,11 @@ MlasGetMaximumThreadCount(
     MLAS_UNREFERENCED_PARAMETER(ThreadPool);
 #else
     if (ThreadPool != nullptr) {
-        return ThreadPool->NumThreads();
+        return ThreadPool->NumThreads() + 1;
     }
 #endif
 
-#if defined(MLAS_USE_WIN32_THREADPOOL)
-    return MlasPlatform.MaximumThreadCount;
-#elif _OPENMP
+#if defined(_OPENMP)
     return (omp_get_num_threads() == 1) ? omp_get_max_threads() : 1;
 #else
     return 1;
@@ -495,7 +560,7 @@ MlasGetMaximumThreadCount(
 #if defined(MLAS_TARGET_ARM)
 #define MLAS_NEON_INTRINSICS
 #define MLAS_NEON32_INTRINSICS
-#elif defined(MLAS_TARGET_ARM64) || defined(_M_HYBRID_X86_ARM64)
+#elif defined(MLAS_TARGET_ARM64)
 #define MLAS_NEON_INTRINSICS
 #define MLAS_NEON64_INTRINSICS
 #elif defined(MLAS_TARGET_AMD64_IX86)
diff --git a/onnxruntime/core/mlas/lib/platform.cpp b/onnxruntime/core/mlas/lib/platform.cpp
index 4f99d50fb27b0..1d0fdacae19d5 100644
--- a/onnxruntime/core/mlas/lib/platform.cpp
+++ b/onnxruntime/core/mlas/lib/platform.cpp
@@ -86,6 +86,9 @@ Return Value:
 
     this->KernelZeroRoutine = MlasSgemmKernelZeroSse;
     this->KernelAddRoutine = MlasSgemmKernelAddSse;
+    this->GemmU8U8CopyPackARoutine = MlasGemmU8U8CopyPackASse;
+    this->GemmU8U8CopyPackBRoutine = MlasGemmU8U8CopyPackBSse;
+    this->GemmU8U8Kernel = MlasGemmU8U8KernelSse;
 #if defined(MLAS_TARGET_AMD64)
     this->TransposePackB16x4Routine = MlasSgemmTransposePackB16x4Sse;
     this->ConvNchwFloatKernel = MlasConvNchwFloatKernelSse;
@@ -157,6 +160,10 @@ Return Value:
 
             if (((Cpuid1[2] & 0x1000) != 0) && ((Cpuid7[1] & 0x20) != 0)) {
 
+                this->GemmU8U8CopyPackARoutine = MlasGemmU8U8CopyPackAAvx2;
+                this->GemmU8U8CopyPackBRoutine = MlasGemmU8U8CopyPackBAvx2;
+                this->GemmU8U8Kernel = MlasGemmU8U8KernelAvx2;
+
                 if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) {
 
                     this->KernelZeroRoutine = MlasSgemmKernelZeroAvx512F;
@@ -171,6 +178,23 @@ Return Value:
                     this->NchwcBlockSize = 16;
                     this->PreferredBufferAlignment = 64;
 
+                    //
+                    // Check if the processor supports AVX512BW.
+                    //
+
+                    if ((Cpuid7[1] & 0x40000000) != 0) {
+
+                        this->GemmU8U8Kernel = MlasGemmU8U8KernelAvx512BW;
+
+                        //
+                        // Check if the processor supports AVX512VNNI.
+                        //
+
+                        if ((Cpuid7[2] & 0x800) != 0) {
+                            this->GemmU8U8Kernel = MlasGemmU8U8KernelAvx512Vnni;
+                        }
+                    }
+
                 } else {
 
                     this->KernelZeroRoutine = MlasSgemmKernelZeroFma3;
@@ -192,25 +216,6 @@ Return Value:
     }
 
 #endif
-
-#if defined(MLAS_USE_WIN32_THREADPOOL)
-
-    //
-    // Retrieve the number of processors in the system.
-    //
-
-    SYSTEM_INFO SystemInfo;
-
-    GetSystemInfo(&SystemInfo);
-
-    if (SystemInfo.dwNumberOfProcessors <= MLAS_MAXIMUM_THREAD_COUNT) {
-        this->MaximumThreadCount = int32_t(SystemInfo.dwNumberOfProcessors);
-    } else {
-        this->MaximumThreadCount = MLAS_MAXIMUM_THREAD_COUNT;
-    }
-
-#endif
-
 }
 
 size_t
@@ -223,7 +228,7 @@ MlasGetPreferredBufferAlignment(
 Routine Description:
 
     This routine returns the preferred byte alignment for buffers that are used
-    with this library. Buffers that are not bytes aligned to this value will
+    with this library. Buffers that are not byte aligned to this value will
     function, but will not achieve best performance.
 
 Arguments:
diff --git a/onnxruntime/core/mlas/lib/qgemm.cpp b/onnxruntime/core/mlas/lib/qgemm.cpp
new file mode 100644
index 0000000000000..c5d07da984769
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/qgemm.cpp
@@ -0,0 +1,599 @@
+/*++
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+
+Licensed under the MIT License.
+
+Module Name:
+
+    qgemm.cpp
+
+Abstract:
+
+    This module implements the quantized integer matrix/matrix multiply
+    operation (QGEMM).
+
+--*/
+
+#include "mlasi.h"
+
+//
+// Define the default strides to step through slices of the input matrices.
+//
+
+#define MLAS_GEMM_U8U8_STRIDEM              12
+#define MLAS_GEMM_U8U8_STRIDEN              128
+#define MLAS_GEMM_U8U8_STRIDEK              128
+
+#ifdef MLAS_TARGET_AMD64_IX86
+
+void
+MLASCALL
+MlasGemmU8U8CopyPackASse(
+    int16_t* D,
+    const uint8_t* A,
+    size_t lda,
+    size_t CountM,
+    size_t CountK,
+    int32_t* RowSumVector,
+    int16_t offb
+    )
+/*++
+
+Routine Description:
+
+    This routine copies elements from the source matrix to the destination
+    packed buffer.
+
+Arguments:
+
+    D - Supplies the address of the destination packed buffer.
+
+    A - Supplies the address of the source matrix.
+
+    lda - Supplies the number of elements per row of the source matrix.
+
+    CountM - Supplies the number of rows of the source matrix to copy.
+
+    CountK - Supplies the number of columns of the source matrix to copy.
+
+    RowSumVector - Supplies the address of the buffer to receive the sums of
+        the elements from each of the rows. Each sum has also been multiplied
+        by the zero point offset.
+
+    offb - Supplies the zero point offset for the other source matrix of the
+        matrix multiplication.
+
+Return Value:
+
+    None.
+
+--*/
+{
+    const __m128i ZeroVector = _mm_setzero_si128();
+    const __m128i OffsetBroadcast = _mm_set1_epi16(offb);
+    uint8_t PaddedMatrixAData[8] = { 0 };
+
+    //
+    // Process a single row of matrix A in a loop.
+    //
+
+    while (CountM > 0) {
+
+        const uint8_t* a = A;
+        size_t k = CountK;
+        __m128i RowSum = ZeroVector;
+
+        //
+        // Zero extend the source bytes to 16-bits and write to the packed
+        // buffer. The packed buffer has the same data ordering as the source
+        // bytes, but the stride is CountK aligned up to a multiple of 8
+        // values.
+        //
+        // These 16-bit values are also accumulated into an intermediate per-row
+        // accumulator. CountK cannot be greater than 256 to avoid overflowing
+        // these 16-bit accumulators.
+        //
+
+        while (k >= 8) {
+
+            __m128i Bytes = _mm_loadl_epi64((__m128i*)&a[0]);
+            __m128i Words = _mm_unpacklo_epi8(Bytes, ZeroVector);
+
+            RowSum = _mm_add_epi16(RowSum, Words);
+
+            _mm_storeu_si128((__m128i*)&D[0], Words);
+
+            D += 8;
+            a += 8;
+            k -= 8;
+        }
+
+        if (k > 0) {
+
+            //
+            // Copy the remaining bytes to the zero padded stack buffer.
+            //
+
+            uint8_t* padded = PaddedMatrixAData;
+            uint8_t* padded_end = padded + k;
+
+            do {
+                padded[0] = a[0];
+                padded++;
+                a++;
+            } while (padded < padded_end);
+
+            __m128i Bytes = _mm_loadl_epi64((__m128i*)PaddedMatrixAData);
+            __m128i Words = _mm_unpacklo_epi8(Bytes, ZeroVector);
+
+            RowSum = _mm_add_epi16(RowSum, Words);
+
+            //
+            // Copy the 16-bit pairs from the vector to the destination packed
+            // buffer. Rotate the vector at each iteration.
+            //
+
+            for (size_t pairs = (k + 1) / 2; pairs > 0; pairs--) {
+                *((int32_t*)D) = _mm_cvtsi128_si32(Words);
+                D += 2;
+                Words = _mm_shuffle_epi32(Words, _MM_SHUFFLE(0, 3, 2, 1));
+            }
+        }
+
+        //
+        // Reduce the sum for the single row of output.
+        //
+
+        RowSum = _mm_madd_epi16(RowSum, OffsetBroadcast);
+        RowSum = _mm_add_epi32(RowSum, _mm_shuffle_epi32(RowSum, _MM_SHUFFLE(3, 2, 3, 2)));
+        RowSum = _mm_add_epi32(RowSum, _mm_shuffle_epi32(RowSum, _MM_SHUFFLE(0, 1, 0, 1)));
+
+        *RowSumVector++ = _mm_cvtsi128_si32(RowSum);
+
+        A += lda;
+        CountM -= 1;
+    }
+}
+
+void
+MLASCALL
+MlasGemmU8U8CopyPackBSse(
+    uint8_t* D,
+    const uint8_t* B,
+    size_t ldb,
+    size_t CountN,
+    size_t CountK,
+    int32_t* ColumnSumVector,
+    int16_t offa
+    )
+/*++
+
+Routine Description:
+
+    This routine copies elements from the source matrix to the destination
+    packed buffer.
+
+Arguments:
+
+    D (rcx) - Supplies the address of the destination packed buffer.
+
+    B (rdx) - Supplies the address of the source matrix.
+
+    ldb (r8) - Supplies the number of elements per row of the source matrix.
+
+    CountN (r9) - Supplies the number of columns of the source matrix to copy.
+
+    CountK - Supplies the number of rows of the source matrix to copy.
+
+    ColumnSumVector - Supplies the address of the buffer to receive the sums of
+        the elements from each of the columns. Each sum has also been multiplied
+        by the zero point offset.
+
+    offa - Supplies the zero point offset for the other source matrix of the
+        matrix multiplication.
+
+Return Value:
+
+    None.
+
+--*/
+{
+    const __m128i ZeroVector = _mm_setzero_si128();
+    const __m128i OffsetBroadcast = _mm_set1_epi16(offa);
+    uint8_t PaddedMatrixBData[16] = { 0 };
+
+    //
+    // Process 8 columns of matrix B in a loop.
+    //
+
+    while (CountN >= 8) {
+
+        const uint8_t* b = B;
+        size_t k = CountK;
+        __m128i ColumnSum0 = ZeroVector;
+        __m128i ColumnSum1 = ZeroVector;
+
+        while (k >= 2) {
+
+            __m128i BytesRow0 = _mm_loadl_epi64((__m128i*)&b[0]);
+            __m128i BytesRow1 = _mm_loadl_epi64((__m128i*)&b[ldb]);
+            __m128i BytesInterleaved = _mm_unpacklo_epi8(BytesRow0, BytesRow1);
+
+            _mm_storeu_si128((__m128i*)&D[0], BytesInterleaved);
+
+            ColumnSum0 = _mm_add_epi16(ColumnSum0, _mm_unpacklo_epi8(BytesInterleaved, ZeroVector));
+            ColumnSum1 = _mm_add_epi16(ColumnSum1, _mm_unpackhi_epi8(BytesInterleaved, ZeroVector));
+
+            b += ldb * 2;
+            D += 16;
+            k -= 2;
+        }
+
+        if (k > 0) {
+
+            __m128i BytesRow0 = _mm_loadl_epi64((__m128i*)&b[0]);
+            __m128i BytesInterleaved = _mm_unpacklo_epi8(BytesRow0, ZeroVector);
+
+            _mm_storeu_si128((__m128i*)&D[0], BytesInterleaved);
+
+            ColumnSum0 = _mm_add_epi16(ColumnSum0, _mm_unpacklo_epi8(BytesInterleaved, ZeroVector));
+            ColumnSum1 = _mm_add_epi16(ColumnSum1, _mm_unpackhi_epi8(BytesInterleaved, ZeroVector));
+
+            b += ldb * 2;
+            D += 16;
+        }
+
+        ColumnSum0 = _mm_madd_epi16(ColumnSum0, OffsetBroadcast);
+        ColumnSum1 = _mm_madd_epi16(ColumnSum1, OffsetBroadcast);
+
+        _mm_storeu_si128((__m128i*)&ColumnSumVector[0], ColumnSum0);
+        _mm_storeu_si128((__m128i*)&ColumnSumVector[4], ColumnSum1);
+
+        ColumnSumVector += 8;
+
+        B += 8;
+        CountN -= 8;
+    }
+
+    //
+    // Process the remaining columns of matrix B.
+    //
+
+    if (CountN > 0) {
+
+        const uint8_t* b = B;
+        size_t k = CountK;
+        __m128i ColumnSum0 = ZeroVector;
+        __m128i ColumnSum1 = ZeroVector;
+
+        while (k >= 2) {
+
+            //
+            // Copy the remaining columns to the zero padded stack buffer.
+            //
+
+            const uint8_t* bcopy = b;
+            uint8_t* padded = PaddedMatrixBData;
+            uint8_t* padded_end = padded + CountN;
+
+            do {
+                padded[0] = bcopy[0];
+                padded[8] = bcopy[ldb];
+                padded++;
+                bcopy++;
+            } while (padded < padded_end);
+
+            __m128i BytesRow0 = _mm_loadl_epi64((__m128i*)&PaddedMatrixBData[0]);
+            __m128i BytesRow1 = _mm_loadl_epi64((__m128i*)&PaddedMatrixBData[8]);
+            __m128i BytesInterleaved = _mm_unpacklo_epi8(BytesRow0, BytesRow1);
+
+            _mm_storeu_si128((__m128i*)&D[0], BytesInterleaved);
+
+            ColumnSum0 = _mm_add_epi16(ColumnSum0, _mm_unpacklo_epi8(BytesInterleaved, ZeroVector));
+            ColumnSum1 = _mm_add_epi16(ColumnSum1, _mm_unpackhi_epi8(BytesInterleaved, ZeroVector));
+
+            b += ldb * 2;
+            D += 16;
+            k -= 2;
+        }
+
+        if (k > 0) {
+
+            //
+            // Copy the remaining columns to the zero padded stack buffer.
+            //
+
+            const uint8_t* bcopy = b;
+            uint8_t* padded = PaddedMatrixBData;
+            uint8_t* padded_end = padded + CountN;
+
+            do {
+                padded[0] = bcopy[0];
+                padded++;
+                bcopy++;
+            } while (padded < padded_end);
+
+            __m128i BytesRow0 = _mm_loadl_epi64((__m128i*)&PaddedMatrixBData[0]);
+            __m128i BytesInterleaved = _mm_unpacklo_epi8(BytesRow0, ZeroVector);
+
+            _mm_storeu_si128((__m128i*)&D[0], BytesInterleaved);
+
+            ColumnSum0 = _mm_add_epi16(ColumnSum0, _mm_unpacklo_epi8(BytesInterleaved, ZeroVector));
+            ColumnSum1 = _mm_add_epi16(ColumnSum1, _mm_unpackhi_epi8(BytesInterleaved, ZeroVector));
+        }
+
+        ColumnSum0 = _mm_madd_epi16(ColumnSum0, OffsetBroadcast);
+        ColumnSum1 = _mm_madd_epi16(ColumnSum1, OffsetBroadcast);
+
+        _mm_storeu_si128((__m128i*)&ColumnSumVector[0], ColumnSum0);
+        _mm_storeu_si128((__m128i*)&ColumnSumVector[4], ColumnSum1);
+    }
+}
+
+size_t
+MLASCALL
+MlasGemmU8U8KernelSse(
+    const int16_t* A,
+    const uint8_t* B,
+    int32_t* C,
+    size_t PairedCountK,
+    size_t CountM,
+    size_t CountN,
+    size_t ldc,
+    const int32_t* RowSumVector,
+    const int32_t* ColumnSumVector,
+    int32_t DepthValue,
+    bool ZeroMode
+    )
+/*++
+
+Routine Description:
+
+    This routine is an inner kernel to compute matrix multiplication for a
+    set of rows.
+
+Arguments:
+
+    A - Supplies the address of matrix A. The matrix data has been packed
+        using MlasGemmU8U8CopyPackASse.
+
+    B - Supplies the address of matrix B. The matrix data has been packed
+        using MlasGemmU8U8CopyPackBSse.
+
+    C - Supplies the address of matrix C.
+
+    PairedCountK - Supplies the number of paired columns from matrix A and
+        the number of paired rows from matrix B to iterate over.
+
+    CountM - Supplies the maximum number of rows that can be processed for
+        matrix A and matrix C. The actual number of rows handled for this
+        invocation depends on the kernel implementation.
+
+    CountN - Supplies the number of columns from matrix B and matrix C to iterate
+        over.
+
+    ldc - Supplies the first dimension of matrix C.
+
+    RowSumVector - Supplies the sum of each row from matrix A multiplied by the
+        zero point offset of matrix B. These values are accumulated into every
+        row of matrix C.
+
+    ColumnSumVector - Supplies the sum of each column from matrix B multiplied
+        by the zero point offset of matrix A. These values are accumulated into
+        every column of matrix C.
+
+    DepthValue - Supplies the value CountK multiplied by the zero point offset
+        of matrixA multplied by the zero point offset of matrix B. This value is
+        accumulated into every element of matrix C.
+
+    ZeroMode - Supplies true if the output matrix must be zero initialized,
+        else false if the output matrix is accumulated into.
+
+Return Value:
+
+    Returns the number of rows handled.
+
+--*/
+{
+    const __m128i ZeroVector = _mm_setzero_si128();
+
+    MLAS_UNREFERENCED_PARAMETER(CountM);
+    MLAS_UNREFERENCED_PARAMETER(ldc);
+
+    while (CountN > 0) {
+
+        //
+        // Initialize the accumulators with the sum of the global depth value
+        // constant, the column sums, and the row sums.
+        //
+
+        __m128i Accumulator0 = _mm_set1_epi32(DepthValue);
+        Accumulator0 = _mm_add_epi32(Accumulator0, _mm_set1_epi32(RowSumVector[0]));
+        __m128i Accumulator1 = Accumulator0;
+        Accumulator0 = _mm_add_epi32(Accumulator0, _mm_loadu_si128((__m128i*)&ColumnSumVector[0]));
+        Accumulator1 = _mm_add_epi32(Accumulator1, _mm_loadu_si128((__m128i*)&ColumnSumVector[4]));
+        ColumnSumVector += 8;
+
+        //
+        // Broadcast each pair of 16-bit values from the matrix A and multiply
+        // with the zero-extended pair of 16-bit values from matrix B, and add
+        // the 32-bit intermediate into the accumulator registers.
+        //
+
+        const int16_t* a = A;
+        size_t k = PairedCountK;
+
+        while (k > 0) {
+
+            __m128i AElements0 = _mm_set1_epi32(*((int32_t*)a));
+            __m128i BElements0 = _mm_loadu_si128((__m128i*)&B[0]);
+
+            __m128i Intermediate0 = _mm_unpacklo_epi8(BElements0, ZeroVector);
+            __m128i Intermediate1 = _mm_unpackhi_epi8(BElements0, ZeroVector);
+
+            Intermediate0 = _mm_madd_epi16(Intermediate0, AElements0);
+            Intermediate1 = _mm_madd_epi16(Intermediate1, AElements0);
+
+            Accumulator0 = _mm_add_epi32(Accumulator0, Intermediate0);
+            Accumulator1 = _mm_add_epi32(Accumulator1, Intermediate1);
+
+            a += 2;
+            B += 16;
+            k -= 1;
+        }
+
+        //
+        // Output the accumulator block after optionally accumulating the values
+        // from matrix C.
+        //
+
+        if (CountN >= 8) {
+
+            if (!ZeroMode) {
+                Accumulator0 = _mm_add_epi32(Accumulator0, _mm_loadu_si128((__m128i*)&C[0]));
+                Accumulator1 = _mm_add_epi32(Accumulator1, _mm_loadu_si128((__m128i*)&C[4]));
+            }
+
+            _mm_storeu_si128((__m128i*)&C[0], Accumulator0);
+            _mm_storeu_si128((__m128i*)&C[4], Accumulator1);
+
+            C += 8;
+            CountN -= 8;
+
+        } else {
+
+            //
+            // Output the remaining partial output block.
+            //
+
+            if ((CountN & 4) != 0) {
+
+                if (!ZeroMode) {
+                    Accumulator0 = _mm_add_epi32(Accumulator0, _mm_loadu_si128((__m128i*)&C[0]));
+                }
+
+                _mm_storeu_si128((__m128i*)&C[0], Accumulator0);
+                C += 4;
+
+                Accumulator0 = Accumulator1;
+            }
+
+            if ((CountN & 2) != 0) {
+
+                if (!ZeroMode) {
+                    Accumulator0 = _mm_add_epi32(Accumulator0, _mm_loadl_epi64((__m128i*)&C[0]));
+                }
+
+                _mm_storel_epi64((__m128i*)&C[0], Accumulator0);
+                C += 2;
+
+                Accumulator0 = _mm_shuffle_epi32(Accumulator0, _MM_SHUFFLE(1, 0, 3, 2));
+            }
+
+            if ((CountN & 1) != 0) {
+
+                int32_t AccumulatorValue = _mm_cvtsi128_si32(Accumulator0);
+
+                if (!ZeroMode) {
+                    AccumulatorValue += C[0];
+                }
+
+                C[0] = AccumulatorValue;
+            }
+
+            break;
+        }
+    }
+
+    return 1;
+}
+
+void
+MLASCALL
+MlasQgemm(
+    size_t M,
+    size_t N,
+    size_t K,
+    const uint8_t* A,
+    size_t lda,
+    uint8_t offa,
+    const uint8_t* B,
+    size_t ldb,
+    uint8_t offb,
+    int32_t* C,
+    size_t ldc,
+    MLAS_THREADPOOL* ThreadPool
+    )
+{
+    MLAS_DECLSPEC_ALIGN(int16_t PanelA[MLAS_GEMM_U8U8_STRIDEM * MLAS_GEMM_U8U8_STRIDEK], 64);
+    MLAS_DECLSPEC_ALIGN(uint8_t PanelB[MLAS_GEMM_U8U8_STRIDEN * MLAS_GEMM_U8U8_STRIDEK], 64);
+
+    MLAS_DECLSPEC_ALIGN(int32_t RowSumVector[MLAS_GEMM_U8U8_STRIDEM], 16);
+    MLAS_DECLSPEC_ALIGN(int32_t ColumnSumVector[MLAS_GEMM_U8U8_STRIDEN], 16);
+
+    size_t StrideM = MLAS_GEMM_U8U8_STRIDEM;
+    size_t StrideN = MLAS_GEMM_U8U8_STRIDEN;
+    size_t StrideK = MLAS_GEMM_U8U8_STRIDEK;
+
+    MLAS_UNREFERENCED_PARAMETER(ThreadPool);
+
+    size_t CountK;
+
+    for (size_t k = 0; k < K; k += CountK) {
+
+        CountK = StrideK;
+
+        if (CountK > (K - k)) {
+            CountK = K - k;
+        }
+
+        size_t CountN;
+
+        for (size_t n = 0; n < N; n += CountN) {
+
+            CountN = StrideN;
+
+            if (CountN > (N - n)) {
+                CountN = N - n;
+            }
+
+            MlasPlatform.GemmU8U8CopyPackBRoutine(PanelB, B + n + k * ldb, ldb, CountN, CountK, ColumnSumVector, -int16_t(offa));
+
+            size_t CountM;
+
+            for (size_t m = 0; m < M; m += CountM) {
+
+                CountM = StrideM;
+
+                if (CountM > (M - m)) {
+                    CountM = M - m;
+                }
+
+                MlasPlatform.GemmU8U8CopyPackARoutine(PanelA, A + k + m * lda, lda, CountM, CountK, RowSumVector, -int16_t(offb));
+
+                int16_t* pa = PanelA;
+                int32_t* c = C + n + m * ldc;
+
+                int32_t* RowSums = RowSumVector;
+
+                size_t RowsRemaining = CountM;
+                size_t RowsHandled;
+
+                size_t PairedCountK = (CountK + 1) / 2;
+
+                while (RowsRemaining > 0) {
+
+                    RowsHandled = MlasPlatform.GemmU8U8Kernel(pa, PanelB, c, PairedCountK, RowsRemaining, CountN, ldc, RowSums, ColumnSumVector, int32_t(CountK) * offa * offb, k == 0);
+
+                    RowsRemaining -= RowsHandled;
+                    c += ldc * RowsHandled;
+                    pa += 2 * PairedCountK * RowsHandled;
+                    RowSums += RowsHandled;
+                }
+            }
+        }
+    }
+}
+
+#endif
diff --git a/onnxruntime/core/mlas/lib/sgemm.cpp b/onnxruntime/core/mlas/lib/sgemm.cpp
index f436f17e6cd9c..5250c84487b93 100644
--- a/onnxruntime/core/mlas/lib/sgemm.cpp
+++ b/onnxruntime/core/mlas/lib/sgemm.cpp
@@ -55,7 +55,7 @@ struct MLAS_SGEMM_WORK_BLOCK {
 // Stores a vector to build a conditional load/store mask for vmaskmovps.
 //
 
-extern "C" MLAS_DECLSPEC_ALIGN(const uint32_t MlasMaskMoveAvx[8], 8 * sizeof(float)) = { 0, 1, 2, 3, 4, 5, 6, 7 };
+MLAS_INTERNAL_DATA MLAS_DECLSPEC_ALIGN(const uint32_t MlasMaskMoveAvx[8], 8 * sizeof(float)) = { 0, 1, 2, 3, 4, 5, 6, 7 };
 
 #endif
 
diff --git a/onnxruntime/core/mlas/lib/tanh.cpp b/onnxruntime/core/mlas/lib/tanh.cpp
index 430afdf60e225..2fbeaef3d9815 100644
--- a/onnxruntime/core/mlas/lib/tanh.cpp
+++ b/onnxruntime/core/mlas/lib/tanh.cpp
@@ -26,7 +26,7 @@ Module Name:
 // Bundles the floating point constants for use by kernels written in assembly.
 //
 
-extern "C" const struct {
+MLAS_INTERNAL_DATA const struct {
     float LowerRange;
     float UpperRange;
     float alpha_13;
diff --git a/onnxruntime/core/mlas/lib/threading.cpp b/onnxruntime/core/mlas/lib/threading.cpp
index 858b72722e8bc..ef30de9499bb2 100644
--- a/onnxruntime/core/mlas/lib/threading.cpp
+++ b/onnxruntime/core/mlas/lib/threading.cpp
@@ -16,59 +16,6 @@ Module Name:
 
 #include "mlasi.h"
 
-#if defined(MLAS_USE_WIN32_THREADPOOL)
-
-//
-// Define the parameters to execute threaded work using the Windows thread pool
-// library.
-//
-
-struct MLAS_THREADED_WORK_BLOCK {
-    volatile LONG Counter;
-    PMLAS_THREADED_ROUTINE ThreadedRoutine;
-    void* Context;
-};
-
-void
-CALLBACK
-MlasThreadedWorkCallback(
-    PTP_CALLBACK_INSTANCE Instance,
-    void* Context,
-    PTP_WORK WorkObject
-    )
-/*++
-
-Routine Description:
-
-    This routine is invoked from a worker thread to execute one iteration of a
-    batch of threaded work.
-
-Arguments:
-
-    Instance - Supplies the callback instance object.
-
-    Context - Supplies the pointer to the parameters for the operation.
-
-    WorkObject - Supplies the threadpool work object.
-
-Return Value:
-
-    None.
-
---*/
-{
-    MLAS_UNREFERENCED_PARAMETER(Instance);
-    MLAS_UNREFERENCED_PARAMETER(WorkObject);
-
-    MLAS_THREADED_WORK_BLOCK* WorkBlock = (MLAS_THREADED_WORK_BLOCK*)Context;
-
-    LONG Index = InterlockedIncrement(&WorkBlock->Counter) - 1;
-
-    WorkBlock->ThreadedRoutine(WorkBlock->Context, Index);
-}
-
-#endif
-
 void
 MlasExecuteThreaded(
     MLAS_THREADED_ROUTINE ThreadedRoutine,
@@ -99,48 +46,11 @@ MlasExecuteThreaded(
     }
 #endif
 
-#if defined(MLAS_USE_WIN32_THREADPOOL)
 
     //
-    // Schedule the threaded iterations using a work object.
+    // Fallback to OpenMP or a serialized implementation.
     //
 
-    MLAS_THREADED_WORK_BLOCK WorkBlock;
-
-    PTP_WORK WorkObject = CreateThreadpoolWork(MlasThreadedWorkCallback, &WorkBlock, nullptr);
-
-    if (WorkObject != nullptr) {
-
-        WorkBlock.Counter = 0;
-        WorkBlock.ThreadedRoutine = ThreadedRoutine;
-        WorkBlock.Context = Context;
-
-        for (int32_t tid = 1; tid < Iterations; tid++) {
-            SubmitThreadpoolWork(WorkObject);
-        }
-
-        //
-        // Execute the remaining iteration on this thread.
-        //
-
-        ThreadedRoutine(Context, Iterations - 1);
-
-        //
-        // Wait for the work object callbacks to complete.
-        //
-
-        WaitForThreadpoolWorkCallbacks(WorkObject, FALSE);
-        CloseThreadpoolWork(WorkObject);
-
-        return;
-    }
-
-    //
-    // Fallback to a serialized implementation.
-    //
-
-#endif
-
     //
     // Execute the routine for the specified number of iterations.
     //
diff --git a/onnxruntime/core/mlas/lib/x86_64/AssembleAvx512Vnni.h b/onnxruntime/core/mlas/lib/x86_64/AssembleAvx512Vnni.h
new file mode 100644
index 0000000000000..bd3112bd9ccd9
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/x86_64/AssembleAvx512Vnni.h
@@ -0,0 +1,238 @@
+/*++
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+
+Licensed under the MIT License.
+
+Module Name:
+
+    AssembleAvx512Vnni.h
+
+Abstract:
+
+    This module contains macros to build VNNI instructions for toolchains that
+    do not natively support this newer instruction set extension.
+
+--*/
+
+//
+// Map friendly register names to the encoded register index.
+//
+
+        .equ    .LZmmIndex_zmm0, 0
+        .equ    .LZmmIndex_zmm1, 1
+        .equ    .LZmmIndex_zmm2, 2
+        .equ    .LZmmIndex_zmm3, 3
+        .equ    .LZmmIndex_zmm4, 4
+        .equ    .LZmmIndex_zmm5, 5
+        .equ    .LZmmIndex_zmm6, 6
+        .equ    .LZmmIndex_zmm7, 7
+        .equ    .LZmmIndex_zmm8, 8
+        .equ    .LZmmIndex_zmm9, 9
+        .equ    .LZmmIndex_zmm10, 10
+        .equ    .LZmmIndex_zmm11, 11
+        .equ    .LZmmIndex_zmm12, 12
+        .equ    .LZmmIndex_zmm13, 13
+        .equ    .LZmmIndex_zmm14, 14
+        .equ    .LZmmIndex_zmm15, 15
+        .equ    .LZmmIndex_zmm16, 16
+        .equ    .LZmmIndex_zmm17, 17
+        .equ    .LZmmIndex_zmm18, 18
+        .equ    .LZmmIndex_zmm19, 19
+        .equ    .LZmmIndex_zmm20, 20
+        .equ    .LZmmIndex_zmm21, 21
+        .equ    .LZmmIndex_zmm22, 22
+        .equ    .LZmmIndex_zmm23, 23
+        .equ    .LZmmIndex_zmm24, 24
+        .equ    .LZmmIndex_zmm25, 25
+        .equ    .LZmmIndex_zmm26, 26
+        .equ    .LZmmIndex_zmm27, 27
+        .equ    .LZmmIndex_zmm28, 28
+        .equ    .LZmmIndex_zmm29, 29
+        .equ    .LZmmIndex_zmm30, 30
+        .equ    .LZmmIndex_zmm31, 31
+
+        .equ    .LGprIndex_rax, 0
+        .equ    .LGprIndex_rcx, 1
+        .equ    .LGprIndex_rdx, 2
+        .equ    .LGprIndex_rbx, 3
+        .equ    .LGprIndex_rbp, 5
+        .equ    .LGprIndex_rsi, 6
+        .equ    .LGprIndex_rdi, 7
+        .equ    .LGprIndex_r8, 8
+        .equ    .LGprIndex_r9, 9
+        .equ    .LGprIndex_r10, 10
+        .equ    .LGprIndex_r11, 11
+        .equ    .LGprIndex_r12, 12
+        .equ    .LGprIndex_r13, 13
+        .equ    .LGprIndex_r14, 14
+        .equ    .LGprIndex_r15, 15
+
+/*++
+
+Macro Description:
+
+    This macro builds a VNNI instruction of the form:
+
+        instr zmm1,zmm2,zmm3
+
+Arguments:
+
+    Opcode - Specifies the opcode for the VNNI instruction.
+
+    DestReg - Specifies the destination register.
+
+    Src1Reg - Specifies the first source register.
+
+    Src2Reg - Specifies the second source register.
+
+--*/
+
+        .macro VnniZmmZmmZmm Opcode, DestReg, Src1Reg, Src2Reg
+
+        .set    Payload0, 0x02              # "0F 38" prefix
+        .set    Payload0, Payload0 + ((((.LZmmIndex_\DestReg\() >> 3) & 1) ^ 1) << 7)
+        .set    Payload0, Payload0 + ((((.LZmmIndex_\Src2Reg\() >> 4) & 1) ^ 1) << 6)
+        .set    Payload0, Payload0 + ((((.LZmmIndex_\Src2Reg\() >> 3) & 1) ^ 1) << 5)
+        .set    Payload0, Payload0 + ((((.LZmmIndex_\DestReg\() >> 4) & 1) ^ 1) << 4)
+
+        .set    Payload1, 0x05              # "66" prefix
+        .set    Payload1, Payload1 + (((.LZmmIndex_\Src1Reg\() & 15) ^ 15) << 3)
+
+        .set    Payload2, 0x40              # 512-bit vector length
+        .set    Payload2, Payload2 + ((((.LZmmIndex_\Src1Reg\() >> 4) & 1) ^ 1) << 3)
+
+        .set    ModRMByte, 0xC0             # register form
+        .set    ModRMByte, ModRMByte + ((.LZmmIndex_\DestReg\() & 7) << 3)
+        .set    ModRMByte, ModRMByte + (.LZmmIndex_\Src2Reg\() & 7)
+
+        .byte   0x62, Payload0, Payload1, Payload2, \Opcode\(), ModRMByte
+
+        .endm
+
+        .macro VpdpbusdZmmZmmZmm DestReg, Src1Reg, Src2Reg
+
+        VnniZmmZmmZmm 0x50, \DestReg\(), \Src1Reg\(), \Src2Reg\()
+
+        .endm
+
+        .macro VpdpbusdsZmmZmmZmm DestReg, Src1Reg, Src2Reg
+
+        VnniZmmZmmZmm 0x51, \DestReg\(), \Src1Reg\(), \Src2Reg\()
+
+        .endm
+
+        .macro VpdpwssdZmmZmmZmm DestReg, Src1Reg, Src2Reg
+
+        VnniZmmZmmZmm 0x52, \DestReg\(), \Src1Reg\(), \Src2Reg\()
+
+        .endm
+
+        .macro VpdpwssdsZmmZmmZmm DestReg, Src1Reg, Src2Reg
+
+        VnniZmmZmmZmm 0x53, \DestReg\(), \Src1Reg\(), \Src2Reg\()
+
+        .endm
+
+/*++
+
+Macro Description:
+
+    This macro builds a VNNI instruction of the form:
+
+         instr zmm1,zmm2,DWORD PTR [BaseReg+IndexReg*Scale]{1to16}
+
+Arguments:
+
+    Opcode - Specifies the opcode for the VNNI instruction.
+
+    DestReg - Specifies the destination register.
+
+    Src1Reg - Specifies the first source register.
+
+    BaseReg - Specifies the base register of the broadcast operand.
+
+    IndexReg - Specifies the optional index register of the broadcast operand.
+
+    Scale - Specifies the scaling factor of the optional index register.
+
+--*/
+
+        .macro VnniZmmZmmBroadcast Opcode, DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        .set    Payload0, 0x02              # "0F 38" prefix
+        .set    Payload0, Payload0 + ((((.LZmmIndex_\DestReg\() >> 3) & 1) ^ 1) << 7)
+.ifnes "\IndexReg\()", ""
+        .set    Payload0, Payload0 + ((((.LGprIndex_\IndexReg\() >> 3) & 1) ^ 1) << 6)
+.else
+        .set    Payload0, Payload0 + 0x40   # zero logical index register
+.endif
+        .set    Payload0, Payload0 + ((((.LGprIndex_\BaseReg\() >> 3) & 1) ^ 1) << 5)
+        .set    Payload0, Payload0 + ((((.LZmmIndex_\DestReg\() >> 4) & 1) ^ 1) << 4)
+
+        .set    Payload1, 0x05              # "66" prefix
+        .set    Payload1, Payload1 + (((.LZmmIndex_\Src1Reg\() & 15) ^ 15) << 3)
+
+        .set    Payload2, 0x50              # 512-bit vector length, broadcast
+        .set    Payload2, Payload2 + ((((.LZmmIndex_\Src1Reg\() >> 4) & 1) ^ 1) << 3)
+
+        .set    ModRMByte, 0x00             # memory form
+        .set    ModRMByte, ModRMByte + ((.LZmmIndex_\DestReg\() & 7) << 3)
+.ifnes "\IndexReg\()", ""
+        .set    ModRMByte, ModRMByte + 0x04 # indicate SIB byte needed
+.else
+        .set    ModRMByte, ModRMByte + (.LGprIndex_\BaseReg\() & 7)
+.endif
+
+.ifnes "\IndexReg\()", ""
+        .set    SibByte, 0
+.ifeqs "\Scale\()", "2"
+        .set    SibByte, SibByte + (1 << 6)
+.else
+.ifeqs "\Scale\()", "4"
+        .set    SibByte, SibByte + (2 << 6)
+.else
+.ifeqs "\Scale\()", "8"
+        .set    SibByte, SibByte + (3 << 6)
+.else
+.ifnes "\Scale\()", "1"
+        .err
+.endif
+.endif
+.endif
+.endif
+        .set    SibByte, SibByte + ((.LGprIndex_\IndexReg\() & 7) << 3)
+        .set    SibByte, SibByte + (.LGprIndex_\BaseReg\() & 7)
+.endif
+
+.ifnes "\IndexReg\()", ""
+        .byte   0x62, Payload0, Payload1, Payload2, \Opcode\(), ModRMByte, SibByte
+.else
+        .byte   0x62, Payload0, Payload1, Payload2, \Opcode\(), ModRMByte
+.endif
+
+        .endm
+
+        .macro VpdpbusdZmmZmmBroadcast DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        VnniZmmZmmBroadcast 0x50, \DestReg\(), \Src1Reg\(), \BaseReg\(), \IndexReg\(), \Scale\()
+
+        .endm
+
+        .macro VpdpbusdsZmmZmmBroadcast DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        VnniZmmZmmBroadcast 0x51, \DestReg\(), \Src1Reg\(), \BaseReg\(), \IndexReg\(), \Scale\()
+
+        .endm
+
+        .macro VpdpwssdZmmZmmBroadcast DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        VnniZmmZmmBroadcast 0x52, \DestReg\(), \Src1Reg\(), \BaseReg\(), \IndexReg\(), \Scale\()
+
+        .endm
+
+        .macro VpdpwssdsZmmZmmBroadcast DestReg, Src1Reg, BaseReg, IndexReg, Scale
+
+        VnniZmmZmmBroadcast 0x53, \DestReg\(), \Src1Reg\(), \BaseReg\(), \IndexReg\(), \Scale\()
+
+        .endm
diff --git a/onnxruntime/core/mlas/lib/x86_64/ErfKernelFma3.S b/onnxruntime/core/mlas/lib/x86_64/ErfKernelFma3.S
index 29518fb91119a..92b7976d7db79 100644
--- a/onnxruntime/core/mlas/lib/x86_64/ErfKernelFma3.S
+++ b/onnxruntime/core/mlas/lib/x86_64/ErfKernelFma3.S
@@ -26,6 +26,7 @@ Abstract:
 //
 // Structure layout for the erf constants block.
 //
+
         .equ    ErfUpperAbsRange, 0
         .equ    ErfSplitBoundary, 4
         .equ    ErfSMALL_P0, 8
@@ -68,7 +69,7 @@ Abstract:
         .equ    ErfBuffer1, 128
         .equ    ErfKernelFrame_CountN, 256
         .equ    ErfKernelFrame_ReturnAddress, 256+8
-        
+
 /*++
 
 Routine Description:
@@ -92,7 +93,7 @@ Return Value:
         .globl  C_UNDERSCORE(MlasErfKernelFma3)
 C_UNDERSCORE(MlasErfKernelFma3):
         sub     rsp,ErfKernelFrame_ReturnAddress
-        mov     rax,C_UNDERSCORE(MlasErfConstants)@GOTPCREL[rip]
+        lea     rax,C_UNDERSCORE(MlasErfConstants)[rip]
 
         sub     rdx,8*4
         jb      .LErfProcessRemainingCount
@@ -376,10 +377,9 @@ C_UNDERSCORE(MlasErfKernelFma3):
 
 .LErfProcess1x8:
         mov     DWORD PTR ErfKernelFrame_CountN[rsp],edx
-        mov     rcx,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vbroadcastss ymm3,DWORD PTR ErfKernelFrame_CountN[rsp]
 
-        vpcmpgtd ymm3,ymm3,YMMWORD PTR [rcx]
+        vpcmpgtd ymm3,ymm3,YMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
         vbroadcastss ymm15,ErfNegZero[rax]
         vmaskmovps ymm0,ymm3,YMMWORD PTR [rdi]  # original input vx0
 
diff --git a/onnxruntime/core/mlas/lib/x86_64/LogisticKernelFma3.S b/onnxruntime/core/mlas/lib/x86_64/LogisticKernelFma3.S
index 8b7b27dcbb0ed..243b355398eb6 100644
--- a/onnxruntime/core/mlas/lib/x86_64/LogisticKernelFma3.S
+++ b/onnxruntime/core/mlas/lib/x86_64/LogisticKernelFma3.S
@@ -72,7 +72,7 @@ Return Value:
         .globl  C_UNDERSCORE(MlasLogisticKernelFma3)
 C_UNDERSCORE(MlasLogisticKernelFma3):
 
-        mov     rax,C_UNDERSCORE(MlasLogisticConstants)@GOTPCREL[rip]
+        lea     rax,C_UNDERSCORE(MlasLogisticConstants)[rip]
         vbroadcastss ymm4,LogisticConstants_LowerRange[rax]
         vbroadcastss ymm5,LogisticConstants_UpperRange[rax]
         vbroadcastss ymm6,LogisticConstants_alpha_9[rax]
@@ -120,9 +120,8 @@ C_UNDERSCORE(MlasLogisticKernelFma3):
         add     rdx,8                           # correct for over-subtract above
         jz      .LExitKernel
         mov     DWORD PTR LogisticKernelFrame_CountN[rsp],edx
-        mov     rcx,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vbroadcastss ymm2,DWORD PTR LogisticKernelFrame_CountN[rsp]
-        vpcmpgtd ymm2,ymm2,YMMWORD PTR [rcx]
+        vpcmpgtd ymm2,ymm2,YMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
         vmaskmovps ymm0,ymm2,YMMWORD PTR [rdi]
         vmaxps  ymm0,ymm4,ymm0                  # clamp lower bound
         vminps  ymm0,ymm5,ymm0                  # clamp upper bound
diff --git a/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx2.S b/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx2.S
new file mode 100644
index 0000000000000..8837be62c5e2e
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx2.S
@@ -0,0 +1,1121 @@
+/*++
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+
+Licensed under the MIT License.
+
+Module Name:
+
+    QgemmU8U8KernelAvx2.s
+
+Abstract:
+
+    This module implements the kernels for the quantized integer matrix/matrix
+    multiply operation (QGEMM).
+
+    This implementation uses AVX2 instructions.
+
+--*/
+
+#include "asmmacro.h"
+
+        .intel_syntax noprefix
+
+        .text
+
+//
+// Stack frame layout for the U8U8 CopyPackA routine.
+//
+
+        .equ    .LGemmU8U8CopyPackAFrame_PaddedMatrixAData, -72
+        .equ    .LGemmU8U8CopyPackAFrame_mask, -8
+        .equ    .LGemmU8U8CopyPackAFrame_SavedR13, 0
+        .equ    .LGemmU8U8CopyPackAFrame_SavedR12, 8
+        .equ    .LGemmU8U8CopyPackAFrame_SavedRbx, 16
+        .equ    .LGemmU8U8CopyPackAFrame_SavedRbp, 24
+        .equ    .LGemmU8U8CopyPackAFrame_ReturnAddress, 32
+        .equ    .LGemmU8U8CopyPackAFrame_offb, 40
+
+//
+// Stack frame layout for the U8U8 CopyPackB routine.
+//
+
+        .equ    .LGemmU8U8CopyPackBFrame_PaddedMatrixBData, -40
+        .equ    .LGemmU8U8CopyPackBFrame_Padding, -8
+        .equ    .LGemmU8U8CopyPackBFrame_SavedRbx, 0
+        .equ    .LGemmU8U8CopyPackBFrame_SavedRbp, 8
+        .equ    .LGemmU8U8CopyPackBFrame_ReturnAddress, 16
+        .equ    .LGemmU8U8CopyPackBFrame_offa, 24
+
+//
+// Stack frame layout for the U8U8 kernel.
+//
+
+        .equ    .LGemmU8U8KernelFrame_mask, -8
+        .equ    .LGemmU8U8KernelFrame_SavedR14, 0
+        .equ    .LGemmU8U8KernelFrame_SavedR13, 8
+        .equ    .LGemmU8U8KernelFrame_SavedR12, 16
+        .equ    .LGemmU8U8KernelFrame_SavedRbx, 24
+        .equ    .LGemmU8U8KernelFrame_SavedRbp, 32
+        .equ    .LGemmU8U8KernelFrame_ReturnAddress, 40
+        .equ    .LGemmU8U8KernelFrame_ldc, 48
+        .equ    .LGemmU8U8KernelFrame_RowSumVector, 56
+        .equ    .LGemmU8U8KernelFrame_ColumnSumVector, 64
+        .equ    .LGemmU8U8KernelFrame_DepthValue, 72
+        .equ    .LGemmU8U8KernelFrame_ZeroMode, 80
+
+/*++
+
+Routine Description:
+
+    This routine copies elements from the source matrix to the destination
+    packed buffer.
+
+    The kernel expects that elements from matrix A have been zero extended to
+    16-bits and padded to a multiple of 32-bits (two pairs of 16-bit values).
+    The kernel can then efficiently broadcast 32-bits from the packed buffer
+    and avoid expensive shuffling inside the kernel.
+
+Arguments:
+
+    D (rdi) - Supplies the address of the destination packed buffer.
+
+    A (rsi) - Supplies the address of the source matrix.
+
+    lda (rdx) - Supplies the number of elements per row of the source matrix.
+
+    CountM (rcx) - Supplies the number of rows of the source matrix to copy.
+
+    CountK (r8) - Supplies the number of columns of the source matrix to copy.
+
+    RowSumVector (r9) - Supplies the address of the buffer to receive the sums
+        of the elements from each of the rows. Each sum has also been multiplied
+        by the zero point offset.
+
+    offb - Supplies the zero point offset for the other source matrix of the
+        matrix multiplication.
+
+Return Value:
+
+    None.
+
+--*/
+
+        .globl  C_UNDERSCORE(MlasGemmU8U8CopyPackAAvx2)
+C_UNDERSCORE(MlasGemmU8U8CopyPackAAvx2):
+
+        push    rbp
+        push    rbx
+        push    r12
+        push    r13
+
+        mov     r10,rdx
+        mov     r11,rcx
+        lea     r12,[r8+1]
+        and     r12,NOT 1                   # align CountK up to pair count
+        vpbroadcastw xmm8,WORD PTR .LGemmU8U8CopyPackAFrame_offb[rsp]
+
+//
+// Compute the conditional load/store mask for an unaligned CountK.
+//
+
+        mov     eax,r8d
+        and     eax,15                      # isolate unaligned count
+        inc     eax
+        shr     eax,1                       # align unaligned count to pair count
+        mov     DWORD PTR .LGemmU8U8CopyPackAFrame_mask[rsp],eax
+        vpbroadcastd ymm9,DWORD PTR .LGemmU8U8CopyPackAFrame_mask[rsp]
+        vpcmpgtd ymm9,ymm9,YMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
+
+//
+// Zero initialize the padded stack buffers.
+//
+
+        vpxor   xmm0,xmm0,xmm0
+        vmovdqu YMMWORD PTR .LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp],ymm0
+        vmovdqu YMMWORD PTR .LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp+32],ymm0
+
+//
+// Process 4 rows of matrix A in a loop.
+//
+// For each row, zero extend the source bytes to 16-bits and write to the packed
+// buffer. The packed buffer has the same data ordering as the source bytes, but
+// the stride is CountK aligned up to an even number of 16-bit values.
+//
+// These 16-bit values are also accumulated into an intermediate per-row
+// accumulator. CountK cannot be greater than 256 to avoid overflowing these
+// 16-bit accumulators.
+//
+
+        sub     r11,4
+        jb      .LCopyPackA.ProcessRemainingRows
+
+.LCopyPackA.ProcessNextRowM4:
+        vpxor   xmm0,xmm0,xmm0              # clear row accumulators
+        vpxor   xmm1,xmm1,xmm1
+        vpxor   xmm2,xmm2,xmm2
+        vpxor   xmm3,xmm3,xmm3
+        mov     rdx,rsi
+        mov     rcx,rdi
+        lea     rsi,[rsi+r10*4]             # advance next matrix A by 4 rows
+        lea     rdi,[rdi+r12*(2*4)]         # advance next matrix D by 4 rows
+        mov     rbx,r8                      # reload columns remaining
+        sub     rbx,16
+        jb      .LCopyPackA.ProcessRemainingColumnsM4
+
+.LCopyPackA.ProcessNextColumnLoopM4:
+        lea     rax,[rdx+r10*2]             # compute matrix A plus two rows
+        vpmovzxbw ymm4,XMMWORD PTR [rdx]
+        vpmovzxbw ymm5,XMMWORD PTR [rdx+r10]
+        vpmovzxbw ymm6,XMMWORD PTR [rax]
+        vpmovzxbw ymm7,XMMWORD PTR [rax+r10]
+        lea     rax,[rcx+r12*4]             # compute matrix D plus two rows
+        vmovdqu YMMWORD PTR [rcx],ymm4
+        vmovdqu YMMWORD PTR [rcx+r12*2],ymm5
+        vmovdqu YMMWORD PTR [rax],ymm6
+        vmovdqu YMMWORD PTR [rax+r12*2],ymm7
+        vpaddw  ymm0,ymm0,ymm4              # accumulate per row along columns
+        vpaddw  ymm1,ymm1,ymm5
+        vpaddw  ymm2,ymm2,ymm6
+        vpaddw  ymm3,ymm3,ymm7
+        add     rdx,16                      # advance matrix A by 16 bytes
+        add     rcx,16*2                    # advance matrix D by 16 words
+        sub     rbx,16                      # subtract columns remaining
+        jae     .LCopyPackA.ProcessNextColumnLoopM4
+
+.LCopyPackA.ProcessRemainingColumnsM4:
+        add     rbx,16                      # correct for over-subtract above
+        jz      .LCopyPackA.ReduceRowSumVectorM4
+
+//
+// Copy the unaligned CountK columns to a zero padded stack buffer.
+//
+
+        lea     rbp,.LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp]
+        test    bl,8                        # (CountK & 8) != 0?
+        jz      .LCopyPackA.CopyRemainingCountKLessThan8M4
+        lea     r13,[rdx+r10*2]             # compute matrix A plus two rows
+        mov     rax,QWORD PTR [rdx]
+        mov     QWORD PTR [rbp],rax
+        mov     rax,QWORD PTR [rdx+r10]
+        mov     QWORD PTR [rbp+16],rax
+        mov     rax,QWORD PTR [r13]
+        mov     QWORD PTR [rbp+32],rax
+        mov     rax,QWORD PTR [r13+r10]
+        mov     QWORD PTR [rbp+48],rax
+        add     rdx,8
+        add     rbp,8                       # advance padded buffer destination
+
+.LCopyPackA.CopyRemainingCountKLessThan8M4:
+        test    bl,4                        # (CountK & 4) != 0?
+        jz      .LCopyPackA.CopyRemainingCountKLessThan4M4
+        lea     r13,[rdx+r10*2]             # compute matrix A plus two rows
+        mov     eax,DWORD PTR [rdx]
+        mov     DWORD PTR [rbp],eax
+        mov     eax,DWORD PTR [rdx+r10]
+        mov     DWORD PTR [rbp+16],eax
+        mov     eax,DWORD PTR [r13]
+        mov     DWORD PTR [rbp+32],eax
+        mov     eax,DWORD PTR [r13+r10]
+        mov     DWORD PTR [rbp+48],eax
+        add     rdx,4
+        add     rbp,4                       # advance padded buffer destination
+
+.LCopyPackA.CopyRemainingCountKLessThan4M4:
+        test    bl,2                        # (CountK & 2) != 0?
+        jz      .LCopyPackA.CopyRemainingCountKLessThan2M4
+        lea     r13,[rdx+r10*2]             # compute matrix A plus two rows
+        movzx   eax,WORD PTR [rdx]
+        mov     WORD PTR [rbp],ax
+        movzx   eax,WORD PTR [rdx+r10]
+        mov     WORD PTR [rbp+16],ax
+        movzx   eax,WORD PTR [r13]
+        mov     WORD PTR [rbp+32],ax
+        movzx   eax,WORD PTR [r13+r10]
+        mov     WORD PTR [rbp+48],ax
+        add     rdx,2
+        add     rbp,2                       # advance padded buffer destination
+
+.LCopyPackA.CopyRemainingCountKLessThan2M4:
+        test    bl,1                        # (CountK & 1) != 0?
+        jz      .LCopyPackA.ProcessPaddedMatrixADataM4
+        lea     r13,[rdx+r10*2]             # compute matrix A plus two rows
+        movzx   eax,BYTE PTR [rdx]
+        mov     BYTE PTR [rbp],al
+        movzx   eax,BYTE PTR [rdx+r10]
+        mov     BYTE PTR [rbp+16],al
+        movzx   eax,BYTE PTR [r13]
+        mov     BYTE PTR [rbp+32],al
+        movzx   eax,BYTE PTR [r13+r10]
+        mov     BYTE PTR [rbp+48],al
+
+//
+// Process the remaining CountK columns using the zero padded stack buffer.
+//
+
+.LCopyPackA.ProcessPaddedMatrixADataM4:
+        vpmovzxbw ymm4,XMMWORD PTR .LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp]
+        vpmovzxbw ymm5,XMMWORD PTR .LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp+16]
+        vpmovzxbw ymm6,XMMWORD PTR .LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp+32]
+        vpmovzxbw ymm7,XMMWORD PTR .LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp+48]
+        lea     rax,[rcx+r12*4]             # compute matrix D plus two rows
+        vpmaskmovd YMMWORD PTR [rcx],ymm9,ymm4
+        vpmaskmovd YMMWORD PTR [rcx+r12*2],ymm9,ymm5
+        vpmaskmovd YMMWORD PTR [rax],ymm9,ymm6
+        vpmaskmovd YMMWORD PTR [rax+r12*2],ymm9,ymm7
+        vpaddw  ymm0,ymm0,ymm4              # accumulate per row along columns
+        vpaddw  ymm1,ymm1,ymm5
+        vpaddw  ymm2,ymm2,ymm6
+        vpaddw  ymm3,ymm3,ymm7
+
+//
+// Reduce the sums for the four rows of output. Transpose the intermediate
+// accumulators by treating the registers as 32-bit elements containing a pair
+// of 16-bit sums. Continue reducing the transposed accumulators to produce the
+// final 32-bit vector output.
+//
+
+.LCopyPackA.ReduceRowSumVectorM4:
+        vpunpckldq ymm4,ymm0,ymm1           # [A5 B5 A4 B4 A1 B1 A0 B0]
+        vpunpckhdq ymm5,ymm0,ymm1           # [A7 B7 A6 B6 A3 B3 A2 B2]
+        vpunpckldq ymm6,ymm2,ymm3           # [C5 D5 C4 D4 C1 D1 C0 D0]
+        vpunpckhdq ymm7,ymm2,ymm3           # [C7 D7 C6 D6 C3 D3 C2 D2]
+        vpunpcklqdq ymm0,ymm4,ymm6          # [A4 B4 C4 D4 A0 B0 C0 D0]
+        vpunpckhqdq ymm1,ymm4,ymm6          # [A5 B5 C5 D5 A1 B1 C1 D1]
+        vpunpcklqdq ymm2,ymm5,ymm7          # [A6 B6 C6 D6 A2 B2 C2 D2]
+        vpunpckhqdq ymm3,ymm5,ymm7          # [A7 B7 C7 D7 A3 B3 C3 D3]
+        vpaddw  ymm0,ymm0,ymm1              # reduction
+        vpaddw  ymm0,ymm0,ymm2
+        vpaddw  ymm0,ymm0,ymm3
+        vextracti128 xmm1,ymm0,1            # extract high pairs
+        vpaddw  xmm0,xmm0,xmm1              # reduction
+        vpmaddwd xmm0,xmm0,xmm8             # multiply by offset and reduce
+        vmovdqu XMMWORD PTR [r9],xmm0
+        add     r9,4*4                      # advance row sum vector by 4 dwords
+        sub     r11,4                       # subtract rows remaining
+        jae     .LCopyPackA.ProcessNextRowM4
+
+.LCopyPackA.ProcessRemainingRows:
+        add     r11,4                       # correct for over-subtract above
+        jz      .LCopyPackA.ExitRoutine
+
+//
+// Process a single row of matrix A in a loop.
+//
+
+.LCopyPackA.ProcessNextRowM1:
+        vpxor   xmm0,xmm0,xmm0              # clear row accumulator
+        mov     rdx,rsi
+        mov     rcx,rdi
+        add     rsi,r10
+        lea     rdi,[rdi+r12*2]
+        mov     rbx,r8                      # reload columns remaining
+        sub     rbx,16
+        jb      .LCopyPackA.ProcessRemainingColumnsM1
+
+.LCopyPackA.ProcessNextColumnLoopM1:
+        vpmovzxbw ymm4,XMMWORD PTR [rdx]
+        vmovdqu YMMWORD PTR [rcx],ymm4
+        vpaddw  ymm0,ymm0,ymm4              # accumulate per row along columns
+        add     rdx,16                      # advance matrix A by 16 bytes
+        add     rcx,16*2                    # advance matrix D by 16 words
+        sub     rbx,16                      # subtract columns remaining
+        jae     .LCopyPackA.ProcessNextColumnLoopM1
+
+.LCopyPackA.ProcessRemainingColumnsM1:
+        add     rbx,16                      # correct for over-subtract above
+        jz      .LCopyPackA.ReduceRowSumVectorM1
+
+//
+// Copy the unaligned CountK columns to a zero padded stack buffer.
+//
+
+        lea     rbp,.LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp]
+        test    bl,8                        # (CountK & 8) != 0?
+        jz      .LCopyPackA.CopyRemainingCountKLessThan8M1
+        mov     rax,QWORD PTR [rdx]
+        mov     QWORD PTR [rbp],rax
+        add     rdx,8
+        add     rbp,8                       # advance padded buffer destination
+
+.LCopyPackA.CopyRemainingCountKLessThan8M1:
+        test    bl,4                        # (CountK & 4) != 0?
+        jz      .LCopyPackA.CopyRemainingCountKLessThan4M1
+        mov     eax,DWORD PTR [rdx]
+        mov     DWORD PTR [rbp],eax
+        add     rdx,4
+        add     rbp,4                       # advance padded buffer destination
+
+.LCopyPackA.CopyRemainingCountKLessThan4M1:
+        test    bl,2                        # (CountK & 2) != 0?
+        jz      .LCopyPackA.CopyRemainingCountKLessThan2M1
+        movzx   eax,WORD PTR [rdx]
+        mov     WORD PTR [rbp],ax
+        add     rdx,2
+        add     rbp,2                       # advance padded buffer destination
+
+.LCopyPackA.CopyRemainingCountKLessThan2M1:
+        test    bl,1                        # (CountK & 1) != 0?
+        jz      .LCopyPackA.ProcessPaddedMatrixADataM1
+        movzx   eax,BYTE PTR [rdx]
+        mov     BYTE PTR [rbp],al
+
+//
+// Process the remaining CountK columns using the zero padded stack buffer.
+//
+
+.LCopyPackA.ProcessPaddedMatrixADataM1:
+        vpmovzxbw ymm4,XMMWORD PTR .LGemmU8U8CopyPackAFrame_PaddedMatrixAData[rsp]
+        vpmaskmovd YMMWORD PTR [rcx],ymm9,ymm4
+        vpaddw  ymm0,ymm0,ymm4              # accumulate per row along columns
+
+//
+// Reduce the sum for the single row of output.
+//
+
+.LCopyPackA.ReduceRowSumVectorM1:
+        vextracti128 xmm1,ymm0,1            # extract high pairs
+        vpaddw  xmm0,xmm0,xmm1              # reduction
+        vphaddw xmm0,xmm0,xmm0
+        vphaddw xmm0,xmm0,xmm0
+        vpmaddwd xmm0,xmm0,xmm8             # multiply by offset and reduce
+        vmovd   DWORD PTR [r9],xmm0
+        add     r9,4                        # advance row sum vector by 1 DWORD
+        dec     r11                         # decrement rows remaining
+        jnz     .LCopyPackA.ProcessNextRowM1
+
+//
+// Restore non-volatile registers and return.
+//
+
+.LCopyPackA.ExitRoutine:
+        vzeroupper
+
+        pop     r13
+        pop     r12
+        pop     rbx
+        pop     rbp
+        ret
+
+/*++
+
+Routine Description:
+
+    This routine copies elements from the source matrix to the destination
+    packed buffer.
+
+Arguments:
+
+    D (rdi) - Supplies the address of the destination packed buffer.
+
+    B (rsi) - Supplies the address of the source matrix.
+
+    ldb (rdx) - Supplies the number of elements per row of the source matrix.
+
+    CountN (rcx) - Supplies the number of columns of the source matrix to copy.
+
+    CountK (r8) - Supplies the number of rows of the source matrix to copy.
+
+    ColumnSumVector (r9) - Supplies the address of the buffer to receive the sums
+        of the elements from each of the columns. Each sum has also been
+        multiplied by the zero point offset.
+
+    offa - Supplies the zero point offset for the other source matrix of the
+        matrix multiplication.
+
+Return Value:
+
+    None.
+
+--*/
+
+        .globl  C_UNDERSCORE(MlasGemmU8U8CopyPackBAvx2)
+C_UNDERSCORE(MlasGemmU8U8CopyPackBAvx2):
+
+        push    rbp
+        push    rbx
+
+        mov     r10,rdx
+        mov     r11,rcx
+        vpbroadcastw ymm5,WORD PTR .LGemmU8U8CopyPackBFrame_offa[rsp]
+
+//
+// Zero initialize the padded stack buffers.
+//
+
+        vpxor   xmm0,xmm0,xmm0
+        vmovdqu YMMWORD PTR .LGemmU8U8CopyPackBFrame_PaddedMatrixBData[rsp],ymm0
+
+//
+// Process 16 columns of matrix B in a loop.
+//
+
+        sub     r11,16
+        jb      .LCopyPackB.ProcessRemainingColumns
+
+.LCopyPackB.ProcessNextColumnN16:
+        vpxor   xmm0,xmm0,xmm0              # clear column accumulators
+        vpxor   xmm1,xmm1,xmm1
+        mov     rdx,rsi
+        add     rsi,16                      # advance next matrix B by 16 columns
+        mov     rbx,r8                      # reload rows remaining
+        sub     rbx,2
+        jb      .LCopyPackB.ProcessRemainingRowsN16
+
+.LCopyPackB.ProcessNextRowLoopN16:
+        vmovdqu xmm2,XMMWORD PTR [rdx]      # load two rows
+        vmovdqu xmm3,XMMWORD PTR [rdx+r10]
+        lea     rdx,[rdx+r10*2]             # advance matrix B by two rows
+        vpunpcklbw xmm4,xmm2,xmm3           # interleave row data
+        vpunpckhbw xmm3,xmm2,xmm3
+        vmovdqu XMMWORD PTR [rdi],xmm4      # store interleaved rows
+        vmovdqu XMMWORD PTR [rdi+16],xmm3
+        vpmovzxbw ymm4,xmm4
+        vpmovzxbw ymm3,xmm3
+        add     rdi,32                      # advance matrix D by 32 bytes
+        vpaddw  ymm0,ymm0,ymm4              # accumulate per column
+        vpaddw  ymm1,ymm1,ymm3
+        sub     rbx,2                       # subtract columns remaining
+        jae     .LCopyPackB.ProcessNextRowLoopN16
+
+.LCopyPackB.ProcessRemainingRowsN16:
+        add     rbx,2                       # correct for over-subtract above
+        jz      .LCopyPackB.ReduceColumnSumVectorN16
+        vpmovzxbw ymm4,XMMWORD PTR [rdx]
+        vmovdqu YMMWORD PTR [rdi],ymm4      # store interleaved rows
+        vextracti128 xmm3,ymm4,1
+        vpmovzxbw ymm4,xmm4
+        vpmovzxbw ymm3,xmm3
+        vpaddw  ymm0,ymm0,ymm4              # accumulate per column
+        vpaddw  ymm1,ymm1,ymm3
+        add     rdi,32                      # advance matrix D by 32 bytes
+
+.LCopyPackB.ReduceColumnSumVectorN16:
+        vpmaddwd ymm0,ymm0,ymm5             # multiply by offset and reduce
+        vpmaddwd ymm1,ymm1,ymm5             # multiply by offset and reduce
+        vmovdqu YMMWORD PTR [r9],ymm0
+        vmovdqu YMMWORD PTR [r9+32],ymm1
+        add     r9,64                       # advance column sum vector by 16 dwords
+        sub     r11,16                      # subtract columns remaining
+        jae     .LCopyPackB.ProcessNextColumnN16
+
+.LCopyPackB.ProcessRemainingColumns:
+        add     r11,16                      # correct for over-subtract above
+        jnz     .LCopyPackB.ProcessColumnNUnaligned
+
+//
+// Restore non-volatile registers and return.
+//
+
+.LCopyPackB.ExitRoutine:
+        vzeroupper
+
+        pop     rbx
+        pop     rbp
+        ret
+
+//
+// Process the remaining columns of matrix B.
+//
+
+.LCopyPackB.ProcessColumnNUnaligned:
+        vpxor   xmm0,xmm0,xmm0              # clear column accumulators
+        vpxor   xmm1,xmm1,xmm1
+        sub     r8,2
+        jb      .LCopyPackB.ProcessRemainingRowsNUnaligned
+
+.LCopyPackB.ProcessNextRowLoopNUnaligned:
+        mov     rdx,rsi
+        lea     rbp,.LGemmU8U8CopyPackBFrame_PaddedMatrixBData[rsp]
+        test    r11b,8                      # (CountN & 8) != 0?
+        jz      .LCopyPackB.CopyRemainingCountNLessThan8K2
+        mov     rax,QWORD PTR [rdx]
+        mov     QWORD PTR [rbp],rax
+        mov     rax,QWORD PTR [rdx+r10]
+        mov     QWORD PTR [rbp+16],rax
+        add     rdx,8                       # advance matrix B
+        add     rbp,8                       # advance padded buffer destination
+
+.LCopyPackB.CopyRemainingCountNLessThan8K2:
+        test    r11b,4                      # (CountN & 4) != 0?
+        jz      .LCopyPackB.CopyRemainingCountNLessThan4K2
+        mov     eax,DWORD PTR [rdx]
+        mov     DWORD PTR [rbp],eax
+        mov     eax,DWORD PTR [rdx+r10]
+        mov     DWORD PTR [rbp+16],eax
+        add     rdx,4                       # advance matrix B
+        add     rbp,4                       # advance padded buffer destination
+
+.LCopyPackB.CopyRemainingCountNLessThan4K2:
+        test    r11b,2                      # (CountN & 2) != 0?
+        jz      .LCopyPackB.CopyRemainingCountNLessThan2K2
+        movzx   eax,WORD PTR [rdx]
+        mov     WORD PTR [rbp],ax
+        movzx   eax,WORD PTR [rdx+r10]
+        mov     WORD PTR [rbp+16],ax
+        add     rdx,2                       # advance matrix B
+        add     rbp,2                       # advance padded buffer destination
+
+.LCopyPackB.CopyRemainingCountNLessThan2K2:
+        test    r11b,1                      # (CountN & 1) != 0?
+        jz      .LCopyPackB.ProcessPaddedMatrixBDataK2
+        movzx   eax,BYTE PTR [rdx]
+        mov     BYTE PTR [rbp],al
+        movzx   eax,BYTE PTR [rdx+r10]
+        mov     BYTE PTR [rbp+16],al
+
+.LCopyPackB.ProcessPaddedMatrixBDataK2:
+        vmovdqu xmm2,XMMWORD PTR .LGemmU8U8CopyPackBFrame_PaddedMatrixBData[rsp]
+        vmovdqu xmm3,XMMWORD PTR .LGemmU8U8CopyPackBFrame_PaddedMatrixBData[rsp+16]
+        vpunpcklbw xmm4,xmm2,xmm3           # interleave row data
+        vpunpckhbw xmm3,xmm2,xmm3
+        vmovdqu XMMWORD PTR [rdi],xmm4      # store interleaved rows
+        vmovdqu XMMWORD PTR [rdi+16],xmm3
+        vpmovzxbw ymm4,xmm4
+        vpmovzxbw ymm3,xmm3
+        vpaddw  ymm0,ymm0,ymm4              # accumulate per column
+        vpaddw  ymm1,ymm1,ymm3
+        lea     rsi,[rsi+r10*2]             # advance next matrix B by two rows
+        add     rdi,32                      # advance matrix D by 32 bytes
+        sub     r8,2                        # subtract columns remaining
+        jae     .LCopyPackB.ProcessNextRowLoopNUnaligned
+
+.LCopyPackB.ProcessRemainingRowsNUnaligned:
+        add     r8,2
+        jz      .LCopyPackB.ReduceColumnSumVectorNUnaligned
+        mov     rdx,rsi
+        lea     rbp,.LGemmU8U8CopyPackBFrame_PaddedMatrixBData[rsp]
+        test    r11b,8                      # (CountN & 8) != 0?
+        jz      .LCopyPackB.CopyRemainingCountNLessThan8K1
+        mov     rax,QWORD PTR [rdx]
+        mov     QWORD PTR [rbp],rax
+        add     rdx,8                       # advance matrix B
+        add     rbp,8                       # advance padded buffer destination
+
+.LCopyPackB.CopyRemainingCountNLessThan8K1:
+        test    r11b,4                      # (CountN & 4) != 0?
+        jz      .LCopyPackB.CopyRemainingCountNLessThan4K1
+        mov     eax,DWORD PTR [rdx]
+        mov     DWORD PTR [rbp],eax
+        add     rdx,4                       # advance matrix B
+        add     rbp,4                       # advance padded buffer destination
+
+.LCopyPackB.CopyRemainingCountNLessThan4K1:
+        test    r11b,2                      # (CountN & 2) != 0?
+        jz      .LCopyPackB.CopyRemainingCountNLessThan2K1
+        movzx   eax,WORD PTR [rdx]
+        mov     WORD PTR [rbp],ax
+        add     rdx,2                       # advance matrix B
+        add     rbp,2                       # advance padded buffer destination
+
+.LCopyPackB.CopyRemainingCountNLessThan2K1:
+        test    r11b,1                      # (CountN & 1) != 0?
+        jz      .LCopyPackB.ProcessPaddedMatrixBDataK1
+        movzx   eax,BYTE PTR [rdx]
+        mov     BYTE PTR [rbp],al
+
+.LCopyPackB.ProcessPaddedMatrixBDataK1:
+        vpmovzxbw ymm4,XMMWORD PTR .LGemmU8U8CopyPackBFrame_PaddedMatrixBData[rsp]
+        vmovdqu YMMWORD PTR [rdi],ymm4      # store interleaved rows
+        vextracti128 xmm3,ymm4,1
+        vpmovzxbw ymm4,xmm4
+        vpmovzxbw ymm3,xmm3
+        vpaddw  ymm0,ymm0,ymm4              # accumulate per column
+        vpaddw  ymm1,ymm1,ymm3
+
+.LCopyPackB.ReduceColumnSumVectorNUnaligned:
+        vpmaddwd ymm0,ymm0,ymm5             # multiply by offset and reduce
+        vpmaddwd ymm1,ymm1,ymm5             # multiply by offset and reduce
+        vmovdqu YMMWORD PTR [r9],ymm0
+        vmovdqu YMMWORD PTR [r9+32],ymm1
+        jmp     .LCopyPackB.ExitRoutine
+
+/*++
+
+Macro Description:
+
+    This macro generates code to multiply and accumulator a single row of the
+    output block.
+
+Arguments:
+
+    ColumnCount - Supplies the number of columns to produce.
+
+    Vec1Reg - Supplies the high block accumulator register (when ColumnCount
+        is 16).
+
+    Vec2Reg - Supplies the low block accumulator register.
+
+Implicit Arguments:
+
+    ymm0 - Supplies the first vector loaded from matrix B.
+
+    ymm1 - Supplies the second vector loaded from matrix B (when ColumnCount
+        is 16).
+
+    ymm2 - Supplies the broadcast value loaded from matrix A.
+
+--*/
+
+        .macro MultiplyAccumulateRow ColumnCount, Vec1Reg, Vec2Reg
+
+.if \ColumnCount\() == 16
+        vpmaddwd ymm3,ymm2,ymm0
+        vpaddd  \Vec1Reg\(),\Vec1Reg\(),ymm3
+        vpmaddwd ymm2,ymm2,ymm1
+        vpaddd  \Vec2Reg\(),\Vec2Reg\(),ymm2
+.else
+        vpmaddwd ymm3,ymm2,ymm0
+        vpaddd  \Vec2Reg\(),\Vec2Reg\(),ymm3
+.endif
+
+        .endm
+
+/*++
+
+Macro Description:
+
+    This macro generates code to multiply and accumulate each row of the output
+    block.
+
+Arguments:
+
+    ColumnCount - Supplies the number of columns to produce.
+
+    RowCount - Supplies the number of rows to produce.
+
+    VectorOffset - Supplies the byte offset from matrix B to fetch elements.
+
+    BroadcastOffset - Supplies the byte offset from matrix A to fetch elements.
+
+Implicit Arguments:
+
+    rdi - Supplies the address into the matrix A data.
+
+    rbx - Supplies the address into the matrix A data plus 3 rows.
+
+    rsi - Supplies the address into the matrix B data.
+
+    r10 - Supplies the length in bytes of a row from matrix A.
+
+    ymm4-ymm15 - Supplies the block accumulators.
+
+--*/
+
+        .macro ComputeBlock ColumnCount, RowCount, VectorOffset, BroadcastOffset
+
+        vpmovzxbw ymm0,XMMWORD PTR [rsi+\VectorOffset\()]
+        EmitIfCountGE \ColumnCount\(), 16, "vpmovzxbw ymm1,XMMWORD PTR [rsi+\VectorOffset\()+16]"
+        EmitIfCountGE \RowCount\(), 1, "vpbroadcastd ymm2,DWORD PTR [rdi+\BroadcastOffset\()]"
+        EmitIfCountGE \RowCount\(), 1, "MultiplyAccumulateRow \ColumnCount\(), ymm4, ymm5"
+        EmitIfCountGE \RowCount\(), 2, "vpbroadcastd ymm2,DWORD PTR [rdi+r10+\BroadcastOffset\()]"
+        EmitIfCountGE \RowCount\(), 2, "MultiplyAccumulateRow \ColumnCount\(), ymm6, ymm7"
+        EmitIfCountGE \RowCount\(), 3, "vpbroadcastd ymm2,DWORD PTR [rdi+r10*2+\BroadcastOffset\()]"
+        EmitIfCountGE \RowCount\(), 3, "MultiplyAccumulateRow \ColumnCount\(), ymm8, ymm9"
+        EmitIfCountGE \RowCount\(), 4, "vpbroadcastd ymm2,DWORD PTR [rbx+\BroadcastOffset\()]"
+        EmitIfCountGE \RowCount\(), 4, "MultiplyAccumulateRow \ColumnCount\(), ymm10, ymm11"
+        EmitIfCountGE \RowCount\(), 5, "vpbroadcastd ymm2,DWORD PTR [rbx+r10+\BroadcastOffset\()]"
+        EmitIfCountGE \RowCount\(), 5, "MultiplyAccumulateRow \ColumnCount\(), ymm12, ymm13"
+        EmitIfCountGE \RowCount\(), 6, "vpbroadcastd ymm2,DWORD PTR [rbx+r10*2+\BroadcastOffset\()]"
+        EmitIfCountGE \RowCount\(), 6, "MultiplyAccumulateRow \ColumnCount\(), ymm14, ymm15"
+
+        .endm
+
+/*++
+
+Macro Description:
+
+    This macro generates code to produce an output block for a set of columns
+    and rows.
+
+Arguments:
+
+    ColumnCount - Supplies the number of columns to produce.
+
+    RowCount - Supplies the number of rows to produce.
+
+Implicit Arguments:
+
+    rax - Supplies the length in bytes of a row from matrix C.
+
+    rdi - Supplies the address into the matrix A data.
+
+    rsi - Supplies the address into the matrix B data.
+
+    rcx - Supplies the number of paired columns from matrix A and the number of
+        paired rows from matrix B to iterate over.
+
+    r10 - Supplies the length in bytes of a row from matrix A.
+
+    r12 - Supplies the address of the row sum vector.
+
+    r13 - Supplies the address of the column sum vector.
+
+--*/
+
+        .macro ProduceOutputBlock ColumnCount, RowCount
+
+//
+// Initialize the accumulators with the sum of the global depth value constant,
+// the column sums, and the row sums.
+//
+
+        vpbroadcastd ymm1,DWORD PTR .LGemmU8U8KernelFrame_DepthValue[rsp]
+.if \ColumnCount\() == 16
+        vpaddd  ymm0,ymm1,YMMWORD PTR [r13]
+        vpaddd  ymm1,ymm1,YMMWORD PTR [r13+32]
+        add     r13,16*4                    # advance ColumnSumVector by 16 columns
+.else
+        vpaddd  ymm1,ymm1,YMMWORD PTR [r13]
+.endif
+        EmitIfCountGE \RowCount\(), 1, "vpbroadcastd ymm5,DWORD PTR [r12]"
+        EmitIfCountGE \RowCount\(), 2, "vpbroadcastd ymm7,DWORD PTR [r12+4]"
+        EmitIfCountGE \RowCount\(), 3, "vpbroadcastd ymm9,DWORD PTR [r12+8]"
+        EmitIfCountGE \RowCount\(), 4, "vpbroadcastd ymm11,DWORD PTR [r12+12]"
+        EmitIfCountGE \RowCount\(), 5, "vpbroadcastd ymm13,DWORD PTR [r12+16]"
+        EmitIfCountGE \RowCount\(), 6, "vpbroadcastd ymm15,DWORD PTR [r12+20]"
+        EmitIfCount2GE \RowCount\(), 1, \ColumnCount\(), 16, "vpaddd ymm4,ymm5,ymm0"
+        EmitIfCountGE \RowCount\(), 1, "vpaddd ymm5,ymm5,ymm1"
+        EmitIfCount2GE \RowCount\(), 2, \ColumnCount\(), 16, "vpaddd ymm6,ymm7,ymm0"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd ymm7,ymm7,ymm1"
+        EmitIfCount2GE \RowCount\(), 3, \ColumnCount\(), 16, "vpaddd ymm8,ymm9,ymm0"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd ymm9,ymm9,ymm1"
+        EmitIfCount2GE \RowCount\(), 4, \ColumnCount\(), 16, "vpaddd ymm10,ymm11,ymm0"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd ymm11,ymm11,ymm1"
+        EmitIfCount2GE \RowCount\(), 5, \ColumnCount\(), 16, "vpaddd ymm12,ymm13,ymm0"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd ymm13,ymm13,ymm1"
+        EmitIfCount2GE \RowCount\(), 6, \ColumnCount\(), 16, "vpaddd ymm14,ymm15,ymm0"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd ymm15,ymm15,ymm1"
+
+//
+// Iterate over PairedCountK elements from matrix A and matrix B.
+//
+// Unrolling the loop to do two iterations improves performance slightly at the
+// cost of larger code size. Balance this by only unrolling for the common case
+// of computing 16 columns for an even number of rows.
+//
+
+        mov     rbp,rcx                     # reload PairedCountK
+.if \RowCount\() > 3
+        lea     rbx,[r10*2+r10]
+        add     rbx,rdi                     # compute matrix A plus 3 rows
+.endif
+
+.if (\ColumnCount\() == 16) && ((\RowCount\() & 1) == 0)
+        sub     rbp,2
+        jb      .LProcessRemainingBlocks.\ColumnCount\().\RowCount\()
+
+.LComputeBlockLoop.\ColumnCount\().\RowCount\():
+        ComputeBlock \ColumnCount\(), \RowCount\(), 0, 0
+        ComputeBlock \ColumnCount\(), \RowCount\(), 32, 4
+        add     rdi,2*4                     # advance matrix A by 2 pairs
+.if \RowCount\() > 3
+        add     rbx,2*4                     # advance matrix A plus 3 rows by 2 pairs
+.endif
+        add     rsi,2*32                    # advance matrix B by 64 columns
+        sub     rbp,2                       # subtract pairs remaining
+        jae     .LComputeBlockLoop.\ColumnCount\().\RowCount\()
+
+.LProcessRemainingBlocks.\ColumnCount\().\RowCount\():
+        add     rbp,2                       # correct for over-subtract above
+        jz      .LComputeBlockLoopExit.\ColumnCount\().\RowCount\()
+        ComputeBlock \ColumnCount\(), \RowCount\(), 0, 0
+        add     rsi,32                      # advance matrix B by 32 columns
+.else
+.LComputeBlockLoop.\ColumnCount\().\RowCount\():
+        ComputeBlock \ColumnCount\(), \RowCount\(), 0, 0
+        add     rdi,4                       # advance matrix A by 1 pair
+.if \RowCount\() > 3
+        add     rbx,4                       # advance matrix A plus 3 rows by 1 pair
+.endif
+        add     rsi,32
+        dec     rbp                         # decrement pairs remaining
+        jnz     .LComputeBlockLoop.\ColumnCount\().\RowCount\()
+.endif
+
+.LComputeBlockLoopExit.\ColumnCount\().\RowCount\():
+.if \RowCount\() > 3
+        lea     rbx,[rdx+rax*2]             # compute matrix C plus 3 rows
+        add     rbx,rax
+.endif
+
+        .endm
+
+/*++
+
+Macro Description:
+
+    This macro generates code to compute matrix multiplication for a fixed set
+    of rows.
+
+Arguments:
+
+    RowCount - Supplies the number of rows to process.
+
+    Fallthrough - Supplies a non-blank value if the macro may fall through to
+        the ExitKernel label.
+
+Implicit Arguments:
+
+    rax - Supplies the length in bytes of a row from matrix C.
+
+    rdi - Supplies the address of matrix A.
+
+    rsi - Supplies the address of matrix B.
+
+    rdx - Supplies the address of matrix C.
+
+    r11 - Supplies the address of matrix A.
+
+    r9 - Supplies the number of columns from matrix B and matrix C to iterate
+        over.
+
+    rcx - Supplies the number of paired columns from matrix A and the number of
+        paired rows from matrix B to iterate over.
+
+    r10 - Supplies the length in bytes of a row from matrix A.
+
+    r12 - Supplies the address of the row sum vector.
+
+    r13 - Supplies the address of the column sum vector.
+
+    r14b - Supplies the zero mode flag.
+
+--*/
+
+        .macro ProcessCountM RowCount, Fallthrough
+
+        cmp     r9,8
+        jbe     .LProcessRemainingCountN.\RowCount\()
+
+.LProcessNextColumnLoop16xN.\RowCount\():
+        ProduceOutputBlock 16, \RowCount\()
+        sub     r9,16
+        jb      .LOutputMasked16xNBlock.\RowCount\()
+        test    r14b,r14b                   # ZeroMode?
+        jnz     .LSkipAccumulateOutput16xNBlock.\RowCount\()
+        EmitIfCountGE \RowCount\(), 1, "vpaddd ymm4,ymm4,YMMWORD PTR [rdx]"
+        EmitIfCountGE \RowCount\(), 1, "vpaddd ymm5,ymm5,YMMWORD PTR [rdx+32]"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd ymm6,ymm6,YMMWORD PTR [rdx+rax]"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd ymm7,ymm7,YMMWORD PTR [rdx+rax+32]"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd ymm8,ymm8,YMMWORD PTR [rdx+rax*2]"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd ymm9,ymm9,YMMWORD PTR [rdx+rax*2+32]"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd ymm10,ymm10,YMMWORD PTR [rbx]"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd ymm11,ymm11,YMMWORD PTR [rbx+32]"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd ymm12,ymm12,YMMWORD PTR [rbx+rax]"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd ymm13,ymm13,YMMWORD PTR [rbx+rax+32]"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd ymm14,ymm14,YMMWORD PTR [rbx+rax*2]"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd ymm15,ymm15,YMMWORD PTR [rbx+rax*2+32]"
+
+.LSkipAccumulateOutput16xNBlock.\RowCount\():
+        EmitIfCountGE \RowCount\(), 1, "vmovdqu YMMWORD PTR [rdx],ymm4"
+        EmitIfCountGE \RowCount\(), 1, "vmovdqu YMMWORD PTR [rdx+32],ymm5"
+        EmitIfCountGE \RowCount\(), 2, "vmovdqu YMMWORD PTR [rdx+rax],ymm6"
+        EmitIfCountGE \RowCount\(), 2, "vmovdqu YMMWORD PTR [rdx+rax+32],ymm7"
+        EmitIfCountGE \RowCount\(), 3, "vmovdqu YMMWORD PTR [rdx+rax*2],ymm8"
+        EmitIfCountGE \RowCount\(), 3, "vmovdqu YMMWORD PTR [rdx+rax*2+32],ymm9"
+        EmitIfCountGE \RowCount\(), 4, "vmovdqu YMMWORD PTR [rbx],ymm10"
+        EmitIfCountGE \RowCount\(), 4, "vmovdqu YMMWORD PTR [rbx+32],ymm11"
+        EmitIfCountGE \RowCount\(), 5, "vmovdqu YMMWORD PTR [rbx+rax],ymm12"
+        EmitIfCountGE \RowCount\(), 5, "vmovdqu YMMWORD PTR [rbx+rax+32],ymm13"
+        EmitIfCountGE \RowCount\(), 6, "vmovdqu YMMWORD PTR [rbx+rax*2],ymm14"
+        EmitIfCountGE \RowCount\(), 6, "vmovdqu YMMWORD PTR [rbx+rax*2+32],ymm15"
+        add     rdx,16*4                    # advance matrix C by 16 columns
+        mov     rdi,r11                     # reload matrix A
+        cmp     r9,8
+        ja      .LProcessNextColumnLoop16xN.\RowCount\()
+        test    r9,r9
+        jz      .LExitKernel
+
+.LProcessRemainingCountN.\RowCount\():
+        ProduceOutputBlock 8, \RowCount\()
+        cmp     r9,8
+        jb      .LOutputMasked8xNBlock.\RowCount\()
+        test    r14b,r14b                   # ZeroMode?
+        jnz     .LSkipAccumulateOutput8xNBlock.\RowCount\()
+        EmitIfCountGE \RowCount\(), 1, "vpaddd ymm5,ymm5,YMMWORD PTR [rdx]"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd ymm7,ymm7,YMMWORD PTR [rdx+rax]"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd ymm9,ymm9,YMMWORD PTR [rdx+rax*2]"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd ymm11,ymm11,YMMWORD PTR [rbx]"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd ymm13,ymm13,YMMWORD PTR [rbx+rax]"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd ymm15,ymm15,YMMWORD PTR [rbx+rax*2]"
+
+.LSkipAccumulateOutput8xNBlock.\RowCount\():
+        EmitIfCountGE \RowCount\(), 1, "vmovdqu YMMWORD PTR [rdx],ymm5"
+        EmitIfCountGE \RowCount\(), 2, "vmovdqu YMMWORD PTR [rdx+rax],ymm7"
+        EmitIfCountGE \RowCount\(), 3, "vmovdqu YMMWORD PTR [rdx+rax*2],ymm9"
+        EmitIfCountGE \RowCount\(), 4, "vmovdqu YMMWORD PTR [rbx],ymm11"
+        EmitIfCountGE \RowCount\(), 5, "vmovdqu YMMWORD PTR [rbx+rax],ymm13"
+        EmitIfCountGE \RowCount\(), 6, "vmovdqu YMMWORD PTR [rbx+rax*2],ymm15"
+        jmp     .LExitKernel
+
+.LOutputMasked16xNBlock.\RowCount\():
+        test    r14b,r14b                   # ZeroMode?
+        jnz     .LSkipAccumulateOutputMasked16xNBlock.\RowCount\()
+        EmitIfCountGE \RowCount\(), 1, "vpaddd ymm4,ymm4,YMMWORD PTR [rdx]"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd ymm6,ymm6,YMMWORD PTR [rdx+rax]"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd ymm8,ymm8,YMMWORD PTR [rdx+rax*2]"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd ymm10,ymm10,YMMWORD PTR [rbx]"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd ymm12,ymm12,YMMWORD PTR [rbx+rax]"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd ymm14,ymm14,YMMWORD PTR [rbx+rax*2]"
+
+.LSkipAccumulateOutputMasked16xNBlock.\RowCount\():
+        EmitIfCountGE \RowCount\(), 1, "vmovdqu YMMWORD PTR [rdx],ymm4"
+        EmitIfCountGE \RowCount\(), 2, "vmovdqu YMMWORD PTR [rdx+rax],ymm6"
+        EmitIfCountGE \RowCount\(), 3, "vmovdqu YMMWORD PTR [rdx+rax*2],ymm8"
+        EmitIfCountGE \RowCount\(), 4, "vmovdqu YMMWORD PTR [rbx],ymm10"
+        EmitIfCountGE \RowCount\(), 5, "vmovdqu YMMWORD PTR [rbx+rax],ymm12"
+        EmitIfCountGE \RowCount\(), 6, "vmovdqu YMMWORD PTR [rbx+rax*2],ymm14"
+        add     rdx,8*4                     # advance matrix C by 8 columns
+.if \RowCount\() > 3
+        add     rbx,8*4                     # advance matrix C plus 3 rows by 8 columns
+.endif
+        add     r9,8                        # correct for over-subtract above
+
+.LOutputMasked8xNBlock.\RowCount\():
+        mov     DWORD PTR .LGemmU8U8KernelFrame_mask[rsp],r9d
+        vpbroadcastd ymm0,DWORD PTR .LGemmU8U8KernelFrame_mask[rsp]
+        vpcmpgtd ymm0,ymm0,YMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
+        test    r14b,r14b                   # ZeroMode?
+        jnz     .LSkipAccumulateOutputMasked8xNBlock.\RowCount\()
+        EmitIfCountGE \RowCount\(), 1, "vpmaskmovd ymm4,ymm0,YMMWORD PTR [rdx]"
+        EmitIfCountGE \RowCount\(), 2, "vpmaskmovd ymm6,ymm0,YMMWORD PTR [rdx+rax]"
+        EmitIfCountGE \RowCount\(), 3, "vpmaskmovd ymm8,ymm0,YMMWORD PTR [rdx+rax*2]"
+        EmitIfCountGE \RowCount\(), 4, "vpmaskmovd ymm10,ymm0,YMMWORD PTR [rbx]"
+        EmitIfCountGE \RowCount\(), 5, "vpmaskmovd ymm12,ymm0,YMMWORD PTR [rbx+rax]"
+        EmitIfCountGE \RowCount\(), 6, "vpmaskmovd ymm14,ymm0,YMMWORD PTR [rbx+rax*2]"
+        EmitIfCountGE \RowCount\(), 1, "vpaddd ymm5,ymm5,ymm4"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd ymm7,ymm7,ymm6"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd ymm9,ymm9,ymm8"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd ymm11,ymm11,ymm10"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd ymm13,ymm13,ymm12"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd ymm15,ymm15,ymm14"
+
+.LSkipAccumulateOutputMasked8xNBlock.\RowCount\():
+        EmitIfCountGE \RowCount\(), 1, "vpmaskmovd YMMWORD PTR [rdx],ymm0,ymm5"
+        EmitIfCountGE \RowCount\(), 2, "vpmaskmovd YMMWORD PTR [rdx+rax],ymm0,ymm7"
+        EmitIfCountGE \RowCount\(), 3, "vpmaskmovd YMMWORD PTR [rdx+rax*2],ymm0,ymm9"
+        EmitIfCountGE \RowCount\(), 4, "vpmaskmovd YMMWORD PTR [rbx],ymm0,ymm11"
+        EmitIfCountGE \RowCount\(), 5, "vpmaskmovd YMMWORD PTR [rbx+rax],ymm0,ymm13"
+        EmitIfCountGE \RowCount\(), 6, "vpmaskmovd YMMWORD PTR [rbx+rax*2],ymm0,ymm15"
+.ifb \Fallthrough\()
+        jmp     .LExitKernel
+.endif
+
+        .endm
+
+/*++
+
+Routine Description:
+
+    This routine is an inner kernel to compute matrix multiplication for a
+    set of rows.
+
+Arguments:
+
+    A (rdi) - Supplies the address of matrix A. The matrix data has been packed
+        using MlasGemmU8U8CopyPackAAvx2.
+
+    B (rsi) - Supplies the address of matrix B. The matrix data has been packed
+        using MlasGemmU8U8CopyPackBAvx2.
+
+    C (rdx) - Supplies the address of matrix C.
+
+    PairedCountK (rcx) - Supplies the number of paired columns from matrix A and
+        the number of paired rows from matrix B to iterate over.
+
+    CountM (r8) - Supplies the maximum number of rows that can be processed for
+        matrix A and matrix C. The actual number of rows handled for this
+        invocation depends on the kernel implementation.
+
+    CountN (r9) - Supplies the number of columns from matrix B and matrix C to
+        iterate over.
+
+    ldc - Supplies the first dimension of matrix C.
+
+    RowSumVector - Supplies the sum of each row from matrix A multiplied by the
+        zero point offset of matrix B. These values are accumulated into every
+        row of matrix C.
+
+    ColumnSumVector - Supplies the sum of each column from matrix B multiplied
+        by the zero point offset of matrix A. These values are accumulated into
+        every column of matrix C.
+
+    DepthValue - Supplies the value CountK multiplied by the zero point offset
+        of matrixA multplied by the zero point offset of matrix B. This value is
+        accumulated into every element of matrix C.
+
+    ZeroMode - Supplies true if the output matrix must be zero initialized,
+        else false if the output matrix is accumulated into.
+
+Return Value:
+
+    Returns the number of rows handled.
+
+--*/
+
+        .globl  C_UNDERSCORE(MlasGemmU8U8KernelAvx2)
+C_UNDERSCORE(MlasGemmU8U8KernelAvx2):
+
+        push    rbp
+        push    rbx
+        push    r12
+        push    r13
+        push    r14
+
+        mov     rax,.LGemmU8U8KernelFrame_ldc[rsp]
+        shl     rax,2                       # convert ldc to bytes
+        lea     r10,[rcx*4]
+        mov     r11,rdi
+        mov     r12,.LGemmU8U8KernelFrame_RowSumVector[rsp]
+        mov     r13,.LGemmU8U8KernelFrame_ColumnSumVector[rsp]
+        movzx   r14,BYTE PTR .LGemmU8U8KernelFrame_ZeroMode[rsp]
+
+//
+// Process CountM rows of the matrices.
+//
+
+        cmp     r8,5
+        ja      .LProcessCountM6
+        je      .LProcessCountM5
+        cmp     r8,3
+        ja      .LProcessCountM4
+        je      .LProcessCountM3
+        cmp     r8,1
+        je      .LProcessCountM1
+
+.LProcessCountM2:
+        ProcessCountM 2
+
+.LProcessCountM4:
+        ProcessCountM 4
+
+.LProcessCountM6:
+        mov     r8d,6                       # return 6 rows handled
+        ProcessCountM 6, Fallthrough
+
+//
+// Restore non-volatile registers and return.
+//
+
+.LExitKernel:
+        mov     eax,r8d
+        vzeroupper
+
+        pop     r14
+        pop     r13
+        pop     r12
+        pop     rbx
+        pop     rbp
+        ret
+
+.LProcessCountM1:
+        ProcessCountM 1
+
+.LProcessCountM3:
+        ProcessCountM 3
+
+.LProcessCountM5:
+        ProcessCountM 5
+
+        .end
diff --git a/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512BW.S b/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512BW.S
new file mode 100644
index 0000000000000..bacb29a9a138c
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512BW.S
@@ -0,0 +1,120 @@
+/*++
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+
+Licensed under the MIT License.
+
+Module Name:
+
+    QgemmU8U8KernelAvx512BW.s
+
+Abstract:
+
+    This module implements the kernels for the quantized integer matrix/matrix
+    multiply operation (QGEMM).
+
+    This implementation uses AVX512BW instructions.
+
+--*/
+
+#include "asmmacro.h"
+#include "QgemmU8U8KernelAvx512Common.h"
+
+        .intel_syntax noprefix
+
+        .text
+
+/*++
+
+Macro Description:
+
+    This macro generates code to multiply and accumulator a single row of the
+    output block.
+
+Arguments:
+
+    ColumnCount - Supplies the number of columns to produce.
+
+    Vec1Reg - Supplies the high block accumulator register (when ColumnCount
+        is 32).
+
+    Vec2Reg - Supplies the low block accumulator register.
+
+Implicit Arguments:
+
+    zmm28 - Supplies the first vector loaded from matrix B.
+
+    zmm29 - Supplies the second vector loaded from matrix B (when ColumnCount
+        is 32).
+
+    zmm30 - Supplies the broadcast value loaded from matrix A.
+
+--*/
+
+        .macro MultiplyAccumulateRow ColumnCount, Vec1Reg, Vec2Reg
+
+.if \ColumnCount\() == 32
+        vpmaddwd zmm31,zmm30,zmm28
+        vpaddd  \Vec1Reg\(),\Vec1Reg\(),zmm31
+        vpmaddwd zmm30,zmm30,zmm29
+        vpaddd  \Vec2Reg\(),\Vec2Reg\(),zmm30
+.else
+        vpmaddwd zmm31,zmm30,zmm28
+        vpaddd  \Vec2Reg\(),\Vec2Reg\(),zmm31
+.endif
+
+        .endm
+
+/*++
+
+Macro Description:
+
+    This macro generates code to multiply and accumulate each row of the output
+    block.
+
+Arguments:
+
+    ColumnCount - Supplies the number of columns to produce.
+
+    RowCount - Supplies the number of rows to produce.
+
+Implicit Arguments:
+
+    rdi - Supplies the address into the matrix A data.
+
+    rbx - Supplies the address into the matrix A data plus 3 rows.
+
+    rsi - Supplies the address into the matrix B data.
+
+    r10 - Supplies the length in bytes of a row from matrix A.
+
+    zmm16-zmm27 - Supplies the block accumulators.
+
+--*/
+
+        .macro ComputeBlock ColumnCount, RowCount
+
+        vpmovzxbw zmm28,YMMWORD PTR [rsi]
+        EmitIfCountGE \ColumnCount\(), 32, "vpmovzxbw zmm29,YMMWORD PTR [rsi+r10*8]"
+        EmitIfCountGE \RowCount\(), 1, "vpbroadcastd zmm30,DWORD PTR [rdi]"
+        EmitIfCountGE \RowCount\(), 1, "MultiplyAccumulateRow \ColumnCount\(), zmm16, zmm17"
+        EmitIfCountGE \RowCount\(), 2, "vpbroadcastd zmm30,DWORD PTR [rdi+r10]"
+        EmitIfCountGE \RowCount\(), 2, "MultiplyAccumulateRow \ColumnCount\(), zmm18, zmm19"
+        EmitIfCountGE \RowCount\(), 3, "vpbroadcastd zmm30,DWORD PTR [rdi+r10*2]"
+        EmitIfCountGE \RowCount\(), 3, "MultiplyAccumulateRow \ColumnCount\(), zmm20, zmm21"
+        EmitIfCountGE \RowCount\(), 4, "vpbroadcastd zmm30,DWORD PTR [rbx]"
+        EmitIfCountGE \RowCount\(), 4, "MultiplyAccumulateRow \ColumnCount\(), zmm22, zmm23"
+        EmitIfCountGE \RowCount\(), 5, "vpbroadcastd zmm30,DWORD PTR [rbx+r10]"
+        EmitIfCountGE \RowCount\(), 5, "MultiplyAccumulateRow \ColumnCount\(), zmm24, zmm25"
+        EmitIfCountGE \RowCount\(), 6, "vpbroadcastd zmm30,DWORD PTR [rbx+r10*2]"
+        EmitIfCountGE \RowCount\(), 6, "MultiplyAccumulateRow \ColumnCount\(), zmm26, zmm27"
+
+        .endm
+
+//
+// Generate the GEMM kernel.
+//
+
+GemmU8U8KernelAvx512Function Avx512BW
+
+        .end
diff --git a/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Common.h b/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Common.h
new file mode 100644
index 0000000000000..3abd87b7ce986
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Common.h
@@ -0,0 +1,361 @@
+/*++
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+
+Licensed under the MIT License.
+
+Module Name:
+
+    QgemmU8U8KernelAvx512Common.h
+
+Abstract:
+
+    This module contains common kernel macros and structures for the quantized
+    integer matrix/matrix multiply operation (QGEMM) for the AVX512BW and
+    AVX512VNNI kernels.
+
+--*/
+
+//
+// Stack frame layout for the U8U8 kernel.
+//
+
+        .equ    .LGemmU8U8KernelFrame_SavedR14, 0
+        .equ    .LGemmU8U8KernelFrame_SavedR13, 8
+        .equ    .LGemmU8U8KernelFrame_SavedR12, 16
+        .equ    .LGemmU8U8KernelFrame_SavedRbx, 24
+        .equ    .LGemmU8U8KernelFrame_SavedRbp, 32
+        .equ    .LGemmU8U8KernelFrame_ReturnAddress, 40
+        .equ    .LGemmU8U8KernelFrame_ldc, 48
+        .equ    .LGemmU8U8KernelFrame_RowSumVector, 56
+        .equ    .LGemmU8U8KernelFrame_ColumnSumVector, 64
+        .equ    .LGemmU8U8KernelFrame_DepthValue, 72
+        .equ    .LGemmU8U8KernelFrame_ZeroMode, 80
+
+/*++
+
+Macro Description:
+
+    This macro generates code to produce an output block for a set of columns
+    and rows.
+
+Arguments:
+
+    ColumnCount - Supplies the number of columns to produce.
+
+    RowCount - Supplies the number of rows to produce.
+
+Implicit Arguments:
+
+    rax - Supplies the length in bytes of a row from matrix C.
+
+    rdi - Supplies the address into the matrix A data.
+
+    rsi - Supplies the address into the matrix B data.
+
+    rcx - Supplies the number of paired columns from matrix A and the number of
+        paired rows from matrix B to iterate over.
+
+    r10 - Supplies the length in bytes of a row from matrix A.
+
+    r12 - Supplies the address of the row sum vector.
+
+    r13 - Supplies the address of the column sum vector.
+
+--*/
+
+        .macro ProduceOutputBlock ColumnCount, RowCount
+
+//
+// Initialize the accumulators with the sum of the global depth value constant,
+// the column sums, and the row sums.
+//
+
+        vpbroadcastd zmm31,DWORD PTR .LGemmU8U8KernelFrame_DepthValue[rsp]
+.if \ColumnCount\() == 32
+        vpaddd  zmm30,zmm31,ZMMWORD PTR [r13]
+        vpaddd  zmm31,zmm31,ZMMWORD PTR [r13+64]
+        add     r13,32*4                    # advance ColumnSumVector by 32 columns
+.else
+        vpaddd zmm31,zmm31,ZMMWORD PTR [r13]
+.endif
+        EmitIfCount2GE \RowCount\(), 1, \ColumnCount\(), 32, "vpaddd zmm16,zmm30,DWORD PTR [r12]{1to16}"
+        EmitIfCountGE \RowCount\(), 1, "vpaddd zmm17,zmm31,DWORD PTR [r12]{1to16}"
+        EmitIfCount2GE \RowCount\(), 2, \ColumnCount\(), 32, "vpaddd zmm18,zmm30,DWORD PTR [r12+4]{1to16}"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd zmm19,zmm31,DWORD PTR [r12+4]{1to16}"
+        EmitIfCount2GE \RowCount\(), 3, \ColumnCount\(), 32, "vpaddd zmm20,zmm30,DWORD PTR [r12+8]{1to16}"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd zmm21,zmm31,DWORD PTR [r12+8]{1to16}"
+        EmitIfCount2GE \RowCount\(), 4, \ColumnCount\(), 32, "vpaddd zmm22,zmm30,DWORD PTR [r12+12]{1to16}"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd zmm23,zmm31,DWORD PTR [r12+12]{1to16}"
+        EmitIfCount2GE \RowCount\(), 5, \ColumnCount\(), 32, "vpaddd zmm24,zmm30,DWORD PTR [r12+16]{1to16}"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd zmm25,zmm31,DWORD PTR [r12+16]{1to16}"
+        EmitIfCount2GE \RowCount\(), 6, \ColumnCount\(), 32, "vpaddd zmm26,zmm30,DWORD PTR [r12+20]{1to16}"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd zmm27,zmm31,DWORD PTR [r12+20]{1to16}"
+
+//
+// Iterate over PairedCountK elements from matrix A and matrix B.
+//
+
+        mov     rbp,rcx                     # reload PairedCountK
+.if \RowCount\() > 3
+        lea     rbx,[r10*2+r10]
+        add     rbx,rdi                     # compute matrix A plus 3 rows
+.endif
+
+.LComputeBlockLoop.\ColumnCount\().\RowCount\():
+        ComputeBlock \ColumnCount\(), \RowCount\()
+        add     rdi,4                       # advance matrix A by 1 pair
+.if \RowCount\() > 3
+        add     rbx,4                       # advance matrix A plus 3 rows by 1 pair
+.endif
+        add     rsi,32
+        dec     rbp                         # decrement pairs remaining
+        jnz     .LComputeBlockLoop.\ColumnCount\().\RowCount\()
+
+.if \RowCount\() > 3
+        lea     rbx,[rdx+rax*2]             # compute matrix C plus 3 rows
+        add     rbx,rax
+.endif
+
+        .endm
+
+/*++
+
+Macro Description:
+
+    This macro generates code to compute matrix multiplication for a fixed set
+    of rows.
+
+Arguments:
+
+    RowCount - Supplies the number of rows to process.
+
+Implicit Arguments:
+
+    rax - Supplies the length in bytes of a row from matrix C.
+
+    rdi - Supplies the address of matrix A.
+
+    rsi - Supplies the address of matrix B.
+
+    rdx - Supplies the address of matrix C.
+
+    r11 - Supplies the address of matrix A.
+
+    r9 - Supplies the number of columns from matrix B and matrix C to iterate
+        over.
+
+    rcx - Supplies the number of paired columns from matrix A and the number of
+        paired rows from matrix B to iterate over.
+
+    r10 - Supplies the length in bytes of a row from matrix A.
+
+    r12 - Supplies the address of the row sum vector.
+
+    r13 - Supplies the address of the column sum vector.
+
+    r14b - Supplies the zero mode flag.
+
+--*/
+
+        .macro ProcessCountM RowCount
+
+        cmp     r9,16
+        jbe     .LProcessRemainingCountN.\RowCount\()
+
+.LProcessNextColumnLoop32xN.\RowCount\():
+        ProduceOutputBlock 32, \RowCount\()
+        lea     rsi,[rsi+r10*8]             # advance matrix B by 8*PairedCountK
+        test    r14b,r14b                   # ZeroMode?
+        jnz     .LSkipAccumulateOutput32xNBlock.\RowCount\()
+        EmitIfCountGE \RowCount\(), 1, "vpaddd zmm16,zmm16,ZMMWORD PTR [rdx]"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd zmm18,zmm18,ZMMWORD PTR [rdx+rax]"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd zmm20,zmm20,ZMMWORD PTR [rdx+rax*2]"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd zmm22,zmm22,ZMMWORD PTR [rbx]"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd zmm24,zmm24,ZMMWORD PTR [rbx+rax]"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd zmm26,zmm26,ZMMWORD PTR [rbx+rax*2]"
+
+.LSkipAccumulateOutput32xNBlock.\RowCount\():
+        EmitIfCountGE \RowCount\(), 1, "vmovdqu32 ZMMWORD PTR [rdx],zmm16"
+        EmitIfCountGE \RowCount\(), 2, "vmovdqu32 ZMMWORD PTR [rdx+rax],zmm18"
+        EmitIfCountGE \RowCount\(), 3, "vmovdqu32 ZMMWORD PTR [rdx+rax*2],zmm20"
+        EmitIfCountGE \RowCount\(), 4, "vmovdqu32 ZMMWORD PTR [rbx],zmm22"
+        EmitIfCountGE \RowCount\(), 5, "vmovdqu32 ZMMWORD PTR [rbx+rax],zmm24"
+        EmitIfCountGE \RowCount\(), 6, "vmovdqu32 ZMMWORD PTR [rbx+rax*2],zmm26"
+        add     rdx,16*4                    # advance matrix C by 16 columns
+.if \RowCount\() > 3
+        add     rbx,16*4                    # advance matrix C plus 3 rows by 16 columns
+.endif
+        sub     r9,16
+
+.LOutput16xNBlock.\RowCount\():
+        sub     r9,16
+        jae     .LOutput16xNBlockWithMask.\RowCount\()
+        lea     rcx,[r9+16]                 # correct for over-subtract above
+        mov     ebp,1
+        shl     ebp,cl
+        dec     ebp
+        kmovw   k1,ebp                      # update mask for remaining columns
+        xor     r9,r9                       # no more columns remaining
+
+.LOutput16xNBlockWithMask.\RowCount\():
+        test    r14b,r14b                   # ZeroMode?
+        jnz     .LSkipAccumulateOutput16xNBlockWithMask.\RowCount\()
+        EmitIfCountGE \RowCount\(), 1, "vpaddd zmm17{k1},zmm17,ZMMWORD PTR [rdx]"
+        EmitIfCountGE \RowCount\(), 2, "vpaddd zmm19{k1},zmm19,ZMMWORD PTR [rdx+rax]"
+        EmitIfCountGE \RowCount\(), 3, "vpaddd zmm21{k1},zmm21,ZMMWORD PTR [rdx+rax*2]"
+        EmitIfCountGE \RowCount\(), 4, "vpaddd zmm23{k1},zmm23,ZMMWORD PTR [rbx]"
+        EmitIfCountGE \RowCount\(), 5, "vpaddd zmm25{k1},zmm25,ZMMWORD PTR [rbx+rax]"
+        EmitIfCountGE \RowCount\(), 6, "vpaddd zmm27{k1},zmm27,ZMMWORD PTR [rbx+rax*2]"
+
+.LSkipAccumulateOutput16xNBlockWithMask.\RowCount\():
+        EmitIfCountGE \RowCount\(), 1, "vmovdqu32 ZMMWORD PTR [rdx]{k1},zmm17"
+        EmitIfCountGE \RowCount\(), 2, "vmovdqu32 ZMMWORD PTR [rdx+rax]{k1},zmm19"
+        EmitIfCountGE \RowCount\(), 3, "vmovdqu32 ZMMWORD PTR [rdx+rax*2]{k1},zmm21"
+        EmitIfCountGE \RowCount\(), 4, "vmovdqu32 ZMMWORD PTR [rbx]{k1},zmm23"
+        EmitIfCountGE \RowCount\(), 5, "vmovdqu32 ZMMWORD PTR [rbx+rax]{k1},zmm25"
+        EmitIfCountGE \RowCount\(), 6, "vmovdqu32 ZMMWORD PTR [rbx+rax*2]{k1},zmm27"
+        add     rdx,16*4                    # advance matrix C by 16 columns
+        mov     rdi,r11                     # reload matrix A
+        cmp     r9,16
+        ja      .LProcessNextColumnLoop32xN.\RowCount\()
+        test    r9,r9
+        jz      .LExitKernel
+
+.LProcessRemainingCountN.\RowCount\():
+        ProduceOutputBlock 16, \RowCount\()
+        jmp     .LOutput16xNBlock.\RowCount\()
+
+        .endm
+
+/*++
+
+Macro Description:
+
+    This macro generates the common AVX512 code for the inner kernel to compute
+    matrix multiplication.
+
+Arguments:
+
+    Isa - Supplies the instruction set architecture string for function tags.
+
+--*/
+
+        .macro GemmU8U8KernelAvx512Function Isa
+
+/*++
+
+Routine Description:
+
+    This routine is an inner kernel to compute matrix multiplication for a
+    set of rows.
+
+Arguments:
+
+    A (rdi) - Supplies the address of matrix A. The matrix data has been packed
+        using MlasGemmU8U8CopyPackAAvx2.
+
+    B (rsi) - Supplies the address of matrix B. The matrix data has been packed
+        using MlasGemmU8U8CopyPackBAvx2.
+
+    C (rdx) - Supplies the address of matrix C.
+
+    PairedCountK (rcx) - Supplies the number of paired columns from matrix A and
+        the number of paired rows from matrix B to iterate over.
+
+    CountM (r8) - Supplies the maximum number of rows that can be processed for
+        matrix A and matrix C. The actual number of rows handled for this
+        invocation depends on the kernel implementation.
+
+    CountN (r9) - Supplies the number of columns from matrix B and matrix C to
+        iterate over.
+
+    ldc - Supplies the first dimension of matrix C.
+
+    RowSumVector - Supplies the sum of each row from matrix A multiplied by the
+        zero point offset of matrix B. These values are accumulated into every
+        row of matrix C.
+
+    ColumnSumVector - Supplies the sum of each column from matrix B multiplied
+        by the zero point offset of matrix A. These values are accumulated into
+        every column of matrix C.
+
+    DepthValue - Supplies the value CountK multiplied by the zero point offset
+        of matrixA multplied by the zero point offset of matrix B. This value is
+        accumulated into every element of matrix C.
+
+    ZeroMode - Supplies true if the output matrix must be zero initialized,
+        else false if the output matrix is accumulated into.
+
+Return Value:
+
+    Returns the number of rows handled.
+
+--*/
+
+        .globl  C_UNDERSCORE(MlasGemmU8U8Kernel\Isa\())
+C_UNDERSCORE(MlasGemmU8U8Kernel\Isa\()):
+
+        push    rbp
+        push    rbx
+        push    r12
+        push    r13
+        push    r14
+
+        mov     rax,.LGemmU8U8KernelFrame_ldc[rsp]
+        shl     rax,2                       # convert ldc to bytes
+        lea     r10,[rcx*4]
+        mov     r11,rdi
+        mov     r12,.LGemmU8U8KernelFrame_RowSumVector[rsp]
+        mov     r13,.LGemmU8U8KernelFrame_ColumnSumVector[rsp]
+        movzx   r14,BYTE PTR .LGemmU8U8KernelFrame_ZeroMode[rsp]
+        mov     ebp,-1
+        kmovw   k1,ebp                      # update mask to write all columns
+
+//
+// Process CountM rows of the matrices.
+//
+
+        cmp     r8,5
+        ja      .LProcessCountM6
+        je      .LProcessCountM5
+        cmp     r8,3
+        ja      .LProcessCountM4
+        je      .LProcessCountM3
+        cmp     r8,1
+        je      .LProcessCountM1
+
+.LProcessCountM2:
+        ProcessCountM 2
+
+.LProcessCountM4:
+        ProcessCountM 4
+
+.LProcessCountM6:
+        mov     r8d,6                       # return 6 rows handled
+        ProcessCountM 6
+
+//
+// Restore non-volatile registers and return.
+//
+
+.LExitKernel:
+        mov     eax,r8d
+
+        pop     r14
+        pop     r13
+        pop     r12
+        pop     rbx
+        pop     rbp
+        ret
+
+.LProcessCountM1:
+        ProcessCountM 1
+
+.LProcessCountM3:
+        ProcessCountM 3
+
+.LProcessCountM5:
+        ProcessCountM 5
+
+        .endm
diff --git a/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Vnni.S b/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Vnni.S
new file mode 100644
index 0000000000000..76a85427d5689
--- /dev/null
+++ b/onnxruntime/core/mlas/lib/x86_64/QgemmU8U8KernelAvx512Vnni.S
@@ -0,0 +1,95 @@
+/*++
+
+Copyright (c) Microsoft Corporation. All rights reserved.
+
+Licensed under the MIT License.
+
+Module Name:
+
+    QgemmU8U8KernelAvx512Vnni.s
+
+Abstract:
+
+    This module implements the kernels for the quantized integer matrix/matrix
+    multiply operation (QGEMM).
+
+    This implementation uses AVX512VNNI instructions.
+
+--*/
+
+#include "asmmacro.h"
+#include "QgemmU8U8KernelAvx512Common.h"
+#include "AssembleAvx512Vnni.h"
+
+        .intel_syntax noprefix
+
+        .text
+
+/*++
+
+Macro Description:
+
+    This macro generates code to multiply and accumulate each row of the output
+    block.
+
+Arguments:
+
+    ColumnCount - Supplies the number of columns to produce.
+
+    RowCount - Supplies the number of rows to produce.
+
+Implicit Arguments:
+
+    rdi - Supplies the address into the matrix A data.
+
+    rbx - Supplies the address into the matrix A data plus 3 rows.
+
+    rsi - Supplies the address into the matrix B data.
+
+    r10 - Supplies the length in bytes of a row from matrix A.
+
+    zmm16-zmm27 - Supplies the block accumulators.
+
+--*/
+
+        .macro ComputeBlock ColumnCount, RowCount
+
+        vpmovzxbw zmm28,YMMWORD PTR [rsi]
+.if \ColumnCount\() == 32
+        vpmovzxbw zmm29,YMMWORD PTR [rsi+r10*8]
+        EmitIfCountGE \RowCount\(), 1, "vpbroadcastd zmm30,DWORD PTR [rdi]"
+        EmitIfCountGE \RowCount\(), 1, "VpdpwssdZmmZmmZmm zmm16,zmm28,zmm30"
+        EmitIfCountGE \RowCount\(), 1, "VpdpwssdZmmZmmZmm zmm17,zmm29,zmm30"
+        EmitIfCountGE \RowCount\(), 2, "vpbroadcastd zmm30,DWORD PTR [rdi+r10]"
+        EmitIfCountGE \RowCount\(), 2, "VpdpwssdZmmZmmZmm zmm18,zmm28,zmm30"
+        EmitIfCountGE \RowCount\(), 2, "VpdpwssdZmmZmmZmm zmm19,zmm29,zmm30"
+        EmitIfCountGE \RowCount\(), 3, "vpbroadcastd zmm30,DWORD PTR [rdi+r10*2]"
+        EmitIfCountGE \RowCount\(), 3, "VpdpwssdZmmZmmZmm zmm20,zmm28,zmm30"
+        EmitIfCountGE \RowCount\(), 3, "VpdpwssdZmmZmmZmm zmm21,zmm29,zmm30"
+        EmitIfCountGE \RowCount\(), 4, "vpbroadcastd zmm30,DWORD PTR [rbx]"
+        EmitIfCountGE \RowCount\(), 4, "VpdpwssdZmmZmmZmm zmm22,zmm28,zmm30"
+        EmitIfCountGE \RowCount\(), 4, "VpdpwssdZmmZmmZmm zmm23,zmm29,zmm30"
+        EmitIfCountGE \RowCount\(), 5, "vpbroadcastd zmm30,DWORD PTR [rbx+r10]"
+        EmitIfCountGE \RowCount\(), 5, "VpdpwssdZmmZmmZmm zmm24,zmm28,zmm30"
+        EmitIfCountGE \RowCount\(), 5, "VpdpwssdZmmZmmZmm zmm25,zmm29,zmm30"
+        EmitIfCountGE \RowCount\(), 6, "vpbroadcastd zmm30,DWORD PTR [rbx+r10*2]"
+        EmitIfCountGE \RowCount\(), 6, "VpdpwssdZmmZmmZmm zmm26,zmm28,zmm30"
+        EmitIfCountGE \RowCount\(), 6, "VpdpwssdZmmZmmZmm zmm27,zmm29,zmm30"
+.else
+        EmitIfCountGE \RowCount\(), 1, "VpdpwssdZmmZmmBroadcast zmm17,zmm28,rdi"
+        EmitIfCountGE \RowCount\(), 2, "VpdpwssdZmmZmmBroadcast zmm19,zmm28,rdi,r10,1"
+        EmitIfCountGE \RowCount\(), 3, "VpdpwssdZmmZmmBroadcast zmm21,zmm28,rdi,r10,2"
+        EmitIfCountGE \RowCount\(), 4, "VpdpwssdZmmZmmBroadcast zmm23,zmm28,rbx"
+        EmitIfCountGE \RowCount\(), 5, "VpdpwssdZmmZmmBroadcast zmm25,zmm28,rbx,r10,1"
+        EmitIfCountGE \RowCount\(), 6, "VpdpwssdZmmZmmBroadcast zmm27,zmm28,rbx,r10,2"
+.endif
+
+        .endm
+
+//
+// Generate the GEMM kernel.
+//
+
+GemmU8U8KernelAvx512Function Avx512Vnni
+
+        .end
diff --git a/onnxruntime/core/mlas/lib/x86_64/SconvKernelAvx.S b/onnxruntime/core/mlas/lib/x86_64/SconvKernelAvx.S
index 617119763a2b5..2163708dcb352 100644
--- a/onnxruntime/core/mlas/lib/x86_64/SconvKernelAvx.S
+++ b/onnxruntime/core/mlas/lib/x86_64/SconvKernelAvx.S
@@ -257,9 +257,15 @@ Arguments:
         .macro PostProcessBlock FilterCount, OutputCount
 
         .globl  MlasConvPostProcessFloatAvxFilter\FilterCount\()Output\OutputCount\()
+#if !defined(__APPLE__)
+        .hidden MlasConvPostProcessFloatAvxFilter\FilterCount\()Output\OutputCount\()
+#endif
 MlasConvPostProcessFloatAvxFilter\FilterCount\()Output\OutputCount\():
 
         .globl  MlasConvPostProcessFloatFma3Filter\FilterCount\()Output\OutputCount\()
+#if !defined(__APPLE__)
+        .hidden MlasConvPostProcessFloatFma3Filter\FilterCount\()Output\OutputCount\()
+#endif
 MlasConvPostProcessFloatFma3Filter\FilterCount\()Output\OutputCount\():
 
 .if \FilterCount\() > 2
diff --git a/onnxruntime/core/mlas/lib/x86_64/SconvKernelAvx512F.S b/onnxruntime/core/mlas/lib/x86_64/SconvKernelAvx512F.S
index 873cb4dbf9431..55d2aa613f212 100644
--- a/onnxruntime/core/mlas/lib/x86_64/SconvKernelAvx512F.S
+++ b/onnxruntime/core/mlas/lib/x86_64/SconvKernelAvx512F.S
@@ -361,6 +361,9 @@ Arguments:
         .macro PostProcessBlock FilterCount, OutputCount
 
         .globl  MlasConvPostProcessFloatAvx512FFilter\FilterCount\()Output\OutputCount\()
+#if !defined(__APPLE__)
+        .hidden MlasConvPostProcessFloatAvx512FFilter\FilterCount\()Output\OutputCount\()
+#endif
 MlasConvPostProcessFloatAvx512FFilter\FilterCount\()Output\OutputCount\():
 
 .if \FilterCount\() > 2
diff --git a/onnxruntime/core/mlas/lib/x86_64/SconvKernelSse2.S b/onnxruntime/core/mlas/lib/x86_64/SconvKernelSse2.S
index e5505ea48942e..4dbbf696e96f7 100644
--- a/onnxruntime/core/mlas/lib/x86_64/SconvKernelSse2.S
+++ b/onnxruntime/core/mlas/lib/x86_64/SconvKernelSse2.S
@@ -249,6 +249,9 @@ Arguments:
         .macro PostProcessBlock FilterCount, OutputCount
 
         .globl  MlasConvPostProcessFloatSseFilter\FilterCount\()Output\OutputCount\()
+#if !defined(__APPLE__)
+        .hidden MlasConvPostProcessFloatSseFilter\FilterCount\()Output\OutputCount\()
+#endif
 MlasConvPostProcessFloatSseFilter\FilterCount\()Output\OutputCount\():
 
 .if \FilterCount\() > 2
diff --git a/onnxruntime/core/mlas/lib/x86_64/SgemmKernelAvx.S b/onnxruntime/core/mlas/lib/x86_64/SgemmKernelAvx.S
index 0147f08edc821..63c6d5d2c837e 100644
--- a/onnxruntime/core/mlas/lib/x86_64/SgemmKernelAvx.S
+++ b/onnxruntime/core/mlas/lib/x86_64/SgemmKernelAvx.S
@@ -374,10 +374,9 @@ C_UNDERSCORE(MlasSgemmKernel\Mode\()Avx):
 
 .L\Mode\().OutputMasked8x4Block:
         vmovd   xmm0,r9d
-        mov     rbp,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vshufps xmm0,xmm0,xmm0,0
-        vpcmpgtd xmm1,xmm0,XMMWORD PTR [rbp+16]
-        vpcmpgtd xmm0,xmm0,XMMWORD PTR [rbp]
+        vpcmpgtd xmm1,xmm0,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip+16]
+        vpcmpgtd xmm0,xmm0,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
         vinsertf128 ymm0,ymm0,xmm1,1
 .ifeqs "\Mode\()","Add"
         vmaskmovps ymm8,ymm0,YMMWORD PTR [rdx]
@@ -473,10 +472,9 @@ C_UNDERSCORE(MlasSgemmKernel\Mode\()Avx):
 
 .L\Mode\().OutputMasked8x2Block:
         vmovd   xmm0,r9d
-        mov     rbp,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vshufps xmm0,xmm0,xmm0,0
-        vpcmpgtd xmm1,xmm0,XMMWORD PTR [rbp+16]
-        vpcmpgtd xmm0,xmm0,XMMWORD PTR [rbp]
+        vpcmpgtd xmm1,xmm0,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip+16]
+        vpcmpgtd xmm0,xmm0,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
         vinsertf128 ymm0,ymm0,xmm1,1
 .ifeqs "\Mode\()","Add"
         vmaskmovps ymm8,ymm0,YMMWORD PTR [rdx]
@@ -540,10 +538,9 @@ C_UNDERSCORE(MlasSgemmKernel\Mode\()Avx):
 
 .L\Mode\().OutputMasked8x1Block:
         vmovd   xmm0,r9d
-        mov     rbp,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vshufps xmm0,xmm0,xmm0,0
-        vpcmpgtd xmm1,xmm0,XMMWORD PTR [rbp+16]
-        vpcmpgtd xmm0,xmm0,XMMWORD PTR [rbp]
+        vpcmpgtd xmm1,xmm0,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip+16]
+        vpcmpgtd xmm0,xmm0,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
         vinsertf128 ymm0,ymm0,xmm1,1
 .ifeqs "\Mode\()","Add"
         vmaskmovps ymm8,ymm0,YMMWORD PTR [rdx]
diff --git a/onnxruntime/core/mlas/lib/x86_64/SgemmKernelFma3.S b/onnxruntime/core/mlas/lib/x86_64/SgemmKernelFma3.S
index cfeceb6be30f3..a7382f897946b 100644
--- a/onnxruntime/core/mlas/lib/x86_64/SgemmKernelFma3.S
+++ b/onnxruntime/core/mlas/lib/x86_64/SgemmKernelFma3.S
@@ -435,9 +435,8 @@ C_UNDERSCORE(MlasSgemmKernel\Mode\()Fma3):
 
 .L\Mode\().OutputMasked8x6Block:
         mov     DWORD PTR [rsp+SgemmKernelFrame_mask],r9d
-        mov     rbp,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vbroadcastss ymm0,DWORD PTR [rsp+SgemmKernelFrame_mask]
-        vpcmpgtd ymm0,ymm0,YMMWORD PTR [rbp]
+        vpcmpgtd ymm0,ymm0,YMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
 .ifeqs "\Mode\()","Add"
         vmaskmovps ymm4,ymm0,YMMWORD PTR [rdx]
         vmaskmovps ymm6,ymm0,YMMWORD PTR [rdx+rax]
@@ -550,9 +549,8 @@ C_UNDERSCORE(MlasSgemmKernel\Mode\()Fma3):
 
 .L\Mode\().OutputMasked8x3Block:
         mov     DWORD PTR [rsp+SgemmKernelFrame_mask],r9d
-        mov     rbp,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vbroadcastss ymm0,DWORD PTR [rsp+SgemmKernelFrame_mask]
-        vpcmpgtd ymm0,ymm0,YMMWORD PTR [rbp]
+        vpcmpgtd ymm0,ymm0,YMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
 .ifeqs "\Mode\()","Add"
         vmaskmovps ymm4,ymm0,YMMWORD PTR [rdx]
         vmaskmovps ymm6,ymm0,YMMWORD PTR [rdx+rax]
@@ -653,9 +651,8 @@ C_UNDERSCORE(MlasSgemmKernel\Mode\()Fma3):
 
 .L\Mode\().OutputMasked8x1Block:
         mov     DWORD PTR [rsp+SgemmKernelFrame_mask],r9d
-        mov     rbp,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vbroadcastss ymm0,DWORD PTR [rsp+SgemmKernelFrame_mask]
-        vpcmpgtd ymm0,ymm0,YMMWORD PTR [rbp]
+        vpcmpgtd ymm0,ymm0,YMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
 .ifeqs "\Mode\()","Add"
         vmaskmovps ymm4,ymm0,YMMWORD PTR [rdx]
         vfmadd213ps ymm5,ymm2,ymm4
diff --git a/onnxruntime/core/mlas/lib/x86_64/SgemmKernelM1Avx.S b/onnxruntime/core/mlas/lib/x86_64/SgemmKernelM1Avx.S
index 28fca0e956640..86bc82b23071b 100644
--- a/onnxruntime/core/mlas/lib/x86_64/SgemmKernelM1Avx.S
+++ b/onnxruntime/core/mlas/lib/x86_64/SgemmKernelM1Avx.S
@@ -80,10 +80,9 @@ C_UNDERSCORE(MlasSgemmKernelM1Avx):
         mov     eax,r8d
         and     eax,7
         vmovd   xmm7,eax
-        mov     rbx,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vshufps xmm7,xmm7,xmm7,0
-        vpcmpgtd xmm6,xmm7,XMMWORD PTR [rbx+16]
-        vpcmpgtd xmm7,xmm7,XMMWORD PTR [rbx]
+        vpcmpgtd xmm6,xmm7,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip+16]
+        vpcmpgtd xmm7,xmm7,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
         vinsertf128 ymm7,ymm7,xmm6,1
 
 //
diff --git a/onnxruntime/core/mlas/lib/x86_64/SgemmKernelM1TransposeBAvx.S b/onnxruntime/core/mlas/lib/x86_64/SgemmKernelM1TransposeBAvx.S
index 8d5ff17f90084..86bc9209fa248 100644
--- a/onnxruntime/core/mlas/lib/x86_64/SgemmKernelM1TransposeBAvx.S
+++ b/onnxruntime/core/mlas/lib/x86_64/SgemmKernelM1TransposeBAvx.S
@@ -79,10 +79,9 @@ C_UNDERSCORE(MlasSgemmKernelM1TransposeBAvx):
         mov     eax,ecx
         and     eax,7
         vmovd   xmm7,eax
-        mov     rbx,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vshufps xmm7,xmm7,xmm7,0
-        vpcmpgtd xmm6,xmm7,XMMWORD PTR [rbx+16]
-        vpcmpgtd xmm7,xmm7,XMMWORD PTR [rbx]
+        vpcmpgtd xmm6,xmm7,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip+16]
+        vpcmpgtd xmm7,xmm7,XMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
         vinsertf128 ymm7,ymm7,xmm6,1
 
 //
diff --git a/onnxruntime/core/mlas/lib/x86_64/TanhKernelFma3.S b/onnxruntime/core/mlas/lib/x86_64/TanhKernelFma3.S
index 61bbef5c91171..dd5584648dbe7 100644
--- a/onnxruntime/core/mlas/lib/x86_64/TanhKernelFma3.S
+++ b/onnxruntime/core/mlas/lib/x86_64/TanhKernelFma3.S
@@ -72,7 +72,7 @@ Return Value:
         .globl  C_UNDERSCORE(MlasTanhKernelFma3)
 C_UNDERSCORE(MlasTanhKernelFma3):
 
-        mov     rax,C_UNDERSCORE(MlasTanhConstants)@GOTPCREL[rip]
+        lea     rax,C_UNDERSCORE(MlasTanhConstants)[rip]
         vbroadcastss ymm4,TanhConstants_LowerRange[rax]
         vbroadcastss ymm5,TanhConstants_UpperRange[rax]
         vbroadcastss ymm6,TanhConstants_alpha_13[rax]
@@ -116,9 +116,8 @@ C_UNDERSCORE(MlasTanhKernelFma3):
         add     rdx,8                           # correct for over-subtract above
         jz      .LExitKernel
         mov     DWORD PTR TanhKernelFrame_CountN[rsp],edx
-        mov     rcx,QWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)@GOTPCREL[rip]
         vbroadcastss ymm2,DWORD PTR TanhKernelFrame_CountN[rsp]
-        vpcmpgtd ymm2,ymm2,YMMWORD PTR [rcx]
+        vpcmpgtd ymm2,ymm2,YMMWORD PTR C_UNDERSCORE(MlasMaskMoveAvx)[rip]
         vmaskmovps ymm0,ymm2,YMMWORD PTR [rdi]
         vmaxps  ymm0,ymm4,ymm0                  # clamp lower bound
         vminps  ymm0,ymm5,ymm0                  # clamp upper bound
diff --git a/onnxruntime/core/optimizer/optimizer_execution_frame.cc b/onnxruntime/core/optimizer/optimizer_execution_frame.cc
index cd64d2398228b..fb14f762969d8 100644
--- a/onnxruntime/core/optimizer/optimizer_execution_frame.cc
+++ b/onnxruntime/core/optimizer/optimizer_execution_frame.cc
@@ -1,3 +1,5 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
 
 #include "core/common/common.h"
 #include "core/common/status.h"
@@ -9,7 +11,7 @@
 #include "core/framework/mldata_type_utils.h"
 #include "core/framework/kernel_registry.h"
 #include "core/framework/fuse_nodes_funcs.h"
-#include "core/common/callback.h"
+#include "core/framework/callback.h"
 #include "core/optimizer/optimizer_execution_frame.h"
 
 namespace onnxruntime {
diff --git a/onnxruntime/core/optimizer/optimizer_execution_frame.h b/onnxruntime/core/optimizer/optimizer_execution_frame.h
index 41f5e85215699..cb507feb9a57a 100644
--- a/onnxruntime/core/optimizer/optimizer_execution_frame.h
+++ b/onnxruntime/core/optimizer/optimizer_execution_frame.h
@@ -11,7 +11,7 @@
 #include "core/framework/execution_frame.h"
 #include "core/framework/ort_value_name_idx_map.h"
 #include "core/framework/ml_value.h"
-#include "core/common/callback.h"
+#include "core/framework/callback.h"
 
 namespace onnxruntime {
 class DataTransferManager;
diff --git a/onnxruntime/core/optimizer/transformer_memcpy.cc b/onnxruntime/core/optimizer/transformer_memcpy.cc
index c6f57900b9881..bdf44761bba18 100644
--- a/onnxruntime/core/optimizer/transformer_memcpy.cc
+++ b/onnxruntime/core/optimizer/transformer_memcpy.cc
@@ -52,7 +52,7 @@ class TransformerMemcpyImpl {
   std::string provider_;
 };
 
-/** Helper that returns a pointer to the corresponding TensorProto for a name if it is an initializer. 
+/** Helper that returns a pointer to the corresponding TensorProto for a name if it is an initializer.
 @param check_outer_scope If true and the graph is a subgraph, check parent graph/s for 'name' if not found in 'graph'.
 */
 static const onnx::TensorProto* GetInitializer(const Graph& graph, const std::string& name, bool check_outer_scope) {
@@ -73,7 +73,6 @@ common::Status MemcpyTransformer::ApplyImpl(Graph& graph, bool& modified, int gr
         provider != onnxruntime::kMklDnnExecutionProvider &&
         provider != onnxruntime::kNGraphExecutionProvider &&
         provider != onnxruntime::kNupharExecutionProvider &&
-        provider != onnxruntime::kTensorrtExecutionProvider &&
         provider != onnxruntime::kOpenVINOExecutionProvider) {
       TransformerMemcpyImpl copy_impl(graph, provider);
       auto current_modified = copy_impl.ModifyGraph(registry_manager_);
@@ -100,7 +99,7 @@ common::Status MemcpyTransformer::ApplyImpl(Graph& graph, bool& modified, int gr
 
 Overview: The transformer transforms the input graph as follows:
 
-(1) For every initializer W that is referenced by both provider and non-provider nodes, 
+(1) For every initializer W that is referenced by both provider and non-provider nodes,
 we create a duplicate initializer W2 and change all provider nodes to reference this
 duplicate copy.
 
@@ -167,7 +166,9 @@ bool TransformerMemcpyImpl::ModifyGraph(const KernelRegistryManager& kernel_regi
 }
 
 void TransformerMemcpyImpl::ProcessDefs(onnxruntime::Node& node, const KernelRegistryManager& kernel_registries, InitializedTensorSet& initializers_consumed) {
-  if (node.GetExecutionProviderType() == provider_) {
+  if (node.GetExecutionProviderType() == provider_
+      || (node.GetExecutionProviderType() == kCudaExecutionProvider && provider_ == kTensorrtExecutionProvider)
+      || (node.GetExecutionProviderType() == kTensorrtExecutionProvider && provider_ == kCudaExecutionProvider)) {
     provider_nodes_.insert(&node);
     // note KernelCreateInfo might be nullptr for custom kernel
     const KernelCreateInfo* kci = nullptr;
@@ -206,7 +207,7 @@ void TransformerMemcpyImpl::ProcessDefs(onnxruntime::Node& node, const KernelReg
     }
   } else {
     // TODO: copy between devices? i.e. multiple GPUs
-    if (node.GetExecutionProviderType() != onnxruntime::kCpuExecutionProvider && node.GetExecutionProviderType() != onnxruntime::kTensorrtExecutionProvider &&
+    if (node.GetExecutionProviderType() != onnxruntime::kCpuExecutionProvider &&
         node.GetExecutionProviderType() != onnxruntime::kNGraphExecutionProvider && !node.GetExecutionProviderType().empty()) {
       ORT_THROW("Execution type '", node.GetExecutionProviderType(), "' doesn't support memcpy ");
     }
diff --git a/onnxruntime/core/platform/env.h b/onnxruntime/core/platform/env.h
index d27cdbaae6833..c9199faf7f168 100644
--- a/onnxruntime/core/platform/env.h
+++ b/onnxruntime/core/platform/env.h
@@ -24,7 +24,7 @@ limitations under the License.
 #include <gsl/pointers>
 
 #include "core/common/common.h"
-#include "core/common/callback.h"
+#include "core/framework/callback.h"
 #include "core/platform/env_time.h"
 
 #ifndef _WIN32
diff --git a/onnxruntime/core/platform/posix/env.cc b/onnxruntime/core/platform/posix/env.cc
index 2b7e50dee38aa..2d34bea1e2bfc 100644
--- a/onnxruntime/core/platform/posix/env.cc
+++ b/onnxruntime/core/platform/posix/env.cc
@@ -41,7 +41,7 @@ namespace onnxruntime {
 namespace {
 constexpr int OneMillion = 1000000;
 
-static void ORT_API_CALL DeleteBuffer(void* param) noexcept { ::free(param); }
+static void DeleteBuffer(void* param) noexcept { ::free(param); }
 
 class UnmapFileParam {
  public:
@@ -50,7 +50,7 @@ class UnmapFileParam {
   int fd;
 };
 
-static void ORT_API_CALL UnmapFile(void* param) noexcept {
+static void UnmapFile(void* param) noexcept {
   UnmapFileParam* p = reinterpret_cast<UnmapFileParam*>(param);
   int ret = munmap(p->addr, p->len);
   if (ret != 0) {
@@ -124,7 +124,7 @@ class PosixEnv : public Env {
   }
 
   common::Status ReadFileAsString(const char* fname, off_t offset, void*& p, size_t& len,
-      OrtCallback& deleter) const override {
+                                  OrtCallback& deleter) const override {
     if (!fname) {
       return common::Status(common::ONNXRUNTIME, common::INVALID_ARGUMENT, "ReadFileAsString: 'fname' cannot be NULL");
     }
@@ -180,7 +180,7 @@ class PosixEnv : public Env {
     char buf[1024];
     const char* msg = "";
     if (e > 0) {
-#if defined(__GLIBC__) && defined(_GNU_SOURCE) && !defined (__ANDROID__)
+#if defined(__GLIBC__) && defined(_GNU_SOURCE) && !defined(__ANDROID__)
       msg = strerror_r(e, buf, sizeof(buf));
 #else
       // for Mac OS X and Android lower than API 23
diff --git a/onnxruntime/core/platform/windows/env.cc b/onnxruntime/core/platform/windows/env.cc
index 1f0e9bcac4410..010077d07b273 100644
--- a/onnxruntime/core/platform/windows/env.cc
+++ b/onnxruntime/core/platform/windows/env.cc
@@ -30,7 +30,7 @@ namespace onnxruntime {
 
 namespace {
 
-static void ORT_API_CALL DeleteBuffer(void* param) noexcept { ::free(param); }
+static void DeleteBuffer(void* param) noexcept { ::free(param); }
 
 class WindowsEnv : public Env {
  public:
diff --git a/onnxruntime/core/providers/common.h b/onnxruntime/core/providers/common.h
index 16a065cb8cc1a..c23dded2f9c40 100644
--- a/onnxruntime/core/providers/common.h
+++ b/onnxruntime/core/providers/common.h
@@ -4,6 +4,7 @@
 #pragma once
 
 #include "core/common/common.h"
+#include "core/framework/tensor.h"
 
 namespace onnxruntime {
 
@@ -20,4 +21,25 @@ inline int64_t HandleNegativeAxis(int64_t axis, int64_t tensor_rank) {
   return axis = axis < 0 ? axis + tensor_rank : axis;
 }
 
+/**
+Returns true if given tensor is a scalar or 1D tensor of size 1
+**/
+inline bool IsScalarOr1ElementVector(const Tensor* input) {
+  if (input->Shape().NumDimensions() == 0 ||
+      (input->Shape().NumDimensions() == 1 && input->Shape().GetDims().size() == 1)) {
+    return true;
+  } else {
+    return false;
+  }
+}
+
+/**
+Clamps input between provided min and max values
+**/
+inline float clamp(float v, float lo, float hi) {
+  if (v < lo) return lo;
+  if (v > hi) return hi;
+  return v;
+}
+
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/controlflow/loop.cc b/onnxruntime/core/providers/cpu/controlflow/loop.cc
index 3937ab4ed53f3..89dbf9f8d783d 100644
--- a/onnxruntime/core/providers/cpu/controlflow/loop.cc
+++ b/onnxruntime/core/providers/cpu/controlflow/loop.cc
@@ -257,7 +257,6 @@ Status LoopImpl::CreateFeedsFetchesManager(std::unique_ptr<FeedsFetchesManager>&
     feed_names.push_back(entry.first);
   }
 
-  FeedsFetchesInfo ffi(feed_names, subgraph_output_names_);
   auto status =
       FeedsFetchesManager::Create(feed_names, subgraph_output_names_, session_state_.GetOrtValueNameIdxMap(), ffm);
 
diff --git a/onnxruntime/core/providers/cpu/controlflow/scan_utils.cc b/onnxruntime/core/providers/cpu/controlflow/scan_utils.cc
index 821c84a78c723..2c696066556e4 100644
--- a/onnxruntime/core/providers/cpu/controlflow/scan_utils.cc
+++ b/onnxruntime/core/providers/cpu/controlflow/scan_utils.cc
@@ -125,7 +125,6 @@ Status CreateFeedsFetchesManager(const GraphViewer& subgraph, int num_variadic_i
     feed_names.push_back(entry.first);
   }
 
-  FeedsFetchesInfo ffi(feed_names, subgraph_output_names);
   auto status = FeedsFetchesManager::Create(feed_names, subgraph_output_names, ort_value_name_idx_map, ffm);
 
   return status;
diff --git a/onnxruntime/core/providers/cpu/controlflow/utils.h b/onnxruntime/core/providers/cpu/controlflow/utils.h
index b5a39bcfdef4d..d3427e9104a50 100644
--- a/onnxruntime/core/providers/cpu/controlflow/utils.h
+++ b/onnxruntime/core/providers/cpu/controlflow/utils.h
@@ -26,7 +26,8 @@ common::Status SubgraphExecuteHelper(std::unique_ptr<FeedsFetchesManager>& cache
   } else {
     // use a local instance until we know we're successful, and cache if it is
     std::unique_ptr<FeedsFetchesManager> new_ffm;
-    impl.CreateFeedsFetchesManager(new_ffm);
+    ORT_RETURN_IF_ERROR(impl.CreateFeedsFetchesManager(new_ffm));
+
     status = impl.Execute(&*new_ffm, nullptr);
     if (status.IsOK()) {
       cached_feeds_fetches_manager = std::move(new_ffm);
diff --git a/onnxruntime/core/providers/cpu/cpu_execution_provider.cc b/onnxruntime/core/providers/cpu/cpu_execution_provider.cc
index 08b6a31938111..eb1c26203308f 100644
--- a/onnxruntime/core/providers/cpu/cpu_execution_provider.cc
+++ b/onnxruntime/core/providers/cpu/cpu_execution_provider.cc
@@ -9,12 +9,16 @@
 #include "contrib_ops/cpu_contrib_kernels.h"
 #endif
 
+#ifdef MICROSOFT_AUTOML
+#include "automl_ops/cpu_automl_kernels.h"
+#endif
+
 #include "core/framework/compute_capability.h"
 
 namespace onnxruntime {
 
 // Forward declarations of op kernels
-class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, Clip);
+class ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, 10, Clip);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, Elu);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, HardSigmoid);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, LeakyRelu);
@@ -132,6 +136,7 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain,
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceLogSumExp);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceMax);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceMax);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int64_t, ReduceMax);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceMean);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceMean);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceMin);
@@ -141,6 +146,7 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain,
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceSum);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, double, ReduceSum);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceSum);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int64_t, ReduceSum);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceSumSquare);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, double, ReduceSumSquare);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceSumSquare);
@@ -218,6 +224,7 @@ class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, Mea
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t, Greater);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t, Greater);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t, Less);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t, Less);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, string, Cast);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, EyeLike);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, float, IsNaN);
@@ -232,6 +239,8 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain,
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, float_float_float, OneHot);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t_int32_t_float, OneHot);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t_float_int64_t, OneHot);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t_float_int32_t, OneHot);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t_float_float, OneHot);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, MaxUnpool);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, Sinh);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, Cosh);
@@ -247,9 +256,11 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain,
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, float, NonZero);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t, NonZero);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t, NonZero);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, uint8_t, NonZero);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, string, Where);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, float, Where);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t, Where);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t, Where);
 
 // Opset 10
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, StringNormalizer);
@@ -263,9 +274,11 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain,
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, ThresholdedRelu);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, uint8_t, DequantizeLinear);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, int8_t, DequantizeLinear);
-class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, float, QuantizeLinear);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, uint8_t, QuantizeLinear);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, int8_t, QuantizeLinear);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, QLinearMatMul);
-class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, MatMulInteger);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, uint8_t, MatMulInteger);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, int8_t, MatMulInteger);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, ConvInteger);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, QLinearConv);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, bool, Slice);
@@ -288,9 +301,13 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain,
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, double, RoiAlign);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, ReverseSequence);
 
+// opset 11
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 11, Clip);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 11, uint8_t, DynamicQuantizeLinear);
+
 void RegisterOnnxOperatorKernels(KernelRegistry& kernel_registry) {
   static const BuildKernelCreateInfoFn function_table[] = {
-      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, Clip)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, 10, Clip)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, Elu)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, HardSigmoid)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 6, LeakyRelu)>,
@@ -408,6 +425,7 @@ void RegisterOnnxOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceLogSumExp)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceMax)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceMax)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int64_t, ReduceMax)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceMean)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceMean)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceMin)>,
@@ -415,8 +433,9 @@ void RegisterOnnxOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceProd)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceProd)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceSum)>,
-      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceSum)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, double, ReduceSum)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceSum)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int64_t, ReduceSum)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, float, ReduceSumSquare)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, int32_t, ReduceSumSquare)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 1, double, ReduceSumSquare)>,
@@ -494,6 +513,7 @@ void RegisterOnnxOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t, Greater)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t, Greater)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t, Less)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t, Less)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, EyeLike)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, string, Cast)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, float, IsNaN)>,
@@ -508,6 +528,8 @@ void RegisterOnnxOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, float_float_float, OneHot)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t_int32_t_float, OneHot)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t_float_int64_t, OneHot)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t_float_int32_t, OneHot)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t_float_float, OneHot)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, MaxUnpool)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, Sinh)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, Cosh)>,
@@ -523,9 +545,11 @@ void RegisterOnnxOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, float, NonZero)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t, NonZero)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t, NonZero)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, uint8_t, NonZero)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, string, Where)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, float, Where)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int32_t, Where)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 9, int64_t, Where)>,
 
       // Opset 10
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, StringNormalizer)>,
@@ -539,9 +563,11 @@ void RegisterOnnxOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, ThresholdedRelu)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, uint8_t, DequantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, int8_t, DequantizeLinear)>,
-      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, float, QuantizeLinear)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, uint8_t, QuantizeLinear)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, int8_t, QuantizeLinear)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, QLinearMatMul)>,
-      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, MatMulInteger)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, uint8_t, MatMulInteger)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, int8_t, MatMulInteger)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, ConvInteger)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, QLinearConv)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, bool, Slice)>,
@@ -563,6 +589,10 @@ void RegisterOnnxOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, float, RoiAlign)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, double, RoiAlign)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 10, ReverseSequence)>,
+
+      //opset 11
+      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 11, Clip)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 11, uint8_t, DynamicQuantizeLinear)>,
   };
 
   for (auto& function_table_entry : function_table) {
@@ -588,7 +618,8 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1,
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, int64_t_double, DictVectorizer);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, FeatureVectorizer);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, Imputer);
-class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, LabelEncoder);
+
+class ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, 1, LabelEncoder);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, float, LinearClassifier);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, double, LinearClassifier);
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, int64_t, LinearClassifier);
@@ -615,6 +646,13 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1,
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, TreeEnsembleRegressor);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, ZipMap);
 
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, float_string, LabelEncoder);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, string_float, LabelEncoder);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, int64_float, LabelEncoder);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, float_int64, LabelEncoder);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, int64_string, LabelEncoder);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, string_int64, LabelEncoder);
+
 void RegisterOnnxMLOperatorKernels(KernelRegistry& kernel_registry) {
   static const BuildKernelCreateInfoFn function_table[] = {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, float, ArrayFeatureExtractor)>,
@@ -633,7 +671,7 @@ void RegisterOnnxMLOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, int64_t_double, DictVectorizer)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, FeatureVectorizer)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, Imputer)>,
-      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, LabelEncoder)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, 1, LabelEncoder)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, float, LinearClassifier)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, double, LinearClassifier)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, int64_t, LinearClassifier)>,
@@ -659,6 +697,13 @@ void RegisterOnnxMLOperatorKernels(KernelRegistry& kernel_registry) {
       BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, int32_t, TreeEnsembleClassifier)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, TreeEnsembleRegressor)>,
       BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 1, ZipMap)>,
+
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, float_string, LabelEncoder)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, string_float, LabelEncoder)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, int64_float, LabelEncoder)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, float_int64, LabelEncoder)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, int64_string, LabelEncoder)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMLDomain, 2, string_int64, LabelEncoder)>,
   };
 
   for (auto& function_table_entry : function_table) {
@@ -673,6 +718,9 @@ static void RegisterCPUKernels(KernelRegistry& kernel_registry) {
 #ifndef DISABLE_CONTRIB_OPS
   ::onnxruntime::contrib::RegisterCpuContribKernels(kernel_registry);
 #endif
+#ifdef MICROSOFT_AUTOML
+  ::onnxruntime::automl::RegisterCpuAutoMLKernels(kernel_registry);
+#endif
 }
 
 std::shared_ptr<KernelRegistry> GetCpuKernelRegistry() {
diff --git a/onnxruntime/core/providers/cpu/generator/random.cc b/onnxruntime/core/providers/cpu/generator/random.cc
index fc8f7c5917873..a688241d71867 100644
--- a/onnxruntime/core/providers/cpu/generator/random.cc
+++ b/onnxruntime/core/providers/cpu/generator/random.cc
@@ -76,8 +76,6 @@ void GenerateData(std::default_random_engine& generator, TDistribution distribut
 static Status RandomNormalCompute(float mean, float scale, std::default_random_engine& generator, TensorProto::DataType dtype, Tensor& Y);
 static Status RandomUniformCompute(float high, float low, std::default_random_engine& generator, TensorProto::DataType dtype, Tensor& Y);
 
-// Leaving in case we need to change to this approach
-//static Status CreateOutputTensorFromTensorValues(OpKernelContext* ctx, const Tensor& X,Tensor** Y);
 static Status CreateOutputTensorFromTensorShape(OpKernelContext* ctx, const Tensor& X, Tensor** Y);
 static TensorProto::DataType InferDataType(const Tensor& tensor);
 
@@ -168,53 +166,48 @@ static Status MultinomialCompute(OpKernelContext* ctx,
   Eigen::array<int64_t, 2> Y_dims = {{batch_size, num_samples}};
   Matrix<OutputType> output = Matrix<OutputType>(Y.template MutableData<OutputType>(), Y_dims);
 
-  // TODO (perf optimization) - the idea behind making this a lambda is so that we can parallelize across batches.
-  // When we do that this lamdba will act as one task given to a thread
-  auto DoWork = [ctx, num_samples, num_classes, &generator, &logits, &output](int64_t start_row,
-                                                                              int64_t limit_row) {
-    std::default_random_engine generator_copy = generator;
-    // BEGIN create temporary tensor
-    AllocatorPtr alloc;
-    ctx->GetTempSpaceAllocator(&alloc);
-    auto cdf_data = static_cast<double*>(alloc->Alloc(sizeof(double) * num_classes));
-    BufferUniquePtr cdf_buffer(cdf_data, BufferDeleter(alloc));
-    Eigen::array<int64_t, 1> cdf_dims = {{num_classes}};
-    auto cdf = EigenVector<double>(cdf_data, cdf_dims);
-    // END create temporary tensor
-
-    std::uniform_real_distribution<double> dist(0.0, 1.0);  // TODO: should this be initialized per batch?
-    for (int64_t b = start_row; b < limit_row; ++b) {
-      const float* logits_row = &(logits(b, 0));
-      // Takes an along-class maximum (for numerical stability).
-      float maxx = std::numeric_limits<float>::lowest();
-      for (int64_t j = 0; j < num_classes; ++j) {
-        if (Eigen::numext::isfinite(logits_row[j])) {
-          maxx = std::max(maxx, logits_row[j]);
-        }
+  // BEGIN create temporary tensor
+  AllocatorPtr alloc;
+  ORT_RETURN_IF_ERROR(ctx->GetTempSpaceAllocator(&alloc));
+  auto cdf_data = static_cast<double*>(alloc->Alloc(sizeof(double) * num_classes));
+  BufferUniquePtr cdf_buffer(cdf_data, BufferDeleter(alloc));
+  Eigen::array<int64_t, 1> cdf_dims = {{num_classes}};
+  auto cdf = EigenVector<double>(cdf_data, cdf_dims);
+  // END create temporary tensor
+
+  std::uniform_real_distribution<double> dist(0.0, 1.0);  // TODO: should this be initialized per batch?
+
+  for (int64_t b = 0; b < batch_size; ++b) {
+    const float* logits_row = &(logits(b, 0));
+    // Takes an along-class maximum (for numerical stability).
+    float maxx = std::numeric_limits<float>::lowest();
+    for (int64_t j = 0; j < num_classes; ++j) {
+      if (Eigen::numext::isfinite(logits_row[j])) {
+        maxx = std::max(maxx, logits_row[j]);
       }
-      const auto max_logit = static_cast<double>(maxx);
-
-      // Precompute cumulative probability distribution across classes.
-      // Note: This isn't normalized.
-      cdf = (logits.chip<0>(b).cast<double>() - max_logit).exp();
-      double running_total = 0;
-      for (int64_t j = 0; j < num_classes; ++j) {
-        if (Eigen::numext::isfinite(logits_row[j])) {
-          running_total += cdf(j);
-        }
-        cdf(j) = running_total;
-      }
-      // Generate each sample.
-      const double* cdf_begin = cdf.data();
-      const double* cdf_end = cdf.data() + num_classes;
-      for (int64_t j = 0; j < num_samples; ++j) {
-        const double to_find = dist(generator_copy) * running_total;
-        auto found_iter = std::upper_bound(cdf_begin, cdf_end, to_find);
-        output(b, j) = static_cast<OutputType>(std::distance(cdf_begin, found_iter));
+    }
+    const auto max_logit = static_cast<double>(maxx);
+
+    // Precompute cumulative probability distribution across classes.
+    // Note: This isn't normalized.
+    cdf = (logits.chip<0>(b).cast<double>() - max_logit).exp();
+    double running_total = 0;
+    for (int64_t j = 0; j < num_classes; ++j) {
+      if (Eigen::numext::isfinite(logits_row[j])) {
+        running_total += cdf(j);
       }
+      cdf(j) = running_total;
+    }
+    // Generate each sample.
+    const double* cdf_begin = cdf.data();
+    const double* cdf_end = cdf.data() + num_classes;
+    for (int64_t j = 0; j < num_samples; ++j) {
+      const double to_find = dist(generator) * running_total;
+      auto found_iter = std::upper_bound(cdf_begin, cdf_end, to_find);
+      output(b, j) = static_cast<OutputType>(std::distance(cdf_begin, found_iter));
     }
-  };
-  DoWork(0, batch_size);
+  }
+
   return Status::OK();
 }
 
@@ -262,32 +255,6 @@ Status Multinomial::Compute(OpKernelContext* ctx) const {
   return status;
 }
 
-/*
-alternative interpretation of the spec is that the input tensor contains the dimensions as ints.
-Keeping this temporarily in case we go back to that.
-
-// read shape information from input tensor and create output tensor with it
-static Status CreateOutputTensorFromTensorValues(OpKernelContext* ctx, const Tensor& X, Tensor** Y) {
-  const TensorShape& shape = X.Shape();
-  auto size = shape.Size();
-  auto num_dims = shape.NumDimensions();
-
-  if (num_dims != 1) {
-    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Expected 1 dimension tensor with shape information. Dimensions=", num_dims);
-  }
-
-  std::vector<int64_t> dims;
-  dims.reserve(shape.Size());
-
-  auto data = gsl::make_span(tensor.template Data<int64_t>(), shape.Size());
-  dims.insert(dims.cbegin(), data.cbegin(), data.cend());
-
-  *Y = ctx->Output(0, TensorShape(dims));
-
-  return Status::OK();
-}
-*/
-
 // create output tensor using shape of input tensor
 static Status CreateOutputTensorFromTensorShape(OpKernelContext* ctx, const Tensor& X, Tensor** Y) {
   const TensorShape& shape = X.Shape();
@@ -363,9 +330,11 @@ static Status RandomUniformCompute(float low, float high,
 
 template <typename T, typename TDistribution>
 void GenerateData(std::default_random_engine& generator, TDistribution distribution, Tensor& tensor) {
-  auto out = gsl::make_span(tensor.template MutableData<T>(), tensor.Shape().Size());
-
-  std::for_each(out.begin(), out.end(), [&generator, &distribution](T& value) { value = distribution(generator); });
+  T* out = tensor.MutableData<T>();
+  for (int64_t i = 0, end = tensor.Shape().Size(); i < end; ++i) {
+    *out = distribution(generator);
+    ++out;
+  }
 }
 
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/generator/random.h b/onnxruntime/core/providers/cpu/generator/random.h
index 6ef8d2c460553..639341d1a29cc 100644
--- a/onnxruntime/core/providers/cpu/generator/random.h
+++ b/onnxruntime/core/providers/cpu/generator/random.h
@@ -20,11 +20,14 @@ class RandomNormal final : public OpKernel {
 
     // read optional seed attribute and generate if not provided
     float seed = 0.f;
-    if (!info.GetAttr<float>("seed", &seed).IsOK()) {
-      seed = gsl::narrow_cast<float>(std::chrono::high_resolution_clock::now().time_since_epoch().count());
+    if (info.GetAttr<float>("seed", &seed).IsOK()) {
+      generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
+    }
+    else {
+      generator_ = std::default_random_engine{
+        gsl::narrow_cast<uint32_t>(std::chrono::high_resolution_clock::now().time_since_epoch().count())
+      };
     }
-
-    generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
 
     int64_t dtype;
     ORT_ENFORCE(info.GetAttr<int64_t>("dtype", &dtype).IsOK());
@@ -60,11 +63,14 @@ class RandomNormalLike final : public OpKernel {
 
     // read optional seed attribute and generate if not provided
     float seed = 0.f;
-    if (!info.GetAttr<float>("seed", &seed).IsOK()) {
-      seed = gsl::narrow_cast<float>(std::chrono::high_resolution_clock::now().time_since_epoch().count());
+    if (info.GetAttr<float>("seed", &seed).IsOK()) {
+      generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
+    }
+    else {
+      generator_ = std::default_random_engine{
+        gsl::narrow_cast<uint32_t>(std::chrono::high_resolution_clock::now().time_since_epoch().count())
+      };
     }
-
-    generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
 
     int64_t dtype;
     if (info.GetAttr<int64_t>("dtype", &dtype).IsOK()) {
@@ -94,11 +100,14 @@ class RandomUniform final : public OpKernel {
 
     // read optional seed attribute and generate if not provided
     float seed = 0.f;
-    if (!info.GetAttr<float>("seed", &seed).IsOK()) {
-      seed = gsl::narrow_cast<float>(std::chrono::high_resolution_clock::now().time_since_epoch().count());
+    if (info.GetAttr<float>("seed", &seed).IsOK()) {
+      generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
+    }
+    else {
+      generator_ = std::default_random_engine{
+        gsl::narrow_cast<uint32_t>(std::chrono::high_resolution_clock::now().time_since_epoch().count())
+      };
     }
-
-    generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
 
     int64_t dtype;
     ORT_ENFORCE(info.GetAttr<int64_t>("dtype", &dtype).IsOK());
@@ -131,11 +140,14 @@ class RandomUniformLike final : public OpKernel {
     ORT_ENFORCE(info.GetAttr<float>("low", &low_).IsOK());
     // read optional seed attribute and generate if not provided
     float seed = 0.f;
-    if (!info.GetAttr<float>("seed", &seed).IsOK()) {
-      seed = gsl::narrow_cast<float>(std::chrono::high_resolution_clock::now().time_since_epoch().count());
+    if (info.GetAttr<float>("seed", &seed).IsOK()) {
+      generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
+    }
+    else {
+      generator_ = std::default_random_engine{
+        gsl::narrow_cast<uint32_t>(std::chrono::high_resolution_clock::now().time_since_epoch().count())
+      };
     }
-
-    generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
 
     int64_t dtype;
     if (info.GetAttr<int64_t>("dtype", &dtype).IsOK()) {
@@ -163,11 +175,14 @@ class Multinomial final : public OpKernel {
     ORT_ENFORCE(info.GetAttr<int64_t>("sample_size", &num_samples_).IsOK());
 
     float seed = 0.f;
-    if (!info.GetAttr<float>("seed", &seed).IsOK()) {
-      seed = gsl::narrow_cast<float>(std::chrono::high_resolution_clock::now().time_since_epoch().count());
+    if (info.GetAttr<float>("seed", &seed).IsOK()) {
+      generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
+    }
+    else {
+      generator_ = std::default_random_engine{
+        gsl::narrow_cast<uint32_t>(std::chrono::high_resolution_clock::now().time_since_epoch().count())
+      };
     }
-
-    generator_ = std::default_random_engine{gsl::narrow_cast<uint32_t>(seed)};
 
     int64_t output_dtype_tmp;
     if (!info.GetAttr<int64_t>("dtype", &output_dtype_tmp).IsOK()) {
diff --git a/onnxruntime/core/providers/cpu/math/clip.cc b/onnxruntime/core/providers/cpu/math/clip.cc
index 160d587df7238..dc99582ddca04 100644
--- a/onnxruntime/core/providers/cpu/math/clip.cc
+++ b/onnxruntime/core/providers/cpu/math/clip.cc
@@ -5,9 +5,16 @@
 
 namespace onnxruntime {
 
-ONNX_CPU_OPERATOR_KERNEL(
+ONNX_CPU_OPERATOR_VERSIONED_KERNEL(
     Clip,
     6,
+    10,
+    KernelDefBuilder().MayInplace(0, 0).TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
+    Clip_6<float>);
+
+ONNX_CPU_OPERATOR_KERNEL(
+    Clip,
+    11,
     KernelDefBuilder().MayInplace(0, 0).TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
     Clip<float>);
 
diff --git a/onnxruntime/core/providers/cpu/math/clip.h b/onnxruntime/core/providers/cpu/math/clip.h
index 653967547bb02..b4ef64398dddf 100644
--- a/onnxruntime/core/providers/cpu/math/clip.h
+++ b/onnxruntime/core/providers/cpu/math/clip.h
@@ -10,9 +10,9 @@
 namespace onnxruntime {
 
 template <typename T>
-class Clip final : public OpKernel {
+class Clip_6 final : public OpKernel {
  public:
-  Clip(const OpKernelInfo& info) : OpKernel(info) {
+  Clip_6(const OpKernelInfo& info) : OpKernel(info) {
     ORT_ENFORCE(info.GetAttr<T>("max", &max_).IsOK());
     ORT_ENFORCE(info.GetAttr<T>("min", &min_).IsOK());
   }
@@ -32,4 +32,36 @@ class Clip final : public OpKernel {
   T min_;
 };
 
+template <typename T>
+class Clip final : public OpKernel {
+ public:
+  Clip(const OpKernelInfo& info) : OpKernel(info) {
+  }
+
+  Status Compute(OpKernelContext* ctx) const override {
+    const auto* X = ctx->Input<Tensor>(0);
+    const auto* min = ctx->Input<Tensor>(1);
+    const auto* max = ctx->Input<Tensor>(2);
+    Tensor* Y = ctx->Output(0, X->Shape());
+
+    auto min_val = -std::numeric_limits<T>::infinity();
+    auto max_val = std::numeric_limits<T>::infinity();
+    if (min) {
+      ORT_ENFORCE(min->Shape().NumDimensions() == 0, "min should be a scalar.");
+      min_val = *(min->template Data<T>());
+    }
+    if (max) {
+      ORT_ENFORCE(max->Shape().NumDimensions() == 0, "max should be a scalar.");
+      max_val = *(max->template Data<T>());
+    }
+
+    EigenVectorMap<T>(Y->template MutableData<T>(), Y->Shape().Size()) =
+        ConstEigenVectorMap<T>(X->template Data<T>(), X->Shape().Size())
+            .cwiseMax(min_val)
+            .cwiseMin(max_val);
+
+    return Status::OK();
+  }
+};
+
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/math/element_wise_ops.cc b/onnxruntime/core/providers/cpu/math/element_wise_ops.cc
index ece68834e7525..7c13c98a745a6 100644
--- a/onnxruntime/core/providers/cpu/math/element_wise_ops.cc
+++ b/onnxruntime/core/providers/cpu/math/element_wise_ops.cc
@@ -18,6 +18,15 @@ namespace onnxruntime {
       KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<TYPE>()), \
       KERNEL_CLASS<TYPE>);
 
+#define REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(OP_TYPE, VERSION, TYPE, KERNEL_CLASS)         \
+  ONNX_CPU_OPERATOR_TYPED_KERNEL(                                                  \
+      OP_TYPE,                                                                     \
+      VERSION,                                                                     \
+      TYPE,                                                                        \
+      KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<TYPE>())  \
+                        .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()), \
+      KERNEL_CLASS<TYPE>);
+
 #define REG_ELEMENTWISE_VERSIONED_TYPED_KERNEL(OP_TYPE, VERSION_FROM, VERSION_TO, TYPE, KERNEL_CLASS) \
   ONNX_CPU_OPERATOR_VERSIONED_TYPED_KERNEL(                                                           \
       OP_TYPE,                                                                                        \
@@ -26,6 +35,15 @@ namespace onnxruntime {
       KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<TYPE>()),                    \
       KERNEL_CLASS<TYPE>);
 
+#define REG_ELEMENTWISE_LOGICALOP_VERSIONED_TYPED_KERNEL(OP_TYPE, VERSION_FROM, VERSION_TO, TYPE, KERNEL_CLASS) \
+  ONNX_CPU_OPERATOR_VERSIONED_TYPED_KERNEL(                                                           \
+      OP_TYPE,                                                                                        \
+      VERSION_FROM, VERSION_TO,                                                                       \
+      TYPE,                                                                                           \
+      KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<TYPE>())                    \
+                        .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()),                    \
+      KERNEL_CLASS<TYPE>);
+
 REG_ELEMENTWISE_TYPED_KERNEL(Add, 7, float, Add);
 REG_ELEMENTWISE_TYPED_KERNEL(Add, 7, double, Add);
 REG_ELEMENTWISE_TYPED_KERNEL(Add, 7, int32_t, Add);
@@ -88,45 +106,55 @@ REG_ELEMENTWISE_VERSIONED_TYPED_KERNEL(Max, 6, 7, float, Max_6);
 REG_ELEMENTWISE_TYPED_KERNEL(Max, 8, float, Max_8);
 REG_ELEMENTWISE_TYPED_KERNEL(Max, 8, double, Max_8);
 
-REG_ELEMENTWISE_VERSIONED_TYPED_KERNEL(Less, 7, 9, float, Less);
-REG_ELEMENTWISE_TYPED_KERNEL(Less, 9, int32_t, Less);
+REG_ELEMENTWISE_LOGICALOP_VERSIONED_TYPED_KERNEL(Less, 7, 9, float, Less);
+REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Less, 9, int32_t, Less);
+REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Less, 9, int64_t, Less);
 
-REG_ELEMENTWISE_VERSIONED_TYPED_KERNEL(Greater, 7, 9, float, Greater)
-REG_ELEMENTWISE_TYPED_KERNEL(Greater, 9, int32_t, Greater);
-REG_ELEMENTWISE_TYPED_KERNEL(Greater, 9, int64_t, Greater);
+REG_ELEMENTWISE_LOGICALOP_VERSIONED_TYPED_KERNEL(Greater, 7, 9, float, Greater)
+REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Greater, 9, int32_t, Greater);
+REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Greater, 9, int64_t, Greater);
 
-REG_ELEMENTWISE_TYPED_KERNEL(Equal, 7, bool, Equal);
-REG_ELEMENTWISE_TYPED_KERNEL(Equal, 7, int32_t, Equal);
-REG_ELEMENTWISE_TYPED_KERNEL(Equal, 7, int64_t, Equal);
-REG_ELEMENTWISE_TYPED_KERNEL(Equal, 11, float, Equal);
+REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Equal, 7, bool, Equal);
+REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Equal, 7, int32_t, Equal);
+REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Equal, 7, int64_t, Equal);
+REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Equal, 11, float, Equal);
 
 REG_ELEMENTWISE_VERSIONED_TYPED_KERNEL(Mean, 6, 7, float, Mean_6);
 REG_ELEMENTWISE_TYPED_KERNEL(Mean, 8, float, Mean_8);
 
 REG_ELEMENTWISE_TYPED_KERNEL(Erf, 9, float, Erf);
 
+// REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Not, 1, bool, Not);
+// REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(And, 7, bool, And);
+// REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Or, 7, bool, Or);
+// REG_ELEMENTWISE_LOGICALOP_TYPED_KERNEL(Xor, 7, bool, Xor);
+
 ONNX_CPU_OPERATOR_KERNEL(
     Not,
     1,
-    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<bool>()),
+    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<bool>())
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()),
     Not);
 
 ONNX_CPU_OPERATOR_KERNEL(
     And,
     7,
-    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<bool>()),
+    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<bool>())
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()),
     And);
 
 ONNX_CPU_OPERATOR_KERNEL(
     Or,
     7,
-    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<bool>()),
+    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<bool>())
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()),
     Or);
 
 ONNX_CPU_OPERATOR_KERNEL(
     Xor,
     7,
-    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<bool>()),
+    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<bool>())
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()),
     Xor);
 
 template <typename T>
diff --git a/onnxruntime/core/providers/cpu/math/element_wise_ops.h b/onnxruntime/core/providers/cpu/math/element_wise_ops.h
index 035ed3ee69cfb..e9d28c8314adb 100644
--- a/onnxruntime/core/providers/cpu/math/element_wise_ops.h
+++ b/onnxruntime/core/providers/cpu/math/element_wise_ops.h
@@ -320,6 +320,11 @@ struct BroadcastIterator {
     return index;
   }
 
+  void Reserve(int64_t max_dims) {
+    deltas_.reserve(max_dims);
+    counts_.reserve(max_dims);
+  }
+
   void Init(int64_t axis, int64_t largest) {
     ORT_ENFORCE(axis == 1 || axis == largest, "Attempting to broadcast an axis by a dimension other than 1. ", axis, " by ", largest);
 
@@ -368,6 +373,8 @@ struct Broadcaster {
     size_t dimension_count_max = std::max(shape1.size(), shape2.size());
     size_t dimension_count_min = std::min(shape1.size(), shape2.size());
     output_shape_.resize(dimension_count_max);
+    iterator1_.Reserve(dimension_count_max);
+    iterator2_.Reserve(dimension_count_max);
 
     auto iter1 = shape1.end();
     auto iter2 = shape2.end();
@@ -395,22 +402,22 @@ struct Broadcaster {
         *--output_shape = axis;
       }
       index++;  // Manually increment since we processed one axis
-    }
-
-    for (; index < dimension_count_min; index++) {
-      auto axis1 = *--iter1;
-      auto axis2 = *--iter2;
+    } else {
+      for (; index < dimension_count_min; index++) {
+        auto axis1 = *--iter1;
+        auto axis2 = *--iter2;
 
-      auto largest = std::max(axis1, axis2);
-      *--output_shape = largest;
+        auto largest = std::max(axis1, axis2);
+        *--output_shape = largest;
 
-      if (largest == 1 && index + 1 < dimension_count_min)  // Nothing to do in this case
-        continue;
+        if (largest == 1 && index + 1 < dimension_count_min)  // Nothing to do in this case
+          continue;
 
-      iterator1_.Init(axis1, largest);
-      iterator2_.Init(axis2, largest);
-      index++;  // Manually increment since we processed one axis
-      break;
+        iterator1_.Init(axis1, largest);
+        iterator2_.Init(axis2, largest);
+        index++;  // Manually increment since we processed one axis
+        break;
+      }
     }
 
     for (; index < dimension_count_min; index++) {
diff --git a/onnxruntime/core/providers/cpu/math/gemm.h b/onnxruntime/core/providers/cpu/math/gemm.h
index a3aa724ab410d..225754141a6d7 100644
--- a/onnxruntime/core/providers/cpu/math/gemm.h
+++ b/onnxruntime/core/providers/cpu/math/gemm.h
@@ -8,6 +8,7 @@
 #include "core/util/math.h"
 #include "core/util/math_cpuonly.h"
 #include "gemm_helper.h"
+#include "core/framework/op_kernel_context_internal.h"
 
 namespace onnxruntime {
 
@@ -27,6 +28,9 @@ class Gemm : public OpKernel {
   }
 
   Status Compute(OpKernelContext* context) const override {
+    auto ctx_internal = static_cast<OpKernelContextInternal*>(context);
+    concurrency::ThreadPool* tp = ctx_internal->GetOperatorThreadPool();
+
     const auto X = context->Input<Tensor>(0);
     const auto W = context->Input<Tensor>(1);
     const auto B = context->Input<Tensor>(2);
@@ -64,7 +68,7 @@ class Gemm : public OpKernel {
     }
 
     // W * x
-    math::Gemm<T, CPUMathUtil>(
+    math::Gemm<T>(
         trans_A_,
         trans_B_,
         M,
@@ -75,7 +79,7 @@ class Gemm : public OpKernel {
         W->template Data<T>(),
         beta_,
         y_data,
-        &CPUMathUtil::Instance());
+        tp);
 
     FuseActivation<T>(activation_, y_data, M * N, leaky_relu_alpha_);
 
diff --git a/onnxruntime/core/providers/cpu/math/logsoftmax.cc b/onnxruntime/core/providers/cpu/math/logsoftmax.cc
index 281031e71568e..19fbb9897c699 100644
--- a/onnxruntime/core/providers/cpu/math/logsoftmax.cc
+++ b/onnxruntime/core/providers/cpu/math/logsoftmax.cc
@@ -4,6 +4,8 @@
 #include "core/providers/cpu/math/logsoftmax.h"
 
 #include "core/framework/op_kernel.h"
+#include "core/framework/op_kernel_context_internal.h"
+
 #include "core/providers/common.h"
 #include "core/providers/cpu/math/softmax_shared.h"
 #include "core/util/math.h"
@@ -12,6 +14,9 @@ namespace onnxruntime {
 
 template <>
 Status LogSoftmax<float>::Compute(OpKernelContext* ctx) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(ctx);
+  concurrency::ThreadPool* tp = ctx_internal->GetOperatorThreadPool();
+
   const auto* tensor_pointer = ctx->Input<Tensor>(0);
   if (tensor_pointer == nullptr) return Status(common::ONNXRUNTIME, common::FAIL, "input count mismatch");
   const Tensor& X = *tensor_pointer;
@@ -32,7 +37,7 @@ Status LogSoftmax<float>::Compute(OpKernelContext* ctx) const {
 
   const bool logarithmic = true;
   auto status = SoftmaxCPU(N, D, X.template Data<float>(), Ydata,
-                           scale_.data(), sum_multiplier_.data(), logarithmic, rowmax_.data());
+                           scale_.data(), sum_multiplier_.data(), logarithmic, rowmax_.data(), tp);
 
   return status;
 }
diff --git a/onnxruntime/core/providers/cpu/math/matmul.cc b/onnxruntime/core/providers/cpu/math/matmul.cc
index 539157e92bd95..4f4bacc34baeb 100644
--- a/onnxruntime/core/providers/cpu/math/matmul.cc
+++ b/onnxruntime/core/providers/cpu/math/matmul.cc
@@ -1,6 +1,6 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
-
+#include "core/framework/op_kernel_context_internal.h"
 #include "core/providers/cpu/math/matmul.h"
 
 #include "core/util/math.h"
@@ -53,6 +53,9 @@ ONNX_CPU_OPERATOR_VERSIONED_TYPED_KERNEL(
 
 template <typename T>
 Status MatMul<T>::Compute(OpKernelContext* ctx) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(ctx);
+  concurrency::ThreadPool* thread_pool = ctx_internal->GetOperatorThreadPool();
+
   const auto* left_X = ctx->Input<Tensor>(0);
   const auto* right_X = ctx->Input<Tensor>(1);
 
@@ -69,7 +72,7 @@ Status MatMul<T>::Compute(OpKernelContext* ctx) const {
         static_cast<int>(helper.K()),
         left_X->template Data<T>() + helper.LeftOffsets()[i],
         right_X->template Data<T>() + helper.RightOffsets()[i],
-        Y->template MutableData<T>() + helper.OutputOffsets()[i]);
+        Y->template MutableData<T>() + helper.OutputOffsets()[i], thread_pool);
   }
 
   return Status::OK();
diff --git a/onnxruntime/core/providers/cpu/math/matmul_helper.h b/onnxruntime/core/providers/cpu/math/matmul_helper.h
index af82037a7c465..e5095e0ea1382 100644
--- a/onnxruntime/core/providers/cpu/math/matmul_helper.h
+++ b/onnxruntime/core/providers/cpu/math/matmul_helper.h
@@ -29,9 +29,8 @@ class MatMulComputeHelper {
       M_ = left_shape.SizeToDimension(left_num_dims - 1);
       K_ = left_shape[left_num_dims - 1];
       N_ = right_shape[right_num_dims - 1];
-      std::vector<int64_t> output_dims = left_shape.GetDims();
-      output_dims[left_num_dims - 1] = N_;
-      output_shape_ = TensorShape(output_dims);
+      output_shape_ = left_shape;
+      output_shape_[left_num_dims - 1] = N_;
       output_offsets_ = {0};
       left_offsets_ = {0};
       right_offsets_ = {0};
diff --git a/onnxruntime/core/providers/cpu/math/matmul_integer.cc b/onnxruntime/core/providers/cpu/math/matmul_integer.cc
index 9a64e3fe42094..eab5434d24fb9 100644
--- a/onnxruntime/core/providers/cpu/math/matmul_integer.cc
+++ b/onnxruntime/core/providers/cpu/math/matmul_integer.cc
@@ -1,49 +1,40 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 
-#ifdef _MSC_VER
-#pragma warning(disable : 4244)
-#pragma warning(disable : 4267)
-#endif
-
 #include "core/providers/cpu/math/matmul_integer.h"
 #include "core/providers/cpu/math/matmul_helper.h"
-#include "core/util/gemmlowp_common_wrapper.h"
+#include "core/util/qmath.h"
+#include "core/providers/common.h"
 
 namespace onnxruntime {
 
 // only register this operator if low precision computation is enabled.
-ONNX_OPERATOR_KERNEL_EX(
+ONNX_OPERATOR_TYPED_KERNEL_EX(
     MatMulInteger,
     kOnnxDomain,
     10,
+    uint8_t,
     kCpuExecutionProvider,
     KernelDefBuilder()
         .TypeConstraint("T1", DataTypeImpl::GetTensorType<uint8_t>())
         .TypeConstraint("T2", DataTypeImpl::GetTensorType<uint8_t>())
         .TypeConstraint("T3", DataTypeImpl::GetTensorType<int32_t>()),
-    MatMulInteger<uint8_t, uint8_t, int32_t>);
-
-Status GemmlowpMultiply(const uint8_t* lhs_data, const uint8_t* rhs_data,
-                        int32_t* result_data, const int lhs_offset, const int rhs_offset,
-                        int m, int n, int k) {
-  const std::tuple<> empty_pipeline = {};
-  // TODO exp ColMajor order for rhs and result. That may be faster
-  const auto matOrder = gemmlowp::MapOrder::RowMajor;
-  gemmlowp::MatrixMap<const std::uint8_t, matOrder> lhs(lhs_data, m, k);
-  gemmlowp::MatrixMap<const std::uint8_t, matOrder> rhs(rhs_data, k, n);
-  gemmlowp::MatrixMap<std::int32_t, matOrder> result(result_data, m, n);
+    MatMulInteger<uint8_t, uint8_t>);
 
-  gemmlowp::GemmContext gemm_context;
-  gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::int32_t,
-                                   gemmlowp::DefaultL8R8BitDepthParams>(
-      &gemm_context, lhs, rhs, &result, -lhs_offset, -rhs_offset, empty_pipeline);
-
-  return Status::OK();
-}
+ONNX_OPERATOR_TYPED_KERNEL_EX(
+    MatMulInteger,
+    kOnnxDomain,
+    10,
+    int8_t,
+    kCpuExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<uint8_t>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<int8_t>())
+        .TypeConstraint("T3", DataTypeImpl::GetTensorType<int32_t>()),
+    MatMulInteger<uint8_t, int8_t>);
 
-template<>
-Status MatMulInteger<uint8_t, uint8_t, int32_t>::Compute(OpKernelContext* ctx) const {
+template <>
+Status MatMulInteger<uint8_t, uint8_t>::Compute(OpKernelContext* ctx) const {
   auto a = ctx->Input<Tensor>(0);
   auto b = ctx->Input<Tensor>(1);
   ORT_ENFORCE(a != nullptr && b != nullptr);
@@ -53,34 +44,79 @@ Status MatMulInteger<uint8_t, uint8_t, int32_t>::Compute(OpKernelContext* ctx) c
   Tensor* y = ctx->Output(0, helper.OutputShape());
 
   // validate zero points
-  int32_t a_offset = 0;
-  int32_t b_offset = 0;
+  uint8_t a_offset = 0;
+  uint8_t b_offset = 0;
   if (has_a_zero_point_) {
     auto a_zero_point = ctx->Input<Tensor>(2);
-    ORT_ENFORCE(a_zero_point->Shape().NumDimensions() == 0 ||
-        (a_zero_point->Shape().NumDimensions() == 1 && a_zero_point->Shape().GetDims().size() == 1),
-        "Currently only scalar zero_point is supported. TODO: add per channel zero point support.");
+    ORT_ENFORCE(IsScalarOr1ElementVector(a_zero_point),
+                "MatmulInteger : input1 zero point must be a scalar or 1D tensor of size 1");
     a_offset = static_cast<int32_t>(*a_zero_point->template Data<uint8_t>());
   }
   if (has_b_zero_point_) {
     auto b_zero_point = ctx->Input<Tensor>(3);
-    ORT_ENFORCE(b_zero_point->Shape().NumDimensions() == 0 ||
-        (b_zero_point->Shape().NumDimensions() == 1 && b_zero_point->Shape().GetDims().size() == 1),
-        "Currently only scalar zero_point is supported. TODO: add per channel zero point support.");
+    ORT_ENFORCE(IsScalarOr1ElementVector(b_zero_point),
+                "MatmulInteger : input2 zero point must be a scalar or 1D tensor of size 1");
     b_offset = static_cast<int32_t>(*b_zero_point->template Data<uint8_t>());
   }
 
   for (size_t i = 0; i < helper.OutputOffsets().size(); i++) {
-    GemmlowpMultiply(a->template Data<uint8_t>() + helper.LeftOffsets()[i],
-                     b->template Data<uint8_t>() + helper.RightOffsets()[i],
-                     y->template MutableData<int32_t>() + helper.OutputOffsets()[i],
-                     a_offset,
-                     b_offset,
-                     static_cast<int>(helper.M()),
-                     static_cast<int>(helper.N()),
-                     static_cast<int>(helper.K()));
+    QGemmu8u8_s32(static_cast<int>(helper.M()),
+                  static_cast<int>(helper.N()),
+                  static_cast<int>(helper.K()),
+                  a->template Data<uint8_t>() + helper.LeftOffsets()[i],
+                  static_cast<int>(helper.K()),
+                  a_offset,
+                  b->template Data<uint8_t>() + helper.RightOffsets()[i],
+                  static_cast<int>(helper.N()),
+                  b_offset,
+                  y->template MutableData<int32_t>() + helper.OutputOffsets()[i],
+                  static_cast<int>(helper.N()),
+                  nullptr);
   }
+  return Status::OK();
+}
 
+template <>
+Status MatMulInteger<uint8_t, int8_t>::Compute(OpKernelContext* ctx) const {
+  auto a = ctx->Input<Tensor>(0);
+  auto b = ctx->Input<Tensor>(1);
+  ORT_ENFORCE(a != nullptr && b != nullptr);
+
+  MatMulComputeHelper helper;
+  ORT_RETURN_IF_ERROR(helper.Compute(a->Shape(), b->Shape()));
+  Tensor* y = ctx->Output(0, helper.OutputShape());
+
+  if (has_a_zero_point_ || has_b_zero_point_) {
+    // currently zero point is only supported in Gemmlowp path above
+    // in future, the selection of Eigen/Gemmlowp/mklml/etc. should be in a common math library like SGEMM
+
+    auto IsZeroPointTensorAllZero = [](OpKernelContext* ctx, int input_idx) -> bool {
+      auto t = ctx->Input<Tensor>(input_idx);
+      ORT_ENFORCE(t->Shape().NumDimensions() <= 1 && t->Shape().Size() == 1,
+                  "Currently only scalar zero_point is supported. TODO: add per channel zero point support.");
+      ORT_ENFORCE(t->DataType() == DataTypeImpl::GetType<int8_t>() ||
+                  t->DataType() == DataTypeImpl::GetType<uint8_t>());
+      auto data = reinterpret_cast<const int8_t*>(t->DataRaw());
+      auto vec = std::vector<int8_t>(data, data + t->Shape().Size());
+      return std::all_of(vec.begin(), vec.end(), [](int8_t v) { return v == 0; });
+    };
+
+    if ((has_a_zero_point_ && !IsZeroPointTensorAllZero(ctx, 2)) ||
+        (has_b_zero_point_ && !IsZeroPointTensorAllZero(ctx, 3))) {
+      ORT_NOT_IMPLEMENTED("MatMulInteger: Unsupported input types with zero point");
+    }
+  }
+
+  // NOTE: Eigen based implementation is a reference implementation for accuracy only
+  for (int i = 0; i < static_cast<int>(helper.OutputOffsets().size()); i++) {
+    EigenCastGEMM<uint8_t, int8_t, int32_t>(
+        a->template Data<uint8_t>() + helper.LeftOffsets()[i],
+        b->template Data<int8_t>() + helper.RightOffsets()[i],
+        y->template MutableData<int32_t>() + helper.OutputOffsets()[i],
+        static_cast<int>(helper.M()),
+        static_cast<int>(helper.N()),
+        static_cast<int>(helper.K()));
+  }
   return Status::OK();
 }
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/math/matmul_integer.h b/onnxruntime/core/providers/cpu/math/matmul_integer.h
index d9b5bbfbc9361..36e9c11707674 100644
--- a/onnxruntime/core/providers/cpu/math/matmul_integer.h
+++ b/onnxruntime/core/providers/cpu/math/matmul_integer.h
@@ -9,14 +9,14 @@
 
 namespace onnxruntime {
 
-template <typename T1, typename T2, typename T3>
+template <typename T1, typename T2>
 class MatMulInteger final : public OpKernel {
  public:
   MatMulInteger(const OpKernelInfo& info) : OpKernel(info) {
     has_a_zero_point_ = false;
     has_b_zero_point_ = false;
     if (info.GetInputCount() > 2) {
-      has_a_zero_point_ = true;      
+      has_a_zero_point_ = true;
     }
     if (info.GetInputCount() > 3) {
       has_b_zero_point_ = true;
@@ -29,4 +29,4 @@ class MatMulInteger final : public OpKernel {
   bool has_a_zero_point_;
   bool has_b_zero_point_;
 };
-}  // namespace onnxruntime
\ No newline at end of file
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/math/quantize_linear_matmul.cc b/onnxruntime/core/providers/cpu/math/quantize_linear_matmul.cc
index b3cf0dcbe7094..164b58f208f70 100644
--- a/onnxruntime/core/providers/cpu/math/quantize_linear_matmul.cc
+++ b/onnxruntime/core/providers/cpu/math/quantize_linear_matmul.cc
@@ -1,14 +1,9 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 
-#ifdef _MSC_VER
-#pragma warning(disable : 4244)
-#pragma warning(disable : 4267)
-#endif
-
 #include "core/providers/cpu/math/quantize_linear_matmul.h"
 #include "core/providers/cpu/math/matmul_helper.h"
-#include "core/util/gemmlowp_common_wrapper.h"
+#include "core/providers/common.h"
 
 namespace onnxruntime {
 
@@ -24,55 +19,7 @@ ONNX_OPERATOR_KERNEL_EX(
         .TypeConstraint("T3", DataTypeImpl::GetTensorType<uint8_t>()),
     QLinearMatMul<uint8_t, uint8_t, uint8_t>);
 
-Status GemmlowpMultiply(const uint8_t* lhs_data, const uint8_t* rhs_data, uint8_t* result_data,
-                        const int lhs_offset, const int rhs_offset, const int result_offset,
-                        int m, int n, int k, int32_t int_multiplier, int32_t right_shift) {
-  gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint quantize_down_stage;
-  quantize_down_stage.result_offset_after_shift = result_offset;
-  quantize_down_stage.result_fixedpoint_multiplier = int_multiplier;
-  quantize_down_stage.result_shift = right_shift;
-  gemmlowp::OutputStageSaturatingCastToUint8 saturating_cast_stage;
-  const auto& output_pipeline = std::make_tuple(quantize_down_stage, saturating_cast_stage);
-
-  // TODO exp ColMajor order for rhs and result. That may be faster
-  const auto matOrder = gemmlowp::MapOrder::RowMajor;
-  gemmlowp::MatrixMap<const std::uint8_t, matOrder> lhs(lhs_data, m, k);
-  gemmlowp::MatrixMap<const std::uint8_t, matOrder> rhs(rhs_data, k, n);
-  gemmlowp::MatrixMap<std::uint8_t, matOrder> result(result_data, m, n);
-
-  gemmlowp::GemmContext gemm_context;
-  gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
-                                   gemmlowp::DefaultL8R8BitDepthParams>(
-      &gemm_context, lhs, rhs, &result, -lhs_offset, -rhs_offset, output_pipeline);
-
-  return Status::OK();
-}
-
-void QuantizeMultiplier(float fp_multiplier, std::int32_t* integer_multiplier, int* right_shift) {
-  auto* fp_as_bits = reinterpret_cast<uint32_t*>(&fp_multiplier);
-  auto current_exponent = (*fp_as_bits >> 23);
-  // bring multiplier in [.5,1) range and calculate the shift
-  auto bumped_multiplier_as_bits =
-      (*fp_as_bits & UINT32_C(0x007fffff)) | UINT32_C(0x3f000000);
-  auto* bumped_multiplier = reinterpret_cast<float*>(&bumped_multiplier_as_bits);
-  auto shift = 126 - current_exponent;
-  // convert to fixed point number
-  auto int_multiplier = static_cast<std::int64_t>(std::round(*bumped_multiplier * (1ll << 31)));
-
-  *integer_multiplier = static_cast<int32_t>(int_multiplier);
-  *right_shift = shift;
-}
-
-void ScaleAndZeropointPairValidationHelper(const Tensor* scale, const Tensor* zeropoint) {
-  ORT_ENFORCE(scale->Shape().NumDimensions() == 0 ||
-      (scale->Shape().NumDimensions() == 1 && scale->Shape().GetDims().size() == 1),
-      "scale must be a scalar");
-  ORT_ENFORCE(zeropoint->Shape().NumDimensions() == 0 ||
-      (zeropoint->Shape().NumDimensions() == 1 && zeropoint->Shape().GetDims().size() == 1),
-      "zeropoint must be a scalar");
-}
-
-template<>
+template <>
 Status QLinearMatMul<uint8_t, uint8_t, uint8_t>::Compute(OpKernelContext* ctx) const {
   auto a = ctx->Input<Tensor>(0);
   auto b = ctx->Input<Tensor>(3);
@@ -82,16 +29,27 @@ Status QLinearMatMul<uint8_t, uint8_t, uint8_t>::Compute(OpKernelContext* ctx) c
   ORT_RETURN_IF_ERROR(helper.Compute(a->Shape(), b->Shape()));
   Tensor* y = ctx->Output(0, helper.OutputShape());
 
-  // validate scale and zero points
+  // validate offsets
+  auto a_offset = ctx->Input<Tensor>(2);
+  auto b_offset = ctx->Input<Tensor>(5);
+  auto y_offset = ctx->Input<Tensor>(7);
+  ORT_ENFORCE(IsScalarOr1ElementVector(a_offset),
+              "QLinearMatmul : input zero point must be a scalar or 1D tensor of size 1");
+  ORT_ENFORCE(IsScalarOr1ElementVector(b_offset),
+              "QLinearMatmul : weight zero point must be a scalar or 1D tensor of size 1");
+  ORT_ENFORCE(IsScalarOr1ElementVector(y_offset),
+              "QLinearMatmul : result zero point must be a scalar or 1D tensor of size 1");
+
+  // validate scale
   auto a_scale = ctx->Input<Tensor>(1);
-  auto a_zero_point = ctx->Input<Tensor>(2);
-  ScaleAndZeropointPairValidationHelper(a_scale, a_zero_point);
   auto b_scale = ctx->Input<Tensor>(4);
-  auto b_zero_point = ctx->Input<Tensor>(5);
-  ScaleAndZeropointPairValidationHelper(b_scale, b_zero_point);
   auto y_scale = ctx->Input<Tensor>(6);
-  auto y_zero_point = ctx->Input<Tensor>(7);
-  ScaleAndZeropointPairValidationHelper(y_scale, y_zero_point);
+  ORT_ENFORCE(IsScalarOr1ElementVector(a_scale),
+              "QLinearMatmul : input scale must be a scalar or 1D tensor of size 1");
+  ORT_ENFORCE(IsScalarOr1ElementVector(b_scale),
+              "QLinearMatmul : weight scale must be a scalar or 1D tensor of size 1");
+  ORT_ENFORCE(IsScalarOr1ElementVector(y_scale),
+              "QLinearMatmul : result scale must be a scalar or 1D tensor of size 1");
 
   auto a_scale_data = *(a_scale->template Data<float>());
   auto b_scale_data = *(b_scale->template Data<float>());
@@ -103,17 +61,17 @@ Status QLinearMatMul<uint8_t, uint8_t, uint8_t>::Compute(OpKernelContext* ctx) c
   QuantizeMultiplier(real_multiplier, &integer_multiplier, &right_shift);
 
   for (size_t i = 0; i < helper.OutputOffsets().size(); i++) {
-    GemmlowpMultiply(a->template Data<uint8_t>() + helper.LeftOffsets()[i],
-                     b->template Data<uint8_t>() + helper.RightOffsets()[i],
-                     y->template MutableData<uint8_t>() + helper.OutputOffsets()[i],
-                     *a_zero_point->template Data<uint8_t>(),
-                     *b_zero_point->template Data<uint8_t>(),
-                     *y_zero_point->template Data<uint8_t>(),
-                     static_cast<int>(helper.M()),
-                     static_cast<int>(helper.N()),
-                     static_cast<int>(helper.K()),
-                     integer_multiplier,
-                     right_shift);
+    GemmlowpMultiplyu8u8_u8(a->template Data<uint8_t>() + helper.LeftOffsets()[i],
+                            b->template Data<uint8_t>() + helper.RightOffsets()[i],
+                            y->template MutableData<uint8_t>() + helper.OutputOffsets()[i],
+                            *a_offset->template Data<uint8_t>(),
+                            *b_offset->template Data<uint8_t>(),
+                            *y_offset->template Data<uint8_t>(),
+                            static_cast<int>(helper.M()),
+                            static_cast<int>(helper.N()),
+                            static_cast<int>(helper.K()),
+                            integer_multiplier,
+                            right_shift);
   }
 
   return Status::OK();
diff --git a/onnxruntime/core/providers/cpu/math/quantize_linear_matmul.h b/onnxruntime/core/providers/cpu/math/quantize_linear_matmul.h
index 778bb03ca0e84..aada308756e85 100644
--- a/onnxruntime/core/providers/cpu/math/quantize_linear_matmul.h
+++ b/onnxruntime/core/providers/cpu/math/quantize_linear_matmul.h
@@ -6,6 +6,7 @@
 #include "core/common/common.h"
 #include "core/framework/op_kernel.h"
 #include "core/util/math_cpuonly.h"
+#include "core/util/gemmlowp_common.h"
 
 namespace onnxruntime {
 
@@ -16,6 +17,6 @@ class QLinearMatMul final : public OpKernel {
   }
 
   Status Compute(OpKernelContext* context) const override;
-  
+
 };
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/math/softmax.cc b/onnxruntime/core/providers/cpu/math/softmax.cc
index 9242967901e46..542e20e79f79c 100644
--- a/onnxruntime/core/providers/cpu/math/softmax.cc
+++ b/onnxruntime/core/providers/cpu/math/softmax.cc
@@ -4,6 +4,7 @@
 #include "core/providers/cpu/math/softmax.h"
 
 #include "core/framework/op_kernel.h"
+#include "core/framework/op_kernel_context_internal.h"
 #include "core/providers/common.h"
 #include "core/providers/cpu/math/softmax_shared.h"
 #include "core/util/math.h"
@@ -12,6 +13,9 @@ namespace onnxruntime {
 
 template <>
 Status Softmax<float>::Compute(OpKernelContext* ctx) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(ctx);
+  concurrency::ThreadPool* tp = ctx_internal->GetOperatorThreadPool();
+
   const auto* tensor_pointer = ctx->Input<Tensor>(0);
   if (tensor_pointer == nullptr) return Status(common::ONNXRUNTIME, common::FAIL, "input count mismatch");
   const Tensor& X = *tensor_pointer;
@@ -34,7 +38,7 @@ Status Softmax<float>::Compute(OpKernelContext* ctx) const {
 
   const bool logarithmic = false;
   auto status = SoftmaxCPU(N, D, X.template Data<float>(), Ydata,
-                           scale_.data(), sum_multiplier_.data(), logarithmic, rowmax_.data());
+                           scale_.data(), sum_multiplier_.data(), logarithmic, rowmax_.data(), tp);
 
   return status;
 }
diff --git a/onnxruntime/core/providers/cpu/math/softmax_shared.cc b/onnxruntime/core/providers/cpu/math/softmax_shared.cc
index 7dd3a10cfc598..18277f6b4137c 100644
--- a/onnxruntime/core/providers/cpu/math/softmax_shared.cc
+++ b/onnxruntime/core/providers/cpu/math/softmax_shared.cc
@@ -31,6 +31,7 @@
 #endif
 
 #include "core/providers/cpu/math/softmax_shared.h"
+
 #include "core/util/math.h"
 #include "core/util/math_cpuonly.h"
 
@@ -46,7 +47,7 @@ common::Status SoftmaxCPU(const int64_t N,
                           float* scale,
                           const float* sum_multiplier,
                           bool logarithmic,
-                          float* rowmax) {
+                          float* rowmax, onnxruntime::concurrency::ThreadPool* tp) {
   // the Math functions SoftmaxCPU uses only support int32_t as input, so enforce that
   if (N * D > INT32_MAX || N > INT32_MAX || D > INT32_MAX) {
     std::ostringstream ss;
@@ -65,7 +66,7 @@ common::Status SoftmaxCPU(const int64_t N,
   // Put the intermediate result X - max(X) into Y by first copying X to Y, and then subtracting max from each entry
   gsl::copy(gsl::make_span(Xdata, nd), gsl::make_span(Ydata, nd));
 
-  math::Gemm<float, CPUMathUtil>(CblasNoTrans, CblasNoTrans, n, d, 1, -1, rowmax, sum_multiplier, 1, Ydata, nullptr);
+  math::Gemm<float>(CblasNoTrans, CblasNoTrans, n, d, 1, -1, rowmax, sum_multiplier, 1, Ydata, tp);
 
   // Exponentiation
   math::Exp<float, CPUMathUtil>(nd, Ydata, Ydata, nullptr);
diff --git a/onnxruntime/core/providers/cpu/math/softmax_shared.h b/onnxruntime/core/providers/cpu/math/softmax_shared.h
index 3439b9717f051..26ffeb193fe4f 100644
--- a/onnxruntime/core/providers/cpu/math/softmax_shared.h
+++ b/onnxruntime/core/providers/cpu/math/softmax_shared.h
@@ -6,6 +6,9 @@
 #include "core/common/status.h"
 
 namespace onnxruntime {
+namespace concurrency {
+class ThreadPool;
+}
 /**
 Calculate Softmax using CPU memory.
 @param N Number of rows
@@ -18,5 +21,5 @@ Calculate Softmax using CPU memory.
 @param rowmax Storage for calculation of maximum in each row. Size must be >= N.
 */
 common::Status SoftmaxCPU(int64_t N, int64_t D, const float* Xdata, float* Ydata, float* scale,
-                          const float* sum_multiplier, bool logarithmic, float* rowmax);
+                          const float* sum_multiplier, bool logarithmic, float* rowmax, concurrency::ThreadPool* tp);
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/ml/label_encoder.cc b/onnxruntime/core/providers/cpu/ml/label_encoder.cc
index 4a2ac686b4480..b497300a72c89 100644
--- a/onnxruntime/core/providers/cpu/ml/label_encoder.cc
+++ b/onnxruntime/core/providers/cpu/ml/label_encoder.cc
@@ -9,15 +9,16 @@ using namespace ::onnxruntime::common;
 namespace onnxruntime {
 namespace ml {
 
-ONNX_CPU_OPERATOR_ML_KERNEL(
+ONNX_CPU_OPERATOR_VERSIONED_ML_KERNEL(
     LabelEncoder,
-    1,
+    1, 1,
     KernelDefBuilder().TypeConstraint("T1",
                                       std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::string>(),
                                                               DataTypeImpl::GetTensorType<int64_t>()})
         .TypeConstraint("T2",
                         std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::string>(),
-                                                DataTypeImpl::GetTensorType<int64_t>()}),
+                                                DataTypeImpl::GetTensorType<int64_t>()})
+        .SinceVersion(1, 2),
     LabelEncoder);
 
 Status LabelEncoder::Compute(OpKernelContext* context) const {
@@ -67,5 +68,107 @@ Status LabelEncoder::Compute(OpKernelContext* context) const {
   return Status::OK();
 }
 
+ONNX_CPU_OPERATOR_TYPED_ML_KERNEL(
+    LabelEncoder,
+    2,
+    float_string,
+    KernelDefBuilder().TypeConstraint("T1",
+                                      std::vector<MLDataType>{DataTypeImpl::GetTensorType<float>()})
+        .TypeConstraint("T2",
+                        std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::string>()}),
+    LabelEncoder_2<float, std::string>);
+
+template <>
+void LabelEncoder_2<float, std::string>::InitializeSomeFields(const OpKernelInfo& info) {
+  _key_field_name = "keys_floats";
+  _value_field_name = "values_strings";
+  info.GetAttrOrDefault<std::string>("default_string", &_default_value, std::string("_Unused"));
+};
+
+ONNX_CPU_OPERATOR_TYPED_ML_KERNEL(
+    LabelEncoder,
+    2,
+    string_float,
+    KernelDefBuilder().TypeConstraint("T1",
+                                      std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::string>()})
+        .TypeConstraint("T2",
+                        std::vector<MLDataType>{DataTypeImpl::GetTensorType<float>()}),
+    LabelEncoder_2<std::string, float>);
+
+template <>
+void LabelEncoder_2<std::string, float>::InitializeSomeFields(const OpKernelInfo& info) {
+  _key_field_name = "keys_strings";
+  _value_field_name = "values_floats";
+  info.GetAttrOrDefault<float>("default_float", &_default_value, -0.0f);
+};
+
+ONNX_CPU_OPERATOR_TYPED_ML_KERNEL(
+    LabelEncoder,
+    2,
+    int64_float,
+    KernelDefBuilder().TypeConstraint("T1",
+                                      std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::int64_t>()})
+        .TypeConstraint("T2",
+                        std::vector<MLDataType>{DataTypeImpl::GetTensorType<float>()}),
+    LabelEncoder_2<std::int64_t, float>);
+
+template <>
+void LabelEncoder_2<std::int64_t, float>::InitializeSomeFields(const OpKernelInfo& info) {
+  _key_field_name = "keys_int64s";
+  _value_field_name = "values_floats";
+  info.GetAttrOrDefault<float>("default_float", &_default_value, -0.0f);
+};
+
+ONNX_CPU_OPERATOR_TYPED_ML_KERNEL(
+    LabelEncoder,
+    2,
+    float_int64,
+    KernelDefBuilder().TypeConstraint("T1",
+                                      std::vector<MLDataType>{DataTypeImpl::GetTensorType<float>()})
+        .TypeConstraint("T2",
+                        std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::int64_t>()}),
+    LabelEncoder_2<float, std::int64_t>);
+
+template <>
+void LabelEncoder_2<float, std::int64_t>::InitializeSomeFields(const OpKernelInfo& info) {
+  _key_field_name = "keys_floats";
+  _value_field_name = "values_int64s";
+  info.GetAttrOrDefault<std::int64_t>("default_int64", &_default_value, (std::int64_t)-1);
+};
+
+ONNX_CPU_OPERATOR_TYPED_ML_KERNEL(
+    LabelEncoder,
+    2,
+    int64_string,
+    KernelDefBuilder().TypeConstraint("T1",
+                                      std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::int64_t>()})
+        .TypeConstraint("T2",
+                        std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::string>()}),
+    LabelEncoder_2<std::int64_t, std::string>)
+
+template <>
+void LabelEncoder_2<std::int64_t, std::string>::InitializeSomeFields(const OpKernelInfo& info) {
+  _key_field_name = "keys_int64s";
+  _value_field_name = "values_strings";
+  info.GetAttrOrDefault<std::string>("default_string", &_default_value, std::string("_Unused"));
+};
+
+ONNX_CPU_OPERATOR_TYPED_ML_KERNEL(
+    LabelEncoder,
+    2,
+    string_int64,
+    KernelDefBuilder().TypeConstraint("T1",
+                                      std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::string>()})
+        .TypeConstraint("T2",
+                        std::vector<MLDataType>{DataTypeImpl::GetTensorType<std::int64_t>()}),
+    LabelEncoder_2<std::string, std::int64_t>)
+
+template <>
+void LabelEncoder_2<std::string, std::int64_t>::InitializeSomeFields(const OpKernelInfo& info) {
+  _key_field_name = "keys_strings";
+  _value_field_name = "values_int64s";
+  info.GetAttrOrDefault<std::int64_t>("default_int64", &_default_value, (std::int64_t)-1);
+};
+
 }  // namespace ml
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/ml/label_encoder.h b/onnxruntime/core/providers/cpu/ml/label_encoder.h
index 597cf240c6ed4..0f7c59b5740a0 100644
--- a/onnxruntime/core/providers/cpu/ml/label_encoder.h
+++ b/onnxruntime/core/providers/cpu/ml/label_encoder.h
@@ -43,5 +43,67 @@ class LabelEncoder final : public OpKernel {
   int64_t default_int_;
 };
 
+template <typename TKey, typename TValue>
+class LabelEncoder_2 final : public OpKernel {
+ public:
+  LabelEncoder_2(const OpKernelInfo& info) : OpKernel(info) {
+    // Let the specialized member function to tell which fields to load.
+    InitializeSomeFields(info);
+
+    std::vector<TKey> keys;
+    std::vector<TValue> values;
+
+    ORT_ENFORCE(info.GetAttrs<TKey>(_key_field_name, keys).IsOK());
+    ORT_ENFORCE(info.GetAttrs<TValue>(_value_field_name, values).IsOK());
+
+    auto num_keys = keys.size();
+    auto num_values = values.size();
+    ORT_ENFORCE(num_keys == num_values,
+                "The ", _key_field_name, " and ", _value_field_name, " attribtues in LabelEncoder ",
+                "(name: ", info.node().Name(), ") must have the same length. ",
+                "However, the number of key is ", num_keys, " and the number of ",
+                "values is ", num_values, ".");
+
+    for (size_t i = 0; i < num_keys; ++i)
+      _map[keys[i]] = values[i];
+  }
+
+  Status Compute(OpKernelContext* context) const override {
+    const auto* tensor_pointer = context->Input<Tensor>(0);
+    if (tensor_pointer == nullptr) return Status(common::ONNXRUNTIME, common::FAIL, "input count mismatch");
+    const Tensor& X = *tensor_pointer;
+    const TensorShape& shape = X.Shape();
+    Tensor& Y = *context->Output(0, TensorShape(shape));
+
+    auto input = X.template DataAsSpan<TKey>();
+    auto output = Y.template MutableDataAsSpan<TValue>();
+
+    for (int64_t i = 0; i < shape.Size(); ++i) {
+      const auto found = _map.find(input[i]);
+      if (found == _map.end())
+        output[i] = _default_value;
+      else
+        output[i] = found->second;
+    }
+
+    return Status::OK();
+  }
+
+ private:
+  // Specialize this method to set attribute names. For example, if keys' type
+  // is 64-bit integer, _key_field_name should be "keys_int64s". Field names
+  // for other types can be found in ONNX spec.
+  void InitializeSomeFields(const OpKernelInfo& info);
+
+  // A collection of key-value pairs. Each (a_key, a_value) pair
+  // means that the "a_key" in the input would be mapped to "a_value".
+  // If _map doesn't contain "a_key", we use _default_value as its output.
+  std::unordered_map<TKey, TValue> _map;
+  TValue _default_value;
+  // ONNX attribute name to load keys.
+  std::string _key_field_name;
+  // ONNX attribute name to load values.
+  std::string _value_field_name;
+};
 }  // namespace ml
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/nn/Unpool.cc b/onnxruntime/core/providers/cpu/nn/Unpool.cc
index 3b1c16f354a55..853bd05cdd8d0 100644
--- a/onnxruntime/core/providers/cpu/nn/Unpool.cc
+++ b/onnxruntime/core/providers/cpu/nn/Unpool.cc
@@ -18,9 +18,9 @@ ONNX_CPU_OPERATOR_KERNEL(
     MaxUnpool,
     9,
     KernelDefBuilder()
-        .TypeConstraint("T", DataTypeImpl::GetTensorType<float>())
-        .TypeConstraint("I", DataTypeImpl::GetTensorType<int64_t>())
-        .TypeConstraint("Y", DataTypeImpl::GetTensorType<float>()),
+        .TypeConstraint("T1", DataTypeImpl::GetTensorType<float>())
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<int64_t>()),
+        // .TypeConstraint("Y", DataTypeImpl::GetTensorType<float>()),
     MaxUnpool);
 
 Status MaxUnpool::Compute(OpKernelContext* context) const {
diff --git a/onnxruntime/core/providers/cpu/nn/conv.cc b/onnxruntime/core/providers/cpu/nn/conv.cc
index c3acbd02a62c5..c0091936704d8 100644
--- a/onnxruntime/core/providers/cpu/nn/conv.cc
+++ b/onnxruntime/core/providers/cpu/nn/conv.cc
@@ -14,6 +14,7 @@
 * limitations under the License.
 */
 /* Modifications Copyright (c) Microsoft. */
+#include "core/framework/op_kernel_context_internal.h"
 
 #include "core/providers/cpu/nn/conv.h"
 #include "core/framework/op_kernel_context_internal.h"
@@ -24,6 +25,8 @@ namespace onnxruntime {
 template <typename T>
 Status Conv<T>::Compute(OpKernelContext* context) const {
   size_t num_inputs = OpKernel::Node().InputDefs().size();
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(context);
+  concurrency::ThreadPool* tp = ctx_internal->GetOperatorThreadPool();
 
   const auto* X = context->Input<Tensor>(0);
   const auto* W = context->Input<Tensor>(1);
@@ -116,7 +119,7 @@ Status Conv<T>::Compute(OpKernelContext* context) const {
             col_buffer_data,
             &CPUMathUtil::Instance());
       }
-      math::Gemm<T, CPUMathUtil>(
+      math::Gemm<T>(
           CblasNoTrans,
           CblasNoTrans,
           M / group_,
@@ -127,7 +130,7 @@ Status Conv<T>::Compute(OpKernelContext* context) const {
           col_buffer_data,
           0,
           Ydata + group_id * Y_offset,
-          &CPUMathUtil::Instance());
+          tp);
     }
 
     if (B != nullptr) {
@@ -144,6 +147,9 @@ Status Conv<T>::Compute(OpKernelContext* context) const {
 }
 
 Status Conv<float>::Compute(OpKernelContext* context) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(context);
+  concurrency::ThreadPool* tp = ctx_internal->GetOperatorThreadPool();
+
   size_t num_inputs = OpKernel::Node().InputDefs().size();
   const auto* X = context->Input<Tensor>(0);
   const auto* W = context->Input<Tensor>(1);
@@ -186,11 +192,6 @@ Status Conv<float>::Compute(OpKernelContext* context) const {
   const size_t kernel_rank = kernel_shape.size();
 
   if (kernel_rank == 2 || kernel_rank == 3) {
-    // Get access to the internal threadpool
-    // Temporarily derive concurrency parameters without access to session state
-    auto ctx_internal = static_cast<OpKernelContextInternal*>(context);
-    auto thread_pool = ctx_internal->GetOperatorThreadPool();
-
     MLAS_CONV_PARAMETERS Parameters;
     size_t WorkingBufferSize;
     MlasConvPrepare(&Parameters,
@@ -207,7 +208,7 @@ Status Conv<float>::Compute(OpKernelContext* context) const {
                     static_cast<size_t>(M / group_),
                     &activation_,
                     &WorkingBufferSize,
-                    const_cast<concurrency::ThreadPool*>(thread_pool));
+                    tp);
 
     auto working_data = WorkingBufferSize > 0 ? alloc->Alloc(sizeof(float) * WorkingBufferSize) : nullptr;
     BufferUniquePtr working_buffer(working_data, BufferDeleter(alloc));
@@ -218,7 +219,7 @@ Status Conv<float>::Compute(OpKernelContext* context) const {
              Bdata,
              static_cast<float*>(working_buffer.get()),
              Ydata,
-             const_cast<concurrency::ThreadPool*>(thread_pool));
+             tp);
   } else {
     const int64_t input_image_size = input_shape.Size();
     const int64_t output_image_size = output_shape.Size();
@@ -253,7 +254,7 @@ Status Conv<float>::Compute(OpKernelContext* context) const {
             static_cast<int>(kernel_shape.size()),
             col_buffer_data,
             &CPUMathUtil::Instance());
-        math::Gemm<float, CPUMathUtil>(
+        math::Gemm<float>(
             CblasNoTrans,
             CblasNoTrans,
             M / group_,
@@ -264,7 +265,7 @@ Status Conv<float>::Compute(OpKernelContext* context) const {
             col_buffer_data,
             0,
             Ydata + group_id * Y_offset,
-            &CPUMathUtil::Instance());
+            tp);
       }
 
       MlasActivation(&activation_, Ydata, Bdata, M, output_image_size, output_image_size);
diff --git a/onnxruntime/core/providers/cpu/nn/conv_integer.cc b/onnxruntime/core/providers/cpu/nn/conv_integer.cc
index fbd182312f554..534cb75a6e840 100644
--- a/onnxruntime/core/providers/cpu/nn/conv_integer.cc
+++ b/onnxruntime/core/providers/cpu/nn/conv_integer.cc
@@ -1,15 +1,11 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 
-#ifdef _MSC_VER
-#pragma warning(disable : 4244)
-#pragma warning(disable : 4267)
-#endif
-
 #include "core/providers/cpu/nn/conv_integer.h"
 #include "core/util/math.h"
 #include "core/util/math_cpuonly.h"
-#include "core/util/gemmlowp_common_wrapper.h"
+#include "core/util/qmath.h"
+#include "core/providers/common.h"
 
 namespace onnxruntime {
 
@@ -25,30 +21,21 @@ ONNX_OPERATOR_KERNEL_EX(
     ConvInteger);
 
 Status ConvInteger::Compute(OpKernelContext* context) const {
+
   size_t num_inputs = OpKernel::Node().InputDefs().size();
   const auto* X = context->Input<Tensor>(0);
   const auto* W = context->Input<Tensor>(1);
-  int32_t input_offset = 0;
-  int32_t filter_offset = 0;
+  uint8_t input_offset = 0;
+  uint8_t filter_offset = 0;
   if (num_inputs >= 3) {
     const auto* X_Zero_Point = context->Input<Tensor>(2);
-    if (X_Zero_Point->Shape().NumDimensions() == 0 ||
-        (X_Zero_Point->Shape().NumDimensions() == 1 && X_Zero_Point->Shape().GetDims().size() == 1)) {
-      input_offset = static_cast<int32_t>(*(X_Zero_Point->Data<uint8_t>()));
-    } else {
-      //TODO: Add support for per-channel quantization.
-      return Status(common::ONNXRUNTIME, common::FAIL, "Non per-tensor quantization is not supported now.");
-    }
+    ORT_ENFORCE(IsScalarOr1ElementVector(X_Zero_Point), "Must be a scalar or 1D tensor or size 1.");
+    input_offset = *(X_Zero_Point->Data<uint8_t>());
   }
   if (num_inputs >= 4) {
     const auto* W_Zero_Point = context->Input<Tensor>(3);
-    if (W_Zero_Point->Shape().NumDimensions() == 0 ||
-        (W_Zero_Point->Shape().NumDimensions() == 1 && W_Zero_Point->Shape().GetDims().size() == 1)) {
-      filter_offset = static_cast<int32_t>(*(W_Zero_Point->Data<uint8_t>()));
-    } else {
-      //TODO: Add support for per-channel quantization.
-      return Status(common::ONNXRUNTIME, common::FAIL, "Non per-tensor quantization is not supported now.");
-    }
+    ORT_ENFORCE(IsScalarOr1ElementVector(W_Zero_Point), "Non per-tensor quantization is not supported now.");
+    filter_offset = *(W_Zero_Point->Data<uint8_t>());
   }
 
   const int64_t N = X->Shape()[0];
@@ -118,27 +105,21 @@ Status ConvInteger::Compute(OpKernelContext* context) const {
           static_cast<int>(kernel_shape.size()),
           col_buffer_data,
           &CPUMathUtil::Instance(),
-		  false,
-		  input_offset);
-
-      const uint8_t* filter_data_as_uint8 = W->template Data<uint8_t>() + group_id * W_offset;
-      static const gemmlowp::MapOrder ResultOrder = gemmlowp::MapOrder::RowMajor;
-      static const gemmlowp::MapOrder LhsOrder = gemmlowp::MapOrder::RowMajor;
-      static const gemmlowp::MapOrder RhsOrder = gemmlowp::MapOrder::RowMajor;
-      gemmlowp::MatrixMap<const std::uint8_t, LhsOrder> lhs(
-          filter_data_as_uint8, static_cast<int>(M / group_), static_cast<int>(kernel_dim));
-      gemmlowp::MatrixMap<const std::uint8_t, RhsOrder> rhs(
-          col_buffer_data, static_cast<int>(kernel_dim), static_cast<int>(output_image_size));
-      gemmlowp::MatrixMap<std::int32_t, ResultOrder> result(
-          Ydata + group_id * Y_offset, static_cast<int>(M / group_), static_cast<int>(output_image_size));
-      const std::tuple<> empty_pipeline = {};
-
-      gemmlowp::GemmContext gemm_context;
-      // TODO: worker thread pool needs to be handled.
-      gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::int32_t,
-                                       gemmlowp::DefaultL8R8BitDepthParams>(
-          &gemm_context, lhs, rhs, &result, -filter_offset, -input_offset,
-          empty_pipeline);
+          false,
+          input_offset);
+
+      QGemmu8u8_s32(static_cast<int>(M / group_),
+                    static_cast<int>(output_image_size),
+                    static_cast<int>(kernel_dim),
+                    W->template Data<uint8_t>() + group_id * W_offset,
+                    static_cast<int>(kernel_dim),
+                    filter_offset,
+                    col_buffer_data,
+                    static_cast<int>(output_image_size),
+                    input_offset,
+                    Ydata + group_id * Y_offset,
+                    static_cast<int>(output_image_size),
+                    nullptr);
     }
 
     Xdata += X_offset * group_;
diff --git a/onnxruntime/core/providers/cpu/nn/conv_transpose.cc b/onnxruntime/core/providers/cpu/nn/conv_transpose.cc
index 14f13ccd20198..9fd9cd1502147 100644
--- a/onnxruntime/core/providers/cpu/nn/conv_transpose.cc
+++ b/onnxruntime/core/providers/cpu/nn/conv_transpose.cc
@@ -16,6 +16,8 @@
 /* Modifications Copyright (c) Microsoft. */
 
 #include "core/providers/cpu/nn/conv_transpose.h"
+#include "core/framework/op_kernel_context_internal.h"
+
 #include "core/util/math.h"
 #include "core/util/math_cpuonly.h"
 
@@ -228,6 +230,9 @@ Status ConvTranspose<T>::Compute(OpKernelContext* context) const {
 
 template <typename T>
 Status ConvTranspose<T>::DoConvTranspose(OpKernelContext* context, bool dynamic_padding) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(context);
+  concurrency::ThreadPool* tp = ctx_internal->GetOperatorThreadPool();
+
   size_t num_inputs = OpKernel::Node().InputDefs().size();
   Prepare p;
   bool has_bias = dynamic_padding ? num_inputs == 4 : num_inputs == 3;
@@ -254,7 +259,7 @@ Status ConvTranspose<T>::DoConvTranspose(OpKernelContext* context, bool dynamic_
   for (auto image_id = 0; image_id < p.N; ++image_id) {
     for (int group_id = 0; group_id < group_; ++group_id) {
       // Weight term
-      math::Gemm<T, CPUMathUtil>(
+      math::Gemm<T>(
           CblasTrans,
           CblasNoTrans,
           kernel_dim,
@@ -265,7 +270,7 @@ Status ConvTranspose<T>::DoConvTranspose(OpKernelContext* context, bool dynamic_
           Xdata + group_id * X_offset,
           0,
           col_buffer_data,
-          &CPUMathUtil::Instance());
+          tp);
 
       // Col2im
       math::Col2im<T, CPUMathUtil, StorageOrder::NCHW>(
diff --git a/onnxruntime/core/providers/cpu/nn/pool.cc b/onnxruntime/core/providers/cpu/nn/pool.cc
index 367a9256a0c16..47bc8fc856bb3 100644
--- a/onnxruntime/core/providers/cpu/nn/pool.cc
+++ b/onnxruntime/core/providers/cpu/nn/pool.cc
@@ -190,7 +190,7 @@ Status PoolBase::Compute(OpKernelContext* context, MLAS_POOLING_KIND kind) const
   // Get access to the internal threadpool
   // Temporarily derive concurrency parameters without access to session state
   auto ctx_internal = static_cast<OpKernelContextInternal*>(context);
-  auto thread_pool = ctx_internal->GetOperatorThreadPool();
+  concurrency::ThreadPool* thread_pool = ctx_internal->GetOperatorThreadPool();
 
   MlasPool(kind,
            pooling_dims,
diff --git a/onnxruntime/core/providers/cpu/nn/pool_base.h b/onnxruntime/core/providers/cpu/nn/pool_base.h
index 43f81982dd3a9..606ac909f08f1 100644
--- a/onnxruntime/core/providers/cpu/nn/pool_base.h
+++ b/onnxruntime/core/providers/cpu/nn/pool_base.h
@@ -99,10 +99,13 @@ class LpPool {
 };
 
 class PoolBase {
+ private:
+  static bool IsGlobalPooling(const std::string& op_name) {
+    return op_name == "GlobalAveragePool" || op_name == "GlobalMaxPool" || op_name == "GlobalLpPool";
+  }
+
  protected:
-  PoolBase(const OpKernelInfo& info) {
-    op_name_ = info.GetKernelDef().OpName();
-    global_pooling_ = (op_name_ == "GlobalAveragePool" || op_name_ == "GlobalMaxPool" || op_name_ == "GlobalLpPool");
+  PoolBase(const OpKernelInfo& info) : op_name_(info.GetKernelDef().OpName()), global_pooling_(IsGlobalPooling(op_name_)) {
     int end;
     info.GetKernelDef().SinceVersion(&start_version_, &end);
 
@@ -256,8 +259,8 @@ class PoolBase {
   Status Compute(OpKernelContext* context, MLAS_POOLING_KIND kind) const;
 
  protected:
-  std::string op_name_;
-  bool global_pooling_{};
+  const std::string op_name_;
+  const bool global_pooling_;
   bool count_include_pad_{};
   int64_t storage_order_{0};  // MaxPool_8 only. 0 is row major, and 1 is column major. Default is 0.
   int64_t ceil_mode_{0};      // Introduced in MaxPool_10
diff --git a/onnxruntime/core/providers/cpu/nn/qlinearconv.cc b/onnxruntime/core/providers/cpu/nn/qlinearconv.cc
index 1cf064f7ea9e1..78a53679325e8 100644
--- a/onnxruntime/core/providers/cpu/nn/qlinearconv.cc
+++ b/onnxruntime/core/providers/cpu/nn/qlinearconv.cc
@@ -1,14 +1,10 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 
-#ifdef _MSC_VER
-#pragma warning(disable : 4244)
-#pragma warning(disable : 4267)
-#endif
-
 #include "core/providers/cpu/nn/qlinearconv.h"
 #include "core/util/math.h"
 #include "core/util/math_cpuonly.h"
+#include "core/providers/common.h"
 
 namespace onnxruntime {
 ONNX_OPERATOR_KERNEL_EX(
@@ -19,32 +15,40 @@ ONNX_OPERATOR_KERNEL_EX(
     KernelDefBuilder()
         .TypeConstraint("T1", DataTypeImpl::GetTensorType<uint8_t>())
         .TypeConstraint("T2", DataTypeImpl::GetTensorType<uint8_t>())
-        .TypeConstraint("T3", DataTypeImpl::GetTensorType<uint8_t>()),
+        .TypeConstraint("T3", DataTypeImpl::GetTensorType<uint8_t>())
+        .TypeConstraint("T4", DataTypeImpl::GetTensorType<int32_t>()),
     QLinearConv);
 
 Status QLinearConv::Compute(OpKernelContext* context) const {
   const auto* X = context->Input<Tensor>(0);
   const auto* W = context->Input<Tensor>(3);
 
-  // validate scale and zero points
-  auto input_scale = context->Input<Tensor>(1);
+  // validate offsets
   auto input_offset = context->Input<Tensor>(2);
-  ScaleAndZeropointPairValidationHelper(input_scale, input_offset);
-  auto filter_scale = context->Input<Tensor>(4);
   auto filter_offset = context->Input<Tensor>(5);
-  ScaleAndZeropointPairValidationHelper(filter_scale, filter_offset);
-  auto result_scale = context->Input<Tensor>(6);
   auto result_offset = context->Input<Tensor>(7);
-  ScaleAndZeropointPairValidationHelper(result_scale, result_offset);
+  ORT_ENFORCE(IsScalarOr1ElementVector(input_offset),
+              "QLinearConv : input zero point must be a scalar or 1D tensor of size 1");
+  ORT_ENFORCE(IsScalarOr1ElementVector(filter_offset),
+              "QLinearConv : filter zero point must be a scalar or 1D tensor of size 1");
+  ORT_ENFORCE(IsScalarOr1ElementVector(result_offset),
+              "QLinearConv : result zero point must be a scalar or 1D tensor of size 1");
+
+  // validate scale
+  auto input_scale = context->Input<Tensor>(1);
+  auto filter_scale = context->Input<Tensor>(4);
+  auto result_scale = context->Input<Tensor>(6);
+  ORT_ENFORCE(IsScalarOr1ElementVector(input_scale),
+              "QLinearConv : input scale must be a scalar or 1D tensor of size 1");
+  ORT_ENFORCE(IsScalarOr1ElementVector(filter_scale),
+              "QLinearConv : filter scale must be a scalar or 1D tensor of size 1");
+  ORT_ENFORCE(IsScalarOr1ElementVector(result_scale),
+              "QLinearConv : result scale must be a scalar or 1D tensor of size 1");
 
   auto input_scale_data = *(input_scale->template Data<float>());
   auto filter_scale_data = *(filter_scale->template Data<float>());
   auto result_scale_data = *(result_scale->template Data<float>());
 
-  auto input_offset_data = *(input_offset->template Data<uint8_t>());
-  auto filter_offset_data = *(filter_offset->template Data<uint8_t>());
-  auto result_offset_data = *(result_offset->template Data<uint8_t>());
-
   const float real_multiplier = (input_scale_data * filter_scale_data) / result_scale_data;
   int32_t integer_multiplier;
   int right_shift;
@@ -54,7 +58,7 @@ Status QLinearConv::Compute(OpKernelContext* context) const {
   const Tensor* bias = nullptr;
   if (num_inputs == 9) {
     bias = context->Input<Tensor>(8);
-  }  
+  }
 
   const int64_t N = X->Shape()[0];
   const int64_t C = X->Shape()[1];
@@ -95,7 +99,7 @@ Status QLinearConv::Compute(OpKernelContext* context) const {
   const int64_t kernel_size = TensorShape(kernel_shape).Size();
   const int64_t X_offset = C / group_ * input_image_size;
   const int64_t Y_offset = Y->Shape().Size() / Y->Shape()[0] / group_;
-  const int64_t W_offset = W->Shape().Size() / group_;  
+  const int64_t W_offset = W->Shape().Size() / group_;
   const int64_t kernel_dim = C / group_ * kernel_size;
   const int64_t col_buffer_size = kernel_dim * output_image_size;
   const int bias_offset = static_cast<int>(M / group_);
@@ -124,35 +128,21 @@ Status QLinearConv::Compute(OpKernelContext* context) const {
           static_cast<int>(kernel_shape.size()),
           col_buffer_data,
           &CPUMathUtil::Instance(),
-		  false,
-          input_offset_data);
-
-      const uint8_t* filter_data_as_uint8 = W->template Data<uint8_t>() + group_id * W_offset;
-      static const gemmlowp::MapOrder MatOrder = gemmlowp::MapOrder::RowMajor;
-      gemmlowp::MatrixMap<const std::uint8_t, MatOrder> lhs(
-          filter_data_as_uint8, static_cast<int>(M / group_), static_cast<int>(kernel_dim));
-      gemmlowp::MatrixMap<const std::uint8_t, MatOrder> rhs(
-          col_buffer_data, static_cast<int>(kernel_dim), static_cast<int>(output_image_size));
-      gemmlowp::MatrixMap<std::uint8_t, MatOrder> result(
-          Ydata + group_id * Y_offset, static_cast<int>(M / group_), static_cast<int>(output_image_size));
-
-      // TODO: worker thread pool needs to be handled.
-      gemmlowp::GemmContext gemm_context;
-      if (bias == nullptr) {
-        auto output_pipeline = MakeOutputPipelineWithOutBias(result_offset_data, 
-            integer_multiplier, right_shift);
-        gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
-                                         gemmlowp::DefaultL8R8BitDepthParams>(
-            &gemm_context, lhs, rhs, &result, -filter_offset_data, -input_offset_data,
-            output_pipeline);        
-      } else {
-        auto output_pipeline = MakeOutputPipelineWithBias(bias->template Data<int32_t>() + group_id * bias_offset, 
-            static_cast<int>(M / group_), result_offset_data, integer_multiplier, right_shift);
-        gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
-                                         gemmlowp::DefaultL8R8BitDepthParams>(
-            &gemm_context, lhs, rhs, &result, -filter_offset_data, -input_offset_data,
-            output_pipeline);
-      }      
+          false,
+          *input_offset->template Data<uint8_t>());
+
+      GemmlowpMultiplyu8u8_u8(W->template Data<uint8_t>() + group_id * W_offset,
+                              col_buffer_data,
+                              Ydata + group_id * Y_offset,
+                              *filter_offset->template Data<uint8_t>(),
+                              *input_offset->template Data<uint8_t>(),
+                              *result_offset->template Data<uint8_t>(),
+                              static_cast<int>(M / group_),
+                              static_cast<int>(output_image_size),
+                              static_cast<int>(kernel_dim),
+                              integer_multiplier,
+                              right_shift,
+                              bias == nullptr ? nullptr : bias->template Data<int32_t>() + group_id * bias_offset);
     }
 
     Xdata += X_offset * group_;
@@ -161,28 +151,4 @@ Status QLinearConv::Compute(OpKernelContext* context) const {
 
   return Status::OK();
 }
-
-void QLinearConv::QuantizeMultiplier(float fp_multiplier, std::int32_t* integer_multiplier, int* right_shift) const {
-  auto* fp_as_bits = reinterpret_cast<uint32_t*>(&fp_multiplier);
-  auto current_exponent = (*fp_as_bits >> 23);
-  // bring multiplier in [.5,1) range and calculate the shift
-  auto bumped_multiplier_as_bits =
-      (*fp_as_bits & UINT32_C(0x007fffff)) | UINT32_C(0x3f000000);
-  auto* bumped_multiplier = reinterpret_cast<float*>(&bumped_multiplier_as_bits);
-  auto shift = 126 - current_exponent;
-  // convert to fixed point number
-  auto int_multiplier = static_cast<std::int64_t>(std::round(*bumped_multiplier * (1ll << 31)));
-
-  *integer_multiplier = static_cast<int32_t>(int_multiplier);
-  *right_shift = shift;
-}
-
-void QLinearConv::ScaleAndZeropointPairValidationHelper(const Tensor* scale, const Tensor* zeropoint) const {
-  ORT_ENFORCE(scale->Shape().NumDimensions() == 0 ||
-                  (scale->Shape().NumDimensions() == 1 && scale->Shape().GetDims().size() == 1),
-              "scale must be a scalar");
-  ORT_ENFORCE(zeropoint->Shape().NumDimensions() == 0 ||
-                  (zeropoint->Shape().NumDimensions() == 1 && zeropoint->Shape().GetDims().size() == 1),
-              "zeropoint must be a scalar");
-}
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/nn/qlinearconv.h b/onnxruntime/core/providers/cpu/nn/qlinearconv.h
index c5e7919371bc8..9179da587c1f4 100644
--- a/onnxruntime/core/providers/cpu/nn/qlinearconv.h
+++ b/onnxruntime/core/providers/cpu/nn/qlinearconv.h
@@ -4,7 +4,7 @@
 #pragma once
 
 #include "core/providers/cpu/nn/conv_base.h"
-#include "core/util/gemmlowp_common_wrapper.h"
+#include "core/util/gemmlowp_common.h"
 
 namespace onnxruntime {
 class QLinearConv : public OpKernel, public ConvBase {
@@ -12,44 +12,6 @@ class QLinearConv : public OpKernel, public ConvBase {
   explicit QLinearConv(const OpKernelInfo& info) : OpKernel(info), ConvBase(info) {
   }
 
-  Status Compute(OpKernelContext* context) const override;
-
-  void QuantizeMultiplier(float fp_multiplier, std::int32_t* integer_multiplier, int* right_shift) const;
-
-  void ScaleAndZeropointPairValidationHelper(const Tensor* scale, const Tensor* zeropoint) const;  
+  Status Compute(OpKernelContext* context) const override;  
 };
-
-typedef gemmlowp::VectorMap<const std::int32_t, gemmlowp::VectorShape::Col> ColVectorMap;
-
-inline std::tuple<gemmlowp::OutputStageBiasAddition<ColVectorMap>,
-                  gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint,
-                  gemmlowp::OutputStageSaturatingCastToUint8>
-MakeOutputPipelineWithBias(const int32_t* bias,
-                           int rows,
-                           std::int32_t result_offset,
-                           std::int32_t result_mult_int,
-                           std::int32_t result_shift) {
-  ColVectorMap bias_vector(bias, rows);
-  gemmlowp::OutputStageBiasAddition<ColVectorMap> bias_addition_stage;
-  bias_addition_stage.bias_vector = bias_vector;
-  gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint quantize_down_stage;
-  quantize_down_stage.result_offset_after_shift = result_offset;
-  quantize_down_stage.result_fixedpoint_multiplier = result_mult_int;
-  quantize_down_stage.result_shift = result_shift;
-  gemmlowp::OutputStageSaturatingCastToUint8 saturating_cast_stage;
-  return std::make_tuple(bias_addition_stage, quantize_down_stage, saturating_cast_stage);
-}
-
-inline std::tuple<gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint,
-                  gemmlowp::OutputStageSaturatingCastToUint8>
-MakeOutputPipelineWithOutBias(std::int32_t result_offset,
-                              std::int32_t result_mult_int,
-                              std::int32_t result_shift) {
-  gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint quantize_down_stage;
-  quantize_down_stage.result_offset_after_shift = result_offset;
-  quantize_down_stage.result_fixedpoint_multiplier = result_mult_int;
-  quantize_down_stage.result_shift = result_shift;
-  gemmlowp::OutputStageSaturatingCastToUint8 saturating_cast_stage;
-  return std::make_tuple(quantize_down_stage, saturating_cast_stage);
-}
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/object_detection/non_max_suppression.cc b/onnxruntime/core/providers/cpu/object_detection/non_max_suppression.cc
index 66084547810ad..c1a376026ab14 100644
--- a/onnxruntime/core/providers/cpu/object_detection/non_max_suppression.cc
+++ b/onnxruntime/core/providers/cpu/object_detection/non_max_suppression.cc
@@ -141,7 +141,7 @@ Status NonMaxSuppression::Compute(OpKernelContext* ctx) const {
   for (int64_t batch_index = 0; batch_index < pc.num_batches_; ++batch_index) {
     for (int64_t class_index = 0; class_index < pc.num_classes_; ++class_index) {
       int64_t box_score_offset = (batch_index * pc.num_classes_ + class_index) * pc.num_boxes_;
-      int64_t box_offset = batch_index * pc.num_classes_ * pc.num_boxes_ * 4;
+      int64_t box_offset = batch_index * pc.num_boxes_ * 4;
       // Filter by score_threshold_
       std::priority_queue<ScoreIndexPair, std::deque<ScoreIndexPair>> sorted_scores_with_index;
       const auto* class_scores = scores_data + box_score_offset;
@@ -158,7 +158,7 @@ Status NonMaxSuppression::Compute(OpKernelContext* ctx) const {
       }
 
       ScoreIndexPair next_top_score;
-      std::vector<int64_t> selected_indicies_inside_class;
+      std::vector<int64_t> selected_indices_inside_class;
       // Get the next box with top score, filter by iou_threshold
       while (!sorted_scores_with_index.empty()) {
         next_top_score = sorted_scores_with_index.top();
@@ -166,7 +166,7 @@ Status NonMaxSuppression::Compute(OpKernelContext* ctx) const {
 
         bool selected = true;
         // Check with existing selected boxes for this class, suppress if exceed the IOU (Intersection Over Union) threshold
-        for (int64_t selected_index : selected_indicies_inside_class) {
+        for (int64_t selected_index : selected_indices_inside_class) {
           if (SuppressByIOU(boxes_data + box_offset, selected_index, next_top_score.index_,
                             center_point_box, iou_threshold)) {
             selected = false;
@@ -176,10 +176,10 @@ Status NonMaxSuppression::Compute(OpKernelContext* ctx) const {
 
         if (selected) {
           if (max_output_boxes_per_class > 0 &&
-              static_cast<int64_t>(selected_indicies_inside_class.size()) >= max_output_boxes_per_class) {
+              static_cast<int64_t>(selected_indices_inside_class.size()) >= max_output_boxes_per_class) {
             break;
           }
-          selected_indicies_inside_class.push_back(next_top_score.index_);
+          selected_indices_inside_class.push_back(next_top_score.index_);
           selected_indices.emplace_back(batch_index, class_index, next_top_score.index_);
         }
       }  //while
diff --git a/onnxruntime/core/providers/cpu/object_detection/roialign.cc b/onnxruntime/core/providers/cpu/object_detection/roialign.cc
index 9453039aa8753..4d27e957e9f44 100644
--- a/onnxruntime/core/providers/cpu/object_detection/roialign.cc
+++ b/onnxruntime/core/providers/cpu/object_detection/roialign.cc
@@ -268,7 +268,7 @@ void RoiAlignForward(
       }    // for ph
     }      // for c
   };       // for n
-  const_cast<ThreadPool*>(ttp)->ParallelFor(static_cast<int32_t>(n_rois), work_object);
+  if (ttp != nullptr) const_cast<ThreadPool*>(ttp)->ParallelFor(static_cast<int32_t>(n_rois), work_object);
 }
 }  // namespace
 
diff --git a/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc b/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc
index 8c9143a238868..b418012574c7e 100644
--- a/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc
+++ b/onnxruntime/core/providers/cpu/reduction/reduction_ops.cc
@@ -30,15 +30,25 @@ namespace onnxruntime {
       KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<double>()), \
       x<double>);
 
+#define REGISTER_UNARY_ELEMENTWISE_KERNEL_INT64_ONLY(x, sinceVersion)                 \
+  ONNX_CPU_OPERATOR_TYPED_KERNEL(                                                     \
+      x,                                                                              \
+      sinceVersion,                                                                   \
+      int64_t,                                                                        \
+      KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<int64_t>()), \
+      x<int64_t>);
+
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceL1, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceL2, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceLogSum, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceLogSumExp, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceMax, 1);
+REGISTER_UNARY_ELEMENTWISE_KERNEL_INT64_ONLY(ReduceMax, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceMean, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceMin, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceProd, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceSum, 1);
+REGISTER_UNARY_ELEMENTWISE_KERNEL_INT64_ONLY(ReduceSum, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL_DOUBLE_ONLY(ReduceSum, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL(ReduceSumSquare, 1);
 REGISTER_UNARY_ELEMENTWISE_KERNEL_DOUBLE_ONLY(ReduceSumSquare, 1);
diff --git a/onnxruntime/core/providers/cpu/rnn/deep_cpu_gru.cc b/onnxruntime/core/providers/cpu/rnn/deep_cpu_gru.cc
index c5be268f59e2d..0dd13269bfacd 100644
--- a/onnxruntime/core/providers/cpu/rnn/deep_cpu_gru.cc
+++ b/onnxruntime/core/providers/cpu/rnn/deep_cpu_gru.cc
@@ -1,5 +1,7 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
+#include "core/platform/threadpool.h"
+#include "core/framework/op_kernel_context_internal.h"
 
 // there's no way to use a raw pointer as the copy destination with std::copy_n
 // (which gsl::copy uses with span::data() which returns a raw pointer) with the 14.11 toolset
@@ -167,7 +169,8 @@ class UniDirectionalGru {
   UniDirectionalGru(AllocatorPtr allocator, int seq_length, int batch_size, int input_size, int hidden_size,
                     bool linear_before_reset, Direction direction, const gsl::span<const T>& bias,
                     const gsl::span<const T>& initial_hidden_state, const ActivationFuncs::Entry& activation_func_f,
-                    const ActivationFuncs::Entry& activation_func_g, float clip);
+                    const ActivationFuncs::Entry& activation_func_g, float clip,
+                    onnxruntime::concurrency::ThreadPool* ttp);
 
   void Compute(const gsl::span<const T>& inputs, const gsl::span<const int>& sequence_lengths, int num_directions,
                const gsl::span<const T>& input_weights, const gsl::span<const T>& recurrent_weights,
@@ -233,6 +236,8 @@ class UniDirectionalGru {
   deepcpu::GruOutputGateFuncPtr output_gate_{};
 
   void AllocateBuffers();
+
+  onnxruntime::concurrency::ThreadPool* ttp_;
 };
 }  // namespace detail
 
@@ -263,6 +268,9 @@ Status DeepCpuGruOp::Compute(OpKernelContext* context) const {
 
 template <typename T>
 Status DeepCpuGruOp::ComputeImpl(OpKernelContext& context) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(&context);
+  concurrency::ThreadPool* thread_pool = ctx_internal->GetOperatorThreadPool();
+
   const Tensor& X = *context.Input<Tensor>(0);  // inputs. [seq_length, batch_size, input_size]
   const Tensor& W = *context.Input<Tensor>(1);  // weights. [num_directions, 3*hidden_size, input_size]
   const Tensor& R = *context.Input<Tensor>(2);  // recurrence weights. [num_directions, 3*hidden_size, hidden_size]
@@ -367,7 +375,7 @@ Status DeepCpuGruOp::ComputeImpl(OpKernelContext& context) const {
                                     linear_before_reset_, Direction::kForward, bias_1, initial_hidden_1,
                                     activation_funcs_.Entries()[0],
                                     activation_funcs_.Entries()[1],
-                                    clip_);
+                                    clip_, thread_pool);
     fw.Compute(input, sequence_lens_span, num_directions_, input_weights_1, recurrent_weights_1,
                output_1, hidden_output_1);
 
@@ -375,7 +383,7 @@ Status DeepCpuGruOp::ComputeImpl(OpKernelContext& context) const {
                                     linear_before_reset_, Direction::kReverse, bias_2, initial_hidden_2,
                                     activation_funcs_.Entries()[2],
                                     activation_funcs_.Entries()[3],
-                                    clip_);
+                                    clip_, thread_pool);
     bw.Compute(input, sequence_lens_span, num_directions_, input_weights_2, recurrent_weights_2,
                output_2, hidden_output_2);
   } else {
@@ -383,7 +391,7 @@ Status DeepCpuGruOp::ComputeImpl(OpKernelContext& context) const {
                                        linear_before_reset_, direction_, bias_1, initial_hidden_1,
                                        activation_funcs_.Entries()[0],
                                        activation_funcs_.Entries()[1],
-                                       clip_);
+                                       clip_, thread_pool);
     gru_p.Compute(input, sequence_lens_span, num_directions_, input_weights_1, recurrent_weights_1,
                   output_1, hidden_output_1);
   }
@@ -412,7 +420,7 @@ UniDirectionalGru<T>::UniDirectionalGru(AllocatorPtr allocator,
                                         const gsl::span<const T>& initial_hidden_state,
                                         const ActivationFuncs::Entry& activation_func_f,
                                         const ActivationFuncs::Entry& activation_func_g,
-                                        const float clip)
+                                        const float clip, onnxruntime::concurrency::ThreadPool* ttp)
     : allocator_(allocator),
       seq_length_(seq_length),
       batch_size_(batch_size),
@@ -421,7 +429,8 @@ UniDirectionalGru<T>::UniDirectionalGru(AllocatorPtr allocator,
       linear_before_reset_(linear_before_reset),
       clip_(clip),
       direction_(direction),
-      use_bias_(!bias.empty()) {
+      use_bias_(!bias.empty()),
+      ttp_(ttp) {
   clip_with_bias_ptr_ = use_bias_ ? deepcpu::clip_add_bias : deepcpu::clip_ignore_bias;
 
   // setup activation function pointers and alpha/beta values to use with them
@@ -540,7 +549,7 @@ void UniDirectionalGru<T>::Compute(const gsl::span<const T>& inputs_arg,
               input_weights.cbegin(), input_weights.cend(),
               input_size_, beta,
               outputZRH_.begin(), outputZRH_.end(),
-              hidden_size_x3);
+              hidden_size_x3, ttp_);
 
   DumpMatrix("inputs with weights applied", outputZRH_.data(), seq_length_ * batch_size_ * 3, hidden_size_);
 
@@ -606,7 +615,7 @@ void UniDirectionalGru<T>::Compute(const gsl::span<const T>& inputs_arg,
                 recurrent_weightsZR.cbegin(), recurrent_weightsZR.cend(),
                 hidden_size_, beta,
                 outputZRH_.begin() + out_added_offset, outputZRH_.end(),
-                hidden_size_x3);
+                hidden_size_x3, ttp_);
 
     DumpMatrix("Ht-1 * R[zr] + Xt*(W[zr]^T)" + seqno_str,
                outputZRH_.data() + out_added_offset, batch_size_, hidden_size_x2, 0, hidden_size_x3);
@@ -622,7 +631,7 @@ void UniDirectionalGru<T>::Compute(const gsl::span<const T>& inputs_arg,
                   recurrent_weightsH.cbegin(), recurrent_weightsH.cend(),  // Rh^T
                   hidden_size_, beta,
                   linear_output_.begin(), linear_output_.end(),  // pre: Rbh, post:output
-                  hidden_size_);
+                  hidden_size_, ttp_);
 
       DumpMatrix("Ht-1 * (Rh^T) + Rbh " + seqno_str, linear_output_.data(), batch_size_, hidden_size_);
     }
@@ -693,7 +702,7 @@ void UniDirectionalGru<T>::Compute(const gsl::span<const T>& inputs_arg,
                   recurrent_weightsH.cbegin(), recurrent_weightsH.cend(),  // Rh^T
                   hidden_size_, beta,
                   out_H, outputZRH_.end(),
-                  hidden_size_x3);
+                  hidden_size_x3, ttp_);
     }
 
     DumpMatrix("Xt*(Wh^T) + (" + label + ")" + seqno_str, outputZRH_.data() + out_added_offset,
diff --git a/onnxruntime/core/providers/cpu/rnn/deep_cpu_lstm.cc b/onnxruntime/core/providers/cpu/rnn/deep_cpu_lstm.cc
index 8f4e8236981f8..682dabd9262ca 100644
--- a/onnxruntime/core/providers/cpu/rnn/deep_cpu_lstm.cc
+++ b/onnxruntime/core/providers/cpu/rnn/deep_cpu_lstm.cc
@@ -9,6 +9,9 @@
 #pragma warning(disable : 4996)
 #endif
 
+#include "core/platform/threadpool.h"
+#include "core/framework/op_kernel_context_internal.h"
+
 #include "core/providers/cpu/rnn/deep_cpu_lstm.h"
 
 #include "core/common/common.h"
@@ -193,7 +196,8 @@ class UniDirectionalLstm {
                      const gsl::span<const T>& initial_hidden_state, const gsl::span<const T>& initial_cell_state,
                      const ActivationFuncs::Entry& activation_func_f, const ActivationFuncs::Entry& activation_func_g,
                      const ActivationFuncs::Entry& activation_func_h, float clip,
-                     onnxruntime::concurrency::ThreadPool& ttp);
+                     concurrency::ThreadPool& lstm_tp_,
+                     concurrency::ThreadPool* mlas_tp_);
 
   void Compute(const gsl::span<const T>& inputs, const gsl::span<const int>& sequence_lengths, int num_directions,
                const gsl::span<const T>& input_weights, const gsl::span<const T>& recurrent_weights,
@@ -275,7 +279,8 @@ class UniDirectionalLstm {
   ActivationInfo<deepcpu::ActivationFuncPtr> activation_g_;
   ActivationInfo<deepcpu::LstmMergeGatesFuncPtr> activation_h_;
 
-  onnxruntime::concurrency::ThreadPool& ttp_;
+  concurrency::ThreadPool& lstm_tp_;
+  concurrency::ThreadPool* mlas_tp_;
 };
 
 }  // namespace detail
@@ -309,6 +314,9 @@ DeepCpuLstmOp::Compute(OpKernelContext* context) const {
 
 template <typename T>
 Status DeepCpuLstmOp::ComputeImpl(OpKernelContext& context) const {
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(&context);
+  concurrency::ThreadPool* mlas_thread_pool = ctx_internal->GetOperatorThreadPool();
+
   auto& logger = context.Logger();
 
   const Tensor& X = *context.Input<Tensor>(0);  // inputs. [seq_length, batch_size, input_size]
@@ -452,7 +460,7 @@ Status DeepCpuLstmOp::ComputeImpl(OpKernelContext& context) const {
                                      activation_funcs_.Entries()[0],
                                      activation_funcs_.Entries()[1],
                                      activation_funcs_.Entries()[2],
-                                     clip_, ttp_);
+                                     clip_, lstm_tp_, mlas_thread_pool);
 
     detail::UniDirectionalLstm<T> bw(alloc, logger, seq_length, batch_size, input_size,
                                      hidden_size_, Direction::kReverse, input_forget_,
@@ -460,7 +468,7 @@ Status DeepCpuLstmOp::ComputeImpl(OpKernelContext& context) const {
                                      activation_funcs_.Entries()[3],
                                      activation_funcs_.Entries()[4],
                                      activation_funcs_.Entries()[5],
-                                     clip_, ttp_);
+                                     clip_, lstm_tp_, mlas_thread_pool);
 
     fw.Compute(input, sequence_lens_span, num_directions_, input_weights_1, recurrent_weights_1,
                output_1, hidden_output_1, last_cell_1);
@@ -473,7 +481,7 @@ Status DeepCpuLstmOp::ComputeImpl(OpKernelContext& context) const {
                                      activation_funcs_.Entries()[0],
                                      activation_funcs_.Entries()[1],
                                      activation_funcs_.Entries()[2],
-                                     clip_, ttp_);
+                                     clip_, lstm_tp_, mlas_thread_pool);
 
     fw.Compute(input, sequence_lens_span, num_directions_, input_weights_1, recurrent_weights_1,
                output_1, hidden_output_1, last_cell_1);
@@ -546,7 +554,8 @@ UniDirectionalLstm<T>::UniDirectionalLstm(AllocatorPtr allocator,
                                           const ActivationFuncs::Entry& activation_func_g,
                                           const ActivationFuncs::Entry& activation_func_h,
                                           const float clip,
-                                          onnxruntime::concurrency::ThreadPool& ttp)
+                                          concurrency::ThreadPool& lstm_tp,
+                                          concurrency::ThreadPool* mlas_tp)
     : allocator_(allocator),
       logger_(logger),
       seq_length_(seq_length),
@@ -558,7 +567,8 @@ UniDirectionalLstm<T>::UniDirectionalLstm(AllocatorPtr allocator,
       clip_(clip),
       use_bias_(!bias.empty()),
       use_peepholes_(!peephole_weights.empty()),
-      ttp_(ttp) {
+      lstm_tp_(lstm_tp),
+      mlas_tp_(mlas_tp) {
   activation_f_ = {deepcpu::ActivationFuncByName(activation_func_f.name),
                    activation_func_f.alpha,
                    activation_func_f.beta};
@@ -774,7 +784,7 @@ void UniDirectionalLstm<T>::Compute(const gsl::span<const T>& inputs_arg,
               input_weights.cbegin(), input_weights.cend(),  // W[iofc]
               input_size_, beta,
               output_iofc_.begin(), output_iofc_.end(),
-              hidden_size_x4);
+              hidden_size_x4, mlas_tp_);
 
   DumpMatrix("Xt*(W[iofc]^T)", output_iofc_.data(), total_rows, hidden_size_x4);
 
@@ -823,7 +833,7 @@ void UniDirectionalLstm<T>::Compute(const gsl::span<const T>& inputs_arg,
                     recurrent_weights.cbegin(), recurrent_weights.cend(),  // R[iofc]
                     hidden_size_, beta,
                     step_out_IOFC, output_iofc_.end(),  // input contains Xt*(W[iofc]^T)
-                    hidden_size_x4);
+                    hidden_size_x4, mlas_tp_);
 
         DumpMatrix("Xt*(W[iofc]^T) + Ht-t*R[iofc]" + row_str,
                    &*step_out_IOFC, local_fused_hidden_rows, hidden_size_x4);
@@ -874,7 +884,7 @@ void UniDirectionalLstm<T>::Compute(const gsl::span<const T>& inputs_arg,
       }
     };
 
-    ExecuteLambdaInParallel("Processing batch", hidden_gemm_and_activations, batch_size_, fused_hidden_rows, ttp_, logger_);
+    ExecuteLambdaInParallel("Processing batch", hidden_gemm_and_activations, batch_size_, fused_hidden_rows, lstm_tp_, logger_);
 
   } else {
     span_T_iter c_prev = batched_internal_state_prev_one_step.begin();
@@ -901,7 +911,7 @@ void UniDirectionalLstm<T>::Compute(const gsl::span<const T>& inputs_arg,
                   recurrent_weights.cbegin(), recurrent_weights.cend(),  // R[iofc]
                   hidden_size_, beta,
                   step_out_IOFC, output_iofc_.end(),  // input contains Xt*(W[iofc]^T)
-                  hidden_size_x4);
+                  hidden_size_x4, mlas_tp_);
 
       span_T_iter batched_output;
       span_T_iter batched_output_end;
diff --git a/onnxruntime/core/providers/cpu/rnn/deep_cpu_lstm.h b/onnxruntime/core/providers/cpu/rnn/deep_cpu_lstm.h
index 606dfbf5b190c..faf32e3a77a2f 100644
--- a/onnxruntime/core/providers/cpu/rnn/deep_cpu_lstm.h
+++ b/onnxruntime/core/providers/cpu/rnn/deep_cpu_lstm.h
@@ -82,8 +82,8 @@ class DeepCpuLstmOp final : public OpKernel {
   // across them. mutable due to this.
   // The alternative would be to create a threadpool in each call to Compute but that would incur thread creation
   // cost on every call.
-  mutable onnxruntime::concurrency::ThreadPool ttp_{"DEEPCPU_LSTM",
-                                                    static_cast<int>(std::thread::hardware_concurrency())};
+  mutable onnxruntime::concurrency::ThreadPool lstm_tp_{"DEEPCPU_LSTM",
+                                                        static_cast<int>(std::thread::hardware_concurrency())};
 };
 
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/rnn/rnn.cc b/onnxruntime/core/providers/cpu/rnn/rnn.cc
index 4030d65a94d45..d26b02f81ae68 100644
--- a/onnxruntime/core/providers/cpu/rnn/rnn.cc
+++ b/onnxruntime/core/providers/cpu/rnn/rnn.cc
@@ -1,5 +1,6 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
+#include "core/framework/op_kernel_context_internal.h"
 
 #include "core/providers/cpu/rnn/rnn.h"
 #include "core/providers/cpu/rnn/rnn_activation_functors.h"
@@ -99,6 +100,8 @@ using EigenMatrixMapRowMajor = Eigen::Map<
 template <>
 Status RNN<float>::Compute(OpKernelContext* ctx) const {
   using namespace rnn::detail;
+  auto ctx_internal = static_cast<OpKernelContextInternal*>(ctx);
+  concurrency::ThreadPool* tp = ctx_internal->GetOperatorThreadPool();
 
   // inputs
   const Tensor& X = *ctx->Input<Tensor>(0);
@@ -160,7 +163,7 @@ Status RNN<float>::Compute(OpKernelContext* ctx) const {
     }
 
     // X * W[direction]^t + B
-    math::Gemm<float, CPUMathUtil>(
+    math::Gemm<float>(
         CblasNoTrans,
         CblasTrans,
         static_cast<int>(seq_length * batch_size),
@@ -171,7 +174,7 @@ Status RNN<float>::Compute(OpKernelContext* ctx) const {
         W.template Data<float>() + direction * hidden_size_ * input_size,
         1,
         x_matmul_w_buffer_data,
-        &CPUMathUtil::Instance());
+        tp);
 
     for (int64_t t = 0; t < seq_length; t++) {
       int64_t time_step = isReverse ? (seq_length - t - 1) : t;
@@ -181,8 +184,12 @@ Status RNN<float>::Compute(OpKernelContext* ctx) const {
 
       const float* h_prev = nullptr;
       if (t == 0) {
-        if (initial_h != nullptr)
-          h_prev = initial_h->template Data<float>();
+        if (initial_h != nullptr) {
+          // the shape of initial_h is [num_directions, batch_size, hidden_size]
+          // so pick the offset (multiple of Y_frame_size == batch_size * hidden_size_)
+          // based on the direction
+          h_prev = initial_h->template Data<float>() + (direction * Y_frame_size);        
+        }
       } else {
         if (isReverse)
           h_prev = Y_buffer_data_current_frame + num_directions * Y_frame_size;
@@ -192,7 +199,7 @@ Status RNN<float>::Compute(OpKernelContext* ctx) const {
 
       if (h_prev != nullptr) {
         // H_t_1 * R[direction]^t
-        math::Gemm<float, CPUMathUtil>(
+        math::Gemm<float>(
             CblasNoTrans,
             CblasTrans,
             static_cast<int>(batch_size),
@@ -203,7 +210,7 @@ Status RNN<float>::Compute(OpKernelContext* ctx) const {
             R.template Data<float>() + direction * hidden_size_ * hidden_size_,
             0,
             Y_buffer_data_current_frame,
-            &CPUMathUtil::Instance());
+            tp);
       } else {
         math::Set<float, CPUMathUtil>(batch_size * hidden_size_, 0, Y_buffer_data_current_frame, &CPUMathUtil::Instance());
       }
diff --git a/onnxruntime/core/providers/cpu/rnn/rnn_helpers.h b/onnxruntime/core/providers/cpu/rnn/rnn_helpers.h
index 2e3e5f88d72ec..f1038e63a350e 100644
--- a/onnxruntime/core/providers/cpu/rnn/rnn_helpers.h
+++ b/onnxruntime/core/providers/cpu/rnn/rnn_helpers.h
@@ -159,7 +159,7 @@ void ComputeGemm(const int M,
                  const float beta,
                  TSpanCIter C,
                  TSpanCIter C_end,
-                 const int ldc) {
+                 const int ldc, concurrency::ThreadPool* tp) {
   // validate all the inputs
   // need to use the lda/ldb/ldc strides which should be >= the columns for the span
   ORT_ENFORCE(lda >= K && ldb >= K && ldc >= N);
@@ -167,12 +167,12 @@ void ComputeGemm(const int M,
   ORT_ENFORCE(B + (N * ldb - (ldb - K)) <= B_end);
   ORT_ENFORCE(C + (M * ldc - (ldc - N)) <= C_end);
 
-  ::onnxruntime::math::GemmEx<float, CPUMathUtil>(
+  ::onnxruntime::math::GemmEx<float>(
       CblasNoTrans, CblasTrans,
       M, N, K, alpha,
       &*A, lda,
       &*B, ldb, beta,
-      &*C, ldc, &CPUMathUtil::Instance());
+      &*C, ldc, tp);
 }
 
 // helper to convert a span to a raw pointer
diff --git a/onnxruntime/core/providers/cpu/symbols.txt b/onnxruntime/core/providers/cpu/symbols.txt
index fc7560f5b7696..1d2750e5d3d3e 100644
--- a/onnxruntime/core/providers/cpu/symbols.txt
+++ b/onnxruntime/core/providers/cpu/symbols.txt
@@ -12,7 +12,7 @@ OrtCompareAllocatorInfo
 OrtCreateAllocatorInfo
 OrtCreateCpuAllocatorInfo
 OrtCreateCustomOpDomain
-OrtCreateDefaultAllocator
+OrtGetAllocatorWithDefaultOptions
 OrtCreateEnv
 OrtCreateEnvWithCustomLogger
 OrtCreateRunOptions
@@ -41,7 +41,6 @@ OrtGetErrorMessage
 OrtGetStringTensorContent
 OrtGetStringTensorDataLength
 OrtGetTensorElementType
-OrtGetTensorMemSizeInBytesFromTensorProto
 OrtGetTensorMutableData
 OrtGetTensorShapeElementCount
 OrtGetTensorTypeAndShape
@@ -52,7 +51,6 @@ OrtGetValueType
 OrtGetVersionString
 OrtIsTensor
 OrtGetOnnxTypeFromTypeInfo
-OrtReleaseAllocator
 OrtReleaseAllocatorInfo
 OrtReleaseCustomOpDomain
 OrtReleaseEnv
@@ -64,10 +62,10 @@ OrtReleaseTensorTypeAndShapeInfo
 OrtReleaseTypeInfo
 OrtReleaseValue
 OrtRun
-OrtRunCallback
 OrtRunOptionsGetRunLogVerbosityLevel
 OrtRunOptionsGetRunTag
 OrtRunOptionsSetRunLogVerbosityLevel
+OrtRunOptionsSetRunLogSeverityLevel
 OrtRunOptionsSetRunTag
 OrtRunOptionsEnableTerminate
 OrtRunOptionsDisableTerminate
@@ -82,6 +80,7 @@ OrtSetDimensions
 OrtSetSessionGraphOptimizationLevel
 OrtSetSessionLogId
 OrtSetSessionLogVerbosityLevel
+OrtSetSessionLogSeverityLevel
+OrtSetOptimizedModelFilePath
 OrtSetSessionThreadPoolSize
 OrtSetTensorElementType
-OrtTensorProtoToOrtValue
diff --git a/onnxruntime/core/providers/cpu/tensor/cast_op.cc b/onnxruntime/core/providers/cpu/tensor/cast_op.cc
index c326d25ef17a0..0f8da8eaff2a6 100644
--- a/onnxruntime/core/providers/cpu/tensor/cast_op.cc
+++ b/onnxruntime/core/providers/cpu/tensor/cast_op.cc
@@ -10,7 +10,7 @@
 #include "Eigen/src/Core/arch/GPU/Half.h"
 #include "core/common/common.h"
 
-#if defined(USE_MLAS) && defined(_M_AMD64)
+#if defined(_M_AMD64)
 #include "core/mlas/inc/mlas.h"
 #endif
 
@@ -40,7 +40,7 @@ inline void CastData<MLFloat16, float>(const Tensor* in, Tensor* out, const Tens
   auto out_data = out->template MutableData<float>();
   auto in_data = in->template Data<MLFloat16>();
   auto shape_size = shape.Size();
-#if defined(USE_MLAS) && defined(_M_AMD64)
+#if defined(_M_AMD64)
   MlasConvertHalfToFloatBuffer(&in_data[0].val, out_data, shape_size);
 #else
   auto in_vector = ConstEigenVectorMap<Eigen::half>(static_cast<const Eigen::half*>(static_cast<const void*>(in_data)), shape_size);
diff --git a/onnxruntime/core/providers/cpu/tensor/compress.cc b/onnxruntime/core/providers/cpu/tensor/compress.cc
index e732121adbf02..b3f82bf9fdc2a 100644
--- a/onnxruntime/core/providers/cpu/tensor/compress.cc
+++ b/onnxruntime/core/providers/cpu/tensor/compress.cc
@@ -9,7 +9,8 @@ namespace onnxruntime {
 ONNX_CPU_OPERATOR_KERNEL(
     Compress,
     9,
-    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::AllTensorTypes()),
+    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::AllTensorTypes())
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()),
     Compress);
 
 Status Compress::Compute(OpKernelContext* ctx) const {
diff --git a/onnxruntime/core/providers/cpu/tensor/concat.cc b/onnxruntime/core/providers/cpu/tensor/concat.cc
index afca4d421efe8..0a26ea2a0dd42 100644
--- a/onnxruntime/core/providers/cpu/tensor/concat.cc
+++ b/onnxruntime/core/providers/cpu/tensor/concat.cc
@@ -34,16 +34,17 @@ Status ConcatBase::PrepareForCompute(OpKernelContext* ctx, int input_count, Prep
     auto& inputs_n = *tensor_pointer;
     const auto& inputs_n_dims = inputs_n.Shape().GetDims();
     const size_t inputs_n_rank = inputs_n_dims.size();
-    ORT_ENFORCE(inputs_n_rank == inputs_0_rank, "Ranks of input data are different, cannot concatenate them, "
-                "expected rank: ", std::to_string(inputs_0_rank), " got: ", std::to_string(inputs_n_rank));
+    ORT_ENFORCE(inputs_n_rank == inputs_0_rank,
+                "Ranks of input data are different, cannot concatenate them. expected rank: ",
+                inputs_0_rank, " got: ", inputs_n_rank);
     // Ensure all the other (non-concat) axes match
     for (size_t axis_index = 0; axis_index < inputs_0_rank; ++axis_index) {
       num_elements *= inputs_n_dims[axis_index];
       if (axis_index == p.axis)
         continue;
       ORT_RETURN_IF_NOT(inputs_n_dims[axis_index] == inputs_0_dims[axis_index],
-                        "Non concat axis dimensions must match: Axis ", 
-                        axis_index, " has mismatched dimensions of ", inputs_n_dims[axis_index], 
+                        "Non concat axis dimensions must match: Axis ",
+                        axis_index, " has mismatched dimensions of ", inputs_n_dims[axis_index],
                         " and ", inputs_0_dims[axis_index]);
     }
     tensor_num_elements[index] = num_elements;
@@ -58,7 +59,7 @@ Status ConcatBase::PrepareForCompute(OpKernelContext* ctx, int input_count, Prep
 
   // Calculate the shape of the output tensor
   std::vector<int64_t> dims(inputs_0_rank);
-  size_t num_elements = 1; // cache size of the first input along the way
+  size_t num_elements = 1;  // cache size of the first input along the way
   for (size_t dimension_index = 0; dimension_index < inputs_0_rank; dimension_index++) {
     dims[dimension_index] = inputs_0_dims[dimension_index];
     num_elements *= inputs_0_dims[dimension_index];
@@ -66,7 +67,7 @@ Status ConcatBase::PrepareForCompute(OpKernelContext* ctx, int input_count, Prep
   tensor_num_elements[0] = num_elements;
   dims[p.axis] = concat_axis_size;
   TensorShape output_shape(dims);
- 
+
   auto& concat_result = *ctx->Output(0, output_shape);
   p.output_tensor = &concat_result;
   p.output_num_elements = output_shape.Size();
@@ -75,7 +76,7 @@ Status ConcatBase::PrepareForCompute(OpKernelContext* ctx, int input_count, Prep
   // there is no need to proceed further
   if (p.output_num_elements == 0)
     return Status::OK();
-    
+
   // The output_axis_pitch is the number of elements to add to move to the next split axis in the output
   p.output_axis_pitch = 1;
   for (size_t i = inputs_0_rank; i-- > p.axis;) p.output_axis_pitch *= dims[i];
@@ -110,7 +111,7 @@ Status Concat::Compute(OpKernelContext* ctx) const {
 
   auto is_string_type = ctx->Input<Tensor>(0)->DataType() == DataTypeImpl::GetType<std::string>();
 
-  int64_t output_offset = 0;
+  int64_t initial_output_offset = 0;  // initial offset for each input
   auto element_bytes = p.output_tensor->DataType()->Size();
   for (int input_index = 0; input_index < input_count; input_index++) {
     const auto& prep = p.inputs[input_index];
@@ -124,19 +125,29 @@ Status Concat::Compute(OpKernelContext* ctx) const {
 
     // Copy the data across. For every 'input_axis_pitch' values copied, we move over by the 'output_axis_pitch'
     uint8_t* output = static_cast<uint8_t*>(p.output_tensor->MutableDataRaw());
-    for (size_t idxCopy = 0; idxCopy < input_size / input_axis_pitch; ++idxCopy) {
+    int64_t cur_out_offset = 0;
+    int64_t cur_in_offset = 0;
+    for (size_t idx_copy = 0, end = input_size / input_axis_pitch; idx_copy < end; ++idx_copy) {
       if (is_string_type) {
-        for (int idxItem = 0; idxItem < input_axis_pitch; ++idxItem)
-          reinterpret_cast<std::string*>(output)[output_offset + idxCopy * p.output_axis_pitch + idxItem] =
-              reinterpret_cast<const std::string*>(input)[idxCopy * input_axis_pitch + idxItem];
-      } else
+        size_t out = initial_output_offset + cur_out_offset;
+        for (int idx_item = 0; idx_item < input_axis_pitch; ++idx_item) {
+          reinterpret_cast<std::string*>(output)[out + idx_item] =
+              reinterpret_cast<const std::string*>(input)[cur_in_offset + idx_item];
+        }
+      } else {
         memcpy(
-            output + (output_offset + idxCopy * p.output_axis_pitch) * element_bytes,
-            input + idxCopy * input_axis_pitch * element_bytes,
+            output + (initial_output_offset + cur_out_offset) * element_bytes,
+            input + cur_in_offset * element_bytes,
             input_axis_pitch * element_bytes);
+      }
+
+      cur_out_offset += p.output_axis_pitch;
+      cur_in_offset += input_axis_pitch;
     }
-    output_offset += input_axis_pitch;
+
+    initial_output_offset += input_axis_pitch;
   }
+
   return Status::OK();
 }
 
diff --git a/onnxruntime/core/providers/cpu/tensor/dynamicquantizelinear.cc b/onnxruntime/core/providers/cpu/tensor/dynamicquantizelinear.cc
new file mode 100644
index 0000000000000..dafa3a322f5e8
--- /dev/null
+++ b/onnxruntime/core/providers/cpu/tensor/dynamicquantizelinear.cc
@@ -0,0 +1,75 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "dynamicquantizelinear.h"
+#include "core/providers/common.h"
+#include "core/util/math_cpuonly.h"
+#include <cmath>
+#include <cfenv>
+
+namespace onnxruntime {
+
+ONNX_CPU_OPERATOR_TYPED_KERNEL(
+    DynamicQuantizeLinear,
+    11,
+    uint8_t,
+    KernelDefBuilder()
+        .TypeConstraint("T2", DataTypeImpl::GetTensorType<uint8_t>()),
+    DynamicQuantizeLinear<uint8_t>);
+
+
+static float RoundHalfToEven(float input) {
+  std::fesetround(FE_TONEAREST);
+  auto result = std::nearbyintf(input);
+  return result;
+}
+
+// formula is Y = X / Scale + ZeroPoint
+template <typename T>
+Status DynamicQuantizeLinear<T>::Compute(OpKernelContext* ctx) const {
+  auto x_ptr = ctx->Input<Tensor>(0);
+  ORT_ENFORCE(x_ptr != nullptr);
+  auto& x = *x_ptr;
+  const auto* x_data = x.template Data<float>();
+
+  auto& y = *ctx->Output(0, x.Shape());
+  std::vector<int64_t> shape({});
+  auto& y_scale = *ctx->Output(1, shape);
+  auto& y_zeropoint = *ctx->Output(2, shape); 
+  
+  // find quantization range min and max
+  float qmax = std::numeric_limits<T>::max();
+  float qmin = std::numeric_limits<T>::min();
+  // Adjust the int8 range to -127 to 127 so that zero point can be 0
+  if (qmin == -128) {
+    qmin = -127;
+  }
+
+  // find input range min and max
+  auto min = ConstEigenVectorMap<float>(x_data, x.Shape().Size()).minCoeff();
+  min = std::min(min, qmin);
+  auto max = ConstEigenVectorMap<float>(x_data, x.Shape().Size()).maxCoeff();
+  max = std::max(max, qmin);
+
+  // find scale and zero point
+  auto scale = (max - min) / (qmax - qmin);
+  auto* output_scale = y_scale.template MutableData<float>();
+  *output_scale = scale;
+
+  const auto initial_zero_point = qmin - min / scale;
+  auto zero_point = static_cast<T>(RoundHalfToEven(std::max(qmin, std::min(qmax, initial_zero_point))));
+  auto* output_zp = y_zeropoint.template MutableData<T>();
+  *output_zp = zero_point;
+
+  // quantize the data
+  auto* output = y.template MutableData<T>();
+  const auto num_of_elements = x.Shape().Size();
+
+  for (int i = 0; i < num_of_elements; ++i) {
+    output[i] = static_cast<T>(clamp(RoundHalfToEven(static_cast<float>(x_data[i] / scale)) + zero_point, qmin, qmax));
+  }
+
+  return Status::OK();
+}
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/tensor/dynamicquantizelinear.h b/onnxruntime/core/providers/cpu/tensor/dynamicquantizelinear.h
new file mode 100644
index 0000000000000..fa15cc9126cb6
--- /dev/null
+++ b/onnxruntime/core/providers/cpu/tensor/dynamicquantizelinear.h
@@ -0,0 +1,20 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/common/common.h"
+#include "core/framework/op_kernel.h"
+
+namespace onnxruntime {
+
+template <typename T>
+class DynamicQuantizeLinear final : public OpKernel {
+ public:
+  DynamicQuantizeLinear(const OpKernelInfo& info) : OpKernel(info) {
+  }
+
+  Status Compute(OpKernelContext* context) const override;
+
+};
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/tensor/identity_op.cc b/onnxruntime/core/providers/cpu/tensor/identity_op.cc
index b7fe35c73f039..f431d9de70185 100644
--- a/onnxruntime/core/providers/cpu/tensor/identity_op.cc
+++ b/onnxruntime/core/providers/cpu/tensor/identity_op.cc
@@ -10,7 +10,8 @@ ONNX_CPU_OPERATOR_VERSIONED_KERNEL(
     7, 9,
     KernelDefBuilder().TypeConstraint("T", {DataTypeImpl::GetTensorType<MLFloat16>(), 
                                             DataTypeImpl::GetTensorType<float>(), 
-                                            DataTypeImpl::GetTensorType<double>()}),
+                                            DataTypeImpl::GetTensorType<double>()})
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()),
     IdentityOp<true>);
 
 ONNX_CPU_OPERATOR_KERNEL(
diff --git a/onnxruntime/core/providers/cpu/tensor/nonzero_op.cc b/onnxruntime/core/providers/cpu/tensor/nonzero_op.cc
index 7c725bba0f8b6..ef16693dc73c4 100644
--- a/onnxruntime/core/providers/cpu/tensor/nonzero_op.cc
+++ b/onnxruntime/core/providers/cpu/tensor/nonzero_op.cc
@@ -23,7 +23,7 @@ namespace onnxruntime {
 
 // start with a subset of types, enable more as needed...
 NONZERO_TYPED_KERNEL(bool)
-//NONZERO_TYPED_KERNEL(uint8_t)
+NONZERO_TYPED_KERNEL(uint8_t)
 //NONZERO_TYPED_KERNEL(uint16_t)
 //NONZERO_TYPED_KERNEL(uint32_t)
 //NONZERO_TYPED_KERNEL(uint64_t)
@@ -40,24 +40,6 @@ NONZERO_TYPED_KERNEL(float)
 #undef NONZERO_TYPED_KERNEL_WITH_TYPE_NAME
 #undef NONZERO_TYPED_KERNEL
 
-namespace {
-void IncrementCoordinate(const TensorShape& shape, std::vector<int64_t>* coordinate) {
-  assert(coordinate->size() == shape.NumDimensions());
-
-  size_t i = 0;
-  const size_t i_end = coordinate->size();
-  for (; i < i_end; ++i) {
-    const size_t i_from_back = i_end - i - 1;
-    if ((*coordinate)[i_from_back] != shape[i_from_back] - 1) break;
-    (*coordinate)[i_from_back] = 0;
-  }
-
-  if (i < i_end) {
-    ++(*coordinate)[i_end - i - 1];
-  }
-}
-}  // namespace
-
 template <typename T>
 Status NonZero<T>::Compute(OpKernelContext* context) const {
   const auto X = context->Input<Tensor>(0);
@@ -71,19 +53,37 @@ Status NonZero<T>::Compute(OpKernelContext* context) const {
   // reserve enough space for indices for every element of X
   non_zero_indices_buffer.reserve(X_shape.Size() * coordinate_size);
 
+  const T* data = X->Data<T>();
+
   if (X_shape.IsScalar()) {
-    const T& value = *(X->Data<T>());
+    const T& value = *data;
     if (value != T{}) {
       non_zero_indices_buffer.push_back(0);
     }
   } else {
     std::vector<int64_t> coordinate(coordinate_size, 0);
-    for (const T& value : X->DataAsSpan<T>()) {
+
+    // as we iterate the entries, increment the coordinate for the current entry
+    // e.g. if shape is {2,2}, we start with 0,0 increment to 0,1 increment to 1,0 and finally 1,1
+    auto increment_coordinate = [&coordinate, &coordinate_size, &X_shape]() {
+      for (int64_t idx = coordinate_size - 1; idx >= 0; --idx) {
+        int64_t& cur_coord = coordinate[idx];
+        if (cur_coord != X_shape[idx] - 1) {
+          ++cur_coord;
+          break;
+        }
+        cur_coord = 0;
+      }
+    };
+
+    for (size_t i = 0, end = X_shape.Size(); i < end; ++i) {
+      const T& value = *data++;
       if (value != T{}) {
         non_zero_indices_buffer.insert(non_zero_indices_buffer.end(),
                                        coordinate.begin(), coordinate.end());
       }
-      IncrementCoordinate(X_shape, &coordinate);
+
+      increment_coordinate();
     }
   }
 
diff --git a/onnxruntime/core/providers/cpu/tensor/onehot.cc b/onnxruntime/core/providers/cpu/tensor/onehot.cc
index 1dfbaaf37640f..c4f0c2479a069 100644
--- a/onnxruntime/core/providers/cpu/tensor/onehot.cc
+++ b/onnxruntime/core/providers/cpu/tensor/onehot.cc
@@ -18,8 +18,9 @@ limitations under the License.
 #include "core/util/eigen_common_wrapper.h"
 #include "core/platform/env.h"
 
+#ifndef EIGEN_USE_THREADS
 #define EIGEN_USE_THREADS
-
+#endif
 using namespace ::onnxruntime::common;
 using namespace std;
 
@@ -46,6 +47,8 @@ REG_ONE_HOT_OP(float, int64_t, int64_t);
 REG_ONE_HOT_OP(int64_t, string, int64_t);
 REG_ONE_HOT_OP(float, string, int64_t);
 REG_ONE_HOT_OP(int64_t, float, int64_t);
+REG_ONE_HOT_OP(int32_t, float, int32_t);
+REG_ONE_HOT_OP(int32_t, float, float);
 REG_ONE_HOT_OP(float, float, float);      // added this to satisfy onnx model tests
 REG_ONE_HOT_OP(int64_t, int32_t, float);  // added this to satisfy onnx model tests
 
diff --git a/onnxruntime/core/providers/cpu/tensor/quantize_linear.cc b/onnxruntime/core/providers/cpu/tensor/quantize_linear.cc
index 5846bc102f565..e345ad4da3cd8 100644
--- a/onnxruntime/core/providers/cpu/tensor/quantize_linear.cc
+++ b/onnxruntime/core/providers/cpu/tensor/quantize_linear.cc
@@ -63,21 +63,22 @@ Status DequantizeLinear<T>::Compute(OpKernelContext* ctx) const {
 ONNX_CPU_OPERATOR_TYPED_KERNEL(
     QuantizeLinear,
     10,
-    float,
+    uint8_t,
     KernelDefBuilder()
         .TypeConstraint("x", DataTypeImpl::GetTensorType<float>())
-        .TypeConstraint("y_scale", DataTypeImpl::GetTensorType<float>())
         .TypeConstraint("y_zero_point", DataTypeImpl::GetTensorType<uint8_t>())
         .TypeConstraint("y", DataTypeImpl::GetTensorType<uint8_t>()),
-    QuantizeLinear<float>);
-
-// clamp doesn't exist in the version of <algorithm> that we're using, so
-// make a local one.
-static float clamp(float v, float lo, float hi) {
-  if (v < lo) return lo;
-  if (v > hi) return hi;
-  return v;
-}
+    QuantizeLinear<uint8_t>);
+
+ONNX_CPU_OPERATOR_TYPED_KERNEL(
+    QuantizeLinear,
+    10,
+    int8_t,
+    KernelDefBuilder()
+        .TypeConstraint("x", DataTypeImpl::GetTensorType<float>())
+        .TypeConstraint("y_zero_point", DataTypeImpl::GetTensorType<int8_t>())
+        .TypeConstraint("y", DataTypeImpl::GetTensorType<int8_t>()),
+    QuantizeLinear<int8_t>);
 
 static float RoundHalfToEven(float input) {
   std::fesetround(FE_TONEAREST);
@@ -85,9 +86,9 @@ static float RoundHalfToEven(float input) {
   return result;
 }
 
-template <>
+template <typename T>
 // formula is Y = X / Scale + ZeroPoint
-Status QuantizeLinear<float>::Compute(OpKernelContext* ctx) const {
+Status QuantizeLinear<T>::Compute(OpKernelContext* ctx) const {
   auto& x = *ctx->Input<Tensor>(0);
   auto& y_scale = *ctx->Input<Tensor>(1);
   auto& y_zero_point = *ctx->Input<Tensor>(2);
@@ -102,14 +103,18 @@ Status QuantizeLinear<float>::Compute(OpKernelContext* ctx) const {
   ORT_ENFORCE(scale_shape.NumDimensions() == 0 || (scale_shape.NumDimensions() == 1 && scale_shape.GetDims().size() == 1), "x_scale must be a scalar.");
   ORT_ENFORCE(zero_point_shape.NumDimensions() == 0 || (zero_point_shape.NumDimensions() == 1 && zero_point_shape.GetDims().size() == 1), "x_zero_point must be a scalar.");
   
-  const uint8_t zero_point = *(y_zero_point.template Data<uint8_t>());
+  const T zero_point = *(y_zero_point.template Data<T>());
   const float scale = *(y_scale.template Data<float>());
   const auto* input = x.template Data<float>();
-  auto* output = y.template MutableData<uint8_t>();
+  auto* output = y.template MutableData<T>();
   const auto num_of_elements = x_shape.Size();
+  const float qmax = std::numeric_limits<T>::max();
+  const float qmin_default = std::numeric_limits<T>::min();
+  // adjust qmin for int8 inputs. This is required to keep zero point as zero
+  const float qmin = qmin_default == -128 ? -127 : qmin_default;
 
   for (int i = 0; i < num_of_elements; ++i) {
-    output[i] = static_cast<uint8_t>(clamp(RoundHalfToEven(static_cast<float>(input[i]/scale)) + zero_point, 0.0f, float(UINT8_MAX)));
+    output[i] = static_cast<T>(clamp(RoundHalfToEven(static_cast<float>(input[i]/scale)) + zero_point, qmin, qmax));
   }
 
   return Status::OK();
diff --git a/onnxruntime/core/providers/cpu/tensor/size.cc b/onnxruntime/core/providers/cpu/tensor/size.cc
index 675c14b8cfee6..75bdd5bec204e 100644
--- a/onnxruntime/core/providers/cpu/tensor/size.cc
+++ b/onnxruntime/core/providers/cpu/tensor/size.cc
@@ -41,7 +41,8 @@ ONNX_CPU_OPERATOR_KERNEL(
                                                                DataTypeImpl::GetTensorType<uint32_t>(),
                                                                DataTypeImpl::GetTensorType<uint64_t>(),
                                                                DataTypeImpl::GetTensorType<std::string>(),
-                                                               DataTypeImpl::GetTensorType<bool>()})),
+                                                               DataTypeImpl::GetTensorType<bool>()}))
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<int64_t>()),
     Size);
 
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/tensor/tile.cc b/onnxruntime/core/providers/cpu/tensor/tile.cc
index 984f490adec9d..1b0ab391fbe41 100644
--- a/onnxruntime/core/providers/cpu/tensor/tile.cc
+++ b/onnxruntime/core/providers/cpu/tensor/tile.cc
@@ -34,7 +34,8 @@ ONNX_CPU_OPERATOR_KERNEL(
                                             DataTypeImpl::GetTensorType<uint16_t>(),
                                             DataTypeImpl::GetTensorType<uint32_t>(),
                                             DataTypeImpl::GetTensorType<uint64_t>(),
-                                            DataTypeImpl::GetTensorType<bool>()}),
+                                            DataTypeImpl::GetTensorType<bool>()})
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<int64_t>()),
     Tile);
 
 Status TileCoreForFixedSizeTypes(const Tensor& input_tensor, Tensor& output_tensor, const int64_t* repeats, TensorAxisCounters& input_counters, const TensorPitches& output_pitches, size_t element_size) {
diff --git a/onnxruntime/core/providers/cpu/tensor/upsample.cc b/onnxruntime/core/providers/cpu/tensor/upsample.cc
index 95605dbef4a68..3dcfcb47a353b 100644
--- a/onnxruntime/core/providers/cpu/tensor/upsample.cc
+++ b/onnxruntime/core/providers/cpu/tensor/upsample.cc
@@ -3,6 +3,7 @@
 
 #include "core/providers/cpu/tensor/upsample.h"
 #include <cmath>
+#include <sstream>
 
 using namespace onnxruntime::common;
 using namespace std;
@@ -61,14 +62,18 @@ Status UpsampleNearest(const T* input,
                        T* output,
                        const TensorShape& input_shape,
                        const TensorShape& output_shape,
-                       const vector<float>& scales) {
+                       const vector<float>& scales,
+                       bool is_resize) {
   if (!input || !output)
-    return Status(ONNXRUNTIME, FAIL, "Upsample: input/output value is nullptr");
+    return Status(ONNXRUNTIME, FAIL, is_resize ? "Resize: input/output value is nullptr" : 
+                                                 "Upsample: input/output value is nullptr");
   if (input_shape.NumDimensions() != output_shape.NumDimensions())
-    return Status(ONNXRUNTIME, FAIL, "Upsample: input/output value's dimension mismatch");
+    return Status(ONNXRUNTIME, FAIL, is_resize ? "Resize: input/output value's dimension mismatch" : 
+                                                 "Upsample: input/output value's dimension mismatch");
   if (input_shape.NumDimensions() == 0) {
     return Status(common::ONNXRUNTIME, common::INVALID_ARGUMENT,
-                  "Upsample: input shape needs to be at least a single dimension.");
+                  is_resize ? "Resize: input shape needs to be at least a single dimension" : 
+                              "Upsample: input shape needs to be at least a single dimension.");
   }
 
   int64_t n_dim = static_cast<int64_t>(input_shape.NumDimensions());
@@ -192,11 +197,14 @@ Status upsampleLiner(const T* input,
                      T* output,
                      const TensorShape& input_shape,
                      const TensorShape& output_shape,
-                     const vector<float>& scales) {
+                     const vector<float>& scales,
+                     bool is_resize) {
   if (!input || !output)
-    return Status(ONNXRUNTIME, FAIL, "Upsample: input/output value is nullptr");
+    return Status(ONNXRUNTIME, FAIL, is_resize ? "Resize: input / output value is nullptr" : 
+                                                 "Upsample: input / output value is nullptr");
   if (input_shape.NumDimensions() != output_shape.NumDimensions())
-    return Status(ONNXRUNTIME, FAIL, "Upsample: input/output value's dimension mismatch");
+    return Status(ONNXRUNTIME, FAIL, is_resize ? "Resize: input/output value's dimension mismatch" : 
+                                                 "Upsample: input/output value's dimension mismatch");
   auto n_dim = input_shape.NumDimensions();
   for (size_t i = 0, size = output_shape.Size(); i < size; i++) {
     std::vector<int64_t> val1;
@@ -242,6 +250,11 @@ Status upsampleLiner(const T* input,
   return Status::OK();
 }
 
+// The following method supports a 4-D input in 'Linear mode' 
+// that amounts to 'Bilinear' Upsampling/Resizing in the sense that it assumes
+// the scale values for the outermost 2 dimensions are 1.
+// This is the common use-case where the 4-D input (batched multi-channel images) 
+// is usually of shape [N, C, H, W] and the scales are [1.0, 1.0, height_scale, width_scale]
 template <typename T>
 void upsampleBilinear(
     int64_t batch_size,
@@ -327,9 +340,10 @@ Status Upsample<T>::BaseCompute(OpKernelContext* context, const std::vector<floa
   ORT_ENFORCE(X != nullptr);
 
   const std::vector<int64_t>& dims = X->Shape().GetDims();
-  if (dims.size() != scales.size()) {
-    return Status(ONNXRUNTIME, INVALID_ARGUMENT, "Upsample: input tensor's dimension does not match the scales.");
-  }
+  if (dims.size() != scales.size())
+    return Status(ONNXRUNTIME, INVALID_ARGUMENT, 
+                  is_resize ? "Resize: input tensor's dimension does not match the scales." : 
+                              "Upsample: input tensor's dimension does not match the scales.");
 
   bool no_scale = true;
   std::vector<int64_t> Y_dims;
@@ -348,26 +362,33 @@ Status Upsample<T>::BaseCompute(OpKernelContext* context, const std::vector<floa
 
   switch (mode_) {
     case UpsampleMode::NN:
-      return UpsampleNearest<T>(X->template Data<T>(), Y->template MutableData<T>(), X->Shape(), Y->Shape(), scales);
+      return UpsampleNearest<T>(X->template Data<T>(), Y->template MutableData<T>(), X->Shape(), Y->Shape(), scales, is_resize);
     case UpsampleMode::LINEAR: {
-      //What's the correct behavior of linear mode is not clear right now,
-      //Only support bilinear with 4D tensor to keep consistent with previous behavior
-      if (dims.size() != 4)
-        return Status(ONNXRUNTIME, FAIL, "Upsample: linear mode upsample only support 4-D tensor with NCHW layout");
+      //The correct behavior of 'linear' mode for an N-D input is not clear right now,
+      //so only support 'bilinear' with 2-D or 4-D input tensor with outermost 2 scales as 1 in the 4-D case 
+      if (dims.size() != 2 && dims.size() != 4) {
+        std::ostringstream oss;
+        oss << "'Linear' mode only support 2-D inputs ('Bilinear') or 4-D inputs "
+               "with the corresponding outermost 2 scale values being 1 in the ";
+        oss << (is_resize ? "Resize operator" : "Upsample operator");
+        return Status(ONNXRUNTIME, FAIL, oss.str());      
+      }
 
-      const int64_t batch_size = dims[0];
-      const int64_t num_channels = dims[1];
-      const int64_t input_height = dims[2];
-      const int64_t input_width = dims[3];
+      bool is_2D = dims.size() == 2;
+      const int64_t batch_size = is_2D ? 1 : dims[0];
+      const int64_t num_channels = is_2D ? 1 : dims[1];
+      const int64_t input_height = is_2D ? dims[0] : dims[2];
+      const int64_t input_width = is_2D ? dims[1] : dims[3];
 
       AllocatorPtr alloc;
       ORT_RETURN_IF_ERROR(context->GetTempSpaceAllocator(&alloc));
       upsampleBilinear(batch_size, num_channels, input_height, input_width,
-                       scales[2], scales[3], X->template Data<T>(), Y->template MutableData<T>(), alloc);
+                       is_2D ? scales[0] : scales[2], is_2D ? scales[1] : scales[3], 
+                       X->template Data<T>(), Y->template MutableData<T>(), alloc);
       return Status::OK();
     }
     default:
-      return Status(ONNXRUNTIME, FAIL, "Upsample: unexpected mode");
+      return Status(ONNXRUNTIME, FAIL, is_resize ? "Resize: unexpected mode" : "Upsample: unexpected mode");
   }
 }
 
@@ -380,9 +401,9 @@ Status Upsample<T>::Compute(OpKernelContext* context) const {
   const auto* scales = context->Input<Tensor>(1);
   ORT_ENFORCE(scales != nullptr);
   int64_t scales_size = scales->Shape().Size();
-  std::vector<float> scales_arrary(scales_size);
-  ParseScalesData(scales, scales_arrary);
-  return BaseCompute(context, scales_arrary);
+  std::vector<float> scales_array(scales_size);
+  ParseScalesData(scales, scales_array);
+  return BaseCompute(context, scales_array);
 }
 
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cpu/tensor/upsample.h b/onnxruntime/core/providers/cpu/tensor/upsample.h
index 5c57295af5195..97b41e0915d89 100644
--- a/onnxruntime/core/providers/cpu/tensor/upsample.h
+++ b/onnxruntime/core/providers/cpu/tensor/upsample.h
@@ -72,9 +72,10 @@ class UpsampleBase {
     }
 
     if (UpsampleMode::LINEAR == mode) {
-      ORT_ENFORCE(scales.size() == 4, "Upsample: linear mode upsample only support bilinear with 4 dimension.");
-      ORT_ENFORCE(((scales[0] == 1) && (scales[1] == 1)),
-                  "Upsample: linear mode upsample only support bilinear, the first 2 scales should be 1.");
+      ORT_ENFORCE(scales.size() == 2 || (scales.size() == 4 && scales[0] == 1 && scales[1] == 1), 
+                  "'Linear' mode only support 2-D inputs ('Bilinear') or 4-D inputs " 
+                  "with the corresponding outermost 2 scale values being 1 in the ",
+                  is_resize ? "Resize operator" : "Upsample operator");
     }
   }
 
diff --git a/onnxruntime/core/providers/cpu/tensor/where_op.cc b/onnxruntime/core/providers/cpu/tensor/where_op.cc
index 21e0243dcf46a..bd946c4619f1e 100644
--- a/onnxruntime/core/providers/cpu/tensor/where_op.cc
+++ b/onnxruntime/core/providers/cpu/tensor/where_op.cc
@@ -29,7 +29,7 @@ namespace onnxruntime {
 //WHERE_TYPED_KERNEL(int8_t)
 //WHERE_TYPED_KERNEL(int16_t)
 WHERE_TYPED_KERNEL(int32_t)
-//WHERE_TYPED_KERNEL(int64_t)
+WHERE_TYPED_KERNEL(int64_t)
 //WHERE_TYPED_KERNEL(MLFloat16)
 //WHERE_TYPED_KERNEL(BFloat16)
 WHERE_TYPED_KERNEL(float)
diff --git a/onnxruntime/core/providers/cuda/cuda_allocator.cc b/onnxruntime/core/providers/cuda/cuda_allocator.cc
index 44cbbd75d0fc2..5241545763c38 100644
--- a/onnxruntime/core/providers/cuda/cuda_allocator.cc
+++ b/onnxruntime/core/providers/cuda/cuda_allocator.cc
@@ -61,8 +61,7 @@ void CUDAPinnedAllocator::Free(void* p) {
 }
 
 const OrtAllocatorInfo& CUDAPinnedAllocator::Info() const {
-  static constexpr OrtAllocatorInfo cuda_allocator_info(CUDA_PINNED, OrtDeviceAllocator, OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, 0), 0, OrtMemTypeCPUOutput);
-  return cuda_allocator_info;
+  return info_;
 }
 
 FencePtr CUDAPinnedAllocator::CreateFence(const SessionState* session_state) {
diff --git a/onnxruntime/core/providers/cuda/cuda_allocator.h b/onnxruntime/core/providers/cuda/cuda_allocator.h
index 06f6caa784c0e..2840dcb4088c3 100644
--- a/onnxruntime/core/providers/cuda/cuda_allocator.h
+++ b/onnxruntime/core/providers/cuda/cuda_allocator.h
@@ -9,7 +9,7 @@ namespace onnxruntime {
 
 class CUDAAllocator : public IDeviceAllocator {
  public:
-  CUDAAllocator(int device_id) : info_(CUDA, OrtAllocatorType::OrtDeviceAllocator, OrtDevice(OrtDevice::GPU, OrtDevice::MemType::DEFAULT, device_id), device_id, OrtMemTypeDefault) {}
+  CUDAAllocator(int device_id, const char* name) : info_(name, OrtAllocatorType::OrtDeviceAllocator, OrtDevice(OrtDevice::GPU, OrtDevice::MemType::DEFAULT, device_id), device_id, OrtMemTypeDefault) {}  
   virtual void* Alloc(size_t size) override;
   virtual void Free(void* p) override;
   virtual const OrtAllocatorInfo& Info() const override;
@@ -25,10 +25,14 @@ class CUDAAllocator : public IDeviceAllocator {
 //TODO: add a default constructor
 class CUDAPinnedAllocator : public IDeviceAllocator {
  public:
+  CUDAPinnedAllocator(int device_id, const char* name) : info_(name, OrtAllocatorType::OrtDeviceAllocator, OrtDevice(OrtDevice::CPU, OrtDevice::MemType::CUDA_PINNED, device_id), device_id, OrtMemTypeCPUOutput) {}  
   virtual void* Alloc(size_t size) override;
   virtual void Free(void* p) override;
   virtual const OrtAllocatorInfo& Info() const override;
   virtual FencePtr CreateFence(const SessionState* session_state) override;
+
+ private:
+  const OrtAllocatorInfo info_;
 };
 
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
index 6509cf01fdf9a..04a87a120bf0f 100644
--- a/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
+++ b/onnxruntime/core/providers/cuda/cuda_execution_provider.cc
@@ -52,7 +52,7 @@ CUDAExecutionProvider::PerThreadContext::PerThreadContext(int device_id) {
 
   DeviceAllocatorRegistrationInfo default_allocator_info(
       {OrtMemTypeDefault,
-       [](int id) { return std::make_unique<CUDAAllocator>(id); }, std::numeric_limits<size_t>::max()});
+       [](int id) { return std::make_unique<CUDAAllocator>(id, CUDA); }, std::numeric_limits<size_t>::max()});
   allocator_ = CreateAllocator(default_allocator_info, device_id);
 }
 
@@ -66,12 +66,17 @@ CUDAExecutionProvider::CUDAExecutionProvider(const CUDAExecutionProviderInfo& in
   CUDA_CALL_THROW(cudaSetDevice(device_id_));
 
   DeviceAllocatorRegistrationInfo default_allocator_info(
-      {OrtMemTypeDefault, [](int id) { return std::make_unique<CUDAAllocator>(id); }, std::numeric_limits<size_t>::max()});
+      {OrtMemTypeDefault, [](int id) { return std::make_unique<CUDAAllocator>(id, CUDA); }, std::numeric_limits<size_t>::max()});
   InsertAllocator(CreateAllocator(default_allocator_info, device_id_));
 
   DeviceAllocatorRegistrationInfo pinned_allocator_info(
-      {OrtMemTypeCPUOutput, [](int) { return std::make_unique<CUDAPinnedAllocator>(); }, std::numeric_limits<size_t>::max()});
+      {OrtMemTypeCPUOutput, [](int) { return std::make_unique<CUDAPinnedAllocator>(0, CUDA_PINNED); }, std::numeric_limits<size_t>::max()});
   InsertAllocator(CreateAllocator(pinned_allocator_info, device_id_));
+
+  // TODO: this is actually used for the cuda kernels which explicitly ask for inputs from CPU.
+  // This will be refactored/removed when allocator and execution provider are decoupled.
+  DeviceAllocatorRegistrationInfo cpu_allocator_info({OrtMemTypeCPUInput, [](int) { return std::make_unique<CPUAllocator>(std::make_unique<OrtAllocatorInfo>("CUDA_CPU", OrtAllocatorType::OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeCPUInput)); }, std::numeric_limits<size_t>::max()});
+  InsertAllocator(CreateAllocator(cpu_allocator_info));
 }
 
 CUDAExecutionProvider::~CUDAExecutionProvider() {
@@ -1013,6 +1018,7 @@ CUDAExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph,
       // Note that nodes with only inputs from initializer would not be place on CUDA
       // Ideally, those nodes should be eliminated in constant folding
       bool should_force_outside = true;
+      bool all_input_are_initializer = true;
       node.ForEachWithIndex(
           node.InputDefs(),
           [&](const NodeArg& def, size_t index) {
@@ -1020,12 +1026,17 @@ CUDAExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph,
             // The input is not a initializer and the input is from CPU
             // or the input declared as CPU memory and is from CPU
             // in that case we should still keep the node on CUDA
-            if ((!graph.GetInitializedTensor(def.Name(), initializer) && !defs_outside_cuda.count(&def)) ||
+            bool initializer_input = graph.GetInitializedTensor(def.Name(), initializer);
+            if ((!initializer_input && !defs_outside_cuda.count(&def)) ||
                 (defs_outside_cuda.count(&def) && cuda_kernel_def->kernel_def->IsInputOnCpu(index)))
               should_force_outside = false;
+            if (!initializer_input) {
+              all_input_are_initializer = false;
+            }
             return Status::OK();
           });
-      if (should_force_outside) {
+      // If all the inputs are initialier, we shouldn't force it to CPU
+      if (should_force_outside && !all_input_are_initializer) {
         force_outside = true;
       }
     }
diff --git a/onnxruntime/core/providers/cuda/cudnn_common.h b/onnxruntime/core/providers/cuda/cudnn_common.h
index 02a3ba6b694bb..bfd233b68b65e 100644
--- a/onnxruntime/core/providers/cuda/cudnn_common.h
+++ b/onnxruntime/core/providers/cuda/cudnn_common.h
@@ -97,13 +97,13 @@ class CudnnDropout final {
     return dropout_desc_;
   }
 
- private:
   Status CreateDescriptorIfNeeded() {
     if (!dropout_desc_)
       CUDNN_RETURN_IF_ERROR(cudnnCreateDropoutDescriptor(&dropout_desc_));
     return Status::OK();
   }
 
+ private:
   cudnnDropoutDescriptor_t dropout_desc_;
 };
 
diff --git a/onnxruntime/core/providers/cuda/math/binary_elementwise_ops.cc b/onnxruntime/core/providers/cuda/math/binary_elementwise_ops.cc
index 16f6246b3df37..6f679c8a6cbf7 100644
--- a/onnxruntime/core/providers/cuda/math/binary_elementwise_ops.cc
+++ b/onnxruntime/core/providers/cuda/math/binary_elementwise_ops.cc
@@ -92,6 +92,17 @@ Status BinaryElementwise<ShouldBroadcast>::Prepare(OpKernelContext* context, int
       KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \
       x<T>);
 
+#define BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(x, ver, T)                     \
+  ONNX_OPERATOR_TYPED_KERNEL_EX(                                                \
+      x,                                                                        \
+      kOnnxDomain,                                                              \
+      ver,                                                                      \
+      T,                                                                        \
+      kCudaExecutionProvider,                                                   \
+      KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<T>()) \
+                        .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()), \
+      x<T>);
+
 #define BINARY_ELEMENTWISE_REGISTER_KERNEL_VERSIONED_TYPED(x, startver, endver, T) \
   ONNX_OPERATOR_VERSIONED_TYPED_KERNEL_EX(                                         \
       x,                                                                           \
@@ -127,6 +138,11 @@ Status BinaryElementwise<ShouldBroadcast>::Prepare(OpKernelContext* context, int
   BINARY_ELEMENTWISE_REGISTER_KERNEL_TYPED(name, ver, T) \
   BINARY_ELEMENTWISE_COMPUTE(name, T)
 
+#define BINARY_LOGICALOP_TYPED(name, ver, T)                    \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, T) \
+  BINARY_ELEMENTWISE_COMPUTE(name, T)
+
+
 // since different ops has different types, we cannot use BINARY_OPS() directly
 // the postfix of means the types supported by the op:
 // B: uint8_t
@@ -155,10 +171,15 @@ Status BinaryElementwise<ShouldBroadcast>::Prepare(OpKernelContext* context, int
   BINARY_OP_HFD(name, ver)
 
 #define BINARY_OP_REGISTER_OIL(name, ver)                        \
-  BINARY_ELEMENTWISE_REGISTER_KERNEL_TYPED(name, ver, bool)  \
+  BINARY_ELEMENTWISE_REGISTER_KERNEL_TYPED(name, ver, bool)      \
   BINARY_ELEMENTWISE_REGISTER_KERNEL_TYPED(name, ver, int32_t)   \
   BINARY_ELEMENTWISE_REGISTER_KERNEL_TYPED(name, ver, int64_t)
 
+#define BINARY_LOGICALOP_REGISTER_OIL(name, ver)                          \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, bool)     \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, int32_t)  \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, int64_t)
+
 #define BINARY_OP_REGISTER_HFD(name, ver)                        \
   BINARY_ELEMENTWISE_REGISTER_KERNEL_TYPED(name, ver, MLFloat16) \
   BINARY_ELEMENTWISE_REGISTER_KERNEL_TYPED(name, ver, float)     \
@@ -171,6 +192,15 @@ Status BinaryElementwise<ShouldBroadcast>::Prepare(OpKernelContext* context, int
   BINARY_ELEMENTWISE_REGISTER_KERNEL_TYPED(name, ver, int64_t)  \
   BINARY_OP_REGISTER_HFD(name, ver)
 
+#define BINARY_LOGICALOP_REGISTER_UZILHFD(name, ver)                        \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, uint32_t)   \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, uint64_t)   \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, int32_t)    \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, int64_t)    \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, MLFloat16)  \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, float)      \
+  BINARY_ELEMENTWISE_LOGICALOP_REGISTER_KERNEL_TYPED(name, ver, double)
+
 #define BINARY_OP_REGISTER_VERSIONED_HFD(name, startver, endver)                        \
   BINARY_ELEMENTWISE_REGISTER_KERNEL_VERSIONED_TYPED(name, startver, endver, MLFloat16) \
   BINARY_ELEMENTWISE_REGISTER_KERNEL_VERSIONED_TYPED(name, startver, endver, float)     \
@@ -188,9 +218,9 @@ BINARY_OP_UZILHFD(Sub, 7)
 BINARY_OP_UZILHFD(Mul, 7)
 BINARY_OP_UZILHFD(Div, 7)
 BINARY_OP_HFD(Pow, 7)
-BINARY_OP_TYPED(And, 7, bool)
-BINARY_OP_TYPED(Or, 7, bool)
-BINARY_OP_TYPED(Xor, 7, bool)
+BINARY_LOGICALOP_TYPED(And, 7, bool)
+BINARY_LOGICALOP_TYPED(Or, 7, bool)
+BINARY_LOGICALOP_TYPED(Xor, 7, bool)
 BINARY_OP_HFD(PRelu, 7)
 
 template <typename T>
@@ -440,7 +470,7 @@ Status Equal<T>::ComputeInternal(OpKernelContext* context) const {
 
 BINARY_OP_REGISTER_UZILHFD(Sum, 8)
 BINARY_OP_REGISTER_VERSIONED_UZILHFD(Sum, 6, 7)
-BINARY_OP_REGISTER_UZILHFD(Greater, 9)
+BINARY_LOGICALOP_REGISTER_UZILHFD(Greater, 9)
 BINARY_OP_REGISTER_OIL(Equal, 7)
 BINARY_OP_REGISTER_VERSIONED_HFD(Greater, 7, 8)
 BINARY_OP_REGISTER_HFD(Max, 8)
diff --git a/onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc b/onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc
index e45eb16dc5508..2b13aa5882b57 100644
--- a/onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc
+++ b/onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc
@@ -15,7 +15,7 @@ void CudnnRnnBase<T>::SetWeightBias(const cudnnHandle_t handle,
                                     const cudnnTensorDescriptor_t x_desc,
                                     const cudnnFilterDescriptor_t w_desc,
                                     const cudnnFilterDescriptor_t filter_desc,
-                                    const void* w_data,
+                                    const void* reorganized_w_data,
                                     const int lin_layer_id,
                                     const T* pos,
                                     int& offset,
@@ -27,9 +27,9 @@ void CudnnRnnBase<T>::SetWeightBias(const cudnnHandle_t handle,
   T* mem_offset;
 
   if (is_matrix) {
-    cudnnGetRNNLinLayerMatrixParams(handle, rnn_desc, pseudo_layer, x_desc, w_desc, w_data, lin_layer_id, filter_desc, (void**)&mem_offset);
+    cudnnGetRNNLinLayerMatrixParams(handle, rnn_desc, pseudo_layer, x_desc, w_desc, reorganized_w_data, lin_layer_id, filter_desc, (void**)&mem_offset);
   } else {
-    cudnnGetRNNLinLayerBiasParams(handle, rnn_desc, pseudo_layer, x_desc, w_desc, w_data, lin_layer_id, filter_desc, (void**)&mem_offset);
+    cudnnGetRNNLinLayerBiasParams(handle, rnn_desc, pseudo_layer, x_desc, w_desc, reorganized_w_data, lin_layer_id, filter_desc, (void**)&mem_offset);
   }
 
   cudnnGetFilterNdDescriptor(filter_desc, 3, &dt, &tf, &numDims, matDims.data());
@@ -42,25 +42,25 @@ Status CudnnRnnBase<T>::SetCudnnRnnWeightBias(const cudnnHandle_t cudnn_handle,
                                               const cudnnRNNDescriptor_t rnn_desc,
                                               const cudnnTensorDescriptor_t x_desc,
                                               const cudnnFilterDescriptor_t w_desc,
-                                              void* w_data,
+                                              void* reorganized_w_data,
                                               const T* W_data,
                                               const T* R_data,
                                               const T* B_data) const {
-  //Onnx only support 1 layer
   int w_offset = 0;
   int r_offset = 0;
   int bias_offset = 0;
-  for (int layer = 0; layer < num_layers_ * num_directions_; ++layer) {
+  CudnnFilterDescriptor filter_desc;
+  for (int layer = 0; layer < RNN_NUM_LAYERS * num_directions_; ++layer) {
     for (size_t idx = 0; idx < W_lin_layer_id_.size(); ++idx) {
-      SetWeightBias(cudnn_handle, rnn_desc, layer, x_desc, w_desc, filter_desc_, w_data, W_lin_layer_id_[idx], W_data, w_offset, true);
+      SetWeightBias(cudnn_handle, rnn_desc, layer, x_desc, w_desc, filter_desc, reorganized_w_data, W_lin_layer_id_[idx], W_data, w_offset, true);
       if (B_data != nullptr) {
-        SetWeightBias(cudnn_handle, rnn_desc, layer, x_desc, w_desc, filter_desc_, w_data, W_lin_layer_id_[idx], B_data, bias_offset, false);
+        SetWeightBias(cudnn_handle, rnn_desc, layer, x_desc, w_desc, filter_desc, reorganized_w_data, W_lin_layer_id_[idx], B_data, bias_offset, false);
       }
     }
     for (size_t idx = 0; idx < R_lin_layer_id_.size(); ++idx) {
-      SetWeightBias(cudnn_handle, rnn_desc, layer, x_desc, w_desc, filter_desc_, w_data, R_lin_layer_id_[idx], R_data, r_offset, true);
+      SetWeightBias(cudnn_handle, rnn_desc, layer, x_desc, w_desc, filter_desc, reorganized_w_data, R_lin_layer_id_[idx], R_data, r_offset, true);
       if (B_data != nullptr) {
-        SetWeightBias(cudnn_handle, rnn_desc, layer, x_desc, w_desc, filter_desc_, w_data, R_lin_layer_id_[idx], B_data, bias_offset, false);
+        SetWeightBias(cudnn_handle, rnn_desc, layer, x_desc, w_desc, filter_desc, reorganized_w_data, R_lin_layer_id_[idx], B_data, bias_offset, false);
       }
     }
   }
@@ -68,34 +68,11 @@ Status CudnnRnnBase<T>::SetCudnnRnnWeightBias(const cudnnHandle_t cudnn_handle,
   return Status::OK();
 }
 
-template <typename T>
-Status CudnnRnnBase<T>::SetCudnnRnnDesc() {
-  typedef typename ToCudaType<T>::MappedType CudaT;
-
-  cudnnDirectionMode_t cudnn_direction = CUDNN_UNIDIRECTIONAL;
-  if (direction_ == "bidirectional") {
-    cudnn_direction = CUDNN_BIDIRECTIONAL;
-  } else if (direction_ == "forward") {
-    cudnn_direction = CUDNN_UNIDIRECTIONAL;
-  } else if (direction_ == "reverse") {
-    cudnn_direction = CUDNN_UNIDIRECTIONAL;
-    // need to reverse data
-    reverse_ = true;
-  }
-
-  cudnn_dropout_desc_.GetCudnnDropoutStatesSize(CudnnHandle(), state_size_);
-  state_buffer_ = GetScratchBuffer<void>(state_size_);
-  cudnn_dropout_desc_.Set(CudnnHandle(), state_buffer_.get(), state_size_);
-  ORT_RETURN_IF_ERROR(rnn_desc_.Set(CudnnHandle(), hidden_size_, num_layers_, cudnn_dropout_desc_,
-                                            cudnn_direction, rnn_mode_, CudnnTensor::GetDataType<CudaT>()));
-
-  return Status::OK();
-}
-
 template <typename T>
 Status CudnnRnnBase<T>::ReorganizeWeights(const Tensor* W, const Tensor* R, const Tensor* B,
-                                          IAllocatorUniquePtr<void>& target_w_data,
-                                          CudnnFilterDescriptor& target_w_desc) const {
+                                          IAllocatorUniquePtr<void>& reorganized_w_data,
+                                          CudnnFilterDescriptor& target_w_desc,
+                                          CudnnRNN& rnn_desc) const {
   typedef typename ToCudaType<T>::MappedType CudaT;
   int64_t input_size = W->Shape()[2];
   // RNN W[num_directions_, hidden_size_, input_size]
@@ -117,20 +94,21 @@ Status CudnnRnnBase<T>::ReorganizeWeights(const Tensor* W, const Tensor* R, cons
   fake_x_desc.Set(fake_dims_x, CudnnTensor::GetDataType<CudaT>());
 
   // Prepare the weight data
-  target_w_data = GetScratchBuffer<void>(w_size * sizeof(T));
+  reorganized_w_data = GetScratchBuffer<void>(w_size * sizeof(T));
 
   const T* W_data = W->template Data<T>();
   const T* R_data = R->template Data<T>();
   const T* B_data = B == nullptr ? nullptr : B->template Data<T>();
 
-  ORT_RETURN_IF_ERROR(SetCudnnRnnWeightBias(CudnnHandle(), rnn_desc_, fake_x_desc, target_w_desc,
-                                                    target_w_data.get(), W_data, R_data, B_data));
+  ORT_RETURN_IF_ERROR(SetCudnnRnnWeightBias(CudnnHandle(), rnn_desc, fake_x_desc, target_w_desc,
+                                            reorganized_w_data.get(), W_data, R_data, B_data));
 
   return Status::OK();
 }
 
 template <typename T>
 Status CudnnRnnBase<T>::CacheCudnnRnnWeights(const OpKernelInfo& info) {
+  typedef typename ToCudaType<T>::MappedType CudaT;
   // Cache the weight
   const Tensor* W;
   const Tensor* R;
@@ -140,10 +118,13 @@ Status CudnnRnnBase<T>::CacheCudnnRnnWeights(const OpKernelInfo& info) {
   bool get_B = info.TryGetConstantInput(RNN_Input_Index::B, &B);
 
   if (get_W && get_R) {
+    CudnnRNN tmp_rnn_desc;
+    ORT_RETURN_IF_ERROR(tmp_rnn_desc.Set(CudnnHandle(), hidden_size_, RNN_NUM_LAYERS, cudnn_dropout_desc_,
+                                         cudnn_direction_mode_, rnn_mode_, CudnnTensor::GetDataType<CudaT>()));
     if (get_B) {
-      ORT_RETURN_IF_ERROR(ReorganizeWeights(W, R, B, w_data_cache_, w_desc_cache_));
+      ORT_RETURN_IF_ERROR(ReorganizeWeights(W, R, B, w_data_cache_, w_desc_cache_, tmp_rnn_desc));
     } else {
-      ORT_RETURN_IF_ERROR(ReorganizeWeights(W, R, nullptr, w_data_cache_, w_desc_cache_));
+      ORT_RETURN_IF_ERROR(ReorganizeWeights(W, R, nullptr, w_data_cache_, w_desc_cache_, tmp_rnn_desc));
     }
     weight_cached_ = true;
   }
@@ -173,7 +154,7 @@ Status CudnnRnnBase<T>::ComputeInternal(OpKernelContext* ctx) const {
 
   // optional outputs
   std::vector<int64_t> dims_Y({seq_length, num_directions_, batch_size, hidden_size_});
-  std::vector<int64_t> dims_hxy({num_layers_ * num_directions_, batch_size, hidden_size_});
+  std::vector<int64_t> dims_hxy({RNN_NUM_LAYERS * num_directions_, batch_size, hidden_size_});
   std::vector<int64_t> dims_yc{num_directions_, batch_size, hidden_size_};
   Tensor* Y = ctx->Output(Output_Index::Y, dims_Y);
   Tensor* Y_h = ctx->Output(Output_Index::Y_h, dims_hxy);
@@ -198,16 +179,6 @@ Status CudnnRnnBase<T>::ComputeInternal(OpKernelContext* ctx) const {
   ORT_RETURN_IF_ERROR(y_h_desc.Set(dims_hxy, CudnnTensor::GetDataType<CudaT>()));
   ORT_RETURN_IF_ERROR(y_c_desc.Set(dims_hxy, CudnnTensor::GetDataType<CudaT>()));
 
-  // Prepare the weight data
-  IAllocatorUniquePtr<void> w_data;
-  CudnnFilterDescriptor w_desc;
-  if (!weight_cached_) {
-    const Tensor& W = *ctx->Input<Tensor>(RNN_Input_Index::W);
-    const Tensor& R = *ctx->Input<Tensor>(RNN_Input_Index::R);
-    const Tensor* B = ctx->Input<Tensor>(RNN_Input_Index::B);
-    ReorganizeWeights(&W, &R, B, w_data, w_desc);
-  }
-
   IAllocatorUniquePtr<T> x_reversed_data;
   const T* x_data = X->template Data<T>();
   if (reverse_) {
@@ -239,16 +210,33 @@ Status CudnnRnnBase<T>::ComputeInternal(OpKernelContext* ctx) const {
 
   const int32_t* sequence_lens_data = (sequence_lens == nullptr) ? nullptr : sequence_lens->template Data<int32_t>();
 
+  CudnnRNN rnn_desc;
+  ORT_RETURN_IF_ERROR(rnn_desc.Set(CudnnHandle(), hidden_size_, RNN_NUM_LAYERS, cudnn_dropout_desc_,
+                                   cudnn_direction_mode_, rnn_mode_, CudnnTensor::GetDataType<CudaT>()));
+
+  // Prepare the weight data
+  IAllocatorUniquePtr<void> w_data;
+  CudnnFilterDescriptor w_desc;
+  if (!weight_cached_) {
+    const Tensor& W = *ctx->Input<Tensor>(RNN_Input_Index::W);
+    const Tensor& R = *ctx->Input<Tensor>(RNN_Input_Index::R);
+    const Tensor* B = ctx->Input<Tensor>(RNN_Input_Index::B);
+    ReorganizeWeights(&W, &R, B, w_data, w_desc, rnn_desc);
+  }
+
   // CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED works with CUDNN_RNN_PADDED_IO_ENABLED, so that it will auto fill 0 for the shorter sequences
-  CUDNN_RETURN_IF_ERROR(cudnnSetRNNPaddingMode(rnn_desc_, CUDNN_RNN_PADDED_IO_ENABLED));
+  CUDNN_RETURN_IF_ERROR(cudnnSetRNNPaddingMode(rnn_desc, CUDNN_RNN_PADDED_IO_ENABLED));
 
   size_t workspace_bytes;
-  CUDNN_RETURN_IF_ERROR(cudnnGetRNNWorkspaceSize(CudnnHandle(), rnn_desc_, gsl::narrow_cast<int>(seq_length), x_desc.data(), &workspace_bytes));
+  CUDNN_RETURN_IF_ERROR(cudnnGetRNNWorkspaceSize(CudnnHandle(), rnn_desc, gsl::narrow_cast<int>(seq_length), x_desc.data(), &workspace_bytes));
   auto workspace_cuda = GetScratchBuffer<void>(workspace_bytes);
+  int32_t zero_seq_count = 0;
+  std::vector<int32_t> zero_seq_index_cache(batch_size, 0);
+  int64_t zero_seq_index_cache_size = 0;
 
   if (CUDNN_RNN_RELU == rnn_mode_ || CUDNN_RNN_TANH == rnn_mode_ || nullptr == sequence_lens_data) {
     CUDNN_RETURN_IF_ERROR(cudnnRNNForwardInference(CudnnHandle(),
-                                                   rnn_desc_,
+                                                   rnn_desc,
                                                    gsl::narrow_cast<int>(seq_length),
                                                    x_desc.data(),
                                                    x_data_input,
@@ -267,13 +255,35 @@ Status CudnnRnnBase<T>::ComputeInternal(OpKernelContext* ctx) const {
                                                    workspace_cuda.get(),
                                                    workspace_bytes));
   } else {
+    // cudnn doesn't support 0 sequence inside the batch, find the 0 sequence and set it to 1
+    // there's a ZeroMask kernel to reset the result to 0 for the 0 sequence
+    std::vector<int32_t> seq_len_array(sequence_lens_data, sequence_lens_data + batch_size);
+    for (int i = 0; i < batch_size; ++i) {
+      if (0 == seq_len_array[i]) {
+        seq_len_array[i] = 1;
+        zero_seq_index_cache[zero_seq_count] = i;
+        ++zero_seq_count;
+      }
+    }
+
+    // Calculate the zero position cache for reverse direction if it's bidirectional
+    // The cache is for Y_h or Y_c, and the 1st sequence for Y, no need to do it for other sequence in Y since
+    // we hacked the 0 sequence to 1
+    if (zero_seq_count && num_directions_ > 1) {
+      zero_seq_index_cache_size = zero_seq_count * num_directions_;
+      zero_seq_index_cache.resize(zero_seq_index_cache_size);
+      for (int i = 0; i < zero_seq_count; ++i) {
+        zero_seq_index_cache[zero_seq_count + i] = static_cast<int32_t>(batch_size + zero_seq_index_cache[i]);
+      }
+    }
+
     CudnnDataTensor x_desc;
-    x_desc.Set(CudnnTensor::GetDataType<CudaT>(), seq_length, batch_size, input_size, sequence_lens_data);
+    x_desc.Set(CudnnTensor::GetDataType<CudaT>(), seq_length, batch_size, input_size, seq_len_array.data());
     CudnnDataTensor y_desc;
-    y_desc.Set(CudnnTensor::GetDataType<CudaT>(), seq_length, batch_size, hidden_size_ * num_directions_, sequence_lens_data);
+    y_desc.Set(CudnnTensor::GetDataType<CudaT>(), seq_length, batch_size, hidden_size_ * num_directions_, seq_len_array.data());
 
     CUDNN_RETURN_IF_ERROR(cudnnRNNForwardInferenceEx(CudnnHandle(),
-                                                     rnn_desc_,
+                                                     rnn_desc,
                                                      x_desc,
                                                      x_data_input,
                                                      hx_desc,
@@ -292,8 +302,13 @@ Status CudnnRnnBase<T>::ComputeInternal(OpKernelContext* ctx) const {
                                                      nullptr, nullptr, nullptr, nullptr,
                                                      workspace_cuda.get(),
                                                      workspace_bytes));
+
     // Early terminate for this case since Y data is not required, and Y_h is obtained correctly, no need the following code to retrive Y_h from Y data.
     if (nullptr == Y) {
+      // Mask on output for 0 sequence batches
+      if (zero_seq_count > 0) {
+        SetZeroSequences(zero_seq_index_cache_size, zero_seq_index_cache, y_data, y_h_data, y_c_data);
+      }
       return Status::OK();
     }
   }
@@ -327,10 +342,14 @@ Status CudnnRnnBase<T>::ComputeInternal(OpKernelContext* ctx) const {
     }
   }
 
+  // Mask on output for 0 sequence batches
+  if (zero_seq_count > 0) {
+    SetZeroSequences(zero_seq_index_cache_size, zero_seq_index_cache, y_data, y_h_data, y_c_data);
+  }
+
   if ((CUDNN_RNN_RELU == rnn_mode_ || CUDNN_RNN_TANH == rnn_mode_) && sequence_lens_data != nullptr && y_h_data != nullptr && y_data != nullptr) {
-    auto count = sequence_lens->Shape().Size();
-    CudaAsyncBuffer<int32_t> sequence_lens_buffer(this, GetDeviceId(), count);
-    memcpy(sequence_lens_buffer.CpuPtr(), sequence_lens_data, count * sizeof(int32_t));
+    CudaAsyncBuffer<int32_t> sequence_lens_buffer(this, GetDeviceId(), batch_size);
+    memcpy(sequence_lens_buffer.CpuPtr(), sequence_lens_data, batch_size * sizeof(int32_t));
     sequence_lens_buffer.CopyToGpu();
     RnnMaskImpl(gsl::narrow_cast<int32_t>(num_directions_),
                 gsl::narrow_cast<int32_t>(seq_length),
@@ -345,6 +364,24 @@ Status CudnnRnnBase<T>::ComputeInternal(OpKernelContext* ctx) const {
   return Status::OK();
 }
 
+template <typename T>
+void CudnnRnnBase<T>::SetZeroSequences(const int64_t zero_seq_index_cache_size,
+                                       const std::vector<int32_t> zero_seq_index_cache,
+                                       T* y_data,
+                                       T* y_h_data,
+                                       T* y_c_data) const {
+  typedef typename ToCudaType<T>::MappedType CudaT;
+  CudaAsyncBuffer<int32_t> zero_seq_index_cache_async_buffer(this, GetDeviceId(), zero_seq_index_cache_size);
+  memcpy(zero_seq_index_cache_async_buffer.CpuPtr(), zero_seq_index_cache.data(), zero_seq_index_cache_size * sizeof(int32_t));
+  zero_seq_index_cache_async_buffer.CopyToGpu();
+  MaskZeroSequences(gsl::narrow_cast<int32_t>(hidden_size_),
+                    reinterpret_cast<CudaT*>(y_data),
+                    reinterpret_cast<CudaT*>(y_h_data),
+                    reinterpret_cast<CudaT*>(y_c_data),
+                    zero_seq_index_cache_async_buffer.GpuPtr(),
+                    static_cast<int64_t>(zero_seq_index_cache_size));
+}
+
 template class CudnnRnnBase<float>;
 template class CudnnRnnBase<double>;
 template class CudnnRnnBase<MLFloat16>;
diff --git a/onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h b/onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h
index 0afd35435cc7c..6b7f4e9c14f5e 100644
--- a/onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h
+++ b/onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h
@@ -21,26 +21,29 @@ enum RNN_Input_Index {
   initial_c = 6
 };
 
+// Onnx RNN/GRU/LSTM only support 1 layer
+const int RNN_NUM_LAYERS = 1;
+
 class CudnnRNN {
  public:
-  CudnnRNN() : rnn_desc_(nullptr) {
+  CudnnRNN() : cudnn_rnn_desc_(nullptr) {
   }
 
   ~CudnnRNN() {
-    if (rnn_desc_ != nullptr) {
-      cudnnDestroyRNNDescriptor(rnn_desc_);
-      rnn_desc_ = nullptr;
+    if (cudnn_rnn_desc_ != nullptr) {
+      cudnnDestroyRNNDescriptor(cudnn_rnn_desc_);
+      cudnn_rnn_desc_ = nullptr;
     }
   }
 
   Status Set(const cudnnHandle_t& cudnnHandle, int64_t hidden_size, int num_layers,
              cudnnDropoutDescriptor_t cudnn_dropout_desc, cudnnDirectionMode_t cudnn_direction_model,
              cudnnRNNMode_t rnn_mode, cudnnDataType_t dataType) {
-    if (!rnn_desc_)
-      CUDNN_RETURN_IF_ERROR(cudnnCreateRNNDescriptor(&rnn_desc_));
+    if (!cudnn_rnn_desc_)
+      CUDNN_RETURN_IF_ERROR(cudnnCreateRNNDescriptor(&cudnn_rnn_desc_));
 
     CUDNN_RETURN_IF_ERROR(cudnnSetRNNDescriptor(cudnnHandle,
-                                                rnn_desc_,
+                                                cudnn_rnn_desc_,
                                                 gsl::narrow_cast<int>(hidden_size),
                                                 num_layers,
                                                 cudnn_dropout_desc,
@@ -54,11 +57,11 @@ class CudnnRNN {
   }
 
   operator cudnnRNNDescriptor_t() const {
-    return rnn_desc_;
+    return cudnn_rnn_desc_;
   }
 
  private:
-  cudnnRNNDescriptor_t rnn_desc_;
+  cudnnRNNDescriptor_t cudnn_rnn_desc_;
 };
 
 template <typename T>
@@ -68,23 +71,40 @@ class CudnnRnnBase : public CudaKernel {
  public:
   CudnnRnnBase(const OpKernelInfo& info) : CudaKernel{info} {
     reverse_ = false;
-    ORT_ENFORCE(info.GetAttr("direction", &direction_).IsOK());
-    num_directions_ = direction_ == "bidirectional" ? 2 : 1;
-    ORT_ENFORCE(allowed_directions.find(direction_) != allowed_directions.end());
+    std::string direction = "forward";
+    direction = info.GetAttrOrDefault<std::string>("direction", "forward");
+    cudnn_direction_mode_ = CUDNN_UNIDIRECTIONAL;
+    if (direction == "bidirectional") {
+      cudnn_direction_mode_ = CUDNN_BIDIRECTIONAL;
+    } else if (direction == "forward") {
+      cudnn_direction_mode_ = CUDNN_UNIDIRECTIONAL;
+    } else if (direction == "reverse") {
+      cudnn_direction_mode_ = CUDNN_UNIDIRECTIONAL;
+      // need to reverse data
+      reverse_ = true;
+    }
+
+    num_directions_ = cudnn_direction_mode_ == CUDNN_BIDIRECTIONAL ? 2 : 1;
+    ORT_ENFORCE(allowed_directions.find(direction) != allowed_directions.end());
 
     ORT_ENFORCE(info.GetAttr("hidden_size", &hidden_size_).IsOK() && hidden_size_ > 0);
     rnn_mode_ = CUDNN_LSTM;
-    num_layers_ = 1;
     weight_cached_ = false;
     w_data_cache_ = nullptr;
+    
+    size_t state_size;
+    cudnn_dropout_desc_.CreateDescriptorIfNeeded();
+    cudnn_dropout_desc_.GetCudnnDropoutStatesSize(CudnnHandle(), state_size);
+    state_buffer_ = GetScratchBuffer<void>(state_size);
+    cudnn_dropout_desc_.Set(CudnnHandle(), state_buffer_.get(), state_size);
   }
 
-  Status SetCudnnRnnDesc();
-
   Status CacheCudnnRnnWeights(const OpKernelInfo& info);
 
   Status ComputeInternal(OpKernelContext* ctx) const override;
 
+  void SetRNNMode(cudnnRNNMode_t rnn_mode) { rnn_mode_ = rnn_mode; }
+
  private:
   Status SetCudnnRnnWeightBias(const cudnnHandle_t cudnn_handle,
                                const cudnnRNNDescriptor_t rnn_desc,
@@ -97,7 +117,8 @@ class CudnnRnnBase : public CudaKernel {
 
   Status ReorganizeWeights(const Tensor* W, const Tensor* R, const Tensor* B,
                            IAllocatorUniquePtr<void>& target_w_data,
-                           CudnnFilterDescriptor& target_w_desc) const;
+                           CudnnFilterDescriptor& target_w_desc,
+                           CudnnRNN& rnn_desc) const;
 
   void SetWeightBias(const cudnnHandle_t handle,
                      const cudnnRNNDescriptor_t rnn_desc,
@@ -111,27 +132,32 @@ class CudnnRnnBase : public CudaKernel {
                      int& offset,
                      bool is_matrix) const;
 
+  void SetZeroSequences(const int64_t zero_seq_index_cache_size,
+                        const std::vector<int32_t> zero_seq_index_cache,
+                        T* y_data,
+                        T* y_h_data,
+                        T* y_c_data) const;
+
  protected:
-  int64_t num_directions_;
-  // required
-  int64_t hidden_size_;
-  cudnnRNNMode_t rnn_mode_;
+  // W_lin_layer_id_ & R_lin_layer_id_ are set in Constructor
   std::vector<int> W_lin_layer_id_;
   std::vector<int> R_lin_layer_id_;
-  CudnnRNN rnn_desc_;
-  bool reverse_;
-  int num_layers_;
 
  private:
-  // optional
-  std::string direction_;
+  cudnnDirectionMode_t cudnn_direction_mode_;
+  bool reverse_;
+  int64_t num_directions_;
+  // hidden_size_ from attribute
+  int64_t hidden_size_;
+  cudnnRNNMode_t rnn_mode_;
+  // w_desc_cache_ & w_data_cache_ are changed in Constructor if we can get the weights as constant input
   CudnnFilterDescriptor w_desc_cache_;
-  CudnnDropout cudnn_dropout_desc_;
-  CudnnFilterDescriptor filter_desc_;
   IAllocatorUniquePtr<void> w_data_cache_;
   bool weight_cached_;
+
+  // cudnn_dropout_desc_ is a cache, never to be changed
   IAllocatorUniquePtr<void> state_buffer_;
-  size_t state_size_;
+  CudnnDropout cudnn_dropout_desc_;
 
   enum Output_Index {
     Y = 0,
diff --git a/onnxruntime/core/providers/cuda/rnn/gru.h b/onnxruntime/core/providers/cuda/rnn/gru.h
index 43a0ba4ab5878..ab9dabff5db36 100644
--- a/onnxruntime/core/providers/cuda/rnn/gru.h
+++ b/onnxruntime/core/providers/cuda/rnn/gru.h
@@ -15,8 +15,7 @@ template <typename T>
 class GRU final : public CudnnRnnBase<T> {
  public:
   GRU(const OpKernelInfo& info) : CudnnRnnBase<T>(info) {
-    CudnnRnnBase<T>::rnn_mode_ = CUDNN_GRU;
-    CudnnRnnBase<T>::SetCudnnRnnDesc();
+    CudnnRnnBase<T>::SetRNNMode(CUDNN_GRU);
 
     // ONNX W layout is Wzrh, WBzrh, mapping to RNNLinLayerMatrixParams the linLayerID is 1, 0, 2
     CudnnRnnBase<T>::W_lin_layer_id_.assign({1, 0, 2});
diff --git a/onnxruntime/core/providers/cuda/rnn/lstm.h b/onnxruntime/core/providers/cuda/rnn/lstm.h
index 3ba719d61750d..3ed12cfa7fff9 100644
--- a/onnxruntime/core/providers/cuda/rnn/lstm.h
+++ b/onnxruntime/core/providers/cuda/rnn/lstm.h
@@ -13,8 +13,7 @@ class LSTM final : public CudnnRnnBase<T> {
 
  public:
   LSTM(const OpKernelInfo& info) : CudnnRnnBase<T>(info) {
-    CudnnRnnBase<T>::rnn_mode_ = CUDNN_LSTM;
-    CudnnRnnBase<T>::SetCudnnRnnDesc();
+    CudnnRnnBase<T>::SetRNNMode(CUDNN_LSTM);
 
     // ONNX W layout is W[iofc], WB[iofc], mapping to RNNLinLayerMatrixParams the linLayerID is 0, 3, 1, 2
     CudnnRnnBase<T>::W_lin_layer_id_.assign({0, 3, 1, 2});
diff --git a/onnxruntime/core/providers/cuda/rnn/rnn.h b/onnxruntime/core/providers/cuda/rnn/rnn.h
index 246e8d1062df0..dbb0d2843fe11 100644
--- a/onnxruntime/core/providers/cuda/rnn/rnn.h
+++ b/onnxruntime/core/providers/cuda/rnn/rnn.h
@@ -20,11 +20,9 @@ class RNN final : public CudnnRnnBase<T> {
     std::vector<std::string> activations_;
     ORT_ENFORCE(info.GetAttrs("activations", activations_).IsOK());
     if (activations_[0] == "Relu")
-      CudnnRnnBase<T>::rnn_mode_ = CUDNN_RNN_RELU;
+      CudnnRnnBase<T>::SetRNNMode(CUDNN_RNN_RELU);
     else if (activations_[0] == "Tanh")
-      CudnnRnnBase<T>::rnn_mode_ = CUDNN_RNN_TANH;
-
-    CudnnRnnBase<T>::SetCudnnRnnDesc();
+      CudnnRnnBase<T>::SetRNNMode(CUDNN_RNN_TANH);
 
     // ONNX W mapping to RNNLinLayerMatrixParams the linLayerID is 0
     CudnnRnnBase<T>::W_lin_layer_id_.assign({0});
diff --git a/onnxruntime/core/providers/cuda/rnn/rnn_impl.cu b/onnxruntime/core/providers/cuda/rnn/rnn_impl.cu
index ae210ae6818de..930c3a4ddd343 100644
--- a/onnxruntime/core/providers/cuda/rnn/rnn_impl.cu
+++ b/onnxruntime/core/providers/cuda/rnn/rnn_impl.cu
@@ -133,6 +133,48 @@ void RnnMaskImpl(const int32_t num_directions,
       div_dir_block, div_batch_block, y_output_data, y_h_output_data, (CUDA_LONG)N);
 }
 
+template <typename T>
+__global__ void _MaskZeroSequences(const int32_t hidden_size,
+                                   T* y_output_data,
+                                   T* y_h_output_data,
+                                   T* y_c_output_data,
+                                   const int32_t* zeor_seq_index_cache,
+                                   const CUDA_LONG N) {
+  CALCULATE_ELEMENTWISE_INDEX_OR_EXIT(id, N);
+
+  int32_t zero_seq_offset = zeor_seq_index_cache[id] * hidden_size;
+
+  if (y_output_data != nullptr) {
+    for (int i = 0; i < hidden_size; ++i) {
+      y_output_data[zero_seq_offset + i] = 0;
+    }
+  }
+
+  if (y_h_output_data != nullptr) {
+    for (int i = 0; i < hidden_size; ++i) {
+      y_h_output_data[zero_seq_offset + i] = 0;
+    }
+  }
+
+  if (y_c_output_data != nullptr) {
+    for (int i = 0; i < hidden_size; ++i) {
+      y_c_output_data[zero_seq_offset + i] = 0;
+    }
+  }
+}
+
+template <typename T> 
+void MaskZeroSequences(const int32_t hidden_size,
+                       T* y_output_data,
+                       T* y_h_output_data,
+                       T* y_c_output_data,
+                       const int32_t* zeor_seq_index_cache,
+                       const size_t N) {
+  int blocksPerGrid = (int)(ceil(static_cast<float>(N) / GridDim::maxThreadsPerBlock));
+  _MaskZeroSequences<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
+      hidden_size, y_output_data, y_h_output_data, y_c_output_data, zeor_seq_index_cache, (CUDA_LONG)N);
+}
+
 #define SPECIALIZED_RNN_IMPL(T)                                                 \
   template void RnnMaskImpl<T>(const int32_t num_directions,                    \
                                const int32_t seq_length,                        \
@@ -153,7 +195,13 @@ void RnnMaskImpl(const int32_t num_directions,
                                                       const int32_t hidden_size,\
                                                       const T* data,            \
                                                       T* reordered_data,        \
-                                                     const size_t N);
+                                                     const size_t N);           \
+template void MaskZeroSequences<T>(const int32_t hidden_size,                   \
+                                   T* y_output_data,                            \
+                                   T* y_h_output_data,                          \
+                                   T* y_c_output_data,                          \
+                                   const int32_t* zeor_seq_index_cache,         \
+                                   const size_t N);
 
 SPECIALIZED_RNN_IMPL(half)
 SPECIALIZED_RNN_IMPL(float)
diff --git a/onnxruntime/core/providers/cuda/rnn/rnn_impl.h b/onnxruntime/core/providers/cuda/rnn/rnn_impl.h
index d25d71aed3fb1..78ceabf23bf2e 100644
--- a/onnxruntime/core/providers/cuda/rnn/rnn_impl.h
+++ b/onnxruntime/core/providers/cuda/rnn/rnn_impl.h
@@ -34,5 +34,12 @@ void RnnMaskImpl(const int32_t num_directions,
                  T* y_h_output_data,
                  const size_t N);
 
+template <typename T>
+void MaskZeroSequences(const int32_t hidden_size,
+                       T* y_output_data,
+                       T* y_h_output_data,
+                       T* y_c_output_data,
+                       const int32_t* zeor_seq_index_cache_async_buffer,
+                       const size_t N);
 }  // namespace cuda
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/cuda/tensor/compress.cc b/onnxruntime/core/providers/cuda/tensor/compress.cc
index 4e33a421846b9..9e23ad6a5fc1a 100644
--- a/onnxruntime/core/providers/cuda/tensor/compress.cc
+++ b/onnxruntime/core/providers/cuda/tensor/compress.cc
@@ -13,7 +13,8 @@ ONNX_OPERATOR_KERNEL_EX(
     kOnnxDomain,
     9,
     kCudaExecutionProvider,
-    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::AllFixedSizeTensorTypes()),
+    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::AllFixedSizeTensorTypes())
+                      .TypeConstraint("T1", DataTypeImpl::GetTensorType<bool>()),
     Compress);
 
 Status Compress::ComputeInternal(OpKernelContext* ctx) const {
diff --git a/onnxruntime/core/providers/cuda/tensor/resize_impl.cu b/onnxruntime/core/providers/cuda/tensor/resize_impl.cu
index f8df8a9689f02..55d7fcaf01f49 100644
--- a/onnxruntime/core/providers/cuda/tensor/resize_impl.cu
+++ b/onnxruntime/core/providers/cuda/tensor/resize_impl.cu
@@ -29,8 +29,13 @@ __global__ void _ResizeNearestKernel(const size_t rank,
   output_data[id] = input_data[input_index];
 }
 
+// The following method supports a 4-D input in 'Linear mode'
+// that amounts to 'Bilinear' Upsampling/Resizing in the sense that it assumes
+// the scale values for the outermost 2 dimensions are 1.
+// This is the common use-case where the 4-D input (batched multi-channel images)
+// is usually of shape [N, C, H, W] and the scales are [1.0, 1.0, height_scale, width_scale]
 template <typename T>
-__global__ void _ResizeBilinearKernel(const int64_t input_dim2,
+__global__ void _ResizeBilinear4DInputKernel(const int64_t input_dim2,
                                       const int64_t* input_pitches,
                                       const fast_divmod* output_div_pitches,
                                       const float* scales,
@@ -90,6 +95,62 @@ __global__ void _ResizeBilinearKernel(const int64_t input_dim2,
       x11 * static_cast<T>(y_offset_0 * x_offset_0);
 }
 
+// The following method supports a 2-D input in 'Linear mode'
+template <typename T>
+__global__ void _ResizeBilinear2DInputKernel(const int64_t input_dim0,
+                                             const int64_t* input_pitches,
+                                             const fast_divmod* output_div_pitches,
+                                             const float* scales,
+                                             const T* input_data,
+                                             T* output_data,
+                                             const size_t N) {
+  CALCULATE_ELEMENTWISE_INDEX_OR_EXIT(id, N);
+  CUDA_LONG input_index = 0;
+
+  int mod;
+  int index_of_dim0, index_of_dim1;
+  output_div_pitches[0].divmod(id, index_of_dim0, mod);
+  index_of_dim1 = mod;
+  int index_of_input_dim0, index_of_input_dim1;
+  float x_offset_0, y_offset_0, x_offset_1, y_offset_1;
+  index_of_input_dim0 = static_cast<int64_t>(index_of_dim0 / scales[0]);
+  index_of_input_dim1 = static_cast<int64_t>(index_of_dim1 / scales[1]);
+  input_index = index_of_input_dim0 * input_pitches[0] + index_of_input_dim1;
+
+  T x00 = input_data[input_index];
+  T x10, x01, x11;
+
+  bool end_of_dim0 = false, end_of_dim1 = false;
+  if (index_of_input_dim0 == (input_dim0 - 1)) {
+    // It's the end in dimension 0
+    x01 = x00;
+    end_of_dim0 = true;
+  } else {
+    x01 = input_data[input_index + input_pitches[0]];
+  }
+
+  if (index_of_input_dim1 == (input_pitches[0] - 1)) {
+    // It's the end in dimension 1
+    x10 = x00;
+    x11 = x01;
+    end_of_dim1 = true;
+  } else {
+    x10 = input_data[input_index + 1];
+    x11 = end_of_dim0 ? x10 : input_data[input_index + input_pitches[0] + 1];
+  }
+
+  y_offset_0 = end_of_dim0 ? 0.5f : index_of_dim0 / scales[0] - index_of_input_dim0;
+  y_offset_1 = 1.0f - y_offset_0;
+  x_offset_0 = end_of_dim1 ? 0.5f : index_of_dim1 / scales[1] - index_of_input_dim1;
+  x_offset_1 = 1.0f - x_offset_0;
+
+  output_data[id] =
+      x00 * static_cast<T>(y_offset_1 * x_offset_1) +
+      x01 * static_cast<T>(y_offset_0 * x_offset_1) +
+      x10 * static_cast<T>(y_offset_1 * x_offset_0) +
+      x11 * static_cast<T>(y_offset_0 * x_offset_0);
+}
+
 template <typename T>
 void ResizeImpl(const onnxruntime::UpsampleMode upsample_mode,
                 const size_t rank,
@@ -105,8 +166,12 @@ void ResizeImpl(const onnxruntime::UpsampleMode upsample_mode,
     _ResizeNearestKernel<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
         rank, input_pitches, output_div_pitches, scales_vals,
         input_data, output_data, N);
-  } else if (onnxruntime::UpsampleMode::LINEAR == upsample_mode) {
-    _ResizeBilinearKernel<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
+  } else if (onnxruntime::UpsampleMode::LINEAR == upsample_mode && rank == 4) {
+    _ResizeBilinear4DInputKernel<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
+        input_dim2, input_pitches, output_div_pitches, scales_vals,
+        input_data, output_data, N);
+  } else if (onnxruntime::UpsampleMode::LINEAR == upsample_mode && rank == 2) {
+    _ResizeBilinear2DInputKernel<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
         input_dim2, input_pitches, output_div_pitches, scales_vals,
         input_data, output_data, N);
   }
diff --git a/onnxruntime/core/providers/cuda/tensor/tile.cc b/onnxruntime/core/providers/cuda/tensor/tile.cc
index 390d9139de58d..854c784c8a851 100644
--- a/onnxruntime/core/providers/cuda/tensor/tile.cc
+++ b/onnxruntime/core/providers/cuda/tensor/tile.cc
@@ -17,7 +17,8 @@ namespace cuda {
       kCudaExecutionProvider,                                     \
       KernelDefBuilder()                                          \
           .InputMemoryType<OrtMemTypeCPUInput>(1)                 \
-          .TypeConstraint("T", DataTypeImpl::GetTensorType<T>()), \
+          .TypeConstraint("T", DataTypeImpl::GetTensorType<T>())  \
+          .TypeConstraint("T1", DataTypeImpl::GetTensorType<int64_t>()), \
       Tile<T>);
 
 template <typename T>
diff --git a/onnxruntime/core/providers/cuda/tensor/upsample.cc b/onnxruntime/core/providers/cuda/tensor/upsample.cc
index 88248983d70ae..3a9eb36c22f41 100644
--- a/onnxruntime/core/providers/cuda/tensor/upsample.cc
+++ b/onnxruntime/core/providers/cuda/tensor/upsample.cc
@@ -38,10 +38,21 @@ Status Upsample<T>::BaseCompute(OpKernelContext* context, const std::vector<floa
   const std::vector<int64_t>& X_dims = X->Shape().GetDims();
   auto rank = X_dims.size();
   if (rank == 0)
-    return Status(ONNXRUNTIME, INVALID_ARGUMENT, "Upsample: input tensor cannot be scalar.");
+    return Status(ONNXRUNTIME, INVALID_ARGUMENT, 
+                 is_resize ? "Resize: input tensor cannot be scalar." : "Upsample: input tensor cannot be scalar.");
 
   if (rank != scales.size())
-    return Status(ONNXRUNTIME, INVALID_ARGUMENT, "Upsample: input tensor's dimension does not match the scales.");
+    return Status(ONNXRUNTIME, INVALID_ARGUMENT, 
+                 is_resize ? "Resize: input tensor's dimension does not match the scales." : 
+                             "Upsample: input tensor's dimension does not match the scales.");
+
+  if (UpsampleMode::LINEAR == mode_ && rank != 4 && rank != 2) {
+       std::ostringstream oss;
+       oss << "'Linear' mode only support 2-D inputs ('Bilinear') or 4-D inputs "
+              "with the corresponding outermost 2 scale values being 1 in the ";
+       oss << (is_resize ? "Resize operator" : "Upsample operator");
+       return Status(ONNXRUNTIME, FAIL, oss.str());    
+  }
 
   std::vector<int64_t> Y_dims;
   for (std::size_t i = 0; i < rank; i++) {
@@ -69,21 +80,12 @@ Status Upsample<T>::BaseCompute(OpKernelContext* context, const std::vector<floa
 
   size_t output_count = Y->Shape().Size();
 
-  if (UpsampleMode::LINEAR == mode_) {
-    if (rank != 4)
-      if (is_resize) {
-        return Status(ONNXRUNTIME, FAIL, "Resize: linear mode only supports 4-D tensor with NCHW layout");
-      } else {
-        return Status(ONNXRUNTIME, FAIL, "Upsample: linear mode only supports 4-D tensor with NCHW layout");
-      }
-  }
-
   if (is_resize) {
     CudaAsyncBuffer<float> scales_vals(this, device_id, scales);
     scales_vals.CopyToGpu();
     ResizeImpl(mode_,
                rank,
-               (UpsampleMode::LINEAR == mode_) ? X_dims[2] : 0,
+               (UpsampleMode::LINEAR == mode_) ? (rank == 2 ? X_dims[0] : X_dims[2]) : 0,
                input_strides.GpuPtr(),
                output_div_pitches.GpuPtr(),
                scales_vals.GpuPtr(),
@@ -101,7 +103,7 @@ Status Upsample<T>::BaseCompute(OpKernelContext* context, const std::vector<floa
 
     UpampleImpl(mode_,
                 rank,
-                (UpsampleMode::LINEAR == mode_) ? X_dims[2] : 0,
+                (UpsampleMode::LINEAR == mode_) ? (rank  == 2 ? X_dims[0] : X_dims[2]) : 0,
                 input_strides.GpuPtr(),
                 output_div_pitches.GpuPtr(),
                 scales_div.GpuPtr(),
diff --git a/onnxruntime/core/providers/cuda/tensor/upsample_impl.cu b/onnxruntime/core/providers/cuda/tensor/upsample_impl.cu
index 6c723efff8e64..b36cccdf3dc39 100644
--- a/onnxruntime/core/providers/cuda/tensor/upsample_impl.cu
+++ b/onnxruntime/core/providers/cuda/tensor/upsample_impl.cu
@@ -31,8 +31,13 @@ __global__ void _UpampleNearestKernel(const size_t rank,
   output_data[id] = input_data[input_index];
 }
 
+// The following method supports a 4-D input in 'Linear mode' 
+// that amounts to 'Bilinear' Upsampling/Resizing in the sense that it assumes
+// the scale values for the outermost 2 dimensions are 1.
+// This is the common use-case where the 4-D input (batched multi-channel images) 
+// is usually of shape [N, C, H, W] and the scales are [1.0, 1.0, height_scale, width_scale]
 template <typename T>
-__global__ void _UpampleBilinearKernel(const int64_t input_dim2,
+__global__ void _UpampleBilinear4DInputKernel(const int64_t input_dim2,
                                        const int64_t* input_pitches,
                                        const fast_divmod* output_div_pitches,
                                        const fast_divmod* scales_div,
@@ -90,6 +95,59 @@ __global__ void _UpampleBilinearKernel(const int64_t input_dim2,
   output_data[id] = y0 + static_cast<T>(x_offset_T * (y1 - y0) / scales_div3_T);
 }
 
+// The following method supports a 2-D input in 'Linear mode'
+template <typename T>
+__global__ void _UpampleBilinear2DInputKernel(const int64_t input_dim0,
+                                              const int64_t* input_pitches,
+                                              const fast_divmod* output_div_pitches,
+                                              const fast_divmod* scales_div,
+                                              const T* input_data,
+                                              T* output_data,
+                                              const size_t N) {
+  CALCULATE_ELEMENTWISE_INDEX_OR_EXIT(id, N);
+  CUDA_LONG input_index = 0;
+
+  int mod;
+  int index_of_dim0, index_of_dim1;
+  output_div_pitches[0].divmod(id, index_of_dim0, mod);
+  index_of_dim1 = mod;
+  int index_of_input_dim0, index_of_input_dim1, x_offset, y_offset;
+  scales_div[0].divmod(index_of_dim0, index_of_input_dim0, y_offset);
+  scales_div[1].divmod(index_of_dim1, index_of_input_dim1, x_offset);
+
+  input_index = index_of_input_dim0 * input_pitches[0] + index_of_input_dim1;
+
+  T x00 = input_data[input_index];
+  T x10, x01, x11;
+
+  bool end_of_dim0 = false;
+  if (index_of_input_dim0 == (input_dim0 - 1)) {
+    // It's the end in dimension 0
+    x01 = x00;
+    end_of_dim0 = true;
+  } else {
+    x01 = input_data[input_index + input_pitches[0]];
+  }
+
+  if (index_of_input_dim1 == (input_pitches[0] - 1)) {
+    // It's the end in dimension 1
+    x10 = x00;
+    x11 = x01;
+  } else {
+    x10 = input_data[input_index + 1];
+    x11 = end_of_dim0 ? x10 : input_data[input_index + input_pitches[0] + 1];
+  }
+
+  T y_offset_T = static_cast<T>(y_offset);
+  T x_offset_T = static_cast<T>(x_offset);
+  T scales_div0_T = static_cast<T>(scales_div[0].d_);
+  T scales_div1_T = static_cast<T>(scales_div[1].d_);
+  T y0 = x00 + static_cast<T>(y_offset_T * (x01 - x00) / scales_div0_T);
+  T y1 = x10 + static_cast<T>(y_offset_T * (x11 - x10) / scales_div0_T);
+
+  output_data[id] = y0 + static_cast<T>(x_offset_T * (y1 - y0) / scales_div1_T);
+}
+
 template <typename T>
 void UpampleImpl(const onnxruntime::UpsampleMode upsample_mode,
                  const size_t rank,
@@ -105,8 +163,12 @@ void UpampleImpl(const onnxruntime::UpsampleMode upsample_mode,
     _UpampleNearestKernel<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
         rank, input_pitches, output_div_pitches, scales_div,
         input_data, output_data, N);
-  } else if (onnxruntime::UpsampleMode::LINEAR == upsample_mode) {
-    _UpampleBilinearKernel<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
+  } else if (onnxruntime::UpsampleMode::LINEAR == upsample_mode && rank == 4) {
+    _UpampleBilinear4DInputKernel<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
+        input_dim2, input_pitches, output_div_pitches, scales_div,
+        input_data, output_data, N);
+  } else if (onnxruntime::UpsampleMode::LINEAR == upsample_mode && rank == 2) {
+    _UpampleBilinear2DInputKernel<T><<<blocksPerGrid, GridDim::maxThreadsPerBlock, 0>>>(
         input_dim2, input_pitches, output_div_pitches, scales_div,
         input_data, output_data, N);
   }
diff --git a/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc b/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc
index 93cb36116f964..a2908888c5b49 100644
--- a/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc
+++ b/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.cc
@@ -101,7 +101,7 @@ bool MKLDNNExecutionProvider::UseSubgraph(const onnxruntime::GraphViewer& graph_
       index++;
       node = graph_viewer.GetNode(index);
     }
-    if (node->InputDefs()[0]->Type() != nullptr)
+    if (!node->InputDefs().empty() && node->InputDefs()[0]->Type() != nullptr)
       FP16_graph = node->InputDefs()[0]->Type()->find("16") != std::string::npos;
   }
 
@@ -357,8 +357,8 @@ void MKLDNNExecutionProvider::CreateMetaDef(const onnxruntime::GraphViewer& grap
                                             std::vector<std::unique_ptr<ComputeCapability>>& result) const {
   std::string graph_fused_nodes;
   std::string node_list;
-  std::string subgraph_id = std::to_string(sub_var.subgraph_index);
-  sub_var.subgraph_index++;
+  std::string subgraph_id = std::to_string(subgraph_index_);
+  subgraph_index_++;
 
   // This is a list of initializers that subgraph considers as constants.
   // Example weights, reshape shape etc.
@@ -378,7 +378,7 @@ void MKLDNNExecutionProvider::CreateMetaDef(const onnxruntime::GraphViewer& grap
 
   auto meta_def = std::make_unique<::onnxruntime::IndexedSubGraph::MetaDef>();
   meta_def->attributes["initializers"] = initializers;
-  meta_def->name = "MkldnnCustomOp" + std::to_string(sub_var.subgraph_index);
+  meta_def->name = "MkldnnCustomOp" + std::to_string(subgraph_index_);
   meta_def->domain = kMSDomain;
   meta_def->since_version = 1;
   meta_def->status = ONNX_NAMESPACE::EXPERIMENTAL;
diff --git a/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.h b/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.h
index 2869698568bde..a57f290689382 100644
--- a/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.h
+++ b/onnxruntime/core/providers/mkldnn/mkldnn_execution_provider.h
@@ -147,6 +147,8 @@ class MKLDNNExecutionProvider : public IExecutionProvider {
   }
 
  private:
+  mutable int subgraph_index_ = 0;
+
   // supported MklDnn Operators
   std::set<std::string> mkldnn_ops_ = {"Conv", "BatchNormalization", "Relu", "Sum",
                                        "AveragePool", "GlobalMaxPool", "GlobalAveragePool", "MaxPool", "LRN"};
diff --git a/onnxruntime/core/providers/mkldnn/mkldnn_provider_factory.cc b/onnxruntime/core/providers/mkldnn/mkldnn_provider_factory.cc
index c94060d5b2450..2cc7e112a29e2 100644
--- a/onnxruntime/core/providers/mkldnn/mkldnn_provider_factory.cc
+++ b/onnxruntime/core/providers/mkldnn/mkldnn_provider_factory.cc
@@ -27,6 +27,7 @@ std::unique_ptr<IExecutionProvider> MkldnnProviderFactory::CreateProvider() {
 
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Mkldnn(int device_id) {
   return std::make_shared<onnxruntime::MkldnnProviderFactory>(device_id);
+  //TODO: This is apparently a bug. The consructor parameter is create-arena-flag, not the device-id
 }
 
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/mkldnn/subgraph/subgraph.h b/onnxruntime/core/providers/mkldnn/subgraph/subgraph.h
index 6e3f967dab65d..b63692bce04ec 100644
--- a/onnxruntime/core/providers/mkldnn/subgraph/subgraph.h
+++ b/onnxruntime/core/providers/mkldnn/subgraph/subgraph.h
@@ -48,12 +48,8 @@ struct Subgraph {
     std::vector<std::string> outputs;
     std::vector<std::string> outputs_as_input_other_node;
     std::vector<onnxruntime::NodeIndex> subgraph_node_indexes;
-    int subgraph_index = 0;
 
-    SubgraphVariables() {
-      subgraph_index = 0;
-    }
-    void Reset() {
+	void Reset() {
       subgraph_node_indexes.clear();
       inputs.clear();
       outputs.clear();
diff --git a/onnxruntime/core/providers/ngraph/ngraph_custom_op.cc b/onnxruntime/core/providers/ngraph/ngraph_custom_op.cc
index 5ac6354bbb11b..326e878cbcd95 100644
--- a/onnxruntime/core/providers/ngraph/ngraph_custom_op.cc
+++ b/onnxruntime/core/providers/ngraph/ngraph_custom_op.cc
@@ -25,20 +25,23 @@
 namespace onnxruntime {
 namespace ngraph_ep {
 
+#define NGRAPH_EP_LRU_CACHE_DEFAULT_SIZE 500
+
 static bool check_ngraph_dump_ops() {
 #ifdef _WIN32
   size_t env_name_len = 0;
   char* env_name = nullptr;
-  return (_dupenv_s(&env_name, &env_name_len, "ONNXRUNTIME_NGRAPH_DUMP_OPS") == 0);
+  return (_dupenv_s(&env_name, &env_name_len, "ONNXRUNTIME_NGRAPH_DUMP_OPS") == 0 && env_name != nullptr);
 #else
   return (std::getenv("ONNXRUNTIME_NGRAPH_DUMP_OPS") != nullptr);
 #endif
 }
 
-NGRAPHCustomOp::NGRAPHCustomOp(const ComputeContext* context, const ONNX_NAMESPACE::ModelProto& model_proto,
-                               const std::shared_ptr<ngraph::runtime::Backend>& ng_backend)
-    : ng_backend_{ng_backend},
-      model_proto_{model_proto} {
+NGRAPHCustomOp::NGRAPHCustomOp(const ComputeContext* context,
+                               const ONNX_NAMESPACE::ModelProto& model_proto,
+                               const std::shared_ptr<ngraph::runtime::Backend>& ng_backend) :
+  ng_backend_{ng_backend}, model_proto_{model_proto}
+{
   allocate_func_ = context->allocate_func;
   release_func_ = context->release_func;
   allocator_ = context->allocator_handle;
@@ -59,7 +62,6 @@ NGRAPHCustomOp::~NGRAPHCustomOp() {
 //This method gets called in critical path of execution: Optimize
 void NGRAPHCustomOp::Initialize(const OrtCustomOpApi* api, OrtKernelContext* context) const {
   Ort::CustomOpApi ort{*api};
-  LOGS_DEFAULT(INFO) << "nGraph compiling customOp: " << name_;
 
   size_t num_inputs = ort.KernelContext_GetInputCount(context);
 
@@ -80,7 +82,45 @@ void NGRAPHCustomOp::Initialize(const OrtCustomOpApi* api, OrtKernelContext* con
     uniq_input_shape.append(reinterpret_cast<const char*>(tensor_shape.data()), ndim * sizeof(int64_t));
   }
 
-  auto it = ng_exe_map_.insert({uniq_input_shape, nullptr});  //TODO: Limit the size of map with configurable size.
+  // Get cache size from environment
+  std::string tempSize;
+  #ifdef _WIN32
+  char *buf{nullptr};
+  size_t bufSize = 0;
+  if (!_dupenv_s(&buf, &bufSize, "ONNXRUNTIME_NGRAPH_LRU_CACHE_SIZE") && buf) {
+    tempSize = buf;
+    free(buf);
+  }
+  #else
+  if (std::getenv("ONNXRUNTIME_NGRAPH_LRU_CACHE_SIZE")) {
+    tempSize = std::getenv("ONNXRUNTIME_NGRAPH_LRU_CACHE_SIZE");
+  }
+  #endif
+  size_t cacheSize = tempSize.empty() ? NGRAPH_EP_LRU_CACHE_DEFAULT_SIZE : std::stoi(tempSize);
+
+  // Not in cache
+  if (ng_exe_map_.find(uniq_input_shape) == ng_exe_map_.end()) {
+    // Check if full
+    if (keyCache.size() == cacheSize) {
+      // Delete least recently used element
+      std::string last = keyCache.back();
+  
+      // Pop the last elmeent
+      keyCache.pop_back();
+  
+      // Erase the last element from cache
+      ng_exe_map_.erase(ng_exe_map_.find(last)); 
+    } 
+  } 
+  
+  // Found in cache 
+  else {
+    keyCache.remove(uniq_input_shape);
+  }
+  
+  // update reference
+  keyCache.push_front(uniq_input_shape);
+  auto it = ng_exe_map_.insert({uniq_input_shape, nullptr});
 
   //ng_exe with current shape already exists
   if (!it.second) {
@@ -88,6 +128,9 @@ void NGRAPHCustomOp::Initialize(const OrtCustomOpApi* api, OrtKernelContext* con
     return;
   } else {
     auto graph_proto = model_proto_.mutable_graph();
+
+    LOGS_DEFAULT(INFO) << "[NGRAPHCustomOp] Compiling customOp: " << name_;
+
     // Clear previous shapes if any and set new input shapes
     for (size_t i = 0; i < num_inputs; i++) {
       auto g_in_shape = graph_proto->mutable_input((int)i)->mutable_type()->mutable_tensor_type()->mutable_shape();
@@ -108,12 +151,12 @@ void NGRAPHCustomOp::Initialize(const OrtCustomOpApi* api, OrtKernelContext* con
     try {
       ng_function = ngraph::onnx_import::import_onnx_model(model_stream);
     } catch (const std::exception& exp) {
-      LOGS_DEFAULT(FATAL) << "[" << name_ << "] "
-                          << "Exception while converting onnx to nGraph: " << std::string(exp.what());
+      LOGS_DEFAULT(FATAL) << "[NGRAPHCustomOp] " << " - " << name_ << " - "
+                          << "Exception while importing model to nGraph: " << std::string(exp.what());
       throw;
     } catch (...) {
-      LOGS_DEFAULT(FATAL) << "[" << name_ << "] "
-                          << "Unknown exception while converting onnx to nGraph";
+      LOGS_DEFAULT(FATAL) << "[NGRAPHCustomOp] " << " - " << name_ << " - "
+                          << "Unknown exception while importing model to nGraph";
       throw;
     }
 
@@ -125,9 +168,10 @@ void NGRAPHCustomOp::Initialize(const OrtCustomOpApi* api, OrtKernelContext* con
     try {
       ng_curr_exe_ = ng_backend_->compile(ng_function);
     } catch (const std::exception& exp) {
-      LOGS_DEFAULT(FATAL) << "Exception while compiling nGraph Op: " << name_ << std::string(exp.what());
+      LOGS_DEFAULT(FATAL) << "[NGRAPHCustomOp] " << " - " << name_ << " - "
+                          << "Exception while compiling ngraph::Function: " << std::string(exp.what());
     } catch (...) {
-      LOGS_DEFAULT(FATAL) << "Unknown exception while compiling nGraph Op: " << name_;
+      LOGS_DEFAULT(FATAL) << "[NGRAPHCustomOp] " << " - " << name_ << " - " << "Unknown exception while compiling ngraph::Function";
     }
     it.first->second = ng_curr_exe_;
   }
@@ -137,11 +181,11 @@ void NGRAPHCustomOp::Initialize(const OrtCustomOpApi* api, OrtKernelContext* con
 Status NGRAPHCustomOp::Compute(const OrtCustomOpApi* api, OrtKernelContext* context) const {
   Ort::CustomOpApi ort{*api};
 
-  //TODO: Minimize locked region
-  std::lock_guard<std::mutex> lock(compute_lock_);
-
   // Initialize nGraph function if it is not already initialized.
-  Initialize(api, context);
+  {
+    std::lock_guard<std::mutex> lock(compute_lock_);
+    Initialize(api, context);
+  }
 
   ORT_ENFORCE(ng_curr_exe_ != nullptr);
 
@@ -154,12 +198,13 @@ Status NGRAPHCustomOp::Compute(const OrtCustomOpApi* api, OrtKernelContext* cont
     for (const auto& ng_param : ng_curr_exe_->get_parameters()) {
       const OrtValue* input_tensor = ort.KernelContext_GetInput(context, input_index++);
       void* input_data = const_cast<void*>(ort.GetTensorData<void>(input_tensor));
+      std::lock_guard<std::mutex> lock(compute_lock_);
       ng_inputs.emplace_back(ng_backend_->create_tensor(ng_param->get_output_element_type(0), ng_param->get_output_shape(0), input_data));
     }
   } catch (const std::exception& exp) {
-    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Exception while copying input data to nGraph: " + std::string(exp.what()));
+    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, name_ + ": Exception while copying input data to nGraph: " + std::string(exp.what()));
   } catch (...) {
-    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Unknown exception while copying input data to nGraph");
+    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, name_ + ": Unknown exception while copying input data to nGraph");
   }
 
   // Initialize output tensors
@@ -173,22 +218,24 @@ Status NGRAPHCustomOp::Compute(const OrtCustomOpApi* api, OrtKernelContext* cont
       std::vector<int64_t> ort_shape{shape.begin(), shape.end()};
       OrtValue* output_tensor = ort.KernelContext_GetOutput(context, output_index++, ort_shape.data(), ort_shape.size());
       void* output_data = ort.GetTensorMutableData<void>(output_tensor);
+      std::lock_guard<std::mutex> lock(compute_lock_);
       ng_outputs.emplace_back(ng_backend_->create_tensor(dtype, shape, output_data));
     }
   } catch (const std::exception& exp) {
-    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Exception while creating nGraph output Tensor: " + std::string(exp.what()));
+    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, name_ + ": Exception while creating nGraph output Tensor: " + std::string(exp.what()));
   } catch (...) {
-    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Unknown exception while creating nGraph output Tensor");
+    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, name_ + ": Unknown exception while creating nGraph output Tensor");
   }
 
   // Run the graph through nGraph.
   try {
+    std::lock_guard<std::mutex> lock(compute_lock_);
     if (!ng_curr_exe_->call(ng_outputs, ng_inputs))
-      return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Error while executing nGraph computation");
+      return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, name_ + ": Error while executing nGraph computation");
   } catch (const std::exception& exp) {
-    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Exception while executing nGraph computation: " + std::string(exp.what()));
+    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, name_ + ": Exception while executing nGraph computation: " + std::string(exp.what()));
   } catch (...) {
-    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Unknown exception while executing nGraph computation");
+    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, name_ + ": Unknown exception while executing nGraph computation");
   }
 
   return Status::OK();
diff --git a/onnxruntime/core/providers/ngraph/ngraph_custom_op.h b/onnxruntime/core/providers/ngraph/ngraph_custom_op.h
index 6661fdb378e56..ad9955872d7fb 100644
--- a/onnxruntime/core/providers/ngraph/ngraph_custom_op.h
+++ b/onnxruntime/core/providers/ngraph/ngraph_custom_op.h
@@ -25,7 +25,9 @@ namespace ngraph_ep {
 
 class NGRAPHCustomOp {
  public:
-  NGRAPHCustomOp(const ComputeContext* context, const ONNX_NAMESPACE::ModelProto& model_proto, const std::shared_ptr<ngraph::runtime::Backend>& ng_backend);
+  NGRAPHCustomOp(const ComputeContext* context,
+                 const ONNX_NAMESPACE::ModelProto& model_proto,
+                 const std::shared_ptr<ngraph::runtime::Backend>& ng_backend);
 
   Status Compute(const OrtCustomOpApi* api, OrtKernelContext* context) const;
 
@@ -54,7 +56,8 @@ class NGRAPHCustomOp {
   key = [3,1,2,3,2,4,5]
 */
   mutable std::unordered_map<std::string, std::shared_ptr<ngraph::runtime::Executable>> ng_exe_map_;
-
+  mutable std::list<std::string> keyCache;
+  
   mutable std::mutex compute_lock_;
 
   mutable ONNX_NAMESPACE::ModelProto model_proto_;
diff --git a/onnxruntime/core/providers/ngraph/ngraph_execution_provider.cc b/onnxruntime/core/providers/ngraph/ngraph_execution_provider.cc
index 459deae2c81f9..60fad0071ccf6 100644
--- a/onnxruntime/core/providers/ngraph/ngraph_execution_provider.cc
+++ b/onnxruntime/core/providers/ngraph/ngraph_execution_provider.cc
@@ -33,19 +33,36 @@ constexpr const char* NGRAPH = "nGraph";
 
 NGRAPHExecutionProvider::NGRAPHExecutionProvider(const NGRAPHExecutionProviderInfo& info)
     : IExecutionProvider{onnxruntime::kNGraphExecutionProvider} {
-  DeviceAllocatorRegistrationInfo default_allocator_info({OrtMemTypeDefault,
-                                                          [](int) { return std::make_unique<CPUAllocator>(std::make_unique<OrtAllocatorInfo>(NGRAPH, OrtAllocatorType::OrtDeviceAllocator)); },
-                                                          std::numeric_limits<size_t>::max()});
+
+  ORT_ENFORCE(info.ng_backend_type == "CPU", "nGraph Execution Provider for onnxruntime currently is only supported for CPU backend.");
+
+  auto default_allocator_factory = [](int) {
+    auto allocator_info = std::make_unique<OrtAllocatorInfo>(NGRAPH, OrtAllocatorType::OrtDeviceAllocator);
+    return std::make_unique<CPUAllocator>(std::move(allocator_info));
+  };
+
+  DeviceAllocatorRegistrationInfo default_allocator_info{
+    OrtMemTypeDefault,
+    std::move(default_allocator_factory),
+    std::numeric_limits<size_t>::max()
+  };
 
   InsertAllocator(CreateAllocator(default_allocator_info));
 
-  DeviceAllocatorRegistrationInfo cpu_allocator_info({OrtMemTypeCPUOutput,
-                                                      [](int) { return std::make_unique<CPUAllocator>(std::make_unique<OrtAllocatorInfo>(NGRAPH, OrtAllocatorType::OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeCPUOutput)); },
-                                                      std::numeric_limits<size_t>::max()});
 
-  InsertAllocator(CreateAllocator(cpu_allocator_info));
+  auto cpu_allocator_factory = [](int) {
+    auto allocator_info = std::make_unique<OrtAllocatorInfo>(
+      NGRAPH, OrtAllocatorType::OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeCPUOutput);
+    return std::make_unique<CPUAllocator>(std::move(allocator_info));
+  };
 
-  ORT_ENFORCE(info.ng_backend_type == "CPU", "nGraph Execution Provider for onnxruntime currently is only supported for CPU backend.");
+  DeviceAllocatorRegistrationInfo cpu_allocator_info{
+    OrtMemTypeCPUOutput,
+    std::move(cpu_allocator_factory),
+    std::numeric_limits<size_t>::max()
+  };
+
+  InsertAllocator(CreateAllocator(cpu_allocator_info));
 
   try {
     ng_backend_ = ngraph::runtime::Backend::create(info.ng_backend_type);
@@ -57,25 +74,6 @@ NGRAPHExecutionProvider::NGRAPHExecutionProvider(const NGRAPHExecutionProviderIn
   }
 }
 
-/**
- * Checks if a tensor represented by srcLocation can be copied into the dstLocation tensor
- * @param src_location result of Location().name call on the source tensor
- * @param dst_location result of Location().name call on the destination tensor
- * @return true if src and dest locations combination allows copying
- */
-bool TensorCopyPossible(const std::string& src_location, const std::string& dst_location) {
-  // contains allowed combinations of source and destination locations for tensors copying purposes
-  // the first element of a pair denotes a source, the second - destination
-  static const std::map<std::string, std::string> allowed_copy_directions = {
-      {NGRAPH, CPU}, {NGRAPH, NGRAPH}, {CPU, NGRAPH}};
-
-  // copying of tensors is allowed only if the params match any of the allowed combinations
-  return std::any_of(allowed_copy_directions.begin(),
-                     allowed_copy_directions.end(), [&](const auto& copy_direction) {
-                       return src_location == copy_direction.first && dst_location == copy_direction.second;
-                     });
-}
-
 // Returns true only if op is in a mode that is not currently supported
 static bool IsUnsupportedOpMode(const Node* node, const onnxruntime::GraphViewer& graph_viewer) {
   const auto& optype = node->OpType();
@@ -131,11 +129,6 @@ static bool IsUnsupportedOpMode(const Node* node, const onnxruntime::GraphViewer
           return true;
       }
     }
-  } else if (optype == "Cast") {
-    //support of casting to bool in nGraph is in progress
-    const auto& attributes = node->GetAttributes();
-    const auto to_attr = attributes.find("to");
-    return to_attr->second.i() == ONNX_NAMESPACE::TensorProto::BOOL;
   } else if (optype == "Slice") {
     //Slice in opset 10 is currently not supported.
     //unsupported inputs: starts, ends, axes, steps
@@ -164,6 +157,63 @@ static bool IsUnsupportedOpMode(const Node* node, const onnxruntime::GraphViewer
     if (ceil_attr != attributes.end() && ceil_attr->second.i() != 0) {
       return true;
     }
+  } else if (optype == "Split") {
+    const auto& attributes = node->GetAttributes();
+    const auto split_attr = attributes.find("split");
+
+    if (split_attr != attributes.end()) {
+      // split implementation contains a bug that doesn't throw for incorrect split values
+      // disabling temporarily until it's fixed in the next release of nGraph
+      const auto splits = split_attr->second.ints();
+      return std::any_of(std::begin(splits), std::end(splits),
+        [](const auto split) { return split <= 0; });
+    }
+  } else if (optype == "QLinearMatMul") {
+    const auto& a_zero_point = node->InputDefs()[2];
+    const auto& b_zero_point = node->InputDefs()[5];
+    const auto& y_zero_point = node->InputDefs()[7];
+
+    bool non_const_zero_point = false;
+
+    // check if any of the zero points is NOT in the initializers list
+    non_const_zero_point |= initializers.find(a_zero_point->Name()) == initializers.end();
+    non_const_zero_point |= initializers.find(b_zero_point->Name()) == initializers.end();
+    non_const_zero_point |= initializers.find(y_zero_point->Name()) == initializers.end();
+
+    // QLinearMatMul is not supported if any of the zero points is a dynamic input
+    return non_const_zero_point;
+  } else if (optype == "MatMulInteger") {
+    // all MatMulInteger zero points need to be constants
+    const auto inputs = node->InputDefs();
+    if (inputs.size() == 3) {
+      const auto& a_zero_point = node->InputDefs()[2];
+
+      // not found in initializers -> not const
+      return initializers.find(a_zero_point->Name()) == initializers.end();
+    } else if (inputs.size() == 4) {
+      const auto& a_zero_point = node->InputDefs()[2];
+      const auto& b_zero_point = node->InputDefs()[3];
+
+      // not found in initializers -> not const
+      return initializers.find(a_zero_point->Name()) == initializers.end() ||
+             initializers.find(b_zero_point->Name()) == initializers.end();
+    } // else -> azp & bzp are 0 by default according to ONNX spec
+  } else if (optype == "ConvInteger") {
+    // all ConvInteger zero points need to be constants
+    const auto inputs = node->InputDefs();
+    if (inputs.size() == 3) {
+      const auto& x_zero_point = node->InputDefs()[2];
+
+      // not found in initializers -> not const
+      return initializers.find(x_zero_point->Name()) == initializers.end();
+    } else if (inputs.size() == 4) {
+      const auto& x_zero_point = node->InputDefs()[2];
+      const auto& w_zero_point = node->InputDefs()[3];
+
+      // not found in initializers -> not const
+      return initializers.find(x_zero_point->Name()) == initializers.end() ||
+             initializers.find(w_zero_point->Name()) == initializers.end();
+    } // else -> xzp & wzp are 0 by default according to ONNX spec
   }
 
   //Op doesn't fall into known any of unsupported modes.
@@ -237,21 +287,10 @@ static void AppendClusterToSubGraph(const std::vector<NodeIndex>& nodes,
                                     const onnxruntime::GraphViewer& graph_viewer,
                                     const std::vector<std::string>& inputs,
                                     const std::vector<std::string>& outputs,
-                                    const std::unordered_set<std::string>& ng_required_initializers,
                                     std::vector<std::unique_ptr<ComputeCapability>>& result) {
   static size_t op_counter = 0;
 
-  // Create ng_required_initializers attribute of NGraphCustomOp
-  ONNX_NAMESPACE::AttributeProto initializers;
-  initializers.set_name("initializers");
-  initializers.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_TENSORS);
-  for (const auto& init : ng_required_initializers) {
-    auto tensor = initializers.add_tensors();
-    *tensor = *(graph_viewer.GetAllInitializedTensors().at(init));
-  }
-
   auto meta_def = std::make_unique<IndexedSubGraph::MetaDef>();
-  meta_def->attributes["initializers"] = initializers;
   meta_def->name = "NGRAPHCustomOp_" + std::to_string(++op_counter);
   meta_def->domain = kNGraphDomain;
   meta_def->since_version = 1;
@@ -259,6 +298,13 @@ static void AppendClusterToSubGraph(const std::vector<NodeIndex>& nodes,
   meta_def->inputs = inputs;
   meta_def->outputs = outputs;
 
+  //store the name of the graph this node belongs to - used to retrieve graph initializers from the cache
+  ONNX_NAMESPACE::AttributeProto graph_name;
+  graph_name.set_name("graph_name");
+  graph_name.set_type(ONNX_NAMESPACE::AttributeProto_AttributeType::AttributeProto_AttributeType_STRING);
+  graph_name.set_s(graph_viewer.Name());
+  meta_def->attributes["graph_name"] = graph_name;
+
   std::unique_ptr<IndexedSubGraph> sub_graph = std::make_unique<IndexedSubGraph>();
   sub_graph->nodes = nodes;
   sub_graph->SetMetaDef(meta_def);
@@ -274,7 +320,7 @@ static std::map<std::string, std::set<std::string>> GetNgSupportedOps(const int
   std::map<std::string, std::set<std::string>> ng_supported_ops;
   ng_supported_ops.emplace(kOnnxDomain, ngraph::onnx_import::get_supported_operators(onnx_opset, kOnnxDomain));
 
-  const std::set<std::string> ng_disabled_ops = {};  //Place-holder for ops not supported.
+  const std::set<std::string> ng_disabled_ops = {"LSTM", "Gather"};  //Place-holder for ops not supported.
 
   for (const auto& disabled_op : ng_disabled_ops) {
     ng_supported_ops.at(kOnnxDomain).erase(disabled_op);
@@ -283,7 +329,8 @@ static std::map<std::string, std::set<std::string>> GetNgSupportedOps(const int
   return ng_supported_ops;
 }
 
-static std::vector<NodeIndex> GetUnsupportedNodeIndices(const GraphViewer& graph_viewer, /*out*/ std::unordered_set<std::string>& ng_required_initializers) {
+static std::vector<NodeIndex>
+GetUnsupportedNodeIndices(const GraphViewer& graph_viewer, /*out*/ std::unordered_set<std::string>& ng_required_initializers) {
   const auto ng_supported_ops = GetNgSupportedOps(GetOnnxOpSet(graph_viewer));
 
   std::vector<NodeIndex> unsupported_nodes_idx;
@@ -303,10 +350,12 @@ static std::vector<NodeIndex> GetUnsupportedNodeIndices(const GraphViewer& graph
   return unsupported_nodes_idx;
 }
 
-/* Returns a vector clusters(or node_idx). For each unsupported node, the graph is split into 3 parts.
-   supported_cluster + (UNsupported_node + rest_of_the_graph). This functions returns vector of all supported_clusters by nGraph
-*/
-static std::vector<std::vector<NodeIndex>> GetPartitionedClusters(const std::vector<NodeIndex>& topological_order, const std::vector<NodeIndex>& unsupported_nodes) {
+/**
+ * Returns a vector clusters(or node_idx). For each unsupported node, the graph is split into 3 parts.
+ * supported_cluster + (UNsupported_node + rest_of_the_graph). This functions returns vector of all supported_clusters by nGraph
+ */
+static std::vector<std::vector<NodeIndex>>
+GetPartitionedClusters(const std::vector<NodeIndex>& topological_order, const std::vector<NodeIndex>& unsupported_nodes) {
   std::vector<std::vector<NodeIndex>> ng_clusters;
 
   auto prev = topological_order.begin();
@@ -457,7 +506,7 @@ NGRAPHExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph_vie
                   [&outputs](const NodeArg* node_arg) { outputs.push_back(node_arg->Name()); });
 
     // Create and add this graph to result.
-    AppendClusterToSubGraph(graph_viewer.GetNodesInTopologicalOrder(), graph_viewer, inputs, outputs, ng_required_initializers, result);
+    AppendClusterToSubGraph(graph_viewer.GetNodesInTopologicalOrder(), graph_viewer, inputs, outputs, result);
 
   } else {  // unsupported_nodes_idx.empty()
     const auto ng_clusters = GetPartitionedClusters(graph_viewer.GetNodesInTopologicalOrder(), unsupported_nodes);
@@ -467,7 +516,7 @@ NGRAPHExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph_vie
       GetInputsOutputsOfCluster(graph_viewer, this_cluster, ng_required_initializers, cluster_inputs, cluster_outputs);
 
       if (!cluster_inputs.empty()) {
-        AppendClusterToSubGraph(this_cluster, graph_viewer, cluster_inputs, cluster_outputs, ng_required_initializers, result);
+        AppendClusterToSubGraph(this_cluster, graph_viewer, cluster_inputs, cluster_outputs, result);
       }
     }
   }
@@ -476,35 +525,21 @@ NGRAPHExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph_vie
 }
 
 static ONNX_NAMESPACE::ModelProto GetModelProtoFromFusedNode(const onnxruntime::Node* fused_node) {
-  const auto& attributes = fused_node->GetAttributes();
-  const auto& initializers = attributes.at("initializers").tensors();
-
-  ONNX_NAMESPACE::ModelProto model_proto;
-  auto graph_proto = model_proto.mutable_graph();
-  const auto& fused_graph = fused_node->GetFunctionBody()->Body();
+  const auto* node_function = fused_node->GetFunctionBody();
 
-  for (const auto& node : fused_graph.Nodes()) {
-    node.ToProto(*(graph_proto->add_node()));
-  }
+  ORT_ENFORCE(node_function != nullptr, "Could not extract function body for node: ", fused_node->Name());
 
-  for (const auto& input : fused_node->InputDefs()) {
-    auto valueInfoProto = graph_proto->add_input();
-    *valueInfoProto = input->ToProto();
-  }
+  const Graph& node_subgraph = node_function->Body();
+  onnxruntime::Model model{node_subgraph.Name(), true};
 
-  for (const auto& output : fused_node->OutputDefs()) {
-    auto valueInfoProto = graph_proto->add_output();
-    *valueInfoProto = output->ToProto();
-  }
+  ONNX_NAMESPACE::ModelProto model_proto = model.ToProto();
+  model_proto.set_ir_version(ONNX_NAMESPACE::Version::IR_VERSION);
 
-  for (const auto& initializer : initializers) {
-    graph_proto->add_initializer()->CopyFrom(initializer);
-  }
+  *(model_proto.mutable_graph()) = node_subgraph.ToGraphProto();
 
   auto opset = model_proto.add_opset_import();
   opset->set_domain(kOnnxDomain);
-  opset->set_version(fused_graph.DomainToVersionMap().at(kOnnxDomain));
-  model_proto.set_ir_version(ONNX_NAMESPACE::Version::IR_VERSION);
+  opset->set_version(node_subgraph.DomainToVersionMap().at(kOnnxDomain));
 
   return model_proto;
 }
@@ -512,14 +547,14 @@ static ONNX_NAMESPACE::ModelProto GetModelProtoFromFusedNode(const onnxruntime::
 Status NGRAPHExecutionProvider::Compile(const std::vector<onnxruntime::Node*>& fused_nodes,
                                         std::vector<NodeComputeInfo>& node_compute_funcs) {
   for (const auto& fused_node : fused_nodes) {
-    auto model_proto = GetModelProtoFromFusedNode(fused_node);
-
     NodeComputeInfo compute_info;
 
     // Local copy of backend since, class members cannot be captured.
     auto ngraph_backend = ng_backend_;
-    compute_info.create_state_func = [model_proto, ngraph_backend](ComputeContext* context, FunctionState* state) {
-      auto* p = new onnxruntime::ngraph_ep::NGRAPHCustomOp(context, model_proto, ngraph_backend);
+    compute_info.create_state_func = [model_proto = GetModelProtoFromFusedNode(fused_node), ngraph_backend]
+                                     (ComputeContext* context, FunctionState* state)
+    {
+      auto* p = new ngraph_ep::NGRAPHCustomOp(context, model_proto, ngraph_backend);
       *state = p;
       return 0;
     };
diff --git a/onnxruntime/core/providers/ngraph/ngraph_execution_provider.h b/onnxruntime/core/providers/ngraph/ngraph_execution_provider.h
index f4081a43a555b..daade7022d44d 100644
--- a/onnxruntime/core/providers/ngraph/ngraph_execution_provider.h
+++ b/onnxruntime/core/providers/ngraph/ngraph_execution_provider.h
@@ -4,12 +4,13 @@
 #pragma once
 
 #include "core/framework/execution_provider.h"
+#include <map>
 
 namespace ngraph {
-namespace runtime {
-class Backend;
+  namespace runtime {
+    class Backend;
+  }
 }
-}  // namespace ngraph
 
 namespace onnxruntime {
 
@@ -35,4 +36,4 @@ class NGRAPHExecutionProvider : public IExecutionProvider {
   std::shared_ptr<ngraph::runtime::Backend> ng_backend_;
 };
 
-}  // namespace onnxruntime
+}
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/analysis.h b/onnxruntime/core/providers/nuphar/common/analysis/analysis.h
new file mode 100644
index 0000000000000..9c4e771761814
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/analysis.h
@@ -0,0 +1,45 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/common/common.h"
+#include "core/common/common.h"
+#include "core/graph/graph_viewer.h"
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// abstract class for Analysis
+template <typename INPUT_TYPE>
+class AnalysisBase {
+ public:
+  AnalysisBase() {}
+
+  AnalysisBase(const std::string& name)
+      : name_(name) {}
+
+  virtual ~AnalysisBase() = default;
+
+  virtual void Evaluate(INPUT_TYPE) = 0;
+
+  const std::string& Name() const {
+    return name_;
+  }
+
+ protected:
+  const std::string name_{"Unknown"};
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(AnalysisBase);
+};
+
+using OrtAnalysis = AnalysisBase<const onnxruntime::GraphViewer&>;
+using NupharAnalysis = AnalysisBase<const onnxruntime::nuphar::NupharSubgraphUnit&>;
+
+// Add Promote for OrtAnalysis and NupharAnalysis
+DYNAMIC_PROMOTE(OrtAnalysis)
+DYNAMIC_PROMOTE(NupharAnalysis)
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/graph_stats.h b/onnxruntime/core/providers/nuphar/common/analysis/graph_stats.h
new file mode 100644
index 0000000000000..f86a28695b9b9
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/graph_stats.h
@@ -0,0 +1,77 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/common.h"
+#include "core/common/common.h"
+#include "core/graph/graph_viewer.h"
+#include "core/providers/nuphar/common/analysis/analysis.h"
+
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+// Base class of GraphStatsBase
+// GraphStatsBase holds analysis results from a graph
+// GraphStatsBase can hold multiple analyses
+
+namespace onnxruntime {
+namespace nuphar {
+
+template <typename INPUT_TYPE>
+class GraphStatsBase {
+ public:
+  GraphStatsBase(const std::string& name)
+      : name_(name) {}
+
+  GraphStatsBase() {}
+
+  virtual ~GraphStatsBase() = default;
+
+  // Evaluate all passes
+  virtual void Evaluate(INPUT_TYPE graph) {
+    for (auto& pass : passes_) {
+      pass->Evaluate(graph);
+    }
+  }
+
+  // Set passes externally
+  void SetAllPasses(const std::vector<std::shared_ptr<AnalysisBase<INPUT_TYPE>>>& passes) {
+    passes_.clear();
+    for (auto& pass : passes) {
+      passes_.push_back(pass);
+    }
+  }
+
+  // Set existed evaluated passes externally
+  void SetAllExistedEvaluatedPasses(
+      const std::vector<std::shared_ptr<AnalysisBase<INPUT_TYPE>>>& passes) {
+    existed_eval_passes_.clear();
+    for (auto& pass : passes) {
+      existed_eval_passes_.push_back(pass);
+    }
+  }
+
+  const std::string& Name() const {
+    return name_;
+  }
+
+ protected:
+  const std::string name_{"Unknown"};
+
+  std::vector<std::shared_ptr<AnalysisBase<INPUT_TYPE>>> passes_;
+
+ private:
+  // existed eval passes not requiring evaluation
+  std::vector<std::shared_ptr<AnalysisBase<INPUT_TYPE>>> existed_eval_passes_;
+
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(GraphStatsBase);
+};
+
+using OrtGraphStats = GraphStatsBase<const onnxruntime::GraphViewer&>;
+using NupharSubgraphUnitStats = GraphStatsBase<const onnxruntime::nuphar::NupharSubgraphUnit&>;
+
+// Add Promote for OrtGraphStats and NupharSubgraphUnitStats
+DYNAMIC_PROMOTE(OrtGraphStats)
+DYNAMIC_PROMOTE(NupharSubgraphUnitStats)
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/output_alias_analysis.cc b/onnxruntime/core/providers/nuphar/common/analysis/output_alias_analysis.cc
new file mode 100644
index 0000000000000..1283f9cd409a8
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/output_alias_analysis.cc
@@ -0,0 +1,109 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/common/analysis/output_alias_analysis.h"
+
+#include "core/codegen/common/common.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+void OutputAliasAnalysis::Traverse(const std::vector<const Node*>& nodes,
+                                   const std::set<std::string>& graph_inputs,
+                                   const std::set<std::string>& graph_outputs) {
+  for (auto& node : nodes) {
+    if (node->NodeType() == Node::Type::Fused) {
+      // unboxing of fused node
+      const auto& func_body = GraphViewer(node->GetFunctionBody()->Body());
+      Traverse(ConvertGraphNodesToNodePtrs(func_body.Nodes()), graph_inputs, graph_outputs);
+    } else {
+      // TODO: change identity to other alias
+      bool is_identity = (node->OpType() == "Identity");
+      node->ForEachWithIndex(
+          node->OutputDefs(),
+          [&](const NodeArg& def, size_t) {
+            if (graph_outputs.count(def.Name()) > 0) {
+              NodeKey key = GetKey(node);
+              output_nodes_.insert(key);
+              if (is_identity) {
+                auto input_def = node->InputDefs()[0];
+                // regard as aliased if input_def is not graph input
+                // otherwise, we still generate Identity ops in TVM
+                // TODO: remove once we have a better solution for alias optimization
+                if (graph_inputs.count(input_def->Name()) == 0) {
+                  alias_use_defs_.insert(std::make_pair(key, input_def));
+                  NodeKey input_key = GetKey(input_def);
+                  output_nodes_.insert(input_key);
+                }
+              }
+            }
+            return Status::OK();
+          });
+    }
+  }
+}
+
+// TODO: please reimplement output alias using the right algorithm.
+// Currently we only copy it from old graph_stats, which is still wrong one
+void OutputAliasAnalysis::Evaluate(const onnxruntime::nuphar::NupharSubgraphUnit& graph) {
+  if (graph.IsSingleNode()) {
+    const Node* node = graph.nodes.front();
+    auto subgraph = GetSubgraph(*node);
+
+    if (nullptr != subgraph) {
+      std::set<std::string> graph_inputs;
+      std::set<std::string> graph_outputs;
+      const auto& graph_viewer = GraphViewer(*subgraph);
+      for (const auto* def : graph_viewer.GetInputs()) {
+        if (nullptr != def) {
+          graph_inputs.insert(def->Name());
+        }
+      }
+      for (const auto* def : graph_viewer.GetOutputs()) {
+        if (nullptr != def) {
+          graph_outputs.insert(def->Name());
+        }
+      }
+      Traverse(ConvertGraphNodesToNodePtrs(graph_viewer.Nodes()), graph_inputs, graph_outputs);
+    } else {
+      NodeKey key = GetKey(node);
+      output_nodes_.insert(key);
+    }
+  } else {
+    // outputs names
+    std::set<std::string> graph_inputs;
+    std::set<std::string> graph_outputs;
+    for (const auto* def : graph.inputs) {
+      if (nullptr != def) {
+        graph_inputs.insert(def->Name());
+      }
+    }
+    for (const auto* def : graph.outputs) {
+      if (nullptr != def) {
+        graph_outputs.insert(def->Name());
+      }
+    }
+    Traverse(graph.nodes, graph_inputs, graph_outputs);
+  }
+}
+
+bool OutputAliasAnalysis::IsOutputNode(const onnxruntime::Node* node) const {
+  return output_nodes_.count(GetKey(node)) != 0;
+}
+
+bool OutputAliasAnalysis::IsOutputAlias(const onnxruntime::Node* node) const {
+  auto key = GetKey(node);
+  return alias_use_defs_.count(key) != 0;
+}
+
+const onnxruntime::NodeArg*
+OutputAliasAnalysis::SourceDefOfOutputAlias(const onnxruntime::NodeArg* node) const {
+  auto iter = alias_use_defs_.find(GetKey(node));
+  if (iter != alias_use_defs_.end()) {
+    return iter->second;
+  }
+  return nullptr;
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/output_alias_analysis.h b/onnxruntime/core/providers/nuphar/common/analysis/output_alias_analysis.h
new file mode 100644
index 0000000000000..57a86205c5041
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/output_alias_analysis.h
@@ -0,0 +1,43 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/common.h"
+#include "core/graph/graph.h"
+#include "core/providers/nuphar/common/analysis/analysis.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+class OutputAliasAnalysis : public NupharAnalysis {
+ public:
+  OutputAliasAnalysis()
+      : NupharAnalysis("OutputAliasAnalysis") {}
+
+  ~OutputAliasAnalysis() = default;
+
+  void Evaluate(const onnxruntime::nuphar::NupharSubgraphUnit& graph) override;
+
+  bool IsOutputNode(const onnxruntime::Node* node) const;
+
+  bool IsOutputAlias(const onnxruntime::Node* node) const;
+
+  const onnxruntime::NodeArg* SourceDefOfOutputAlias(const onnxruntime::NodeArg* node) const;
+
+ private:
+  // a set for output nodes
+  std::set<NodeKey> output_nodes_;
+  // a map from an output alias to its input
+  std::map<NodeKey, const onnxruntime::NodeArg*> alias_use_defs_;
+
+  void Traverse(const std::vector<const Node*>& nodes,
+                const std::set<std::string>& graph_inputs,
+                const std::set<std::string>& graph_outputs);
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(OutputAliasAnalysis);
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/shape_expr.h b/onnxruntime/core/providers/nuphar/common/analysis/shape_expr.h
new file mode 100644
index 0000000000000..76495c77df662
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/shape_expr.h
@@ -0,0 +1,243 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/common/common.h"
+
+// TODO retire this file
+
+namespace onnxruntime {
+
+// A mini IR layer for shape inference
+// Currently just use tvm::Expr but can be replaced by others later
+// Following features are needed:
+// 1. represent symbolic int
+// 2. represent +-*/
+// 3. check if two DimExpr is the same
+// 4. simplify if needed
+// For now only symbolic int is supported
+class SimpleDimExpr {
+ public:
+  SimpleDimExpr() : has_value_(false) {}
+  SimpleDimExpr(int64_t i) : value_(i), has_value_(true) {}
+  SimpleDimExpr(const std::string& sym) : symbol_(sym), has_value_(false) {}
+  bool IsConst() const { return has_value_; }
+  bool IsOne() const { return has_value_ && value_ == 1; }
+  bool operator==(const SimpleDimExpr& expr) const {
+    if (has_value_ != expr.has_value_)
+      return false;
+
+    if (has_value_)
+      return value_ == expr.value_;
+    else
+      return symbol_ == expr.symbol_;
+  }
+
+  bool operator!=(const SimpleDimExpr& expr) const {
+    return !(*this == expr);
+  }
+
+  SimpleDimExpr operator+(const SimpleDimExpr& other) const {
+    ORT_ENFORCE(has_value_ && other.has_value_);
+    return SimpleDimExpr(value_ + other.value_);
+  }
+
+  SimpleDimExpr operator-(const SimpleDimExpr& other) const {
+    ORT_ENFORCE(has_value_ && other.has_value_);
+    return SimpleDimExpr(value_ - other.value_);
+  }
+
+  SimpleDimExpr operator*(const SimpleDimExpr& other) const {
+    if (has_value_ && other.has_value_)
+      return SimpleDimExpr(value_ * other.value_);
+    else if (IsOne())
+      return other;
+    else if (other.IsOne())
+      return *this;
+    else
+      ORT_ENFORCE(false, "unsupported symbolic dim computation");
+  }
+
+  SimpleDimExpr operator/(const SimpleDimExpr& other) const {
+    if (has_value_ && other.has_value_)
+      return SimpleDimExpr(value_ / other.value_);
+    else if (other.IsOne())
+      return *this;
+    else
+      ORT_ENFORCE(false, "unsupported symbolic dim computation");
+  }
+
+  SimpleDimExpr operator%(const SimpleDimExpr& other) const {
+    ORT_ENFORCE(has_value_ && other.has_value_);
+    return SimpleDimExpr(value_ % other.value_);
+  }
+
+  int64_t Value() const {
+    ORT_ENFORCE(IsConst());
+    return value_;
+  }
+
+  const std::string& Symbol() const {
+    ORT_ENFORCE(!IsConst());
+    return symbol_;
+  }
+
+  std::string ToString() const {
+    if (has_value_)
+      return std::to_string(value_);
+    else
+      return symbol_;
+  }
+
+ private:
+  std::string symbol_;
+  int64_t value_;
+  bool has_value_;
+};
+
+template <typename DimT>
+class ShapeExprT {
+ public:
+  ShapeExprT() = default;
+  ShapeExprT(const ShapeExprT<DimT>& expr) = default;
+  ShapeExprT(ShapeExprT<DimT>&& expr) = default;
+  ShapeExprT(size_t size) { dims_.resize(size); }
+  ShapeExprT(const std::vector<DimT>& dims) : dims_(dims) {}
+  ShapeExprT(const std::vector<int64_t>& dims) {
+    for (auto dim : dims)
+      dims_.push_back(DimT(dim));
+  }
+
+  size_t Rank() const {
+    return dims_.size();
+  }
+
+  int64_t TotalKnown() const {
+    if (dims_.size() == 0)
+      return 1;
+    int64_t total = 1;
+    for (size_t i = 0; i < dims_.size(); ++i) {
+      if (dims_[i].IsConst())
+        total = total * dims_[i].Value();
+    }
+    return total;
+  }
+
+  size_t KnownFromDimension() const {
+    size_t min_index = dims_.size();
+    for (int i = static_cast<int>(dims_.size() - 1); i >= 0; i--) {
+      if (!dims_[i].IsConst())
+        break;
+      min_index = static_cast<size_t>(i);
+    }
+    return min_index;
+  }
+
+  std::vector<int64_t> TailedKnown() const {
+    std::vector<int64_t> result;
+
+    for (size_t i = KnownFromDimension(); i < Rank(); ++i) {
+      result.push_back(dims_[i].Value());
+    }
+    return result;
+  }
+
+  int64_t TotalTailedKnown() const {
+    int64_t result = 1;
+    for (size_t i = KnownFromDimension(); i < Rank(); ++i) {
+      result *= dims_[i].Value();
+    }
+    return result;
+  }
+
+  /**
+  Return the total number of elements up to the specified dimension.
+  @param dim Return size up to this dimension. Value must be >= 0 and < this.Size().
+  */
+  DimT SizeToDimension(size_t dim) const {
+    DimT total(1);
+    for (size_t i = 0; i < std::min(dim, dims_.size()); ++i)
+      total = total * dims_[i];
+    return total;
+  }
+
+  /**
+  Return the total number of elements from the specified dimension to the end of the tensor shape.
+  @param dim Return size up to this dimension. 0 <= dimension < this.Size().
+  */
+  DimT SizeFromDimension(size_t dim) const {
+    DimT total(1);
+    for (size_t i = dim; i < dims_.size(); ++i)
+      total = total * dims_[i];
+    return total;
+  }
+
+  bool IsConst() const {
+    return std::all_of(dims_.begin(), dims_.end(), [](const DimT& dim) { return dim.IsConst(); });
+  }
+
+  bool operator==(const ShapeExprT<DimT>& shape) const {
+    if (Rank() != shape.Rank())
+      return false;
+
+    for (size_t dim = 0; dim < Rank(); ++dim) {
+      if (dims_[dim] != (shape.dims_[dim]))
+        return false;
+    }
+    return true;
+  }
+
+  const ShapeExprT<DimT>& operator=(const ShapeExprT<DimT>& shape) {
+    dims_ = shape.dims_;
+    return *this;
+  }
+
+  const DimT& at(size_t dim) const {
+    ORT_ENFORCE(dim < Rank());
+    return dims_[dim];
+  }
+
+  DimT& at(size_t dim) {
+    ORT_ENFORCE(dim < Rank());
+    return dims_[dim];
+  }
+
+  const DimT& operator[](size_t dim) const {
+    ORT_ENFORCE(dim < Rank());
+    return dims_[dim];
+  }
+
+  DimT& operator[](size_t dim) {
+    ORT_ENFORCE(dim < Rank());
+    return dims_[dim];
+  }
+
+  const std::vector<int64_t> Value() const {
+    ORT_ENFORCE(IsConst());
+    std::vector<int64_t> result;
+    for (size_t i = 0; i < Rank(); ++i) {
+      result.push_back(dims_[i].Value());
+    }
+    return result;
+  }
+
+  std::string ToString() const {
+    std::ostringstream oss;
+    oss << "(";
+    for (size_t i = 0; i < Rank(); ++i) {
+      if (i > 0)
+        oss << ", ";
+      oss << dims_[i].ToString();
+    }
+    oss << ")";
+    return oss.str();
+  }
+
+ private:
+  std::vector<DimT> dims_;
+};
+
+typedef SimpleDimExpr DimExpr;
+typedef ShapeExprT<DimExpr> ShapeExpr;
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/subgraph_codegen_stats.cc b/onnxruntime/core/providers/nuphar/common/analysis/subgraph_codegen_stats.cc
new file mode 100644
index 0000000000000..836c3b98291c8
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/subgraph_codegen_stats.cc
@@ -0,0 +1,63 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+
+#include "core/providers/nuphar/common/analysis/output_alias_analysis.h"
+#include "core/providers/nuphar/common/analysis/use_count_analysis.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// CodeGenUnitStats has two analysis passes
+// The first pass, offset as 0,  is UseCountAnalysis
+// The second pass, offset as 1,  is OutputAliasAnalysis
+constexpr int UseCountAnalysisOffset = 0;
+constexpr int OutputAliasAnalysisOffset = 1;
+
+// True reuse count for cheap Op
+constexpr int CheapNodeTrueReuseCount = 2;
+
+// Constructor
+CodeGenUnitStats::CodeGenUnitStats(
+    const std::shared_ptr<ShapeExprContext>& shape_infernece)
+    : NupharSubgraphUnitStats("CodeGenUnitStats") {
+  auto use_count_pass = std::make_shared<NupharUseCountAnalysis>(shape_infernece);
+  passes_.push_back(use_count_pass);
+
+  auto output_alias_pass = std::make_shared<OutputAliasAnalysis>();
+  passes_.push_back(output_alias_pass);
+}
+
+int CodeGenUnitStats::NodeUseCount(const onnxruntime::Node* node) const {
+  ORT_ENFORCE(passes_.size() > UseCountAnalysisOffset);
+  return Promote<NupharUseCountAnalysis>(passes_[UseCountAnalysisOffset])->NodeUseCount(node);
+}
+
+bool CodeGenUnitStats::IsCheapNodeReuse(const onnxruntime::Node* node) const {
+  ORT_ENFORCE(passes_.size() > UseCountAnalysisOffset);
+  // Define cheap nodes include Add / Sub / Mul
+  if (node->OpType() == "Add" || node->OpType() == "Sub" || node->OpType() == "Mul")
+    return Promote<NupharUseCountAnalysis>(passes_[UseCountAnalysisOffset])->NodeUseCount(node) > CheapNodeTrueReuseCount;
+
+  // Otherwise return true and use count is determined by NodeUseCount
+  return true;
+}
+
+bool CodeGenUnitStats::IsOutputNode(const onnxruntime::Node* node) const {
+  ORT_ENFORCE(passes_.size() > OutputAliasAnalysisOffset);
+  return Promote<OutputAliasAnalysis>(passes_[OutputAliasAnalysisOffset])->IsOutputNode(node);
+}
+
+bool CodeGenUnitStats::IsOutputAlias(const onnxruntime::Node* node) const {
+  ORT_ENFORCE(passes_.size() > OutputAliasAnalysisOffset);
+  return Promote<OutputAliasAnalysis>(passes_[OutputAliasAnalysisOffset])->IsOutputAlias(node);
+}
+
+const onnxruntime::NodeArg* CodeGenUnitStats::SourceDefOfOutputAlias(const onnxruntime::NodeArg* node) const {
+  ORT_ENFORCE(passes_.size() > OutputAliasAnalysisOffset);
+  return Promote<OutputAliasAnalysis>(passes_[OutputAliasAnalysisOffset])->SourceDefOfOutputAlias(node);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/subgraph_codegen_stats.h b/onnxruntime/core/providers/nuphar/common/analysis/subgraph_codegen_stats.h
new file mode 100644
index 0000000000000..69b135f73e34d
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/subgraph_codegen_stats.h
@@ -0,0 +1,34 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/common/common.h"
+#include "core/providers/nuphar/common/analysis/graph_stats.h"
+#include "core/providers/nuphar/common/analysis/use_count_analysis.h"
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+class CodeGenUnitStats : public NupharSubgraphUnitStats {
+ public:
+  CodeGenUnitStats(const std::shared_ptr<ShapeExprContext>& shape_infernece);
+
+  ~CodeGenUnitStats() = default;
+
+  int NodeUseCount(const onnxruntime::Node* node) const;
+
+  bool IsCheapNodeReuse(const onnxruntime::Node* node) const;
+
+  bool IsOutputNode(const onnxruntime::Node* node) const;
+
+  bool IsOutputAlias(const onnxruntime::Node* node) const;
+
+  const onnxruntime::NodeArg* SourceDefOfOutputAlias(const onnxruntime::NodeArg* node) const;
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(CodeGenUnitStats);
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/subgraph_partition_stats.cc b/onnxruntime/core/providers/nuphar/common/analysis/subgraph_partition_stats.cc
new file mode 100644
index 0000000000000..b35b456bd6bbd
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/subgraph_partition_stats.cc
@@ -0,0 +1,28 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/common/analysis/subgraph_partition_stats.h"
+
+#include "core/providers/nuphar/common/analysis/use_count_analysis.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// TODO: Add memory analysis
+// SubgraphPartitionStats has one analysis pass
+// The first pass, offset as 0,  is UseCountAnalysis
+constexpr int UseCountAnalysisOffset = 0;
+
+void SubgraphPartitionStats::SetShapeInference(
+    const std::shared_ptr<ShapeExprContext>& shape_infernece) {
+  passes_.clear();
+  passes_.emplace_back(std::make_shared<OrtUseCountAnalysis>(shape_infernece));
+}
+
+int SubgraphPartitionStats::NodeUseCount(const onnxruntime::Node* node) const {
+  ORT_ENFORCE(passes_.size() > UseCountAnalysisOffset);
+  return Promote<OrtUseCountAnalysis>(passes_[UseCountAnalysisOffset])->NodeUseCount(node);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/subgraph_partition_stats.h b/onnxruntime/core/providers/nuphar/common/analysis/subgraph_partition_stats.h
new file mode 100644
index 0000000000000..afbcf0d2a3886
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/subgraph_partition_stats.h
@@ -0,0 +1,31 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/common.h"
+#include "core/providers/nuphar/common/analysis/graph_stats.h"
+#include "core/providers/nuphar/compiler/traverse_shape_infer.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// TODO: rename class name to more target-specific in the tvm refactoring
+// Maybe GraphPartitionStatsX86
+class SubgraphPartitionStats : public OrtGraphStats {
+ public:
+  SubgraphPartitionStats()
+      : OrtGraphStats("SubgraphPartitionStats") {}
+
+  ~SubgraphPartitionStats() = default;
+
+  void SetShapeInference(const std::shared_ptr<ShapeExprContext>& shape_infernece);
+
+  int NodeUseCount(const onnxruntime::Node* node) const;
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(SubgraphPartitionStats);
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/use_count_analysis.cc b/onnxruntime/core/providers/nuphar/common/analysis/use_count_analysis.cc
new file mode 100644
index 0000000000000..c754db4377fba
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/use_count_analysis.cc
@@ -0,0 +1,264 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/common/analysis/use_count_analysis.h"
+
+#include "core/codegen/common/common.h"
+#include "core/graph/function.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+constexpr int PRESET_USE_COUNT_FOR_UNKNOWN = 10;
+constexpr int PRESET_USE_COUNT_FOR_SOFTMAX = 3;
+
+static void CountGemmOp(const onnxruntime::Node& node,
+                        const std::vector<const NodeArg*>& graph_inputs,
+                        std::function<const ShapeExpr*(const onnxruntime::NodeArg*)> shape_func,
+                        std::unordered_map<NodeKey, int>& node_use_counts);
+
+static void CountMatMulOp(const onnxruntime::Node& node,
+                          const std::vector<const NodeArg*>& graph_inputs,
+                          std::function<const ShapeExpr*(const onnxruntime::NodeArg*)> shape_func,
+                          std::unordered_map<NodeKey, int>& node_use_counts);
+
+static void CountRecurrentOp(const onnxruntime::Node& node,
+                             const std::vector<const NodeArg*>& graph_inputs,
+                             std::function<const ShapeExpr*(const onnxruntime::NodeArg*)> shape_func,
+                             std::unordered_map<NodeKey, int>& node_use_counts);
+
+static void CountMatrixArgs(const onnxruntime::NodeArg* A,
+                            const onnxruntime::NodeArg* B,
+                            const onnxruntime::Node& node,
+                            const std::vector<const NodeArg*>& graph_inputs,
+                            std::function<const ShapeExpr*(const onnxruntime::NodeArg*)> shape_func,
+                            std::unordered_map<NodeKey, int>& node_use_counts);
+
+static void CountNodeArg(const onnxruntime::NodeArg* input_def,
+                         const onnxruntime::Node& node,
+                         const std::vector<const NodeArg*>& graph_inputs,
+                         std::unordered_map<NodeKey, int>& node_use_counts,
+                         int use_cnt);
+
+static bool IsMatMulOp(const std::string& op) {
+  return op == "MatMul" || op == "MatMulInteger" || op == "MatMulInteger16";
+}
+
+void CountGemmOp(const onnxruntime::Node& node,
+                 const std::vector<const NodeArg*>& graph_inputs,
+                 std::function<const ShapeExpr*(const onnxruntime::NodeArg*)> shape_func,
+                 std::unordered_map<NodeKey, int>& node_use_counts) {
+  ORT_ENFORCE(node.OpType() == "Gemm");
+
+  auto inputs = node.InputDefs();
+  CountMatrixArgs(inputs[0], inputs[1], node, graph_inputs, shape_func, node_use_counts);
+  // C's use cnt is fixed.
+  CountNodeArg(inputs[2], node, graph_inputs, node_use_counts, 1);
+}
+
+void CountMatMulOp(const onnxruntime::Node& node,
+                   const std::vector<const NodeArg*>& graph_inputs,
+                   std::function<const ShapeExpr*(const onnxruntime::NodeArg*)> shape_func,
+                   std::unordered_map<NodeKey, int>& node_use_counts) {
+  ORT_ENFORCE(IsMatMulOp(node.OpType()));
+  auto inputs = node.InputDefs();
+  CountMatrixArgs(inputs[0], inputs[1], node, graph_inputs, shape_func, node_use_counts);
+}
+
+void CountRecurrentOp(const onnxruntime::Node& node,
+                      const std::vector<const NodeArg*>& graph_inputs,
+                      std::function<const ShapeExpr*(const onnxruntime::NodeArg*)>,
+                      std::unordered_map<NodeKey, int>& node_use_counts) {
+  int use_count = PRESET_USE_COUNT_FOR_UNKNOWN;
+
+  node.ForEachWithIndex(
+      node.InputDefs(),
+      [&node, &graph_inputs, &node_use_counts, &use_count](const NodeArg& def, size_t) {
+        CountNodeArg(&def, node, graph_inputs, node_use_counts, use_count);
+        return Status::OK();
+      });
+}
+
+static bool IsSoftmaxOp(const std::string& op) {
+  return op == "Softmax" || op == "LogSoftmax";
+}
+
+void CountSoftmaxOp(const onnxruntime::Node& node,
+                    const std::vector<const NodeArg*>& graph_inputs,
+                    std::function<const ShapeExpr*(const onnxruntime::NodeArg*)>,
+                    std::unordered_map<NodeKey, int>& node_use_counts) {
+  // Use preset use count for Softmax/LogSoftmax input
+  int use_count = PRESET_USE_COUNT_FOR_SOFTMAX;
+
+  node.ForEachWithIndex(
+      node.InputDefs(),
+      [&node, &graph_inputs, &node_use_counts, &use_count](const NodeArg& def, size_t) {
+        CountNodeArg(&def, node, graph_inputs, node_use_counts, use_count);
+        return Status::OK();
+      });
+}
+
+void CountMatrixArgs(const onnxruntime::NodeArg* A,
+                     const onnxruntime::NodeArg* B,
+                     const onnxruntime::Node& node,
+                     const std::vector<const NodeArg*>& graph_inputs,
+                     std::function<const ShapeExpr*(const onnxruntime::NodeArg*)> shape_func,
+                     std::unordered_map<NodeKey, int>& node_use_counts) {
+  int use_cnt = PRESET_USE_COUNT_FOR_UNKNOWN;
+  const ShapeExpr* a_shape = shape_func(A);
+  if (nullptr != a_shape) {
+    // B's use cnt is based on the rows of A
+    // skip symbolic dimensions for Sequence and batch
+    auto a_cols = (a_shape->Rank() > 0 && a_shape->at(a_shape->Rank() - 1).IsConst()) ? a_shape->at(a_shape->Rank() - 1).Value() : 1;
+    use_cnt = a_shape->TotalTailedKnown() / a_cols;
+  }
+  CountNodeArg(B, node, graph_inputs, node_use_counts, use_cnt);
+
+  // reset use_cnt
+  use_cnt = PRESET_USE_COUNT_FOR_UNKNOWN;
+  const ShapeExpr* b_shape = shape_func(B);
+  if (nullptr != b_shape) {
+    const DimExpr& dim = b_shape->Rank() > 1 ? b_shape->at(b_shape->Rank() - 1) : DimExpr(1);
+    // A's use cnt is based on the cols of B. If B is 1-D, use cnt is 1
+    if (dim.IsConst())
+      use_cnt = dim.Value();
+  }
+
+  CountNodeArg(A, node, graph_inputs, node_use_counts, use_cnt);
+}
+
+void CountNodeArg(const onnxruntime::NodeArg* input_def,
+                  const onnxruntime::Node& node,
+                  const std::vector<const NodeArg*>& graph_inputs,
+                  std::unordered_map<NodeKey, int>& node_use_counts,
+                  int use_cnt) {
+  // Skip graph's input args nodes
+  if (std::find(graph_inputs.begin(), graph_inputs.end(), input_def) != graph_inputs.end())
+    return;
+
+  const Node* input_node = GetInputNode(node, input_def);
+
+  if (nullptr != input_node) {
+    node_use_counts[GetKey(input_node)] += use_cnt;
+  }
+}
+
+InternalUseCountAnalysis::InternalUseCountAnalysis(const std::shared_ptr<ShapeExprContext>& shape_inference) {
+  shape_func_ = [&shape_inference](const onnxruntime::NodeArg* X) {
+    return shape_inference->Lookup(X);
+  };
+}
+
+void InternalUseCountAnalysis::Traverse(
+    const std::vector<const Node*>& nodes,
+    const std::vector<const NodeArg*>& graph_inputs,
+    const std::vector<const NodeArg*>& graph_outputs) {
+  for (auto& node : nodes) {
+    auto op_type = node->OpType();
+    if (op_type == "Gemm") {
+      CountGemmOp(*node, graph_inputs, shape_func_, node_use_counts_);
+    } else if (IsMatMulOp(op_type)) {
+      CountMatMulOp(*node, graph_inputs, shape_func_, node_use_counts_);
+    } else if (op_type == "Scan") {
+      auto subgraph = node->GetGraphAttribute("body");
+      Evaluate(GraphViewer(*subgraph));
+      int use_count = PRESET_USE_COUNT_FOR_UNKNOWN;
+      node->ForEachWithIndex(
+          node->InputDefs(),
+          [this, &node, &graph_inputs, &use_count](const NodeArg& def, size_t) {
+            CountNodeArg(&def, *node, graph_inputs, node_use_counts_, use_count);
+            return Status::OK();
+          });
+    } else if (IsRecurrentNode(*node)) {
+      CountRecurrentOp(*node, graph_inputs, shape_func_, node_use_counts_);
+    } else if (node->NodeType() == Node::Type::Fused) {
+      // note: when unboxing subgraph in fused node, use outermost graph input/output
+      const auto& func_body = GraphViewer(node->GetFunctionBody()->Body());
+      Traverse(ConvertGraphNodesToNodePtrs(func_body.Nodes()), graph_inputs, graph_outputs);
+    } else if (IsSoftmaxOp(op_type)) {
+      CountSoftmaxOp(*node, graph_inputs, shape_func_, node_use_counts_);
+    } else {
+      int use_count = 1;
+      node->ForEachWithIndex(
+          node->InputDefs(),
+          [this, &node, &graph_inputs, &use_count](const NodeArg& def, size_t) {
+            CountNodeArg(&def, *node, graph_inputs, node_use_counts_, use_count);
+            return Status::OK();
+          });
+    }
+
+    NodeKey key = GetKey(node);
+    // For any output_def of the node that is part of graph's outputs but not from graph.Nodes(),
+    // we need to increase the node's use cnt accordingly. Otherwise, we would lose those uses.
+    node->ForEachWithIndex(
+        node->OutputDefs(),
+        [this, &graph_outputs, &key](const NodeArg& def, size_t) {
+          if (std::find(graph_outputs.begin(), graph_outputs.end(), &def) != graph_outputs.end()) {
+            node_use_counts_[key]++;
+          }
+          return Status::OK();
+        });
+  }
+}
+
+void InternalUseCountAnalysis::Evaluate(const onnxruntime::GraphViewer& graph) {
+  const auto& graph_inputs = graph.GetInputs();
+  const auto& graph_outputs = graph.GetOutputs();
+  Traverse(ConvertGraphNodesToNodePtrs(graph.Nodes()), graph_inputs, graph_outputs);
+}
+
+void InternalUseCountAnalysis::Evaluate(const onnxruntime::nuphar::NupharSubgraphUnit& graph) {
+  const auto& graph_inputs = graph.inputs;
+  const auto& graph_outputs = graph.outputs;
+  Traverse(graph.nodes, graph_inputs, graph_outputs);
+}
+
+void InternalUseCountAnalysis::IncrementCount(const onnxruntime::NodeArg* def) {
+  node_use_counts_[GetKey(def)]++;
+}
+
+int InternalUseCountAnalysis::NodeUseCount(const onnxruntime::Node* node) const {
+  auto node_iter = node_use_counts_.find(GetKey(node));
+  if (node_iter != node_use_counts_.end()) {
+    return node_iter->second;
+  } else {
+    return 0;
+  }
+}
+
+OrtUseCountAnalysis::OrtUseCountAnalysis(const std::shared_ptr<ShapeExprContext>& shape_inference)
+    : OrtAnalysis("OrtUseCountAnalysis") {
+  internal_analysis_ = std::make_unique<InternalUseCountAnalysis>(shape_inference);
+}
+
+void OrtUseCountAnalysis::Evaluate(const onnxruntime::GraphViewer& graph) {
+  internal_analysis_->Evaluate(graph);
+}
+
+void OrtUseCountAnalysis::IncrementCount(const onnxruntime::NodeArg* def) {
+  internal_analysis_->IncrementCount(def);
+}
+
+int OrtUseCountAnalysis::NodeUseCount(const onnxruntime::Node* node) const {
+  return internal_analysis_->NodeUseCount(node);
+}
+
+NupharUseCountAnalysis::NupharUseCountAnalysis(const std::shared_ptr<ShapeExprContext>& shape_inference)
+    : NupharAnalysis("NupharUseCountAnalysis") {
+  internal_analysis_ = std::make_unique<InternalUseCountAnalysis>(shape_inference);
+}
+
+void NupharUseCountAnalysis::Evaluate(const onnxruntime::nuphar::NupharSubgraphUnit& graph) {
+  internal_analysis_->Evaluate(graph);
+}
+
+void NupharUseCountAnalysis::IncrementCount(const onnxruntime::NodeArg* def) {
+  internal_analysis_->IncrementCount(def);
+}
+
+int NupharUseCountAnalysis::NodeUseCount(const onnxruntime::Node* node) const {
+  return internal_analysis_->NodeUseCount(node);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/analysis/use_count_analysis.h b/onnxruntime/core/providers/nuphar/common/analysis/use_count_analysis.h
new file mode 100644
index 0000000000000..4d79af579b609
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/analysis/use_count_analysis.h
@@ -0,0 +1,83 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/common/common.h"
+#include "core/providers/nuphar/common/analysis/analysis.h"
+#include "core/providers/nuphar/common/analysis/shape_expr.h"
+#include "core/providers/nuphar/compiler/traverse_shape_infer.h"
+#include "core/graph/graph.h"
+
+#include <functional>
+#include <unordered_map>
+
+// TODO change namespace from codegen to nuphar
+
+namespace onnxruntime {
+namespace nuphar {
+
+class InternalUseCountAnalysis {
+ public:
+  InternalUseCountAnalysis(const std::shared_ptr<ShapeExprContext>& shape_inference);
+
+  ~InternalUseCountAnalysis() = default;
+
+  void Evaluate(const onnxruntime::GraphViewer& graph);
+
+  void Evaluate(const NupharSubgraphUnit& graph);
+
+  void IncrementCount(const onnxruntime::NodeArg* arg);
+
+  int NodeUseCount(const onnxruntime::Node* node) const;
+
+ private:
+  void Traverse(const std::vector<const Node*>& nodes,
+                const std::vector<const NodeArg*>& graph_inputs,
+                const std::vector<const NodeArg*>& graph_outputs);
+
+  std::unordered_map<NodeKey, int> node_use_counts_;
+  std::function<const ShapeExpr*(const onnxruntime::NodeArg*)> shape_func_;
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(InternalUseCountAnalysis);
+};
+
+// TODO analysis move to namespace nuphar
+
+class OrtUseCountAnalysis : public OrtAnalysis {
+ public:
+  OrtUseCountAnalysis(const std::shared_ptr<ShapeExprContext>& shape_inference);
+  ~OrtUseCountAnalysis() = default;
+
+  void Evaluate(const onnxruntime::GraphViewer& graph) override;
+
+  void IncrementCount(const onnxruntime::NodeArg* arg);
+
+  int NodeUseCount(const onnxruntime::Node* node) const;
+
+ private:
+  std::unique_ptr<InternalUseCountAnalysis> internal_analysis_;
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(OrtUseCountAnalysis);
+};
+
+class NupharUseCountAnalysis : public NupharAnalysis {
+ public:
+  NupharUseCountAnalysis(const std::shared_ptr<ShapeExprContext>& shape_inference);
+
+  ~NupharUseCountAnalysis() = default;
+
+  void Evaluate(const onnxruntime::nuphar::NupharSubgraphUnit& graph) override;
+
+  void IncrementCount(const onnxruntime::NodeArg* arg);
+
+  int NodeUseCount(const onnxruntime::Node* node) const;
+
+ private:
+  std::unique_ptr<InternalUseCountAnalysis> internal_analysis_;
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(NupharUseCountAnalysis);
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/nuphar_settings.cc b/onnxruntime/core/providers/nuphar/common/nuphar_settings.cc
new file mode 100644
index 0000000000000..1e3981004ba9b
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/nuphar_settings.cc
@@ -0,0 +1,132 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/common/nuphar_settings.h"
+
+#include "core/codegen/common/common.h"
+#include "core/codegen/common/utils.h"
+#include "core/common/logging/logging.h"
+#include "core/providers/nuphar/nuphar_execution_provider.h"
+
+#include <algorithm>
+#include <cctype>
+#include <unordered_set>
+#include <regex>
+
+namespace onnxruntime {
+namespace nuphar {
+
+static const std::unordered_set<std::string> valid_keys = {
+    codegen::CodeGenSettings::kDumpAllOptions,
+    codegen::CodeGenSettings::kCodeGenDumpModule,
+    codegen::CodeGenSettings::kCodeGenDumpLower,
+    codegen::CodeGenSettings::kCodeGenDumpSchedule,
+    kNupharFastMath,
+    kNupharFastActivation,
+    kNupharDumpFusedNodes,
+    kNupharDumpPartition,
+    kNupharIMatMulForceMkl,
+    kNupharMatmulExec,
+    kNupharCachePath,
+    kNupharCacheVersion,
+    kNupharCacheSoName,
+    kNupharCacheModelChecksum,
+    kNupharCacheForceNoJIT,
+    kNupharCodeGenTarget};
+
+void SetDefaultOptions(std::map<std::string, std::string>& options) {
+  // create two temporary strings to get rid of the odr-use issue introduced
+  // The issue would trigger missing definition errors for static constexpr members
+  // at link time.
+  std::string fast_math_opt(kNupharFastMath);
+  std::string select_fast_math(kNupharFastMath_ShortPolynormial);
+  options.insert(std::make_pair(fast_math_opt, select_fast_math));
+
+  std::string fast_act_opt(kNupharFastActivation);
+  std::string select_fast_act(kNupharActivations_DeepCpu);
+  options.insert(std::make_pair(fast_act_opt, select_fast_act));
+
+  // set jit cache so name
+  std::string cache_so_name_opt(kNupharCacheSoName);
+  std::string cache_so_name_default(kNupharCacheSoName_Default);
+  options.insert(std::make_pair(cache_so_name_opt, cache_so_name_default));
+}
+
+void CreateNupharCodeGenSettings(const NupharExecutionProviderInfo& info) {
+  std::map<std::string, std::string> options;
+  SetDefaultOptions(options);
+
+  std::unordered_set<std::string> required_options;
+  if (!info.settings.empty()) {
+    const std::string& str = info.settings;
+
+    // tokenize settings
+    std::regex reg("\\s*,\\s*");
+    std::sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
+    std::sregex_token_iterator iter_end;
+    std::vector<std::string> pairs(iter, iter_end);
+
+    ORT_ENFORCE(pairs.size() > 0);
+    for (const auto& pair : pairs) {
+      auto pos_colon = pair.find(':');
+      ORT_ENFORCE(pos_colon != std::string::npos, "Invalid key value pair.");
+      std::string key = pair.substr(0, pos_colon);
+      std::string value = pair.substr(pos_colon + 1);
+
+      // trim leading and trailing spaces from key/value
+      auto trim = [](const std::string& str) -> std::string {
+        const std::string WHITESPACE = " \n\r\t\f\v";
+        size_t start = str.find_first_not_of(WHITESPACE);
+        if (start == std::string::npos) {
+          return "";
+        } else {
+          size_t end = str.find_last_not_of(WHITESPACE);
+          ORT_ENFORCE(end != std::string::npos);
+          return str.substr(start, end + 1);
+        }
+      };
+      key = trim(key);
+      value = trim(value);
+
+      if (valid_keys.count(key) == 0) {
+        ORT_NOT_IMPLEMENTED("NupharCodeGenSettings: unknown option (", key, ")");
+      }
+      required_options.insert(key);
+      options.insert(std::make_pair(key, value));
+    }
+  }
+
+#ifndef GOLDEN_BUILD
+  // environment variables override existing settings
+  for (const auto& key : valid_keys) {
+    std::string env_key;
+    // env var is always upper case
+    std::transform(key.begin(), key.end(), std::back_inserter(env_key), (int (*)(int))std::toupper);
+    if (IsEnvVarDefined(env_key.c_str())) {
+      // value is case-sensitive
+      auto value = std::string(GetEnv(env_key.c_str()).get());
+
+      if (required_options.count(key) > 0 && options.at(key) != value) {
+        LOGS_DEFAULT(CODEGEN_SETTINGS_LOG_LEVEL)
+            << "NupharCodeGenSettings: option(" << key
+            << ") from environment variable is ignored because of existing required option value: "
+            << options.at(key);
+      } else {
+        options[key] = value;
+      }
+    }
+  }
+#endif
+
+  codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+  settings.Clear();  // remove previous settings and start from scratch
+
+  settings.InsertOptions(options);
+
+  if (settings.HasOption(codegen::CodeGenSettings::kDumpAllOptions)) {
+    settings.DumpOptions();
+  }
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/nuphar_settings.h b/onnxruntime/core/providers/nuphar/common/nuphar_settings.h
new file mode 100644
index 0000000000000..91d2f03a4b583
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/nuphar_settings.h
@@ -0,0 +1,48 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/common/settings.h"
+
+namespace onnxruntime {
+
+// forward declaration
+struct NupharExecutionProviderInfo;
+
+namespace nuphar {
+constexpr static const char* kNupharDumpPartition = "nuphar_dump_partition";
+constexpr static const char* kNupharDumpFusedNodes = "nuphar_dump_fused_nodes";
+constexpr static const char* kNupharMatmulExec = "nuphar_matmul_exec";
+constexpr static const char* kNupharCachePath = "nuphar_cache_path";
+constexpr static const char* kNupharCacheVersion = "nuphar_cache_version";
+constexpr static const char* kNupharCacheSoName = "nuphar_cache_so_name";
+constexpr static const char* kNupharCacheModelChecksum = "nuphar_cache_model_checksum";
+constexpr static const char* kNupharCacheForceNoJIT = "nuphar_cache_force_no_jit";
+// force to use IMatMulExternMKL/IMatMul16ExternMKL
+constexpr static const char* kNupharIMatMulForceMkl = "nuphar_imatmul_force_mkl";
+
+constexpr static const char* kNupharMatMulExec_ExternCpu = "extern_cpu";
+
+constexpr static const char* kNupharFastMath = "nuphar_fast_math";                         // fast math
+constexpr static const char* kNupharFastMath_Polynormial = "polynormial_math";             // generic polynormial fast math for exp and log
+constexpr static const char* kNupharFastMath_ShortPolynormial = "short_polynormial_math";  // generic shorter polynormial fast math for exp and log
+
+constexpr static const char* kNupharFastActivation = "nuphar_fast_activation";  // fast activation
+constexpr static const char* kNupharActivations_DeepCpu = "deep_cpu_activation";
+
+// Option to control nuphar code generation target (avx2 or avx512)
+constexpr static const char* kNupharCodeGenTarget = "nuphar_codegen_target";
+
+// cache version number (MAJOR.MINOR.PATCH) following https://semver.org/
+// 1. MAJOR version when you make incompatible changes that old cache files no longer work,
+// 2. MINOR version when you add functionality in a backwards - compatible manner, and
+// 3. PATCH version when you make backwards - compatible bug fixes.
+// NOTE this version needs to be updated when generated code may change
+constexpr static const char* kNupharCacheVersion_Current = "1.0.0";
+
+constexpr static const char* kNupharCacheSoName_Default = "jit.so";
+
+void CreateNupharCodeGenSettings(const NupharExecutionProviderInfo& info);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/nuphar_subgraph.h b/onnxruntime/core/providers/nuphar/common/nuphar_subgraph.h
new file mode 100644
index 0000000000000..06e105150ad54
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/nuphar_subgraph.h
@@ -0,0 +1,106 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/framework/tensor.h"
+#include "core/graph/graph.h"
+#include "core/graph/graph_viewer.h"
+
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+namespace onnxruntime {
+namespace nuphar {
+
+using FindInitializerFunc = std::function<const Tensor*(const std::string&)>;
+
+struct OrtSubgraphAllocationInfo {
+  std::unordered_map<std::string, int> internal_allocator_offset;
+  std::unordered_map<std::string, int> inputs;
+  std::unordered_map<std::string, int> outputs;
+  int offset_count;
+
+  OrtSubgraphAllocationInfo(const Node& node) : offset_count(0) {
+    int input_counter = 0;
+    int output_counter = 0;
+
+    node.ForEachDef(
+        [&input_counter, &output_counter, this](const NodeArg& def, bool is_input) {
+          const std::string& def_name = def.Name();
+          if (is_input) {
+            if (inputs.count(def_name) == 0) {
+              inputs.emplace(def_name, input_counter);
+            }
+            input_counter++;
+          } else {
+            outputs.emplace(def_name, output_counter++);
+          }
+        });
+  }
+
+  int CreateOrGetInternalAllocatorOffset(const std::string& def_name) {
+    if (internal_allocator_offset.count(def_name) > 0) {
+      return internal_allocator_offset.at(def_name);
+    }
+    internal_allocator_offset.insert(std::make_pair(def_name, offset_count));
+    return offset_count++;
+  }
+};
+
+enum class NodeArgTileAttribute : int {
+  None = 0,
+  Forward = 1,
+  Backward = 2,
+  NoMerger = 3,
+};
+
+// NupharSubgraphUnit is a data struct under Ort Subgraph.
+// It is a customized data struct in nuphar
+// to enable concurrent function codegen within a Ort Kernel (which maps to an Ort Subgraph)
+struct NupharSubgraphUnit {
+  NupharSubgraphUnit() {
+    id_ = counter++;
+  }
+
+  std::vector<const Node*> nodes;
+
+  // inputs include each input of this NupharSubgraphUnit (input of Partition AND this NupharSubgraphUnit at the same time)
+  // it also includes initializers
+  std::vector<const NodeArg*> inputs;
+
+  // outputs include each output of this NupharSubgraphUnit and real_output (output of Partition AND this NupharSubgraphUnit at the same time)
+  std::vector<const NodeArg*> outputs;
+
+  // initializers include each intializer of this NupharSubgraphUnit
+  std::map<std::string, const Tensor*> initializers;
+
+  // optional
+  std::vector<NodeArgTileAttribute> input_attrs;
+  std::vector<NodeArgTileAttribute> output_attrs;
+
+  bool IsSingleNode() const {
+    return nodes.size() == 1;
+  }
+
+  const std::string& Name() const {
+    return nodes.front()->Name();
+  }
+
+  std::string UniqueId() const {
+    return std::to_string(id_);
+  }
+
+ public:
+  // counter used for subgraph id
+  // reset outside after cache generated
+  // to avoid same inference session continue
+  // increase the counter
+  thread_local static int64_t counter;
+
+ private:
+  int64_t id_;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/nuphar_tvm_utils.cc b/onnxruntime/core/providers/nuphar/common/nuphar_tvm_utils.cc
new file mode 100644
index 0000000000000..ec6566b0f8ff4
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/nuphar_tvm_utils.cc
@@ -0,0 +1,174 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/common/nuphar_tvm_utils.h"
+
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+#include "core/providers/nuphar/common/nuphar_settings.h"
+#include "core/codegen/common/common.h"
+#include "core/codegen/common/target_info.h"
+
+#include "core/common/logging/logging.h"
+#include "core/platform/env.h"
+#include "core/providers/common.h"
+#include "gsl/gsl_util"
+#include <topi/detail/extern.h>
+#include <tvm/ir_pass.h>
+#include <experimental/filesystem>
+#include <fstream>
+namespace fs = std::experimental::filesystem;
+
+namespace onnxruntime {
+namespace nuphar {
+
+static bool GetOrCreateTVMModuleCacheDirectory(fs::path& path, bool create) {
+  codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+
+  if (!settings.HasOption(kNupharCachePath))
+    return false;
+
+  std::string version;
+  if (settings.HasOption(kNupharCacheVersion)) {
+    version = settings.GetOptionValue(kNupharCacheVersion);
+  } else {
+    version = kNupharCacheVersion_Current;
+  }
+
+  path = settings.GetOptionValue(kNupharCachePath);
+  if (!create && !fs::is_directory(path))
+    return false;
+
+  if (!fs::is_directory(path))
+    if (!fs::create_directory(path)) {
+      throw std::runtime_error("Failed to create directory " + path.string());
+    }
+
+  path.append(version);
+  if (!create && !fs::is_directory(path))
+    return false;
+
+  if (!fs::is_directory(path))
+    if (!fs::create_directory(path)) {
+      throw std::runtime_error("Failed to create directory " + path.string());
+    }
+
+  return true;
+}
+
+static bool GetCacheSoFilePath(std::string& so_path) {
+  codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+  fs::path path;
+  if (!GetOrCreateTVMModuleCacheDirectory(path, /*create*/ false))
+    return false;
+
+  auto so_name = settings.GetOptionValue(kNupharCacheSoName);
+  path.append(so_name);
+  if (fs::is_regular_file(path)) {
+    so_path = path.string();
+    return true;
+  }
+  return false;
+}
+
+static void* GetFuncFromLibrary(const std::string& so_path, const std::string& func_name, bool throw_if_not_found = true) {
+  void* so_handle;
+  ORT_ENFORCE(Env::Default().LoadDynamicLibrary(so_path, &so_handle).IsOK());
+  void* func = nullptr;
+  Status s = Env::Default().GetSymbolFromLibrary(so_handle, func_name, &func);
+  if (throw_if_not_found && !s.IsOK())
+    ORT_ENFORCE(false, "Cannot find ", func_name, " in ", so_path);
+  return func;
+}
+
+static bool disable_caching_due_to_checksum_failure = false;
+
+static bool VerifyTVMModuleChecksum(const std::string& so_path) {
+  static std::string last_so_path;
+  static bool last_checksum_validated = false;
+  static std::mutex checksum_mutex;
+  if (last_so_path != so_path) {
+    std::lock_guard<std::mutex> lock(checksum_mutex);
+    if (last_so_path != so_path) {
+      disable_caching_due_to_checksum_failure = false;  // reset disabled caching for a new file
+      last_so_path = so_path;
+      void* f = GetFuncFromLibrary(so_path, "_ORTInternal_GetCheckSum", /*throw_if_not_found*/ false);
+      if (f) {
+        typedef void (*GetChecksumFunc)(const char*&, size_t&);
+        GetChecksumFunc func = reinterpret_cast<GetChecksumFunc>(f);
+        const char* model_checksum;
+        size_t model_checksum_len;
+        func(model_checksum,
+             model_checksum_len);
+
+        codegen::CodeGenSettings& setting = codegen::CodeGenSettings::Instance();
+        // When checksum is expected by dll/so, user must set environment variable
+        // NUPHAR_CACHE_MODEL_CHECKSUM from md5 digest of running model.
+        // User may choose to run with base model or simplified mode and any match
+        // would be regarded as validated.
+        // Note that checksum validation here is not designed as a security measurement,
+        // so checksum compute is not done inside ORT.
+        last_checksum_validated =
+            setting.OptionMatches(
+                kNupharCacheModelChecksum,
+                std::string(model_checksum, model_checksum_len));
+
+        if (!last_checksum_validated) {
+          LOGS_DEFAULT(CODEGEN_SETTINGS_LOG_LEVEL) << "Cache checksum validation failed, using JIT...";
+          disable_caching_due_to_checksum_failure = true;
+        }
+      } else {
+        // do not validate checksum if dll didn't require it (usually during debugging)
+        // TODO: force checksum validation in final release
+        last_checksum_validated = true;
+      }
+    }
+  }
+  return last_checksum_validated;
+}
+
+tvm::runtime::PackedFunc LoadTVMPackedFuncFromCache(const std::string& func_name) {
+  std::string so_path;
+  if (!GetCacheSoFilePath(so_path))
+    return nullptr;
+
+  if (!VerifyTVMModuleChecksum(so_path))
+    return nullptr;
+
+  tvm::runtime::Module module = tvm::runtime::Module::LoadFromFile(so_path);
+  tvm::runtime::PackedFunc func = module.GetFunction(func_name);
+  if (func == nullptr) {
+    LOGS_DEFAULT(CODEGEN_SETTINGS_LOG_LEVEL) << "Cannot find " << func_name << " in cache, using JIT...";
+  }
+  return func;
+}
+
+thread_local int saved_tvm_model_cnt = 0;
+
+void SaveTVMModuleToCache(const std::string& filename, tvm::runtime::Module& module) {
+  fs::path path;
+
+  if (disable_caching_due_to_checksum_failure)
+    return;
+
+  static std::mutex save_cache_mutex;
+  static std::unordered_set<std::string> existing_files;
+  std::lock_guard<std::mutex> lock(save_cache_mutex);
+  if (existing_files.count(filename) == 0 &&
+      GetOrCreateTVMModuleCacheDirectory(path, /*create*/ true)) {
+    existing_files.insert(filename);
+    path.append("cached_" + std::to_string(saved_tvm_model_cnt++) + ".o");
+    if (fs::exists(path)) {
+      LOGS_DEFAULT(CODEGEN_SETTINGS_LOG_LEVEL) << "Object file " << path << " already exists, skip saving...";
+      return;
+    }
+    module->SaveToFile(path.string(), "o");
+  }
+}
+
+std::string GetPackedFuncName(const nuphar::NupharSubgraphUnit& subgraph, const CodeGenTarget& codegen_target) {
+  // in C, a function does not allow its name starting with a digit.
+  return NormalizeCppName("_" + subgraph.UniqueId() + " " + codegen_target.GetTargetName());
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/nuphar_tvm_utils.h b/onnxruntime/core/providers/nuphar/common/nuphar_tvm_utils.h
new file mode 100644
index 0000000000000..3c26a0c6f61f9
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/nuphar_tvm_utils.h
@@ -0,0 +1,26 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include <tvm/tvm.h>
+#include <string>
+
+#include "core/graph/graph.h"
+
+namespace onnxruntime {
+class CodeGenTarget;  //forward
+
+namespace nuphar {
+
+struct NupharSubgraphUnit;  //forward
+// Helper functions to create or load from offline cached dll
+// note after saving to obj file, we need to use tvm Python to create dll
+// using script at onnxruntime/core/codegen/mti/scripts/create_shared.py
+tvm::runtime::PackedFunc
+LoadTVMPackedFuncFromCache(const std::string& func_name);
+void SaveTVMModuleToCache(const std::string& filename, tvm::runtime::Module& module);
+
+std::string GetPackedFuncName(const nuphar::NupharSubgraphUnit& subgraph, const CodeGenTarget& codegen_target);
+
+}  // namespace nuphar
+}  //  namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/utils.cc b/onnxruntime/core/providers/nuphar/common/utils.cc
new file mode 100644
index 0000000000000..848e368a71da4
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/utils.cc
@@ -0,0 +1,76 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/common/utils.h"
+
+#include "core/framework/tensorprotoutils.h"
+#include "core/providers/common.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+bool NodeArgShapeUnknownOnAxis(const NodeArg* def, int64_t axis) {
+  auto shape = def->Shape();
+  axis = HandleNegativeAxis(axis, shape->dim_size());
+  ORT_ENFORCE(axis < shape->dim_size());
+  auto dim = shape->dim(axis);
+  return dim.has_dim_param() || (!dim.has_dim_param() && !dim.has_dim_value());
+}
+
+bool HasUnknownShapeOnAxis(const ConstPointerContainer<std::vector<NodeArg*>>& defs, int64_t axis) {
+  for (const NodeArg* def : defs) {
+    if (NodeArgShapeUnknownOnAxis(def, axis)) {
+      return true;
+    }
+  }
+  return false;
+}
+
+bool HasUnknownShapeOnAxes(const NodeArg* def, std::vector<int64_t>& axes) {
+  for (auto axis : axes) {
+    if (NodeArgShapeUnknownOnAxis(def, axis)) {
+      return true;
+    }
+  }
+  return false;
+}
+
+Status GetSliceAxesFromTensorProto(std::vector<int64_t>& axes,
+                                   const ONNX_NAMESPACE::TensorProto& axes_tp) {
+  size_t tp_sz_in_bytes;
+  ORT_RETURN_IF_ERROR(utils::GetSizeInBytesFromTensorProto<0>(axes_tp, &tp_sz_in_bytes));
+  OrtValue ort_value;
+  std::unique_ptr<char[]> data(new char[tp_sz_in_bytes]);
+
+#define UNPACK_TENSOR(T)                                                \
+   T* p = reinterpret_cast<T*>(data.get());                             \
+   ORT_RETURN_IF_ERROR(utils::UnpackTensor<T>(                          \
+       axes_tp,                                                         \
+       axes_tp.raw_data().size() ? axes_tp.raw_data().data() : nullptr, \
+       axes_tp.raw_data().size(),                                       \
+       p,                                                               \
+       tp_sz_in_bytes / sizeof(T)));                                    \
+   std::vector<T> tmp_axes(p, p + tp_sz_in_bytes / sizeof(T));
+
+  switch (axes_tp.data_type()) {
+    case ONNX_NAMESPACE::TensorProto_DataType_INT32: {
+      UNPACK_TENSOR(int32_t);
+      for (auto axis : tmp_axes) {
+        axes.push_back(static_cast<int64_t>(axis));
+      }
+      break;
+    }
+    case ONNX_NAMESPACE::TensorProto_DataType_INT64: {
+      UNPACK_TENSOR(int64_t);
+      axes.insert(axes.end(), tmp_axes.begin(), tmp_axes.end());
+      break;
+    }
+    default:
+      ORT_NOT_IMPLEMENTED("Unimplemented type: ", axes_tp.data_type());
+  }
+
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/common/utils.h b/onnxruntime/core/providers/nuphar/common/utils.h
new file mode 100644
index 0000000000000..a2c1a702f606d
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/common/utils.h
@@ -0,0 +1,23 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/graph/graph.h"
+
+// forward declaration
+struct OrtAllocatorInfo;
+
+namespace onnxruntime {
+namespace nuphar {
+
+bool NodeArgShapeUnknownOnAxis(const NodeArg* def, int64_t axis);
+
+bool HasUnknownShapeOnAxis(const ConstPointerContainer<std::vector<NodeArg*>>& defs, int64_t axis);
+
+bool HasUnknownShapeOnAxes(const NodeArg* def, std::vector<int64_t>& axes);
+
+Status GetSliceAxesFromTensorProto(std::vector<int64_t>& axes,
+                                   const ONNX_NAMESPACE::TensorProto& axes_tp);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/codegen_manager.cc b/onnxruntime/core/providers/nuphar/compiler/codegen_manager.cc
new file mode 100644
index 0000000000000..981e14d2bec5a
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/codegen_manager.cc
@@ -0,0 +1,233 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/codegen_manager.h"
+
+#include "core/codegen/common/op_macro.h"
+#include "core/codegen/passes/op_ir_creator/all_ops.h"
+#include "core/codegen/passes/scheduler/all_schedules.h"
+#include "core/codegen/passes/weight_layout/transpose_2d.h"
+#include "core/codegen/passes/weight_layout/vertical_stripes_2d.h"
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+#include "core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h"
+
+namespace onnxruntime {
+namespace codegen {
+// explicit instantiation
+template class RegistryBase<tvm_codegen::WeightLayout>;
+}  // namespace codegen
+
+namespace nuphar {
+
+////  All Creator instance registration
+// 1. Create Customized Op IR creator instances
+
+// BEGIN: NupharTVM X86 IR creator classes
+
+#define ADD_OP_ITEM(name) \
+  op_ir_registry->Register(std::move(std::make_unique<NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(name)>()));
+
+#define REDUCE_V_OP(name) ADD_OP_ITEM(name)
+#define UNARY_OP(name) ADD_OP_ITEM(name)
+
+static void RegisterAllNupharX86OpIRCreators(tvm_codegen::OpIRRegistry* op_ir_registry) {
+  LIST_ALL_X86_OPS()
+}
+
+#undef ADD_OP_ITEM
+#undef REDUCE_V_OP
+#undef UNARY_OP
+
+// END: NupharTVM X86 IR creator classes
+
+// 2. Create Scheduler instances
+// BEGIN: Nuphar Scheduler classes
+
+static void RegisterAllNupharSchedulers(tvm_codegen::TVMScheduleRegistry* sched_registry) {
+  // Add Generic TVM Rule schedules
+  sched_registry->Register(
+      std::move(std::make_unique<tvm_codegen::TVM_SCHEDULER_CLASS(AlwaysRoot, GenericTVMRule)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<tvm_codegen::TVM_SCHEDULER_CLASS(Extern, GenericTVMRule)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<tvm_codegen::TVM_SCHEDULER_CLASS(Reduce, GenericTVMRule)>()));
+
+  // Add Generic OpType schedules
+  sched_registry->Register(
+      std::move(std::make_unique<tvm_codegen::TVM_SCHEDULER_CLASS(Softmax, GenericOrtOpType)>()));
+
+  // Add NupharX86 TVM Rule schedules
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(Extern, NupharX86TVMRule)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(Reduce, NupharX86TVMRule)>()));
+
+  // Add NupharX86 Tensorization schedules
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(MatMulInteger, NupharX86Tensorize)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(MatMulInteger16, NupharX86Tensorize)>()));
+
+  // Add NupharX86 OpType schedules
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(Softmax, NupharX86OrtOpType)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(Split, NupharX86OrtOpType)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(Gemm, NupharX86OrtOpType)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(MatMul, NupharX86OrtOpType)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(Conv, NupharX86OrtOpType)>()));
+
+  // Add NupharX86 use count schedules
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(True, NupharX86UseCount)>()));
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(False, NupharX86UseCount)>()));
+
+  // Add NupharX86 partial result schedules
+  sched_registry->Register(
+      std::move(std::make_unique<TVM_SCHEDULER_CLASS(True, NupharX86PartialResult)>()));
+}
+
+// END: Nuphar Scheduler classes
+
+// 3. Create Weight layout instances
+// BEGIN: Nuphar Weight Layouts classes
+static void RegisterAllNupharWeightLayouts(tvm_codegen::WeightLayoutRegistry* layout_registry) {
+  layout_registry->Register(
+      std::move(std::make_unique<tvm_codegen::WeightLayoutVerticalStripe2D>(ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_FLOAT, 8)));
+  layout_registry->Register(
+      std::move(std::make_unique<tvm_codegen::WeightLayoutTranspose2D>(ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_FLOAT)));
+  layout_registry->Register(
+      std::move(std::make_unique<tvm_codegen::WeightLayoutTranspose2D>(ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_INT8)));
+  layout_registry->Register(
+      std::move(std::make_unique<tvm_codegen::WeightLayoutTranspose2D>(ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_UINT8)));
+  layout_registry->Register(
+      std::move(std::make_unique<tvm_codegen::WeightLayoutTranspose2D>(ONNX_NAMESPACE::TensorProto_DataType::TensorProto_DataType_INT16)));
+}
+
+// END: Nuphar Weight Layouts classes
+
+//// All Plugins for Nuphar provider
+// 1. Plugin IR creator classes
+
+// BEGIN: Nuphar TVM X86 IR creator classes
+#define ADD_OP_ITEM(name) \
+  dispatcher->Register(#name, registry->Get(NUPHAR_TVM_X86_OP_IR_CREATOR_STRING(name)));
+
+#define REDUCE_V_OP(name) ADD_OP_ITEM(name)
+#define UNARY_OP(name) ADD_OP_ITEM(name)
+
+static void RegisterNupharX86Dispatcher(const std::shared_ptr<tvm_codegen::TVMIRBuilder>& builder,
+                                        const tvm_codegen::OpIRRegistry* registry) {
+  auto dispatcher = std::make_unique<tvm_codegen::OP_IR_DISPATCHER_CLASS(OpType)>("OptypeNupharTVMX86Creators");
+  LIST_ALL_X86_OPS()
+  builder->InsertDispatcher(std::move(dispatcher));
+}
+
+#undef ADD_OP_ITEM
+#undef REDUCE_V_OP
+#undef UNARY_OP
+// END: Nuphar TVM X86 IR creator classes
+
+// 2 Plugin Scheduler classes
+
+// BEGIN: TVM rule Scheduler
+static void RegisterNupharX86TVMRuleSchedulers(const std::shared_ptr<tvm_codegen::TVMScheduleBuilder>& builder,
+                                               const tvm_codegen::TVMScheduleRegistry* registry) {
+  auto dispatcher = std::make_unique<tvm_codegen::SCHEDULE_DISPATCHER_CLASS(TVMOpRule)>("NupharX86TVMRuleSchedulers");
+
+  // Register a scheduler for TVM External Tensor
+  dispatcher->Register(tvm_codegen::GetTVMOpRule(tvm_codegen::TVMOpRuleType::Extern),
+                       registry->Get(TVM_SCHEDULER_STRING(Extern, NupharX86TVMRule)));
+  // Register a scheduler for TVM Reduce Tensor
+  dispatcher->Register(tvm_codegen::GetTVMOpRule(tvm_codegen::TVMOpRuleType::ComputeReduce),
+                       registry->Get(TVM_SCHEDULER_STRING(Reduce, NupharX86TVMRule)));
+
+  builder->InsertDispatcher(std::move(dispatcher));
+}
+// END: TVM rule Scheduler
+
+// BEGIN: ORT OpType Scheduler
+static void RegisterNupharX86OrtOpTypeSchedulers(const std::shared_ptr<tvm_codegen::TVMScheduleBuilder>& builder,
+                                                 const tvm_codegen::TVMScheduleRegistry* registry) {
+  auto dispatcher = std::make_unique<tvm_codegen::SCHEDULE_DISPATCHER_CLASS(OrtOpType)>("NupharX86OrtOpTypeSchedulers");
+
+  // Register a scheduler for Ort Softmax OpType
+  dispatcher->Register("Softmax",
+                       registry->Get(TVM_SCHEDULER_STRING(Softmax, NupharX86OrtOpType)));
+
+  dispatcher->Register("Split",
+                       registry->Get(TVM_SCHEDULER_STRING(Split, NupharX86OrtOpType)));
+
+  builder->InsertDispatcher(std::move(dispatcher));
+}
+// END: ORT OpType Scheduler
+
+// BEGIN: Reuse Count Analysis Scheduler
+static void RegisterNupharX86UseCountSchedulers(const std::shared_ptr<tvm_codegen::TVMScheduleBuilder>& builder,
+                                                const tvm_codegen::TVMScheduleRegistry* registry) {
+  auto dispatcher = std::make_unique<SCHEDULE_DISPATCHER_CLASS(NupharX86UseCount)>("NupharX86UseCountSchedulers");
+
+  // Register a scheduler for Reuse count > 1
+  dispatcher->Register("True",
+                       registry->Get(TVM_SCHEDULER_STRING(True, NupharX86UseCount)));
+
+  // Register a scheduler for Reuse count <= 1
+  dispatcher->Register("False",
+                       registry->Get(TVM_SCHEDULER_STRING(False, NupharX86UseCount)));
+
+  builder->InsertDispatcher(std::move(dispatcher));
+}
+// END: Reuse Count Analysis Scheduler
+
+// BEGIN: Partial Result Scheduler
+static void RegisterNupharX86PartialResultSchedulers(const std::shared_ptr<tvm_codegen::TVMScheduleBuilder>& builder,
+                                                     const tvm_codegen::TVMScheduleRegistry* registry) {
+  auto dispatcher = std::make_unique<SCHEDULE_DISPATCHER_CLASS(NupharX86PartialResult)>("NupharX86PartialResultSchedulers");
+  dispatcher->Register("True",
+                       registry->Get(TVM_SCHEDULER_STRING(True, NupharX86PartialResult)));
+
+  builder->InsertDispatcher(std::move(dispatcher));
+}
+// END: Partial Result Scheduler
+
+TVMCodeGenManager::TVMCodeGenManager() {
+  op_ir_registry_ = std::make_unique<tvm_codegen::OpIRRegistry>();
+  layout_registry_ = std::make_unique<tvm_codegen::WeightLayoutRegistry>();
+  schedule_registry_ = std::make_unique<tvm_codegen::TVMScheduleRegistry>();
+}
+
+void TVMCodeGenManager::Initialization() {
+  RegisterAllNupharX86OpIRCreators(op_ir_registry_.get());
+  RegisterAllGenericOpIRCreators(op_ir_registry_.get());
+
+  RegisterAllNupharWeightLayouts(layout_registry_.get());
+  RegisterAllNupharSchedulers(schedule_registry_.get());
+}
+
+// TODO Add isa support
+void TVMCodeGenManager::SetCodeGenHandle(NupharCodeGenHandle* handle) {
+  // layout registry
+  handle->layout_registry = layout_registry_.get();
+
+  // Op IR creators
+  handle->op_ir_builder =
+      std::make_shared<tvm_codegen::TVMIRBuilder>("Nuphar_Op_IR_Builder");
+  RegisterNupharX86Dispatcher(handle->op_ir_builder, op_ir_registry_.get());
+  RegisterGenericOrtOpTypeDispatcher(handle->op_ir_builder, op_ir_registry_.get());
+
+  // Schedulers
+  handle->schedule_builder =
+      std::make_shared<tvm_codegen::TVMScheduleBuilder>("Nuphar_Schedule_Builder");
+
+  RegisterNupharX86TVMRuleSchedulers(handle->schedule_builder, schedule_registry_.get());
+  RegisterNupharX86OrtOpTypeSchedulers(handle->schedule_builder, schedule_registry_.get());
+  RegisterNupharX86UseCountSchedulers(handle->schedule_builder, schedule_registry_.get());
+  RegisterNupharX86PartialResultSchedulers(handle->schedule_builder, schedule_registry_.get());
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/codegen_manager.h b/onnxruntime/core/providers/nuphar/compiler/codegen_manager.h
new file mode 100644
index 0000000000000..75a88002fe9fb
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/codegen_manager.h
@@ -0,0 +1,43 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/passes/op_ir_creator/tvm_op_creator.h"
+#include "core/codegen/passes/op_ir_creator/tvm_ir_builder.h"
+#include "core/codegen/passes/scheduler/tvm_schedule_builder.h"
+#include "core/codegen/passes/weight_layout/weight_layout.h"
+#include "core/providers/nuphar/compiler/nuphar_handle.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// TVMCodeGenManager contains all registries
+// including 1) TVM IR builder registry
+//           2) Weight layout transformer registry
+//           3) TVM scheduler registry, etc.
+// These registries include all applicable passes for specific arch
+// AND might also include non-applicable passes, like passes for another arch.
+
+// TVMCodeGenManager keeps the ownerships of all registries, passes,
+// and planners.
+
+// TVMCodeGenManager also sets NupharCodeGenHandle for a specific arch.
+
+class TVMCodeGenManager {
+ public:
+  TVMCodeGenManager();
+
+  // TODO add a list of condition to handle dynamic registration
+  void Initialization();
+
+  // TODO: add target as an input
+  void SetCodeGenHandle(NupharCodeGenHandle* handle);
+
+ private:
+  std::unique_ptr<tvm_codegen::OpIRRegistry> op_ir_registry_;
+  std::unique_ptr<tvm_codegen::WeightLayoutRegistry> layout_registry_;
+  std::unique_ptr<tvm_codegen::TVMScheduleRegistry> schedule_registry_;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/func_info.cc b/onnxruntime/core/providers/nuphar/compiler/func_info.cc
new file mode 100644
index 0000000000000..711a396a8de87
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/func_info.cc
@@ -0,0 +1,562 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/func_info.h"
+
+#include "core/providers/nuphar/runtime/control_flow/scan_exec_ctx.h"
+#include "core/framework/op_kernel.h"
+#include "core/framework/tensorprotoutils.h"
+#include "core/codegen/common/common.h"
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+#include <unordered_map>
+
+// from onnxruntime_typeinf.cc, in global namespace
+const onnxruntime::DataTypeImpl* ElementTypeFromProto(int type);
+
+namespace onnxruntime {
+namespace nuphar {
+
+static void FillBasicFuncInfo(NupharFuncInfo* func_info,
+                              nuphar::OrtSubgraphAllocationInfo* partition_info,
+                              const nuphar::NupharSubgraphUnit& subgraph,
+                              const NupharCodeGenCtx& codegen_ctx,
+                              tvm::Target tvm_target,
+                              tvm::runtime::PackedFunc packed_func,
+                              const std::string& name) {
+  ORT_ENFORCE(nullptr != func_info);
+  ORT_ENFORCE(nullptr != partition_info);
+
+  func_info->name = name;
+  func_info->packed_func = packed_func;
+  func_info->device_type = static_cast<DLDeviceType>(tvm_target->device_type);
+
+  int tvm_input_idx = 0;
+  int def_index = 0;
+  // Handle inputs
+  func_info->ort_input_count = subgraph.inputs.size();
+  // Assign Input meta
+  for (auto& def : subgraph.inputs) {
+    // fill in allocator info
+    NupharFuncInfo::AllocatorMeta input_allocator;
+    if (partition_info->inputs.count(def->Name()) > 0) {
+      // if an input is from external
+      input_allocator.index = partition_info->inputs.at(def->Name());
+      input_allocator.is_external = true;
+      func_info->ort_input_allocator_is_collided_output.push_back(false);
+    } else if (partition_info->outputs.count(def->Name()) > 0) {
+      // if an input is from a previous real output
+      input_allocator.index = partition_info->outputs.at(def->Name());
+      input_allocator.is_external = true;  // a real output is always from external
+      func_info->ort_input_allocator_is_collided_output.push_back(true);
+    } else {
+      // else, an input is from an internal allocator
+      input_allocator.index = partition_info->CreateOrGetInternalAllocatorOffset(def->Name());
+      input_allocator.is_external = false;
+      func_info->ort_input_allocator_is_collided_output.push_back(false);
+    }
+
+    func_info->ort_input_allocators.push_back(input_allocator);
+
+    if (codegen_ctx.IsInitializer(def->Name())) {
+      ++def_index;
+      continue;  // skip initializers
+    }
+
+    // fill in func args
+    NupharFuncInfo::FuncArgMeta input_meta;
+    input_meta.dtype = ElementTypeFromProto(def->TypeAsProto()->tensor_type().elem_type());
+    input_meta.ort_arg_index = def_index;
+
+    // fill in shape info and symobolic info
+    for (int dim = 0; dim < gsl::narrow<int>(ShapeRank(def)); ++dim) {
+      if (ShapeHasSymbol(def, dim)) {
+        input_meta.inferred_shape.push_back(Dimension_Unknown);
+        input_meta.dim_symbols.push_back(std::make_pair(gsl::narrow<size_t>(dim), ShapeSymbol(def, dim)));
+      } else if (ShapeHasValue(def, dim)) {
+        input_meta.inferred_shape.push_back(ShapeValue(def, dim));
+      } else {
+        input_meta.inferred_shape.push_back(Dimension_Unknown);
+      }
+    }
+
+    func_info->input_metas.push_back(input_meta);
+
+    ++tvm_input_idx;
+    ++def_index;
+  }
+
+  // Handle initializers
+  // Initializer meta
+  std::vector<const Tensor*>& intializers = func_info->intializers;
+  // Assign Initializer meta
+  for (const auto& item : codegen_ctx.GetWeightLayoutMap()) {
+    const WeightLayoutCodegenInfo* layout_info = item.second.get();
+    bool is_marshalled = layout_info->is_marshalled;
+    const Tensor* t =
+        is_marshalled ? layout_info->marshalled_initializer
+                      : codegen_ctx.GetOrtInitializerTensor(item.first);
+
+    intializers.push_back(t);
+    ++tvm_input_idx;
+  }
+
+  // set input_count = the number of inputs + the number of initializers
+  func_info->func_input_count = gsl::narrow<size_t>(tvm_input_idx);
+
+  // Handle Outputs
+
+  func_info->ort_output_count = subgraph.outputs.size();
+  // Assign Output meta
+  int tvm_output_idx = 0;
+  std::unordered_map<NodeKey, size_t> visited_output_def_indices;
+  def_index = 0;
+  for (auto& def : subgraph.outputs) {
+    // fill in allocator info
+    NupharFuncInfo::AllocatorMeta output_allocator;
+    if (partition_info->outputs.count(def->Name()) > 0) {
+      // if an output is from external
+      output_allocator.index = partition_info->outputs.at(def->Name());
+      output_allocator.is_external = true;
+    } else {
+      // else, an output is from an internal allocator
+      output_allocator.index = partition_info->CreateOrGetInternalAllocatorOffset(def->Name());
+      output_allocator.is_external = false;
+    }
+
+    func_info->ort_output_allocators.push_back(output_allocator);
+
+    // check output alias
+    const NodeArg* source_def = Promote<CodeGenUnitStats>(codegen_ctx.GetGraphStats())
+                                    ->SourceDefOfOutputAlias(def);
+
+    if (nullptr != source_def) {
+      // if def is an alias
+      auto key = GetKey(source_def);
+      if (visited_output_def_indices.count(key) != 0) {
+        // source_def has visisted ==> def is a duplicated output
+        // record the pair (dst of ort arg index, src of tvm func index)
+        func_info->ort_aliased_output_to_func_indices.emplace_back(def_index,
+                                                                   func_info->func_input_count +
+                                                                       visited_output_def_indices[key]);
+
+        ++def_index;
+        continue;
+      }
+      // update visited_output_def_indices
+      visited_output_def_indices.insert(std::make_pair(key, gsl::narrow_cast<size_t>(tvm_output_idx)));
+    } else {
+      auto key = GetKey(def);
+      if (visited_output_def_indices.count(key) != 0) {
+        // def has visisted ==> def is a duplicated output
+        // record the pair (dst of ort arg index, src of tvm func index)
+        func_info->ort_aliased_output_to_func_indices.emplace_back(def_index,
+                                                                   func_info->func_input_count +
+                                                                       visited_output_def_indices[key]);
+
+        ++def_index;
+        continue;
+      }
+      visited_output_def_indices.insert(std::make_pair(key, gsl::narrow_cast<size_t>(tvm_output_idx)));
+    }
+
+    NupharFuncInfo::FuncArgMeta output_meta;
+    output_meta.dtype = ElementTypeFromProto(def->TypeAsProto()->tensor_type().elem_type());
+    output_meta.ort_arg_index = def_index;
+
+    // fill in shape info and symobolic info
+    for (int dim = 0; dim < gsl::narrow<int>(ShapeRank(def)); ++dim) {
+      if (ShapeHasSymbol(def, dim)) {
+        auto p = std::make_pair(gsl::narrow<size_t>(dim), ShapeSymbol(def, dim));
+        output_meta.dim_symbols.push_back(p);
+        output_meta.inferred_shape.push_back(Dimension_Unknown);
+      } else if (ShapeHasValue(def, dim)) {
+        output_meta.inferred_shape.push_back(ShapeValue(def, dim));
+      } else {
+        output_meta.inferred_shape.push_back(Dimension_Unknown);
+      }
+    }
+
+    func_info->output_metas.push_back(output_meta);
+    ++def_index;
+    ++tvm_output_idx;
+  }
+
+  // set output_count as the real output count
+  func_info->func_output_count = gsl::narrow_cast<size_t>(tvm_output_idx);
+
+  // set tvm type_codes
+  func_info->type_codes.resize(func_info->func_input_count + func_info->func_output_count, TVMTypeCode::kNDArrayContainer);
+}
+
+static void FillScanExecInfo(NupharFuncInfo* func_info,
+                             nuphar::OrtSubgraphAllocationInfo* partition_info,
+                             const Node& node,
+                             const NupharCodeGenCtx& codegen_ctx,
+                             tvm::Target tvm_target,
+                             tvm::runtime::PackedFunc packed_func,
+                             const std::string& name) {
+  ORT_ENFORCE(nullptr != func_info);
+  ORT_ENFORCE(nullptr != partition_info);
+
+  // create Scan control-flow info
+  auto scan_info = std::make_unique<ScanExecInfo>();
+
+  int64_t num_state_variables;
+  int64_t num_scan_inputs;
+  int64_t num_scan_outputs;
+
+  ProtoHelperNodeContext ctx(node);
+  OpNodeProtoHelper<ProtoHelperNodeContext> attrs(&ctx);
+
+  // extract num_scan_inputs
+  bool attr_is_ok = attrs.GetAttr<int64_t>("num_scan_inputs", &num_scan_inputs).IsOK();
+  ORT_UNUSED_PARAMETER(attr_is_ok);
+  ORT_ENFORCE_DEBUG(attr_is_ok);
+
+  auto subgraph = GetSubgraph(node);
+  ORT_ENFORCE(subgraph != nullptr);
+  size_t num_variadic_inputs = subgraph->GetInputs().size();
+  size_t num_variadic_outputs = subgraph->GetOutputs().size();
+
+  num_state_variables = gsl::narrow<int64_t>(num_variadic_inputs) - num_scan_inputs;
+  num_scan_outputs = gsl::narrow<int64_t>(num_variadic_outputs) - num_state_variables;
+
+  // Set ScanExecInfo's parameter count meta
+  scan_info->num_state_variables = num_state_variables;
+  scan_info->num_scan_inputs = num_scan_inputs;
+  scan_info->num_scan_outputs = num_scan_outputs;
+  scan_info->num_scan_implicit_inputs = gsl::narrow_cast<int64_t>(node.ImplicitInputDefs().size());
+
+  // ScanExecInfo's control flow Meta
+  std::vector<bool>& scan_input_forwards = scan_info->scan_input_forwards;
+  std::vector<bool>& scan_output_forwards = scan_info->scan_output_forwards;
+  std::vector<int64_t>& scan_input_axes = scan_info->scan_input_axes;
+  std::vector<int64_t>& scan_output_axes = scan_info->scan_output_axes;
+
+  scan_input_forwards.resize(num_scan_inputs);
+  scan_output_forwards.resize(num_scan_outputs);
+
+  // extract directions and axes
+  std::vector<int64_t> scan_input_directions;
+  std::vector<int64_t> scan_output_directions;
+
+  // scan_input_directions
+  if (attrs.GetAttrs<int64_t>("scan_input_directions", scan_input_directions).IsOK()) {
+    ORT_ENFORCE(gsl::narrow_cast<int64_t>(scan_input_directions.size()) == num_scan_inputs,
+                "Number of entries in 'scan_input_directions ' was ", scan_input_directions.size(),
+                ". Must match 'num_scan_inputs' of ", num_scan_inputs);
+    ORT_ENFORCE(std::all_of(scan_input_directions.cbegin(), scan_input_directions.cend(),
+                            [](int64_t i) { return i == 0 ||
+                                                   i == 1; }),
+                "Invalid values in 'scan_input_directions'. 0 == forward. 1 == reverse.");
+  } else {
+    // default to forward
+    scan_input_directions = std::vector<int64_t>(num_scan_inputs, 0);
+  }
+
+  // scan_input_forwards
+  for (size_t i = 0; i < gsl::narrow<size_t>(num_scan_inputs); ++i) {
+    scan_input_forwards[i] = scan_input_directions[i] == 0;
+  }
+
+  // scan_output_directions
+  if (attrs.GetAttrs<int64_t>("scan_output_directions", scan_output_directions).IsOK()) {
+    ORT_ENFORCE(gsl::narrow_cast<int64_t>(scan_output_directions.size()) == num_scan_outputs,
+                "Number of entries in 'scan_output_directions ' was ", scan_output_directions.size(),
+                ". Must match 'num_scan_outputs' of ", num_scan_outputs);
+    ORT_ENFORCE(std::all_of(scan_output_directions.cbegin(), scan_output_directions.cend(),
+                            [](int64_t i) { return i == 0 ||
+                                                   i == 1; }),
+                "Invalid values in 'scan_output_directions'. 0 == forward. 1 == reverse.");
+  } else {
+    // default to forward
+    scan_output_directions = std::vector<int64_t>(num_scan_outputs, 0);
+  }
+
+  // scan_output_forwards
+  for (size_t i = 0; i < gsl::narrow<size_t>(num_scan_outputs); ++i) {
+    scan_output_forwards[i] = scan_output_directions[i] == 0;
+  }
+
+  // scan_input_axes
+  if (attrs.GetAttrs<int64_t>("scan_input_axes", scan_input_axes).IsOK()) {
+    ORT_ENFORCE(gsl::narrow_cast<int64_t>(scan_input_axes.size()) == num_scan_inputs,
+                "Number of entries in 'scan_input_axes ' was ", scan_input_axes.size(),
+                ". Must match 'num_scan_inputs' of ", num_scan_inputs);
+
+  } else {
+    // default to axis 0
+    scan_input_axes = std::vector<int64_t>(num_scan_inputs, 0);
+  }
+
+  // scan_output_axes
+  if (attrs.GetAttrs<int64_t>("scan_output_axes", scan_output_axes).IsOK()) {
+    ORT_ENFORCE(gsl::narrow_cast<int64_t>(scan_output_axes.size()) == num_scan_outputs,
+                "Number of entries in 'scan_output_axes ' was ", scan_output_axes.size(),
+                ". Must match 'num_scan_outputs' of ", num_scan_outputs);
+
+  } else {
+    // default to axis 0
+    scan_output_axes = std::vector<int64_t>(num_scan_outputs, 0);
+  }
+
+  // handle NupharFuncInfo
+  func_info->name = name;
+  func_info->packed_func = packed_func;
+  func_info->device_type = static_cast<DLDeviceType>(tvm_target->device_type);
+
+  int tvm_input_idx = 0;
+  // Handle state inputs & inputs
+  func_info->ort_input_count = num_variadic_inputs;
+
+  // assign state inputs & inputs
+  for (size_t ort_input_idx = 0; ort_input_idx < num_variadic_inputs; ++ort_input_idx) {
+    // fill in allocator info
+    NupharFuncInfo::AllocatorMeta input_allocator;
+    const NodeArg* main_graph_def = node.InputDefs()[ort_input_idx];
+    ORT_ENFORCE(nullptr != main_graph_def);
+    if (partition_info->inputs.count(main_graph_def->Name()) > 0) {
+      // if an input is from external
+      input_allocator.index = partition_info->inputs.at(main_graph_def->Name());
+      input_allocator.is_external = true;
+      func_info->ort_input_allocator_is_collided_output.push_back(false);
+    } else if (partition_info->outputs.count(main_graph_def->Name()) > 0) {
+      // if an input is from a previous real output
+      input_allocator.index = partition_info->outputs.at(main_graph_def->Name());
+      input_allocator.is_external = true;  // a real output is always from external
+      func_info->ort_input_allocator_is_collided_output.push_back(true);
+    } else {
+      // else, an input is from an internal allocator
+      input_allocator.index = partition_info->CreateOrGetInternalAllocatorOffset(main_graph_def->Name());
+      input_allocator.is_external = false;
+      func_info->ort_input_allocator_is_collided_output.push_back(false);
+    }
+
+    func_info->ort_input_allocators.push_back(input_allocator);
+
+    const NodeArg* def = subgraph->GetInputs()[ort_input_idx];
+    ORT_ENFORCE(nullptr != def);
+
+    if (ort_input_idx >= gsl::narrow<size_t>(num_state_variables)) {
+      // initializer should only happen in real inputs, not in state inputs
+      if (codegen_ctx.IsInitializer(def->Name())) {
+        continue;  // skip initializers
+      }
+    }
+
+    NupharFuncInfo::FuncArgMeta input_meta;
+    input_meta.dtype = ElementTypeFromProto(def->TypeAsProto()->tensor_type().elem_type());
+    input_meta.ort_arg_index = gsl::narrow_cast<int>(ort_input_idx);
+
+    // fill in shape info and symobolic info
+    for (int dim = 0; dim < gsl::narrow<int>(ShapeRank(def)); ++dim) {
+      if (ShapeHasSymbol(def, dim)) {
+        auto p = std::make_pair(gsl::narrow<size_t>(dim), ShapeSymbol(def, dim));
+        input_meta.dim_symbols.push_back(p);
+        input_meta.inferred_shape.push_back(Dimension_Unknown);
+      } else if (ShapeHasValue(def, dim)) {
+        input_meta.inferred_shape.push_back(ShapeValue(def, dim));
+      } else {
+        input_meta.inferred_shape.push_back(Dimension_Unknown);
+      }
+    }
+
+    func_info->input_metas.push_back(input_meta);
+    ++tvm_input_idx;
+  }
+
+  size_t ort_input_idx = num_variadic_inputs;
+  // Handle implicit inputs
+  for (const NodeArg* def : node.ImplicitInputDefs()) {
+    NupharFuncInfo::AllocatorMeta input_allocator;
+    if (partition_info->inputs.count(def->Name()) > 0) {
+      // if an input is from external
+      input_allocator.index = partition_info->inputs.at(def->Name());
+      input_allocator.is_external = true;
+      func_info->ort_input_allocator_is_collided_output.push_back(false);
+    } else if (partition_info->outputs.count(def->Name()) > 0) {
+      // if an input is from a previous real output
+      input_allocator.index = partition_info->outputs.at(def->Name());
+      input_allocator.is_external = true;
+      func_info->ort_input_allocator_is_collided_output.push_back(true);
+    } else {
+      // else, an input is from an internal allocator
+      input_allocator.index = partition_info->CreateOrGetInternalAllocatorOffset(def->Name());
+      input_allocator.is_external = false;
+      func_info->ort_input_allocator_is_collided_output.push_back(false);
+    }
+
+    func_info->ort_input_allocators.push_back(input_allocator);
+
+    // skip initializers
+    if (codegen_ctx.IsInitializer(def->Name())) {
+      ++ort_input_idx;
+      continue;  // skip initializers
+    }
+
+    NupharFuncInfo::FuncArgMeta input_meta;
+    input_meta.dtype = ElementTypeFromProto(def->TypeAsProto()->tensor_type().elem_type());
+    input_meta.ort_arg_index = gsl::narrow_cast<int>(ort_input_idx);
+
+    std::vector<std::pair<size_t, std::string>> symbols;
+    for (int dim = 0; dim < gsl::narrow<int>(ShapeRank(def)); ++dim) {
+      if (ShapeHasSymbol(def, dim)) {
+        auto p = std::make_pair(gsl::narrow<size_t>(dim), ShapeSymbol(def, dim));
+        input_meta.dim_symbols.push_back(p);
+        input_meta.inferred_shape.push_back(Dimension_Unknown);
+      } else if (ShapeHasValue(def, dim)) {
+        input_meta.inferred_shape.push_back(ShapeValue(def, dim));
+      } else {
+        input_meta.inferred_shape.push_back(Dimension_Unknown);
+      }
+    }
+    func_info->input_metas.push_back(input_meta);
+    ++tvm_input_idx;
+    ++ort_input_idx;
+  }
+
+  // Handle initializers
+  // Initializer meta
+  std::vector<const Tensor*>& intializers = func_info->intializers;
+
+  // Assign Initializer meta
+  for (const auto& item : codegen_ctx.GetWeightLayoutMap()) {
+    const WeightLayoutCodegenInfo* layout_info = item.second.get();
+
+    bool is_marshalled = layout_info->is_marshalled;
+    const Tensor* t =
+        is_marshalled ? layout_info->marshalled_initializer
+                      : codegen_ctx.GetOrtInitializerTensor(item.first);
+
+    intializers.push_back(t);
+    ++tvm_input_idx;
+  }
+
+  // set input_count = the number of inputs (real inputs + state inputs) + the number of initializers
+  func_info->func_input_count = gsl::narrow<size_t>(tvm_input_idx);
+
+  // Handle State Outputs and Outputs
+  func_info->ort_output_count = num_variadic_outputs;
+
+  // Since in Scan, we only allow state using output's memory during Execution, not the other around.
+  // When one input and one state are aliased, the kept one can only be the input.
+  // Therefore, we do alias detection starting from inputs first.
+  std::unordered_map<NodeKey, int> visited_output_def_indices;
+  for (size_t ort_output_idx = gsl::narrow<size_t>(num_state_variables); ort_output_idx < num_variadic_outputs; ++ort_output_idx) {
+    const NodeArg* def = subgraph->GetOutputs()[ort_output_idx];
+    ORT_ENFORCE(nullptr != def);
+    const NodeArg* source_def = Promote<CodeGenUnitStats>(codegen_ctx.GetGraphStats())
+                                    ->SourceDefOfOutputAlias(def);
+    if (nullptr != source_def) {
+      auto key = GetKey(source_def);
+      ORT_ENFORCE(visited_output_def_indices.count(key) == 0,
+                  "Scan has alias btw two inputs. Nuphar only support aliasing btw state and output in Scan");
+      visited_output_def_indices.insert(std::make_pair(key, gsl::narrow<int>(ort_output_idx)));
+    } else {
+      auto key = GetKey(def);
+      visited_output_def_indices.insert(std::make_pair(key, gsl::narrow<int>(ort_output_idx)));
+    }
+  }
+
+  // assign state outputs and outputs
+  size_t tvm_output_idx = 0;
+  std::unordered_map<NodeKey, int> visited_output_state_func_indices;
+  for (size_t ort_output_idx = 0; ort_output_idx < num_variadic_outputs; ++ort_output_idx) {
+    // fill in allocator info
+    NupharFuncInfo::AllocatorMeta output_allocator;
+    const NodeArg* main_graph_def = node.OutputDefs()[ort_output_idx];
+    ORT_ENFORCE(nullptr != main_graph_def);
+    if (partition_info->outputs.count(main_graph_def->Name()) > 0) {
+      output_allocator.index = partition_info->outputs.at(main_graph_def->Name());
+      output_allocator.is_external = true;
+    } else {
+      output_allocator.index = partition_info->CreateOrGetInternalAllocatorOffset(main_graph_def->Name());
+      output_allocator.is_external = false;
+    }
+    func_info->ort_output_allocators.push_back(output_allocator);
+
+    // perform alias analysis
+    const NodeArg* def = subgraph->GetOutputs()[ort_output_idx];
+    ORT_ENFORCE(nullptr != def);
+    const NodeArg* source_def = Promote<CodeGenUnitStats>(codegen_ctx.GetGraphStats())
+                                    ->SourceDefOfOutputAlias(def);
+
+    // Determine alias btw output and state output
+    auto key = source_def != nullptr ? GetKey(source_def) : GetKey(def);
+
+    int ort_arg_index = gsl::narrow_cast<int>(ort_output_idx);
+    if (ort_output_idx < gsl::narrow<size_t>(num_state_variables)) {
+      // if ort_output_idx is a state output
+      if (visited_output_def_indices.count(key) != 0) {
+        // If state output is an alias
+        // record i_output for the lookup of the aliased output later
+        visited_output_state_func_indices.insert(std::make_pair(key, gsl::narrow<int>(func_info->func_input_count + tvm_output_idx)));
+
+        // also record ort_aliased_output_to_func_indices
+        func_info->ort_aliased_output_to_func_indices.push_back(std::make_pair(gsl::narrow<int>(ort_output_idx),
+                                                                               func_info->func_input_count + tvm_output_idx));
+
+        scan_info->state_to_output_indices.push_back(visited_output_def_indices[key] - gsl::narrow_cast<int>(num_state_variables));
+        // override ort_arg_index using the output index
+        ort_arg_index = visited_output_def_indices[key];
+      } else {
+        // the state output not aliased(no scan output shares with it)
+        scan_info->state_to_output_indices.push_back(NupharFuncInfo::Index_NonAliasedOutput);
+      }
+    } else {
+      // if ort_output_idx is an output
+      if (visited_output_state_func_indices.count(key) != 0) {
+        if (source_def != nullptr) {
+          // skip a duplicated output, since it was counted in the duplicated state output previously
+          continue;
+        }
+      }
+    }
+
+    NupharFuncInfo::FuncArgMeta output_meta;
+    output_meta.dtype = ElementTypeFromProto(def->TypeAsProto()->tensor_type().elem_type());
+    output_meta.ort_arg_index = ort_arg_index;
+
+    // shape and symbols
+    for (int dim = 0; dim < gsl::narrow<int>(ShapeRank(def)); ++dim) {
+      if (ShapeHasSymbol(def, dim)) {
+        auto p = std::make_pair(gsl::narrow<size_t>(dim), ShapeSymbol(def, dim));
+        output_meta.dim_symbols.push_back(p);
+        output_meta.inferred_shape.push_back(Dimension_Unknown);
+      } else if (ShapeHasValue(def, dim)) {
+        output_meta.inferred_shape.push_back(ShapeValue(def, dim));
+      } else {
+        output_meta.inferred_shape.push_back(Dimension_Unknown);
+      }
+    }
+    func_info->output_metas.push_back(output_meta);
+    ++tvm_output_idx;
+  }
+
+  // set output_count as the real output count
+  func_info->func_output_count = tvm_output_idx;
+
+  // set tvm type_codes
+  func_info->type_codes.resize(func_info->func_input_count + func_info->func_output_count, TVMTypeCode::kNDArrayContainer);
+
+  // set control-flow info
+  func_info->cf_info = std::move(scan_info);
+}
+
+void FillNupharFuncInfo(NupharFuncInfo* func_info,
+                        nuphar::OrtSubgraphAllocationInfo* partition_info,
+                        const nuphar::NupharSubgraphUnit& subgraph,
+                        const NupharCodeGenCtx& codegen_ctx,
+                        tvm::Target tvm_target,
+                        tvm::runtime::PackedFunc packed_func,
+                        const std::string& name) {
+  if (subgraph.nodes.front()->OpType() == "Scan") {
+    FillScanExecInfo(func_info, partition_info, *subgraph.nodes.front(), codegen_ctx, tvm_target, packed_func, name);
+    return;
+  }
+
+  FillBasicFuncInfo(func_info, partition_info, subgraph, codegen_ctx, tvm_target, packed_func, name);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/func_info.h b/onnxruntime/core/providers/nuphar/compiler/func_info.h
new file mode 100644
index 0000000000000..6b2780e6657de
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/func_info.h
@@ -0,0 +1,122 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/common/common.h"
+#include "core/common/common.h"
+#include "core/framework/data_types.h"
+#include "core/framework/tensor.h"
+#include "core/graph/graph.h"
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+
+#include <tvm/tvm.h>
+#include <tvm/build_module.h>
+#include <type_traits>
+
+namespace onnxruntime {
+namespace nuphar {
+
+enum class ControlFlowInfoType : unsigned int {
+  Scan = 1,
+};
+
+// abstract class for control flow info
+struct ControlFlowInfo {
+ private:
+  ControlFlowInfoType type;
+
+ public:
+  ControlFlowInfo(ControlFlowInfoType _type) : type(_type) {}
+
+  virtual ~ControlFlowInfo() = default;
+
+  DYN_PROMOTE_BASE(ControlFlowInfo, ControlFlowInfoType, type)
+};
+
+// Add Promote support for ControlFlowInfo
+// Note here we need to use DYN_PROMOTE instread of DYNAMIC_PROMOTE
+// since ControlFlowInfo is a critical path
+DYN_PROMOTE(ControlFlowInfo)
+
+// NupharFuncInfo holds tvm::runtime::PackedFunc (the generated function)
+// And corresponding static meta information to call it, like number of argument and offset
+// Note NupharFuncInfo includes ONLY parameters from codegen
+// but DOES NOT include any runtime information.
+
+// The owner of NupharFuncInfo is currently NupharKernelState.
+// NupharFuncInfo is created in NupharCompiler and is consumed by ExecBlock
+// Note all of vectors use numbers of PackedFunc's parameters as vector bounds
+// (meaning vector.size() == numbers of PackedFunc's parameters)
+// except those denoted with ort, which use numbers of ort op's parameters as vector bounds.
+// -1 might be inserted a bubble to keep positions and sizes for later lookup.
+struct NupharFuncInfo {
+  // speicial value for *_func_indices
+  enum : int {
+    Index_NonAliasedOutput = -1,
+  };
+
+  // PackedFunc name
+  std::string name;
+
+  // PackedFunc
+  tvm::runtime::PackedFunc packed_func;
+
+  // TVM DLDevice
+  DLDeviceType device_type;
+
+  struct FuncArgMeta {
+    MLDataType dtype;
+    // shapes with dimensions statically know or inferred at compile time
+    // symbolic dim would have Dimension_Unknown and will be patched at runtime
+    std::vector<int64_t> inferred_shape;
+    std::vector<std::pair<size_t, std::string>> dim_symbols;
+    int ort_arg_index;
+  };
+
+  std::vector<FuncArgMeta> input_metas;
+  std::vector<FuncArgMeta> output_metas;
+  std::vector<std::pair<int, size_t>> ort_aliased_output_to_func_indices;  // A pair of (Ort dst index, TVM src index)
+
+  struct AllocatorMeta {
+    int index;
+    bool is_external;
+  };
+
+  std::vector<AllocatorMeta> ort_input_allocators;
+  std::vector<AllocatorMeta> ort_output_allocators;
+
+  // Note an input can be also an external output.
+  // It is due to NodeArg can be used by Nodes in
+  // and out of a subgraph at the same time.
+  // When it happens, we need to label it as a collided output,
+  // and record that external output allocator index.
+  std::vector<bool> ort_input_allocator_is_collided_output;
+
+  // initializers meta
+  std::vector<const Tensor*> intializers;
+
+  // Note the total arg number == input_count + output_count
+  size_t func_input_count;   // input_count == real inputs + initializers
+  size_t func_output_count;  // real outputs
+
+  // tvm args (including input and outputs )
+  std::vector<int> type_codes;
+
+  // control-flow info for the generated function
+  std::unique_ptr<ControlFlowInfo> cf_info;
+
+  size_t ort_input_count;
+  size_t ort_output_count;
+};
+
+void FillNupharFuncInfo(NupharFuncInfo* func_info,
+                        nuphar::OrtSubgraphAllocationInfo* partition_info,
+                        const nuphar::NupharSubgraphUnit& subgraph,
+                        const NupharCodeGenCtx& codegen_ctx,
+                        tvm::Target tvm_target,
+                        tvm::runtime::PackedFunc packed_func,
+                        const std::string& name);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/initializer_info.h b/onnxruntime/core/providers/nuphar/compiler/initializer_info.h
new file mode 100644
index 0000000000000..b2fd47829828e
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/initializer_info.h
@@ -0,0 +1,34 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/framework/tensor.h"
+#include <tvm/tvm.h>
+
+// TODO: move to nuphar
+namespace onnxruntime {
+namespace nuphar {
+
+// TODO: move it to weight layout place
+struct WeightLayoutCodegenInfo {
+  const Tensor* marshalled_initializer = nullptr;  // TODO: change it to unique_ptr
+  std::string layout = "";                         // layout name
+  tvm::Tensor marshalled_tensor;
+  tvm::Tensor unmarshalled_tensor;
+  bool is_marshalled;
+
+  WeightLayoutCodegenInfo(const tvm::Tensor& tvm_tensor)
+      : marshalled_tensor(tvm_tensor), unmarshalled_tensor(tvm_tensor), is_marshalled(false) {}
+};
+
+struct InitializerInfo {
+  const Tensor* original_initializer = nullptr;  // original ort tensor
+  std::unique_ptr<WeightLayoutCodegenInfo> layout_info = nullptr;
+
+  InitializerInfo(const Tensor* tensor) : original_initializer(tensor) {}
+};
+
+using InitializerMap = std::map<std::string, InitializerInfo>;
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_codegen_ctx.cc b/onnxruntime/core/providers/nuphar/compiler/nuphar_codegen_ctx.cc
new file mode 100644
index 0000000000000..1c9c4dae9c39b
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_codegen_ctx.cc
@@ -0,0 +1,247 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "nuphar_codegen_ctx.h"
+
+#include "core/codegen/common/common.h"
+#include "core/codegen/common/utils.h"
+#include "core/codegen/mti/mti_tvm_utils.h"  // TODO: remove this after decoupling layout compile and run
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+#include "core/codegen/passes/utils/ort_tvm_utils.h"  // TODO: remove this after decoupling layout compile and run
+#include <tvm/build_module.h>                         // TODO: remove this after decoupling layout compile and run
+
+#include "core/providers/nuphar/common/nuphar_tvm_utils.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+NupharCodeGenCtx::NupharCodeGenCtx(
+    const Node& node,
+    const std::map<std::string, const Tensor*>& initializers,
+    std::unordered_map<std::string, std::unique_ptr<Tensor>>& global_generated_initializers,
+    const NupharCodeGenHandle* handle)
+    : CodeGenContext(handle),
+      nuphar_handle_(handle),
+      initializers_(initializers),
+      global_generated_initializers_(global_generated_initializers) {
+  // construct graph_stats
+  graph_stats_ = std::make_unique<CodeGenUnitStats>(nuphar_handle_->shape_inference);
+}
+
+NupharCodeGenCtx::NupharCodeGenCtx(
+    const nuphar::NupharSubgraphUnit& subgraph,
+    std::unordered_map<std::string, std::unique_ptr<Tensor>>& global_generated_initializers,
+    const NupharCodeGenHandle* handle)
+    : CodeGenContext(handle),
+      nuphar_handle_(handle),
+      initializers_(subgraph.initializers),
+      global_generated_initializers_(global_generated_initializers) {
+  graph_stats_ = std::make_unique<CodeGenUnitStats>(nuphar_handle_->shape_inference);
+  Promote<CodeGenUnitStats>(graph_stats_)->Evaluate(subgraph);
+}
+
+// This is a temp function before we decouple weight layout compilation and run
+// This will be moved.
+// TODO: remove this.
+static tvm::runtime::PackedFunc LowerLayoutFunc(const tvm_codegen::WeightLayout* layout) {
+  tvm::Array<tvm::Tensor> inputs;
+  tvm::Array<tvm::Tensor> outputs;
+
+  layout->CreateLayoutMarshallingTVMOp(inputs, outputs);
+
+  auto config = tvm::build_config();
+  config->disable_select_rewriting = true;
+  auto S = tvm::create_schedule({outputs[0]->op});
+  S[outputs[0]->op].compute_root();
+
+  std::string func_name = layout->Name() + "_marshall";
+
+  tvm::runtime::PackedFunc cached_func = nuphar::LoadTVMPackedFuncFromCache(func_name);
+
+  if (cached_func == nullptr) {
+    auto lowered = tvm::lower(S, {inputs[0], outputs[0]}, func_name, {}, config);
+    auto module = tvm::build(lowered, tvm::target::llvm(), tvm::Target(), config);
+    tvm_codegen::DumpTVMModuleToFile(func_name, module);
+    nuphar::SaveTVMModuleToCache(func_name, module);
+    cached_func = module.GetFunction(func_name);
+  }
+  return cached_func;
+}
+
+// This is a temp function before we decouple weight layout compilation and run.
+// This will be moved.
+// TODO: remove this.
+static const Tensor* Marshalling(
+    const std::string& initializer_name,
+    std::unordered_map<std::string, std::unique_ptr<Tensor>>& global_generated_initializers,
+    const Tensor* original_initializer,
+    const tvm_codegen::WeightLayout* layout_ptr,
+    WeightLayoutCtx& ctx_layout,
+    AllocatorPtr allocator) {
+  tvm::runtime::PackedFunc packed_func;
+
+  const std::string& layout_key = layout_ptr->Name();
+  if (ctx_layout.weight_layout_to_packed_func.count(layout_key) == 0) {
+    packed_func = LowerLayoutFunc(layout_ptr);
+    ctx_layout.weight_layout_to_packed_func.insert(std::make_pair(layout_key, packed_func));
+  } else {
+    packed_func = ctx_layout.weight_layout_to_packed_func[layout_key];
+  }
+
+  std::vector<int64_t> marshalled_shape = layout_ptr->ToActualShape(original_initializer);
+  auto marshalled_size = TotalSize(marshalled_shape);
+  auto byte_size = original_initializer->DataType()->Size();
+
+  std::unique_ptr<Tensor> out_ptr;
+  void* p_data = allocator->Alloc(marshalled_size * byte_size);
+  out_ptr = std::make_unique<Tensor>(
+      original_initializer->DataType(),
+      TensorShape(marshalled_shape),
+      p_data,
+      allocator->Info());
+
+  global_generated_initializers.emplace(initializer_name, std::move(out_ptr));
+
+  int num_args = 2;
+  DLContext tvm_ctx{kDLCPU, 0};
+  std::vector<TVMValue> lvalues(num_args);
+  std::vector<DLTensor> tvm_tensors(num_args);
+
+  // input
+  const auto& tensor_shape = original_initializer->Shape();
+  auto input_shape = tensor_shape.GetDims();
+  if (input_shape.empty())
+    input_shape.push_back(1);
+  const void* input_data = original_initializer->DataRaw();
+  DLDataType tvm_dtype = tvm_codegen::ToTvmDLDataType(original_initializer->DataType());
+
+  tvm_tensors[0] = {const_cast<void*>(input_data), tvm_ctx,
+                    gsl::narrow_cast<int>(input_shape.size()), tvm_dtype,
+                    input_shape.data(), nullptr, 0};
+  lvalues[0].v_handle = &(tvm_tensors[0]);
+
+  // output
+  tvm_tensors[1] = {p_data, tvm_ctx,
+                    gsl::narrow_cast<int>(marshalled_shape.size()), tvm_dtype,
+                    marshalled_shape.data(), nullptr, 0};
+  lvalues[1].v_handle = &(tvm_tensors[1]);
+
+  auto types_code = std::vector<int>(num_args, kNDArrayContainer);
+  tvm::TVMArgs tvm_args(lvalues.data(), types_code.data(), num_args);
+  tvm::TVMRetValue rvalue;
+  packed_func.CallPacked(tvm_args, &rvalue);
+  return global_generated_initializers.at(initializer_name).get();
+}
+
+// on the fly WeightLayout transformer
+tvm::Tensor NupharCodeGenCtx::ApplyWeightLayout(
+    const std::string& layout_key,
+    const std::string& initializer_name,
+    const tvm::Tensor& X,
+    bool returnMarshalled) {
+  tvm::Tensor marshalled;
+  ORT_ENFORCE(IsInitializer(initializer_name));
+  auto layout_info = GetWeightLayoutInfo(initializer_name);
+  ORT_ENFORCE(nullptr != layout_info);
+
+  const Tensor* original_initializer = GetOrtInitializerTensor(initializer_name);
+
+  auto layout_ptr = nuphar_handle_->layout_registry->Get(layout_key);
+  ORT_ENFORCE(nullptr != layout_ptr);
+
+  // check whether the weight is applied layout marshalling
+  if (nullptr == layout_info->marshalled_initializer) {
+    ORT_ENFORCE(!layout_info->is_marshalled);  // initializer should not have been marshalled before
+
+    // TODO: change to delayed call
+    layout_info->layout = layout_ptr->Name();
+
+    // TODO: change to delayed call
+    layout_info->marshalled_initializer =
+        Marshalling(initializer_name,
+                    global_generated_initializers_,
+                    original_initializer,
+                    layout_ptr,
+                    weight_layout_ctx_,
+                    nuphar_handle_->allocator);
+
+    layout_info->marshalled_tensor = tvm::placeholder(layout_ptr->ToActualShape(X), X->dtype, initializer_name + "_marshalled");
+    layout_info->unmarshalled_tensor = tvm::compute(
+        X->shape,
+        [&](const tvm::Array<tvm::Var>& nominal_coord) {
+          tvm::Array<tvm::Expr> cc;
+          for (auto v : nominal_coord)
+            cc.push_back(v);
+
+          auto coord_trans_func = layout_ptr->ToActual(X);
+          return layout_info->marshalled_tensor(coord_trans_func(cc));
+        },
+        initializer_name + "_unmarshalled");
+
+    layout_info->is_marshalled = true;
+
+  } else {
+    ORT_ENFORCE(layout_ptr->Name() == layout_info->layout);
+  }
+
+  if (returnMarshalled) {
+    return layout_info->marshalled_tensor;
+  }
+  return layout_info->unmarshalled_tensor;
+}
+
+const NupharSubgraphUnitStats* NupharCodeGenCtx::GetGraphStats() const {
+  return graph_stats_.get();
+}
+
+bool NupharCodeGenCtx::IsInitializer(const std::string& name) const {
+  return initializers_.count(name) > 0;
+}
+
+const Tensor* NupharCodeGenCtx::GetOrtInitializerTensor(const std::string& name) const {
+  if (IsInitializer(name))
+    return initializers_.at(name);
+  return nullptr;
+}
+
+WeightLayoutCodegenInfo* NupharCodeGenCtx::GetWeightLayoutInfo(const std::string& name) {
+  if (initializer_layouts_.count(name) > 0)
+    return initializer_layouts_.at(name).get();
+  return nullptr;
+}
+
+const WeightLayoutCodegenInfo* NupharCodeGenCtx::GetWeightLayoutInfo(const std::string& name) const {
+  if (initializer_layouts_.count(name) > 0)
+    return initializer_layouts_.at(name).get();
+  return nullptr;
+}
+
+void NupharCodeGenCtx::CreateWeightLayoutInfo(const std::string& name, const tvm::Tensor& tensor) {
+  ORT_ENFORCE(initializer_layouts_.count(name) == 0);
+  initializer_layouts_.emplace(name, std::move(std::make_unique<WeightLayoutCodegenInfo>(tensor)));
+}
+
+const std::map<std::string, std::unique_ptr<WeightLayoutCodegenInfo>>& NupharCodeGenCtx::GetWeightLayoutMap() const {
+  return initializer_layouts_;
+}
+
+void NupharCodeGenCtx::RecordTensorToNode(const tvm::Tensor& t, const Node* node) {
+  // Insert tvm::Tensor and Node to the lookup table
+  // But bypass it when node is a output alias
+  if (!Promote<CodeGenUnitStats>(graph_stats_)->IsOutputAlias(node))
+    tvm_tensor_to_node_lookup_.insert(std::make_pair(t->op.get(), node));
+}
+
+const Node* NupharCodeGenCtx::FindNode(const tvm::Tensor& t) const {
+  auto p = tvm_tensor_to_node_lookup_.find(t->op.get());
+  if (p != tvm_tensor_to_node_lookup_.end())
+    return p->second;
+  return nullptr;
+}
+
+const NupharCodeGenHandle* NupharCodeGenCtx::GetCodeGenHandle() const {
+  return nuphar_handle_;
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_codegen_ctx.h b/onnxruntime/core/providers/nuphar/compiler/nuphar_codegen_ctx.h
new file mode 100644
index 0000000000000..69ffa04adb3cd
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_codegen_ctx.h
@@ -0,0 +1,147 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/common.h"
+#include "core/codegen/passes/utils/codegen_context.h"
+#include "core/common/common.h"
+#include "core/graph/graph.h"
+#include "core/providers/nuphar/common/analysis/graph_stats.h"
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+#include "core/providers/nuphar/compiler/initializer_info.h"
+#include "core/providers/nuphar/compiler/nuphar_handle.h"
+
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Nuphar Tensor Context
+struct TVMTensorCtx {
+  std::map<std::string, tvm::Tensor> inputs;
+  std::map<const Node*, tvm::Array<tvm::Tensor>> ops;
+  std::map<std::string, std::pair<const Node*, size_t>> input_from;
+
+  bool Lookup(const NodeArg* def, tvm::Tensor& tensor) {
+    const std::string& def_name = def->Name();
+    auto iter = inputs.find(def_name);
+    if (iter != inputs.end()) {
+      tensor = iter->second;
+      return true;
+    }
+
+    auto iter_out_index = input_from.find(def_name);
+
+    if (iter_out_index == input_from.end()) {
+      return false;
+    }
+
+    const Node* from_node = iter_out_index->second.first;
+    size_t index = iter_out_index->second.second;
+    auto iter_op = ops.find(from_node);
+    ORT_ENFORCE(iter_op != ops.end());
+    tensor = iter_op->second[index];
+    return true;
+  }
+
+  const tvm::Tensor
+  Lookup(const NodeArg* def) const {
+    const std::string& def_name = def->Name();
+    auto iter = inputs.find(def_name);
+    if (iter != inputs.end()) {
+      return iter->second;
+    }
+
+    auto iter_out_index = input_from.find(def_name);
+
+    ORT_ENFORCE(iter_out_index != input_from.end());
+
+    const Node* from_node = iter_out_index->second.first;
+    size_t index = iter_out_index->second.second;
+    auto iter_op = ops.find(from_node);
+    ORT_ENFORCE(iter_op != ops.end());
+    return iter_op->second[index];
+  }
+};
+
+struct WeightLayoutCtx {
+  //std::map<std::string, std::string> initializer_to_weight_layout;  // unused yet. This is for decoupling weight layout compile and run
+  std::unordered_map<std::string, tvm::runtime::PackedFunc> weight_layout_to_packed_func;
+};
+
+// NupharCodeGenCtx is Nuphar-specific CodeGenContext
+class NupharCodeGenCtx : public tvm_codegen::CodeGenContext {
+ public:
+  NupharCodeGenCtx(const Node& node,
+                   const std::map<std::string, const Tensor*>& initializers,
+                   std::unordered_map<std::string, std::unique_ptr<Tensor>>& global_generated_initializers,
+                   const NupharCodeGenHandle* handle);
+
+  NupharCodeGenCtx(const nuphar::NupharSubgraphUnit& subgraph,
+                   std::unordered_map<std::string, std::unique_ptr<Tensor>>& global_generated_initializers,
+                   const NupharCodeGenHandle* handle);
+
+  virtual ~NupharCodeGenCtx() = default;
+
+  const NupharSubgraphUnitStats* GetGraphStats() const;
+
+  bool IsInitializer(const std::string& name) const;
+  const Tensor* GetOrtInitializerTensor(const std::string& name) const;
+  WeightLayoutCodegenInfo* GetWeightLayoutInfo(const std::string& name);
+  const WeightLayoutCodegenInfo* GetWeightLayoutInfo(const std::string& name) const;
+  void CreateWeightLayoutInfo(const std::string& name, const tvm::Tensor& tensor);
+  const std::map<std::string, std::unique_ptr<WeightLayoutCodegenInfo>>& GetWeightLayoutMap() const;
+
+  // On-the-fly apply an existing layout
+  tvm::Tensor ApplyWeightLayout(
+      const std::string& layout_key,
+      const std::string& initializer_name,
+      const tvm::Tensor& X,
+      bool returnMarshalled);
+
+  void RecordTensorToNode(const tvm::Tensor& t, const Node* node);
+  const Node* FindNode(const tvm::Tensor& t) const;
+
+  const NupharCodeGenHandle* GetCodeGenHandle() const;
+
+  // TODO remove this after decoupling compiler and runtime of WeightLayout
+  template <typename T>
+  IAllocatorUniquePtr<T> AllocateT(size_t size) const { return IAllocator::MakeUniquePtr<T>(nuphar_handle_->allocator, size); }
+  // TODO remove this after decoupling compiler and runtime of WeightLayout
+  IAllocatorUniquePtr<void> Allocate(size_t size) const { return AllocateT<void>(size); }
+
+  // Keep for CodeGenContext
+  TVMTensorCtx& GetTVMTensorCtx() {
+    return tvm_tensor_ctx_;
+  }
+
+  // Keep for CodeGenContext
+  const TVMTensorCtx& GetTVMTensorCtx() const {
+    return tvm_tensor_ctx_;
+  }
+
+ private:
+  std::unique_ptr<NupharSubgraphUnitStats> graph_stats_;
+
+  const NupharCodeGenHandle* nuphar_handle_;
+
+  const std::map<std::string, const Tensor*>& initializers_;
+
+  // A table from tvm::Tensor (its unchanged source tvm::Node*) to ORT Node
+  std::unordered_map<const tvm::Node*, const Node*> tvm_tensor_to_node_lookup_;
+
+  // All TVM Tensor and correponidng shape context
+  TVMTensorCtx tvm_tensor_ctx_;
+
+  // local copy
+  std::map<std::string, std::unique_ptr<WeightLayoutCodegenInfo>> initializer_layouts_;
+
+  std::unordered_map<std::string, std::unique_ptr<Tensor>>& global_generated_initializers_;
+
+  // all layouts
+  WeightLayoutCtx weight_layout_ctx_;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_compiler.cc b/onnxruntime/core/providers/nuphar/compiler/nuphar_compiler.cc
new file mode 100644
index 0000000000000..61716f1a1f40e
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_compiler.cc
@@ -0,0 +1,229 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/nuphar_compiler.h"
+
+#include "core/codegen/common/profile.h"
+#include "core/codegen/common/settings.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+#include "core/codegen/passes/utils/ort_tvm_utils.h"
+#include "core/mlas/inc/mlas.h"
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+#include "core/providers/nuphar/common/nuphar_settings.h"
+#include "core/providers/nuphar/common/nuphar_tvm_utils.h"
+#include "core/providers/nuphar/compiler/nuphar_handle.h"
+#include "core/providers/nuphar/compiler/nuphar_op_ir_builder.h"
+#include "core/providers/nuphar/compiler/nuphar_schedule_builder.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+static void HandleAllOutputs(
+    const std::vector<const NodeArg*>& outputs,
+    tvm::Array<tvm::Tensor>& tvm_args,
+    tvm::Array<tvm::Tensor>& tvm_outputs,
+    const NupharCodeGenCtx& context) {
+  // find out all outputs
+  std::set<NodeKey> visited_alias_def;
+  auto add_tvm_arg_and_output = [&](const onnxruntime::NodeArg* def) {
+    auto& tvm_tensor = context.GetTVMTensorCtx().Lookup(def);
+    tvm_args.push_back(tvm_tensor);
+    tvm_outputs.push_back(tvm_tensor);
+  };
+
+  for (const NodeArg* def : outputs) {
+    const NodeArg* input_def = Promote<CodeGenUnitStats>(context.GetGraphStats())->SourceDefOfOutputAlias(def);
+    if (input_def) {
+      auto key = GetKey(input_def);
+      if (visited_alias_def.count(key) == 0) {
+        visited_alias_def.insert(key);
+        add_tvm_arg_and_output(input_def);
+      }
+    } else {
+      auto key = GetKey(def);
+      if (visited_alias_def.count(key) == 0) {
+        visited_alias_def.insert(key);
+        add_tvm_arg_and_output(def);
+      }
+    }
+  }
+}
+
+// Constructor for Node
+// This is mainly for single node support
+// For multiple subgraph support, please call the next constructor
+NupharCompiler::NupharCompiler(const Node& node,
+                               const std::map<std::string, const Tensor*>& initializer,
+                               std::unordered_map<std::string, std::unique_ptr<Tensor>>& generated_initializers,
+                               const NupharCodeGenHandle* handle)
+    : num_initializers_in_graph_inputs_(0),
+      context_(node, initializer, generated_initializers, handle) {}
+
+NupharCompiler::NupharCompiler(const nuphar::NupharSubgraphUnit& subgraph,
+                               std::unordered_map<std::string, std::unique_ptr<Tensor>>& generated_initializers,
+                               const NupharCodeGenHandle* handle)
+    : num_initializers_in_graph_inputs_(0),
+      context_(subgraph, generated_initializers, handle) {}
+
+Status NupharCompiler::Build(const nuphar::NupharSubgraphUnit& subgraph) {
+  if (subgraph.nodes.front()->OpType() == "Scan") {
+    return BuildSubgraph(*subgraph.nodes.front());
+  }
+
+  tvm_args_ = tvm::Array<tvm::Tensor>();
+  tvm_outputs_ = tvm::Array<tvm::Tensor>();
+
+  ORT_RETURN_IF_ERROR(CreateTVMIR(subgraph, context_));
+
+  // fill in all non-initializer inputs
+  num_initializers_in_graph_inputs_ = 0;
+  for (auto& def : subgraph.inputs) {
+    if (context_.IsInitializer(def->Name())) {
+      ++num_initializers_in_graph_inputs_;
+    } else {
+      tvm_args_.push_back(context_.GetTVMTensorCtx().Lookup(def));
+    }
+  }
+
+  // fill in all initializers
+  for (const auto& item : context_.GetWeightLayoutMap()) {
+    const WeightLayoutCodegenInfo* layout_info = item.second.get();
+    tvm_args_.push_back(layout_info->marshalled_tensor);
+  }
+
+  // find out all outputs, and save the output shapes
+  HandleAllOutputs(subgraph.outputs, tvm_args_, tvm_outputs_, context_);
+
+  return Status::OK();
+}
+
+// BuildSubgraph drive a graph traversal that calls CreateInput and CreateOutputs metioned above for a subgraph.
+// And collect args among nodes.
+// We need another API other than Build, because name mismatching
+Status NupharCompiler::BuildSubgraph(const Node& node) {
+  tvm_args_ = tvm::Array<tvm::Tensor>();
+  tvm_outputs_ = tvm::Array<tvm::Tensor>();
+
+  auto subgraph = GetSubgraph(node);
+
+  ORT_RETURN_IF_ERROR(CreateTVMIR(GraphViewer(*subgraph), context_, /*use_placeholder_for_input*/ true));
+
+  num_initializers_in_graph_inputs_ = 0;
+  // fill in all non-initializer inputs
+
+  for (const auto& input : subgraph->GetInputs()) {
+    if (context_.IsInitializer(input->Name())) {
+      ++num_initializers_in_graph_inputs_;
+    } else {
+      tvm_args_.push_back(context_.GetTVMTensorCtx().Lookup(input));
+    }
+  }
+
+  // fill in implicit inputs
+  for (const auto& input : node.ImplicitInputDefs()) {
+    if (context_.IsInitializer(input->Name())) {
+      ++num_initializers_in_graph_inputs_;
+    } else {
+      tvm_args_.push_back(context_.GetTVMTensorCtx().Lookup(input));
+    }
+  }
+
+  // fill in all initializers
+  for (const auto& item : context_.GetWeightLayoutMap()) {
+    const WeightLayoutCodegenInfo* layout_info = item.second.get();
+    tvm_args_.push_back(layout_info->marshalled_tensor);
+  }
+
+  // find out all outputs
+  HandleAllOutputs(subgraph->GetOutputs(), tvm_args_, tvm_outputs_, context_);
+
+  return Status::OK();
+}
+
+tvm::runtime::PackedFunc NupharCompiler::GetLoweredPackedFunc(
+    const std::string& func_name,
+    tvm::Target tvm_target,
+    tvm::Target tvm_host_target,
+    const tvm::BuildConfig& config,
+    const std::string& subgraph_type,
+    const std::string& subgraph_name) {
+  // TODO: refactor the following logic for both JIT-caching and AOT support
+  // JIT-caching and AOT are mutual exclusive.
+  // Change it by not always saving a compiled func unless it is in JIT-Caching model.
+  // In AOT, there should be another member func explicitly loading
+  tvm::runtime::PackedFunc cached_func = nuphar::LoadTVMPackedFuncFromCache(func_name);
+  if (cached_func == nullptr) {
+    codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+
+    if (settings.HasOption(kNupharCacheForceNoJIT)) {
+      if (settings.OptionMatches(kNupharCacheForceNoJIT, "on")) {
+        ORT_THROW("Force not using JIT code!");
+      }
+    }
+
+    tvm::Schedule tvm_schedule = CreateSchedule(tvm_outputs_, context_);
+    std::unordered_map<tvm::Tensor, tvm::Buffer> binds;
+    tvm::Array<tvm::LoweredFunc> lowered = tvm::lower(tvm_schedule, tvm_args_, func_name, binds, config);
+
+    if (settings.HasOption(codegen::CodeGenSettings::kCodeGenDumpLower)) {
+      if (settings.OptionMatches(codegen::CodeGenSettings::kCodeGenDumpLower, "verbose") ||
+          settings.OptionMatches(codegen::CodeGenSettings::kCodeGenDumpLower, subgraph_type)) {
+        for (const auto& func : lowered)
+          LOGS_DEFAULT(CODEGEN_SETTINGS_LOG_LEVEL) << "[CODEGEN_DUMP_LOWER] Dumping lowered func: " << func << std::endl
+                                                   << func->body;
+      } else if (settings.OptionMatches(codegen::CodeGenSettings::kCodeGenDumpLower, "concise")) {
+        LOGS_DEFAULT(CODEGEN_SETTINGS_LOG_LEVEL) << "[CODEGEN_DUMP_LOWER] Subgraph Type: "
+                                                 << subgraph_type << ", name: " << subgraph_name
+                                                 << " #lowered funcs: " << lowered.size() << std::endl;
+      }
+    }
+
+    tvm::runtime::Module module = tvm::build(lowered, tvm_target, tvm_host_target, config);
+    tvm_codegen::DumpTVMModuleToFile(func_name, module);
+    nuphar::SaveTVMModuleToCache(func_name, module);
+    cached_func = module.GetFunction(func_name);
+  }
+
+  return cached_func;
+}
+
+static tvm::BuildConfig CreateConfig(const Node& node,
+                                     bool allow_unaligned_buffers) {
+  tvm::BuildConfig config = tvm::build_config();
+  config->disable_select_rewriting = true;
+
+  if (allow_unaligned_buffers) {
+    config->data_alignment = 1;  // aligned to 1
+  } else {
+    config->data_alignment = gsl::narrow<int>(MlasGetPreferredBufferAlignment());
+  }
+
+  config->restricted_func = true;
+  return config;
+}
+
+// Lower compiles the tvm::Tensor to a function
+Status NupharCompiler::Lower(const nuphar::NupharSubgraphUnit& subgraph,
+                             tvm::Target tvm_target,
+                             tvm::Target tvm_host_target,
+                             NupharFuncInfo* func_info,
+                             nuphar::OrtSubgraphAllocationInfo* partition_info) {
+  const auto& target_codegen = *context_.GetCodeGenHandle()->codegen_target;
+  std::string func_name = nuphar::GetPackedFuncName(subgraph, target_codegen);
+  tvm::BuildConfig config = CreateConfig(*subgraph.nodes.front(),
+                                         context_.GetCodeGenHandle()->allow_unaligned_buffers);
+
+  // using "subgraph" for type and name for now
+  // TODO: change name
+  tvm::runtime::PackedFunc cached_func =
+      GetLoweredPackedFunc(
+          func_name, tvm_target, tvm_host_target,
+          config, "subgraph", "subgraph");
+
+  FillNupharFuncInfo(func_info, partition_info, subgraph, context_, tvm_target, cached_func, func_name);
+
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_compiler.h b/onnxruntime/core/providers/nuphar/compiler/nuphar_compiler.h
new file mode 100644
index 0000000000000..1b4b4d5376f99
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_compiler.h
@@ -0,0 +1,65 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/common.h"
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+#include "core/providers/nuphar/compiler/func_info.h"
+#include "core/providers/nuphar/compiler/initializer_info.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+#include "core/providers/nuphar/compiler/nuphar_handle.h"
+#include "core/providers/nuphar/compiler/traverse_shape_infer.h"
+#include "core/framework/op_kernel.h"
+#include "core/graph/graph.h"
+#include "gsl/gsl_util"
+
+#include <algorithm>
+#include <tvm/build_module.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+class NupharCompiler {
+ public:
+  NupharCompiler(const Node& node,
+                 const std::map<std::string, const Tensor*>& initializers,
+                 std::unordered_map<std::string, std::unique_ptr<Tensor>>& generated_initializers,
+                 const NupharCodeGenHandle* handle);
+
+  NupharCompiler(const nuphar::NupharSubgraphUnit& subgraph,
+                 std::unordered_map<std::string, std::unique_ptr<Tensor>>& generated_initializers,
+                 const NupharCodeGenHandle* handle);
+
+  // Build builds tvm IR and apply passes
+  Status Build(const nuphar::NupharSubgraphUnit& subgraph);
+
+  // Lower lowers the built tvm IR to llvm ir and compiles it
+  Status Lower(const nuphar::NupharSubgraphUnit& subgraph,
+               tvm::Target tvm_target,
+               tvm::Target tvm_host_target,
+               NupharFuncInfo* ctx_func,
+               nuphar::OrtSubgraphAllocationInfo* partition_info);
+
+  tvm::runtime::PackedFunc GetLoweredPackedFunc(
+      const std::string& func_name,
+      tvm::Target tvm_target,
+      tvm::Target tvm_host_target,
+      const tvm::BuildConfig& config,
+      const std::string& subgraph_type,
+      const std::string& subgraph_name);
+
+ private:
+  size_t num_initializers_in_graph_inputs_;
+
+  // BuildSubgraph builds tvm IR and apply passes for a subgraph
+  Status BuildSubgraph(const Node& node);
+
+  NupharCodeGenCtx context_;
+
+  tvm::Array<tvm::Tensor> tvm_args_;
+  tvm::Array<tvm::Tensor> tvm_outputs_;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_handle.h b/onnxruntime/core/providers/nuphar/compiler/nuphar_handle.h
new file mode 100644
index 0000000000000..84be4555ba0f4
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_handle.h
@@ -0,0 +1,40 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/common.h"
+#include "core/codegen/common/handle.h"
+#include "core/codegen/common/target_info.h"
+#include "core/codegen/passes/weight_layout/weight_layout.h"
+#include "core/framework/allocator.h"                             // TODO: get rid of this
+#include "core/providers/nuphar/compiler/traverse_shape_infer.h"  // TODO: get rid of this
+
+namespace onnxruntime {
+
+// forwarding
+namespace tvm_codegen {
+class TVMIRBuilder;
+class TVMScheduleBuilder;
+}  // namespace tvm_codegen
+
+namespace nuphar {
+
+// TVM is a wrapper containing CodeGen related setting
+// TODO: make this the Base
+// TODO: create one for nuphar
+struct NupharCodeGenHandle : codegen::CodeGenHandle {
+  std::shared_ptr<tvm_codegen::TVMIRBuilder> op_ir_builder;           // keep
+  std::shared_ptr<tvm_codegen::TVMScheduleBuilder> schedule_builder;  // keep
+  // maybe add a layout
+  tvm_codegen::WeightLayoutRegistry* layout_registry;
+  bool enable_per_node_parallelized;  // TODO: change to config
+
+  bool allow_unaligned_buffers;  // move to another place
+
+  AllocatorPtr allocator;                             // remove
+  std::shared_ptr<ShapeExprContext> shape_inference;  // remove
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_op_ir_builder.cc b/onnxruntime/core/providers/nuphar/compiler/nuphar_op_ir_builder.cc
new file mode 100644
index 0000000000000..4578134d359ec
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_op_ir_builder.cc
@@ -0,0 +1,311 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/nuphar_op_ir_builder.h"
+
+#include "core/codegen/common/op_macro.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+#include "core/codegen/passes/op_ir_creator/all_ops.h"
+#include "core/codegen/passes/op_ir_creator/tvm_ir_builder.h"
+#include "core/codegen/passes/utils/ort_tvm_utils.h"
+#include "core/common/common.h"
+#include "core/providers/nuphar/compiler/initializer_info.h"
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Declaration of GetOrCreateInitializer
+// GetOrCreateInitializer create tvm::placeholder for a marshalled weight
+// with correpsonding data layout transfomration for a weight,
+// Note the weight is fed during build
+static const tvm::Tensor& GetOrCreateInitializer(const std::string& name,
+                                                 const Tensor* tensor,
+                                                 bool is_sliced,
+                                                 NupharCodeGenCtx& ctx_codegen);
+
+static const tvm::Tensor& GetOrCreateInitializer(const NodeArg* def,
+                                                 const Tensor* tensor,
+                                                 bool is_sliced,
+                                                 NupharCodeGenCtx& ctx_codegen);
+
+// CreateInputPlaceholder create tvm input placeholder (tvm::Tensor)
+// NOTE: here we assume axis 0 is sequence
+// TODO: add support for sequence not axis 0
+static tvm::Tensor CreateInputPlaceholder(const tvm::Array<tvm::Expr>& shape,
+                                          HalideIR::Type halide_type,
+                                          const std::string& name,
+                                          bool is_sliced) {
+  return tvm::placeholder(is_sliced && shape.size() > 1 ? tvm_codegen::SliceShapeFromDimension(shape, 1) : shape, halide_type, name);
+}
+
+// CreateInput creats tvm::Tensor of corresponding ORT input
+// Inputs are either initializer or regular input placeholder
+static bool CreateInput(
+    const NodeArg* def,
+    tvm::Tensor& input,
+    bool initializer_only,
+    bool is_sliced,
+    NupharCodeGenCtx& ctx_codegen) {
+  const Tensor* initialized_tensor = ctx_codegen.GetOrtInitializerTensor(def->Name());
+  if (nullptr == initialized_tensor && initializer_only)
+    return false;
+
+  ORT_ENFORCE(def->Shape());
+  if (nullptr != initialized_tensor) {
+    input = GetOrCreateInitializer(def, initialized_tensor, is_sliced, ctx_codegen);
+  } else {
+    // Handle inputs without initializer
+    std::string name = NormalizeNodeArgName(def);
+    MLDataType ONNXRUNTIME_data_type = DataTypeImpl::TypeFromProto(*def->TypeAsProto());
+    DLDataType dtype = tvm_codegen::ToTvmDLDataType(ONNXRUNTIME_data_type);
+    HalideIR::Type halide_type((halideir_type_code_t)dtype.code, dtype.bits, dtype.lanes);
+    tvm::Array<tvm::Expr> shape = ShapeToTvmArray(def, ctx_codegen);
+
+    // Create InputPlaceholder
+    // Slice InputPlaceholder if it is asked for.
+    input = CreateInputPlaceholder(shape, halide_type, name, is_sliced);
+  }
+  return true;
+}
+
+// GetOrCreateInitializer create tvm::placeholder for a marshalled weight
+// with correpsonding data layout transfomration for a weight,
+// Note the weight is fed during build
+const tvm::Tensor& GetOrCreateInitializer(const std::string& name,
+                                          const Tensor* tensor,
+                                          bool is_sliced,
+                                          NupharCodeGenCtx& ctx_codegen) {
+  ORT_ENFORCE(ctx_codegen.IsInitializer(name));
+
+  auto layout_info = ctx_codegen.GetWeightLayoutInfo(name);
+  if (nullptr != layout_info) {
+    return layout_info->marshalled_tensor;
+  }
+
+  auto ONNXRUNTIME_data_type = tensor->DataType();
+  DLDataType dtype = tvm_codegen::ToTvmDLDataType(ONNXRUNTIME_data_type);
+  HalideIR::Type halide_type((halideir_type_code_t)dtype.code, dtype.bits, dtype.lanes);
+  std::string normalized_name = NormalizeCppName(name);
+  auto tvm_shape = tvm_codegen::ToTvmArray(tensor->Shape().GetDims());
+  auto tvm_tensor = CreateInputPlaceholder(tvm_shape, halide_type, normalized_name, is_sliced);
+  // create the layout info
+  ctx_codegen.CreateWeightLayoutInfo(name, tvm_tensor);
+  return ctx_codegen.GetWeightLayoutInfo(name)->marshalled_tensor;
+}
+
+const tvm::Tensor& GetOrCreateInitializer(const NodeArg* def,
+                                          const Tensor* tensor,
+                                          bool is_sliced,
+                                          NupharCodeGenCtx& ctx_codegen) {
+  return GetOrCreateInitializer(def->Name(), tensor, is_sliced, ctx_codegen);
+}
+
+// CreateOutputs constructs tvm::Tensor with corresponding computation
+static Status CreateOutputs(const Node* node,
+                            const tvm::Array<tvm::Tensor>& inputs,
+                            tvm::Array<tvm::Tensor>& outputs,
+                            NupharCodeGenCtx& ctx_codegen) {
+  ORT_RETURN_IF_ERROR(ctx_codegen.GetCodeGenHandle()
+                          ->op_ir_builder
+                          ->Evaluate(inputs, *node, ctx_codegen, outputs));
+
+  // Collect constructed tvm::Node to onnxruntime::Node mapping
+  // Both states and outputs
+  for (const auto& t : outputs) {
+    ctx_codegen.RecordTensorToNode(t, node);
+  }
+
+  return Status::OK();
+}
+
+// CreateTVMIR is the entry function for building TVM IR
+// It will call TVMIRBuilder (in CreateOutputs) from CodeGenContext
+Status CreateTVMIR(
+    const GraphViewer& graph,
+    NupharCodeGenCtx& ctx_codegen,
+    bool use_placeholder_for_input) {
+  TVMTensorCtx& ctx_tensor = ctx_codegen.GetTVMTensorCtx();
+
+  if (use_placeholder_for_input) {
+    // build graph inputs
+    const auto& graph_inputs = graph.GetInputs();
+    for (size_t i = 0; i < graph_inputs.size(); ++i) {
+      tvm::Tensor value;
+      if (CreateInput(graph_inputs[i], value,
+                      /*initializer_only*/ false, /*is_sliced*/ false,
+                      ctx_codegen)) {
+        ctx_tensor.inputs.emplace(graph_inputs[i]->Name(), std::move(value));
+      }
+    }
+  }
+
+  for (const auto& node : graph.Nodes()) {
+    // initializers
+    node.ForEachWithIndex(
+        node.InputDefs(),
+        [&ctx_codegen, &ctx_tensor](const NodeArg& def, size_t) {
+          tvm::Tensor value;
+          if (CreateInput(&def, value, /*initializer_only*/ true, /*is_sliced*/ false,
+                          ctx_codegen)) {
+            ctx_tensor.inputs.emplace(def.Name(), std::move(value));
+          }
+          return Status::OK();
+        });
+  }
+
+  // iterate through the graph and create op (outputs)
+  for (auto node_index : graph.GetNodesInTopologicalOrder()) {
+    const auto& node = *graph.GetNode(node_index);
+    tvm::Array<tvm::Tensor> inputs;
+    for (const NodeArg* def : node.InputDefs()) {
+      tvm::Tensor input;
+      if (def->Exists()) {
+        bool exist = ctx_tensor.Lookup(def, input);
+        if (!exist) {
+          tvm::Tensor value;
+          if (CreateInput(def, value,
+                          /*initializer_only*/ false, /*is_sliced*/ false,
+                          ctx_codegen)) {
+            ctx_tensor.inputs.emplace(def->Name(), std::move(value));
+          }
+          input = ctx_tensor.Lookup(def);
+        }
+      }
+      inputs.push_back(input);
+    }
+
+    auto subgraph = GetSubgraph(node);
+    if (nullptr != subgraph) {
+      // unboxing
+      GraphViewer subgraph_viewer(*subgraph);
+      ORT_RETURN_IF_ERROR(CreateTVMIR(subgraph_viewer, ctx_codegen, /*use_placeholder_for_input*/ false));
+    } else {
+      tvm::Array<tvm::Tensor> op_outputs;
+      ORT_RETURN_IF_ERROR(CreateOutputs(&node, inputs, op_outputs, ctx_codegen));
+      ctx_tensor.ops.emplace(&node, std::move(op_outputs));
+
+      // input_from_
+      node.ForEachWithIndex(
+          node.OutputDefs(),
+          [&node, &ctx_tensor](const NodeArg& def, size_t index) {
+            ORT_ENFORCE(ctx_tensor.input_from.count(def.Name()) == 0);
+            ctx_tensor.input_from.emplace(def.Name(), std::make_pair(&node, index));
+            return Status::OK();
+          });
+    }
+  }
+
+  return Status::OK();
+}
+
+// CreateTVMIR is the entry function for building TVM IR
+// It will call TVMIRBuilder (in CreateOutputs) from CodeGenContext
+Status CreateTVMIR(
+    const Node& node,
+    NupharCodeGenCtx& ctx_codegen) {
+  // wrapper
+  TVMTensorCtx& ctx_tensor = ctx_codegen.GetTVMTensorCtx();
+  bool has_loop = HasLoop(node);
+
+  // create real Inputs
+  node.ForEachWithIndex(
+      node.InputDefs(),
+      [&has_loop, &ctx_codegen, &ctx_tensor](const NodeArg& def, size_t) {
+        tvm::Tensor value;
+        if (CreateInput(&def, value, /*initializer_only*/ false, /*is_sliced*/ has_loop,
+                        ctx_codegen)) {
+          ctx_tensor.inputs.emplace(def.Name(), std::move(value));
+        }
+        return Status::OK();
+      });
+
+  // input_from_
+  node.ForEachWithIndex(
+      node.OutputDefs(),
+      [&node, &ctx_tensor](const NodeArg& def, size_t index) {
+        ctx_tensor.input_from.emplace(def.Name(), std::make_pair(&node, index));
+        return Status::OK();
+      });
+
+  tvm::Array<tvm::Tensor> inputs;
+  for (const NodeArg* def : node.InputDefs()) {
+    inputs.push_back(def->Exists() ? ctx_tensor.Lookup(def) : tvm::Tensor());
+  }
+
+  // create ops (outputs)
+  tvm::Array<tvm::Tensor> op_outputs;
+  ORT_RETURN_IF_ERROR(CreateOutputs(&node, inputs, op_outputs, ctx_codegen));
+  ctx_tensor.ops.emplace(&node, std::move(op_outputs));
+
+  return Status::OK();
+}
+
+// CreateTVMIR is the entry function for building TVM IR
+// It will call TVMIRBuilder (in CreateOutputs) from CodeGenContext
+Status CreateTVMIR(
+    const nuphar::NupharSubgraphUnit& subgraph,
+    NupharCodeGenCtx& ctx_codegen) {
+  ////////////////////////////////////////
+  // handle a special case for a single node
+  ////////////////////////////////////////
+  if (subgraph.IsSingleNode()) {
+    const Node* node = subgraph.nodes.front();
+
+    const Graph* onnx_graph = GetSubgraph(*node);
+
+    if (nullptr != onnx_graph) {
+      return CreateTVMIR(GraphViewer(*onnx_graph), ctx_codegen, true);
+    }
+    return CreateTVMIR(*node, ctx_codegen);
+  }
+
+  //////////////////////////////
+  // handle a generic subgraph below
+  //////////////////////////////
+  TVMTensorCtx& ctx_tensor = ctx_codegen.GetTVMTensorCtx();
+
+  // build subgraph inputs
+  for (const NodeArg* def : subgraph.inputs) {
+    tvm::Tensor value;
+
+    if (CreateInput(def, value, /*initializer_only*/ false, /*is_sliced*/ false,
+                    ctx_codegen)) {
+      ctx_tensor.inputs.emplace(def->Name(), std::move(value));
+    }
+  }
+
+  // build subgraph initializers
+  for (auto& p : subgraph.initializers) {
+    tvm::Tensor value = GetOrCreateInitializer(p.first, p.second, false, ctx_codegen);
+    ctx_tensor.inputs.emplace(p.first, std::move(value));
+  }
+
+  // iterate through the subgraph nodes and create op (outputs)
+  for (auto& node : subgraph.nodes) {
+    tvm::Array<tvm::Tensor> inputs;
+
+    // collects local inputs
+    for (const NodeArg* def : node->InputDefs()) {
+      inputs.push_back(def->Exists() ? ctx_tensor.Lookup(def) : tvm::Tensor());
+    }
+
+    tvm::Array<tvm::Tensor> op_outputs;
+    ORT_RETURN_IF_ERROR(CreateOutputs(node, inputs, op_outputs, ctx_codegen));
+    ctx_tensor.ops.emplace(node, std::move(op_outputs));
+
+    // input_from_
+    node->ForEachWithIndex(
+        node->OutputDefs(),
+        [&node, &ctx_tensor](const NodeArg& def, size_t index) {
+          ORT_ENFORCE(ctx_tensor.input_from.count(def.Name()) == 0);
+          ctx_tensor.input_from.emplace(def.Name(), std::make_pair(node, index));
+          return Status::OK();
+        });
+  }
+
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_op_ir_builder.h b/onnxruntime/core/providers/nuphar/compiler/nuphar_op_ir_builder.h
new file mode 100644
index 0000000000000..532e917b5d8cc
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_op_ir_builder.h
@@ -0,0 +1,34 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/common/common.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// CreateTVMIR function traverses a GraphViewer
+// and builds tvm ir (and store them in CodeGenContext)
+// based on corresponding ORT ir
+Status CreateTVMIR(const GraphViewer& graph,
+                   NupharCodeGenCtx& ctx_codegen,
+                   bool use_placeholder_for_input);
+
+// CreateTVMIR function traverses a single node
+// and builds tvm ir (and store them in CodeGenContext)
+// based on corresponding ORT ir
+Status CreateTVMIR(const Node& node,
+                   NupharCodeGenCtx& ctx_codegen);
+
+// CreateTVMIR function traverses a NupharSubgraphUnit
+// and builds tvm ir (and store them in CodeGenContext)
+// based on corresponding ORT ir
+Status CreateTVMIR(const nuphar::NupharSubgraphUnit& subgraph,
+                   NupharCodeGenCtx& ctx_codegen);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_schedule_builder.cc b/onnxruntime/core/providers/nuphar/compiler/nuphar_schedule_builder.cc
new file mode 100644
index 0000000000000..2755f0c01aed1
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_schedule_builder.cc
@@ -0,0 +1,77 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/nuphar_schedule_builder.h"
+
+#include "core/codegen/common/settings.h"
+#include "core/codegen/passes/scheduler/schedule_utils.h"
+#include "core/codegen/passes/scheduler/tvm_schedule_builder.h"
+
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+
+// TODO change name space
+namespace onnxruntime {
+namespace nuphar {
+
+// Traverse iterates a tvm::Tensor and itself dependencies
+// and builds schedule (in ScheduleContext)
+// based on corresponding ORT ir and TVM ir
+static void Traverse(const tvm::Tensor& tensor,
+                     const Node* node,
+                     NupharCodeGenCtx& ctx_codegen,
+                     tvm_codegen::ScheduleContext& ctx_schedule) {
+  // no need to traverse on nodes already marked as closured
+  if (ctx_schedule.scheduled_tensors.count(tensor->op.get()) > 0) {
+    if (ctx_schedule.scheduled_tensors[tensor->op.get()] == tvm_codegen::ScheduleType::ScheduleClosure) {
+      return;
+    }
+  }
+
+  ctx_codegen.GetCodeGenHandle()->schedule_builder->Evaluate(tensor, node, ctx_codegen, ctx_schedule);
+
+  // for real ouput
+  bool is_real_output = nullptr != node &&
+                        Promote<CodeGenUnitStats>(ctx_codegen.GetGraphStats())->IsOutputNode(node);
+
+  if (is_real_output) {
+    // TODO change it to the value from Target
+    int64_t natural_vector_size = 16;
+
+    TryVectorization(tensor, natural_vector_size, ctx_schedule);  // to x86
+    InsertRootScheduleAndClosure(tensor, ctx_schedule);
+  }
+
+  // Traverse tensor's children
+  for (auto& t : tensor->op->InputTensors()) {
+    // check whether it is a tensor having inputs
+    if (t->op->InputTensors().size() > 0) {
+      auto current_node = ctx_codegen.FindNode(t);
+      Traverse(t, current_node, ctx_codegen, ctx_schedule);
+    }
+  }
+}
+
+tvm::Schedule CreateSchedule(const tvm::Array<tvm::Tensor>& outs,
+                             NupharCodeGenCtx& ctx_codegen) {
+  // Create scheudule object
+  tvm::Array<tvm::Operation> out_ops;
+  for (auto& t : outs) {
+    out_ops.push_back(t->op);
+  }
+
+  if (codegen::CodeGenSettings::Instance().HasOption(codegen::CodeGenSettings::kCodeGenDumpSchedule))
+    ctx_codegen.GetCodeGenHandle()->schedule_builder->DumpAllSchedulers();
+
+  tvm_codegen::ScheduleContext ctx_schedule(out_ops);
+
+  // Schedule all outputs
+  for (const auto& t : outs) {
+    const Node* node = ctx_codegen.FindNode(t);
+    Traverse(t, node, ctx_codegen, ctx_schedule);
+  }
+
+  return ctx_schedule.schedule;
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/nuphar_schedule_builder.h b/onnxruntime/core/providers/nuphar/compiler/nuphar_schedule_builder.h
new file mode 100644
index 0000000000000..de4631b154afe
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/nuphar_schedule_builder.h
@@ -0,0 +1,20 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <tvm/tvm.h>
+#include "core/common/common.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+
+// TODO change name space
+namespace onnxruntime {
+namespace nuphar {
+
+// Traverse iterates tvm::Array<tvm::Tensor> a single node
+// and builds the whole schedule (in CodeGenContext)
+tvm::Schedule CreateSchedule(const tvm::Array<tvm::Tensor>& outs,
+                             NupharCodeGenCtx& ctx_codegen);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/traverse_shape_infer.cc b/onnxruntime/core/providers/nuphar/compiler/traverse_shape_infer.cc
new file mode 100644
index 0000000000000..4a20343113a07
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/traverse_shape_infer.cc
@@ -0,0 +1,128 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/traverse_shape_infer.h"
+
+#include "core/codegen/common/common.h"
+#include "core/common/common.h"
+#include "core/framework/tensorprotoutils.h"
+
+// TODO retire this file
+
+namespace onnxruntime {
+namespace nuphar {
+
+// local shape infernece function for input
+static bool CreateInput(const NodeArg* def,
+                        const GraphViewer& graph,
+                        ShapeExpr& input,
+                        bool initializer_only) {
+  if (initializer_only && graph.GetAllInitializedTensors().count(def->Name()) == 0)
+    return false;
+
+  auto def_shape = def->Shape();
+  if (!def_shape)
+    return false;
+
+  int rank = def_shape->dim_size();
+  input = ShapeExpr(rank);
+  for (int i = 0; i < rank; ++i) {
+    const auto& dim = def_shape->dim()[i];
+    if (dim.has_dim_value())
+      input[i] = DimExpr(dim.dim_value());
+    else if (dim.has_dim_param())
+      input[i] = DimExpr(dim.dim_param());
+    else {
+      input[i] = DimExpr(NormalizeNodeArgName(def) + "_dim" + std::to_string(i));
+    }
+  }
+  return true;
+}
+
+// local shape infernece function for output
+static Status CreateOutputs(
+    const Node* node,
+    const std::vector<const ShapeExpr*>& inputs,
+    std::vector<ShapeExpr>& outputs) {
+  outputs.resize(node->OutputDefs().size());
+  node->ForEachWithIndex(
+      node->OutputDefs(),
+      [&](const NodeArg& def, size_t index) {
+        auto shape_proto = def.Shape();
+        if (shape_proto) {
+          TensorShape shape{utils::GetTensorShapeFromTensorShapeProto(*shape_proto)};
+          ShapeExpr output_shape(shape.NumDimensions());
+          for (int d = 0; d < gsl::narrow<int>(shape.NumDimensions()); ++d) {
+            if (shape[d] > 0) {
+              output_shape[d] = DimExpr(shape[d]);
+            } else {
+              ORT_RETURN_IF_NOT(shape_proto->dim_size() > d && shape_proto->dim(d).has_dim_param());
+              output_shape[d] = DimExpr(shape_proto->dim(d).dim_param());
+            }
+          }
+          outputs[index] = output_shape;
+        }
+        return Status::OK();
+      });
+  return Status::OK();
+}
+
+// The main function for shape infernece
+Status ShapeInference(
+    const GraphViewer& graph,
+    ShapeExprContext& context) {
+  // build graph inputs
+  const auto& graph_inputs = graph.GetInputs();
+  for (size_t i = 0; i < graph_inputs.size(); ++i) {
+    ShapeExpr value;
+    if (CreateInput(graph_inputs[i], graph, value, /*initializer_only*/ false)) {
+      context.inputs.emplace(graph_inputs[i]->Name(), std::move(value));
+    }
+  }
+
+  // perform shape inference using the topological order from ORT
+  for (const NodeIndex& node_index : graph.GetNodesInTopologicalOrder()) {
+    const Node& node = *graph.GetNode(node_index);
+    // initializers
+    node.ForEachWithIndex(
+        node.InputDefs(),
+        [&graph, &context](const NodeArg& def, size_t) {
+          ShapeExpr value;
+          if (CreateInput(&def, graph, value, /*initializer_only*/ true)) {
+            context.inputs.emplace(def.Name(), std::move(value));
+          }
+          return Status::OK();
+        });
+
+    // handle subgraph
+    const Graph* subgraph = GetSubgraph(node);
+    if (nullptr != subgraph) {
+      GraphViewer subgraph_viewer(*subgraph);
+      ShapeInference(subgraph_viewer, context);
+    }
+
+    // collect inputs before creating outputs
+    std::vector<const ShapeExpr*> inputs;
+    for (const NodeArg* def : node.InputDefs()) {
+      inputs.push_back(def->Exists() ? context.Lookup(def) : nullptr);
+    }
+
+    // create outputs
+    std::vector<ShapeExpr> op_outputs;
+    ORT_RETURN_IF_ERROR(CreateOutputs(&node, inputs, op_outputs));
+    context.ops.emplace(&node, std::move(op_outputs));
+
+    // recall input_from_
+    node.ForEachWithIndex(
+        node.OutputDefs(),
+        [&node, &context](const NodeArg& def, size_t index) {
+          context.input_from.emplace(def.Name(), std::make_pair(&node, index));
+          return Status::OK();
+        });
+  }
+
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/traverse_shape_infer.h b/onnxruntime/core/providers/nuphar/compiler/traverse_shape_infer.h
new file mode 100644
index 0000000000000..deaa5777a3c66
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/traverse_shape_infer.h
@@ -0,0 +1,49 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/providers/nuphar/common/analysis/shape_expr.h"
+#include "core/common/common.h"
+#include "core/framework/tensor.h"
+#include "core/graph/graph_viewer.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// A collection of ShapeExpr
+struct ShapeExprContext {
+  std::map<std::string, ShapeExpr> inputs;
+  std::map<const Node*, std::vector<ShapeExpr>> ops;
+  std::map<std::string, std::pair<const Node*, size_t>> input_from;
+
+  const ShapeExpr* Lookup(const NodeArg* def) const {
+    const std::string& def_name = def->Name();
+    auto iter = inputs.find(def_name);
+    if (iter != inputs.end())
+      return &(iter->second);
+
+    auto iter_out_index = input_from.find(def_name);
+
+    // OK if shape inference is incomplete
+    // This is for some per-node unit test where NodeArg does not even have shape ranks
+    // We ignore the shape inference in ToCapacity computation in per-node unit tests
+    if (iter_out_index == input_from.end())
+      return nullptr;
+
+    const Node* from_node = iter_out_index->second.first;
+    size_t index = iter_out_index->second.second;
+    auto iter_op = ops.find(from_node);
+    ORT_ENFORCE(iter_op != ops.end());
+    return &(iter_op->second[index]);
+  }
+};
+
+// Traverse function traverses a GraphViewer,
+// performs shape infernce,
+// and builds ShapeExpr in ShapeExprContext
+Status ShapeInference(const GraphViewer& graph,
+                      ShapeExprContext& context);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h
new file mode 100644
index 0000000000000..2d2c55b17c169
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h
@@ -0,0 +1,64 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/passes/utils/codegen_context.h"
+#include "core/codegen/passes/op_ir_creator/tvm_op_creator.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Declare a TVM IR builder based on the ORT OP type
+// with postfix NupharTVMX86
+#define DECLARE_NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(OP) \
+  DECLARE_OP_IR_CREATOR_CLASS_EX(OP, NupharTVM, X86)
+
+// Return a TVM IR builder class name such as OP type
+// with postfix NupharTVMX86
+#define NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(OP) \
+  OP_IR_CREATOR_CLASS_EX(OP, NupharTVM, X86)
+
+#define NUPHAR_TVM_X86_OP_IR_CREATOR_STRING(OP) \
+  STRINGIZE(NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(OP))
+
+#define LIST_X86_UNARY_OPS()   \
+  UNARY_OP(Erf)                \
+  UNARY_OP(Exp)                \
+  UNARY_OP(Log)                \
+  UNARY_OP(ParametricSoftplus) \
+  UNARY_OP(ScaledTanh)         \
+  UNARY_OP(Selu)               \
+  UNARY_OP(Sigmoid)            \
+  UNARY_OP(Softplus)           \
+  UNARY_OP(Tanh)
+
+#define LIST_REDUCE_V_OPS() \
+  REDUCE_V_OP(ReduceMax)    \
+  REDUCE_V_OP(ReduceMin)    \
+  REDUCE_V_OP(ReduceSum)
+
+#define LIST_ALL_X86_OPS()     \
+  LIST_REDUCE_V_OPS()          \
+  LIST_X86_UNARY_OPS()         \
+  ADD_OP_ITEM(Gemm)            \
+  ADD_OP_ITEM(LogSoftmax)      \
+  ADD_OP_ITEM(MatMul)          \
+  ADD_OP_ITEM(MatMulInteger)   \
+  ADD_OP_ITEM(MatMulInteger16) \
+  ADD_OP_ITEM(Slice)           \
+  ADD_OP_ITEM(Softmax)         \
+  ADD_OP_ITEM(Tile)
+
+// Define all OPs for NupharTVMX86
+#define ADD_OP_ITEM(OP) DECLARE_NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(OP)
+#define REDUCE_V_OP(OP) ADD_OP_ITEM(OP)
+#define UNARY_OP(OP) ADD_OP_ITEM(OP)
+
+LIST_ALL_X86_OPS()
+
+#undef ADD_OP_ITEM
+#undef REDUCE_V_OP
+#undef UNARY_OP
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/gemm.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/gemm.cc
new file mode 100644
index 0000000000000..5ac2adf738017
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/gemm.cc
@@ -0,0 +1,52 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+#include "core/codegen/mti/math/binary_ops.h"
+#include "core/codegen/mti/math/gemm.h"
+#include "core/framework/op_kernel_info.h"
+#include "core/providers/common.h"
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+#include "core/providers/nuphar/mti_x86/math/matmul_ops.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(Gemm)::Evaluate(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm::Array<tvm::Tensor>& outputs) {
+  ProtoHelperNodeContext ctx(node);
+  OpNodeProtoHelper<ProtoHelperNodeContext> info(&ctx);
+
+  tvm::Tensor Y;
+  auto& A = inputs[0];
+  auto& B = inputs[1];
+  auto& C = inputs[2];
+  int64_t trans_a, trans_b;
+  float alpha, beta;
+  ORT_RETURN_IF_ERROR(info.GetAttr<int64_t>("transA", &trans_a));
+  ORT_RETURN_IF_ERROR(info.GetAttr<int64_t>("transB", &trans_b));
+  ORT_RETURN_IF_ERROR(info.GetAttr<float>("alpha", &alpha));
+  ORT_RETURN_IF_ERROR(info.GetAttr<float>("beta", &beta));
+
+  // use native sgemm for floating point
+  if (A->dtype == HalideIR::Float(32) &&
+      B->dtype == HalideIR::Float(32) &&
+      MatMulExternCpu(A, B, Y, !!trans_a, !!trans_b, node.Name() + "_gemm")) {
+    if (beta != 0) {
+      tvm::Tensor beta_bias = (beta == 1) ? C : tvm_codegen::Mul(tvm::make_const(tvm::Float(32), beta), C);
+      Y = tvm_codegen::Add((alpha == 1) ? Y : tvm_codegen::Mul(tvm::make_const(tvm::Float(32), alpha), Y), beta_bias, node.Name() + "_add_bias");
+    }
+    outputs.push_back(Y);
+    return Status::OK();
+  }
+
+  // fallback to default MTI ops
+  Y = tvm_codegen::Gemm(A, B, C, trans_a, trans_b, alpha, beta, node.Name());
+  outputs.push_back(Y);
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/logsoftmax.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/logsoftmax.cc
new file mode 100644
index 0000000000000..aef32e3d3c81d
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/logsoftmax.cc
@@ -0,0 +1,32 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+#include "core/providers/nuphar/mti_x86/math/logsoftmax.h"
+#include "core/framework/op_kernel_info.h"
+#include "core/providers/common.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Evaluate of LogSoftmax OpIRCreator
+Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(LogSoftmax)::Evaluate(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext&,
+    tvm::Array<tvm::Tensor>& outputs) {
+  ProtoHelperNodeContext ctx(node);
+  OpNodeProtoHelper<ProtoHelperNodeContext> info(&ctx);
+
+  int64_t axis_i64;
+  ORT_RETURN_IF_ERROR(info.GetAttr<int64_t>("axis", &axis_i64));
+
+  axis_i64 = HandleNegativeAxis(axis_i64, gsl::narrow_cast<int64_t>(inputs[0]->shape.size()));
+  tvm::Tensor Y = nuphar::LogSoftmax(inputs[0], axis_i64);
+  outputs.push_back(Y);
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/matmul.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/matmul.cc
new file mode 100644
index 0000000000000..e81ef497c50a8
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/matmul.cc
@@ -0,0 +1,148 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+#include "core/providers/nuphar/mti_x86/math/matmul_ops.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+#include "core/codegen/passes/weight_layout/transpose_2d.h"
+#include "core/codegen/passes/weight_layout/vertical_stripes_2d.h"
+#include "core/providers/nuphar/compiler/x86/x86_target_info.h"
+
+#include <tvm/ir_pass.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+// TODO: remove tvm core function
+
+// local helper functions
+
+static bool MatMul_weights2D(
+    ONNX_NAMESPACE::TensorProto_DataType proto_type,
+    const tvm::Tensor& A,
+    const tvm::Tensor& B,
+    const std::string& initializer_name,
+    NupharCodeGenCtx& ctx_codegen,
+    tvm::Tensor& Y,
+    const std::string& name = "matmul_weights2d") {
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx_codegen);
+
+  // optimizations for B being 2D weights
+
+  // The 2D weight is marshalled with stripe_width.
+  // This should be 2x nature vector width
+  int stripe_width = 8;
+  int block_size = 32;
+
+  onnxruntime::CodeGenTargetX86* target =
+      dynamic_cast<onnxruntime::CodeGenTargetX86*>(ctx_codegen.GetCodeGenHandle()->codegen_target);
+  if (nullptr != target) {
+    stripe_width = 2 * target->NaturalVectorWidth(B->dtype.bits());
+  }
+
+  // align A, B to multiple of block size
+  const auto& A_shape = A->shape;
+  tvm::Expr A0_size = tvm_codegen::SizeToDimension(A_shape, -1);
+  auto A0_roundup = tvm_codegen::RoundUp(A0_size, block_size);
+  tvm::Expr A1_size = tvm_codegen::SizeFromDimension(A_shape, -1);
+  auto A1_roundup = tvm_codegen::RoundUp(A1_size, block_size);
+  bool A0_need_pad = !tvm::ir::Equal(A0_roundup, A0_size);
+  bool A1_need_pad = !tvm::ir::Equal(A1_roundup, A1_size);
+
+  const auto& B_shape = B->shape;
+  tvm::Expr B0_size = tvm_codegen::SizeToDimension(B_shape, 1);
+  auto B0_roundup = tvm_codegen::RoundUp(B0_size, block_size);
+  tvm::Expr B1_size = tvm_codegen::SizeFromDimension(B_shape, 1);
+  auto B1_roundup = tvm_codegen::RoundUp(B1_size, block_size);
+  bool B1_need_pad = !tvm::ir::Equal(B1_roundup, B1_size);
+
+  ORT_ENFORCE(tvm::ir::Equal(A1_roundup, B0_roundup));
+
+  // Currently only support padding in B1, as it's free with memory marshalling
+  if (A0_need_pad || A1_need_pad || B1_need_pad)
+    return false;
+
+  auto layout_key = tvm_codegen::WeightLayoutVerticalStripe2D::GetKey(proto_type, stripe_width);
+  auto B_unmarshalled = ctx_nuphar->ApplyWeightLayout(layout_key, initializer_name, B, false);
+
+  ORT_ENFORCE(B_unmarshalled->op.as<tvm::ComputeOpNode>());
+
+  tvm::Array<tvm::Expr> Y_shape;
+  for (size_t d = 0; d < A->shape.size() - 1; ++d)
+    Y_shape.push_back(A->shape[d]);
+  Y_shape.push_back(B->shape[1]);
+
+  auto k = tvm::reduce_axis(tvm::Range(0, A1_size), "k");
+  Y = tvm::compute(
+      Y_shape,
+      [&](const tvm::Array<tvm::Var>& idx) {
+        tvm::Array<tvm::Expr> A_indices;
+        for (size_t d = 0; d < idx.size() - 1; ++d)
+          A_indices.push_back(idx[d]);
+        A_indices.push_back(k);
+        return tvm::sum(A(A_indices) * B_unmarshalled(k, idx[idx.size() - 1]), {k});
+      },
+      name);
+
+  return true;
+}
+
+static bool MatMulF32ExternCpuEx(
+    ONNX_NAMESPACE::TensorProto_DataType proto_type,
+    NupharCodeGenCtx& ctx_nuphar,
+    const tvm::Tensor& A,
+    const tvm::Tensor& B,
+    tvm::Tensor& Y,
+    const std::string& B_initializer_name = "",
+    bool trans_a = false,
+    bool trans_b = false,
+    const std::string& name = "matmul_extern_cpu_ex") {
+  // transpose weights if not already
+  tvm::Tensor actual_B = B;
+
+  if (ctx_nuphar.IsInitializer(B_initializer_name) && !trans_b) {
+    auto layout_key = tvm_codegen::WeightLayoutTranspose2D::GetKey(proto_type);
+    actual_B = ctx_nuphar.ApplyWeightLayout(layout_key, B_initializer_name, B, true);
+    trans_b = true;
+  }
+
+  return nuphar::MatMulExternCpu(A, actual_B, Y, trans_a, trans_b, name);
+}
+
+Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(MatMul)::Evaluate(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm::Array<tvm::Tensor>& outputs) {
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx_codegen);
+
+  auto proto_type = TensorProtoDataType(node.InputDefs()[1]);
+
+  tvm::Tensor Y;
+  auto& A = inputs[0];
+  auto& B = inputs[1];
+  const std::string& input_1_name = node.InputDefs()[1]->Name();
+
+  if (A->dtype == HalideIR::Float(32) &&
+      B->dtype == HalideIR::Float(32) &&
+      MatMulF32ExternCpuEx(proto_type, *ctx_nuphar, A, B, Y, input_1_name)) {
+    outputs.push_back(Y);
+    return Status::OK();
+  }
+
+  if (ShapeRank(node.InputDefs()[1]) == 2 && ctx_nuphar->IsInitializer(input_1_name)) {
+    if (MatMul_weights2D(proto_type, A, B, input_1_name, *ctx_nuphar, Y)) {
+      outputs.push_back(Y);
+      return Status::OK();
+    }
+  }
+
+  Y = nuphar::MatMul(A, B, node.Name());
+  outputs.push_back(Y);
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/quantize/matmul_integer.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/quantize/matmul_integer.cc
new file mode 100644
index 0000000000000..1fbf3516c46f1
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/quantize/matmul_integer.cc
@@ -0,0 +1,126 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+#include "core/codegen/mti/math/binary_ops.h"
+#include "core/codegen/mti/math/matmul_ops.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+#include "core/codegen/mti/tensor/cast_ops.h"
+#include "core/codegen/mti/tensor/reshape_ops.h"
+#include "core/codegen/mti/tensor/transpose.h"
+#include "core/codegen/passes/weight_layout/transpose_2d.h"
+#include "core/common/cpuid_info.h"  // TODO: refactor to control through config
+#include "core/providers/nuphar/common/nuphar_settings.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+#include "core/providers/nuphar/mti_x86/quantize/imatmul_extern.h"
+#include "core/providers/nuphar/mti_x86/quantize/imatmul16_extern.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Evaluate of MatMulInteger or MatMulInteger16
+static Status EvaluateMatMulInteger(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm::Array<tvm::Tensor>& outputs) {
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx_codegen);
+
+  const auto& A = inputs[0];
+  const auto& B = inputs[1];
+  auto& name = node.Name();
+
+  if (B->shape.size() == 2) {
+    const int64_t* p_input_dim = tvm::as_const_int(B->shape[0]);
+    const int64_t* p_embed_dim = tvm::as_const_int(B->shape[1]);
+
+    if (p_input_dim != nullptr && p_embed_dim != nullptr) {
+      int64_t input_dim = *p_input_dim;
+      int64_t embed_dim = *p_embed_dim;
+
+      bool is16bitSymm = (A->dtype == HalideIR::type_of<int16_t>() &&
+                          B->dtype == HalideIR::type_of<int16_t>());
+      bool is8bitAsymm = (A->dtype == HalideIR::type_of<uint8_t>() &&
+                          B->dtype == HalideIR::type_of<int8_t>());
+
+      if (is16bitSymm || is8bitAsymm) {
+        auto A_rank = gsl::narrow_cast<int>(A->shape.size());
+
+        tvm::Array<tvm::Expr> output_shape;
+        for (int i = 0; i < A_rank - 1; ++i) {
+          output_shape.push_back(A->shape[i]);
+        }
+        output_shape.push_back(tvm::Expr(gsl::narrow_cast<int>(embed_dim)));
+
+        tvm::Tensor B_marshalled;
+        auto B_NodeArg = node.InputDefs()[1];
+        const std::string& B_name = B_NodeArg->Name();
+
+        if (ctx_nuphar->IsInitializer(B_name)) {
+          auto layout_key = tvm_codegen::WeightLayoutTranspose2D::GetKey(TensorProtoDataType(B_NodeArg));
+          B_marshalled = ctx_nuphar->ApplyWeightLayout(layout_key, B_name, B, true);
+        } else {
+          B_marshalled = tvm_codegen::Transpose(B, {1, 0});
+        }
+
+        // TODO: add reserved_bits attribute
+        bool use_AVX2;
+        const codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+        if (settings.HasOption(kNupharIMatMulForceMkl)) {
+          use_AVX2 = false;
+        } else {
+          use_AVX2 = CPUIDInfo::GetCPUIDInfo().HasAVX2();
+        }
+        auto output_tensor =
+            is16bitSymm ? use_AVX2 ? IMatMul16ExternAVX2(B_marshalled, A,
+                                                         output_shape, input_dim, embed_dim,
+                                                         name + "_IMatMul16ExternAVX2")
+                                   : IMatMul16ExternMKL(B_marshalled, A,
+                                                        output_shape, input_dim, embed_dim,
+                                                        name + "_IMatMul16ExternMKL")
+                        : use_AVX2 ? IMatMulExternAVX2(B_marshalled, A,
+                                                       output_shape, input_dim, embed_dim,
+                                                       name + "_IMatMulExternAVX2")
+                                   : IMatMulExternMKL(B_marshalled, A,
+                                                      output_shape, input_dim, embed_dim,
+                                                      name + "_IMatMulExternMKL");
+
+        outputs.push_back(output_tensor);
+        return Status::OK();
+      }
+    }
+  }
+  // slow path, cast to int32 for now
+  // Support skipped trailing inputs
+  auto A_Int32 = (node.InputDefs().size() >= 3 && node.InputDefs()[2]->Exists())
+                     ? tvm_codegen::Sub(tvm_codegen::Cast(A, HalideIR::Int(32)), tvm_codegen::Cast(inputs[2], HalideIR::Int(32)))
+                     : tvm_codegen::Cast(A, HalideIR::Int(32));
+  auto B_Int32 = (node.InputDefs().size() >= 4 && node.InputDefs()[3]->Exists())
+                     ? tvm_codegen::Sub(tvm_codegen::Cast(B, HalideIR::Int(32)), tvm_codegen::Cast(inputs[3], HalideIR::Int(32)))
+                     : tvm_codegen::Cast(B, HalideIR::Int(32));
+  tvm::Tensor Y = tvm_codegen::MatMul(A_Int32, B_Int32, name);
+  outputs.push_back(Y);
+  return Status::OK();
+}
+
+// Evaluate of MatMulInteger OpIRCreator
+Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(MatMulInteger)::Evaluate(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm::Array<tvm::Tensor>& outputs) {
+  return EvaluateMatMulInteger(inputs, node, ctx_codegen, outputs);
+}
+
+// Evaluate of MatMulInteger16 OpIRCreator
+Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(MatMulInteger16)::Evaluate(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm::Array<tvm::Tensor>& outputs) {
+  return EvaluateMatMulInteger(inputs, node, ctx_codegen, outputs);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/reduce_ops.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/reduce_ops.cc
new file mode 100644
index 0000000000000..5dee08e0b51ca
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/reduce_ops.cc
@@ -0,0 +1,169 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+#include "core/providers/nuphar/mti_x86/math/reduce_ops.h"
+#include "core/framework/op_kernel_info.h"
+#include "core/providers/common.h"
+
+#include <algorithm>  // for sort
+
+namespace onnxruntime {
+namespace nuphar {
+
+using ReduceVFunc = tvm::Tensor (*)(const tvm::Tensor& X,
+                                    const std::vector<int64_t>& axes,
+                                    bool keep_dims,
+                                    int32_t vector_size,
+                                    bool last_dim_aligned,
+                                    int32_t fuse_dim,
+                                    const std::string& name);
+
+// This function gives a proper vector width and fuse dim for reduce
+// It avoids vector_width larger than shape
+// Fuse dim implies mulitple reduce axis could be fused together to form a longer vector_width
+// It can avoid too small vector_width
+static std::tuple<int, int> VectorWidthAndFuseDimForReduce(int natural_width,
+                                                           std::vector<int64_t> axes,
+                                                           const NodeArg* def) {
+  int64_t rank = ShapeRank(def);
+  if (rank == 0) {
+    return std::make_tuple(1, 0);
+  }
+
+  int tail_size = 1;
+
+  // reduce all
+  if (axes.size() == 0) {
+    for (int i = gsl::narrow_cast<int>(rank) - 1; i >= 0; --i) {
+      if (ShapeHasValue(def, i)) {
+        tail_size *= gsl::narrow_cast<int>(ShapeValue(def, i));
+      } else {
+        if (i > 0)
+          return std::make_tuple(tail_size, i - 1);
+        else
+          return std::make_tuple(natural_width, 0);
+      }
+
+      if (tail_size >= natural_width) {
+        return std::make_tuple(natural_width, i);
+      }
+    }
+
+    return std::make_tuple(tail_size, 0);
+  }
+
+  //reduce last
+  int j = axes.size() - 1;
+  if (axes.back() == (rank - 1)) {
+    for (int i = gsl::narrow_cast<int>(rank) - 1; i >= 0; --i) {
+      if (ShapeHasValue(def, i) && axes[j] == gsl::narrow_cast<int64_t>(i)) {
+        tail_size *= gsl::narrow_cast<int>(ShapeValue(def, i));
+        if (j > 0)
+          --j;
+      } else {
+        if (i > 0) {
+          return std::make_tuple(tail_size, i - 1);
+        } else {
+          return std::make_tuple(natural_width, 0);
+        }
+      }
+
+      if (tail_size >= natural_width) {
+        return std::make_tuple(natural_width, i);
+      }
+    }
+
+    return std::make_tuple(tail_size, 0);
+  }
+
+  // reduce other
+  for (int i = gsl::narrow_cast<int>(rank) - 1; i >= 0; --i) {
+    if (ShapeHasValue(def, i) && axes[j] != gsl::narrow_cast<int64_t>(i)) {
+      tail_size *= gsl::narrow_cast<int>(ShapeValue(def, i));
+      if (j > 0)
+        --j;
+    } else {
+      if (i > 0)
+        return std::make_tuple(tail_size, i - 1);
+      else
+        return std::make_tuple(natural_width, 0);
+    }
+
+    if (tail_size >= natural_width) {
+      return std::make_tuple(natural_width, i);
+    }
+  }
+
+  return std::make_tuple(tail_size, 0);
+}
+
+class FuncReduceV {
+ public:
+  FuncReduceV(const Node& node,
+              ReduceVFunc func,
+              std::function<int(int)> natural_vector,
+              const NodeArg* def,
+              const std::string& name) : def_(def) {
+    ProtoHelperNodeContext ctx(node);
+    OpNodeProtoHelper<ProtoHelperNodeContext> info(&ctx);
+    axes_ = info.GetAttrsOrDefault<int64_t>("axes");
+    std::sort(axes_.begin(), axes_.end());  //ReduceV requires sorted axes
+    int64_t keepdims_i = 1;
+    ORT_ENFORCE(info.GetAttr("keepdims", &keepdims_i).IsOK());
+    keep_dims_ = (keepdims_i == 1);
+    func_ = func;
+    name_ = node.Name() + "_" + name;
+    natural_vector_ = natural_vector;
+  }
+
+  tvm::Tensor operator()(const tvm::Tensor& X) const {
+    std::vector<int64_t> axes;
+    for (auto i : axes_) {
+      axes.push_back(HandleNegativeAxis(i, gsl::narrow_cast<int64_t>(X->shape.size())));
+    }
+
+    auto p = VectorWidthAndFuseDimForReduce(natural_vector_(X->dtype.bits()), axes, def_);
+    int vector_width = std::get<0>(p);
+    int fuse_dim = std::get<1>(p);
+
+    bool last_dim_aligned = false;
+    const int64_t* p_last_dim_size = tvm::as_const_int(X->shape[X->shape.size() - 1]);
+
+    if (p_last_dim_size != nullptr) {
+      last_dim_aligned = (*p_last_dim_size) % vector_width == 0;
+    }
+
+    return func_(X, axes, keep_dims_, vector_width, last_dim_aligned, fuse_dim, name_);
+  }
+
+ private:
+  std::vector<int64_t> axes_;
+  bool keep_dims_;
+  ReduceVFunc func_;
+  std::string name_;
+  std::function<int(int)> natural_vector_;
+  const NodeArg* def_;
+};
+
+#define REDUCE_V_OP(name)                                                                                    \
+  Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(name)::Evaluate(                                                 \
+      const tvm::Array<tvm::Tensor>& inputs,                                                                 \
+      const Node& node,                                                                                      \
+      tvm_codegen::CodeGenContext& ctx_codegen,                                                              \
+      tvm::Array<tvm::Tensor>& outputs) {                                                                    \
+    auto natural_vector = [&](int bits) {                                                                    \
+      return ctx_codegen.GetCodeGenHandle()->codegen_target->NaturalVectorWidth(bits);                       \
+    };                                                                                                       \
+    tvm::Tensor Y = FuncReduceV(node, &nuphar::name, natural_vector, node.InputDefs()[0], #name)(inputs[0]); \
+    outputs.push_back(Y);                                                                                    \
+    return Status::OK();                                                                                     \
+  }
+
+LIST_REDUCE_V_OPS()
+
+#undef REDUCE_V_OP
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/softmax.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/softmax.cc
new file mode 100644
index 0000000000000..28efe2cf1c0f8
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/softmax.cc
@@ -0,0 +1,32 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+#include "core/providers/nuphar/mti_x86/math/softmax.h"
+#include "core/framework/op_kernel_info.h"
+#include "core/providers/common.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Evaluate of Softmax OpIRCreator
+Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(Softmax)::Evaluate(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext&,
+    tvm::Array<tvm::Tensor>& outputs) {
+  ProtoHelperNodeContext ctx(node);
+  OpNodeProtoHelper<ProtoHelperNodeContext> info(&ctx);
+
+  int64_t axis_i64;
+  ORT_RETURN_IF_ERROR(info.GetAttr<int64_t>("axis", &axis_i64));
+
+  axis_i64 = HandleNegativeAxis(axis_i64, gsl::narrow_cast<int64_t>(inputs[0]->shape.size()));
+  tvm::Tensor Y = Softmax(inputs[0], axis_i64);
+  outputs.push_back(Y);
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/unary_ops.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/unary_ops.cc
new file mode 100644
index 0000000000000..79cf4ecbd38cd
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/math/unary_ops.cc
@@ -0,0 +1,124 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+#include "core/codegen/common/op_macro.h"
+#include "core/framework/op_kernel_info.h"
+#include "core/providers/nuphar/mti_x86/math/unary_ops.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// helper class for unary_ops with alpha
+class FuncWithAlpha {
+ public:
+  FuncWithAlpha(const Node& node) {
+    ProtoHelperNodeContext ctx(node);
+    OpNodeProtoHelper<ProtoHelperNodeContext> attrs(&ctx);
+    ORT_ENFORCE(attrs.GetAttr<float>("alpha", &alpha_).IsOK());
+  }
+
+ protected:
+  float alpha_;
+};
+
+// helper class for unary_ops with alpha and beta
+class FuncWithAlphaBeta {
+ public:
+  FuncWithAlphaBeta(const Node& node) {
+    ProtoHelperNodeContext ctx(node);
+    OpNodeProtoHelper<ProtoHelperNodeContext> attrs(&ctx);
+    ORT_ENFORCE(attrs.GetAttr<float>("alpha", &alpha_).IsOK());
+    ORT_ENFORCE(attrs.GetAttr<float>("beta", &beta_).IsOK());
+  }
+
+ protected:
+  float alpha_;
+  float beta_;
+};
+
+// helper class for unary_ops with alpha and gamma
+class FuncWithAlphaGamma {
+ public:
+  FuncWithAlphaGamma(const Node& node) {
+    ProtoHelperNodeContext ctx(node);
+    OpNodeProtoHelper<ProtoHelperNodeContext> attrs(&ctx);
+    ORT_ENFORCE(attrs.GetAttr<float>("alpha", &alpha_).IsOK());
+    ORT_ENFORCE(attrs.GetAttr<float>("gamma", &gamma_).IsOK());
+  }
+
+ protected:
+  float alpha_;
+  float gamma_;
+};
+
+// helper macro declares unary_ops helper class without attribute
+#define FuncClass(name)                                  \
+  class Func##name {                                     \
+   public:                                               \
+    Func##name(const Node&) {}                           \
+    tvm::Tensor operator()(const tvm::Tensor& X) const { \
+      return name(X);                                    \
+    }                                                    \
+  }
+
+// helper macro declares unary_ops helper class with alpha
+#define FuncClassAlpha(name)                              \
+  class Func##name : public FuncWithAlpha {               \
+   public:                                                \
+    Func##name(const Node& node) : FuncWithAlpha(node) {} \
+    tvm::Tensor operator()(const tvm::Tensor& X) const {  \
+      return name(X, alpha_);                             \
+    }                                                     \
+  }
+
+// helper macro declares unary_ops helper class with alpha and beta
+#define FuncClassAlphaBeta(name)                              \
+  class Func##name : public FuncWithAlphaBeta {               \
+   public:                                                    \
+    Func##name(const Node& node) : FuncWithAlphaBeta(node) {} \
+    tvm::Tensor operator()(const tvm::Tensor& X) const {      \
+      return name(X, alpha_, beta_);                          \
+    }                                                         \
+  }
+
+// helper macro declares unary_ops helper class with alpha and gamma
+#define FuncClassAlphaGamma(name)                              \
+  class Func##name : public FuncWithAlphaGamma {               \
+   public:                                                     \
+    Func##name(const Node& node) : FuncWithAlphaGamma(node) {} \
+    tvm::Tensor operator()(const tvm::Tensor& X) const {       \
+      return name(X, alpha_, gamma_);                          \
+    }                                                          \
+  }
+
+FuncClass(Erf);
+FuncClass(Exp);
+FuncClass(Log);
+FuncClassAlphaBeta(ParametricSoftplus);
+FuncClassAlphaBeta(ScaledTanh);
+FuncClassAlphaGamma(Selu);
+FuncClass(Sigmoid);
+FuncClass(Softplus);
+FuncClass(Tanh);
+
+// helper macro defines Evaluate of UNARY_OP OpIRCreators
+#define UNARY_OP(name)                                       \
+  Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(name)::Evaluate( \
+      const tvm::Array<tvm::Tensor>& inputs,                 \
+      const Node& node,                                      \
+      tvm_codegen::CodeGenContext&,                          \
+      tvm::Array<tvm::Tensor>& outputs) {                    \
+    tvm::Tensor Y = Func##name(node)(inputs[0]);             \
+    outputs.push_back(Y);                                    \
+    return Status::OK();                                     \
+  }
+
+// helper local macros to replace some calls in LIST_UNARY_OPS
+LIST_X86_UNARY_OPS()
+
+#undef UNARY_OP
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/tensor/slice.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/tensor/slice.cc
new file mode 100644
index 0000000000000..d303f2f6411d7
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/tensor/slice.cc
@@ -0,0 +1,72 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+#include "core/codegen/mti/tensor/tile.h"
+#include "core/framework/op_kernel_info.h"
+#include "core/providers/common.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+
+namespace onnxruntime {
+namespace tvm_codegen {
+
+// Forwarding
+Status SliceCommon(const tvm::Array<tvm::Tensor>& inputs,
+                   const Node& node,
+                   tvm::Array<tvm::Tensor>& outputs,
+                   const std::vector<int64_t>& starts,
+                   const std::vector<int64_t>& ends,
+                   const std::vector<int64_t>& axes);
+
+}  // namespace tvm_codegen
+
+namespace nuphar {
+
+// Evaluate of Slice OpIRCreator
+Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(Slice)::Evaluate(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm::Array<tvm::Tensor>& outputs) {
+  ProtoHelperNodeContext ctx(node);
+  OpNodeProtoHelper<ProtoHelperNodeContext> info(&ctx);
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx_codegen);
+
+  std::vector<std::vector<int64_t>> slice_params;
+  int version = ctx_codegen.GetCodeGenHandle()->domain_version_lookup_func(node.Domain());
+  if (version <= 9) {
+    std::vector<int64_t> starts, ends, axes;
+    ORT_RETURN_IF_ERROR(info.GetAttrs<int64_t>("starts", starts));
+    ORT_RETURN_IF_ERROR(info.GetAttrs<int64_t>("ends", ends));
+    ORT_RETURN_IF_NOT(starts.size() == ends.size());
+    axes = info.GetAttrsOrDefault<int64_t>("axes");
+    slice_params.push_back(starts);
+    slice_params.push_back(ends);
+    slice_params.push_back(axes);
+  } else {
+    // for opset 10 Slice, input 1/2/3/4 are starts/ends/axes/steps
+    // while axes and steps are optional
+    ORT_ENFORCE(node.InputDefs().size() < 5, "Slice opset 10: steps is not supported yet");
+    for (size_t i = 1; i < 4; ++i) {
+      if (i < node.InputDefs().size()) {
+        const auto* tensor = ctx_nuphar->GetOrtInitializerTensor(node.InputDefs()[i]->Name());
+        if (tensor) {
+          if (tensor->DataType() == DataTypeImpl::GetType<int64_t>()) {
+            const int64_t* data = tensor->Data<int64_t>();
+            slice_params.push_back(std::vector<int64_t>(data, data + tensor->Shape().Size()));
+          } else {
+            const int32_t* data = tensor->Data<int32_t>();
+            slice_params.push_back(std::vector<int64_t>(data, data + tensor->Shape().Size()));
+          }
+          continue;
+        }
+      }
+      slice_params.push_back(std::vector<int64_t>());
+    }
+  }
+  return tvm_codegen::SliceCommon(inputs, node, outputs, slice_params[0], slice_params[1], slice_params[2]);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/tensor/tile.cc b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/tensor/tile.cc
new file mode 100644
index 0000000000000..2841206eea8c0
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/op_ir_creator/tensor/tile.cc
@@ -0,0 +1,34 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/op_ir_creator/all_ops.h"
+
+#include "core/codegen/mti/tensor/tile.h"
+#include "core/framework/op_kernel_info.h"
+#include "core/providers/common.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Evaluate of Tile OpIRCreator
+Status NUPHAR_TVM_X86_OP_IR_CREATOR_CLASS(Tile)::Evaluate(
+    const tvm::Array<tvm::Tensor>& inputs,
+    const Node& node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm::Array<tvm::Tensor>& outputs) {
+  ProtoHelperNodeContext ctx(node);
+  OpNodeProtoHelper<ProtoHelperNodeContext> info(&ctx);
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx_codegen);
+  const auto* repeats = ctx_nuphar->GetOrtInitializerTensor(node.InputDefs()[1]->Name());
+  ORT_RETURN_IF_NOT(repeats != nullptr);
+  ORT_RETURN_IF_NOT(repeats->Shape().Size() == gsl::narrow<int64_t>(inputs[0]->shape.size()));
+  const int64_t* repeats_data = repeats->Data<int64_t>();
+  const auto repeats_vector = std::vector<int64_t>(repeats_data, repeats_data + inputs[0]->shape.size());
+  tvm::Tensor Y = tvm_codegen::Tile(inputs[0], repeats_vector, node.Name() + "_Tile");
+  outputs.push_back(Y);
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/analysis_schedule.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/analysis_schedule.cc
new file mode 100644
index 0000000000000..485aa328d75e2
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/analysis_schedule.cc
@@ -0,0 +1,31 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h"
+
+#include "core/codegen/passes/scheduler/schedule_utils.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// This is for UseCount
+bool TVM_SCHEDULER_CLASS(True, NupharX86UseCount)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node*,
+    tvm_codegen::CodeGenContext&,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  bool status_vec = TryVectorizationX86(tensor, ctx_sched);
+  bool status_r_and_c = tvm_codegen::InsertRootScheduleAndClosure(tensor, ctx_sched);
+  return status_vec || status_r_and_c;
+}
+
+bool TVM_SCHEDULER_CLASS(False, NupharX86UseCount)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node*,
+    tvm_codegen::CodeGenContext&,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  return tvm_codegen::TryInlineSchedule(tensor, ctx_sched);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.cc
new file mode 100644
index 0000000000000..736c92ed4fc1c
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.cc
@@ -0,0 +1,53 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h"
+
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm_codegen::Scheduler* SCHEDULE_DISPATCHER_CLASS(NupharX86UseCount)::
+    Find(const tvm::Tensor&, const Node* node, tvm_codegen::CodeGenContext& ctx) {
+  if (nullptr == node)
+    return nullptr;
+
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx);
+  bool reused = Promote<CodeGenUnitStats>(ctx_nuphar->GetGraphStats())->NodeUseCount(node) > 1;
+  bool cheap_node_reused = Promote<CodeGenUnitStats>(ctx_nuphar->GetGraphStats())->IsCheapNodeReuse(node);
+
+  if (reused && cheap_node_reused) {
+    return DispatcherBase::Get("True");
+  }
+  return DispatcherBase::Get("False");
+}
+
+tvm_codegen::Scheduler* SCHEDULE_DISPATCHER_CLASS(NupharX86PartialResult)::
+    Find(const tvm::Tensor&, const Node* node, tvm_codegen::CodeGenContext&) {
+  if (nullptr == node)
+    return DispatcherBase::Get("True");
+  return nullptr;
+}
+
+tvm_codegen::Scheduler* SCHEDULE_DISPATCHER_CLASS(NupharX86Tensorize)::
+    Find(const tvm::Tensor& tensor, const Node* node, tvm_codegen::CodeGenContext&) {
+  if (nullptr == node)
+    return nullptr;
+
+  // special checking to bypass tensorization
+  // when fall back to extern function call
+  if (tensor->op->InputTensors().size() > 0) {
+    auto& imatmul = tensor->op->InputTensors()[0];
+    auto extern_op = imatmul->op.as<tvm::ExternOpNode>();
+    // Extern function call
+    if (nullptr != extern_op)
+      return nullptr;
+  }
+
+  return DispatcherBase::Get(node->OpType());
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h
new file mode 100644
index 0000000000000..766d251235ea4
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h
@@ -0,0 +1,41 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/passes/scheduler/tvm_scheduler.h"
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+DECLARE_SCHEDULE_DISPATCHER_CLASS(NupharX86UseCount)
+DECLARE_SCHEDULE_DISPATCHER_CLASS(NupharX86PartialResult)
+DECLARE_SCHEDULE_DISPATCHER_CLASS(NupharX86Tensorize)
+
+DECLARE_TVM_SCHEDULER_CLASS(Extern, NupharX86TVMRule)
+DECLARE_TVM_SCHEDULER_CLASS(Reduce, NupharX86TVMRule)
+
+DECLARE_TVM_SCHEDULER_CLASS(MatMulInteger, NupharX86Tensorize)
+DECLARE_TVM_SCHEDULER_CLASS(MatMulInteger16, NupharX86Tensorize)
+DECLARE_TVM_SCHEDULER_CLASS(Softmax, NupharX86OrtOpType)
+DECLARE_TVM_SCHEDULER_CLASS(Gemm, NupharX86OrtOpType)
+DECLARE_TVM_SCHEDULER_CLASS(Conv, NupharX86OrtOpType)
+DECLARE_TVM_SCHEDULER_CLASS(MatMul, NupharX86OrtOpType)
+DECLARE_TVM_SCHEDULER_CLASS(Split, NupharX86OrtOpType)
+
+DECLARE_TVM_SCHEDULER_CLASS(True, NupharX86UseCount)
+DECLARE_TVM_SCHEDULER_CLASS(False, NupharX86UseCount)
+
+DECLARE_TVM_SCHEDULER_CLASS(True, NupharX86PartialResult)
+
+// utilities
+bool TryVectorizationX86(
+    const tvm::Tensor& tensor,
+    tvm_codegen::ScheduleContext& ctx);
+
+bool InputRootScheduleWithVectorizationX86(
+    const tvm::Tensor& tensor,
+    tvm_codegen::ScheduleContext& ctx);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/ort_type_schedule.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/ort_type_schedule.cc
new file mode 100644
index 0000000000000..2acdd995826b3
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/ort_type_schedule.cc
@@ -0,0 +1,270 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h"
+
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+#include "core/codegen/passes/scheduler/schedule_utils.h"
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.h"
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.h"
+#include "core/framework/op_kernel_info.h"
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+bool TryVectorizationX86(
+    const tvm::Tensor& tensor,
+    tvm_codegen::ScheduleContext& ctx) {
+  // TODO change it to the value from Target
+  int64_t natural_vector_size = 16;
+
+  return TryVectorization(tensor, natural_vector_size, ctx);
+}
+
+bool InputRootScheduleWithVectorizationX86(
+    const tvm::Tensor& tensor,
+    tvm_codegen::ScheduleContext& ctx) {
+  bool status = false;
+  for (auto& t : tensor->op->InputTensors()) {
+    if (t->op->InputTensors().size() > 0) {
+      bool status_vec = TryVectorizationX86(t, ctx);
+      bool status_root = InsertRootSchedule(t, ctx);
+      status = status || status_root || status_vec;
+    }
+  }
+  return status;
+}
+
+bool TVM_SCHEDULER_CLASS(Softmax, NupharX86OrtOpType)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node*,
+    tvm_codegen::CodeGenContext&,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  bool status_softmax_itself = TryInlineSchedule(tensor, ctx_sched);
+
+  // compute root the exp since it is reused more than once
+  auto& tensor_exp = tensor->op->InputTensors()[0];
+  bool status_vec = TryVectorizationX86(tensor_exp, ctx_sched);
+  bool status_root = InsertRootSchedule(tensor_exp, ctx_sched);
+  return status_softmax_itself || status_vec || status_root;
+}
+
+bool TVM_SCHEDULER_CLASS(Split, NupharX86OrtOpType)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node*,
+    tvm_codegen::CodeGenContext&,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  auto& tensor_split_input = tensor->op->InputTensors()[0];
+  // force inline for split since to avoid extra copy
+  bool status_split_itself = TryInlineSchedule(tensor, ctx_sched);
+
+  // add root for split's inputs to avoid inline of the inputs
+  bool status_vec = TryVectorizationX86(tensor_split_input, ctx_sched);
+  bool status_input_root = InsertRootSchedule(tensor_split_input, ctx_sched);
+  return status_split_itself || status_vec || status_input_root;
+}
+
+// Illustration purpose only for tensorization
+static Status MatMulTensorization(const tvm::Tensor& tensor,
+                                  tvm_codegen::ScheduleContext& ctx) {
+  if (tensor->shape.size() != 2)
+    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Gemm output shape should be 2D");
+
+  // TODO: remove compute_root
+  InsertRootScheduleAndClosure(tensor, ctx);
+
+// Demo for Tensorization with llvm extern function
+#if 1
+  int32_t factor_int32 = 16;
+  NaiveLLVMExternGemvTensorization tensorization_method("NaiveLLVMExternGemv_Example", {factor_int32, factor_int32});
+
+  auto shape = tensorization_method.Shape();
+  auto compute_op = tensor->op.as<tvm::ComputeOpNode>();
+  auto xy = compute_op->axis;
+  auto x = xy[0];
+  auto y = xy[1];
+  auto z = compute_op->reduce_axis[0];
+
+  tvm::IterVar yo, yi;
+  ctx.schedule[tensor->op].split(y, shape[0], &yo, &yi);
+  tvm::IterVar zo, zi;
+  ctx.schedule[tensor->op].split(z, shape[1], &zo, &zi);
+  ctx.schedule[tensor->op].reorder({x, yo, zo, yi, zi});
+  ctx.schedule[tensor->op].tensorize(yi, tensorization_method.CreateTensorIntrin());
+  ctx.schedule[tensor->op].pragma(yo, "import_llvm", tensorization_method.LLVMImportDef());
+#endif
+
+// Demo for Tensorization with llvm intrisic IR
+#if 0
+  NaiveLLVMIRGemvTensorization tensorization_method("NaiveLLVMIRGemv_Example");
+
+  auto shape = tensorization_method.Shape();
+  auto compute_op = tensor->op.as<tvm::ComputeOpNode>();
+  auto xy = compute_op->axis;
+  auto x = xy[0];
+  auto y = xy[1];
+  auto z = compute_op->reduce_axis[0];
+
+  tvm::IterVar yo, yi;
+  ctx.schedule[tensor->op].split(y, shape[0], &yo, &yi);
+  tvm::IterVar zo, zi;
+  ctx.schedule[tensor->op].split(z, shape[1], &zo, &zi);
+  ctx.schedule[tensor->op].reorder({x, yo, zo, yi, zi});
+  ctx.schedule[tensor->op].tensorize(yi, tensorization_method.CreateTensorIntrin());
+#endif
+
+  return Status::OK();
+}
+
+// this is not tested in onnxruntime_test_all, since extern has higher priority
+// don't register it
+bool TVM_SCHEDULER_CLASS(Gemm, NupharX86OrtOpType)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node* node,
+    tvm_codegen::CodeGenContext&,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  ProtoHelperNodeContext ctx(*node);
+  OpNodeProtoHelper<ProtoHelperNodeContext> attrs(&ctx);
+  int64_t trans_A_64, trans_B_64;
+  bool status_a = attrs.GetAttr<int64_t>("transA", &trans_A_64).IsOK();
+  ORT_ENFORCE(status_a);
+  bool status_b = attrs.GetAttr<int64_t>("transB", &trans_B_64).IsOK();
+  ORT_ENFORCE(status_b);
+
+  if (trans_A_64 == 0 && trans_B_64 == 1) {
+    return MatMulTensorization(tensor, ctx_sched).IsOK();
+  }
+  return InsertRootSchedule(tensor, ctx_sched);
+}
+
+// OLD code from Conv schedule
+static Status ConvScheduleX86(const tvm::Tensor& tensor,
+                              NupharCodeGenCtx& ctx_codegen,
+                              tvm_codegen::ScheduleContext& ctx_sched,
+                              int block_size) {
+  if (tensor->shape.size() != 4)
+    return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Conv output shape should be 4D");
+
+  InsertRootScheduleAndClosure(tensor, ctx_sched);
+
+  auto compute_op = tensor->op.as<tvm::ComputeOpNode>();
+  auto ncyx = compute_op->axis;
+  auto b = ncyx[0];
+  auto oc = ncyx[1];
+  auto y = ncyx[2];
+  auto x = ncyx[3];
+  auto ic = compute_op->reduce_axis[0];
+  auto m = compute_op->reduce_axis[1];
+  auto n = compute_op->reduce_axis[2];
+
+  tvm::Expr kfactor(4);  // todo: this factor for vectorization is tuned for conv2d_performance on AVX2, will need to be addressed later
+  tvm::IterVar oc_chunk, oc_block;
+  ctx_sched.schedule[tensor->op].split(oc, kfactor, &oc_chunk, &oc_block);
+
+  tvm::Expr factor(block_size);  // factor for tiling and blocking
+  tvm::IterVar ic_chunk, ic_block;
+  ctx_sched.schedule[tensor->op].split(ic, factor, &ic_chunk, &ic_block);
+
+  tvm::IterVar xo, xi;
+  ctx_sched.schedule[tensor->op].split(x, factor, &xo, &xi);
+
+  ctx_sched.schedule[tensor->op].reorder({b, oc_chunk, y, xo, ic_chunk, m, n, ic_block, xi, oc_block});
+
+  if (ctx_codegen.GetCodeGenHandle()->enable_per_node_parallelized) {
+    tvm::Array<tvm::IterVar> fused_axis;
+    fused_axis.push_back(b);
+    fused_axis.push_back(oc_chunk);
+    fused_axis.push_back(y);
+    fused_axis.push_back(xo);
+    tvm::IterVar parallel_axis;
+    ctx_sched.schedule[tensor->op].fuse(fused_axis, &parallel_axis);
+    ctx_sched.schedule[tensor->op].parallel(parallel_axis);
+  }
+  ctx_sched.schedule[tensor->op].vectorize(oc_block);
+
+  return Status::OK();
+}
+
+bool TVM_SCHEDULER_CLASS(Conv, NupharX86OrtOpType)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node* node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx_codegen);
+  return ConvScheduleX86(tensor, *ctx_nuphar, ctx_sched, 16).IsOK();
+}  // namespace tvm_codegen
+
+// seems only tested in double path
+static Status MatMul_2DWeight_Schedule(
+    const tvm::Tensor& tensor_C,
+    NupharCodeGenCtx& ctx_codegen,
+    tvm_codegen::ScheduleContext& ctx_sched,
+    int block_size) {
+  // implementation adapted from:
+  // https://docs.tvm.ai/tutorials/optimize/opt_gemm.html#sphx-glr-tutorials-optimize-opt-gemm-py
+  InsertRootScheduleAndClosure(tensor_C, ctx_sched);
+
+  // write cache, note this needs to happen before any axis ops in tensor_C
+  auto CC = ctx_sched.schedule.cache_write(tensor_C, "global");
+
+  const auto& C_axis = tensor_C->op.as<tvm::ComputeOpNode>()->axis;
+  auto C_rank = C_axis.size();
+  auto x = C_axis[C_rank - 2];
+  auto y = C_axis[C_rank - 1];
+  tvm::Expr block(block_size);
+  tvm::IterVar xo, yo, xi, yi;
+  ctx_sched.schedule[tensor_C->op].tile(x, y, block, block, &xo, &yo, &xi, &yi);
+  ctx_sched.schedule[CC->op].compute_at(ctx_sched.schedule[tensor_C->op], yo);
+
+  // new inner axes
+  const auto& CC_axis = CC->op.as<tvm::ComputeOpNode>()->axis;
+  auto xc = CC_axis[C_rank - 2];
+  auto yc = CC_axis[C_rank - 1];
+
+  constexpr int num_unrolls = 4;
+  auto split_factor = tvm::Expr(num_unrolls);
+  auto k = ctx_sched.schedule[CC->op]->op.as<tvm::ComputeOpNode>()->reduce_axis[0];
+  tvm::IterVar ko, ki;
+  ctx_sched.schedule[CC->op].split(k, split_factor, &ko, &ki);
+  tvm::Array<tvm::IterVar> reordered_axis;
+  for (size_t d = 0; d < C_rank - 2; ++d)
+    reordered_axis.push_back(CC_axis[d]);
+  reordered_axis.push_back(ko);
+  reordered_axis.push_back(xc);
+  reordered_axis.push_back(ki);
+  reordered_axis.push_back(yc);
+  ctx_sched.schedule[CC->op].reorder(reordered_axis);
+  ctx_sched.schedule[CC->op].unroll(ki);
+  ctx_sched.schedule[CC->op].vectorize(yc);
+
+  if (ctx_codegen.GetCodeGenHandle()->enable_per_node_parallelized) {
+    // parallelize
+    tvm::Array<tvm::IterVar> fused_axis;
+    for (size_t d = 0; d < C_rank - 2; ++d)
+      fused_axis.push_back(C_axis[d]);
+    fused_axis.push_back(xo);
+    tvm::IterVar fused_xo;
+    ctx_sched.schedule[tensor_C->op].fuse(fused_axis, &fused_xo);
+    ctx_sched.schedule[tensor_C->op].parallel(fused_xo);
+  }
+
+  return Status::OK();
+}
+
+bool TVM_SCHEDULER_CLASS(MatMul, NupharX86OrtOpType)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node* node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx_codegen);
+
+  if (tensor->dtype != HalideIR::Float(32)) {
+    return MatMul_2DWeight_Schedule(tensor, *ctx_nuphar, ctx_sched, 16).IsOK();
+  }
+  return InsertRootSchedule(tensor, ctx_sched);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/partial_schedule.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/partial_schedule.cc
new file mode 100644
index 0000000000000..4af5e016467ba
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/partial_schedule.cc
@@ -0,0 +1,21 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h"
+
+#include "core/codegen/passes/scheduler/schedule_utils.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// This is for ReuseCount
+bool TVM_SCHEDULER_CLASS(True, NupharX86PartialResult)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node*,
+    tvm_codegen::CodeGenContext&,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  return TryInlineSchedule(tensor, ctx_sched);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.cc
new file mode 100644
index 0000000000000..74470087d53cd
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.cc
@@ -0,0 +1,100 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "intrin_gemv_16bit.h"
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.h"
+#include <tvm/buffer.h>
+#include <tvm/codegen.h>
+#include <tvm/ir.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+Gemv16bitTensorization::Gemv16bitTensorization(const std::string& name, const std::vector<int32_t>& vshape)
+    : TensorizeBase(name, "Gemv16bitTensorization_Parameter", {vshape[0], vshape[1]}) {}
+
+tvm::TensorIntrin Gemv16bitTensorization::CreateTensorIntrin() {
+  tvm::Expr m(shape_[0]);
+  tvm::Expr l(shape_[1]);
+
+  auto a = tvm::placeholder({l}, HalideIR::Int(16));
+  auto b = tvm::placeholder({m, l}, HalideIR::Int(16));
+  auto k = tvm::reduce_axis({0, l});
+
+  auto c = tvm::compute({m}, [&](tvm::Var i) {
+    return tvm::sum(tvm::cast(HalideIR::Int(32), a(k)) * tvm::cast(HalideIR::Int(32), b(i, k)), {k});
+  });
+
+  auto a_buf = tvm::BufferNode::make(
+      tvm::Var("a", tvm::Handle()),
+      a->dtype,
+      a->shape,
+      /*strides*/ {1},
+      tvm::Var("a_offset"),
+      "a",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto b_buf = tvm::BufferNode::make(
+      tvm::Var("b", tvm::Handle()),
+      b->dtype,
+      b->shape,
+      /*strides*/ {tvm::Var("s1"), 1},
+      tvm::Var("b_offset"),
+      "b",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto c_buf = tvm::BufferNode::make(
+      tvm::Var("c", tvm::Handle()),
+      c->dtype,
+      c->shape,
+      /*strides*/ {1},
+      tvm::Var("c_offset"),
+      "c",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  int h_unroll = shape_[1] / 16;
+  auto sum_int32x8 = tvm::make_const(HalideIR::Int(32, 8), 0);
+
+  for (int i = 0; i < h_unroll; ++i) {
+    auto a_int16x16 = a_buf.vload({i * 16}, HalideIR::Int(16, 16));
+    auto b_int16x16 = b_buf.vload({0, i * 16}, HalideIR::Int(16, 16));
+
+    auto axb_int32x8 = tvm_codegen::LLVMIntrinsic(HalideIR::Int(32, 8),
+                                                  "llvm.x86.avx2.pmadd.wd",
+                                                  {a_int16x16, b_int16x16});
+    sum_int32x8 += axb_int32x8;
+  }
+
+  sum_int32x8 = tvm_codegen::LLVMIntrinsic(HalideIR::Int(32, 8),
+                                           "llvm.x86.avx2.phadd.d",
+                                           {sum_int32x8, sum_int32x8});
+  sum_int32x8 = tvm_codegen::LLVMIntrinsic(HalideIR::Int(32, 8),
+                                           "llvm.x86.avx2.phadd.d",
+                                           {sum_int32x8, sum_int32x8});
+
+  auto sum_int32x4_l = tvm_codegen::VectorLow(sum_int32x8);
+  auto sum_int32x4_h = tvm_codegen::VectorHigh(sum_int32x8);
+  auto sum_int32x4 = sum_int32x4_l + sum_int32x4_h;
+  auto sum_int32x1 = tvm_codegen::ExtractElement(sum_int32x4, 0);
+
+  auto reset = c_buf.vstore({0}, tvm::make_const(HalideIR::Int(32, 1), 0));
+  auto body = c_buf.vstore({0}, sum_int32x1);
+  auto update = c_buf.vstore({0}, sum_int32x1 + c_buf.vload({0}, HalideIR::Int(32, 1)));
+
+  return tvm::TensorIntrinNode::make(
+      "intrin_gemv_16bit",
+      c->op,
+      {a, b},
+      {a_buf, b_buf, c_buf},
+      body,
+      reset,
+      update);
+}
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.h b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.h
new file mode 100644
index 0000000000000..0f9e460c632aa
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.h
@@ -0,0 +1,20 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "tensorize_base.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+class Gemv16bitTensorization : public tvm_codegen::TensorizeBase {
+ public:
+  Gemv16bitTensorization(const std::string& name, const std::vector<int32_t>& vshape);
+
+  virtual ~Gemv16bitTensorization() = default;
+
+  tvm::TensorIntrin CreateTensorIntrin() override;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.cc
new file mode 100644
index 0000000000000..4a4dc1695303e
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.cc
@@ -0,0 +1,104 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "intrin_gemv_8bit.h"
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.h"
+#include <tvm/buffer.h>
+#include <tvm/codegen.h>
+#include <tvm/ir.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+Gemv8bitTensorization::Gemv8bitTensorization(const std::string& name, const std::vector<int32_t>& vshape)
+    : TensorizeBase(name, "Gemv8bitTensorization_Parameter", {vshape[0], vshape[1]}) {}
+
+tvm::TensorIntrin Gemv8bitTensorization::CreateTensorIntrin() {
+  tvm::Expr m(shape_[0]);
+  tvm::Expr l(shape_[1]);
+
+  auto a = tvm::placeholder({l}, HalideIR::UInt(8));
+  auto b = tvm::placeholder({m, l}, HalideIR::Int(8));
+  auto k = tvm::reduce_axis({0, l});
+
+  auto c = tvm::compute({m}, [&](tvm::Var i) {
+    return tvm::sum(tvm::cast(HalideIR::Int(32), a(k)) * tvm::cast(HalideIR::Int(32), b(i, k)), {k});
+  });
+
+  auto a_buf = tvm::BufferNode::make(
+      tvm::Var("a", tvm::Handle()),
+      a->dtype,
+      a->shape,
+      /*strides*/ {1},
+      tvm::Var("a_offset"),
+      "a",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto b_buf = tvm::BufferNode::make(
+      tvm::Var("b", tvm::Handle()),
+      b->dtype,
+      b->shape,
+      /*strides*/ {tvm::Var("s1"), 1},
+      tvm::Var("b_offset"),
+      "b",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto c_buf = tvm::BufferNode::make(
+      tvm::Var("c", tvm::Handle()),
+      c->dtype,
+      c->shape,
+      /*strides*/ {1},
+      tvm::Var("c_offset"),
+      "c",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  int h_unroll = shape_[1] / 32;
+  auto sum_int32x8 = tvm::make_const(HalideIR::Int(32, 8), 0);
+  auto one = tvm::make_const(HalideIR::Int(16, 16), 1);
+
+  for (int i = 0; i < h_unroll; ++i) {
+    auto a_uint8x32 = a_buf.vload({i * 32}, HalideIR::UInt(8, 32));
+    auto b_int8x32 = b_buf.vload({0, i * 32}, HalideIR::Int(8, 32));
+
+    auto axb_int16x16 = tvm_codegen::LLVMIntrinsic(HalideIR::Int(16, 16),
+                                                   "llvm.x86.avx2.pmadd.ub.sw",
+                                                   {a_uint8x32, b_int8x32});
+    auto axb_int32x8 = tvm_codegen::LLVMIntrinsic(HalideIR::Int(32, 8),
+                                                  "llvm.x86.avx2.pmadd.wd",
+                                                  {axb_int16x16, one});
+    sum_int32x8 += axb_int32x8;
+  }
+
+  sum_int32x8 = tvm_codegen::LLVMIntrinsic(HalideIR::Int(32, 8),
+                                           "llvm.x86.avx2.phadd.d",
+                                           {sum_int32x8, sum_int32x8});
+  sum_int32x8 = tvm_codegen::LLVMIntrinsic(HalideIR::Int(32, 8),
+                                           "llvm.x86.avx2.phadd.d",
+                                           {sum_int32x8, sum_int32x8});
+
+  auto sum_int32x4_l = tvm_codegen::VectorLow(sum_int32x8);
+  auto sum_int32x4_h = tvm_codegen::VectorHigh(sum_int32x8);
+  auto sum_int32x4 = sum_int32x4_l + sum_int32x4_h;
+  auto sum_int32x1 = tvm_codegen::ExtractElement(sum_int32x4, 0);
+
+  auto reset = c_buf.vstore({0}, tvm::make_const(HalideIR::Int(32, 1), 0));
+  auto body = c_buf.vstore({0}, sum_int32x1);
+  auto update = c_buf.vstore({0}, sum_int32x1 + c_buf.vload({0}, HalideIR::Int(32, 1)));
+
+  return tvm::TensorIntrinNode::make(
+      "intrin_gemv_8bit",
+      c->op,
+      {a, b},
+      {a_buf, b_buf, c_buf},
+      body,
+      reset,
+      update);
+}
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.h b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.h
new file mode 100644
index 0000000000000..83d366ca1ecde
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.h
@@ -0,0 +1,20 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "tensorize_base.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+class Gemv8bitTensorization : public tvm_codegen::TensorizeBase {
+ public:
+  Gemv8bitTensorization(const std::string& name, const std::vector<int32_t>& vshape);
+
+  virtual ~Gemv8bitTensorization() = default;
+
+  tvm::TensorIntrin CreateTensorIntrin() override;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.cc
new file mode 100644
index 0000000000000..0aedc8178c72f
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.cc
@@ -0,0 +1,103 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.h"
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.h"
+#include <tvm/buffer.h>
+#include <tvm/ir.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+const char* gemv_update_func_name = "gemv_update";
+const char* gemv_reset_func_name = "gemv_reset";
+
+NaiveLLVMExternGemvTensorization::NaiveLLVMExternGemvTensorization(const std::string& name,
+                                                                   const std::vector<int32_t>& shape)
+    : TensorizeWithLLVMImport(name, "NaiveLLVMExternGemvTensorization_Parameter", shape) {}
+
+tvm::TensorIntrin NaiveLLVMExternGemvTensorization::CreateTensorIntrin() {
+  tvm::Expr m(shape_[0]);
+  tvm::Expr l(shape_[1]);
+
+  auto a = tvm::placeholder({l});
+  auto b = tvm::placeholder({m, l});
+  auto k = tvm::reduce_axis({0, l});
+
+  auto c = tvm::compute({m}, [&](tvm::Var i) {
+    return tvm::sum(a(k) * b(i, k), {k});
+  });
+
+  auto a_buf = tvm::BufferNode::make(
+      tvm::Var("a", tvm::Handle()),
+      a->dtype,
+      a->shape,
+      /*strides*/ {1},
+      tvm::Var("a_offset"),
+      "a",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto b_buf = tvm::BufferNode::make(
+      tvm::Var("b", tvm::Handle()),
+      b->dtype,
+      b->shape,
+      /*strides*/ {tvm::Var("s1"), 1},
+      tvm::Var("b_offset"),
+      "b",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto c_buf = tvm::BufferNode::make(
+      tvm::Var("c", tvm::Handle()),
+      c->dtype,
+      c->shape,
+      /*strides*/ {1},
+      tvm::Var("c_offset"),
+      "c",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto body = tvm::ir::Call::make(
+      HalideIR::Type(HalideIR::Type::Int, 32, 1),
+      gemv_update_func_name,
+      {
+          c_buf.access_ptr(static_cast<int>(tvm::AccessMask::kWrite)),
+          a_buf.access_ptr(static_cast<int>(tvm::AccessMask::kRead)),
+          b_buf.access_ptr(static_cast<int>(tvm::AccessMask::kRead)),
+          m,
+          l,
+          /*stride*/ b_buf->strides[0],
+      },
+      tvm::ir::Call::CallType::Extern);
+
+  auto reduce_init = tvm::ir::Call::make(
+      HalideIR::Type(HalideIR::Type::Int, 32, 1),
+      gemv_reset_func_name,
+      {
+          c_buf.access_ptr(static_cast<int>(tvm::AccessMask::kWrite)),
+          m,
+      },
+      tvm::ir::Call::CallType::Extern);
+
+  auto reduce_update = body;
+
+  return tvm::TensorIntrinNode::make(
+      "intrin_gemv_ll_extern",
+      c->op,
+      {a, b},
+      {a_buf, b_buf, c_buf},
+      tvm::ir::Evaluate::make(body),
+      tvm::ir::Evaluate::make(reduce_init),
+      tvm::ir::Evaluate::make(reduce_update));
+}
+
+const std::string NaiveLLVMExternGemvTensorization::LLVMImportDef() {
+  return std::string(gemv_stubs_ir);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.h b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.h
new file mode 100644
index 0000000000000..6a227c746a9e5
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_extern.h
@@ -0,0 +1,13 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "tensorize_base.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+TENSORIZE_CLASS_WITH_LLVM_IMPORT(NaiveLLVMExternGemvTensorization)
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.cc
new file mode 100644
index 0000000000000..cdaefe9f87e4c
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.cc
@@ -0,0 +1,96 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "intrin_gemv_ll_ir.h"
+
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.h"
+#include <tvm/buffer.h>
+#include <tvm/codegen.h>
+#include <tvm/ir.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+const int32_t dim0 = 1;
+const int32_t dim1 = 8;
+
+NaiveLLVMIRGemvTensorization::NaiveLLVMIRGemvTensorization(const std::string& name)
+    : TensorizeBase(name, "NaiveLLVMIRGemvTensorization_Parameter", {dim0, dim1}) {}
+
+tvm::TensorIntrin NaiveLLVMIRGemvTensorization::CreateTensorIntrin() {
+  tvm::Expr m(dim0);
+  tvm::Expr l(dim1);
+
+  auto a = tvm::placeholder({l});
+  auto b = tvm::placeholder({m, l});
+  auto k = tvm::reduce_axis({0, l});
+
+  auto c = tvm::compute({m}, [&](tvm::Var i) {
+    return tvm::sum(a(k) * b(i, k), {k});
+  });
+
+  auto a_buf = tvm::BufferNode::make(
+      tvm::Var("a", tvm::Handle()),
+      a->dtype,
+      a->shape,
+      /*strides*/ {1},
+      tvm::Var("a_offset"),
+      "a",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto b_buf = tvm::BufferNode::make(
+      tvm::Var("b", tvm::Handle()),
+      b->dtype,
+      b->shape,
+      /*strides*/ {tvm::Var("s1"), 1},
+      tvm::Var("b_offset"),
+      "b",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto c_buf = tvm::BufferNode::make(
+      tvm::Var("c", tvm::Handle()),
+      c->dtype,
+      c->shape,
+      /*strides*/ {1},
+      tvm::Var("c_offset"),
+      "c",
+      "",
+      0,
+      /*offset_factor*/ 1);
+
+  auto a_float32x8 = a_buf.vload({0}, HalideIR::Float(32, 8));
+  auto b_float32x8 = b_buf.vload({0, 0}, HalideIR::Float(32, 8));
+  auto z_float32x8 = tvm::make_const(HalideIR::Float(32, 8), 0);
+
+  auto axb = tvm_codegen::LLVMIntrinsic(HalideIR::Float(32, 8),
+                                        "llvm.x86.fma.vfmadd.ps.256",
+                                        {a_float32x8,
+                                         b_float32x8,
+                                         z_float32x8});
+
+  auto sum = tvm_codegen::ExtractElement(axb, 0);
+
+  for (int i = 1; i < 8; ++i) {
+    auto z0 = tvm_codegen::ExtractElement(axb, i);
+    sum += z0;
+  }
+
+  auto body = c_buf.vstore({0}, sum);
+  auto reset = c_buf.vstore({0}, tvm::make_const(HalideIR::Float(32, 1), 0));
+  auto update = c_buf.vstore({0}, sum + c_buf.vload({0}, HalideIR::Float(32, 1)));
+
+  return tvm::TensorIntrinNode::make(
+      "intrin_gemv_ll_ir",
+      c->op,
+      {a, b},
+      {a_buf, b_buf, c_buf},
+      body,
+      reset,
+      update);
+}
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.h b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.h
new file mode 100644
index 0000000000000..7dad78b35723f
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_ll_ir.h
@@ -0,0 +1,20 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "tensorize_base.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+class NaiveLLVMIRGemvTensorization : public tvm_codegen::TensorizeBase {
+ public:
+  NaiveLLVMIRGemvTensorization(const std::string& name);
+
+  virtual ~NaiveLLVMIRGemvTensorization() = default;
+
+  tvm::TensorIntrin CreateTensorIntrin() override;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.cpp b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.cpp
new file mode 100644
index 0000000000000..7eff192b1829e
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.cpp
@@ -0,0 +1,18 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+extern "C" int gemv_update(float* cc, float* aa, float* bb, int m, int l, int stride) {
+  for (int i = 0; i < m; ++i) {
+    for (int j = 0; j < l; ++j) {
+      cc[i] += aa[j] * bb[i * stride + j];
+    }
+  }
+  return 0;
+}
+
+extern "C" int gemv_reset(float* cc, int m) {
+  for (int i = 0; i < m; ++i) {
+    cc[i] = 0.0;
+  }
+  return 0;
+}
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.h b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.h
new file mode 100644
index 0000000000000..89fdfe5e51148
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/ll/gemv_impl.h
@@ -0,0 +1,137 @@
+// The string in this file is generated using clang:
+// clang++.exe -fno-preserve-as-comments -S -emit-llvm gemv_impl.cpp
+
+namespace onnxruntime {
+namespace nuphar {
+
+const char* gemv_stubs_ir = R"gemv_stub_escape(
+; ModuleID = 'gemv_stubs.cpp'
+source_filename = "gemv_stubs.cpp"
+target datalayout = "e-m:w-i64:64-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-pc-windows-msvc19.11.25548"
+
+; Function Attrs: noinline nounwind optnone uwtable
+define i32 @gemv_update(float*, float*, float*, i32, i32, i32) #0 {
+  %7 = alloca i32, align 4
+  %8 = alloca i32, align 4
+  %9 = alloca i32, align 4
+  %10 = alloca float*, align 8
+  %11 = alloca float*, align 8
+  %12 = alloca float*, align 8
+  %13 = alloca i32, align 4
+  %14 = alloca i32, align 4
+  store i32 %5, i32* %7, align 4
+  store i32 %4, i32* %8, align 4
+  store i32 %3, i32* %9, align 4
+  store float* %2, float** %10, align 8
+  store float* %1, float** %11, align 8
+  store float* %0, float** %12, align 8
+  store i32 0, i32* %13, align 4
+  br label %15
+
+; <label>:15:                                     ; preds = %50, %6
+  %16 = load i32, i32* %13, align 4
+  %17 = load i32, i32* %9, align 4
+  %18 = icmp slt i32 %16, %17
+  br i1 %18, label %19, label %53
+
+; <label>:19:                                     ; preds = %15
+  store i32 0, i32* %14, align 4
+  br label %20
+
+; <label>:20:                                     ; preds = %46, %19
+  %21 = load i32, i32* %14, align 4
+  %22 = load i32, i32* %8, align 4
+  %23 = icmp slt i32 %21, %22
+  br i1 %23, label %24, label %49
+
+; <label>:24:                                     ; preds = %20
+  %25 = load float*, float** %11, align 8
+  %26 = load i32, i32* %14, align 4
+  %27 = sext i32 %26 to i64
+  %28 = getelementptr inbounds float, float* %25, i64 %27
+  %29 = load float, float* %28, align 4
+  %30 = load float*, float** %10, align 8
+  %31 = load i32, i32* %13, align 4
+  %32 = load i32, i32* %7, align 4
+  %33 = mul nsw i32 %31, %32
+  %34 = load i32, i32* %14, align 4
+  %35 = add nsw i32 %33, %34
+  %36 = sext i32 %35 to i64
+  %37 = getelementptr inbounds float, float* %30, i64 %36
+  %38 = load float, float* %37, align 4
+  %39 = fmul float %29, %38
+  %40 = load float*, float** %12, align 8
+  %41 = load i32, i32* %13, align 4
+  %42 = sext i32 %41 to i64
+  %43 = getelementptr inbounds float, float* %40, i64 %42
+  %44 = load float, float* %43, align 4
+  %45 = fadd float %44, %39
+  store float %45, float* %43, align 4
+  br label %46
+
+; <label>:46:                                     ; preds = %24
+  %47 = load i32, i32* %14, align 4
+  %48 = add nsw i32 %47, 1
+  store i32 %48, i32* %14, align 4
+  br label %20
+
+; <label>:49:                                     ; preds = %20
+  br label %50
+
+; <label>:50:                                     ; preds = %49
+  %51 = load i32, i32* %13, align 4
+  %52 = add nsw i32 %51, 1
+  store i32 %52, i32* %13, align 4
+  br label %15
+
+; <label>:53:                                     ; preds = %15
+  ret i32 0
+}
+
+; Function Attrs: noinline nounwind optnone uwtable
+define i32 @gemv_reset(float*, i32) #0 {
+  %3 = alloca i32, align 4
+  %4 = alloca float*, align 8
+  %5 = alloca i32, align 4
+  store i32 %1, i32* %3, align 4
+  store float* %0, float** %4, align 8
+  store i32 0, i32* %5, align 4
+  br label %6
+
+; <label>:6:                                      ; preds = %15, %2
+  %7 = load i32, i32* %5, align 4
+  %8 = load i32, i32* %3, align 4
+  %9 = icmp slt i32 %7, %8
+  br i1 %9, label %10, label %18
+
+; <label>:10:                                     ; preds = %6
+  %11 = load float*, float** %4, align 8
+  %12 = load i32, i32* %5, align 4
+  %13 = sext i32 %12 to i64
+  %14 = getelementptr inbounds float, float* %11, i64 %13
+  store float 0.000000e+00, float* %14, align 4
+  br label %15
+
+; <label>:15:                                     ; preds = %10
+  %16 = load i32, i32* %5, align 4
+  %17 = add nsw i32 %16, 1
+  store i32 %17, i32* %5, align 4
+  br label %6
+
+; <label>:18:                                     ; preds = %6
+  ret i32 0
+}
+
+attributes #0 = { noinline nounwind optnone uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }
+
+!llvm.module.flags = !{!0, !1}
+!llvm.ident = !{!2}
+
+!0 = !{i32 1, !"wchar_size", i32 2}
+!1 = !{i32 7, !"PIC Level", i32 2}
+!2 = !{!"clang version 6.0.1 (tags/RELEASE_601/final)"}
+)gemv_stub_escape";
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_base.h b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_base.h
new file mode 100644
index 0000000000000..d2e238b8f3eb7
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_base.h
@@ -0,0 +1,76 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include <tvm/tvm.h>
+
+// TODO move this tvm_codegen
+
+namespace onnxruntime {
+namespace tvm_codegen {
+
+#define TENSORIZE_CLASS(tensorize_name)                                         \
+  class tensorize_name : public tvm_codegen::TensorizeBase {                    \
+   public:                                                                      \
+    tensorize_name(const std::string& name, const std::vector<int32_t>& shape); \
+    virtual ~tensorize_name() = default;                                        \
+    tvm::TensorIntrin CreateTensorIntrin() override;                            \
+  };
+
+#define TENSORIZE_CLASS_WITH_LLVM_IMPORT(tensorize_name)                        \
+  class tensorize_name : public tvm_codegen::TensorizeWithLLVMImport {          \
+   public:                                                                      \
+    tensorize_name(const std::string& name, const std::vector<int32_t>& shape); \
+    virtual ~tensorize_name() = default;                                        \
+    tvm::TensorIntrin CreateTensorIntrin() override;                            \
+    const std::string LLVMImportDef() override;                                 \
+  };
+
+// TensorizeBase is for standard tesorization scheduling interface
+// For tesorization for nonstandard scheduling, please use custom scheduling
+// Name for debug
+// Parameter for extra parameter name in Dictionary
+// Shape is for the tensorization shape
+// CreateTensorIntrin returns tvm::TensorIntrin
+class TensorizeBase {
+ public:
+  TensorizeBase(const std::string& name,
+                const std::string& parameter,
+                const std::vector<int32_t>& shape)
+      : name_(name), parameter_(parameter), shape_(shape) {}
+
+  virtual ~TensorizeBase() = default;
+
+  const std::string& Name() const { return name_; }
+  const std::string& Parameter() const { return parameter_; }
+  const std::vector<int32_t>& Shape() const { return shape_; }
+  const size_t Size() const { return shape_.size(); }
+
+  virtual tvm::TensorIntrin CreateTensorIntrin() = 0;
+
+ protected:
+  // name for this tensorization
+  std::string name_;
+  // specific parameter name
+  std::string parameter_;
+  // specific shape for this tensorization
+  std::vector<int32_t> shape_;
+};
+
+// TensorizeWithLLVMImport is also for standard tesorization scheduling interface
+// For tesorization for nonstandard scheduling, please use custom scheduling
+// LLVMImportDef returns a string of llvm IR
+class TensorizeWithLLVMImport : public TensorizeBase {
+ public:
+  TensorizeWithLLVMImport(const std::string& name,
+                          const std::string& parameter,
+                          const std::vector<int32_t>& shape)
+      : TensorizeBase(name, parameter, shape) {}
+
+  virtual ~TensorizeWithLLVMImport() = default;
+  // llvm Func definition
+  virtual const std::string LLVMImportDef() = 0;
+};
+
+}  // namespace tvm_codegen
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.cc
new file mode 100644
index 0000000000000..1fdd2b420e6e5
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.cc
@@ -0,0 +1,73 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "tensorize_utilities.h"
+#include "core/codegen/common/common.h"
+
+#include <tvm/codegen.h>
+#include <tvm/ir.h>
+
+namespace onnxruntime {
+namespace tvm_codegen {
+
+tvm::Expr LLVMIntrinsic(tvm::Type type,
+                        const std::string& name,
+                        const tvm::Array<tvm::Expr>& args) {
+  tvm::Array<tvm::Expr> llvm_intrinsic_args;
+  auto llvm_intrinsic_id = tvm::codegen::LookupLLVMIntrinsic(name);
+  ORT_ENFORCE(llvm_intrinsic_id != 0);
+
+  llvm_intrinsic_args.push_back(tvm::make_const(HalideIR::UInt(32), llvm_intrinsic_id));
+  llvm_intrinsic_args.push_back(tvm::make_const(HalideIR::UInt(32), 0));
+
+  for (auto& arg : args) {
+    llvm_intrinsic_args.push_back(arg);
+  }
+
+  // llvm intrinsic is always a pure intrinsic for now
+  return PureIntrinsic(type, "llvm_intrin", llvm_intrinsic_args);
+}
+
+tvm::Expr PureIntrinsic(tvm::Type type,
+                        const std::string& name,
+                        const tvm::Array<tvm::Expr>& args) {
+  return tvm::ir::Call::make(type,
+                             name,
+                             args,
+                             tvm::ir::Call::CallType::PureIntrinsic);
+}
+
+tvm::Expr ExtractElement(tvm::Expr expr,
+                         int32_t id) {
+  // element type
+  tvm::Type type = HalideIR::Type(expr.type().code(), expr.type().bits(), 1);
+
+  return PureIntrinsic(type,
+                       "extract_element",
+                       {expr,
+                        tvm::make_const(HalideIR::UInt(32), id)});
+}
+
+tvm::Expr VectorLow(tvm::Expr expr) {
+  tvm::Type type = HalideIR::Type(expr.type().code(), expr.type().bits(), expr.type().lanes());
+  return PureIntrinsic(type,
+                       "vectorlow",
+                       {expr});
+}
+
+tvm::Expr VectorHigh(tvm::Expr expr) {
+  tvm::Type type = HalideIR::Type(expr.type().code(), expr.type().bits(), expr.type().lanes());
+  return PureIntrinsic(type,
+                       "vectorhigh",
+                       {expr});
+}
+
+tvm::Expr Reinterpret(tvm::Type type,
+                      tvm::Expr expr) {
+  return PureIntrinsic(type,
+                       "reinterpret",
+                       {expr});
+}
+
+}  // namespace tvm_codegen
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.h b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.h
new file mode 100644
index 0000000000000..62dad20e05eca
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize/tensorize_utilities.h
@@ -0,0 +1,30 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include <tvm/tvm.h>
+
+// TODO move this tvm_codegen
+
+namespace onnxruntime {
+namespace tvm_codegen {
+
+tvm::Expr LLVMIntrinsic(tvm::Type type,
+                        const std::string& name,
+                        const tvm::Array<tvm::Expr>& args);
+
+tvm::Expr PureIntrinsic(tvm::Type type,
+                        const std::string& name,
+                        const tvm::Array<tvm::Expr>& args);
+
+tvm::Expr ExtractElement(tvm::Expr expr,
+                         int32_t id);
+
+tvm::Expr VectorLow(tvm::Expr expr);
+tvm::Expr VectorHigh(tvm::Expr expr);
+
+tvm::Expr Reinterpret(tvm::Type type,
+                      tvm::Expr expr);
+
+}  // namespace tvm_codegen
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize_schedule.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize_schedule.cc
new file mode 100644
index 0000000000000..e888ee30e85ce
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tensorize_schedule.cc
@@ -0,0 +1,144 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h"
+
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+#include "core/providers/nuphar/compiler/nuphar_codegen_ctx.h"
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_16bit.h"
+#include "core/providers/nuphar/compiler/x86/scheduler/tensorize/intrin_gemv_8bit.h"
+#include "core/codegen/passes/scheduler/schedule_utils.h"
+#include "core/framework/op_kernel_info.h"
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+static Status IntMatMulTensorize16bit(const tvm::Tensor& tensor,
+                                      const int64_t input_dim,
+                                      tvm_codegen::ScheduleContext& ctx) {
+  // schedule for imatmul inputs
+  InsertRootScheduleAndClosure(tensor, ctx);
+  InputRootScheduleWithVectorizationX86(tensor, ctx);
+
+  // decide kernel shape
+  std::vector<int32_t> kernel_shape;
+  kernel_shape.push_back(1);
+  if (input_dim <= 64) {
+    kernel_shape.push_back(input_dim);
+  } else {
+    kernel_shape.push_back(16);
+  }
+
+  Gemv16bitTensorization gemv16bit_tensorization("gemv16bit_tensorization", kernel_shape);
+  auto shape = gemv16bit_tensorization.Shape();
+  auto compute_op = tensor->op.as<tvm::ComputeOpNode>();
+  auto xy = compute_op->axis;
+  auto x = xy[0];
+  auto y = xy[1];
+  auto z = compute_op->reduce_axis[0];
+  tvm::IterVar yo, yi;
+  ctx.schedule[tensor->op].split(y, shape[0], &yo, &yi);
+  tvm::IterVar zo, zi;
+  ctx.schedule[tensor->op].split(z, shape[1], &zo, &zi);
+  ctx.schedule[tensor->op].reorder({x, yo, zo, yi, zi});
+  ctx.schedule[tensor->op].tensorize(yi, gemv16bit_tensorization.CreateTensorIntrin());
+
+  return Status::OK();
+}
+
+static Status IntMatMulTensorize8bit(const tvm::Tensor& tensor,
+                                     const int64_t input_dim,
+                                     tvm_codegen::ScheduleContext& ctx) {
+  // schedule for imatmul inputs
+  InsertRootScheduleAndClosure(tensor, ctx);
+  InputRootScheduleWithVectorizationX86(tensor, ctx);
+
+  // decide kernel shape
+  std::vector<int32_t> kernel_shape;
+  kernel_shape.push_back(1);
+  if (input_dim <= 256) {
+    kernel_shape.push_back(input_dim);
+  } else if (input_dim % 64 == 0) {
+    kernel_shape.push_back(64);
+  } else {
+    kernel_shape.push_back(32);
+  }
+
+  Gemv8bitTensorization gemv8bit_tensorization("gemv8bit_tensorization", kernel_shape);
+  auto shape = gemv8bit_tensorization.Shape();
+  auto compute_op = tensor->op.as<tvm::ComputeOpNode>();
+  auto xy = compute_op->axis;
+  auto x = xy[0];
+  auto y = xy[1];
+  auto z = compute_op->reduce_axis[0];
+  tvm::IterVar yo, yi;
+  ctx.schedule[tensor->op].split(y, shape[0], &yo, &yi);
+  tvm::IterVar zo, zi;
+  ctx.schedule[tensor->op].split(z, shape[1], &zo, &zi);
+  ctx.schedule[tensor->op].reorder({x, yo, zo, yi, zi});
+  ctx.schedule[tensor->op].tensorize(yi, gemv8bit_tensorization.CreateTensorIntrin());
+
+  return Status::OK();
+}
+
+static bool IntMatMulTensorize(
+    const tvm::Tensor& tensor,
+    const Node* node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  NupharCodeGenCtx* ctx_nuphar = Promote<NupharCodeGenCtx>(&ctx_codegen);
+
+  // schedule for MatMulInteger root: reshape
+  bool reused = Promote<CodeGenUnitStats>(ctx_nuphar->GetGraphStats())->NodeUseCount(node) > 1;
+
+  if (reused) {
+    TryVectorizationX86(tensor, ctx_sched);
+    InsertRootScheduleAndClosure(tensor, ctx_sched);
+  } else {
+    TryInlineSchedule(tensor, ctx_sched);
+  }
+
+  // schedule for MatMulInteger computation: imatmul
+  ORT_ENFORCE(tensor->op->InputTensors().size() > 0);
+  auto& imatmul = tensor->op->InputTensors()[0];
+
+  ORT_ENFORCE(imatmul->op->InputTensors().size() == 2);
+  tvm::Tensor matrixB = imatmul->op->InputTensors()[0];
+  tvm::Tensor matrixA = imatmul->op->InputTensors()[1];
+
+  bool is16bit = (matrixB->dtype == HalideIR::type_of<int16_t>() &&
+                  matrixA->dtype == HalideIR::type_of<int16_t>());
+  bool is8bit = (matrixB->dtype == HalideIR::type_of<uint8_t>() &&
+                 matrixA->dtype == HalideIR::type_of<int8_t>());
+  ORT_ENFORCE(is16bit || is8bit);
+
+  ORT_ENFORCE(matrixA->shape.size() == 2);
+  const int64_t* input_dim_ptr = tvm::as_const_int(matrixA->shape[1]);
+  ORT_ENFORCE(input_dim_ptr != nullptr);
+  int64_t input_dim = *input_dim_ptr;
+
+  if (is8bit)
+    return IntMatMulTensorize8bit(imatmul, input_dim, ctx_sched).IsOK();
+
+  return IntMatMulTensorize16bit(imatmul, input_dim, ctx_sched).IsOK();
+}
+
+bool TVM_SCHEDULER_CLASS(MatMulInteger, NupharX86Tensorize)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node* node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  return IntMatMulTensorize(tensor, node, ctx_codegen, ctx_sched);
+}
+
+bool TVM_SCHEDULER_CLASS(MatMulInteger16, NupharX86Tensorize)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node* node,
+    tvm_codegen::CodeGenContext& ctx_codegen,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  return IntMatMulTensorize(tensor, node, ctx_codegen, ctx_sched);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tvm_rule_schedule.cc b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tvm_rule_schedule.cc
new file mode 100644
index 0000000000000..cd2811b0f828b
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/scheduler/tvm_rule_schedule.cc
@@ -0,0 +1,120 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/scheduler/nuphar_scheduler.h"
+
+#include "core/codegen/passes/scheduler/schedule_utils.h"
+#include "core/providers/nuphar/mti_x86/math/reduce_ops.h"
+#include <topi/tags.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+bool TVM_SCHEDULER_CLASS(Extern, NupharX86TVMRule)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node*,
+    tvm_codegen::CodeGenContext&,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  bool status = InsertRootScheduleAndClosure(tensor, ctx_sched);
+  bool status_input = InputRootScheduleWithVectorizationX86(tensor, ctx_sched);
+  return status || status_input;
+}
+
+static bool ReduceVScheduleNupharX86(
+    const tvm::Tensor& tensor,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  InsertRootScheduleAndClosure(tensor, ctx_sched);
+
+  auto compute_op = tensor->op.as<tvm::ComputeOpNode>();
+  if (compute_op->axis.size() > 1) {
+    tvm::Expr fuse_dim_expr(compute_op->attrs[nuphar::kNupharVReduceFuseDim].node_);
+    const int64_t* fuse_dim = as_const_int(fuse_dim_expr);
+    ORT_ENFORCE(nullptr != fuse_dim);
+
+    tvm::Array<tvm::IterVar> fused_axes;
+    bool can_vectorize = true;
+    for (size_t i = gsl::narrow_cast<size_t>(*fuse_dim); i < compute_op->axis.size(); ++i) {
+      fused_axes.push_back(compute_op->axis[i]);
+      if (tvm::as_const_int(tensor->shape[i]) == nullptr)
+        can_vectorize = false;
+    }
+
+    tvm::IterVar fused_x;
+    bool has_fused_axis = (fused_axes.size() >= 1 && can_vectorize);
+    // currently there's an issue in tvm when fusing two symbolic dims
+    // for simplicity, just disable fuse when no vectorize
+
+    if (has_fused_axis) {
+      ctx_sched.schedule[tensor->op].fuse(fused_axes, &fused_x);
+      if (can_vectorize)
+        ctx_sched.schedule[tensor->op].vectorize(fused_x);
+    }
+
+    auto shape = tensor->shape;
+    const int64_t* head_dim = nullptr;
+    if (shape.size() > 0)
+      head_dim = as_const_int(shape[0]);
+
+    // unroll packed reduce by checking head dim
+    if (nullptr != head_dim) {
+      // if head_dim is already fused, don't unroll
+      // only unroll 1 < head_dim <=4
+      if ((*fuse_dim != 0) && (*head_dim) <= 4 && (*head_dim) > 1) {
+        tvm::Array<tvm::IterVar> reorder_axis;
+        auto x0 = compute_op->axis[0];
+
+        // handle fused axis if there is
+        if (has_fused_axis) {
+          for (size_t i = 1; i < gsl::narrow_cast<size_t>(*fuse_dim); ++i) {
+            reorder_axis.push_back(compute_op->axis[i]);
+          }
+          reorder_axis.push_back(fused_x);
+        } else {
+          for (size_t i = 1; i < compute_op->axis.size(); ++i) {
+            reorder_axis.push_back(compute_op->axis[i]);
+          }
+        }
+
+        for (auto& k : compute_op->reduce_axis)
+          reorder_axis.push_back(k);
+        reorder_axis.push_back(x0);
+
+        ctx_sched.schedule[tensor->op].reorder(reorder_axis);
+        ctx_sched.schedule[tensor->op].unroll(x0);
+      }
+    }
+  } else if (compute_op->axis.size() > 0 &&
+             tvm::as_const_int(tensor->shape[0]) != nullptr) {
+    tvm::IterVar x = compute_op->axis[0];
+    ctx_sched.schedule[tensor->op].vectorize(x);
+  }
+
+  if (compute_op->reduce_axis.size() > 1) {
+    tvm::IterVar k;
+    ctx_sched.schedule[tensor->op].fuse(compute_op->reduce_axis, &k);
+  }
+
+  return true;
+}
+
+// For Reduce Compute tvm::Tensor
+bool TVM_SCHEDULER_CLASS(Reduce, NupharX86TVMRule)::Evaluate(
+    const tvm::Tensor& tensor,
+    const Node*,
+    tvm_codegen::CodeGenContext&,
+    tvm_codegen::ScheduleContext& ctx_sched) {
+  // respect topi::kCommReduce
+  if (tensor->op->tag == topi::kCommReduce) {
+    return InsertRootScheduleAndClosure(tensor, ctx_sched);
+  }
+
+  if (tensor->op->tag == nuphar::kNupharVReduce) {
+    return ReduceVScheduleNupharX86(tensor, ctx_sched);
+  }
+
+  // unknown goes to InsertRootScheduleAndClosure
+  return InsertRootScheduleAndClosure(tensor, ctx_sched);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/x86_target_info.cc b/onnxruntime/core/providers/nuphar/compiler/x86/x86_target_info.cc
new file mode 100644
index 0000000000000..f4e3c96e76727
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/x86_target_info.cc
@@ -0,0 +1,19 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/compiler/x86/x86_target_info.h"
+
+namespace onnxruntime {
+
+const std::string CodeGenTargetX86::LLVM_TARGET_AVX2 = "llvm -mcpu=core-avx2";
+const std::string CodeGenTargetX86::LLVM_TARGET_AVX512 = "llvm -mcpu=skylake-avx512";
+
+std::unique_ptr<CodeGenTarget> CodeGenTarget_AVX2() {
+  return std::make_unique<CodeGenTargetX86>(CodeGenTargetX86::LLVM_TARGET_AVX2, 256, 2);
+}
+
+std::unique_ptr<CodeGenTarget> CodeGenTarget_AVX512() {
+  return std::make_unique<CodeGenTargetX86>(CodeGenTargetX86::LLVM_TARGET_AVX512, 512, 2);
+}
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/compiler/x86/x86_target_info.h b/onnxruntime/core/providers/nuphar/compiler/x86/x86_target_info.h
new file mode 100644
index 0000000000000..bd44e004cd81f
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/compiler/x86/x86_target_info.h
@@ -0,0 +1,33 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/target_info.h"
+
+namespace onnxruntime {
+
+class CodeGenTargetX86 final : public CodeGenTarget {
+  int max_vector_bits_;
+  int vector_unit_num_;
+
+ public:
+  CodeGenTargetX86(const std::string& target_name, int max_vector_bits, int vector_unit_num)
+      : CodeGenTarget(target_name), max_vector_bits_(max_vector_bits), vector_unit_num_(vector_unit_num) {}
+
+  int NaturalVectorWidth(int bits) const override {
+    return max_vector_bits_ * vector_unit_num_ / bits;
+  }
+
+  ~CodeGenTargetX86() override = default;
+
+ public:
+  static const std::string LLVM_TARGET_AVX2;
+  static const std::string LLVM_TARGET_AVX512;
+};
+
+std::unique_ptr<CodeGenTarget> CodeGenTarget_AVX2();
+
+std::unique_ptr<CodeGenTarget> CodeGenTarget_AVX512();
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/extern/igemv_avx2.cc b/onnxruntime/core/providers/nuphar/extern/igemv_avx2.cc
new file mode 100644
index 0000000000000..087e49dfe2972
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/extern/igemv_avx2.cc
@@ -0,0 +1,740 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "igemv_avx2.h"
+
+#ifdef _WIN32
+// A temp fix of visual studio bug
+#pragma warning(push)
+#pragma warning(disable : 4310)
+#endif  // _WIN32
+
+namespace onnxruntime {
+
+#ifdef NUPHAR_USE_AVX2
+
+// General macros for avx2 integer gemv
+#define mm256_load_vec_epi8(Reg, Idx) \
+  auto Reg##Idx = _mm256_lddqu_si256((const __m256i*)(vec + (Idx)*32));
+
+#define mm256_load_vec_epi16(Reg, Idx) \
+  auto Reg##Idx = _mm256_lddqu_si256((const __m256i*)(vec + (Idx)*16));
+
+#define mm256_accumulate_epi32(Reg)      \
+  Reg = _mm256_hadd_epi32(Reg, Reg);     \
+  Reg = _mm256_hadd_epi32(Reg, Reg);     \
+  auto res = _mm_add_epi32(              \
+      _mm256_castsi256_si128(Reg),       \
+      _mm256_extractf128_si256(Reg, 1)); \
+  *(int*)(output + i) = _mm_cvtsi128_si32(res);
+
+// Extended macros for avx2 integer gemv to handle non-padded dimension
+#define mm256_set_mask_epi8(rest, h, l)                                                                                       \
+  switch (rest) {                                                                                                             \
+    case 0:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h); \
+      break;                                                                                                                  \
+    case 1:                                                                                                                   \
+      mask = _mm256_set_epi8(h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 2:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 3:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 4:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 5:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 6:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 7:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 8:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 9:                                                                                                                   \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 10:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 11:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 12:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 13:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 14:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 15:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 16:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 17:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 18:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 19:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 20:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 21:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 22:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 23:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 24:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 25:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 26:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 27:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l, l); \
+      break;                                                                                                                  \
+    case 28:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l, l); \
+      break;                                                                                                                  \
+    case 29:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l, l); \
+      break;                                                                                                                  \
+    case 30:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l, l); \
+      break;                                                                                                                  \
+    case 31:                                                                                                                  \
+      mask = _mm256_set_epi8(h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, h, l); \
+      break;                                                                                                                  \
+    case 32:                                                                                                                  \
+      mask = _mm256_set_epi8(l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l, l); \
+      break;                                                                                                                  \
+    default:                                                                                                                  \
+      printf("Mask set is wrong! Please double check!");                                                                      \
+      abort();                                                                                                                \
+  }
+
+#define mm256_mask_load_vec_epi8(Reg, Idx, Offset, Mask) \
+  auto Reg##Idx = _mm256_blendv_epi8(                    \
+      zero, _mm256_lddqu_si256((const __m256i*)(vec + (Idx)*32 - (Offset))), Mask);
+
+#define mm256_mat_vec_off_mul_epi8(Reg, Idx, Offset)                                   \
+  tmp = _mm256_maddubs_epi16(                                                          \
+      (Reg##Idx), _mm256_lddqu_si256((const __m256i*)(rowPtr + (Idx)*32 - (Offset)))); \
+  sum = _mm256_add_epi32(sum, _mm256_madd_epi16(tmp, one));
+
+#define mm256_mat_vec_mul_epi8(Reg, Idx) mm256_mat_vec_off_mul_epi8(Reg, Idx, 0)
+
+#define mm256_set_mask_epi16(Reg, Idx, Remain, src)                 \
+  switch (Remain) {                                                 \
+    case 1: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0x80);          \
+    } break;                                                        \
+    case 2: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xc0);          \
+    } break;                                                        \
+    case 3: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xe0);          \
+    } break;                                                        \
+    case 4: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xf0);          \
+    } break;                                                        \
+    case 5: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xf8);          \
+    } break;                                                        \
+    case 6: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xfc);          \
+    } break;                                                        \
+    case 7: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xfe);          \
+    } break;                                                        \
+    case 8: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xff);          \
+    } break;                                                        \
+    case 9: {                                                       \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0x80);          \
+      Reg##Idx = _mm256_insertf128_si256(Reg##Idx, Reg##Idx##h, 1); \
+    } break;                                                        \
+    case 10: {                                                      \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xc0);          \
+      Reg##Idx = _mm256_insertf128_si256(Reg##Idx, Reg##Idx##h, 1); \
+    } break;                                                        \
+    case 11: {                                                      \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xe0);          \
+      Reg##Idx = _mm256_insertf128_si256(Reg##Idx, Reg##Idx##h, 1); \
+    } break;                                                        \
+    case 12: {                                                      \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xf0);          \
+      Reg##Idx = _mm256_insertf128_si256(Reg##Idx, Reg##Idx##h, 1); \
+    } break;                                                        \
+    case 13: {                                                      \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xf8);          \
+      Reg##Idx = _mm256_insertf128_si256(Reg##Idx, Reg##Idx##h, 1); \
+    } break;                                                        \
+    case 14: {                                                      \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xfc);          \
+      Reg##Idx = _mm256_insertf128_si256(Reg##Idx, Reg##Idx##h, 1); \
+    } break;                                                        \
+    case 15: {                                                      \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xfe);          \
+      Reg##Idx = _mm256_insertf128_si256(Reg##Idx, Reg##Idx##h, 1); \
+    } break;                                                        \
+    case 16: {                                                      \
+      Reg##Idx = _mm256_blend_epi16(zero, src##Idx, 0xff);          \
+      Reg##Idx = _mm256_insertf128_si256(Reg##Idx, Reg##Idx##h, 1); \
+    } break;                                                        \
+    default:                                                        \
+      abort();                                                      \
+  }
+
+#define mm256_mask_load_vec_epi16(Reg, Idx, Offset, Remain)                        \
+  auto Reg##Idx = _mm256_lddqu_si256((const __m256i*)(vec + (Idx)*16 - (Offset))); \
+  if (Remain != 0) {                                                               \
+    auto Reg##Idx##l = _mm256_castsi256_si128(Reg##Idx);                           \
+    auto Reg##Idx##h = _mm256_extractf128_si256(Reg##Idx, 1);                      \
+    __m256i m256_##Idx;                                                            \
+    if (Remain <= 8) {                                                             \
+      m256_##Idx = _mm256_castsi128_si256(hzero);                                  \
+      m256_##Idx = _mm256_insertf128_si256(m256_##Idx, Reg##Idx##h, 1);            \
+    } else {                                                                       \
+      m256_##Idx = _mm256_castsi128_si256(Reg##Idx##l);                            \
+      m256_##Idx = _mm256_insertf128_si256(m256_##Idx, hzero, 1);                  \
+    }                                                                              \
+    mm256_set_mask_epi16(Reg, Idx, Remain, m256_);                                 \
+  }
+
+#define mm256_mat_vec_off_mul_epi16(Reg, Idx, Offset)                                  \
+  tmp = _mm256_madd_epi16(                                                             \
+      (Reg##Idx), _mm256_lddqu_si256((const __m256i*)(rowPtr + (Idx)*16 - (Offset)))); \
+  sum = _mm256_add_epi32(sum, tmp);
+
+#define mm256_mat_vec_mul_epi16(Reg, Idx) mm256_mat_vec_off_mul_epi16(Reg, Idx, 0)
+
+// S8U8S32: int8_t * uint8_t = int32_t; S16S16S32: int16_t * int16_t = int32_t
+// R stands for row major
+void AVX2IntGemvS8U8S32R(
+    int8_t* matrix,
+    uint8_t* vec,
+    int matrixRowDimension,
+    int paddedRowDimension,
+    int matrixColumnDimension,
+    int32_t* output) {
+  __m256i one = _mm256_set1_epi16(1);
+  __m256i tmp = _mm256_setzero_si256();
+
+  int rowUnrollFactor = (matrixRowDimension + 31) / 32;
+  switch (rowUnrollFactor) {
+    case 1: {
+      mm256_load_vec_epi8(r, 0);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 2: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 3: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 4: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 5: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      mm256_load_vec_epi8(r, 4);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_mat_vec_mul_epi8(r, 4);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 6: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      mm256_load_vec_epi8(r, 4);
+      mm256_load_vec_epi8(r, 5);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_mat_vec_mul_epi8(r, 4);
+        mm256_mat_vec_mul_epi8(r, 5);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 7: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      mm256_load_vec_epi8(r, 4);
+      mm256_load_vec_epi8(r, 5);
+      mm256_load_vec_epi8(r, 6);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_mat_vec_mul_epi8(r, 4);
+        mm256_mat_vec_mul_epi8(r, 5);
+        mm256_mat_vec_mul_epi8(r, 6);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 8: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      mm256_load_vec_epi8(r, 4);
+      mm256_load_vec_epi8(r, 5);
+      mm256_load_vec_epi8(r, 6);
+      mm256_load_vec_epi8(r, 7);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_mat_vec_mul_epi8(r, 4);
+        mm256_mat_vec_mul_epi8(r, 5);
+        mm256_mat_vec_mul_epi8(r, 6);
+        mm256_mat_vec_mul_epi8(r, 7);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    default: {
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * paddedRowDimension;
+        uint8_t* vecPtr = vec;
+
+        __m256i sum = _mm256_setzero_si256();
+        int j = 0;
+        for (; j < std::min(paddedRowDimension, paddedRowDimension / 64 * 64); j += 64) {
+          tmp = _mm256_maddubs_epi16(
+              _mm256_lddqu_si256((const __m256i*)(vecPtr + j)),
+              _mm256_lddqu_si256((const __m256i*)(rowPtr + j)));
+          sum = _mm256_add_epi32(sum, _mm256_madd_epi16(tmp, one));
+
+          tmp = _mm256_maddubs_epi16(
+              _mm256_lddqu_si256((const __m256i*)(vecPtr + 32 + j)),
+              _mm256_lddqu_si256((const __m256i*)(rowPtr + 32 + j)));
+          sum = _mm256_add_epi32(sum, _mm256_madd_epi16(tmp, one));
+        }
+        if (paddedRowDimension % 64) {
+          tmp = _mm256_maddubs_epi16(
+              _mm256_lddqu_si256((const __m256i*)(vecPtr + j)),
+              _mm256_lddqu_si256((const __m256i*)(rowPtr + j)));
+          sum = _mm256_add_epi32(sum, _mm256_madd_epi16(tmp, one));
+        }
+
+        mm256_accumulate_epi32(sum);
+      }
+    }
+  }
+}
+
+void AVX2IntGemvS16S16S32R(
+    int16_t* matrix,
+    int16_t* vec,
+    int matrixRowDimension,
+    int paddedRowDimension,
+    int matrixColumnDimension,
+    int32_t* output) {
+  __m256i tmp = _mm256_setzero_si256();
+
+  int rowUnrollFactor = (matrixRowDimension + 15) / 16;
+  switch (rowUnrollFactor) {
+    case 1: {
+      mm256_load_vec_epi16(r, 0);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi16(r, 0);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 2: {
+      mm256_load_vec_epi16(r, 0);
+      mm256_load_vec_epi16(r, 1);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi16(r, 0);
+        mm256_mat_vec_mul_epi16(r, 1);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 3: {
+      mm256_load_vec_epi16(r, 0);
+      mm256_load_vec_epi16(r, 1);
+      mm256_load_vec_epi16(r, 2);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi16(r, 0);
+        mm256_mat_vec_mul_epi16(r, 1);
+        mm256_mat_vec_mul_epi16(r, 2);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 4: {
+      mm256_load_vec_epi16(r, 0);
+      mm256_load_vec_epi16(r, 1);
+      mm256_load_vec_epi16(r, 2);
+      mm256_load_vec_epi16(r, 3);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * paddedRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi16(r, 0);
+        mm256_mat_vec_mul_epi16(r, 1);
+        mm256_mat_vec_mul_epi16(r, 2);
+        mm256_mat_vec_mul_epi16(r, 3);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    default: {
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * paddedRowDimension;
+        int16_t* vecPtr = vec;
+
+        __m256i sum = _mm256_setzero_si256();
+        for (int j = 0; j < paddedRowDimension; j += 16) {
+          tmp = _mm256_madd_epi16(
+              _mm256_lddqu_si256((const __m256i*)(vecPtr + j)),
+              _mm256_lddqu_si256((const __m256i*)(rowPtr + j)));
+          sum = _mm256_add_epi32(sum, tmp);
+        }
+
+        mm256_accumulate_epi32(sum);
+      }
+    }
+  }
+}
+
+// Ex stands for extended to handle non-padded dimension
+void AVX2IntGemvS8U8S32REx(
+    int8_t* matrix,
+    uint8_t* vec,
+    int matrixRowDimension,
+    int matrixColumnDimension,
+    int32_t* output) {
+  __m256i zero = _mm256_setzero_si256();
+  __m256i one = _mm256_set1_epi16(1);
+  __m256i tmp = _mm256_setzero_si256();
+  int rowUnrollFactor = (matrixRowDimension + 31) / 32;
+  int matrixRowRemain = matrixRowDimension % 32;
+  int matrixRowOffset = (32 - matrixRowRemain) % 32;
+
+  __m256i mask;
+  if (rowUnrollFactor > 1) {
+    mm256_set_mask_epi8(matrixRowRemain, -128, 0);
+  } else if (matrixRowOffset == 0) {
+    mm256_set_mask_epi8(32, 0, -128);
+  } else {
+    mm256_set_mask_epi8(matrixRowOffset, 0, -128);
+  }
+
+  switch (rowUnrollFactor) {
+    case 1: {
+      mm256_mask_load_vec_epi8(r, 0, 0, mask);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_off_mul_epi8(r, 0, 0);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 2: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_mask_load_vec_epi8(r, 1, matrixRowOffset, mask);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_off_mul_epi8(r, 1, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 3: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_mask_load_vec_epi8(r, 2, matrixRowOffset, mask);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_off_mul_epi8(r, 2, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 4: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_mask_load_vec_epi8(r, 3, matrixRowOffset, mask);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_off_mul_epi8(r, 3, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 5: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      mm256_mask_load_vec_epi8(r, 4, matrixRowOffset, mask);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_mat_vec_off_mul_epi8(r, 4, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 6: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      mm256_load_vec_epi8(r, 4);
+      mm256_mask_load_vec_epi8(r, 5, matrixRowOffset, mask);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_mat_vec_mul_epi8(r, 4);
+        mm256_mat_vec_off_mul_epi8(r, 5, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 7: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      mm256_load_vec_epi8(r, 4);
+      mm256_load_vec_epi8(r, 5);
+      mm256_mask_load_vec_epi8(r, 6, matrixRowOffset, mask);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_mat_vec_mul_epi8(r, 4);
+        mm256_mat_vec_mul_epi8(r, 5);
+        mm256_mat_vec_off_mul_epi8(r, 6, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 8: {
+      mm256_load_vec_epi8(r, 0);
+      mm256_load_vec_epi8(r, 1);
+      mm256_load_vec_epi8(r, 2);
+      mm256_load_vec_epi8(r, 3);
+      mm256_load_vec_epi8(r, 4);
+      mm256_load_vec_epi8(r, 5);
+      mm256_load_vec_epi8(r, 6);
+      mm256_mask_load_vec_epi8(r, 7, matrixRowOffset, mask);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi8(r, 0);
+        mm256_mat_vec_mul_epi8(r, 1);
+        mm256_mat_vec_mul_epi8(r, 2);
+        mm256_mat_vec_mul_epi8(r, 3);
+        mm256_mat_vec_mul_epi8(r, 4);
+        mm256_mat_vec_mul_epi8(r, 5);
+        mm256_mat_vec_mul_epi8(r, 6);
+        mm256_mat_vec_off_mul_epi8(r, 7, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    default: {
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int8_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        int j = 0, id = 0;
+        for (; j < std::min(matrixRowDimension, matrixRowDimension / 32 * 32); j += 32, id++) {
+          mm256_load_vec_epi8(reg, id);
+          mm256_mat_vec_mul_epi8(reg, id);
+        }
+        if (matrixRowRemain) {
+          mm256_mask_load_vec_epi8(reg, id, matrixRowOffset, mask);
+          mm256_mat_vec_off_mul_epi8(reg, id, matrixRowOffset);
+        }
+        mm256_accumulate_epi32(sum);
+      }
+    }
+  }
+}
+
+void AVX2IntGemvS16S16S32REx(
+    int16_t* matrix,
+    int16_t* vec,
+    int matrixRowDimension,
+    int matrixColumnDimension,
+    int32_t* output) {
+  __m128i hzero = _mm_setzero_si128();
+  __m256i zero = _mm256_setzero_si256();
+  __m256i tmp = _mm256_setzero_si256();
+  int rowUnrollFactor = (matrixRowDimension + 15) / 16;
+  int matrixRowRemain = matrixRowDimension % 16;
+  int matrixRowOffset = (16 - matrixRowRemain) % 16;
+
+  switch (rowUnrollFactor) {
+    case 1: {
+      mm256_mask_load_vec_epi16(r, 0, matrixRowOffset, matrixRowRemain);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_off_mul_epi16(r, 0, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 2: {
+      mm256_load_vec_epi16(r, 0);
+      mm256_mask_load_vec_epi16(r, 1, matrixRowOffset, matrixRowRemain);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi16(r, 0);
+        mm256_mat_vec_off_mul_epi16(r, 1, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 3: {
+      mm256_load_vec_epi16(r, 0);
+      mm256_load_vec_epi16(r, 1);
+      mm256_mask_load_vec_epi16(r, 2, matrixRowOffset, matrixRowRemain);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi16(r, 0);
+        mm256_mat_vec_mul_epi16(r, 1);
+        mm256_mat_vec_off_mul_epi16(r, 2, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    case 4: {
+      mm256_load_vec_epi16(r, 0);
+      mm256_load_vec_epi16(r, 1);
+      mm256_load_vec_epi16(r, 2);
+      mm256_mask_load_vec_epi16(r, 3, matrixRowOffset, matrixRowRemain);
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        mm256_mat_vec_mul_epi16(r, 0);
+        mm256_mat_vec_mul_epi16(r, 1);
+        mm256_mat_vec_mul_epi16(r, 2);
+        mm256_mat_vec_off_mul_epi16(r, 3, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    } break;
+    default: {
+      for (int i = 0; i < matrixColumnDimension; ++i) {
+        int16_t* rowPtr = matrix + i * matrixRowDimension;
+        __m256i sum = _mm256_setzero_si256();
+        int j = 0, id = 0;
+        for (; j < std::min(matrixRowDimension, matrixRowDimension / 16 * 16); j += 16, id++) {
+          mm256_load_vec_epi16(reg, id);
+          mm256_mat_vec_mul_epi16(reg, id);
+        }
+        mm256_mask_load_vec_epi16(reg, id, matrixRowOffset, matrixRowRemain);
+        mm256_mat_vec_off_mul_epi16(reg, id, matrixRowOffset);
+        mm256_accumulate_epi32(sum);
+      }
+    }
+  }
+}
+#endif
+
+}  // namespace onnxruntime
+
+#ifdef _WIN32
+#pragma warning(pop)
+#endif
diff --git a/onnxruntime/core/providers/nuphar/extern/igemv_avx2.h b/onnxruntime/core/providers/nuphar/extern/igemv_avx2.h
new file mode 100644
index 0000000000000..3dd717f60c545
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/extern/igemv_avx2.h
@@ -0,0 +1,46 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include <algorithm>
+#include <stdio.h>
+#include <stdlib.h>
+
+#if NUPHAR_USE_AVX2
+#include <immintrin.h>
+#include <inttypes.h>
+#endif  // NUPHAR_USE_AVX2
+
+namespace onnxruntime {
+#ifdef NUPHAR_USE_AVX2
+void AVX2IntGemvS8U8S32R(
+    int8_t* matrix,
+    uint8_t* vec,
+    int matrixRowDimension,
+    int paddedRowDimension,
+    int matrixColumnDimension,
+    int32_t* output);
+
+void AVX2IntGemvS16S16S32R(
+    int16_t* matrix,
+    int16_t* vec,
+    int matrixRowDimension,
+    int paddedRowDimension,
+    int matrixColumnDimension,
+    int32_t* output);
+
+void AVX2IntGemvS8U8S32REx(
+    int8_t* matrix,
+    uint8_t* vec,
+    int matrixRowDimension,
+    int matrixColumnDimension,
+    int32_t* output);
+
+void AVX2IntGemvS16S16S32REx(
+    int16_t* matrix,
+    int16_t* vec,
+    int matrixRowDimension,
+    int matrixColumnDimension,
+    int32_t* output);
+#endif
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/extern/igemv_mkl.cc b/onnxruntime/core/providers/nuphar/extern/igemv_mkl.cc
new file mode 100644
index 0000000000000..d96c8aca822c4
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/extern/igemv_mkl.cc
@@ -0,0 +1,36 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "igemv_mkl.h"
+
+namespace onnxruntime {
+#ifdef NUPHAR_USE_MKL
+void MKLIntGemvS16S16S32R(
+    int16_t* matrixA,
+    int16_t* matrixB,
+    int M,
+    int N,
+    int K,
+    int32_t* output) {
+  MKL_INT32 co = 0;
+  cblas_gemm_s16s16s32(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans, CBLAS_TRANSPOSE::CblasNoTrans, CBLAS_OFFSET::CblasFixOffset,
+                       M, N, K,
+                       1, matrixA, K,
+                       0, matrixB, K, 0, 0, output, M, &co);
+}
+void MKLIntGemvS8U8S32R(
+    int8_t* matrixA,
+    uint8_t* matrixB,
+    int M,
+    int N,
+    int K,
+    int32_t* output) {
+  MKL_INT32 co = 0;
+  cblas_gemm_s8u8s32(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans, CBLAS_TRANSPOSE::CblasNoTrans, CBLAS_OFFSET::CblasFixOffset,
+                     M, N, K,
+                     1, matrixA, K,
+                     0, matrixB, K, 0, 0, output, M, &co);
+}
+#endif
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/extern/igemv_mkl.h b/onnxruntime/core/providers/nuphar/extern/igemv_mkl.h
new file mode 100644
index 0000000000000..e8a390d994a61
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/extern/igemv_mkl.h
@@ -0,0 +1,30 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include <stdint.h>
+
+#ifdef NUPHAR_USE_MKL
+// Need to build with USE_MKLML
+#include <mkl_cblas.h>
+#endif  // NUPHAR_USE_MKL
+
+namespace onnxruntime {
+#ifdef NUPHAR_USE_MKL
+void MKLIntGemvS16S16S32R(
+    int16_t* matrixA,
+    int16_t* matrixB,
+    int M,
+    int N,
+    int K,
+    int32_t* output);
+
+void MKLIntGemvS8U8S32R(
+    int8_t* matrixA,
+    uint8_t* matrixB,
+    int M,
+    int N,
+    int K,
+    int32_t* output);
+#endif
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/kernel.cc b/onnxruntime/core/providers/nuphar/kernel.cc
new file mode 100644
index 0000000000000..42353dac224bf
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/kernel.cc
@@ -0,0 +1,230 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/kernel.h"
+
+#include "core/codegen/passes/utils/codegen_context.h"
+#include "core/framework/tensorprotoutils.h"
+#include "core/providers/nuphar/common/analysis/subgraph_codegen_stats.h"
+#include "core/providers/nuphar/compiler/initializer_info.h"
+#include "core/providers/nuphar/nuphar_execution_provider.h"
+#include "core/providers/nuphar/partition/subgraph_partitioner.h"
+#include "core/providers/nuphar/runtime/sequential/basic.h"
+#include "core/providers/nuphar/runtime/sequential/loop.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+thread_local std::unique_ptr<NupharFuncStateToComputeCtxMap> NupharKernelState::nuphar_compute_ctx_map_;
+
+NupharKernelState::NupharKernelState(
+    const Node& node,
+    const ComputeContext& ctx,
+    const NupharExecutionProvider& provider)
+    : provider_(provider),
+      ctx_(ctx) {
+  partition_info_ = std::make_unique<OrtSubgraphAllocationInfo>(node);
+
+  std::vector<NupharSubgraphUnit> subgraphs;
+
+  // create a partitioner
+  SubgraphPartitioner subgraph_partitioner;
+  subgraph_partitioner.Partition(
+      node,
+      subgraphs,
+      [&](const std::string& name) { return provider_.GetConstantInitializer(name); });
+
+  for (auto& subgraph : subgraphs) {
+    Compile(subgraph);
+    if (!codegen_status_.IsOK()) {
+      return;  // early return
+    }
+  }
+
+  // Currently BuildExecBlocksAndCalls is inserted here
+  // TODO: after AOT support, we should move it to a proper location
+  BuildExecBlocksAndCalls(subgraphs);
+}
+
+void NupharKernelState::Compile(const NupharSubgraphUnit& subgraph) {
+  // TODO: rename tvm_target to a proper name
+  auto tvm_target = provider_.GetTVMTarget();
+
+  NupharCompiler tvm_compiler(subgraph,
+                              generated_initailizers_,
+                              provider_.GetNupharCodeGenHandle());
+
+  codegen_status_ = tvm_compiler.Build(subgraph);
+
+  if (codegen_status_.IsOK()) {
+    func_infos_.emplace_back(std::make_unique<NupharFuncInfo>());
+    codegen_status_ = tvm_compiler.Lower(subgraph,
+                                         tvm_target,
+                                         provider_.GetTVMHostTarget(),
+                                         func_infos_.back().get(),
+                                         partition_info_.get());
+  }
+}
+
+void NupharKernelState::BuildExecBlocksAndCalls(const std::vector<NupharSubgraphUnit>& subgraphs) {
+  // create ExecBlocks
+  for (size_t idx = 0; idx < subgraphs.size(); ++idx) {
+    CreateExecBlock(exec_blocks_,
+                    func_infos_[idx].get(),
+                    subgraphs[idx],
+                    provider_.GetNupharRuntimeHandle()->enable_model_parallelism);
+  }
+
+  // create calls
+  for (const auto& eb : exec_blocks_) {
+    exec_block_calls_.push_back(eb.get());
+  }
+}
+
+NupharKernelState::~NupharKernelState() {
+  if (nullptr != nuphar_compute_ctx_map_)
+    nuphar_compute_ctx_map_->erase(this);
+}
+
+Status NupharKernelState::Compute(OpKernelContext* op_kernel_context) const {
+  if (!codegen_status_.IsOK()) {
+    return codegen_status_;
+  }
+
+  // Create the unordered_map if it not exist
+  if (nullptr == nuphar_compute_ctx_map_) {
+    nuphar_compute_ctx_map_ = std::make_unique<NupharFuncStateToComputeCtxMap>();
+  }
+
+  // Create KernelComputeCtx if it not exist
+  if (nuphar_compute_ctx_map_->find(this) == nuphar_compute_ctx_map_->end()) {
+    std::function<void*(size_t)> data_alloc_func =
+        [this](size_t bytes) { return provider_.GetNupharRuntimeHandle()->allocator->Alloc(bytes); };
+
+    nuphar_compute_ctx_map_->emplace(
+        std::make_pair(this,
+                       std::make_unique<KernelComputeCtx>(
+                           provider_.GetNupharRuntimeHandle(),
+                           provider_.GetTLSRealizedDims(),
+                           data_alloc_func,
+                           partition_info_->offset_count)));
+  }
+
+  KernelComputeCtx* compute_ctx = nuphar_compute_ctx_map_->find(this)->second.get();
+
+  ORT_ENFORCE_DEBUG(nullptr != compute_ctx);
+
+  compute_ctx->Bind(op_kernel_context);
+
+  for (auto* call : exec_block_calls_) {
+    call->Run(compute_ctx);
+  }
+
+  return Status::OK();
+}
+
+// dummy kernel for single node, for registration purpose only
+class NupharKernel : public OpKernel {
+ public:
+  explicit NupharKernel(const OpKernelInfo& info)
+      : OpKernel(info) {
+    ORT_ENFORCE(false);  // not supposed to instantiate
+  }
+
+  Status Compute(OpKernelContext* context) const override {
+    return Status::OK();
+  }
+};
+
+}  // namespace nuphar
+
+#define NUPHAR_OP(name, ver, types)                  \
+  ONNX_OPERATOR_KERNEL_EX(                           \
+      name,                                          \
+      kOnnxDomain,                                   \
+      ver,                                           \
+      kNupharExecutionProvider,                      \
+      KernelDefBuilder().TypeConstraint("T", types), \
+      nuphar::NupharKernel);
+
+#define NUPHAR_VERSIONED_OP(name, start_ver, end_ver, types) \
+  ONNX_OPERATOR_VERSIONED_KERNEL_EX(                         \
+      name,                                                  \
+      kOnnxDomain,                                           \
+      start_ver,                                             \
+      end_ver,                                               \
+      kNupharExecutionProvider,                              \
+      KernelDefBuilder().TypeConstraint("T", types),         \
+      nuphar::NupharKernel);
+
+LIST_NUPHAR_OPS()
+
+#undef NUPHAR_OP
+
+ONNX_OPERATOR_VERSIONED_KERNEL_EX(
+    Cast,
+    kOnnxDomain,
+    6,
+    8,
+    kNupharExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::AllFixedSizeTensorTypes())
+        .TypeConstraint("T2", DataTypeImpl::AllFixedSizeTensorExceptHalfTypes()),
+    nuphar::NupharKernel);
+
+ONNX_OPERATOR_KERNEL_EX(
+    Cast,
+    kOnnxDomain,
+    9,
+    kNupharExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("T1", DataTypeImpl::AllFixedSizeTensorTypes())
+        .TypeConstraint("T2", DataTypeImpl::AllFixedSizeTensorExceptHalfTypes()),
+    nuphar::NupharKernel);
+
+ONNX_OPERATOR_KERNEL_EX(
+    Gather,
+    kOnnxDomain,
+    1,
+    kNupharExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("T", DataTypeImpl::AllFixedSizeTensorTypes())
+        .TypeConstraint("Tind", std::vector<MLDataType>{DataTypeImpl::GetTensorType<int32_t>(),
+                                                        DataTypeImpl::GetTensorType<int64_t>()}),
+    nuphar::NupharKernel);
+
+ONNX_OPERATOR_KERNEL_EX(
+    MatMulInteger,
+    kOnnxDomain,
+    10,
+    kNupharExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("T1", {DataTypeImpl::GetTensorType<int8_t>(),
+                               DataTypeImpl::GetTensorType<uint8_t>()})
+        .TypeConstraint("T2", {DataTypeImpl::GetTensorType<int8_t>(),
+                               DataTypeImpl::GetTensorType<uint8_t>()})
+        .TypeConstraint("T3", DataTypeImpl::GetTensorType<int32_t>()),
+    nuphar::NupharKernel);
+
+ONNX_OPERATOR_KERNEL_EX(
+    MatMulInteger16,
+    kMSDomain,
+    1,
+    kNupharExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("T1", {DataTypeImpl::GetTensorType<int16_t>()})
+        .TypeConstraint("T2", {DataTypeImpl::GetTensorType<int16_t>()})
+        .TypeConstraint("T3", DataTypeImpl::GetTensorType<int32_t>()),
+    nuphar::NupharKernel);
+
+ONNX_OPERATOR_KERNEL_EX(
+    Scan,
+    kOnnxDomain,
+    9,
+    kNupharExecutionProvider,
+    KernelDefBuilder()
+        .TypeConstraint("I", DataTypeImpl::GetTensorType<int64_t>())
+        .TypeConstraint("V", DataTypeImpl::AllTensorTypes()),
+    nuphar::NupharKernel);
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/kernel.h b/onnxruntime/core/providers/nuphar/kernel.h
new file mode 100644
index 0000000000000..485e4d733b094
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/kernel.h
@@ -0,0 +1,152 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/common.h"
+#include "core/codegen/passes/utils/codegen_context.h"
+#include "core/graph/graph.h"
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+#include "core/providers/nuphar/compiler/nuphar_compiler.h"
+#include "core/providers/nuphar/compiler/initializer_info.h"
+#include "core/providers/nuphar/runtime/compute_ctx.h"
+#include "core/providers/nuphar/runtime/exec_block.h"
+
+#include <tvm/build_module.h>
+
+#include <map>
+#include <unordered_map>
+
+namespace onnxruntime {
+
+class NupharExecutionProvider;
+
+namespace nuphar {
+
+class NupharKernelState;
+using NupharFuncStateToComputeCtxMap =
+    std::unordered_map<const NupharKernelState*, std::unique_ptr<KernelComputeCtx>>;
+
+class NupharKernelState {
+ public:
+  explicit NupharKernelState(
+      const Node& fused_node,
+      const ComputeContext& ctx,
+      const NupharExecutionProvider& provider);
+
+  ~NupharKernelState();
+
+  Status Compute(OpKernelContext* op_kernel_context) const;
+
+  void Compile(const NupharSubgraphUnit& subgraph);
+
+  void BuildExecBlocksAndCalls(const std::vector<NupharSubgraphUnit>& subgraphs);
+
+ private:
+  const NupharExecutionProvider& provider_;
+
+  // A owner of generated Tensor for weight layout for now
+  // TODO: remove it after weight layout refactoring
+  std::unordered_map<std::string, std::unique_ptr<Tensor>> generated_initailizers_;
+
+  Status codegen_status_;
+
+  // Hold Partition_info for codegen
+  std::unique_ptr<OrtSubgraphAllocationInfo> partition_info_;
+
+  // Hold NupharFuncInfo from codegen.
+  std::vector<std::unique_ptr<NupharFuncInfo>> func_infos_;
+
+  // ExecBlocks of runtime
+  // Ownership of ExecBlock
+  std::vector<std::unique_ptr<ExecBlock>> exec_blocks_;
+
+  // Calls
+  std::vector<ExecBlock*> exec_block_calls_;
+
+  // Here ComputeContext of Ort is used for allocator
+  ComputeContext ctx_;  // the compute context from IExecutionProvider::Compile interface
+
+  static thread_local std::unique_ptr<NupharFuncStateToComputeCtxMap> nuphar_compute_ctx_map_;
+};
+
+#define DISABLE_MACRO(X)
+
+#define LIST_NUPHAR_OPS()                                                                    \
+  NUPHAR_OP(Abs, 6, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  NUPHAR_OP(Add, 7, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  NUPHAR_OP(ArgMax, 1, DataTypeImpl::AllFixedSizeTensorTypes())                              \
+  NUPHAR_OP(ArgMin, 1, DataTypeImpl::AllFixedSizeTensorTypes())                              \
+  DISABLE_MACRO(NUPHAR_OP(AveragePool, 7, DataTypeImpl::AllFixedSizeTensorTypes()))          \
+  NUPHAR_OP(Ceil, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                                \
+  NUPHAR_OP(Clip, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                                \
+  NUPHAR_OP(Concat, 4, DataTypeImpl::AllFixedSizeTensorTypes())                              \
+  DISABLE_MACRO(NUPHAR_OP(Conv, 1, DataTypeImpl::AllIEEEFloatTensorExceptHalfTypes()))       \
+  NUPHAR_OP(Crop, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                                \
+  NUPHAR_OP(Div, 7, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  NUPHAR_OP(Dropout, 7, DataTypeImpl::AllFixedSizeTensorTypes())                             \
+  NUPHAR_OP(Elu, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                                 \
+  NUPHAR_OP(Equal, 7, DataTypeImpl::AllFixedSizeTensorTypes())                               \
+  NUPHAR_OP(Erf, 9, DataTypeImpl::GetTensorType<float>())                                    \
+  NUPHAR_OP(Exp, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                                 \
+  NUPHAR_VERSIONED_OP(Flatten, 1, 8, DataTypeImpl::AllIEEEFloatTensorTypes())                \
+  NUPHAR_OP(Flatten, 9, DataTypeImpl::AllIEEEFloatTensorTypes())                             \
+  NUPHAR_OP(Floor, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                               \
+  NUPHAR_VERSIONED_OP(Gemm, 7, 8, DataTypeImpl::AllIEEEFloatTensorExceptHalfTypes())         \
+  NUPHAR_OP(Gemm, 9, DataTypeImpl::AllIEEEFloatTensorExceptHalfTypes())                      \
+  DISABLE_MACRO(NUPHAR_OP(GlobalAveragePool, 1, DataTypeImpl::AllFixedSizeTensorTypes()))    \
+  DISABLE_MACRO(NUPHAR_OP(GlobalMaxPool, 1, DataTypeImpl::AllFixedSizeTensorTypes()))        \
+  NUPHAR_OP(Greater, 9, DataTypeImpl::AllFixedSizeTensorTypes())                             \
+  NUPHAR_OP(HardSigmoid, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                         \
+  NUPHAR_OP(Identity, 1, DataTypeImpl::AllFixedSizeTensorTypes())                            \
+  NUPHAR_OP(LeakyRelu, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                           \
+  NUPHAR_OP(Less, 9, DataTypeImpl::AllFixedSizeTensorTypes())                                \
+  NUPHAR_OP(Log, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                                 \
+  NUPHAR_OP(LogSoftmax, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                          \
+  DISABLE_MACRO(NUPHAR_OP(LSTM, 7, DataTypeImpl::AllIEEEFloatTensorTypes()))                 \
+  NUPHAR_VERSIONED_OP(MatMul, 1, 8, DataTypeImpl::AllIEEEFloatTensorExceptHalfTypes())       \
+  NUPHAR_OP(MatMul, 9, DataTypeImpl::AllIEEEFloatTensorExceptHalfTypes())                    \
+  NUPHAR_OP(Max, 8, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  DISABLE_MACRO(NUPHAR_VERSIONED_OP(MaxPool, 1, 7, DataTypeImpl::AllFixedSizeTensorTypes())) \
+  DISABLE_MACRO(NUPHAR_OP(MaxPool, 8, DataTypeImpl::AllFixedSizeTensorTypes()))              \
+  NUPHAR_OP(Min, 8, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  NUPHAR_OP(Mul, 7, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  NUPHAR_OP(Neg, 6, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  NUPHAR_OP(Pad, 2, DataTypeImpl::AllIEEEFloatTensorTypes())                                 \
+  NUPHAR_OP(ParametricSoftplus, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                  \
+  NUPHAR_OP(PRelu, 7, DataTypeImpl::AllIEEEFloatTensorTypes())                               \
+  NUPHAR_OP(Relu, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                                \
+  NUPHAR_OP(Reciprocal, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                          \
+  NUPHAR_OP(ReduceL1, 1, DataTypeImpl::AllFixedSizeTensorTypes())                            \
+  NUPHAR_OP(ReduceL2, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                            \
+  NUPHAR_OP(ReduceLogSum, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                        \
+  NUPHAR_OP(ReduceLogSumExp, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                     \
+  NUPHAR_OP(ReduceMax, 1, DataTypeImpl::AllFixedSizeTensorTypes())                           \
+  NUPHAR_OP(ReduceMean, 1, DataTypeImpl::AllFixedSizeTensorTypes())                          \
+  NUPHAR_OP(ReduceMin, 1, DataTypeImpl::AllFixedSizeTensorTypes())                           \
+  NUPHAR_OP(ReduceProd, 1, DataTypeImpl::AllFixedSizeTensorTypes())                          \
+  NUPHAR_OP(ReduceSum, 1, DataTypeImpl::AllFixedSizeTensorTypes())                           \
+  NUPHAR_OP(ReduceSumSquare, 1, DataTypeImpl::AllFixedSizeTensorTypes())                     \
+  NUPHAR_OP(Reshape, 5, DataTypeImpl::AllFixedSizeTensorTypes())                             \
+  NUPHAR_OP(ScaledTanh, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                          \
+  NUPHAR_OP(Selu, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                                \
+  NUPHAR_OP(Sigmoid, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                             \
+  NUPHAR_VERSIONED_OP(Slice, 1, 9, DataTypeImpl::AllFixedSizeTensorTypes())                  \
+  NUPHAR_OP(Slice, 10, DataTypeImpl::AllFixedSizeTensorTypes())                              \
+  NUPHAR_OP(Softmax, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                             \
+  NUPHAR_OP(Softplus, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                            \
+  NUPHAR_OP(Softsign, 1, DataTypeImpl::AllIEEEFloatTensorTypes())                            \
+  NUPHAR_OP(Split, 2, DataTypeImpl::AllIEEEFloatTensorTypes())                               \
+  NUPHAR_OP(Squeeze, 1, DataTypeImpl::AllFixedSizeTensorTypes())                             \
+  NUPHAR_OP(Sqrt, 6, DataTypeImpl::AllIEEEFloatTensorTypes())                                \
+  NUPHAR_OP(Sub, 7, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  NUPHAR_OP(Sum, 8, DataTypeImpl::AllFixedSizeTensorTypes())                                 \
+  NUPHAR_OP(Tanh, 6, DataTypeImpl::AllFixedSizeTensorTypes())                                \
+  NUPHAR_OP(ThresholdedRelu, 1, DataTypeImpl::AllFixedSizeTensorTypes())                     \
+  NUPHAR_OP(Tile, 6, DataTypeImpl::AllFixedSizeTensorTypes())                                \
+  NUPHAR_OP(Transpose, 1, DataTypeImpl::AllFixedSizeTensorTypes())                           \
+  NUPHAR_OP(Unsqueeze, 1, DataTypeImpl::AllFixedSizeTensorTypes())                           \
+  NUPHAR_OP(Where, 9, DataTypeImpl::AllFixedSizeTensorTypes())
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/halide_ops.cc b/onnxruntime/core/providers/nuphar/mti_x86/math/halide_ops.cc
new file mode 100644
index 0000000000000..d18c6292495de
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/halide_ops.cc
@@ -0,0 +1,307 @@
+// Halide Copyright info:
+/*
+Copyright (c) 2012-2018 MIT CSAIL, Google Inc., and other contributors
+
+Developed by:
+
+  The Halide team
+  http://halide-lang.org
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+*/
+
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file contains some ops that were copied from Halide (with small modifications).
+
+#include "core/providers/nuphar/mti_x86/math/halide_ops.h"
+
+#include <topi/elemwise.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Evaluate a float polynomial efficiently, taking instruction latency
+// into account. The high order terms come first. n is the number of
+// terms, which is the degree plus one.
+static tvm::Expr evaluate_polynomial(const tvm::Expr& x, float* coeff, int n) {
+  DCHECK(n >= 2);
+  tvm::Expr x2 = x * x;
+
+  tvm::Expr even_terms = coeff[0];
+  tvm::Expr odd_terms = coeff[1];
+
+  for (int i = 2; i < n; i++) {
+    if ((i & 1) == 0) {
+      if (coeff[i] == 0.0f) {
+        even_terms *= x2;
+      } else {
+        even_terms = even_terms * x2 + coeff[i];
+      }
+    } else {
+      if (coeff[i] == 0.0f) {
+        odd_terms *= x2;
+      } else {
+        odd_terms = odd_terms * x2 + coeff[i];
+      }
+    }
+  }
+
+  if ((n & 1) == 0) {
+    return even_terms * x + odd_terms;
+  } else {
+    return odd_terms * x + even_terms;
+  }
+}
+
+// Fast math ops based on those from Syrah (http://github.com/boulos/syrah). Thanks, Solomon!
+
+// Factor a float into 2^exponent * reduced, where reduced is between 0.75 and 1.5
+static void range_reduce_log(const tvm::Expr& input, tvm::Expr* reduced, tvm::Expr* exponent) {
+  tvm::Type type = input.type();
+  tvm::Type int_type = tvm::Int(32, type.lanes());
+  tvm::Expr int_version = tvm::reinterpret(int_type, input);
+
+  // single precision = SEEE EEEE EMMM MMMM MMMM MMMM MMMM MMMM
+  // exponent mask    = 0111 1111 1000 0000 0000 0000 0000 0000
+  //                    0x7  0xF  0x8  0x0  0x0  0x0  0x0  0x0
+  // non-exponent     = 1000 0000 0111 1111 1111 1111 1111 1111
+  //                  = 0x8  0x0  0x7  0xF  0xF  0xF  0xF  0xF
+  tvm::Expr non_exponent_mask = tvm::make_const(int_type, 0x807fffff);
+
+  // Extract a version with no exponent (between 1.0 and 2.0)
+  tvm::Expr no_exponent = int_version & non_exponent_mask;
+
+  // If > 1.5, we want to divide by two, to normalize back into the
+  // range (0.75, 1.5). We can detect this by sniffing the high bit
+  // of the mantissa.
+  tvm::Expr new_exponent = no_exponent >> 22;
+
+  tvm::Expr new_biased_exponent = 127 - new_exponent;
+  tvm::Expr old_biased_exponent = int_version >> 23;
+  *exponent = old_biased_exponent - new_biased_exponent;
+
+  tvm::Expr blended = (int_version & non_exponent_mask) | (new_biased_exponent << 23);
+
+  *reduced = tvm::reinterpret(type, blended);
+}
+
+tvm::Expr halideir_log(const tvm::Expr& x_full) {
+  tvm::Type type = x_full.type();
+  DCHECK(type.element_of() == tvm::Float(32));
+
+  tvm::Expr nan = tvm::make_const(tvm::Float(32), 0x7FF8000000000000);
+  // tvm::Expr nan = tvm::ir::Call::make(type, "nan_f32", {}, tvm::ir::Call::PureExtern);
+  tvm::Expr neg_inf = tvm::make_const(tvm::Float(32), 0xFFF0000000000000);
+  // tvm::Expr neg_inf = tvm::ir::Call::make(type, "neg_inf_f32", {}, tvm::ir::Call::PureExtern);
+
+  tvm::Expr use_nan = x_full < 0.0f;       // log of a negative returns nan
+  tvm::Expr use_neg_inf = x_full == 0.0f;  // log of zero is -inf
+  tvm::Expr exceptional = use_nan | use_neg_inf;
+
+  // Avoid producing nans or infs by generating ln(1.0f) instead and
+  // then fixing it later.
+  tvm::Expr patched = tvm::ir::Select::make(exceptional, tvm::make_const(type, 1.0), x_full);
+  tvm::Expr reduced, exponent;
+  range_reduce_log(patched, &reduced, &exponent);
+
+  // Very close to the Taylor series for log about 1, but tuned to
+  // have minimum relative error in the reduced domain (0.75 - 1.5).
+
+  float coeff[] = {
+      0.05111976432738144643f,
+      -0.11793923497136414580f,
+      0.14971993724699017569f,
+      -0.16862004708254804686f,
+      0.19980668101718729313f,
+      -0.24991211576292837737f,
+      0.33333435275479328386f,
+      -0.50000106292873236491f,
+      1.0f,
+      0.0f};
+  tvm::Expr x1 = reduced - 1.0f;
+  tvm::Expr result = evaluate_polynomial(x1, coeff, sizeof(coeff) / sizeof(coeff[0]));
+
+  result += tvm::cast(type, exponent) * logf(2.0);
+
+  result = tvm::ir::Select::make(exceptional, tvm::ir::Select::make(use_nan, nan, neg_inf), result);
+
+  // This introduces lots of common subexpressions
+  //result = common_subexpression_elimination(result);
+
+  return result;
+}
+
+tvm::Expr raise_to_integer_power(const tvm::Expr& e, int64_t p) {
+  tvm::Expr result;
+  if (p == 0) {
+    result = tvm::make_const(e.type(), 1);
+  } else if (p == 1) {
+    result = e;
+  } else if (p < 0) {
+    result = tvm::make_const(e.type(), 1) / raise_to_integer_power(e, -p);
+  } else {
+    // p is at least 2
+    tvm::Expr y = raise_to_integer_power(e, p >> 1);
+    if (p & 1)
+      result = y * y * e;
+    else
+      result = y * y;
+  }
+  return result;
+}
+
+/** Return one floating point expression raised to the power of
+ * another. The type of the result is given by the type of the first
+ * argument. If the first argument is not a floating-point type, it is
+ * cast to Float(32). For Float(32), cleanly vectorizable, and
+ * accurate up to the last few bits of the mantissa. Gets worse when
+ * approaching overflow. Vectorizes cleanly. */
+inline tvm::Expr halideir_pow(tvm::Expr x, tvm::Expr y) {
+  if (const int64_t* i = as_const_int(y)) {
+    return raise_to_integer_power(x, *i);
+  }
+
+  if (x.type() == HalideIR::Float(64)) {
+    y = tvm::cast(HalideIR::Float(64), y);
+    return HalideIR::Internal::Call::make(HalideIR::Float(64), "pow_f64", {x, y}, HalideIR::Internal::Call::PureExtern);
+  } else if (x.type() == HalideIR::Float(16)) {
+    y = tvm::cast(HalideIR::Float(16), y);
+    return HalideIR::Internal::Call::make(HalideIR::Float(16), "pow_f16", {x, y}, HalideIR::Internal::Call::PureExtern);
+  } else {
+    x = tvm::cast(HalideIR::Float(32), x);
+    y = tvm::cast(HalideIR::Float(32), y);
+    return HalideIR::Internal::Call::make(HalideIR::Float(32), "pow_f32", {x, y}, HalideIR::Internal::Call::PureExtern);
+  }
+}
+
+tvm::Expr halideir_erf(const tvm::Expr& x_full) {
+  DCHECK(x_full.type() == HalideIR::Float(32));
+
+  // Extract the sign and magnitude.
+  tvm::Expr sign = tvm::ir::Select::make(x_full < 0, -1.0f, 1.0f);
+  tvm::Expr x = abs(x_full);
+
+  // An approximation very similar to one from Abramowitz and
+  // Stegun, but tuned for values > 1. Takes the form 1 - P(x)^-16.
+  float c1[] = {0.0000818502f,
+                -0.0000026500f,
+                0.0009353904f,
+                0.0081960206f,
+                0.0430054424f,
+                0.0703310579f,
+                1.0f};
+  tvm::Expr approx1 = evaluate_polynomial(x, c1, sizeof(c1) / sizeof(c1[0]));
+
+  approx1 = 1.0f - halideir_pow(approx1, -16);
+
+  // An odd polynomial tuned for values < 1. Similar to the Taylor
+  // expansion of erf.
+  float c2[] = {-0.0005553339f,
+                0.0048937243f,
+                -0.0266849239f,
+                0.1127890132f,
+                -0.3761207240f,
+                1.1283789803f};
+
+  tvm::Expr approx2 = evaluate_polynomial(x * x, c2, sizeof(c2) / sizeof(c2[0]));
+  approx2 *= x;
+
+  // Switch between the two approximations based on the magnitude.
+  tvm::Expr y = tvm::ir::Select::make(x > 1.0f, approx1, approx2);
+
+  //Expr result = common_subexpression_elimination(sign * y);
+
+  return sign * y;
+}
+
+tvm::Expr fast_log(const tvm::Expr& x) {
+  DCHECK(x.type() == tvm::Float(32));
+
+  tvm::Expr reduced, exponent;
+  range_reduce_log(x, &reduced, &exponent);
+
+  tvm::Expr x1 = reduced - 1.0f;
+
+  float coeff[] = {
+      0.07640318789187280912f,
+      -0.16252961013874300811f,
+      0.20625219040645212387f,
+      -0.25110261010892864775f,
+      0.33320464908377461777f,
+      -0.49997513376789826101f,
+      1.0f,
+      0.0f};
+
+  tvm::Expr result = evaluate_polynomial(x1, coeff, sizeof(coeff) / sizeof(coeff[0]));
+  result += tvm::cast(x.type(), exponent) * logf(2);
+  //result = common_subexpression_elimination(result);
+  return result;
+}
+
+tvm::Expr halideir_exp(const tvm::Expr& x_full) {
+  tvm::Type type = x_full.type();
+  DCHECK(type.element_of() == tvm::Float(32));
+
+  float ln2_part1 = 0.6931457519f;
+  float ln2_part2 = 1.4286067653e-6f;
+  float one_over_ln2 = 1.0f / logf(2.0f);
+
+  tvm::Expr scaled = x_full * one_over_ln2;
+  tvm::Expr k_real = tvm::floor(scaled);
+  tvm::Expr k = tvm::cast(tvm::Int(32, type.lanes()), k_real);
+
+  tvm::Expr x = x_full - k_real * ln2_part1;
+  x -= k_real * ln2_part2;
+
+  float coeff[] = {
+      0.00031965933071842413f,
+      0.00119156835564003744f,
+      0.00848988645943932717f,
+      0.04160188091348320655f,
+      0.16667983794100929562f,
+      0.49999899033463041098f,
+      1.0f,
+      1.0f};
+  tvm::Expr result = evaluate_polynomial(x, coeff, sizeof(coeff) / sizeof(coeff[0]));
+
+  // Compute 2^k.
+  int fpbias = 127;
+  tvm::Expr biased = k + fpbias;
+
+  tvm::Expr inf = tvm::make_const(tvm::Float(32), 0x7FF0000000000000);
+  // Expr inf = Call::make(type, "inf_f32", {}, Call::PureExtern);
+
+  // Shift the bits up into the exponent field and reinterpret this
+  // thing as float.
+  tvm::Expr two_to_the_n = tvm::reinterpret(type, biased << 23);
+  result *= two_to_the_n;
+
+  // Catch overflow and underflow
+  result = tvm::ir::Select::make(biased < 255, result, inf);
+  result = tvm::ir::Select::make(biased > 0, result, tvm::make_zero(type));
+
+  // This introduces lots of common subexpressions
+  // result = common_subexpression_elimination(result);
+  return result;
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/halide_ops.h b/onnxruntime/core/providers/nuphar/mti_x86/math/halide_ops.h
new file mode 100644
index 0000000000000..e421bd5071715
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/halide_ops.h
@@ -0,0 +1,49 @@
+// Halide Copyright info:
+/*
+Copyright (c) 2012-2018 MIT CSAIL, Google Inc., and other contributors
+
+Developed by:
+
+  The Halide team
+  http://halide-lang.org
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
+so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+*/
+
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+// This file contains some ops that were copied from Halide (with small modifications).
+
+#pragma once
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Expr halideir_erf(const tvm::Expr& x_full);
+
+tvm::Expr halideir_exp(const tvm::Expr& x_full);
+
+tvm::Expr halideir_log(const tvm::Expr& x_full);
+
+tvm::Expr fast_log(const tvm::Expr& x);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/logsoftmax.cc b/onnxruntime/core/providers/nuphar/mti_x86/math/logsoftmax.cc
new file mode 100644
index 0000000000000..a4a57fdc9bf37
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/logsoftmax.cc
@@ -0,0 +1,16 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/mti_x86/math/logsoftmax.h"
+
+#include "core/providers/nuphar/mti_x86/math/softmax_internal.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Tensor LogSoftmax(const tvm::Tensor& input, int64_t axis, const std::string& name) {
+  return internal::SoftmaxInternal(input, axis, name, /*logarithmic*/ true);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/logsoftmax.h b/onnxruntime/core/providers/nuphar/mti_x86/math/logsoftmax.h
new file mode 100644
index 0000000000000..ea533ff4f2d20
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/logsoftmax.h
@@ -0,0 +1,15 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <string>
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Tensor LogSoftmax(const tvm::Tensor& input, int64_t axis, const std::string& name = "LogSoftmax");
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/matmul_ops.cc b/onnxruntime/core/providers/nuphar/mti_x86/math/matmul_ops.cc
new file mode 100644
index 0000000000000..1c30b29687843
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/matmul_ops.cc
@@ -0,0 +1,232 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/mti_x86/math/matmul_ops.h"
+
+#include "core/codegen/common/profile.h"
+#include "core/codegen/mti/math/matmul_ops.h"
+#include "core/providers/cpu/math/matmul_helper.h"
+#include "core/providers/nuphar/common/nuphar_settings.h"
+#include "core/codegen/mti/math/matmul_ops.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+#include "core/util/math.h"
+#include "core/util/math_cpuonly.h"
+#include <topi/detail/extern.h>
+#include <topi/transform.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Tensor MatMul2D(const tvm::Tensor& A, const tvm::Tensor& B, bool trans_a, bool trans_b, const std::string& name) {
+  tvm::Tensor Y;
+  if (MatMulExternCpu(A, B, Y, trans_a, trans_b))
+    return Y;
+
+  return topi::matmul(A, B, trans_a, trans_b, name);
+}
+
+TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.sgemm_cpu")
+    .set_body([](tvm::TVMArgs args, tvm::TVMRetValue* /*ret*/) {
+      CODEGEN_PROFILER_EVENT("math_sgemm");
+      // Explicitly construct TVMArgValue instead of calling operator[] on args for saving some cycles.
+      DLTensor* A = tvm::runtime::TVMArgValue(args.values[0], args.type_codes[0]);
+      DLTensor* B = tvm::runtime::TVMArgValue(args.values[1], args.type_codes[1]);
+      DLTensor* C = tvm::runtime::TVMArgValue(args.values[2], args.type_codes[2]);
+      bool trans_a = tvm::runtime::TVMArgValue(args.values[3], args.type_codes[3]);
+      bool trans_b = tvm::runtime::TVMArgValue(args.values[4], args.type_codes[4]);
+      float alpha = 1.0f;
+      float beta = 0.0f;
+
+      DCHECK(C->strides == nullptr);
+      DCHECK(B->strides == nullptr);
+      DCHECK(A->strides == nullptr);
+      DCHECK(tvm::runtime::TypeMatch(A->dtype, kDLFloat, 32));
+      DCHECK(tvm::runtime::TypeMatch(B->dtype, kDLFloat, 32));
+      DCHECK(tvm::runtime::TypeMatch(C->dtype, kDLFloat, 32));
+
+      int64_t M, N, K;
+
+      // compute default M by flatten A dims
+      M = 1;
+      for (int d = 0; d < A->ndim - 1; ++d)
+        M *= A->shape[d];
+
+      if (A->ndim == 1) {
+        DCHECK(!trans_a);
+        DCHECK_GT(B->ndim, 1);
+        M = 1;
+        N = B->shape[trans_b ? 0 : B->ndim - 1];
+        K = A->shape[0];
+      } else if (B->ndim == 1) {
+        // N-D x 1-D
+        DCHECK(!trans_a);
+        DCHECK(!trans_b);
+        DCHECK_GT(A->ndim, 1);
+        N = 1;
+        K = A->shape[A->ndim - 1];
+      } else {
+        // N-D x N-D
+        DCHECK(!trans_a || A->ndim == 2);  // only allow trans_a for 2D
+        if (trans_a) {
+          M = A->shape[1];
+          K = A->shape[0];
+        } else {
+          K = A->shape[A->ndim - 1];
+        }
+
+        // B is essentially 2D, allowing >2D here to reduce flatten at extern input
+        N = B->shape[trans_b ? B->ndim - 2 : B->ndim - 1];
+      }
+
+      // for empty tensor, don't do anything
+      if (M == 0 || N == 0 || K == 0)
+        return;
+
+      math::Gemm<float, concurrency::ThreadPool>(
+          trans_a ? CblasTrans : CblasNoTrans,
+          trans_b ? CblasTrans : CblasNoTrans,
+          M,
+          N,
+          K,
+          alpha,
+          reinterpret_cast<float*>(static_cast<char*>(A->data) + A->byte_offset),
+          reinterpret_cast<float*>(static_cast<char*>(B->data) + B->byte_offset),
+          beta,
+          reinterpret_cast<float*>(static_cast<char*>(C->data) + C->byte_offset),
+          nullptr);
+    });
+
+TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.batched_matmul_cpu")
+    .set_body([](tvm::TVMArgs args, tvm::TVMRetValue* /*ret*/) {
+      CODEGEN_PROFILER_EVENT("math_batched_sgemm");
+      DLTensor* A = tvm::runtime::TVMArgValue(args.values[0], args.type_codes[0]);
+      DLTensor* B = tvm::runtime::TVMArgValue(args.values[1], args.type_codes[1]);
+      DLTensor* C = tvm::runtime::TVMArgValue(args.values[2], args.type_codes[2]);
+
+      DCHECK(C->strides == nullptr);
+      DCHECK(B->strides == nullptr);
+      DCHECK(A->strides == nullptr);
+      DCHECK(tvm::runtime::TypeMatch(A->dtype, kDLFloat, 32));
+      DCHECK(tvm::runtime::TypeMatch(B->dtype, kDLFloat, 32));
+      DCHECK(tvm::runtime::TypeMatch(C->dtype, kDLFloat, 32));
+
+      MatMulComputeHelper helper;
+      TensorShape A_shape(A->shape, A->ndim);
+      TensorShape B_shape(B->shape, B->ndim);
+      helper.Compute(A_shape, B_shape);
+
+      size_t max_len = helper.OutputOffsets().size();
+      for (size_t i = 0; i < max_len; i++) {
+        math::MatMul<float>(
+            static_cast<int>(helper.M()),
+            static_cast<int>(helper.N()),
+            static_cast<int>(helper.K()),
+            (float*)A->data + helper.LeftOffsets()[i],
+            (float*)B->data + helper.RightOffsets()[i],
+            (float*)C->data + helper.OutputOffsets()[i],
+            nullptr);  // TODO: use thread pool from OpContext
+      }
+    });
+
+bool MatMulExternCpu(
+    const tvm::Tensor& A,
+    const tvm::Tensor& B,
+    tvm::Tensor& Y,
+    bool trans_a,
+    bool trans_b,
+    const std::string& name) {
+  // Note: currently default behavior is always prefer extern
+  const codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+  if (settings.HasOption(kNupharMatmulExec)) {
+    bool prefer_extern = settings.OptionMatches(
+        kNupharMatmulExec,
+        kNupharMatMulExec_ExternCpu);
+    if (!prefer_extern)
+      return false;
+  }
+
+  // TODO: add support for mixed precisions
+  if (A->dtype != B->dtype ||
+      !A->dtype.is_float() ||
+      A->dtype.bits() != 32)
+    return false;
+
+  // inputs need to be at least 1D
+  auto rank_A = A->shape.size();
+  auto rank_B = B->shape.size();
+  if (rank_A < 1 || rank_B < 1)
+    return false;
+
+  // only allow trans_a for 2D inputs
+  if (rank_A != 2 && trans_a)
+    return false;
+
+  // do not support 1-D x 1-D as tvm extern require buffer size > 0
+  if (rank_A == 1 && rank_B == 1)
+    return false;
+
+  tvm::Array<tvm::Expr> out_shape;
+  if (rank_A == 1) {
+    // 1-D x N-D
+    if (trans_b) {
+      ORT_ENFORCE(rank_B == 2);
+      out_shape.push_back(B->shape[0]);
+    } else {
+      for (size_t d = 0; d < rank_B - 2; ++d)
+        out_shape.push_back(B->shape[d]);
+      out_shape.push_back(B->shape[rank_B - 1]);
+    }
+  } else if (rank_B == 1) {
+    // N-D x 1-D
+    for (size_t d = 0; d < rank_A - 1; ++d)
+      out_shape.push_back(A->shape[d]);
+  } else {
+    // N-D x N-D
+    if (rank_B == 2) {
+      if (trans_a) {
+        // trans_a is only allowed for 2D
+        out_shape.push_back(A->shape[rank_A - 1]);
+      } else {
+        for (size_t d = 0; d < rank_A - 1; ++d)
+          out_shape.push_back(A->shape[d]);
+      }
+      out_shape.push_back(B->shape[trans_b ? rank_B - 2 : rank_B - 1]);
+    } else {
+      ORT_ENFORCE(!trans_a && !trans_b);
+      // batched matmul
+      out_shape = tvm_codegen::ComputeMatMulShape(A->shape, B->shape);
+    }
+  }
+
+  Y = topi::detail::make_extern(
+      {out_shape}, {A->dtype}, {A, B},
+      [&](tvm::Array<tvm::Buffer> ins, tvm::Array<tvm::Buffer> outs) {
+        if (rank_B <= 2) {
+          return topi::detail::call_packed({tvm::Expr("tvm.contrib.onnxruntime.sgemm_cpu"),
+                                            topi::detail::pack_buffer(ins[0]),
+                                            topi::detail::pack_buffer(ins[1]),
+                                            topi::detail::pack_buffer(outs[0]),
+                                            trans_a,
+                                            trans_b});
+        } else {
+          return topi::detail::call_packed({tvm::Expr("tvm.contrib.onnxruntime.batched_matmul_cpu"),
+                                            topi::detail::pack_buffer(ins[0]),
+                                            topi::detail::pack_buffer(ins[1]),
+                                            topi::detail::pack_buffer(outs[0])});
+        }
+      },
+      name, "", {})[0];
+
+  return true;
+}
+
+tvm::Tensor MatMul(const tvm::Tensor& A, const tvm::Tensor& B, const std::string& name) {
+  tvm::Tensor Y;
+  if (MatMulExternCpu(A, B, Y))
+    return Y;
+  // go through generic case otherwise
+  return tvm_codegen::MatMul(A, B, name);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/matmul_ops.h b/onnxruntime/core/providers/nuphar/mti_x86/math/matmul_ops.h
new file mode 100644
index 0000000000000..53484dc7f5605
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/matmul_ops.h
@@ -0,0 +1,24 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include <string>
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Tensor MatMul2D(const tvm::Tensor& A, const tvm::Tensor& B, bool trans_a = false, bool trans_b = false, const std::string& name = "matmul2d");
+
+bool MatMulExternCpu(
+    const tvm::Tensor& A,
+    const tvm::Tensor& B,
+    tvm::Tensor& Y,
+    bool trans_a = false,
+    bool trans_b = false,
+    const std::string& name = "matmul_extern_cpu");
+
+tvm::Tensor MatMul(const tvm::Tensor& A, const tvm::Tensor& B, const std::string& name);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/reduce_ops.cc b/onnxruntime/core/providers/nuphar/mti_x86/math/reduce_ops.cc
new file mode 100644
index 0000000000000..d78d2473d81e2
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/reduce_ops.cc
@@ -0,0 +1,356 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/mti_x86/math/reduce_ops.h"
+
+#include "core/codegen/common/common.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+#include "core/codegen/mti/tensor/pad_ops.h"
+#include "core/codegen/mti/tensor/reshape_ops.h"
+#include <topi/reduction.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+using FReduce = std::function<tvm::Expr(tvm::Expr source, const tvm::Array<tvm::IterVar>& axis)>;
+
+// A special vectorization friendly value reduction along with non-last dims
+// axes are sorted, and cannot contain last
+// It prefers but not require last dim is multiple of vector_width
+tvm::Tensor ReduceValueWithoutSplit(const tvm::Tensor& X,
+                                    FReduce func,
+                                    const std::vector<int64_t>& axes,
+                                    bool keep_dims,
+                                    int32_t fuse_dim = 0,
+                                    const std::string& name = "ReduceValueWithoutSplit") {
+  auto input_shape = X->shape;  // [n1, n2, n3, ..., n(d-1), nd]
+  tvm::Array<tvm::IterVar> out_reduce_axis;
+  for (const auto& t : axes) {
+    out_reduce_axis.push_back(tvm::reduce_axis(tvm::Range(0, input_shape[t]), name + "_k_" + std::to_string(t)));  // k
+  }
+
+  tvm::Array<tvm::Expr> output_shape;
+  if (keep_dims) {
+    size_t j = 0;
+    for (size_t i = 0; i < input_shape.size(); ++i) {
+      if (gsl::narrow_cast<int64_t>(i) == axes[j]) {
+        output_shape.push_back(1);
+        if (gsl::narrow_cast<int64_t>(j) < (gsl::narrow_cast<int64_t>(axes.size()) - 1))
+          j++;
+      } else {
+        output_shape.push_back(input_shape[i]);
+      }
+    }
+  } else {
+    size_t j = 0;
+    for (size_t i = 0; i < input_shape.size(); ++i) {
+      if (gsl::narrow_cast<int64_t>(i) == axes[j]) {
+        if (gsl::narrow_cast<int64_t>(j) < (gsl::narrow_cast<int64_t>(axes.size()) - 1))
+          j++;
+      } else {
+        output_shape.push_back(input_shape[i]);
+      }
+    }
+  }
+
+  auto l_out = [&](const tvm::Array<tvm::Var>& indices) {
+    tvm::Array<tvm::Expr> eval_range;
+    size_t j = 0;
+    size_t l = 0;
+    if (keep_dims) {
+      for (size_t i = 0; i < input_shape.size(); ++i) {
+        if (gsl::narrow_cast<int64_t>(i) == axes[j]) {
+          eval_range.push_back(indices[l] + out_reduce_axis[j]->var);
+          if (gsl::narrow_cast<int64_t>(j) < (gsl::narrow_cast<int64_t>(axes.size()) - 1))
+            j++;
+        } else {
+          eval_range.push_back(indices[l]);
+        }
+        l++;
+      }
+    } else {
+      for (size_t i = 0; i < input_shape.size(); ++i) {
+        if (gsl::narrow_cast<int64_t>(i) == axes[j]) {
+          eval_range.push_back(out_reduce_axis[j]->var);
+          if (gsl::narrow_cast<int64_t>(j) < (gsl::narrow_cast<int64_t>(axes.size()) - 1))
+            j++;
+        } else {
+          eval_range.push_back(indices[l]);
+          l++;
+        }
+      }
+    }
+    return func(X(eval_range), out_reduce_axis);
+  };
+
+  tvm::Map<std::string, tvm::NodeRef> attrs;
+  attrs.Set(kNupharVReduceFuseDim, tvm::Expr(fuse_dim));
+
+  return tvm::compute(output_shape, l_out, name + "_regular_reduce", kNupharVReduce, attrs);
+}
+
+// A special vectorization friendly value reduction along with last dims
+// It prefers last dim is multiple of vector_width
+// If not, it will perform Pad to force it. Using last_dim_aligned to bypass it.
+tvm::Tensor ReduceValueWithSplitLast(const tvm::Tensor& X,
+                                     FReduce func,
+                                     const std::vector<int64_t>& axes,
+                                     bool keep_dims,
+                                     const tvm::Expr& pad_value,
+                                     const int32_t vector_size,
+                                     bool last_dim_aligned,
+                                     int32_t fuse_dim = 0,
+                                     const std::string& name = "ReduceValueWithSplitLast") {
+  tvm::Tensor Z;
+  if (last_dim_aligned) {
+    Z = X;
+  } else {
+    Z = tvm_codegen::PadLastDim(X, vector_size, pad_value);
+  }
+
+  auto input_shape = Z->shape;  // [n1, n2, n3, ..., n(d-1), nd]
+  ORT_ENFORCE(input_shape.size() > 0);
+  ORT_ENFORCE(axes.size() > 0);
+
+  tvm::Array<tvm::IterVar> out_reduce_axis;
+  for (size_t i = 0; i < axes.size() - 1; ++i) {
+    out_reduce_axis.push_back(tvm::reduce_axis(tvm::Range(0, input_shape[axes[i]]), name + "_k_" + std::to_string(axes[i])));  // k
+  }
+  tvm::Expr last_end = (input_shape[axes.back()] + vector_size - 1) / vector_size;         // d/w for noPad
+  out_reduce_axis.push_back(tvm::reduce_axis(tvm::Range(0, last_end), name + "_k_last"));  // k
+
+  tvm::Array<tvm::Expr> output_shape;  // output_shape as [n1, n2, n3, ..., n(d-1) w], w as vector_size
+
+  auto input_shape_rank_minus_1 = input_shape.size() - 1;
+  size_t j = 0;
+  if (keep_dims) {
+    for (size_t i = 0; i < input_shape_rank_minus_1; ++i) {
+      if (gsl::narrow_cast<int64_t>(i) == axes[j]) {
+        output_shape.push_back(1);
+        j++;
+      } else {
+        output_shape.push_back(input_shape[i]);
+      }
+    }
+  } else {
+    for (size_t i = 0; i < input_shape_rank_minus_1; ++i) {
+      if (gsl::narrow_cast<int64_t>(i) == axes[j]) {
+        j++;
+      } else {
+        output_shape.push_back(input_shape[i]);
+      }
+    }
+  }
+  output_shape.push_back(vector_size);
+
+  auto l_head = [&](const tvm::Array<tvm::Var>& indices) {
+    tvm::Array<tvm::Expr> eval_range;
+    size_t j = 0;
+    size_t l = 0;
+    if (keep_dims) {
+      for (size_t i = 0; i < input_shape_rank_minus_1; ++i) {
+        if (gsl::narrow_cast<int64_t>(i) == axes[j]) {
+          eval_range.push_back(indices[l] + out_reduce_axis[j]->var);
+          j++;
+        } else {
+          eval_range.push_back(indices[l]);
+        }
+        l++;
+      }
+    } else {
+      for (size_t i = 0; i < input_shape_rank_minus_1; ++i) {
+        if (gsl::narrow_cast<int64_t>(i) == axes[j]) {
+          eval_range.push_back(out_reduce_axis[j]->var);
+          j++;
+        } else {
+          eval_range.push_back(indices[l]);
+          l++;
+        }
+      }
+    }
+    eval_range.push_back((out_reduce_axis[j]->var) * vector_size + indices[l]);
+    return func(Z(eval_range), out_reduce_axis);
+  };
+
+  tvm::Map<std::string, tvm::NodeRef> attrs_head;
+  attrs_head.Set(kNupharVReduceFuseDim, tvm::Expr(gsl::narrow_cast<int32_t>(output_shape.size()) - 1));
+
+  auto head_tensor = tvm::compute(output_shape, l_head, name + "_split_lastdim_reduce", kNupharVReduce, attrs_head);  //[n1, n2, n3, ..., n(d-1), w]
+  return ReduceValueWithoutSplit(head_tensor, func, {gsl::narrow_cast<int64_t>(head_tensor->shape.size()) - 1}, keep_dims, fuse_dim, name + "_final");
+}
+
+// A special vectorization friendly value reduction
+// It will detect reduce axes to decide calling which reduce implementation
+// A reduce all could become a combination of reshape, reduce 1D, and then reshape.
+// The max function calls are 3.
+// The last reshape won't be fused if the reduce is that last node of the graph. [TODO] FIXME
+tvm::Tensor ReduceValue(const tvm::Tensor& X,
+                        FReduce func,
+                        const std::vector<int64_t>& axes,
+                        bool keep_dims,
+                        const tvm::Expr& pad_value,
+                        int32_t vector_size,
+                        bool last_dim_aligned,
+                        int32_t fuse_dim = 0,
+                        const std::string& name = "ReduceValue") {
+  //reduce all, reshape and call ReduceValueWithSplitLast
+  if ((axes.size() == 0 && X->shape.size() > 1) ||
+      (axes.size() == X->shape.size() && axes.size() > 1)) {
+    auto input_shape = X->shape;
+    std::vector<int64_t> axes_new;
+
+    if (fuse_dim == gsl::narrow_cast<int64_t>(X->shape.size()) - 1) {
+      // a special case no need reshape
+      for (int64_t i = 0; i < gsl::narrow_cast<int64_t>(X->shape.size()); ++i) {
+        axes_new.push_back(i);
+      }
+      return ReduceValueWithSplitLast(X,
+                                      func,
+                                      axes_new,
+                                      keep_dims,
+                                      pad_value,
+                                      vector_size,
+                                      last_dim_aligned,
+                                      fuse_dim,
+                                      name);
+    } else {
+      tvm::Array<tvm::Expr> reshape_dim;
+      tvm::Expr tail_size = 1;
+
+      for (size_t i = 0; i < gsl::narrow_cast<size_t>(fuse_dim); ++i) {
+        reshape_dim.push_back(input_shape[i]);
+        axes_new.push_back(i);
+      }
+
+      for (size_t i = gsl::narrow_cast<size_t>(fuse_dim); i < input_shape.size(); ++i) {
+        tail_size *= input_shape[i];
+      }
+      reshape_dim.push_back(tail_size);
+      axes_new.push_back(fuse_dim);
+
+      auto X_reshape = tvm_codegen::Reshape(X, reshape_dim, "reshape_" + name);
+
+      auto reduce_output = ReduceValueWithSplitLast(X_reshape,
+                                                    func,
+                                                    axes_new,
+                                                    keep_dims,
+                                                    pad_value,
+                                                    vector_size,
+                                                    last_dim_aligned,
+                                                    fuse_dim,
+                                                    name);
+
+      if (keep_dims) {
+        tvm::Array<tvm::Expr> out_shape;
+        for (size_t i = 0; i < input_shape.size(); ++i) {
+          out_shape.push_back(1);
+        }
+        return tvm_codegen::Reshape(reduce_output, out_shape, name);
+      }
+
+      return reduce_output;
+    }
+  }
+
+  // axis contain last
+  if (axes.back() == gsl::narrow_cast<int64_t>(X->shape.size()) - 1) {
+    return ReduceValueWithSplitLast(X,
+                                    func,
+                                    axes,
+                                    keep_dims,
+                                    pad_value,
+                                    vector_size,
+                                    last_dim_aligned,
+                                    fuse_dim,
+                                    name);
+  }
+
+  //axis not contain last, call ReduceValueWithoutSplit
+  return ReduceValueWithoutSplit(X, func, axes, keep_dims, fuse_dim, "reduce_first_" + name);
+}
+
+// A special vectorization friendly ReduceSum
+tvm::Tensor ReduceSum(const tvm::Tensor& X,
+                      const std::vector<int64_t>& axes,
+                      bool keep_dims,
+                      const int32_t vector_size,
+                      bool last_dim_aligned,
+                      int32_t fuse_dim,
+                      const std::string& name) {
+  return ReduceValue(X, tvm::sum, axes, keep_dims,
+                     tvm::make_const(X->dtype, 0), vector_size, last_dim_aligned, fuse_dim, name);
+}
+
+// A special vectorization friendly ReduceMax
+tvm::Tensor ReduceMax(const tvm::Tensor& X,
+                      const std::vector<int64_t>& axes,
+                      bool keep_dims,
+                      const int32_t vector_size,
+                      bool last_dim_aligned,
+                      int32_t fuse_dim,
+                      const std::string& name) {
+  return ReduceValue(X, topi::MaxOp, axes, keep_dims,
+                     X->dtype.min(), vector_size, last_dim_aligned, fuse_dim, name);
+}
+
+// A special vectorization friendly ReduceMin
+tvm::Tensor ReduceMin(const tvm::Tensor& X,
+                      const std::vector<int64_t>& axes,
+                      bool keep_dims,
+                      const int32_t vector_size,
+                      bool last_dim_aligned,
+                      int32_t fuse_dim,
+                      const std::string& name) {
+  return ReduceValue(X, topi::MinOp, axes, keep_dims,
+                     X->dtype.max(), vector_size, last_dim_aligned, fuse_dim, name);
+}
+
+// [WIP] a special vectorization friendly value reduction
+// Keep_dim always true
+tvm::Tensor ReduceValueLowest_noPad(const tvm::Tensor& X,
+                                    topi::FReduce func,
+                                    const int32_t vector_size,
+                                    const std::string& name) {
+  auto input_shape = X->shape;
+  tvm::Array<tvm::Expr> head_shape;
+  auto input_shape_rank = input_shape.size();
+  tvm::Expr head_end = input_shape[input_shape_rank - 1] / vector_size;  // d/w
+  for (size_t i = 0; i < input_shape_rank - 1; ++i) {
+    head_shape.push_back(input_shape[i]);
+  }
+  head_shape.push_back(vector_size);
+  // head_shape as [n1, w] or [w]
+  tvm::Array<tvm::IterVar> head_reduce_axis;
+  head_reduce_axis.push_back(tvm::reduce_axis(tvm::Range(0, head_end), "k_head"));
+
+  auto l_head = [&](const tvm::Array<tvm::Var>& indices) {
+    // indices as [n1, w] for 2D or [w] for 1D
+    tvm::Array<tvm::Expr> eval_range;  // a linearized as head_reduce_axis*w + indices
+    auto indices_rank = indices.size() - 1;
+    for (size_t i = 0; i < indices_rank; ++i) {
+      eval_range.push_back(indices[i]);
+    }
+    eval_range.push_back((head_reduce_axis[0]->var) * vector_size + indices[indices_rank]);
+
+    return func(X(eval_range), head_reduce_axis);
+  };
+  //[n1, w] for 2D or [w] for 1D
+  auto head_tensor = tvm::compute(head_shape, l_head, name + "_head_reduce");
+  //[n1, 1] for 2D or [1] for 1D
+  return topi::CommReduce(head_tensor, tvm_codegen::ToTvmArrayInt({(int64_t)(input_shape_rank)-1}), func, true, true);
+}
+
+tvm::Tensor ReduceSumV(const tvm::Tensor& X, const int32_t vector_size, const std::string& name) {
+  return ReduceValueLowest_noPad(X, tvm::sum, vector_size, name);
+}
+
+tvm::Tensor ReduceMaxV(const tvm::Tensor& X, const int32_t vector_size, const std::string& name) {
+  return ReduceValueLowest_noPad(X, topi::MaxOp, vector_size, name);
+}
+
+tvm::Tensor ReduceMinV(const tvm::Tensor& X, const int32_t vector_size, const std::string& name) {
+  return ReduceValueLowest_noPad(X, topi::MinOp, vector_size, name);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/reduce_ops.h b/onnxruntime/core/providers/nuphar/mti_x86/math/reduce_ops.h
new file mode 100644
index 0000000000000..4f15255df1ab5
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/reduce_ops.h
@@ -0,0 +1,40 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <string>
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+constexpr auto kNupharVReduce = "nuphar_v_reduce";
+
+constexpr auto kNupharVReduceFuseDim = "nuphar_v_reduce_fuse_dim";
+
+tvm::Tensor ReduceSum(const tvm::Tensor& X,
+                      const std::vector<int64_t>& axes,
+                      bool keep_dims,
+                      const int32_t vector_size,
+                      bool last_dim_aligned = false,
+                      int32_t fuse_dim = 0,
+                      const std::string& name = "reduce_sum_v");
+
+tvm::Tensor ReduceMax(const tvm::Tensor& X,
+                      const std::vector<int64_t>& axes,
+                      bool keep_dims,
+                      const int32_t vector_size,
+                      bool last_dim_aligned = false,
+                      int32_t fuse_dim = 0,
+                      const std::string& name = "reduce_max_v");
+
+tvm::Tensor ReduceMin(const tvm::Tensor& X,
+                      const std::vector<int64_t>& axes, bool keep_dims,
+                      const int32_t vector_size,
+                      bool last_dim_aligned = false,
+                      int32_t fuse_dim = 0,
+                      const std::string& name = "reduce_min_v");
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/softmax.cc b/onnxruntime/core/providers/nuphar/mti_x86/math/softmax.cc
new file mode 100644
index 0000000000000..7f2926f329778
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/softmax.cc
@@ -0,0 +1,16 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/mti_x86/math/softmax.h"
+
+#include "core/providers/nuphar/mti_x86/math/softmax_internal.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Tensor Softmax(const tvm::Tensor& input, int64_t axis, const std::string& name) {
+  return internal::SoftmaxInternal(input, axis, name, /*logarithmic*/ false);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/softmax.h b/onnxruntime/core/providers/nuphar/mti_x86/math/softmax.h
new file mode 100644
index 0000000000000..ae794ad029691
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/softmax.h
@@ -0,0 +1,15 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <string>
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Tensor Softmax(const tvm::Tensor& input, int64_t axis, const std::string& name = "Softmax");
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/softmax_internal.cc b/onnxruntime/core/providers/nuphar/mti_x86/math/softmax_internal.cc
new file mode 100644
index 0000000000000..9e05c9d063039
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/softmax_internal.cc
@@ -0,0 +1,56 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/mti_x86/math/softmax_internal.h"
+
+#include "core/codegen/mti/math/binary_ops.h"
+#include "core/codegen/mti/math/reduce_ops.h"
+#include "core/codegen/mti/math/unary_ops.h"
+
+#include "core/providers/nuphar/mti_x86/math/unary_ops.h"
+#include "core/providers/nuphar/mti_x86/math/reduce_ops.h"
+#include "gsl/gsl_util"
+
+namespace onnxruntime {
+namespace nuphar {
+namespace internal {
+
+static std::vector<int64_t> ReduceAxes(int64_t axis, std::size_t size) {
+  std::vector<int64_t> reduce_axis;
+  for (int64_t i = axis; i < gsl::narrow_cast<int64_t>(size); ++i)
+    reduce_axis.push_back(i);
+  return reduce_axis;
+}
+
+static int32_t FuseDim(int64_t axis) {
+  if (axis == 0)
+    return 0;  // fuse all
+  else {
+    return gsl::narrow_cast<int32_t>(axis) + 1;  // fuse all after axis
+  }
+}
+
+tvm::Tensor SoftmaxInternal(const tvm::Tensor& input,
+                            int64_t axis,
+                            const std::string& name,
+                            bool logarithmic) {
+  std ::vector<int64_t> reduce_axis = ReduceAxes(axis, input->shape.size());
+  // TODO use natural vector size check by type later
+  int32_t vectorization_width = 16;
+  int32_t fuse_dim = FuseDim(axis);
+  auto max_element = ReduceMax(input, reduce_axis, true, vectorization_width, false, fuse_dim, name + "_reduce_max");
+  auto x_shift = tvm_codegen::Sub(input, max_element, name + "_sub");
+  auto exp_x = nuphar::Exp(x_shift, name + "_exp");
+  auto exp_x_sum = ReduceSum(exp_x, reduce_axis, true, vectorization_width, false, fuse_dim, name + "_reduce_sum");
+
+  if (logarithmic) {
+    auto log_exp_x_sum = nuphar::Log(exp_x_sum, name + "_log");
+    return tvm_codegen::Sub(x_shift, log_exp_x_sum, name + "_sub_log");
+  } else {
+    return tvm_codegen::Div(exp_x, exp_x_sum, name + "_div");
+  }
+}
+
+}  // namespace internal
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/softmax_internal.h b/onnxruntime/core/providers/nuphar/mti_x86/math/softmax_internal.h
new file mode 100644
index 0000000000000..ef0202526add5
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/softmax_internal.h
@@ -0,0 +1,17 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <string>
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+namespace internal {
+
+tvm::Tensor SoftmaxInternal(const tvm::Tensor& input, int64_t axis, const std::string& name, bool logarithmic);
+
+}  // namespace internal
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/unary_ops.cc b/onnxruntime/core/providers/nuphar/mti_x86/math/unary_ops.cc
new file mode 100644
index 0000000000000..1b343ebb62ee3
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/unary_ops.cc
@@ -0,0 +1,186 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/mti_x86/math/unary_ops.h"
+
+#include "core/codegen/mti/math/unary_ops.h"
+#include "core/providers/nuphar/common/nuphar_settings.h"
+#include "core/providers/nuphar/mti_x86/math/halide_ops.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+
+#include <topi/broadcast.h>
+#include <topi/elemwise.h>
+#include <topi/transform.h>
+
+// Using namespace topi for override operator +-*/
+using namespace topi;
+
+namespace onnxruntime {
+namespace nuphar {
+
+// polynomial sigmoid/tanh implementation copied from core/providers/cpu/rnn/rnn_helpers.cc.cc
+const float alpha_1 = 4.89352455891786e-03f;
+const float alpha_3 = 6.37261928875436e-04f;
+const float alpha_5 = 1.48572235717979e-05f;
+const float alpha_7 = 5.12229709037114e-08f;
+const float alpha_9 = -8.60467152213735e-11f;
+const float alpha_11 = 2.00018790482477e-13f;
+const float alpha_13 = -2.76076847742355e-16f;
+
+const float beta_0 = 4.89352518554385e-03f;
+const float beta_2 = 2.26843463243900e-03f;
+const float beta_4 = 1.18534705686654e-04f;
+const float beta_6 = 1.19825839466702e-06f;
+
+const float sigmoid_bound = 20.0f;
+const float tanh_bound = 10.0f;
+
+tvm::Expr exp(const tvm::Expr& x_full) {
+  // Only support f32 fast math now
+  if (x_full.type().element_of() == tvm::Float(32)) {
+    codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+
+    if (settings.HasOption(kNupharFastMath)) {
+      return halideir_exp(x_full);
+    }
+  }
+  return tvm::exp(x_full);
+}
+
+tvm::Expr log(const tvm::Expr& x_full) {
+  // Only support f32 fast math now
+  if (x_full.type().element_of() == tvm::Float(32)) {
+    codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+    if (settings.HasOption(kNupharFastMath)) {
+      if (settings.OptionMatches(kNupharFastMath,
+                                 kNupharFastMath_Polynormial)) {
+        return halideir_log(x_full);
+      } else if (settings.OptionMatches(kNupharFastMath,
+                                        kNupharFastMath_ShortPolynormial)) {
+        return fast_log(x_full);
+      }
+    }
+  }
+  return tvm::log(x_full);
+}
+
+tvm::Tensor Erf(const tvm::Tensor& X, const std::string& name) {
+  return tvm::compute(
+      X->shape,
+      [&](const tvm::Array<tvm::Var>& indices) {
+        return halideir_erf(X(indices));
+      },
+      name);
+}
+
+tvm::Tensor Exp(const tvm::Tensor& X, const std::string& name) {
+  return tvm::compute(
+      X->shape,
+      [&](const tvm::Array<tvm::Var>& indices) {
+        return nuphar::exp(X(indices));
+      },
+      name);
+}
+
+tvm::Tensor Log(const tvm::Tensor& X, const std::string& name) {
+  return tvm::compute(
+      X->shape,
+      [&](const tvm::Array<tvm::Var>& indices) {
+        return nuphar::log(X(indices));
+      },
+      name);
+}
+
+tvm::Tensor ParametricSoftplus(const tvm::Tensor& X, float alpha, float beta, const std::string& name) {
+  return tvm_codegen::Rename(alpha * Softplus(beta * X), name);
+}
+
+tvm::Tensor ScaledTanh(const tvm::Tensor& X, float alpha, float beta, const std::string& name) {
+  return tvm_codegen::Rename(alpha * Tanh(beta * X), name);
+}
+
+tvm::Tensor Selu(const tvm::Tensor& X, float alpha, float gamma, const std::string& name) {
+  return tvm_codegen::Rename(gamma * (-alpha * tvm_codegen::Relu(1 - Exp(X)) + tvm_codegen::Relu(X)), name);
+}
+
+tvm::Tensor SigmoidDeepCPU(const tvm::Tensor& X, const std::string& name) {
+  return tvm::compute(
+      X->shape,
+      [&](const tvm::Array<tvm::Var>& indices) {
+        auto x = 0.5f * max(min(X(indices), sigmoid_bound), -sigmoid_bound);
+        auto x2 = x * x;
+        auto p = x2 * alpha_13 + alpha_11;
+        p = x2 * p + alpha_9;
+        p = x2 * p + alpha_7;
+        p = x2 * p + alpha_5;
+        p = x2 * p + alpha_3;
+        p = x2 * p + alpha_1;
+        p = x * p;
+        auto q = x2 * beta_6 + beta_4;
+        q = x2 * q + beta_2;
+        q = x2 * q + beta_0;
+        return 0.5f * (1 + (p / q));
+      },
+      name);
+}
+
+tvm::Tensor Sigmoid(const tvm::Tensor& X, const std::string& name) {
+  codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+
+  if (settings.HasOption(kNupharFastActivation)) {
+    return SigmoidDeepCPU(X, name);
+  }
+
+  return tvm::compute(
+      X->shape,
+      [&](const tvm::Array<tvm::Var>& indices) {
+        return tvm::ir::Select::make(X(indices) > 0,
+                                     1 / (1 + nuphar::exp(-X(indices))),
+                                     nuphar::exp(X(indices)) / (nuphar::exp(X(indices)) + 1));
+      },
+      name);
+}
+
+tvm::Tensor Softplus(const tvm::Tensor& X, const std::string& name) {
+  return tvm_codegen::Rename(Log(1 + Exp(tvm_codegen::Neg(tvm_codegen::Abs(X)))) + tvm_codegen::Relu(X), name);
+}
+
+tvm::Tensor TanhDeepCPU(const tvm::Tensor& X, const std::string& name) {
+  return tvm::compute(
+      X->shape,
+      [&](const tvm::Array<tvm::Var>& indices) {
+        auto x = max(min(X(indices), tanh_bound), -tanh_bound);
+        auto x2 = x * x;
+        auto p = x2 * alpha_13 + alpha_11;
+        p = x2 * p + alpha_9;
+        p = x2 * p + alpha_7;
+        p = x2 * p + alpha_5;
+        p = x2 * p + alpha_3;
+        p = x2 * p + alpha_1;
+        p = x * p;
+        auto q = x2 * beta_6 + beta_4;
+        q = x2 * q + beta_2;
+        q = x2 * q + beta_0;
+        return p / q;
+      },
+      name);
+}
+
+tvm::Tensor Tanh(const tvm::Tensor& X, const std::string& name) {
+  codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+  if (settings.HasOption(kNupharFastActivation)) {
+    return TanhDeepCPU(X, name);
+  }
+
+  return tvm::compute(
+      X->shape,
+      [&](const tvm::Array<tvm::Var>& indices) {
+        return tvm::ir::Select::make(X(indices) < 0,
+                                     (nuphar::exp(2 * X(indices)) - 1) / (nuphar::exp(2 * X(indices)) + 1),
+                                     (1 - nuphar::exp(-2 * X(indices))) / (1 + nuphar::exp(-2 * X(indices))));
+      },
+      name);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/math/unary_ops.h b/onnxruntime/core/providers/nuphar/mti_x86/math/unary_ops.h
new file mode 100644
index 0000000000000..cf9edd252c556
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/math/unary_ops.h
@@ -0,0 +1,25 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include <string>
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Expr exp(const tvm::Expr& x_full);
+tvm::Expr log(const tvm::Expr& x_full);
+
+tvm::Tensor Erf(const tvm::Tensor& X, const std::string& name = "erf");
+tvm::Tensor Exp(const tvm::Tensor& X, const std::string& name = "exp");
+tvm::Tensor Log(const tvm::Tensor& X, const std::string& name = "log");
+tvm::Tensor ParametricSoftplus(const tvm::Tensor& X, float alpha, float beta, const std::string& name = "parametric_softplus");
+tvm::Tensor ScaledTanh(const tvm::Tensor& X, float alpha, float beta, const std::string& name = "scaled_tanh");
+tvm::Tensor Selu(const tvm::Tensor& X, float alpha, float gamma, const std::string& name = "selu");
+tvm::Tensor Sigmoid(const tvm::Tensor& X, const std::string& name = "sigmoid");
+tvm::Tensor Softplus(const tvm::Tensor& X, const std::string& name = "softplus");
+tvm::Tensor Tanh(const tvm::Tensor& X, const std::string& name = "tanh");
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul16_extern.cc b/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul16_extern.cc
new file mode 100644
index 0000000000000..b475164d4b9fb
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul16_extern.cc
@@ -0,0 +1,149 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/mti_x86/quantize/imatmul16_extern.h"
+
+#include "core/common/common.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+#include "core/providers/nuphar/extern/igemv_mkl.h"
+#include "core/providers/nuphar/extern/igemv_avx2.h"
+#include <topi/detail/extern.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+#ifdef NUPHAR_USE_MKL
+TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.imatmul16.extern.mkl")
+    .set_body([](tvm::TVMArgs args, tvm::TVMRetValue* /*ret*/) {
+      DLTensor* B = args[0];
+      DLTensor* A = args[1];
+      DLTensor* batch_tensor = args[2];
+      DLTensor* Y = args[3];
+      int input_dim = args[4];
+      int embed_dim = args[5];
+
+      DCHECK(B->strides == nullptr);
+      DCHECK(A->strides == nullptr);
+      DCHECK(Y->strides == nullptr);
+      DCHECK(batch_tensor->strides == nullptr);
+
+      auto B_data = reinterpret_cast<int16_t*>(static_cast<char*>(B->data) + B->byte_offset);
+      auto A_data = reinterpret_cast<int16_t*>(static_cast<char*>(A->data) + A->byte_offset);
+      auto Y_data = reinterpret_cast<int32_t*>(static_cast<char*>(Y->data) + Y->byte_offset);
+      auto batch_seq = *reinterpret_cast<int*>(static_cast<char*>(batch_tensor->data) + batch_tensor->byte_offset);
+
+      MKLIntGemvS16S16S32R(B_data, A_data, embed_dim, batch_seq, input_dim, Y_data);
+    });
+#endif
+
+#ifdef NUPHAR_USE_AVX2
+TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.imatmul16.extern.avx2")
+    .set_body([](tvm::TVMArgs args, tvm::TVMRetValue* /*ret*/) {
+      DLTensor* B = args[0];
+      DLTensor* A = args[1];
+      DLTensor* batch_seq_tensor = args[2];
+      DLTensor* Y = args[3];
+      int input_dim = args[4];
+      int embed_dim = args[5];
+
+      DCHECK(B->strides == nullptr);
+      DCHECK(A->strides == nullptr);
+      DCHECK(Y->strides == nullptr);
+      DCHECK(batch_seq_tensor->strides == nullptr);
+
+      auto B_data = reinterpret_cast<int16_t*>(static_cast<char*>(B->data) + B->byte_offset);
+      auto A_data = reinterpret_cast<int16_t*>(static_cast<char*>(A->data) + A->byte_offset);
+      auto Y_data = reinterpret_cast<int32_t*>(static_cast<char*>(Y->data) + Y->byte_offset);
+      auto batch_seq = *reinterpret_cast<int*>(static_cast<char*>(batch_seq_tensor->data) + batch_seq_tensor->byte_offset);
+
+      if (batch_seq == 1) {
+        if (input_dim % 16 == 0)
+          AVX2IntGemvS16S16S32R(B_data, A_data, input_dim, input_dim, embed_dim, Y_data);
+        else
+          AVX2IntGemvS16S16S32REx(B_data, A_data, input_dim, embed_dim, Y_data);
+      } else {
+#ifdef NUPHAR_USE_MKL
+        MKLIntGemvS16S16S32R(B_data, A_data, embed_dim, batch_seq, input_dim, Y_data);
+#else
+        if (input_dim % 16 == 0) {
+          for (int i = 0; i < batch_seq; i++)
+            AVX2IntGemvS16S16S32R(B_data, A_data + i * input_dim, input_dim, input_dim, embed_dim, Y_data + i * embed_dim);
+        } else {
+          for (int i = 0; i < batch_seq; i++)
+            AVX2IntGemvS16S16S32REx(B_data, A_data + i * input_dim, input_dim, embed_dim, Y_data + i * embed_dim);
+        }
+#endif
+      }
+    });
+#endif
+
+tvm::Tensor
+IMatMul16ExternMKL(const tvm::Tensor& B,
+                   const tvm::Tensor& A,
+                   const tvm::Array<tvm::Expr>& output_shape,
+                   int input_dim,
+                   int embed_dim,
+                   const std::string& name) {
+  tvm::Expr batch_seq_dim = tvm_codegen::SizeToDimension(output_shape, -1);
+
+  std::string func_str;
+#ifdef NUPHAR_USE_MKL
+  func_str = "tvm.contrib.onnxruntime.imatmul16.extern.mkl";
+#else
+  ORT_NOT_IMPLEMENTED("Not implemented. Please set NUPHAR_USE_MKL!");
+#endif
+
+  // TODO: instead of calling Promote, we may consider to expose
+  // tvm::Tensor and tvm::Expr version of op from topi
+  return topi::detail::make_extern(
+      {output_shape}, {tvm::Int(32)},
+      tvm_codegen::MakeInputsForExtern(
+          {B, A, tvm_codegen::Promote(batch_seq_dim, {16}, name + "_batch_seq")}),
+      [&](tvm::Array<tvm::Buffer> ins, tvm::Array<tvm::Buffer> outs) {
+        return topi::detail::call_packed({tvm::Expr(func_str),
+                                          topi::detail::pack_buffer(ins[0]),
+                                          topi::detail::pack_buffer(ins[1]),
+                                          topi::detail::pack_buffer(ins[2]),
+                                          topi::detail::pack_buffer(outs[0]),
+                                          input_dim,
+                                          embed_dim});
+      },
+      name, "", {})[0];
+}
+
+tvm::Tensor
+IMatMul16ExternAVX2(const tvm::Tensor& B,
+                    const tvm::Tensor& A,
+                    const tvm::Array<tvm::Expr>& output_shape,
+                    int input_dim,
+                    int embed_dim,
+                    const std::string& name) {
+  tvm::Expr batch_seq_dim = tvm_codegen::SizeToDimension(output_shape, -1);
+
+  std::string func_str;
+#ifdef NUPHAR_USE_AVX2
+  func_str = "tvm.contrib.onnxruntime.imatmul16.extern.avx2";
+#else
+  ORT_NOT_IMPLEMENTED("Not implemented. Please set NUPHAR_USE_AVX2!");
+#endif
+
+  // TODO: instead of calling Promote, we may consider to expose
+  // tvm::Tensor and tvm::Expr version of op from topi
+  return topi::detail::make_extern(
+      {output_shape}, {tvm::Int(32)},
+      tvm_codegen::MakeInputsForExtern(
+          {B, A, tvm_codegen::Promote(batch_seq_dim, {16}, name + "_batch_seq")}),
+      [&](tvm::Array<tvm::Buffer> ins, tvm::Array<tvm::Buffer> outs) {
+        return topi::detail::call_packed({tvm::Expr(func_str),
+                                          topi::detail::pack_buffer(ins[0]),
+                                          topi::detail::pack_buffer(ins[1]),
+                                          topi::detail::pack_buffer(ins[2]),
+                                          topi::detail::pack_buffer(outs[0]),
+                                          input_dim,
+                                          embed_dim});
+      },
+      name, "", {})[0];
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul16_extern.h b/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul16_extern.h
new file mode 100644
index 0000000000000..e3ff1b5f3fee0
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul16_extern.h
@@ -0,0 +1,29 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <string>
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Tensor
+IMatMul16ExternMKL(const tvm::Tensor& transposed_quantized_param,
+                   const tvm::Tensor& Q_X,
+                   const tvm::Array<tvm::Expr>& output_shape,
+                   int input_dim,
+                   int embed_dim,
+                   const std::string& name = "IMatMul16ExternMKL");
+
+tvm::Tensor
+IMatMul16ExternAVX2(const tvm::Tensor& transposed_quantized_param,
+                    const tvm::Tensor& Q_X,
+                    const tvm::Array<tvm::Expr>& output_shape,
+                    int input_dim,
+                    int embed_dim,
+                    const std::string& name = "IMatMul16ExternAVX2");
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul_extern.cc b/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul_extern.cc
new file mode 100644
index 0000000000000..bf31d25784928
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul_extern.cc
@@ -0,0 +1,143 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/mti_x86/quantize/imatmul_extern.h"
+
+#include "core/common/common.h"
+#include "core/codegen/mti/mti_tvm_utils.h"
+#include "core/providers/nuphar/extern/igemv_mkl.h"
+#include "core/providers/nuphar/extern/igemv_avx2.h"
+#include <topi/detail/extern.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+#ifdef NUPHAR_USE_MKL
+TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.imatmul.extern.mkl")
+    .set_body([](tvm::TVMArgs args, tvm::TVMRetValue* /*ret*/) {
+      DLTensor* B = args[0];
+      DLTensor* A = args[1];
+      DLTensor* batch_seq_tensor = args[2];
+      DLTensor* Y = args[3];
+      int input_dim = args[4];
+      int embed_dim = args[5];
+
+      DCHECK(B->strides == nullptr);
+      DCHECK(A->strides == nullptr);
+      DCHECK(Y->strides == nullptr);
+
+      auto B_data = reinterpret_cast<int8_t*>(static_cast<char*>(B->data) + B->byte_offset);
+      auto A_data = reinterpret_cast<uint8_t*>(static_cast<char*>(A->data) + A->byte_offset);
+      auto Y_data = reinterpret_cast<int32_t*>(static_cast<char*>(Y->data) + Y->byte_offset);
+      auto batch_seq = *reinterpret_cast<int*>(static_cast<char*>(batch_seq_tensor->data) + batch_seq_tensor->byte_offset);
+
+      MKLIntGemvS8U8S32R(B_data, A_data, embed_dim, batch_seq, input_dim, Y_data);
+    });
+#endif
+
+#ifdef NUPHAR_USE_AVX2
+TVM_REGISTER_GLOBAL("tvm.contrib.onnxruntime.imatmul.extern.avx2")
+    .set_body([](tvm::TVMArgs args, tvm::TVMRetValue* /*ret*/) {
+      DLTensor* B = args[0];
+      DLTensor* A = args[1];
+      DLTensor* batch_seq_tensor = args[2];
+      DLTensor* Y = args[3];
+      int input_dim = args[4];
+      int embed_dim = args[5];
+
+      DCHECK(B->strides == nullptr);
+      DCHECK(A->strides == nullptr);
+      DCHECK(Y->strides == nullptr);
+
+      auto B_data = reinterpret_cast<int8_t*>(static_cast<char*>(B->data) + B->byte_offset);
+      auto A_data = reinterpret_cast<uint8_t*>(static_cast<char*>(A->data) + A->byte_offset);
+      auto Y_data = reinterpret_cast<int32_t*>(static_cast<char*>(Y->data) + Y->byte_offset);
+      auto batch_seq = *reinterpret_cast<int*>(static_cast<char*>(batch_seq_tensor->data) + batch_seq_tensor->byte_offset);
+
+      if (batch_seq == 1) {
+        if (input_dim % 32 == 0)
+          AVX2IntGemvS8U8S32R(B_data, A_data, input_dim, input_dim, embed_dim, Y_data);
+        else
+          AVX2IntGemvS8U8S32REx(B_data, A_data, input_dim, embed_dim, Y_data);
+      } else {
+#ifdef NUPHAR_USE_MKL
+        MKLIntGemvS8U8S32R(B_data, A_data, embed_dim, batch_seq, input_dim, Y_data);
+#else
+        if (input_dim % 32 == 0) {
+          for (int i = 0; i < batch_seq; i++)
+            AVX2IntGemvS8U8S32R(B_data, A_data + i * input_dim, input_dim, input_dim, embed_dim, Y_data + i * embed_dim);
+        } else {
+          for (int i = 0; i < batch_seq; i++)
+            AVX2IntGemvS8U8S32REx(B_data, A_data + i * input_dim, input_dim, embed_dim, Y_data + i * embed_dim);
+        }
+#endif
+      }
+    });
+#endif
+
+tvm::Tensor
+IMatMulExternMKL(const tvm::Tensor& B,
+                 const tvm::Tensor& A,
+                 const tvm::Array<tvm::Expr>& output_shape,
+                 int input_dim,
+                 int embed_dim,
+                 const std::string& name) {
+  tvm::Expr batch_seq_dim = tvm_codegen::SizeToDimension(output_shape, -1);
+
+  std::string func_str;
+#ifdef NUPHAR_USE_MKL
+  func_str = "tvm.contrib.onnxruntime.imatmul.extern.mkl";
+#else
+  ORT_NOT_IMPLEMENTED("Not implemented. Please set NUPHAR_USE_MKL!");
+#endif
+
+  return topi::detail::make_extern(
+      {output_shape}, {tvm::Int(32)},
+      tvm_codegen::MakeInputsForExtern(
+          {B, A, tvm_codegen::Promote(batch_seq_dim, {16}, name + "_batch_seq")}),
+      [&](tvm::Array<tvm::Buffer> ins, tvm::Array<tvm::Buffer> outs) {
+        return topi::detail::call_packed({tvm::Expr(func_str),
+                                          topi::detail::pack_buffer(ins[0]),
+                                          topi::detail::pack_buffer(ins[1]),
+                                          topi::detail::pack_buffer(ins[2]),
+                                          topi::detail::pack_buffer(outs[0]),
+                                          input_dim,
+                                          embed_dim});
+      },
+      name, "", {})[0];
+}
+
+tvm::Tensor
+IMatMulExternAVX2(const tvm::Tensor& B,
+                  const tvm::Tensor& A,
+                  const tvm::Array<tvm::Expr>& output_shape,
+                  int input_dim,
+                  int embed_dim,
+                  const std::string& name) {
+  tvm::Expr batch_seq_dim = tvm_codegen::SizeToDimension(output_shape, -1);
+
+  std::string func_str;
+#ifdef NUPHAR_USE_AVX2
+  func_str = "tvm.contrib.onnxruntime.imatmul.extern.avx2";
+#else
+  ORT_NOT_IMPLEMENTED("Not implemented. Please set NUPHAR_USE_AVX2!");
+#endif
+
+  return topi::detail::make_extern(
+      {output_shape}, {tvm::Int(32)},
+      tvm_codegen::MakeInputsForExtern(
+          {B, A, tvm_codegen::Promote(batch_seq_dim, {16}, name + "_batch_seq")}),
+      [&](tvm::Array<tvm::Buffer> ins, tvm::Array<tvm::Buffer> outs) {
+        return topi::detail::call_packed({tvm::Expr(func_str),
+                                          topi::detail::pack_buffer(ins[0]),
+                                          topi::detail::pack_buffer(ins[1]),
+                                          topi::detail::pack_buffer(ins[2]),
+                                          topi::detail::pack_buffer(outs[0]),
+                                          input_dim,
+                                          embed_dim});
+      },
+      name, "", {})[0];
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul_extern.h b/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul_extern.h
new file mode 100644
index 0000000000000..dee4bfbbd4880
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/mti_x86/quantize/imatmul_extern.h
@@ -0,0 +1,29 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <string>
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+tvm::Tensor
+IMatMulExternAVX2(const tvm::Tensor& transposed_quantized_param,
+                  const tvm::Tensor& Q_X,
+                  const tvm::Array<tvm::Expr>& output_shape,
+                  int input_dim,
+                  int embed_dim,
+                  const std::string& name = "IMatMulExternAVX2");
+
+tvm::Tensor
+IMatMulExternMKL(const tvm::Tensor& transposed_quantized_param,
+                 const tvm::Tensor& Q_X,
+                 const tvm::Array<tvm::Expr>& output_shape,
+                 int input_dim,
+                 int embed_dim,
+                 const std::string& name = "IMatMulExternMKL");
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/nuphar_execution_provider.cc b/onnxruntime/core/providers/nuphar/nuphar_execution_provider.cc
new file mode 100644
index 0000000000000..d47d9bb1be996
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/nuphar_execution_provider.cc
@@ -0,0 +1,410 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/nuphar_execution_provider.h"
+
+#include "core/codegen/passes/utils/ort_tvm_utils.h"  // TODO remove this after removing tvm::runtime
+#include "core/common/cpuid_info.h"
+#include "core/framework/tensorprotoutils.h"
+#include "core/providers/nuphar/common/analysis/shape_expr.h"  // TODO: remove this shape_expr after shape_infernece refinement
+#include "core/providers/nuphar/common/analysis/subgraph_partition_stats.h"
+#include "core/providers/nuphar/common/nuphar_settings.h"
+#include "core/providers/nuphar/common/utils.h"
+#include "core/providers/nuphar/compiler/x86/x86_target_info.h"
+#include "core/providers/nuphar/kernel.h"
+#include "core/providers/nuphar/partition/graph_partitioner.h"
+
+#include <tvm/runtime/device_api.h>  // TODO remove this after removing tvm::runtime
+
+using namespace onnxruntime::nuphar;
+
+// from onnxruntime_typeinf.cc, in global namespace
+const onnxruntime::DataTypeImpl* ElementTypeFromProto(int type);
+
+namespace onnxruntime {
+
+// Initialization of thread local counter as subgraph id
+thread_local int64_t NupharSubgraphUnit::counter = 0;
+
+thread_local std::unique_ptr<std::unordered_map<std::string, int64_t>> NupharExecutionProvider::tls_realized_dims_;
+
+static std::string GetCurrentHostTargetString() {
+#if USE_TVM_WITH_LLVM
+  // auto detect from CPU ID
+  const auto& cpu_id_info = CPUIDInfo::GetCPUIDInfo();
+  if (cpu_id_info.HasAVX512f()) {
+    return CodeGenTargetX86::LLVM_TARGET_AVX512;
+  } else if (cpu_id_info.HasAVX2()) {
+    return CodeGenTargetX86::LLVM_TARGET_AVX2;
+  }
+  return llvm_target_str;
+#else
+  return stackvm_target_str;
+#endif  // USE_TVM_WITH_LLVM
+}
+
+NupharExecutionProvider::NupharExecutionProvider(const NupharExecutionProviderInfo& info)
+    : IExecutionProvider(kNupharExecutionProvider) {
+  nuphar::CreateNupharCodeGenSettings(info);
+  codegen::CodeGenSettings& settings = codegen::CodeGenSettings::Instance();
+
+  std::string target_str;
+  if (settings.HasOption(nuphar::kNupharCodeGenTarget)) {
+    target_str = settings.GetOptionValue(nuphar::kNupharCodeGenTarget);
+  }
+
+  if (target_str.empty()) {
+    target_str = default_nuphar_target_str;
+  }
+
+  if (target_str == llvm_target_str) {
+    // auto detect from CPU ID
+    const auto& cpu_id_info = CPUIDInfo::GetCPUIDInfo();
+    if (cpu_id_info.HasAVX512f()) {
+      codegen_target_ = CodeGenTarget_AVX512();
+    } else if (cpu_id_info.HasAVX2()) {
+      codegen_target_ = CodeGenTarget_AVX2();
+    } else {
+      codegen_target_ = std::make_unique<CodeGenTargetX86>(target_str, 128, 1);  // TODO: use real values
+    }
+  } else if (target_str == "avx2") {
+    codegen_target_ = CodeGenTarget_AVX2();
+  } else if (target_str == "avx512") {
+    codegen_target_ = CodeGenTarget_AVX512();
+  } else if (target_str != stackvm_target_str) {
+    codegen_target_ = std::make_unique<CodeGenTarget>(target_str);
+  } else {
+    ORT_NOT_IMPLEMENTED("Not supported target, should be one of stackvm/llvm/avx2/avx512.");
+  }
+
+  CreateTVMTarget();
+
+  tvm_host_target_ = tvm::Target::create(GetCurrentHostTargetString());
+  tvm_ctx_.device_type = static_cast<DLDeviceType>(tvm_target_->device_type);
+  tvm_ctx_.device_id = 0; // use the default device id for CPU allocator
+
+  whole_graph_shape_infer_ = std::make_shared<ShapeExprContext>();
+
+  DeviceAllocatorRegistrationInfo allocator_info(
+      {OrtMemTypeDefault,
+       [](int /*id*/) { return std::make_unique<CPUAllocator>(std::make_unique<OrtAllocatorInfo>("Nuphar", OrtAllocatorType::OrtDeviceAllocator)); },
+       std::numeric_limits<size_t>::max()});
+
+  InsertAllocator(CreateAllocator(allocator_info, tvm_ctx_.device_id));
+
+  // TODO add multi-target support
+  tvm_codegen_manager_ = std::make_unique<TVMCodeGenManager>();
+
+  // Create codegen handle for one target for now
+  codegen_handles_.clear();
+  auto handle = std::make_unique<NupharCodeGenHandle>();
+  tvm_codegen_manager_->Initialization();
+  tvm_codegen_manager_->SetCodeGenHandle(handle.get());
+  handle->allocator = GetAllocator(tvm_ctx_.device_id, OrtMemTypeDefault);
+  handle->codegen_target = codegen_target_.get();
+  handle->domain_version_lookup_func =
+      [this](const std::string& domain) {
+        return GetDomainVersion(domain);
+      };
+
+  handle->shape_inference = whole_graph_shape_infer_;
+
+  // TODO: remove
+  handle->enable_per_node_parallelized = info.enable_per_node_parallel;
+  // TODO: remove
+  handle->allow_unaligned_buffers = info.allow_unaligned_buffers;  // TODO remove this
+
+  codegen_handles_.push_back(std::move(handle));
+
+  // Runtime Handle
+  runtime_handle_ = std::make_unique<nuphar::NupharRuntimeHandle>(tvm_ctx_);
+  runtime_handle_->allocator = GetAllocator(tvm_ctx_.device_id, OrtMemTypeDefault);
+  runtime_handle_->allow_unaligned_buffers = info.allow_unaligned_buffers;
+  runtime_handle_->enable_model_parallelism = false;
+}
+
+void NupharExecutionProvider::CreateTVMTarget() {
+  tvm_target_ = tvm::Target::create(codegen_target_->GetTargetName());
+}
+
+std::vector<std::unique_ptr<ComputeCapability>>
+NupharExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph_viewer,
+                                       const std::vector<const KernelRegistry*>&) const {
+  // Perform shape inference. If shape inference failed,
+  // do not run the model through Nuphar
+  if (!ShapeInference(graph_viewer, *whole_graph_shape_infer_).IsOK()) {
+    LOGS_DEFAULT(WARNING) << "Model shape inference failed, execution won't use nuphar provider.";
+    return {};
+  }
+
+  // check if all nodes have shape for outputs
+  for (const auto& node : graph_viewer.Nodes()) {
+    auto s =
+        node.ForEachWithIndex(
+            node.OutputDefs(),
+            [&](const NodeArg& def, size_t index) {
+              if (def.Shape())
+                return Status::OK();
+              else
+                return ORT_MAKE_STATUS(ONNXRUNTIME, FAIL, "Node: ", node.Name(),
+                                       " has no output shape for ", def.Name());
+            });
+    if (!s.IsOK()) {
+      LOGS_DEFAULT(INFO) << "Shape inference incomplete, node execution won't use nuphar provider.";
+      LOGS_DEFAULT(INFO) << s.ErrorMessage();
+    }
+  }
+
+  for (const auto& domain_version : graph_viewer.DomainToVersionMap()) {
+    auto iter = domain_versions_.find(domain_version.first);
+    if (iter == domain_versions_.end())
+      domain_versions_.emplace(domain_version.first, domain_version.second);
+    else {
+      ORT_ENFORCE(iter->second == domain_version.second,
+                  "Inconsistent domain_to_opset_map in Nuphar provider. "
+                  "Please create one Nuphar provider instance for each session.");
+    }
+  }
+
+  std::set<NodeIndex> nodes_indexes;
+  for (auto& node : graph_viewer.Nodes()) {
+    nodes_indexes.insert(node.Index());
+  }
+
+  std::vector<std::unique_ptr<ComputeCapability>> results;
+
+  auto is_supported_func = [&](const Node& node) {
+    bool all_shape_defined = true;
+    node.ForEachDef([&all_shape_defined](const NodeArg& def, bool /*is_input*/) {
+      if (def.Shape() == nullptr) {
+        all_shape_defined = false;
+      } else {
+        for (const auto& dim : def.Shape()->dim()) {
+          if (!((dim.has_dim_value() && dim.dim_value() > 0) || dim.has_dim_param()))
+            all_shape_defined = false;
+        }
+      }
+    });
+
+    if (!all_shape_defined || GetKernelRegistryInternal()->TryFindKernel(node, Type()) == nullptr)
+      return false;
+
+    const auto& inputs = node.InputDefs();
+    if (node.OpType() == "Tile" && !graph_viewer.IsConstantInitializer(inputs[1]->Name(), true))
+      return false;  // do not support tile that has dynamic repeats
+
+    if (node.OpType() == "Slice") {
+      auto num_inputs = inputs.size();
+      ORT_ENFORCE(num_inputs > 0);
+      std::vector<int64_t> axes;
+      if (num_inputs > 1) {
+        // Slice-10
+        bool is_starts_dynamic = !graph_viewer.IsConstantInitializer(inputs[1]->Name(), true);
+        bool is_ends_dynamic = !graph_viewer.IsConstantInitializer(inputs[2]->Name(), true);
+
+        bool is_axes_dynamic = inputs.size() > 3 && !graph_viewer.IsConstantInitializer(inputs[3]->Name(), true);
+
+        bool has_steps = inputs.size() > 4;
+        if (is_starts_dynamic || is_ends_dynamic || is_axes_dynamic || has_steps)
+          return false;
+
+        const ONNX_NAMESPACE::TensorProto* axes_tp = nullptr;
+        bool found_axes = inputs.size() > 3 && graph_viewer.GetInitializedTensor(inputs[3]->Name(), axes_tp);
+        if (found_axes) {
+          GetSliceAxesFromTensorProto(axes, *axes_tp);
+        }
+      } else {
+        const onnxruntime::NodeAttributes& attrs = node.GetAttributes();
+        auto it = attrs.find("axes");
+        if (it != attrs.end()) {
+          for (int i = 0; i < it->second.ints_size(); i++) {
+            axes.push_back(static_cast<int64_t>(it->second.ints(i)));
+          }
+        }
+      }
+      // check if we have symbolic dimension on axes
+      if (HasUnknownShapeOnAxes(inputs[0], axes))
+        return false;
+    }
+    return true;
+  };
+  GraphPartitioner graph_partitioner(is_supported_func);
+
+  ORT_ENFORCE(graph_partitioner.Partition(graph_viewer, results).IsOK());
+
+  // for any node being fused in results, save initializer tensors
+  // because IExecutionProvider::Compile would be called without OpKernelInfo
+  const auto& all_initialized_tensors = graph_viewer.GetAllInitializedTensors();
+
+  for (const auto& result : results) {
+    for (const auto& node_idx : result->sub_graph->nodes) {
+      const auto& node = graph_viewer.GetNode(node_idx);
+
+      node->ForEachDef(
+          [this, &all_initialized_tensors, &graph_viewer](const NodeArg& def, bool is_input) {
+            auto iter = all_initialized_tensors.find(def.Name());
+            if (iter != all_initialized_tensors.end()) {
+              if (graph_viewer.IsConstantInitializer(def.Name(), true)) {
+                ORT_ENFORCE(SaveInitializer(def.Name(), iter->second).IsOK());
+              }
+            }
+          });
+    }
+  }
+
+  if (results.empty()) {
+    LOGS_DEFAULT(INFO) << "No node is claimed in nuphar provider.";
+  }
+
+  return results;
+}
+
+Status NupharExecutionProvider::SaveInitializer(
+    const std::string& name,
+    const ONNX_NAMESPACE::TensorProto* proto) const {
+  auto iter = constant_initializers_used_in_compiled_nodes_.find(name);
+  if (iter == constant_initializers_used_in_compiled_nodes_.end()) {
+    // create tensor from TensorProto
+    // note that session has not call SaveInitializedTensors yet,
+    // so we need to make our own copy
+    const auto& dims = proto->dims();
+    std::vector<int64_t> shape_dims(dims.size());
+    for (int i = 0; i < dims.size(); ++i)
+      shape_dims[i] = dims[i];
+
+    const TensorShape& shape = TensorShape::ReinterpretBaseType(shape_dims);
+    auto data_type = ElementTypeFromProto(proto->data_type());
+    auto t = std::make_unique<Tensor>(
+        data_type,
+        shape,
+        GetAllocator(0, OrtMemTypeDefault)->Alloc(shape.Size() * data_type->Size()),
+        GetAllocator(0, OrtMemTypeDefault)->Info());
+
+#define CASE_UNPACK_TENSOR(V, T)                                       \
+  case V:                                                              \
+    ORT_RETURN_IF_ERROR(utils::UnpackTensor<T>(                        \
+        *proto,                                                        \
+        proto->raw_data().size() ? proto->raw_data().data() : nullptr, \
+        proto->raw_data().size(),                                      \
+        t->MutableData<T>(),                                           \
+        shape.Size()));                                                \
+    break;
+
+    switch (proto->data_type()) {
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_BOOL, bool);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_DOUBLE, double);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_FLOAT, float);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_FLOAT16, MLFloat16);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_INT8, int8_t);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_INT16, int16_t);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_INT32, int32_t);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_INT64, int64_t);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_UINT8, uint8_t);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_UINT16, uint16_t);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_UINT32, uint32_t);
+      CASE_UNPACK_TENSOR(ONNX_NAMESPACE::TensorProto_DataType_UINT64, uint64_t);
+      default:
+        ORT_NOT_IMPLEMENTED("Unimplemented type: ", proto->data_type());
+    }
+
+    constant_initializers_used_in_compiled_nodes_.emplace(
+        name,
+        std::move(t));
+  }
+  return Status::OK();
+}
+
+// Compile nodes into node_compute_funcs
+// Here, each of nodes is a fuse node
+Status NupharExecutionProvider::Compile(
+    const std::vector<onnxruntime::Node*>& nodes,
+    std::vector<NodeComputeInfo>& node_compute_funcs) {
+  for (const auto* node : nodes) {
+    NodeComputeInfo info;
+
+    // Create state function
+    // This is similar to the original OpKernel constructor
+    // TODO move compilation part out of create_state_func to above
+    info.create_state_func =
+        [&, node](ComputeContext* ctx, FunctionState* state) {
+          std::unique_ptr<NupharKernelState> s =
+              std::make_unique<NupharKernelState>(
+                  *node,
+                  *ctx,
+                  *this);
+
+          *state = s.release();
+          return 0;
+        };
+
+    // Release state function
+    // This is similar to the original OpKernel destructor
+    info.release_state_func =
+        [](FunctionState state) {
+          if (state)
+            delete static_cast<NupharKernelState*>(state);
+        };
+
+    // Compute function
+    // This is similar to the original OpKernel's Compute()
+    info.compute_func =
+        [](FunctionState state, const OrtCustomOpApi*, OrtKernelContext* op_kernel_context) {
+          NupharKernelState* s = reinterpret_cast<NupharKernelState*>(state);
+          return s->Compute(reinterpret_cast<OpKernelContext*>(op_kernel_context));
+        };
+
+    node_compute_funcs.push_back(info);
+  }
+  // Reset subgraph id counter to avoid same inference session continue the count
+  NupharSubgraphUnit::counter = 0;
+  return Status::OK();
+}
+
+#define NUPHAR_OP(name, ver, types) \
+  class ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, ver, name);
+
+#define NUPHAR_VERSIONED_OP(name, start_ver, end_ver, types) \
+  class ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, start_ver, end_ver, name);
+
+LIST_NUPHAR_OPS()
+
+#undef NUPHAR_OP
+#undef NUPHAR_VERSIONED_OP
+
+class ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 6, 8, Cast);
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 9, Cast);
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 1, Gather);
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 10, MatMulInteger);
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kMSDomain, 1, MatMulInteger16);
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 9, Scan);
+
+static void RegisterStandaloneNupharKernels(KernelRegistry& kernel_registry) {
+#define NUPHAR_OP(name, ver, types) \
+  kernel_registry.Register(BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, ver, name)>());
+
+#define NUPHAR_VERSIONED_OP(name, start_ver, end_ver, types) \
+  kernel_registry.Register(BuildKernelCreateInfo<ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, start_ver, end_ver, name)>());
+
+  LIST_NUPHAR_OPS()
+
+#undef NUPHAR_OP
+#undef NUPHAR_VERSIONED_OP
+
+  // ops that have multiple type constraints
+  kernel_registry.Register(BuildKernelCreateInfo<ONNX_OPERATOR_VERSIONED_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 6, 8, Cast)>());
+  kernel_registry.Register(BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 9, Cast)>());
+  kernel_registry.Register(BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 1, Gather)>());
+  kernel_registry.Register(BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 10, MatMulInteger)>());
+  kernel_registry.Register(BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kMSDomain, 1, MatMulInteger16)>());
+  kernel_registry.Register(BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kNupharExecutionProvider, kOnnxDomain, 9, Scan)>());
+}
+
+std::shared_ptr<KernelRegistry> NupharExecutionProvider::GetKernelRegistryInternal() const {
+  if (kernel_registry_ == nullptr) {
+    kernel_registry_ = std::make_shared<KernelRegistry>();
+    RegisterStandaloneNupharKernels(*kernel_registry_);
+  }
+  return kernel_registry_;
+}
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/nuphar_execution_provider.h b/onnxruntime/core/providers/nuphar/nuphar_execution_provider.h
new file mode 100644
index 0000000000000..b00e235a95b20
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/nuphar_execution_provider.h
@@ -0,0 +1,163 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/codegen/common/common.h"
+#include "core/framework/allocatormgr.h"
+#include "core/framework/execution_provider.h"
+#include "core/framework/kernel_registry.h"
+#include "core/providers/nuphar/common/analysis/graph_stats.h"
+#include "core/providers/nuphar/compiler/codegen_manager.h"
+#include "core/providers/nuphar/compiler/traverse_shape_infer.h"
+#include "core/providers/nuphar/runtime/handle.h"
+
+#include <tvm/build_module.h>
+
+namespace onnxruntime {
+
+// Forward declaration
+class CodeGenTarget;
+
+// By default, construct either "llvm" or "stackvm" TVM target, for which the default device_type is kDLCPU.
+constexpr const char* llvm_target_str = "llvm";
+constexpr const char* stackvm_target_str = "stackvm";
+
+#ifdef USE_TVM_WITH_LLVM
+constexpr const char* default_nuphar_target_str = llvm_target_str;
+#else
+constexpr const char* default_nuphar_target_str = stackvm_target_str;
+#endif  // USE_TVM_WITH_LLVM
+
+// Information needed to construct Nuphar execution providers.
+struct NupharExecutionProviderInfo {
+  // By default, let provider decide the target by passing in empty string.
+  bool enable_per_node_parallel;  // TODO: remove
+
+  // this flag set TVM build_config with data_alignment=1, at the cost of performance
+  bool allow_unaligned_buffers;
+
+  // this string contains key/value pairs like:
+  // key1:value1, key2:value2, ...
+  // it would override environment variables for settings
+  std::string settings;
+
+  explicit NupharExecutionProviderInfo(bool unaligned_buffers,
+                                       const std::string& str_settings = "",
+                                       bool per_node_parallel = true)
+      : enable_per_node_parallel(per_node_parallel),
+        allow_unaligned_buffers(unaligned_buffers),
+        settings(str_settings) {}
+  NupharExecutionProviderInfo() = default;
+};
+
+class NupharExecutionProvider : public IExecutionProvider {
+ public:
+  explicit NupharExecutionProvider(const NupharExecutionProviderInfo& info);
+
+  virtual ~NupharExecutionProvider() = default;
+
+  std::vector<std::unique_ptr<ComputeCapability>>
+  GetCapability(const onnxruntime::GraphViewer& graph_viewer,
+                const std::vector<const KernelRegistry*>& kernel_registries) const override;
+
+  Status Compile(const std::vector<onnxruntime::Node*>& fused_nodes,
+                 std::vector<NodeComputeInfo>& node_compute_funcs) override;
+
+  const void* GetExecutionHandle() const noexcept override {
+    // The Nuphar interface does not return anything interesting.
+    return nullptr;
+  }
+
+  Status OnRunStart() override {
+    if (tls_realized_dims_ != nullptr) {
+      // at frame start, reset realized_dims since new execution frame may have different dynamic value
+      for (auto& pair : *(tls_realized_dims_.get())) {
+        pair.second = Dimension_Unknown;
+      }
+    } else {
+      tls_realized_dims_ = std::make_unique<std::unordered_map<std::string, int64_t>>();
+    }
+    return Status::OK();
+  }
+
+  std::shared_ptr<KernelRegistry> GetKernelRegistry() const override {
+    // do not register individual kernels
+    return std::make_shared<KernelRegistry>();
+  }
+
+  // internal registry for checking if op is supported
+  std::shared_ptr<KernelRegistry> GetKernelRegistryInternal() const;
+
+  tvm::Target GetTVMTarget() const {
+    return tvm_target_;
+  }
+
+  tvm::Target GetTVMHostTarget() const {
+    return tvm_host_target_;
+  }
+
+  // NOTE: realized_dims_ is thread_local, so it can only be accessed in execution thread (not ctor/dtor)
+  std::unordered_map<std::string, int64_t>& GetTLSRealizedDims() const {
+    return *(tls_realized_dims_.get());
+  }
+
+  const nuphar::NupharCodeGenHandle* GetNupharCodeGenHandle() const {
+    ORT_ENFORCE(codegen_handles_.size() > 0);
+    return codegen_handles_.front().get();
+  }
+
+  const nuphar::NupharRuntimeHandle* GetNupharRuntimeHandle() const {
+    return runtime_handle_.get();
+  }
+
+  const int GetDomainVersion(const std::string& name) const {
+    ORT_ENFORCE(domain_versions_.count(name));
+    return domain_versions_[name];
+  }
+
+  const Tensor* GetConstantInitializer(const std::string& name) const {
+    auto iter = constant_initializers_used_in_compiled_nodes_.find(name);
+    if (iter == constant_initializers_used_in_compiled_nodes_.end())
+      return nullptr;
+    return iter->second.get();
+  }
+
+ private:
+  void CreateTVMTarget();
+
+  Status SaveInitializer(
+      const std::string& name,
+      const ONNX_NAMESPACE::TensorProto* proto) const;
+
+ private:
+  // TODO move this to another place
+  std::unique_ptr<CodeGenTarget> codegen_target_;
+  // TODO: move all tvm related code a manager
+  tvm::Target tvm_target_;
+  tvm::Target tvm_host_target_;
+  TVMContext tvm_ctx_;
+
+  // shape inference
+  std::shared_ptr<nuphar::ShapeExprContext> whole_graph_shape_infer_;
+
+  // mapping from symbolic dimension to actual value
+  static thread_local std::unique_ptr<std::unordered_map<std::string, int64_t>> tls_realized_dims_;
+
+  std::unique_ptr<nuphar::TVMCodeGenManager> tvm_codegen_manager_;
+
+  // codegen_handles_ holds a list of NupharCodeGenHandle .
+  // Why a list? it is for multi-target support
+  // The current release supports one codegen target.
+  // TODO: support multi-target support
+  std::vector<std::unique_ptr<nuphar::NupharCodeGenHandle>> codegen_handles_;
+
+  std::unique_ptr<nuphar::NupharRuntimeHandle> runtime_handle_;
+
+  mutable std::shared_ptr<KernelRegistry> kernel_registry_;
+
+  mutable std::unordered_map<std::string, std::unique_ptr<Tensor>> constant_initializers_used_in_compiled_nodes_;
+  mutable std::unordered_map<std::string, int> domain_versions_;
+};
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/nuphar_provider_factory.cc b/onnxruntime/core/providers/nuphar/nuphar_provider_factory.cc
new file mode 100644
index 0000000000000..0dc79198a8599
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/nuphar_provider_factory.cc
@@ -0,0 +1,36 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/nuphar_provider_factory.h"
+#include <atomic>
+#include "nuphar_execution_provider.h"
+#include "core/session/abi_session_options_impl.h"
+//#include "core/codegen/passes/utils/codegen_context.h"  // TODO: remove it
+
+namespace onnxruntime {
+struct NupharExecutionProviderFactory : IExecutionProviderFactory {
+  NupharExecutionProviderFactory(bool allow_unaligned_buffers, const char* settings)
+      : settings_(settings),
+        allow_unaligned_buffers_(allow_unaligned_buffers) {}
+  ~NupharExecutionProviderFactory() = default;
+  std::unique_ptr<IExecutionProvider> CreateProvider() override;
+
+ private:
+  std::string settings_;
+  bool allow_unaligned_buffers_;
+};
+
+std::unique_ptr<IExecutionProvider> NupharExecutionProviderFactory::CreateProvider() {
+  NupharExecutionProviderInfo info(allow_unaligned_buffers_, settings_, /*per_node_parallel*/ true);
+  return std::make_unique<NupharExecutionProvider>(info);
+}
+
+std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Nuphar(bool allow_unaligned_buffers, const char* settings) {
+  return std::make_shared<onnxruntime::NupharExecutionProviderFactory>(allow_unaligned_buffers, settings);
+}
+}  // namespace onnxruntime
+
+ORT_API_STATUS_IMPL(OrtSessionOptionsAppendExecutionProvider_Nuphar, _In_ OrtSessionOptions* options, int allow_unaligned_buffers, _In_ const char* settings) {
+  options->provider_factories.push_back(onnxruntime::CreateExecutionProviderFactory_Nuphar(static_cast<bool>(allow_unaligned_buffers), settings));
+  return nullptr;
+}
diff --git a/onnxruntime/core/providers/nuphar/partition/graph_partitioner.cc b/onnxruntime/core/providers/nuphar/partition/graph_partitioner.cc
new file mode 100644
index 0000000000000..13406b5bb35f7
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/partition/graph_partitioner.cc
@@ -0,0 +1,190 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/partition/graph_partitioner.h"
+
+#include "core/codegen/common/common.h"
+#include "core/common/logging/logging.h"
+#include "core/providers/nuphar/common/analysis/subgraph_partition_stats.h"
+#include "core/providers/nuphar/common/nuphar_settings.h"
+#include "core/providers/nuphar/common/utils.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+bool GraphPartitioner::IsNodeSupported(const Node& node) const {
+  const auto* subgraph = GetSubgraph(node);
+  if (nullptr != subgraph) {
+    // for control flow ops, only support the ones registered
+    if (node.NodeType() != Node::Type::Fused && !is_op_type_supported_func_(node))
+      return false;
+
+    if (subgraph->NumberOfNodes() == 1) {
+      // In Ort, subgraph is processed before main graph
+      // We only need to detect whether a subgraph is already fused to One node.
+      // And the first node in that fused node is also supported.
+      const Node& fused_node = *subgraph->Nodes().begin();
+      if (fused_node.NodeType() == Node::Type::Fused) {
+        const Node& first_node = *fused_node.GetFunctionBody()->Body().Nodes().begin();
+        return is_op_type_supported_func_(first_node);
+      }
+    }
+    return false;
+  }
+
+  // check single node
+  if (is_op_type_supported_func_(node)) {
+    // currently, our tvm runtime has some issue for inferring the output shape
+    // that's computed from input dimensions. Mark those nodes are not supported
+    auto get_symbolic_dimensions = [](const Node& node, bool check_input) {
+      std::unordered_set<std::string> symbolic_dimensions;
+      node.ForEachDef([&](const NodeArg& def, bool is_input) {
+        if (is_input == check_input && def.Shape() != nullptr) {
+          for (const auto& dim : def.Shape()->dim()) {
+            if (dim.has_dim_param())
+              symbolic_dimensions.insert(dim.dim_param());
+            else
+              ORT_ENFORCE(dim.has_dim_value() && dim.dim_value() > 0);
+          }
+        }
+      });
+      return std::move(symbolic_dimensions);
+    };
+    auto input_sym = get_symbolic_dimensions(node, true);
+    auto output_sym = get_symbolic_dimensions(node, false);
+    if (input_sym != output_sym && output_sym.size() > 0)
+      return false;
+
+    return true;
+  }
+
+  return false;
+}
+
+void GraphPartitioner::HandleSubgraph(const onnxruntime::GraphViewer& graph) {
+  PartitionMeta part_meta;
+
+  for (auto& node_idx : graph.GetNodesInTopologicalOrder()) {
+    const Node* node = graph.GetNode(node_idx);
+    if (IsNodeSupported(*node)) {
+      part_meta.nodes.push_back(node_idx);
+    } else {
+      return;
+    }
+  }
+
+  partitions_.insert(std::make_pair(graph.GetNodesInTopologicalOrder().front(), part_meta));
+}
+
+void GraphPartitioner::CreateNewPartition(
+    const Node& node,
+    const std::vector<NodeIndex>& immedidate_rejected_partitions) {
+  Partitioner::CreateNewPartition(node, immedidate_rejected_partitions);
+  const NodeIndex node_idx = node.Index();
+  PartitionMeta& part_meta = partitions_[node_idx];
+  // also add input
+  for (const NodeArg* input_def : node.OutputDefs()) {
+    if (input_def->Exists()) {
+      part_meta.frontier_node_args.insert(input_def->Name());
+    }
+  }
+}
+
+// FORCE_ONE_SUBGRAPH is a marco to generate a single subgraph partition
+// It is mainly for debug and reproducing older version
+#ifdef FORCE_ONE_SUBGRAPH
+bool GraphPartitioner::ForcePartition(
+    const onnxruntime::GraphViewer& /*graph*/,
+    const Node& node, const std::vector<NodeIndex>& candiates,
+    const std::vector<NodeIndex>& immedidate_rejected_partitions) {
+  const NodeIndex node_idx = node.Index();
+  if (IsRecurrentNode(node)) {
+    // a new partition
+    partitions_.insert(std::make_pair(node_idx, PartitionMeta(node_idx, topology_idx)));
+    PartitionMeta& part_meta = partitions_[node_idx];
+    // update cost
+    part_meta.cost = Cost(node, candiates);
+    // update frontier_nodes and rejected_frontier_nodes
+    UpdateFrontiers(part_meta, node);
+
+    // update rejected predomiate partitions, all candidates become its dominators
+    for (const auto& id : candiates) {
+      part_meta.predecessor_partitions.insert(id);
+    }
+
+    // update predomiate partitions, all rejected partitions become its dominators
+    for (const auto& id : immedidate_rejected_partitions) {
+      UpdatePredecessors(part_meta, id);
+    }
+
+    // all children of node become current partition's rejected_frontier_nodes
+    // to avoid any child be merged with current partition
+    for (auto it = node.OutputEdgesBegin(); it != node.OutputEdgesEnd(); ++it) {
+      const Node& dst_node = it->GetNode();
+      if (part_meta.rejected_frontier_nodes.count(dst_node.Index()) == 0) {
+        part_meta.rejected_frontier_nodes.insert(dst_node.Index());
+      }
+    }
+
+    return true;
+  }
+  return false;
+}
+#endif
+
+// Partition the graph (fusing ops) based on the dependency and whether ops are supported:
+Status GraphPartitioner::Partition(const onnxruntime::GraphViewer& graph,
+                                   std::vector<std::unique_ptr<ComputeCapability>>& result) {
+  // call partition
+  ORT_RETURN_IF_ERROR(Evaluate(graph, /*distinguish_subgraph*/ true));
+
+  std::vector<NodeIndex> erase_partitions;
+
+  // remove single alias node (aka isolated alias op)
+  // TODO: change this logic to removing a partition with only all alias ops
+  for (const auto& iter : partitions_) {
+    if (iter.second.nodes.size() == 1 &&
+        IsAliasNode(*graph.GetNode(iter.second.nodes.front()))) {
+      erase_partitions.push_back(iter.first);
+    }
+  }
+
+  for (const auto& n_idx : erase_partitions) {
+    partitions_.erase(n_idx);
+  }
+
+  // create results
+  for (const auto& iter : partitions_) {
+    std::unique_ptr<IndexedSubGraph> partition = std::make_unique<IndexedSubGraph>();
+
+    for (auto& n : iter.second.nodes) {
+      partition->nodes.push_back(n);
+    }
+
+    if (codegen::CodeGenSettings::Instance().HasOption(kNupharDumpPartition)) {
+      std::ostringstream stream;
+      if (graph.IsSubgraph()) {
+        stream << "[NUPHAR_DUMP_PARTITION] ## Subgraph ## " << std::endl;
+      } else {
+        stream << "[NUPHAR_DUMP_PARTITION]" << std::endl;
+      }
+      stream << "Partition of size " << iter.second.nodes.size() << " [";
+      for (const auto& node_index : partition->nodes) {
+        const Node* node = graph.GetNode(node_index);
+        stream << "(" << node->Name() << ", " << node->OpType() << ", " << node->Index() << ") ";
+      }
+      stream << "]";
+      LOGS_DEFAULT(CODEGEN_SETTINGS_LOG_LEVEL) << stream.str();
+    }
+
+    result.emplace_back(
+        ToCapacity(
+            graph,
+            partition));
+  }
+
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/partition/graph_partitioner.h b/onnxruntime/core/providers/nuphar/partition/graph_partitioner.h
new file mode 100644
index 0000000000000..5ab4035899264
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/partition/graph_partitioner.h
@@ -0,0 +1,48 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/codegen/common/common.h"
+#include "core/framework/compute_capability.h"
+#include "core/providers/nuphar/partition/partitioner.h"
+
+#include <functional>
+#include <unordered_set>
+#include <vector>
+
+namespace onnxruntime {
+namespace nuphar {
+
+using IsOpTypeSupportedFunc = std::function<bool(const Node& node)>;
+
+// GraphPartitioner partitions Ort graph and generates FuseNodes.
+class GraphPartitioner : public Partitioner {
+ public:
+  GraphPartitioner(IsOpTypeSupportedFunc is_op_type_supported_func)
+      : Partitioner(), is_op_type_supported_func_(is_op_type_supported_func) {}
+
+  Status Partition(const onnxruntime::GraphViewer& graph,
+                   std::vector<std::unique_ptr<ComputeCapability>>& result);
+
+ private:
+  IsOpTypeSupportedFunc is_op_type_supported_func_;
+
+  bool IsNodeSupported(const Node& node) const override;
+
+  void HandleSubgraph(const onnxruntime::GraphViewer& graph) override;
+
+  void CreateNewPartition(const Node& node, const std::vector<NodeIndex>& immedidate_rejected_partitions) override;
+
+  // FORCE_ONE_SUBGRAPH is a marco to generate single subgraph partition
+  // It is mainly for debug and reproducing older version
+#ifdef FORCE_ONE_SUBGRAPH
+  bool ForcePartition(
+      const onnxruntime::GraphViewer& /*graph*/,
+      const Node& node,
+      const std::vector<NodeIndex>& candiates,
+      const std::vector<NodeIndex>& immedidate_rejected_partitions) override;
+#endif
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/partition/partitioner.cc b/onnxruntime/core/providers/nuphar/partition/partitioner.cc
new file mode 100644
index 0000000000000..45ac6a39cc797
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/partition/partitioner.cc
@@ -0,0 +1,266 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/partition/partitioner.h"
+
+#include "core/codegen/common/common.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+static bool CheckConnection(const NodeArg* node_arg, const std::unordered_set<std::string>& frontier_node_args) {
+  if (node_arg->Exists())
+    return frontier_node_args.count(node_arg->Name()) > 0;
+  return false;
+}
+
+void Partitioner::UpdateFrontiers(PartitionMeta& part_meta, const Node& node) {
+  // update frontier_node_args
+  for (const NodeArg* output_def : node.OutputDefs()) {
+    if (output_def->Exists()) {
+      part_meta.frontier_node_args.insert(output_def->Name());
+    }
+  }
+}
+
+void Partitioner::UpdatePredecessors(PartitionMeta& part_meta, const NodeIndex& id) {
+  part_meta.predecessor_partitions.insert(id);
+  part_meta.immediate_predecessor_partitions.insert(id);
+  for (auto& p : partitions_[id].predecessor_partitions) {
+    part_meta.predecessor_partitions.insert(p);
+  }
+}
+
+void Partitioner::MergePartitions(const Node& node,
+                                  const std::vector<NodeIndex>& candidates,
+                                  const std::vector<NodeIndex>& rejected_partitions) {
+  std::unordered_set<NodeIndex> merged_partitions;
+  PartitionMeta& part_meta = partitions_[candidates[0]];
+  // update cost
+  part_meta.cost = Cost(node, candidates);
+
+  // merge the rest meta
+  for (size_t i = 1; i < candidates.size(); ++i) {
+    PartitionMeta& other_part_meta = partitions_[candidates[i]];
+    // record merged_partitions
+    merged_partitions.insert(other_part_meta.Id());
+
+    // merge nodes
+    for (auto& n : other_part_meta.nodes) {
+      part_meta.nodes.push_back(n);
+    }
+
+    // merge rejected_frontier_nodes
+    for (auto& n : other_part_meta.rejected_frontiner_node_args) {
+      part_meta.rejected_frontiner_node_args.insert(n);
+    }
+
+    // merge frontier_nodes
+    for (auto& n : other_part_meta.frontier_node_args) {
+      part_meta.frontier_node_args.insert(n);
+    }
+
+    // predecessor_partitions
+    for (auto& p : other_part_meta.predecessor_partitions) {
+      part_meta.predecessor_partitions.insert(p);
+    }
+
+    // immediate_predecessor_partitions
+    for (auto& p : other_part_meta.immediate_predecessor_partitions) {
+      part_meta.immediate_predecessor_partitions.insert(p);
+    }
+
+    // erase the partition
+    partitions_.erase(other_part_meta.Id());
+  }
+
+  // update all predecessor_partitions in the rest partition
+  // by replacing merged_partitions to candidates[0]
+  for (const auto& mp : merged_partitions) {
+    for (auto& iter : partitions_) {
+      // replace predecessor_partitions
+      if (iter.second.predecessor_partitions.count(mp) > 0) {
+        iter.second.predecessor_partitions.erase(mp);
+        iter.second.predecessor_partitions.insert(candidates[0]);
+      }
+
+      // replace predecessor_partitions
+      if (iter.second.immediate_predecessor_partitions.count(mp) > 0) {
+        iter.second.immediate_predecessor_partitions.erase(mp);
+        iter.second.immediate_predecessor_partitions.insert(candidates[0]);
+      }
+    }
+  }
+
+  // make this new node to this partition
+  const NodeIndex node_idx = node.Index();
+  part_meta.nodes.push_back(node_idx);
+  // update frontier_nodes and rejected_frontier_nodes
+  UpdateFrontiers(part_meta, node);
+  // update rejected's predecessor partitions
+  for (auto& id : rejected_partitions) {
+    UpdatePredecessors(part_meta, id);
+  }
+}
+
+void Partitioner::AcceptNode(
+    const onnxruntime::GraphViewer& graph,
+    const NodeIndex& node_idx) {
+  std::vector<NodeIndex> immedidate_rejected_partitions;  // immediate rejected partitions
+  std::unordered_set<NodeIndex> all_rejected_partitions;  // all rejected partitions
+
+  const Node* node = graph.GetNode(node_idx);
+
+  for (const auto& p : partitions_) {
+    bool is_rejected = false;
+    for (const NodeArg* input_def : node->InputDefs()) {
+      is_rejected = CheckConnection(input_def, p.second.rejected_frontiner_node_args);
+      if (is_rejected) {
+        break;
+      }
+    }
+
+    if (is_rejected) {
+      immedidate_rejected_partitions.push_back(p.first);
+      all_rejected_partitions.insert(p.first);
+      const PartitionMeta& part_meta_rejected = partitions_[p.first];
+      for (const auto& r_p : part_meta_rejected.predecessor_partitions) {
+        all_rejected_partitions.insert(r_p);
+      }
+    }
+  }
+
+  std::vector<NodeIndex> candidate_partitions;
+  for (const auto& p : partitions_) {
+    bool is_child = false;
+    for (const NodeArg* input_def : node->InputDefs()) {
+      is_child = CheckConnection(input_def, p.second.frontier_node_args);
+      if (is_child) {
+        break;
+      }
+    }
+
+    if (is_child &&
+        all_rejected_partitions.count(p.first) == 0) {
+      candidate_partitions.push_back(p.first);
+    }
+  }
+
+  std::vector<NodeIndex> coexist_partitions;
+  if (candidate_partitions.size() > 1) {
+    // found multiple candidate partitions
+    // remove a candidate from candidate_partitions
+    // if it is in a predecessor_partitions of another candidate_partitions
+    std::vector<bool> is_partitions_coexisted(candidate_partitions.size(), true);
+    for (auto& cand_id : candidate_partitions) {
+      PartitionMeta& part_meta_cand = partitions_[cand_id];
+      for (size_t i = 0; i < candidate_partitions.size(); ++i) {
+        if (is_partitions_coexisted[i] &&
+            part_meta_cand.predecessor_partitions.count(candidate_partitions[i]) > 0) {
+          is_partitions_coexisted[i] = false;
+        }
+      }
+    }
+
+    for (size_t i = 0; i < candidate_partitions.size(); ++i) {
+      if (is_partitions_coexisted[i]) {
+        coexist_partitions.push_back(candidate_partitions[i]);
+      }
+    }
+
+  } else {
+    coexist_partitions = candidate_partitions;
+  }
+
+  if (ForcePartition(graph, *node, coexist_partitions, immedidate_rejected_partitions)) {
+    return;
+  }
+
+  if (coexist_partitions.size() == 0) {
+    CreateNewPartition(*node, immedidate_rejected_partitions);
+  } else if (coexist_partitions.size() == 1) {
+    // found a unique partition
+    PartitionMeta& part_meta = partitions_[coexist_partitions[0]];
+    // make this new node to this partition
+    part_meta.nodes.push_back(node_idx);
+    // update cost
+    part_meta.cost = Cost(*node, coexist_partitions);
+    // update frontier_nodes and rejected_frontier_nodes
+    UpdateFrontiers(part_meta, *node);
+    // update rejected's predecessor partitions
+    for (auto& id : immedidate_rejected_partitions) {
+      UpdatePredecessors(part_meta, id);
+    }
+  } else {
+    // if there are still multiple coexist partitions
+    // we can fuse all coexist partitions together
+    // we fuse all partitions to the first partition.
+    MergePartitions(*node, coexist_partitions, immedidate_rejected_partitions);
+  }
+}
+
+void Partitioner::CreateNewPartition(
+    const Node& node,
+    const std::vector<NodeIndex>& immedidate_rejected_partitions) {
+  const NodeIndex node_idx = node.Index();
+
+  partitions_.insert(std::make_pair(node_idx, PartitionMeta(node_idx)));
+  PartitionMeta& part_meta = partitions_[node_idx];
+  // update cost
+  part_meta.cost = Cost(node, {});
+  // update frontier_nodes and rejected_frontier_nodes
+  UpdateFrontiers(part_meta, node);
+  // update rejected's predecessor partitions
+  for (auto& id : immedidate_rejected_partitions) {
+    UpdatePredecessors(part_meta, id);
+  }
+}
+
+void Partitioner::RejectNode(
+    const onnxruntime::GraphViewer& graph,
+    const NodeIndex& node_idx) {
+  const Node* node = graph.GetNode(node_idx);
+
+  // if a node (A) is not supported.
+  // its child (B) will be also in rejected_frontier_nodes of a partition (P) which holds the node (A)
+  for (auto& p : partitions_) {
+    bool is_child = false;
+    bool is_rejected = false;
+    for (const NodeArg* input_def : node->InputDefs()) {
+      is_child = is_child || CheckConnection(input_def, p.second.frontier_node_args);
+      is_rejected = is_rejected || CheckConnection(input_def, p.second.rejected_frontiner_node_args);
+    }
+
+    if (is_child || is_rejected) {
+      for (const NodeArg* output_def : node->OutputDefs()) {
+        if (output_def->Exists()) {
+          const std::string& output_def_name = output_def->Name();
+          if (p.second.rejected_frontiner_node_args.count(output_def_name) == 0) {
+            p.second.rejected_frontiner_node_args.insert(output_def_name);
+          }
+        }
+      }
+    }
+  }
+}
+
+// Partition the graph based on the dependency and whether ops are supported.
+Status Partitioner::Evaluate(const onnxruntime::GraphViewer& graph, bool distinguish_subgraph) {
+  if (graph.IsSubgraph() && distinguish_subgraph) {
+    HandleSubgraph(graph);
+    return Status::OK();
+  }
+
+  for (auto& node_idx : graph.GetNodesInTopologicalOrder()) {
+    const Node* node = graph.GetNode(node_idx);
+    if (IsNodeSupported(*node)) {
+      AcceptNode(graph, node_idx);
+    } else {
+      RejectNode(graph, node_idx);
+    }
+  }
+
+  return Status::OK();
+}
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/partition/partitioner.h b/onnxruntime/core/providers/nuphar/partition/partitioner.h
new file mode 100644
index 0000000000000..740d7869274c4
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/partition/partitioner.h
@@ -0,0 +1,100 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/common/common.h"
+#include "core/graph/graph_viewer.h"
+
+#include <unordered_set>
+#include <vector>
+
+namespace onnxruntime {
+namespace nuphar {
+
+// A generic Partition data struct for Partitioner
+struct PartitionMeta {
+  std::vector<NodeIndex> nodes;                                  // a list of NodeIndex to represent Nodes in this Partition
+  std::unordered_set<std::string> frontier_node_args;            // a set of string to repsent frontier NodeArgs in this Partition
+  std::unordered_set<std::string> rejected_frontiner_node_args;  // a set of string to repsent rejected frontier NodeArgs in this Partition
+
+  std::unordered_set<NodeIndex> predecessor_partitions;            // a set of NodeIndex to represent predecessor Partitions of this Partition
+  std::unordered_set<NodeIndex> immediate_predecessor_partitions;  // a set of NodeIndex to represent immediate predecessor Partitions of this Partition
+
+  int cost;  // a cost of this Partition. It can be used to guide customized partitioning
+
+  PartitionMeta() {}
+  PartitionMeta(NodeIndex node) {
+    nodes.push_back(node);
+  }
+
+  inline NodeIndex Id() {
+    //Use the first NodeIndex as the id for PartitionMeta
+    return nodes.front();
+  }
+};
+
+// Base class of Partitioner.
+// Partitioner is used for GraphPartition to generate a FuseNode in Ort for the nuphar provider.
+// OR used for SubgraphPartition to generate subgraph Function within FuseNode in nuphar itself.
+class Partitioner {
+ public:
+  Partitioner() {}
+
+  virtual ~Partitioner() = default;
+
+  // Main function to perform partiton
+  Status Evaluate(const onnxruntime::GraphViewer& graph, bool distinguish_subgraph);
+
+ protected:
+  // Check whether a Node is included
+  virtual bool IsNodeSupported(const Node& node) const = 0;
+
+  // Force a Partition.
+  // It returns false to perform default merge process.
+  // Returning true avoid performing default process.
+  // The customized process need to be implmented within this function
+  virtual bool ForcePartition(
+      const onnxruntime::GraphViewer& /*graph*/,
+      const Node& /*node*/,
+      const std::vector<NodeIndex>& /*candidate_partitions*/,
+      const std::vector<NodeIndex>& /*immedidate_rejected_partitions*/) {
+    return false;
+  }
+
+  // Cost Function interface to exstimate Cost of a PartitionMeta.
+  // It can be used to trigger FocePartition or any other process.
+  virtual int Cost(const Node&, const std::vector<NodeIndex>&) const { return 0; };
+
+  // Update PartitonMeta to include a node
+  void UpdateFrontiers(PartitionMeta& part_meta, const Node& node);
+
+  void UpdatePredecessors(PartitionMeta& part_meta, const NodeIndex& node_id);
+
+  // Merge at least two Partitions when they are connected by a node
+  void MergePartitions(const Node& node,
+                       const std::vector<NodeIndex>& candidates,
+                       const std::vector<NodeIndex>& rejected_partitions);
+
+  std::map<NodeIndex, PartitionMeta> partitions_;
+
+ private:
+  void RejectNode(
+      const onnxruntime::GraphViewer& graph,
+      const NodeIndex& node_idx);
+
+  void AcceptNode(
+      const onnxruntime::GraphViewer& graph,
+      const NodeIndex& node_idx);
+
+  virtual void HandleSubgraph(const onnxruntime::GraphViewer& graph) {}
+
+ protected:
+  virtual void CreateNewPartition(const Node& node, const std::vector<NodeIndex>& immedidate_rejected_partitions);
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(Partitioner);
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/partition/subgraph_partitioner.cc b/onnxruntime/core/providers/nuphar/partition/subgraph_partitioner.cc
new file mode 100644
index 0000000000000..d0d6730eca888
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/partition/subgraph_partitioner.cc
@@ -0,0 +1,403 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/partition/subgraph_partitioner.h"
+
+#include "core/codegen/common/common.h"
+#include "core/common/logging/logging.h"
+#include "core/providers/nuphar/common/analysis/subgraph_partition_stats.h"
+#include "core/providers/nuphar/common/nuphar_settings.h"
+
+#include "core/framework/op_kernel.h"
+#include "core/framework/tensorprotoutils.h"
+
+#include <algorithm>
+
+namespace onnxruntime {
+namespace nuphar {
+
+// A paramter for COST_UPPER_BOUND.
+// This is for mimicing the criteria of the old greedy algorithm.
+// TODO remove this after rewriting Cost function and ForcePartition
+constexpr int COST_UPPER_BOUND = 180000;
+
+// Here, we implement a partitioner using the new merge algorithm mimicing the criteria of the old greedy algorithm.
+// It will generate the SAME subgraph partition as the old greedy algorithm did.
+// This might change soon after the interface becomes stable.
+// TODO: change the Cost and ForcePartition to a more complex form.
+
+// Here we use NodeUseCount as Cost to meet the criteria of the old greedy algorithm.
+// Note Cost function can use function, E.g. weight size or L2 pressure.
+// TODO replace NodeUseCount approximation
+int SubgraphPartitioner::Cost(const Node& node) const {
+  return Promote<SubgraphPartitionStats>(graph_stats_)->NodeUseCount(&node);
+}
+
+// Here we use linear summation for a Partition cost.
+int SubgraphPartitioner::Cost(const Node& node, const std::vector<NodeIndex>& candidates) const {
+  int cost = Cost(node);
+  for (auto n_id : candidates) {
+    const PartitionMeta& part_meta_cand = partitions_.at(n_id);
+    cost += part_meta_cand.cost;
+  }
+  return cost;
+}
+
+// Node is always supported in SubgraphPartitioner, so always return true
+bool SubgraphPartitioner::IsNodeSupported(const Node&) const {
+  return true;
+}
+
+void SubgraphPartitioner::SetSpecifiedNodeNames(const std::vector<std::string>& specified_names) {
+  for (const auto& name : specified_names) {
+    specified_names_.insert(name);
+  }
+}
+
+bool SubgraphPartitioner::SpecifiedNodePartition(const Node& node,
+                                                 const std::vector<NodeIndex>& candidates,
+                                                 const std::vector<NodeIndex>& rejected_partitions) {
+  if (specified_names_.count(node.Name()) > 0) {
+    // Here the old algorithm-equalivent. Merge two partitions
+    MergePartitions(node, candidates, rejected_partitions);
+
+    PartitionMeta& part_meta = partitions_[candidates[0]];
+
+    // all children of node become current partition's rejected_nodes
+    // to avoid any child be merged with current partition
+
+    for (const NodeArg* output_arg : node.OutputDefs()) {
+      if (output_arg->Exists()) {
+        part_meta.rejected_frontiner_node_args.insert(output_arg->Name());
+      }
+    }
+
+    return true;
+  }
+
+  return false;
+}
+
+static void RecordScanStates(
+    const Node& node,
+    const NodeIndex node_idx,
+    std::map<std::string, NodeIndex>& no_merged_args_to_nodes) {
+  size_t num_variadic_inputs = GetSubgraph(node)->GetInputs().size();
+  int64_t num_scan_inputs;
+  ProtoHelperNodeContext ctx(node);
+  OpNodeProtoHelper<ProtoHelperNodeContext> attrs(&ctx);
+  ORT_ENFORCE(attrs.GetAttr<int64_t>("num_scan_inputs", &num_scan_inputs).IsOK());
+  size_t num_state_variables = num_variadic_inputs - gsl::narrow_cast<size_t>(num_scan_inputs);
+
+  for (size_t idx = 0; idx < num_state_variables; ++idx) {
+    const NodeArg* def = node.InputDefs()[idx];
+    no_merged_args_to_nodes.insert(std::make_pair(def->Name(), node_idx));
+  }
+}
+
+// ForcePartition implements the logic equalivent to the old greedy algorithm using a merging algorithm.
+// It first check whether it is a Scan node. If so, make it a single partition.
+// If not, check whether estimated cost is larger than an upperbound And current UseCount >= 2.
+// If so, merge candidates with the node, and then force a partition for the merged partitions.
+// If not, go the default process.
+bool SubgraphPartitioner::ForcePartition(
+    const onnxruntime::GraphViewer& graph,
+    const Node& node,
+    const std::vector<NodeIndex>& candidates,
+    const std::vector<NodeIndex>& immedidate_rejected_partitions) {
+  const NodeIndex node_idx = node.Index();
+
+  if (IsRecurrentNode(node) ||
+      node.OpType() == "Concat") {
+    // a new partition
+    CreateNewPartition(node, immedidate_rejected_partitions);
+    PartitionMeta& part_meta = partitions_[node_idx];
+
+    // update candidate's predecessor partitions, all candidates become its dominators
+    for (auto& id : candidates) {
+      UpdatePredecessors(part_meta, id);
+    }
+
+    // all children of node become current partition's rejected_frontier_nodes
+    // to avoid any child be merged with the current partition in the future
+    for (const NodeArg* output_arg : node.OutputDefs()) {
+      if (output_arg->Exists()) {
+        part_meta.rejected_frontiner_node_args.insert(output_arg->Name());
+      }
+    }
+
+    // record Scan's states
+    if (node.OpType() == "Scan") {
+      RecordScanStates(node, node_idx, no_merged_args_to_nodes_);
+    }
+
+    return true;
+  }
+
+  // add specified node support
+  if (SpecifiedNodePartition(node, candidates, immedidate_rejected_partitions)) {
+    return true;
+  }
+
+  if (candidates.empty()) {
+    return false;
+  }
+
+  // estimated the cost < COST_UPPER_BOUND
+  if (Cost(node, candidates) < COST_UPPER_BOUND) {
+    return false;
+  }
+
+  int use_cnt = Promote<SubgraphPartitionStats>(graph_stats_)->NodeUseCount(&node);
+
+  if (use_cnt >= 2) {
+    // Here the old algorithm-equalivent. Merge two partitions
+    MergePartitions(node, candidates, immedidate_rejected_partitions);
+
+    PartitionMeta& part_meta = partitions_[candidates[0]];
+
+    // all children of node become current partition's rejected_frontier_nodes
+    // to avoid any child be merged with current partition
+    for (const NodeArg* output_arg : node.OutputDefs()) {
+      if (output_arg->Exists()) {
+        part_meta.rejected_frontiner_node_args.insert(output_arg->Name());
+      }
+    }
+
+    return true;
+  }
+  return false;
+}
+
+// Main interface for Partition
+Status SubgraphPartitioner::Partition(
+    const Node& node,
+    std::vector<NupharSubgraphUnit>& results,
+    FindInitializerFunc find_initializer_func) {
+  const Graph* onnx_subgraph = GetSubgraph(node);
+
+  // Handle single node
+  if (nullptr == onnx_subgraph) {
+    NupharSubgraphUnit subgraph;
+    // set node
+    subgraph.nodes.push_back(&node);
+
+    node.ForEachWithIndex(
+        node.InputDefs(),
+        [&](const NodeArg& def, size_t i) {
+          const Tensor* t = find_initializer_func(def.Name());
+          bool unused_initializer = false;
+          if (t != nullptr) {
+            // note for Reshape and Tile, shape/repeats as initializer is not used at runtime
+            unused_initializer = ((node.OpType() == "Reshape" || node.OpType() == "Tile") && i == 1);
+
+            if (!unused_initializer) {
+              subgraph.initializers.emplace(def.Name(), t);
+            }
+          }
+          // set real inputs
+          if (!unused_initializer) {
+            subgraph.inputs.push_back(&def);
+          }
+          return Status::OK();
+        });
+
+    // set real outputs
+    for (const auto def : node.OutputDefs()) {
+      subgraph.outputs.push_back(def);
+    }
+
+    // push back
+    results.push_back(subgraph);
+    return Status::OK();
+  }
+
+  ///////////////////////////////////
+  // The rest code handles a subgraph
+  ///////////////////////////////////
+  const onnxruntime::GraphViewer& graph_viewer = GraphViewer(*onnx_subgraph);
+  std::unordered_set<std::string> real_output_names;
+  node.ForEachWithIndex(
+      node.OutputDefs(),
+      [&real_output_names](const onnxruntime::NodeArg& def, size_t) {
+        real_output_names.insert(def.Name());
+        return Status::OK();
+      });
+
+  // shape infernece here
+  std::shared_ptr<ShapeExprContext> whole_partition_shape_infer = std::make_shared<ShapeExprContext>();
+  ORT_RETURN_IF_ERROR(ShapeInference(graph_viewer, *whole_partition_shape_infer));
+
+  // construct graph stats
+  graph_stats_ = std::make_unique<SubgraphPartitionStats>();
+  Promote<SubgraphPartitionStats>(graph_stats_)->SetShapeInference(whole_partition_shape_infer);
+  graph_stats_->Evaluate(graph_viewer);
+
+  // perform partition
+  ORT_RETURN_IF_ERROR(Evaluate(graph_viewer, false));
+
+  // A group topology sort using predecessor set
+  bool sorted = true;
+  while (sorted) {
+    sorted = false;
+    for (const auto& iter : partitions_) {
+      if (std::find(sorted_partitions_.begin(), sorted_partitions_.end(), iter.first) != sorted_partitions_.end())
+        continue;  // already sorted, skip
+
+      const auto& predecessor = iter.second.predecessor_partitions;
+      std::vector<int> result;
+      auto count_predecessor_not_sorted =
+          std::count_if(predecessor.begin(),
+                        predecessor.end(),
+                        [this](NodeIndex idx) {
+                          return sorted_partitions_.end() ==
+                                 std::find(sorted_partitions_.begin(), sorted_partitions_.end(), idx);
+                        });
+      if (0 == count_predecessor_not_sorted) {
+        // all predecessors are sorted, add it to sorted
+        sorted_partitions_.push_back(iter.first);
+        sorted = true;
+        break;
+      }
+    }
+  }
+
+  ORT_ENFORCE(sorted_partitions_.size() == partitions_.size());
+
+  // create results
+  for (const auto& partition : sorted_partitions_) {
+    const PartitionMeta& meta = partitions_.at(partition);
+
+    NupharSubgraphUnit subgraph;
+    std::unordered_set<NodeIndex> node_indices;
+
+    for (auto& n_idx : meta.nodes) {
+      // set node
+      const Node* n = graph_viewer.GetNode(n_idx);
+      subgraph.nodes.push_back(n);
+      node_indices.insert(n_idx);
+    }
+
+    for (auto& n_idx : meta.nodes) {
+      const Node* n = graph_viewer.GetNode(n_idx);
+
+      // handle current graph's inputs
+      n->ForEachWithIndex(
+          n->InputDefs(),
+          [&](const onnxruntime::NodeArg& def, size_t) {
+            const onnxruntime::Node* input_node = GetInputNode(*n, &def);
+            bool input_from_subgraph = (nullptr != input_node && node_indices.count(input_node->Index()) > 0);
+            const Tensor* t = find_initializer_func(def.Name());
+
+            if (!input_from_subgraph && t == nullptr) {
+              // input is from weights or outside of graph
+              subgraph.inputs.push_back(&def);
+            }
+
+            if (t != nullptr) {
+              subgraph.initializers.emplace(def.Name(), t);
+
+              // a intializer is an input
+              subgraph.inputs.push_back(&def);
+            }
+
+            return Status::OK();
+          });
+
+      // Handle outouts
+      // three cases are considerd as outputs
+      // 1. Output NodeArg is not used by any Node
+      // 2. Output NodeArg is used by at least one Node out of this subgraph.
+      //    Note a NodeArg can be used by Nodes in and out of the subgraph at the same time.
+      // 3. Output NodeArg is one of real outputs of an Ort subgraph.
+      //    Note if a NodeArg was the case 2 during Ort graph partition,
+      //    that NodeArg will disapear in the Ort subgraph,
+      //    and becomes only visible in FuseNode's real outputs.
+      //    This is not an Ort bug. This is due to ONNX limitation that
+      //    no NodeArg can be both internal and external at the same time.
+      //    That is what the case 3 to handle.
+
+      auto InsertOutputToSubgraph = [&subgraph](const NodeArg* def) {
+        if (std::find(subgraph.outputs.begin(), subgraph.outputs.end(), def) ==
+            subgraph.outputs.end()) {
+          subgraph.outputs.push_back(def);
+        }
+      };
+
+      std::unordered_set<std::string> input_names_from_the_output_node;
+
+      for (auto o_iter = n->OutputEdgesBegin(); o_iter != n->OutputEdgesEnd(); ++o_iter) {
+        const auto& p = *o_iter;
+        const Node& out_node = p.GetNode();
+
+        // preprocess for the case 1
+        out_node.ForEachWithIndex(
+            out_node.InputDefs(),
+            [&input_names_from_the_output_node](const onnxruntime::NodeArg& in_def, size_t) {
+              input_names_from_the_output_node.insert(in_def.Name());
+              return Status::OK();
+            });
+        // handle the case 2
+        if (node_indices.count(out_node.Index()) == 0) {
+          const NodeArg* def = n->OutputDefs()[p.GetSrcArgIndex()];
+          InsertOutputToSubgraph(def);
+        }
+      }
+
+      // handle case 1 and 3
+      n->ForEachWithIndex(
+          n->OutputDefs(),
+          [&](const onnxruntime::NodeArg& def, size_t) {
+            if (input_names_from_the_output_node.count(def.Name()) == 0 ||
+                real_output_names.count(def.Name()) > 0) {
+              InsertOutputToSubgraph(&def);
+            }
+
+            return Status::OK();
+          });
+    }
+
+    // Handle immediate nested subgraphs
+    // Note we put all info from immediate nested subgraphs in the end
+    for (auto& n_idx : meta.nodes) {
+      const Node* n = graph_viewer.GetNode(n_idx);
+      auto immediate_nested_subgraph = GetSubgraph(*n);
+      if (nullptr != immediate_nested_subgraph) {
+        for (auto& nn : immediate_nested_subgraph->Nodes()) {
+          nn.ForEachWithIndex(
+              nn.InputDefs(),
+              [&](const onnxruntime::NodeArg& def, size_t) {
+                const Tensor* t = find_initializer_func(def.Name());
+                if (t != nullptr) {
+                  subgraph.initializers.emplace(def.Name(), t);
+
+                  // an intializer is an input
+                  subgraph.inputs.push_back(&def);
+                }
+                return Status::OK();
+              });
+        }
+      }
+    }
+
+    // push back
+    results.push_back(subgraph);
+
+    if (codegen::CodeGenSettings::Instance().HasOption(kNupharDumpFusedNodes)) {
+      std::ostringstream stream;
+      stream << "[NUPHAR_DUMP_FUSED_NODES]" << std::endl;
+      stream << "NupharSubgraphUnit of size " << results.back().nodes.size() << " [";
+      for (const auto& n : results.back().nodes) {
+        stream << "(" << n->Name() << ", " << n->OpType() << ") ";
+      }
+      stream << "]";
+
+      LOGS_DEFAULT(CODEGEN_SETTINGS_LOG_LEVEL) << stream.str();
+    }
+  }
+
+  return Status::OK();
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/partition/subgraph_partitioner.h b/onnxruntime/core/providers/nuphar/partition/subgraph_partitioner.h
new file mode 100644
index 0000000000000..a9fbb4ff1f05d
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/partition/subgraph_partitioner.h
@@ -0,0 +1,56 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/providers/nuphar/partition/partitioner.h"
+
+#include "core/providers/nuphar/common/analysis/graph_stats.h"
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+
+#include <vector>
+
+namespace onnxruntime {
+namespace nuphar {
+
+class SubgraphPartitioner : public Partitioner {
+ public:
+  SubgraphPartitioner()
+      : Partitioner() {}
+
+  Status Partition(
+      const Node& node,
+      std::vector<NupharSubgraphUnit>& subgraphs,
+      FindInitializerFunc find_initializer_func);
+
+  void SetSpecifiedNodeNames(const std::vector<std::string>& specified_names);
+
+ private:
+  std::vector<NodeIndex> sorted_partitions_;
+
+  std::unique_ptr<OrtGraphStats> graph_stats_;
+
+  bool IsNodeSupported(const Node& node) const override;
+
+  bool ForcePartition(
+      const onnxruntime::GraphViewer& graph,
+      const Node& node,
+      const std::vector<NodeIndex>& candiates,
+      const std::vector<NodeIndex>& rejected_partitions) override;
+
+  int Cost(const Node& node, const std::vector<NodeIndex>& candiates) const override;
+  int Cost(const Node& node) const;
+
+  // Some help function for ForcePartition
+  bool SpecifiedNodePartition(const Node& node,
+                              const std::vector<NodeIndex>& candiates,
+                              const std::vector<NodeIndex>& rejected_partitions);
+
+  // a lookup to user-guided partitioning
+  std::unordered_set<std::string> specified_names_;
+
+  std::map<std::string, NodeIndex> no_merged_args_to_nodes_;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/compute_ctx.cc b/onnxruntime/core/providers/nuphar/runtime/compute_ctx.cc
new file mode 100644
index 0000000000000..30b37ea023b89
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/compute_ctx.cc
@@ -0,0 +1,43 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/runtime/compute_ctx.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+KernelComputeCtx::KernelComputeCtx(
+    const nuphar::NupharRuntimeHandle* handle,
+    std::unordered_map<std::string, int64_t>& realized_dims,
+    DataAllocFunc data_alloc_func,
+    int allocator_offset_count)
+    : data_alloc_func_(data_alloc_func),
+      handle_(handle),
+      realized_dims_(realized_dims) {
+  internal_ort_buffer_unique_ptrs_.resize(allocator_offset_count);
+}
+
+void KernelComputeCtx::CreateFuncComputeCtx(const NupharFuncInfo* func_info, bool with_update) {
+  ORT_ENFORCE_DEBUG(nullptr != func_info);
+
+  if (func_compute_ctx_map_.find(func_info) == func_compute_ctx_map_.end()) {
+    func_compute_ctx_map_.emplace(func_info, FuncComputeCtx());
+  }
+
+  FuncComputeCtx& func_compute_ctx = func_compute_ctx_map_.at(func_info);
+
+  size_t num_input = func_info->ort_input_allocators.size();
+  std::vector<const void*>& ort_input_data = func_compute_ctx.ort_input_data;
+  std::vector<const int64_t*>& ort_input_shapes = func_compute_ctx.ort_input_shapes;
+  ort_input_data.resize(num_input);
+  ort_input_shapes.resize(num_input);
+
+  // with_update is set false to create a bunch of ctx in advance and then update later.
+  // e.g. in model_parallelism
+  if (with_update) {
+    UpdateFuncComputeCtx(func_info);
+  }
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/compute_ctx.h b/onnxruntime/core/providers/nuphar/runtime/compute_ctx.h
new file mode 100644
index 0000000000000..43a5b6e19eb31
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/compute_ctx.h
@@ -0,0 +1,233 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/providers/nuphar/compiler/func_info.h"
+#include "core/providers/nuphar/runtime/handle.h"
+#include "core/providers/nuphar/runtime/control_flow/loop_exec_ctx.h"
+
+#include "core/codegen/common/common.h"
+#include "core/framework/allocator.h"
+#include "core/framework/data_types.h"
+#include "core/framework/func_api.h"
+#include "core/framework/func_kernel.h"
+#include "core/framework/op_kernel.h"
+#include "core/framework/tensor.h"
+#include "gsl/gsl_util"
+
+#include <functional>
+#include <mutex>
+#include <vector>
+
+// TODO change name space from tvm_codegen to nuphar
+namespace onnxruntime {
+namespace nuphar {
+
+class LoopExecCtx;
+
+using DataAllocFunc = std::function<void*(size_t)>;
+
+struct FuncComputeCtx {
+  // ort context
+  std::vector<const void*> ort_input_data;
+  std::vector<const int64_t*> ort_input_shapes;
+
+  // tvm context
+  std::vector<TVMValue> lvalues;
+  std::vector<DLTensor> dl_tensors;
+  std::vector<std::vector<int64_t>> dl_output_shapes;
+
+  // LoopExecCtx
+  std::unique_ptr<LoopExecCtx> loop_cf_ctx;
+};
+
+struct InternalTensor {
+  IAllocatorUniquePtr<void> allocator_ptr;
+  const int64_t* shape;
+};
+
+// KernelComputeCtx is a stateful data struct
+class KernelComputeCtx {
+ public:
+  explicit KernelComputeCtx(
+      const nuphar::NupharRuntimeHandle* handle,
+      std::unordered_map<std::string, int64_t>& realized_dims,
+      DataAllocFunc data_alloc_func,
+      int allocator_offset_count);
+
+  inline void Bind(OpKernelContext* op_kernel_ctx) {
+    op_kernel_ctx_ = op_kernel_ctx;
+  }
+
+  void CreateFuncComputeCtx(const NupharFuncInfo* func_info, bool with_update = true);
+
+  inline void UpdateFuncComputeCtx(const NupharFuncInfo* func_info) {
+    ORT_ENFORCE_DEBUG(nullptr != func_info);
+
+    FuncComputeCtx& func_compute_ctx = func_compute_ctx_map_.at(func_info);
+
+    const auto& ort_input_allocators = func_info->ort_input_allocators;
+    const std::vector<bool>& ort_input_allocator_is_collided_output = func_info->ort_input_allocator_is_collided_output;
+    std::vector<const void*>& ort_input_data = func_compute_ctx.ort_input_data;
+    std::vector<const int64_t*>& ort_input_shapes = func_compute_ctx.ort_input_shapes;
+
+    for (int i = 0; i < gsl::narrow_cast<int>(ort_input_allocators.size()); ++i) {
+      int offset = ort_input_allocators[i].index;
+      bool is_external = ort_input_allocators[i].is_external;
+      if (is_external) {
+        bool is_collided = ort_input_allocator_is_collided_output[i];
+        if (is_collided) {
+          const Tensor* t = op_kernel_ctx_->Output<Tensor>(offset);
+          ort_input_data[i] = t->DataRaw();
+          ort_input_shapes[i] = t->Shape().GetDims().data();
+        } else {
+          const Tensor* t = op_kernel_ctx_->Input<Tensor>(offset);
+          ort_input_data[i] = t->DataRaw();
+          ort_input_shapes[i] = t->Shape().GetDims().data();
+        }
+      } else {
+        const InternalTensor& t = internal_ort_buffer_unique_ptrs_[offset];
+        ort_input_data[i] = t.allocator_ptr.get();
+        ort_input_shapes[i] = t.shape;
+      }
+    }
+  }
+
+  inline FuncComputeCtx& GetFuncComputeCtx(const NupharFuncInfo* func_info) {
+    return func_compute_ctx_map_.at(func_info);
+  }
+
+  inline const FuncComputeCtx& GetFuncComputeCtx(const NupharFuncInfo* func_info) const {
+    return func_compute_ctx_map_.at(func_info);
+  }
+
+  inline void* OutputData(const NupharFuncInfo* func_info,
+                          int index,
+                          const TensorShape& shape,
+                          MLDataType dtype) {
+    const auto& ort_output_allocator = func_info->ort_output_allocators[index];
+
+    int offset = ort_output_allocator.index;
+    bool is_external = ort_output_allocator.is_external;
+
+    if (is_external) {
+      auto t = op_kernel_ctx_->Output(offset, shape);
+      return t->MutableDataRaw();
+    }
+
+    internal_ort_buffer_unique_ptrs_[offset].allocator_ptr = AllocateDataUniquePtr(shape, dtype);
+    internal_ort_buffer_unique_ptrs_[offset].shape = shape.GetDims().data();
+    return internal_ort_buffer_unique_ptrs_[offset].allocator_ptr.get();
+  }
+
+  inline IAllocatorUniquePtr<void> AllocateDataUniquePtr(const int64_t* shape, const size_t rank, MLDataType dtype) {
+    int64_t total_size = dtype->Size();
+    for (size_t i = 0; i < rank; ++i) {
+      total_size *= shape[i];
+    }
+    return IAllocator::MakeUniquePtr<void>(handle_->allocator, total_size);
+  }
+
+  inline const nuphar::NupharRuntimeHandle* GetRuntimeHandle() const {
+    return handle_;
+  }
+
+  inline std::unordered_map<std::string, int64_t>& GetRealizedDims() {
+    return realized_dims_;
+  }
+
+  inline bool IsInitialized(const NupharFuncInfo* func_info) const {
+    ORT_ENFORCE_DEBUG(nullptr != func_info);
+    return func_compute_ctx_map_.count(func_info) > 0;
+  }
+
+  // UpdateRealizedDims is used to sync realize dim
+  // Note insert_inclusive_axis is introduced to adjusted shape.
+  // It is commonly used in Scan or other subgraphs
+  // when Tensors' shapes in a subgraph are sliced from the main grahp.
+  // Using the sliced axis as insert_inclusive_axis can find the correct shape dim in the main graph
+  inline void UpdateRealizedDims(
+      const std::vector<std::pair<size_t, std::string>>& symbols,
+      const int64_t* input_shape,
+      size_t insert_inclusive_axis = 65535 /*minimal maximum of size_t*/) {
+    for (const auto& s_pair : symbols) {
+      size_t dim = s_pair.first;
+      size_t adjusted_dim = dim;
+      if (dim >= insert_inclusive_axis) {
+        adjusted_dim = dim + 1;
+      }
+
+      int64_t dim_size = input_shape[adjusted_dim];
+      const std::string& dim_param = s_pair.second;
+      auto dim_value_iter = realized_dims_.find(dim_param);
+
+      if (dim_value_iter == realized_dims_.end()) {
+        std::lock_guard<std::mutex> lock(mutex_);
+        realized_dims_.insert(std::make_pair(dim_param, dim_size));  // update new symbol
+      } else if (dim_value_iter->second == Dimension_Unknown) {
+        std::lock_guard<std::mutex> lock(mutex_);
+        dim_value_iter->second = dim_size;  // update for a symbol
+      } else {
+        std::lock_guard<std::mutex> lock(mutex_);
+        // a runtime error
+        ORT_ENFORCE(dim_value_iter->second == dim_size,
+                    "Input shape's symbolic dim mismatch.", dim_value_iter->second, "!=", dim_size);
+      }
+    }
+  }
+
+  // UpdateRealizedDims is used to sync realize dim
+  // Note insert_exclusive_axis is introduced to adjusted shape.
+  // It is commonly used in Scan or other subgraphs
+  // when Tensors' shapes in a subgraph are sliced from the main grahp.
+  // Using the sliced axis as insert_exclusive_axis can find the correct shape dim in the main graph
+  inline void UpdateRealizedDims(
+      const std::vector<std::pair<size_t, std::string>>& symbols,
+      std::vector<int64_t>& realized_output_shape,
+      size_t insert_exclusive_axis = 65535 /*minimal maximum of size_t*/) {
+    for (const auto& s_pair : symbols) {
+      size_t dim = s_pair.first;
+      size_t adjusted_dim = dim;
+      if (dim > insert_exclusive_axis) {
+        adjusted_dim = dim + 1;
+      }
+
+      const std::string& dim_param = s_pair.second;
+      auto dim_value_iter = realized_dims_.find(dim_param);
+      ORT_ENFORCE_DEBUG(dim_value_iter != realized_dims_.end());
+      {
+        std::lock_guard<std::mutex> lock(mutex_);
+        realized_output_shape[adjusted_dim] = dim_value_iter->second;
+      }
+    }
+  }
+
+ private:
+  inline IAllocatorUniquePtr<void> AllocateDataUniquePtr(const TensorShape& shape, MLDataType dtype) {
+    return IAllocator::MakeUniquePtr<void>(handle_->allocator, shape.Size() * dtype->Size());
+  }
+
+  DataAllocFunc data_alloc_func_;
+
+  // runtime handle
+  const nuphar::NupharRuntimeHandle* handle_;
+
+  // realized_dim
+  std::unordered_map<std::string, int64_t>& realized_dims_;
+
+  // ORT Kernel Context
+  OpKernelContext* op_kernel_ctx_;
+
+  // Owner of non-real inputs, outputs, and even state buffers
+  // Here, "non-real" means edge across subgraph only, but not across boundary of Partition.
+  // Non-real inputs and outputs is managed by Nuphar.
+  std::vector<InternalTensor> internal_ort_buffer_unique_ptrs_;
+
+  std::map<const NupharFuncInfo*, FuncComputeCtx> func_compute_ctx_map_;
+
+  std::mutex mutex_;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/control_flow/loop_exec_ctx.h b/onnxruntime/core/providers/nuphar/runtime/control_flow/loop_exec_ctx.h
new file mode 100644
index 0000000000000..7f8e67d14d48a
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/control_flow/loop_exec_ctx.h
@@ -0,0 +1,43 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/providers/nuphar/compiler/func_info.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+class KernelComputeCtx;
+
+// abstract class for loop
+class LoopExecCtx {
+ public:
+  LoopExecCtx() {}
+
+  virtual ~LoopExecCtx() = default;
+
+  virtual void InitContext(KernelComputeCtx* compute_ctx,
+                           const NupharFuncInfo* func_info) = 0;
+  virtual void UpdateContext(KernelComputeCtx* compute_ctx,
+                             const NupharFuncInfo* func_info) = 0;
+  virtual void InitIteration(KernelComputeCtx* compute_ctx,
+                             const NupharFuncInfo* func_info) = 0;
+
+  virtual void LoopFinalizer() = 0;
+  // Marching to next loop iteration
+  virtual void Advance(const ControlFlowInfo* cf_info) = 0;
+  virtual bool IsValid() {
+    return current_loop_step_ < max_loop_step_;
+  }
+
+ protected:
+  std::vector<int> sequence_lens_;
+
+  // current sequence index that are going to run
+  int current_loop_step_;
+  int min_loop_step_;
+  int max_loop_step_;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/control_flow/scan_exec_ctx.cc b/onnxruntime/core/providers/nuphar/runtime/control_flow/scan_exec_ctx.cc
new file mode 100644
index 0000000000000..5d21726ee0955
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/control_flow/scan_exec_ctx.cc
@@ -0,0 +1,530 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/runtime/control_flow/scan_exec_ctx.h"
+
+#include "core/codegen/common/common.h"
+#include "core/codegen/passes/utils/codegen_context.h"
+#include "core/codegen/passes/utils/ort_tvm_utils.h"
+#include "core/providers/nuphar/runtime/compute_ctx.h"
+#include "core/providers/nuphar/runtime/utils.h"
+
+#include "gsl/gsl_util"
+#include <tvm/tvm.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+void ScanExecCtx::Advance(const ControlFlowInfo* cf_info) {
+  ORT_ENFORCE_DEBUG(current_loop_step_ < max_loop_step_);
+  // update inputs
+  for (size_t scan_input_idx = 0; scan_input_idx < current_input_ptrs_.size(); ++scan_input_idx) {
+    current_input_ptrs_[scan_input_idx] =
+        (static_cast<char*>(current_input_ptrs_[scan_input_idx]) + input_strides_[scan_input_idx]);
+  }
+
+  // update outputs
+  for (size_t scan_output_idx = 0; scan_output_idx < current_output_ptrs_.size(); ++scan_output_idx) {
+    current_output_ptrs_[scan_output_idx] =
+        (static_cast<char*>(current_output_ptrs_[scan_output_idx]) + output_strides_[scan_output_idx]);
+  }
+
+  const ScanExecInfo* scan_info = Promote<ScanExecInfo>(cf_info);
+  ORT_ENFORCE_DEBUG(nullptr != scan_info);
+  size_t num_state_variables = gsl::narrow<size_t>(scan_info->num_state_variables);
+  const std::vector<int>& state_to_output_indices = scan_info->state_to_output_indices;
+
+  // update input and output states
+  if (current_loop_step_ == (max_loop_step_ - 1)) {
+    // When executed the last loop step
+    // copy from current_state_output_ptrs to state_output_ptrs if needed
+    for (size_t scan_state_idx = 0; scan_state_idx < num_state_variables; ++scan_state_idx) {
+      if (current_ort_state_output_ptrs_[scan_state_idx] != ort_state_output_buffers_[scan_state_idx]) {
+        memcpy(ort_state_output_buffers_[scan_state_idx],
+               current_ort_state_output_ptrs_[scan_state_idx],
+               state_bytes_size_[scan_state_idx]);
+      }
+    }
+  } else if (current_loop_step_ == 0) {
+    // When executed the first loop (current_loop_step == 0),
+    // assign current_state_input_ptrs as current_state_output_ptrs
+    // and current_state_output_ptrs as ort_state_output_buffers_
+    for (size_t scan_state_idx = 0; scan_state_idx < num_state_variables; ++scan_state_idx) {
+      current_ort_state_input_ptrs_[scan_state_idx] = current_ort_state_output_ptrs_[scan_state_idx];
+
+      int out_idx = state_to_output_indices[scan_state_idx];
+      if (out_idx >= 0) {
+        current_ort_state_output_ptrs_[scan_state_idx] = current_output_ptrs_[out_idx];
+      } else {
+        current_ort_state_output_ptrs_[scan_state_idx] = ort_state_output_buffers_[scan_state_idx];
+      }
+    }
+  } else {
+    // When current_loop_step > 0
+    // Swap current_ort_state_input_ptrs_[i] and current_ort_state_output_ptrs_[i]
+    for (size_t scan_state_idx = 0; scan_state_idx < num_state_variables; ++scan_state_idx) {
+      int scan_output_idx = state_to_output_indices[scan_state_idx];
+      if (scan_output_idx >= 0) {
+        current_ort_state_input_ptrs_[scan_state_idx] = current_ort_state_output_ptrs_[scan_state_idx];
+        current_ort_state_output_ptrs_[scan_state_idx] = current_output_ptrs_[scan_output_idx];
+      } else {
+        std::swap(current_ort_state_input_ptrs_[scan_state_idx],
+                  current_ort_state_output_ptrs_[scan_state_idx]);
+      }
+    }
+  }
+
+  // increase loop index
+  ++current_loop_step_;
+}
+
+void ScanExecCtx::InitIteration(KernelComputeCtx* kernel_compute_ctx,
+                                const NupharFuncInfo* func_info) {
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info);
+  std::vector<DLTensor>& dl_tensors = subgraph_compute_ctx.dl_tensors;
+
+  size_t arg_index = 0;
+  // update func inputs
+  for (auto ptr : current_ort_state_input_ptrs_) {
+    dl_tensors[arg_index].data = ptr;
+    ++arg_index;
+  }
+
+  // update inputs
+  for (auto ptr : current_input_ptrs_) {
+    dl_tensors[arg_index].data = ptr;
+    ++arg_index;
+  }
+
+  arg_index = func_info->func_input_count;
+  // update func outputs
+  for (auto ptr : current_func_output_ptrs_) {
+    dl_tensors[arg_index].data = *ptr;
+    ++arg_index;
+  }
+}
+
+void ScanExecCtx::InitContext(KernelComputeCtx* kernel_compute_ctx,
+                              const NupharFuncInfo* func_info) {
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info);
+  const DLContext& dl_ctx = kernel_compute_ctx->GetRuntimeHandle()->dl_ctx;
+
+  size_t tvm_input_count = func_info->func_input_count;
+  size_t tvm_output_count = func_info->func_output_count;
+  size_t tvm_num_args = tvm_input_count + tvm_output_count;
+
+  std::vector<TVMValue>& lvalues = subgraph_compute_ctx.lvalues;
+  lvalues.resize(tvm_num_args);
+  std::vector<DLTensor>& dl_tensors = subgraph_compute_ctx.dl_tensors;
+  dl_tensors.resize(tvm_num_args);
+  std::vector<std::vector<int64_t>>& dl_output_shapes = subgraph_compute_ctx.dl_output_shapes;
+  dl_output_shapes.resize(tvm_output_count);
+
+  // control flow info
+  const ScanExecInfo* scan_info = Promote<ScanExecInfo>(func_info->cf_info.get());
+  ORT_ENFORCE_DEBUG(nullptr != scan_info);
+  int64_t num_state_variables = scan_info->num_state_variables;
+  int64_t num_scan_inputs = scan_info->num_scan_inputs;
+  int64_t num_scan_outputs = scan_info->num_scan_outputs;
+
+  const std::vector<int64_t>& scan_input_axes = scan_info->scan_input_axes;
+  const std::vector<int64_t>& scan_output_axes = scan_info->scan_output_axes;
+  const std::vector<bool>& scan_input_forwards = scan_info->scan_input_forwards;
+  const std::vector<bool>& scan_output_forwards = scan_info->scan_output_forwards;
+
+  // a common lambda utility function for fill-in inputs and initializers
+  auto fill_input = [&](size_t tvm_idx, const void* input_data, const int64_t* shape, size_t rank, MLDataType data_type) {
+    ORT_ENFORCE_DEBUG(kernel_compute_ctx->GetRuntimeHandle()->allow_unaligned_buffers ||
+                      (reinterpret_cast<std::uintptr_t>(input_data)) % 64 == 0);
+    DLDataType dtype = tvm_codegen::ToTvmDLDataType(data_type);
+    dl_tensors[tvm_idx] = {const_cast<void*>(input_data), dl_ctx,
+                           gsl::narrow_cast<int>(rank), dtype,
+                           const_cast<int64_t*>(shape), nullptr, 0};
+    lvalues[tvm_idx].v_handle = &(dl_tensors[tvm_idx]);
+  };
+
+  // Reserve sizes of Scan's fast look ptr
+  current_input_ptrs_.resize(num_scan_inputs);
+  input_strides_.resize(num_scan_inputs);
+  current_output_ptrs_.resize(num_scan_outputs);
+  output_strides_.resize(num_scan_outputs);
+  current_ort_state_input_ptrs_.resize(num_state_variables);
+  current_ort_state_output_ptrs_.resize(num_state_variables);
+  ort_state_input_buffers_.resize(num_state_variables);
+  ort_state_output_buffers_.resize(num_state_variables);
+  ort_state_buffer_unique_ptrs_.resize(num_state_variables);
+
+  scan_input_in_subgraph_shapes_.resize(num_scan_inputs);
+  scan_output_shapes_.resize(num_scan_outputs);
+
+  state_bytes_size_.resize(num_state_variables);
+
+  // Handle Scan's control flow ctx
+  seq_length_ = 0;
+  for (int ort_input_idx = gsl::narrow<int>(num_state_variables);
+       ort_input_idx < gsl::narrow<int>(func_info->ort_input_count);
+       ++ort_input_idx) {
+    const int64_t* input_shape = subgraph_compute_ctx.ort_input_shapes[ort_input_idx];
+    size_t scan_input_idx = gsl::narrow<size_t>(ort_input_idx) - gsl::narrow<size_t>(num_state_variables);
+    ORT_ENFORCE_DEBUG(scan_input_idx < scan_input_axes.size());
+    size_t input_scan_axis = gsl::narrow<size_t>(scan_input_axes[scan_input_idx]);
+    if (seq_length_ == 0) {
+      seq_length_ = input_shape[input_scan_axis];
+    }
+    ORT_ENFORCE_DEBUG(seq_length_ == input_shape[input_scan_axis]);
+  }
+
+  min_loop_step_ = 0;
+  max_loop_step_ = seq_length_;
+  current_loop_step_ = min_loop_step_;
+
+  // Handle Inputs (not including initializers)
+  size_t tvm_input_idx = 0;
+  for (const auto& input_meta : func_info->input_metas) {
+    int ort_input_idx = input_meta.ort_arg_index;
+    const void* input_data = subgraph_compute_ctx.ort_input_data[ort_input_idx];
+    const int64_t* ort_input_shape = subgraph_compute_ctx.ort_input_shapes[ort_input_idx];
+    MLDataType data_type = input_meta.dtype;
+    size_t arg_shape_rank = input_meta.inferred_shape.size();
+
+    if (ort_input_idx < gsl::narrow<int>(num_state_variables)) {
+      // for scan_state_input, the rank is the same
+      fill_input(tvm_input_idx, input_data, ort_input_shape, arg_shape_rank, data_type);
+      // set the input_data, which is the main graph's state input ptr, to current current_ort_state_input_ptrs_
+      current_ort_state_input_ptrs_[ort_input_idx] = dl_tensors[tvm_input_idx].data;
+      // set the new allocated ptr to state_input_buffers
+      ort_state_buffer_unique_ptrs_[ort_input_idx] =
+          kernel_compute_ctx->AllocateDataUniquePtr(ort_input_shape, arg_shape_rank, data_type);
+      ort_state_input_buffers_[ort_input_idx] = ort_state_buffer_unique_ptrs_[ort_input_idx].get();
+    } else if (ort_input_idx < gsl::narrow<int>(num_scan_inputs + num_state_variables)) {
+      // if ith varialbe is an input, we need to slice it based on the scan_input_axes
+      size_t scan_input_idx = gsl::narrow<size_t>(ort_input_idx) - gsl::narrow<size_t>(num_state_variables);
+      ORT_ENFORCE_DEBUG(scan_input_idx < scan_input_axes.size());
+      size_t input_scan_axis = gsl::narrow<size_t>(scan_input_axes[scan_input_idx]);
+
+      std::vector<int64_t>& shape = scan_input_in_subgraph_shapes_[scan_input_idx];
+      ShapeRemoveAxis(shape, ort_input_shape, arg_shape_rank + 1, input_scan_axis);
+
+      // Check whether it is backward Scan
+      // If so, we need to use the last frame, instead of the first frame.
+      int64_t stride = BytesOfShape(shape, data_type);
+      input_data = scan_input_forwards[scan_input_idx]
+                       ? input_data
+                       : (static_cast<const char*>(input_data) + stride * (seq_length_ - 1));
+
+      fill_input(tvm_input_idx, input_data, shape.data(), shape.size(), data_type);
+
+      // set the input_data, which is the main graph's input ptr
+      current_input_ptrs_[scan_input_idx] = dl_tensors[tvm_input_idx].data;
+      // use sliced shape and data_type as stride
+      input_strides_[scan_input_idx] = scan_input_forwards[scan_input_idx] ? stride : -stride;
+    } else {
+      fill_input(tvm_input_idx, input_data, ort_input_shape, arg_shape_rank, data_type);
+    }
+
+    // update dynamic shape in realized_dims
+    const auto& symbols = input_meta.dim_symbols;
+    kernel_compute_ctx->UpdateRealizedDims(symbols, dl_tensors[tvm_input_idx].shape);
+    ++tvm_input_idx;
+  }
+
+  // Handle Initializers
+  const std::vector<const Tensor*>& intializers = func_info->intializers;
+  for (const Tensor* t : intializers) {
+    fill_input(tvm_input_idx++, t->DataRaw(), t->Shape().GetDims().data(), t->Shape().NumDimensions(), t->DataType());
+  }
+
+  // Handle outputs and state outputs
+  current_func_output_ptrs_.resize(tvm_output_count);
+  size_t tvm_output_idx = 0;
+  for (const auto& output_meta : func_info->output_metas) {
+    std::vector<int64_t>& realized_shape = dl_output_shapes[tvm_output_idx];
+    // Update static dim
+    realized_shape = output_meta.inferred_shape;
+    // Update dynamic dim
+    const std::vector<std::pair<size_t, std::string>>& symbols = output_meta.dim_symbols;
+    kernel_compute_ctx->UpdateRealizedDims(symbols, realized_shape);
+
+    // Fill in output DLTensor
+    MLDataType data_type = output_meta.dtype;  // static meta from NupharFuncInfo
+
+    int ort_output_idx = output_meta.ort_arg_index;
+
+    void* output_data = nullptr;
+    // if i variables is smaller than num_state_variables, check whether it is an output or a state output
+    if (ort_output_idx < gsl::narrow<int>(num_state_variables)) {
+      //  if ith variable is a state output, we just call OutputData2 API with realized_shape
+      output_data = kernel_compute_ctx->OutputData(func_info,
+                                                   ort_output_idx,
+                                                   TensorShape::ReinterpretBaseType(realized_shape),
+                                                   data_type);
+
+      // set current_ort_state_output_ptrs_ as ort_state_input_buffers_
+      // Note it is "ort_state_input_buffers_", since we will perform double buffering later.
+      current_ort_state_output_ptrs_[ort_output_idx] = ort_state_input_buffers_[ort_output_idx];
+      // set ort_state_output_buffers_ as output_data
+      ort_state_output_buffers_[ort_output_idx] = output_data;
+      state_bytes_size_[ort_output_idx] = BytesOfShape(realized_shape, data_type);
+
+      // link current_func_output_ptrs_ as current_ort_state_output_ptrs_
+      current_func_output_ptrs_[ort_output_idx] = &current_ort_state_output_ptrs_[ort_output_idx];
+    } else {
+      // if ith varialbe is an output, we need to remove an axis for DLTesnor
+      size_t scan_output_idx = gsl::narrow<size_t>(ort_output_idx) - gsl::narrow<size_t>(num_state_variables);
+      std::vector<int64_t>& shape = scan_output_shapes_[scan_output_idx];
+      ORT_ENFORCE_DEBUG(scan_output_idx < scan_output_axes.size());
+      size_t output_scan_axis = gsl::narrow<size_t>(scan_output_axes[scan_output_idx]);
+      ShapeInsertAxis(shape, realized_shape.data(), realized_shape.size(), output_scan_axis, seq_length_);
+
+      output_data = kernel_compute_ctx->OutputData(func_info,
+                                                   ort_output_idx,
+                                                   TensorShape::ReinterpretBaseType(shape),
+                                                   data_type);
+
+      // Check whether it is backward Scan
+      // If so, we need to use the last frame, instead of the first frame.
+      // Note here sliced_shape is realized_shape, since realized_shape is from NupharFunctionInfo.
+      int64_t stride = BytesOfShape(realized_shape, data_type);
+      output_data = scan_output_forwards[scan_output_idx]
+                        ? output_data
+                        : (static_cast<char*>(output_data) + stride * (seq_length_ - 1));
+
+      // set output_data to current_output_ptrs_
+      current_output_ptrs_[scan_output_idx] = output_data;
+      // use sliced shape and data_type as stride
+      output_strides_[scan_output_idx] = scan_output_forwards[scan_output_idx] ? stride : -stride;
+
+      // link current_func_output_ptrs_ to current_output_ptrs_
+      current_func_output_ptrs_[tvm_output_idx] = &current_output_ptrs_[scan_output_idx];
+    }
+
+    DLDataType dtype = tvm_codegen::ToTvmDLDataType(data_type);
+    size_t tvm_idx = tvm_output_idx + tvm_input_count;
+    dl_tensors[tvm_idx] = {output_data, dl_ctx, gsl::narrow<int>(realized_shape.size()),
+                           dtype, realized_shape.data(), nullptr, 0};
+    lvalues[tvm_idx].v_handle = &(dl_tensors[tvm_idx]);
+    ++tvm_output_idx;
+  }
+
+  // Handle alias state outputs
+  const std::vector<std::pair<int, size_t>>& ort_aliased_output_to_func_indices = func_info->ort_aliased_output_to_func_indices;
+  for (const auto& p : ort_aliased_output_to_func_indices) {
+    // p is a std::pair<int, size_t>. A pair of (ort dst idx, tvm src idx)
+    // Note ort dst idx is always a state output
+    int ort_state_idx = p.first;
+    size_t tvm_idx = p.second;
+    size_t tvm_output_idx = tvm_idx - tvm_input_count;
+    MLDataType data_type = func_info->output_metas[tvm_output_idx].dtype;
+    current_ort_state_output_ptrs_[ort_state_idx] = dl_tensors[tvm_idx].data;
+    ort_state_output_buffers_[ort_state_idx] =
+        kernel_compute_ctx->OutputData(func_info,
+                                       ort_state_idx,
+                                       TensorShape::ReinterpretBaseType(dl_output_shapes[tvm_output_idx]),
+                                       data_type);
+    state_bytes_size_[ort_state_idx] = BytesOfShape(dl_output_shapes[tvm_output_idx], data_type);
+  }
+}
+
+// UpdateContext is for an existing KernelComputeCtx, and only needs to update non-initializer input/output
+void ScanExecCtx::UpdateContext(KernelComputeCtx* kernel_compute_ctx,
+                                const NupharFuncInfo* func_info) {
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info);
+  size_t tvm_input_count = func_info->func_input_count;
+
+  // control flow info
+  const ScanExecInfo* scan_info = Promote<ScanExecInfo>(func_info->cf_info.get());
+  ORT_ENFORCE_DEBUG(nullptr != scan_info);
+  int64_t num_state_variables = scan_info->num_state_variables;
+  int64_t num_scan_inputs = scan_info->num_scan_inputs;
+  const std::vector<int64_t>& scan_input_axes = scan_info->scan_input_axes;
+  const std::vector<bool>& scan_input_forwards = scan_info->scan_input_forwards;
+  const std::vector<bool>& scan_output_forwards = scan_info->scan_output_forwards;
+  const std::vector<int64_t>& scan_output_axes = scan_info->scan_output_axes;
+
+  // Handle Scan's control flow ctx
+  seq_length_ = 0;
+  for (int ort_input_idx = gsl::narrow<int>(num_state_variables);
+       ort_input_idx < gsl::narrow<int>(func_info->ort_input_count);
+       ++ort_input_idx) {
+    const int64_t* ort_input_shape = subgraph_compute_ctx.ort_input_shapes[ort_input_idx];
+    size_t scan_input_idx = gsl::narrow<size_t>(ort_input_idx) - gsl::narrow<size_t>(num_state_variables);
+    ORT_ENFORCE_DEBUG(scan_input_idx < scan_input_axes.size());
+    size_t input_scan_axis = gsl::narrow<size_t>(scan_input_axes[scan_input_idx]);
+    if (seq_length_ == 0)
+      seq_length_ = ort_input_shape[input_scan_axis];
+    ORT_ENFORCE_DEBUG(seq_length_ == ort_input_shape[input_scan_axis]);
+  }
+  min_loop_step_ = 0;
+  max_loop_step_ = seq_length_;
+  current_loop_step_ = min_loop_step_;
+
+  // Handle inputs and state inputs (not including initializer)
+  size_t tvm_input_idx = 0;
+  for (const auto& input_meta : func_info->input_metas) {
+    int ort_input_idx = input_meta.ort_arg_index;
+    DLTensor& dl_tensor = subgraph_compute_ctx.dl_tensors[tvm_input_idx];
+
+    const int64_t* ort_input_shape = subgraph_compute_ctx.ort_input_shapes[ort_input_idx];
+    const auto& symbols = input_meta.dim_symbols;
+    MLDataType data_type = input_meta.dtype;
+
+    // check whether it is an input or a state input
+    int scan_input_idx = ort_input_idx - gsl::narrow<int>(num_state_variables);
+    size_t input_scan_axis = 0;
+    bool is_scan_input = ort_input_idx >= gsl::narrow<int>(num_state_variables) &&
+                         ort_input_idx < (gsl::narrow<int>(num_scan_inputs + num_state_variables));
+    bool is_scan_state = ort_input_idx < gsl::narrow<int>(num_state_variables);
+    if (is_scan_input) {
+      ORT_ENFORCE_DEBUG(scan_input_idx < gsl::narrow<int>(scan_input_axes.size()));
+      input_scan_axis = gsl::narrow<size_t>(scan_input_axes[scan_input_idx]);
+    }
+
+    // update scan inputs' dynamic shape in realized_dims
+    // state input would use shape from ort directly
+    if (is_scan_input) {
+      kernel_compute_ctx->UpdateRealizedDims(symbols, ort_input_shape, input_scan_axis);
+
+      for (const auto& s_pair : symbols) {
+        size_t tvm_dim = s_pair.first;
+        size_t ort_dim = tvm_dim;
+        if (tvm_dim >= input_scan_axis) {
+          ort_dim = tvm_dim + 1;
+        }
+        int64_t dim_size = ort_input_shape[ort_dim];
+        dl_tensor.shape[tvm_dim] = dim_size;
+      }
+    }
+
+    // update ptrs
+    void* input_data = const_cast<void*>(subgraph_compute_ctx.ort_input_data[ort_input_idx]);
+
+    if (is_scan_input) {
+      // if ith varialbe is an input, we need to use sliced shape (from dl_tensor.shape)
+      int64_t stride = BytesOfShape(dl_tensor.shape, dl_tensor.ndim, data_type);
+      // Check whether it is backward Scan
+      // If so, we need to use the last frame, instead of the first frame.
+      input_data = scan_input_forwards[scan_input_idx]
+                       ? input_data
+                       : (static_cast<char*>(input_data) + stride * (seq_length_ - 1));
+
+      size_t scan_input_idx = tvm_input_idx - gsl::narrow<size_t>(num_state_variables);
+      dl_tensor.data = input_data;
+      current_input_ptrs_[scan_input_idx] = input_data;
+      // use sliced shape and data_type as stride
+      input_strides_[scan_input_idx] = scan_input_forwards[scan_input_idx] ? stride : -stride;
+    } else if (is_scan_state) {
+      // if ith variable is a state input
+      // set the input_data, which is the main graph's state input ptr, to current current_ort_state_input_ptrs_
+      dl_tensor.data = input_data;
+      dl_tensor.shape = const_cast<int64_t*>(ort_input_shape);
+      dl_tensor.ndim = gsl::narrow<int>(input_meta.inferred_shape.size());
+      current_ort_state_input_ptrs_[ort_input_idx] = input_data;
+      // set the new allocated ptr to state_input_buffers
+      ort_state_buffer_unique_ptrs_[ort_input_idx] =
+          kernel_compute_ctx->AllocateDataUniquePtr(dl_tensor.shape, dl_tensor.ndim, data_type);
+      ort_state_input_buffers_[ort_input_idx] = ort_state_buffer_unique_ptrs_[ort_input_idx].get();
+    } else {
+      // if ith variable is an implicit input
+      dl_tensor.data = input_data;
+      dl_tensor.shape = const_cast<int64_t*>(ort_input_shape);
+      dl_tensor.ndim = gsl::narrow<int>(input_meta.inferred_shape.size());
+    }
+
+    ++tvm_input_idx;
+  }
+
+  // No need to update initializer in UpdateContext
+
+  // Handle outputs and state
+  std::vector<std::vector<int64_t>>& dl_output_shapes = subgraph_compute_ctx.dl_output_shapes;
+  size_t tvm_output_idx = 0;
+  for (const auto& output_meta : func_info->output_metas) {
+    size_t tvm_idx = tvm_output_idx + tvm_input_count;
+    int ort_output_idx = output_meta.ort_arg_index;
+
+    MLDataType data_type = output_meta.dtype;
+    DLTensor& dl_tensor = subgraph_compute_ctx.dl_tensors[tvm_idx];
+
+    int scan_output_idx = ort_output_idx - gsl::narrow<int>(num_state_variables);
+    size_t output_scan_axis = 0;
+    bool is_scan_output = (scan_output_idx >= 0);
+    if (is_scan_output) {
+      ORT_ENFORCE_DEBUG(scan_output_idx < gsl::narrow<int>(scan_output_axes.size()));
+      output_scan_axis = gsl::narrow<size_t>(scan_output_axes[scan_output_idx]);
+    }
+
+    // Update dynamic dim
+    const auto& symbols = output_meta.dim_symbols;
+    if (is_scan_output) {
+      kernel_compute_ctx->UpdateRealizedDims(symbols,
+                                             scan_output_shapes_[scan_output_idx],
+                                             output_scan_axis);
+    }
+    kernel_compute_ctx->UpdateRealizedDims(symbols, dl_output_shapes[tvm_output_idx]);
+
+    std::vector<int64_t>& ort_output_shape =
+        is_scan_output ? scan_output_shapes_[scan_output_idx] : dl_output_shapes[tvm_output_idx];
+
+    // update ptr
+    void* output_data = nullptr;
+    if (ort_output_idx < gsl::narrow<int>(num_state_variables)) {
+      output_data = kernel_compute_ctx->OutputData(func_info,
+                                                   ort_output_idx,
+                                                   TensorShape::ReinterpretBaseType(ort_output_shape),
+                                                   data_type);
+      // set current_ort_state_output_ptrs_ as ort_state_input_buffers_
+      // Note it is "ort_state_input_buffers_", since we will perform double buffering later.
+      current_ort_state_output_ptrs_[ort_output_idx] = ort_state_input_buffers_[ort_output_idx];
+      // set ort_state_output_buffers_ as output_data
+      ort_state_output_buffers_[ort_output_idx] = output_data;
+      state_bytes_size_[ort_output_idx] = BytesOfShape(ort_output_shape, data_type);
+    } else {
+      ort_output_shape[output_scan_axis] = seq_length_;
+      output_data = kernel_compute_ctx->OutputData(func_info,
+                                                   ort_output_idx,
+                                                   TensorShape::ReinterpretBaseType(ort_output_shape),
+                                                   data_type);
+      // Check whether it is backward Scan
+      // If so, we need to use the last frame, instead of the first frame.
+      // Note here stride come from dl_tensor shape
+      int64_t stride = BytesOfShape(dl_tensor.shape, dl_tensor.ndim, data_type);
+      output_data = scan_output_forwards[scan_output_idx]
+                        ? output_data
+                        : (static_cast<char*>(output_data) + stride * (seq_length_ - 1));
+
+      // set output_data to current_output_ptrs_
+      current_output_ptrs_[scan_output_idx] = output_data;
+      // use sliced shape and data_type as stride
+      output_strides_[scan_output_idx] = scan_output_forwards[scan_output_idx] ? stride : -stride;
+    }
+
+    dl_tensor.data = output_data;
+    ++tvm_output_idx;
+  }
+
+  // Handle alias state outputs
+  const std::vector<std::pair<int, size_t>>& ort_aliased_output_to_func_indices = func_info->ort_aliased_output_to_func_indices;
+  for (const auto& p : ort_aliased_output_to_func_indices) {
+    // p is a std::pair<int, size_t>. A pair of (ort dst idx, tvm src idx)
+    // Note ort dst idx is always a state output
+    int ort_state_idx = p.first;
+    size_t tvm_idx = p.second;
+    size_t tvm_output_idx = tvm_idx - tvm_input_count;
+    MLDataType data_type = func_info->output_metas[tvm_output_idx].dtype;
+    DLTensor& dl_tensor = subgraph_compute_ctx.dl_tensors[tvm_idx];
+    current_ort_state_output_ptrs_[ort_state_idx] = dl_tensor.data;
+    ort_state_output_buffers_[ort_state_idx] =
+        kernel_compute_ctx->OutputData(func_info,
+                                       ort_state_idx,
+                                       TensorShape::ReinterpretBaseType(dl_output_shapes[tvm_output_idx]),
+                                       data_type);
+    state_bytes_size_[ort_state_idx] = BytesOfShape(dl_output_shapes[tvm_output_idx], data_type);
+  }
+}
+
+void ScanExecCtx::LoopFinalizer() {
+  seq_length_ = 0;
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/control_flow/scan_exec_ctx.h b/onnxruntime/core/providers/nuphar/runtime/control_flow/scan_exec_ctx.h
new file mode 100644
index 0000000000000..11fef54499421
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/control_flow/scan_exec_ctx.h
@@ -0,0 +1,87 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/providers/nuphar/runtime/control_flow/loop_exec_ctx.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// Note ScanExecInfo have all ort related meta data
+struct ScanExecInfo : ControlFlowInfo {
+  std::vector<bool> scan_input_forwards;
+  std::vector<bool> scan_output_forwards;
+  std::vector<int64_t> scan_input_axes;
+  std::vector<int64_t> scan_output_axes;
+
+  int64_t num_state_variables;
+  int64_t num_scan_inputs;
+  int64_t num_scan_outputs;
+  int64_t num_scan_implicit_inputs;
+
+  std::vector<int> state_to_output_indices;
+
+  ScanExecInfo() : ControlFlowInfo(ControlFlowInfoType::Scan) {}
+
+  DYN_PROMOTE_DERIVED(ControlFlowInfo, ControlFlowInfoType, Scan)
+};
+
+class ScanExecCtx final : public LoopExecCtx {
+ public:
+  ScanExecCtx() : seq_length_(0) {
+  }
+
+  void InitContext(KernelComputeCtx* compute_ctx,
+                   const NupharFuncInfo* func_info) override;
+  void UpdateContext(KernelComputeCtx* compute_ctx,
+                     const NupharFuncInfo* func_info) override;
+  void InitIteration(KernelComputeCtx* compute_ctx,
+                     const NupharFuncInfo* func_info) override;
+
+  void LoopFinalizer() override;
+  void Advance(const ControlFlowInfo* cf_info) override;
+
+ private:
+  // Current input/output holds the current ptr of Scan, not PackedFunc
+  std::vector<void*> current_input_ptrs_;
+  std::vector<void*> current_output_ptrs_;
+  std::vector<void*> current_ort_state_input_ptrs_;
+  std::vector<void*> current_ort_state_output_ptrs_;
+
+  std::vector<int64_t> input_strides_;
+  std::vector<int64_t> output_strides_;
+
+  // shapes for scan inputs inside subgraph, as ORT shape would have scan_axis
+  // it also serves as ptr in DLTensor's shape for input args
+  std::vector<std::vector<int64_t>> scan_input_in_subgraph_shapes_;
+
+  // shapes for scan outputs outside subgraph
+  // it is used by ort to create output tensor
+  std::vector<std::vector<int64_t>> scan_output_shapes_;
+
+  // the ptr holds func's input and output
+  // func outputs are more irregular due to alias.
+  // Therefore, we use double ptr to hold the ptr
+  // to make code more readable
+  std::vector<void**> current_func_output_ptrs_;
+
+  // allocated state buffers
+  // This is unqiue_ptr from Ort, and will be freed after this class is free
+  std::vector<IAllocatorUniquePtr<void>> ort_state_buffer_unique_ptrs_;
+
+  // state buffers (raw ptrs)
+  // The raw pointer of the above.
+  // These two the one we common use for address calculation
+  std::vector<void*> ort_state_input_buffers_;
+  std::vector<void*> ort_state_output_buffers_;
+
+  std::vector<std::size_t> state_bytes_size_;
+
+  int64_t seq_length_;
+
+  ;
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/exec_block.cc b/onnxruntime/core/providers/nuphar/runtime/exec_block.cc
new file mode 100644
index 0000000000000..04edf45ea2362
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/exec_block.cc
@@ -0,0 +1,29 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/runtime/exec_block.h"
+
+// all execution block headers
+#include "core/providers/nuphar/runtime/sequential/basic.h"
+#include "core/providers/nuphar/runtime/sequential/loop.h"
+
+#include "core/providers/nuphar/common/nuphar_subgraph.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+void CreateExecBlock(std::vector<std::unique_ptr<ExecBlock>>& exec_blocks,
+                     const NupharFuncInfo* func_info,
+                     const nuphar::NupharSubgraphUnit& subgraph,
+                     bool /*enable_tiling*/) {
+  if (subgraph.IsSingleNode() && subgraph.nodes.front()->OpType() == "Scan") {
+    exec_blocks.push_back(
+        std::move(std::make_unique<LoopExecBlock>(func_info, "nuphar_exec_" + subgraph.Name())));
+  } else {
+    exec_blocks.push_back(
+        std::move(std::make_unique<BasicExecBlock>(func_info, "nuphar_exec_" + subgraph.Name())));
+  }
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/exec_block.h b/onnxruntime/core/providers/nuphar/runtime/exec_block.h
new file mode 100644
index 0000000000000..5101012d68166
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/exec_block.h
@@ -0,0 +1,54 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include "core/common/common.h"
+#include "core/framework/op_kernel.h"
+#include "core/providers/nuphar/compiler/func_info.h"
+#include "core/providers/nuphar/runtime/compute_ctx.h"
+#include "core/providers/nuphar/runtime/exec_block.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// base class for Execution
+class ExecBlock {
+ public:
+  ExecBlock(
+      const NupharFuncInfo* info,
+      const std::string& name,
+      const std::string& type)
+      : func_info_(info), name_(name), type_(type) {}
+
+  virtual ~ExecBlock() = default;
+
+  const std::string& Name() const {
+    return name_;
+  }
+
+  const std::string& Type() const {
+    return type_;
+  }
+
+  virtual void Run(KernelComputeCtx* compute_ctx) = 0;
+  virtual void InitContext(KernelComputeCtx* compute_ctx) const = 0;
+  virtual void UpdateContext(KernelComputeCtx* compute_ctx) const = 0;
+  virtual void BlockFinalizer(KernelComputeCtx* kernel_compute_ctx) const {};
+
+ protected:
+  const NupharFuncInfo* func_info_;
+  std::string name_;  // name_ is for debug
+  std::string type_;  // type_ is for debug
+
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(ExecBlock);
+};
+
+void CreateExecBlock(
+    std::vector<std::unique_ptr<ExecBlock>>& exec_blocks,
+    const NupharFuncInfo* info,
+    const NupharSubgraphUnit& subgraph,
+    bool enable_tiling = false);
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/handle.h b/onnxruntime/core/providers/nuphar/runtime/handle.h
new file mode 100644
index 0000000000000..42d0d26737043
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/handle.h
@@ -0,0 +1,24 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/framework/allocator.h"
+#include <tvm/build_module.h>
+
+namespace onnxruntime {
+namespace nuphar {
+
+// NupharRuntimeHandle holds necessary meta data from nuphar provider
+struct NupharRuntimeHandle {
+  bool allow_unaligned_buffers;
+  const DLContext& dl_ctx;
+  bool enable_model_parallelism;
+
+  AllocatorPtr allocator;
+
+  NupharRuntimeHandle(const DLContext& _dl_ctx)
+      : dl_ctx(_dl_ctx) {}
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/sequential/basic.cc b/onnxruntime/core/providers/nuphar/runtime/sequential/basic.cc
new file mode 100644
index 0000000000000..5405997dd0422
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/sequential/basic.cc
@@ -0,0 +1,195 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/runtime/sequential/basic.h"
+
+#include "core/codegen/common/common.h"
+#include "core/codegen/common/profile.h"
+#include "core/codegen/passes/utils/ort_tvm_utils.h"
+#include "gsl/gsl_util"
+#include <tvm/tvm.h>
+
+// from onnxruntime_typeinf.cc, in global namespace
+const onnxruntime::DataTypeImpl* ElementTypeFromProto(int type);
+
+namespace onnxruntime {
+namespace nuphar {
+
+void BasicExecBlock::Run(KernelComputeCtx* kernel_compute_ctx) {
+  CODEGEN_PROFILER_EVENT(func_info_->name);
+  if (!kernel_compute_ctx->IsInitialized(func_info_)) {
+    kernel_compute_ctx->CreateFuncComputeCtx(func_info_);
+    InitContext(kernel_compute_ctx);
+  } else {
+    kernel_compute_ctx->UpdateFuncComputeCtx(func_info_);
+    UpdateContext(kernel_compute_ctx);
+  }
+
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info_);
+  size_t tvm_input_count = func_info_->func_input_count;
+  size_t tvm_output_count = func_info_->func_output_count;
+  int tvm_num_args = gsl::narrow<int>(tvm_input_count + tvm_output_count);
+  tvm::TVMArgs tvm_args(subgraph_compute_ctx.lvalues.data(),
+                        func_info_->type_codes.data(),
+                        tvm_num_args);
+
+  tvm::TVMRetValue rvalue;
+  const tvm::runtime::PackedFunc& func = func_info_->packed_func;
+
+  func.CallPacked(tvm_args, &rvalue);
+
+  // Check aliased outputs
+  if (tvm_output_count < func_info_->ort_output_count) {
+    const std::vector<DLTensor>& dl_tensors = subgraph_compute_ctx.dl_tensors;
+    const std::vector<std::vector<int64_t>>& dl_output_shapes = subgraph_compute_ctx.dl_output_shapes;
+    const auto& ort_aliased_output_to_func_indices = func_info_->ort_aliased_output_to_func_indices;
+    const auto& output_metas = func_info_->output_metas;
+
+    for (const auto& p : ort_aliased_output_to_func_indices) {
+      // p is a std::pair<int, size_t>. A pair of (ort dst idx, tvm src idx)
+      // Purpose for using tvm src to avoid potential extra copying in kernel_compute_ctx
+      int ort_output_idx = p.first;
+      size_t tvm_idx = p.second;
+      size_t tvm_output_idx = tvm_idx - func_info_->func_input_count;
+      const TensorShape& shape = TensorShape::ReinterpretBaseType(dl_output_shapes[tvm_output_idx]);
+      MLDataType dtype = output_metas[tvm_output_idx].dtype;
+      void* dst = kernel_compute_ctx->OutputData(func_info_, ort_output_idx, shape, dtype);
+      void* src = dl_tensors[tvm_idx].data;
+
+      // TODO: change it to use provider::CopyTensor for non-CPU devices
+      memcpy(dst, src, shape.Size() * dtype->Size());
+    }
+  }
+}
+
+void BasicExecBlock::InitContext(KernelComputeCtx* kernel_compute_ctx) const {
+  const DLContext& dl_ctx = kernel_compute_ctx->GetRuntimeHandle()->dl_ctx;
+
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info_);
+
+  size_t tvm_input_count = func_info_->func_input_count;
+  size_t tvm_output_count = func_info_->func_output_count;
+  size_t tvm_num_args = tvm_input_count + tvm_output_count;
+
+  std::vector<TVMValue>& lvalues = subgraph_compute_ctx.lvalues;
+  lvalues.resize(tvm_num_args);
+  std::vector<DLTensor>& dl_tensors = subgraph_compute_ctx.dl_tensors;
+  dl_tensors.resize(tvm_num_args);
+  std::vector<std::vector<int64_t>>& dl_output_shapes = subgraph_compute_ctx.dl_output_shapes;
+  dl_output_shapes.resize(tvm_output_count);
+
+  // a common lambda utility function for fill-in inputs
+  auto fill_input = [&](size_t tvm_idx, const void* input_data, const int64_t* input_shape, size_t shape_rank, MLDataType data_type) {
+    ORT_ENFORCE_DEBUG(kernel_compute_ctx->GetRuntimeHandle()->allow_unaligned_buffers ||
+                      (reinterpret_cast<std::uintptr_t>(input_data)) % 64 == 0);
+    DLDataType dtype = tvm_codegen::ToTvmDLDataType(data_type);
+    dl_tensors[tvm_idx] = {const_cast<void*>(input_data), dl_ctx,
+                           gsl::narrow_cast<int>(shape_rank), dtype,
+                           const_cast<int64_t*>(input_shape), nullptr, 0};
+    lvalues[tvm_idx].v_handle = &(dl_tensors[tvm_idx]);
+  };
+
+  // Handle Inputs (not including initializers)
+  size_t tvm_input_idx = 0;
+  for (const auto& input_meta : func_info_->input_metas) {
+    int ort_input_idx = input_meta.ort_arg_index;
+    const void* input_data = subgraph_compute_ctx.ort_input_data[ort_input_idx];
+    const int64_t* input_shape = subgraph_compute_ctx.ort_input_shapes[ort_input_idx];
+    MLDataType data_type = input_meta.dtype;
+
+    // update dynamic shape in realized_dims
+    const auto& symbols = input_meta.dim_symbols;
+    kernel_compute_ctx->UpdateRealizedDims(symbols, input_shape);
+
+    fill_input(tvm_input_idx++, input_data, input_shape, input_meta.inferred_shape.size(), data_type);
+  }
+
+  // Handle Initializers
+  const std::vector<const Tensor*>& intializers = func_info_->intializers;
+  for (const Tensor* t : intializers) {
+    fill_input(tvm_input_idx++, t->DataRaw(), t->Shape().GetDims().data(),
+               t->Shape().NumDimensions(), t->DataType());
+  }
+
+  // Handle Outputs
+  size_t tvm_output_idx = 0;
+  for (const auto& output_meta : func_info_->output_metas) {
+    std::vector<int64_t>& realized_output_shape = dl_output_shapes[tvm_output_idx];
+    // Update static dim
+    realized_output_shape = output_meta.inferred_shape;
+    // Update dynamic dim
+    const std::vector<std::pair<size_t, std::string>>& symbols = output_meta.dim_symbols;
+    kernel_compute_ctx->UpdateRealizedDims(symbols, realized_output_shape);
+
+    int ort_output_idx = output_meta.ort_arg_index;
+
+    // Fill in output DLTensor
+    MLDataType data_type = output_meta.dtype;
+    void* output_data = kernel_compute_ctx->OutputData(func_info_,
+                                                       ort_output_idx,
+                                                       TensorShape::ReinterpretBaseType(realized_output_shape),
+                                                       data_type);
+
+    ORT_ENFORCE_DEBUG(kernel_compute_ctx->GetRuntimeHandle()->allow_unaligned_buffers ||
+                      (reinterpret_cast<std::uintptr_t>(output_data)) % 64 == 0);
+
+    DLDataType dtype = tvm_codegen::ToTvmDLDataType(data_type);
+
+    size_t tvm_idx = tvm_output_idx + tvm_input_count;
+
+    dl_tensors[tvm_idx] = {output_data, dl_ctx,
+                           gsl::narrow_cast<int>(realized_output_shape.size()),
+                           dtype, realized_output_shape.data(), nullptr, 0};
+    lvalues[tvm_idx].v_handle = &(dl_tensors[tvm_idx]);
+
+    ++tvm_output_idx;
+  }
+}
+
+// UpdateContext is for an existing KernelComputeCtx, and only needs to update non-initializer input/output
+void BasicExecBlock::UpdateContext(KernelComputeCtx* kernel_compute_ctx) const {
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info_);
+  std::vector<std::vector<int64_t>>& dl_output_shapes = subgraph_compute_ctx.dl_output_shapes;
+
+  // Handle Inputs
+  size_t tvm_input_idx = 0;
+  for (const auto& input_meta : func_info_->input_metas) {
+    int ort_input_idx = input_meta.ort_arg_index;
+
+    // data ptr
+    DLTensor& dl_tensor = subgraph_compute_ctx.dl_tensors[tvm_input_idx];
+    dl_tensor.data = const_cast<void*>(subgraph_compute_ctx.ort_input_data[ort_input_idx]);
+    const int64_t* input_shape = subgraph_compute_ctx.ort_input_shapes[ort_input_idx];
+    // update dynamic shape in realized_dims
+    const auto& symbols = input_meta.dim_symbols;
+    kernel_compute_ctx->UpdateRealizedDims(symbols, input_shape);
+
+    dl_tensor.shape = const_cast<int64_t*>(input_shape);
+    dl_tensor.ndim = gsl::narrow<int>(input_meta.inferred_shape.size());
+    ++tvm_input_idx;
+  }
+
+  // No need to update initializer in UpdateContext
+
+  // Handle Outputs
+  size_t tvm_output_idx = 0;
+  for (const auto& output_meta : func_info_->output_metas) {
+    size_t tvm_idx = tvm_output_idx + func_info_->func_input_count;
+    DLTensor& dl_tensor = subgraph_compute_ctx.dl_tensors[tvm_idx];
+    // Update dynamic dim
+    const std::vector<std::pair<size_t, std::string>>& symbols = output_meta.dim_symbols;
+    kernel_compute_ctx->UpdateRealizedDims(symbols, dl_output_shapes[tvm_output_idx]);
+
+    int ort_output_idx = output_meta.ort_arg_index;
+
+    // update pointer
+    dl_tensor.data = kernel_compute_ctx->OutputData(func_info_,
+                                                    ort_output_idx,
+                                                    TensorShape::ReinterpretBaseType(dl_output_shapes[tvm_output_idx]),
+                                                    output_meta.dtype);
+    ++tvm_output_idx;
+  }
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/sequential/basic.h b/onnxruntime/core/providers/nuphar/runtime/sequential/basic.h
new file mode 100644
index 0000000000000..265289e541c2e
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/sequential/basic.h
@@ -0,0 +1,34 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/providers/nuphar/runtime/compute_ctx.h"
+#include "core/providers/nuphar/runtime/exec_block.h"
+#include "core/common/common.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// BasicExecBlock is most common execuction block
+// It does not contain C++ control flow during execution
+class BasicExecBlock : public ExecBlock {
+ public:
+  BasicExecBlock(const NupharFuncInfo* info,
+                 const std::string& name)
+      : ExecBlock(info, name, "BasicExecBlock") {}
+
+  BasicExecBlock(const NupharFuncInfo* info,
+                 const std::string& name,
+                 const std::string& type)
+      : ExecBlock(info, name, type) {}
+
+  virtual void Run(KernelComputeCtx* compute_ctx) override;
+  void InitContext(KernelComputeCtx* compute_ctx) const override;
+  void UpdateContext(KernelComputeCtx* compute_ctx) const override;
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(BasicExecBlock);
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/sequential/loop.cc b/onnxruntime/core/providers/nuphar/runtime/sequential/loop.cc
new file mode 100644
index 0000000000000..e07fdd91d783b
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/sequential/loop.cc
@@ -0,0 +1,72 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/providers/nuphar/runtime/sequential/loop.h"
+#include "core/providers/nuphar/runtime/control_flow/loop_exec_ctx.h"
+#include "core/codegen/common/profile.h"
+
+// TODO: refactor it
+#include "core/providers/nuphar/runtime/control_flow/scan_exec_ctx.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+LoopExecBlock::LoopExecBlock(const NupharFuncInfo* func_info, const std::string& name)
+    : ExecBlock(func_info, name, "LoopExecBlock") {}
+
+void LoopExecBlock::Run(KernelComputeCtx* kernel_compute_ctx) {
+  if (!kernel_compute_ctx->IsInitialized(func_info_)) {
+    kernel_compute_ctx->CreateFuncComputeCtx(func_info_);
+    InitContext(kernel_compute_ctx);
+  } else {
+    kernel_compute_ctx->UpdateFuncComputeCtx(func_info_);
+    UpdateContext(kernel_compute_ctx);
+  }
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info_);
+
+  const tvm::runtime::PackedFunc& func = func_info_->packed_func;
+  int num_func_args = gsl::narrow<int>(func_info_->func_input_count + func_info_->func_output_count);
+
+  // Note tvm_args holds ptr of std::vector<TVMValue> not value, so we only need to assign once.
+  tvm::TVMArgs tvm_args(subgraph_compute_ctx.lvalues.data(),
+                        func_info_->type_codes.data(),
+                        num_func_args);
+  tvm::TVMRetValue rvalue;
+
+  // Do it sequentially sicne it is a sequential ExecBlock
+  while (subgraph_compute_ctx.loop_cf_ctx->IsValid()) {
+    // Note InitIteration would change values of std::vector<DLTensor> and std::vector<TVMValue>, not ptr.
+    subgraph_compute_ctx.loop_cf_ctx->InitIteration(kernel_compute_ctx, func_info_);
+
+    // Profiling event (no op for non-profiling build)
+    CODEGEN_PROFILER_EVENT(func_info_->name);
+
+    func.CallPacked(tvm_args, &rvalue);
+
+    subgraph_compute_ctx.loop_cf_ctx->Advance(func_info_->cf_info.get());
+  }
+  subgraph_compute_ctx.loop_cf_ctx->LoopFinalizer();
+}
+
+void LoopExecBlock::InitContext(KernelComputeCtx* kernel_compute_ctx) const {
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info_);
+
+  ORT_ENFORCE_DEBUG(nullptr == subgraph_compute_ctx.loop_cf_ctx);
+  if (nullptr != func_info_->cf_info) {
+    if (ScanExecInfo::IsType(func_info_->cf_info.get())) {
+      subgraph_compute_ctx.loop_cf_ctx = std::make_unique<ScanExecCtx>();
+    }
+  }
+
+  subgraph_compute_ctx.loop_cf_ctx->InitContext(kernel_compute_ctx, func_info_);
+}
+
+void LoopExecBlock::UpdateContext(KernelComputeCtx* kernel_compute_ctx) const {
+  FuncComputeCtx& subgraph_compute_ctx = kernel_compute_ctx->GetFuncComputeCtx(func_info_);
+
+  ORT_ENFORCE_DEBUG(nullptr != subgraph_compute_ctx.loop_cf_ctx);
+  subgraph_compute_ctx.loop_cf_ctx->UpdateContext(kernel_compute_ctx, func_info_);
+}
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/sequential/loop.h b/onnxruntime/core/providers/nuphar/runtime/sequential/loop.h
new file mode 100644
index 0000000000000..b9c2e24c7c7fa
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/sequential/loop.h
@@ -0,0 +1,30 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/providers/nuphar/runtime/compute_ctx.h"
+#include "core/providers/nuphar/runtime/exec_block.h"
+#include "core/common/common.h"
+
+namespace onnxruntime {
+namespace nuphar {
+
+// LoopExecBlock is an ExecBlock for regular loops.
+// It is mainly for Scan, LSTM, GRU, RNN, those recurrences.
+// It ONLY works a single nested loop for NOW.
+// It ONLY works a loop body without other controll flow.
+
+class LoopExecBlock : public ExecBlock {
+ public:
+  LoopExecBlock(const NupharFuncInfo* info, const std::string& name);
+
+  void Run(KernelComputeCtx* compute_ctx) override;
+  void InitContext(KernelComputeCtx* compute_ctx) const override;
+  void UpdateContext(KernelComputeCtx* compute_ctx) const override;
+
+ private:
+  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(LoopExecBlock);
+};
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/runtime/utils.h b/onnxruntime/core/providers/nuphar/runtime/utils.h
new file mode 100644
index 0000000000000..9ed4d430b5ad6
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/runtime/utils.h
@@ -0,0 +1,72 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/framework/data_types.h"
+
+// Common utilities for runtime
+// Use inline if utilities are in a runtime critical path
+
+namespace onnxruntime {
+namespace nuphar {
+
+inline void ShapeRemoveAxis(
+    std::vector<int64_t>& shape,
+    const int64_t* original_shape,
+    size_t rank,
+    size_t axis) {
+  for (size_t i = 0; i < rank; ++i) {
+    if (i != axis) {
+      shape.push_back(original_shape[i]);
+    }
+  }
+}
+
+inline void ShapeInsertAxis(
+    std::vector<int64_t>& shape,
+    const int64_t* original_shape,
+    size_t rank,
+    size_t axis,
+    int64_t value) {
+  for (size_t i = 0; i < axis; ++i) {
+    shape.push_back(original_shape[i]);
+  }
+
+  shape.push_back(value);
+
+  for (size_t i = axis; i < rank; ++i) {
+    shape.push_back(original_shape[i]);
+  }
+}
+
+inline int64_t BytesOfShape(const std::vector<int64_t>& shape, MLDataType dtype) {
+  int64_t total_size = dtype->Size();
+  for (const auto& s : shape) {
+    total_size *= s;
+  }
+  return total_size;
+}
+
+inline int64_t BytesOfShape(const int64_t* shape, size_t rank, MLDataType dtype) {
+  int64_t total_size = dtype->Size();
+  for (size_t i = 0; i < rank; ++i) {
+    total_size *= shape[i];
+  }
+  return total_size;
+}
+
+#define STATIC_PROMOTE_FROM_BASE(X, BASE, KEY, VALUE) \
+  static inline const X* Promote(const BASE* base) {  \
+    auto derived = static_cast<const X*>(base);       \
+    ORT_ENFORCE(nullptr != derived);                  \
+    return derived;                                   \
+  }                                                   \
+                                                      \
+  static inline X* Promote(BASE* base) {              \
+    auto derived = dynamic_cast<X*>(base);            \
+    ORT_ENFORCE(nullptr != derived);                  \
+    return derived;                                   \
+  }
+
+}  // namespace nuphar
+}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/nuphar/scripts/README.md b/onnxruntime/core/providers/nuphar/scripts/README.md
new file mode 100644
index 0000000000000..d55b2ecb2313f
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/README.md
@@ -0,0 +1,25 @@
+This folder contains scripts for Nuphar:
+
+* cntk_converter.py
+
+Converts CNTK model to ONNX and generate test data, requires CNTK python wheels to run
+
+* create_shared.cmd/sh
+
+Generates JIT dll from where env NUPHAR_CACHE_PATH is set, to reduce JIT cost at runtime
+
+* model_editor.py
+
+Edits models like LSTM to Scan for Nuphar to run
+
+* model_quantizer.py
+
+Quantize MatMul in model dynamically wrt. input
+
+* rnn_benchmark.py
+
+Benchmark for LSTM/GRU/RNN with model_editor and model_quantizer to show Nuphar's speed up for those models
+
+* symbolic_shape_infer.py
+
+Run symbolic shape inference with sympy. Nuphar relies on shape inference to run efficiently.
\ No newline at end of file
diff --git a/onnxruntime/core/providers/nuphar/scripts/cntk_converter.py b/onnxruntime/core/providers/nuphar/scripts/cntk_converter.py
new file mode 100644
index 0000000000000..9459416c07042
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/cntk_converter.py
@@ -0,0 +1,81 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# -*- coding: UTF-8 -*-
+import argparse
+import cntk as C
+from model_editor_internal import PairDescription
+import numpy as np
+import onnx
+from onnx import numpy_helper
+import os
+
+def save_data(test_data_dir, var_and_data, output_uid_to_ort_pairs=None):
+    for i, (var, data) in enumerate(var_and_data.items()):
+        data = np.asarray(data).astype(var.dtype)
+        # ONNX input shape always has sequence axis as the first dimension, if sequence axis exists
+        if len(var.dynamic_axes) == 2:
+            data = data.transpose((1,0,)+tuple(range(2,len(data.shape))))
+        file_path = os.path.join(test_data_dir, '{}_{}.pb'.format('output' if output_uid_to_ort_pairs else 'input', i))
+        if output_uid_to_ort_pairs:
+            tensor_name = output_uid_to_ort_pairs[var.uid]
+        else:
+            tensor_name = var.name if var.name else var.uid
+        onnx.save_tensor(numpy_helper.from_array(data, tensor_name), file_path)
+
+
+def convert_model_and_gen_data(input, output, end_node, seq_len, batch_size):
+    cntk_model = C.load_model(input)
+    if end_node:
+        nodes = C.logging.depth_first_search(cntk_model, lambda x: x.name == end_node, depth=-1)
+        assert len(nodes) == 1
+        cntk_model = C.as_composite(nodes[0])
+    cntk_model.save(output, C.ModelFormat.ONNX)
+
+    if seq_len==0:
+        return
+
+    pair_desc = PairDescription()
+    pair_string = onnx.load(output).graph.doc_string
+    pair_desc.parse_from_string(pair_string)
+
+    cntk_feeds = {}
+    for var in cntk_model.arguments:
+        data_shape = []
+        for ax in var.dynamic_axes:
+            if ax.name == 'defaultBatchAxis':
+                data_shape = data_shape + [batch_size]
+            else:
+                data_shape = data_shape + [seq_len] # TODO: handle models with multiple sequence axes
+        data_shape = data_shape + list(var.shape)
+        cntk_feeds[var] = np.random.rand(*data_shape).astype(var.dtype)
+
+    # run inference with CNTK
+    cntk_output = cntk_model.eval(cntk_feeds)
+    if type(cntk_output) != dict:
+        assert len(cntk_model.outputs) == 1
+        cntk_output = {cntk_model.output : cntk_output}
+
+    test_data_dir = os.path.join(os.path.split(output)[0], 'test_data_set_0')
+    os.makedirs(test_data_dir, exist_ok=True)
+    save_data(test_data_dir, cntk_feeds)
+    save_data(test_data_dir, cntk_output, pair_desc.get_pairs(PairDescription.PairType.uid_2_onnx_node_name))
+
+
+def parse_arguments():
+  parser = argparse.ArgumentParser()
+  parser.add_argument('--input', help='The input CNTK model file.', required=True, default=None)
+  parser.add_argument('--output', help='The output ONNX model file.', required=True, default=None)
+  parser.add_argument('--end_node', help='The end node of CNTK model. This is to remove error/loss related parts from input model.', default=None)
+  parser.add_argument('--seq_len', help='Test data sequence length.', type=int, default=0)
+  parser.add_argument('--batch_size', help='Test data batch size.', type=int, default=1)
+  return parser.parse_args()
+
+
+if __name__ == '__main__':
+    C.try_set_default_device(C.cpu())
+    args = parse_arguments()
+    print('input model: ' + args.input)
+    print('output model: ' + args.output)
+    convert_model_and_gen_data(args.input, args.output, args.end_node, args.seq_len, args.batch_size)
+    print('Done!')
diff --git a/onnxruntime/core/providers/nuphar/scripts/create_shared.cmd b/onnxruntime/core/providers/nuphar/scripts/create_shared.cmd
new file mode 100644
index 0000000000000..17ec2afcb538b
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/create_shared.cmd
@@ -0,0 +1,65 @@
+:: Copyright (c) Microsoft Corporation. All rights reserved.
+:: Licensed under the MIT License.
+
+@echo off
+setlocal EnableDelayedExpansion
+
+if "%1"=="" goto Usage
+
+set CACHE_DIR=%~f1
+set MODEL_FILE=%~f2
+
+if "%3"=="" (
+set OUTPUT_DLL=jit.so
+) else (
+set OUTPUT_DLL=%3
+)
+
+REM check required tools
+if not "%2"=="" (
+REM need fciv when provided model file
+where /q fciv.exe || echo Could not find fciv.exe, please make sure it is in PATH, or download from https://support.microsoft.com/en-us/help/841290 && exit /b -1
+)
+where /q cl.exe || echo Could not find cl.exe, please make sure it is in PATH, or install Visual Studio 2017 && exit /b -1
+where /q link.exe || echo Could not find link.exe, please make sure it is in path, or install Visual Studio 2017 && exit /b -1
+
+REM generate dllmain.cc
+set DLLMAIN_CC=%CACHE_DIR%\dllmain.cc
+echo Generating %DLLMAIN_CC%...
+echo #include ^<windows.h^> >%DLLMAIN_CC%
+echo BOOL APIENTRY DllMain(HMODULE hModule, >>%DLLMAIN_CC%
+echo                       DWORD  ul_reason_for_call, >>%DLLMAIN_CC%
+echo                       LPVOID lpReserved) >>%DLLMAIN_CC%
+echo {return TRUE;} >>%DLLMAIN_CC%
+
+REM skip checksum if no model file specified
+if NOT EXIST "%MODEL_FILE%" goto Compile
+
+REM get checksum from the model file
+set CHECKSUM_CC=%CACHE_DIR%\checksum.cc
+echo Generating %CHECKSUM_CC%...
+for /f %%i in ('fciv %MODEL_FILE%') do (set MD5SUM=%%i)
+echo #include ^<stdlib.h^> >%CHECKSUM_CC%
+echo static const char model_checksum[] = "%MD5SUM%"; >>%CHECKSUM_CC%
+echo extern "C" >>%CHECKSUM_CC%
+echo __declspec(dllexport) >>%CHECKSUM_CC%
+echo void _ORTInternal_GetCheckSum(const char*^& cs, size_t^& len) { >> %CHECKSUM_CC%
+echo   cs = model_checksum; len = sizeof(model_checksum)/sizeof(model_checksum[0]) - 1;} >>%CHECKSUM_CC%
+
+:Compile
+cd /d %CACHE_DIR%
+for /f %%i in ('dir /b *.cc') do (
+  cl /Fo:%%i.o /c %%i
+)
+
+for /f %%i in ('dir /b *.o') do (set OBJS=!OBJS! %%i)
+echo Linking %CACHE_DIR%\%OUTPUT_DLL%...
+link -dll -FORCE:MULTIPLE !OBJS! -EXPORT:__tvm_main__ -out:%CACHE_DIR%\%OUTPUT_DLL%
+del *.o *.cc
+
+exit /b
+
+:Usage
+echo Usage: %0 cache_dir [model_file] [output_dll]
+echo The generated file would be cache_dir\output_dll
+exit /b
\ No newline at end of file
diff --git a/onnxruntime/core/providers/nuphar/scripts/create_shared.sh b/onnxruntime/core/providers/nuphar/scripts/create_shared.sh
new file mode 100644
index 0000000000000..05f0cff91767b
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/create_shared.sh
@@ -0,0 +1,64 @@
+#!/bin/bash
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+set -x -e -o pipefail
+
+function usage {
+    echo Usage: create_shared.sh -c cache_dir -m input_model_file -o output_so_file
+    echo The generated file would be cache_dir/output_so_file
+    exit 1
+}
+
+while getopts c:m:o: parameter_Option
+do case "${parameter_Option}"
+in
+c) CACHE_DIR=${OPTARG};;
+m) MODEL_FILE=${OPTARG};;
+o) OUTPUT_SO_FILE=${OPTARG};;
+esac
+done
+
+if [ -z "$CACHE_DIR" ]; then
+    echo "No cache_dir specified"
+    usage
+fi
+
+if [ -z "$OUTPUT_SO_FILE" ]; then
+    OUTPUT_SO_FILE=jit.so
+fi
+
+# check required tools
+if ! [ -x "$(command -v g++)" ]; then
+    echo "Could not find g++"
+    exit 1
+fi
+
+cd $CACHE_DIR
+if [ -x "$MODEL_FILE" ]; then
+    # generate checksum.cc
+    md5=`md5sum ${MODEL_FILE} | awk '{ print $1 }'`
+    cat > $CACHE_DIR/checksum.cc <<__EOF__
+#include <stdlib.h>
+static const char model_checksum[] = "$md5";
+extern "C"
+void _ORTInternal_GetCheckSum(const char*& cs, size_t& len) {
+  cs = model_checksum; len = sizeof(model_checksum)/sizeof(model_checksum[0]) - 1;
+}    
+__EOF__
+    g++ -std=c++14 -fPIC -o checksum.o -c checksum.cc
+    rm checksum.cc
+fi
+
+# link
+if ls *.o 1> /dev/null 2>&1; then
+    OBJS=""
+    for o_file in *.o; do
+        OBJS+=" $o_file"
+    done
+        
+    if ! [ -z "$OBJS" ]; then
+        g++ -shared -fPIC -o $CACHE_DIR/$OUTPUT_SO_FILE $OBJS
+    fi
+    rm *.o
+fi
\ No newline at end of file
diff --git a/onnxruntime/core/providers/nuphar/scripts/model_editor.py b/onnxruntime/core/providers/nuphar/scripts/model_editor.py
new file mode 100644
index 0000000000000..2472af01dfeb1
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/model_editor.py
@@ -0,0 +1,629 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# -*- coding: UTF-8 -*-
+import argparse
+from enum import Enum
+import numpy as np
+import onnx
+from node_factory import NodeFactory, ensure_opset
+from symbolic_shape_infer import SymbolicShapeInference, get_shape_from_type_proto
+
+# trim outputs of LSTM/GRU/RNN if not used or outputed
+def trim_unused_outputs(node, graph):
+    trimmed = onnx.NodeProto()
+    trimmed.CopyFrom(node)
+    graph_outputs = [o.name for o in graph.output]
+    for o_idx in range(len(node.output)):
+        o = node.output[o_idx]
+        use = [n for n in graph.node if o in list(n.input) + graph_outputs]
+        if not use:
+            trimmed.output[o_idx] = ''
+    return trimmed
+
+# squeeze init states, and split forward/reverse for bidirectional
+def handle_init_state(init_state, nf, num_directions):
+    if not init_state:
+        return None
+    if not nf.get_initializer(init_state) is None:
+        return nf.get_initializer(init_state)
+    if num_directions == 2:
+        split_names = [init_state + '_split_0', init_state + '_split_1']
+        nf.make_node('Split', init_state, {'axis':0}, split_names) # [1, batch, hidden]
+        return [nf.make_node('Squeeze', s, {'axes':[0]}) for s in split_names]
+    else:
+        return [nf.make_node('Squeeze', init_state, {'axes':[0]})]
+
+# handle some common attributes between LSTM/GRU/RNN
+def handle_common_attributes(node, default_activations):
+    direction = NodeFactory.get_attribute(node, 'direction')
+    if direction:
+        direction = str(direction, 'utf-8')
+    else:
+        direction = 'forward'
+    num_directions = 2 if direction == 'bidirectional' else 1
+
+    activations = NodeFactory.get_attribute(node, 'activations')
+    if activations:
+        activations = [str(x, 'utf-8').lower().capitalize() for x in activations]
+    else:
+        activations = default_activations * num_directions
+
+    activation_alpha = NodeFactory.get_attribute(node, 'activation_alpha')
+    activation_beta = NodeFactory.get_attribute(node, 'activation_beta')
+    clip_threshold = NodeFactory.get_attribute(node, 'clip')
+    # TODO: support these activation attributes
+    assert not activation_alpha
+    assert not activation_beta
+    assert not clip_threshold
+    return direction, num_directions, activations
+
+# get batch_size, and create batch_node if needed
+def handle_batch_size(X, nf, need_batch_node):
+    X_vi = nf.get_value_info(X)
+    assert X_vi
+    dim = get_shape_from_type_proto(X_vi.type)[1]
+    if type(dim) == str and need_batch_node:
+        # only need to create batch_node for symbolic batch_size
+        # otherwise, just use numpy.zeros
+        X_shape = nf.make_node('Shape', X)
+        node = nf.make_node('Slice', X_shape, {'axes':[0],'starts':[1],'ends':[2]})
+    else:
+        node = None
+    return dim, node
+
+# create default init state with zeros
+def default_init_state(X, batch_size, batch_node, hidden_size, nf, postfix=''):
+    if batch_node:
+        shape = nf.make_node('Concat', [batch_node, np.asarray([hidden_size]).astype(np.int64)], {'axis':0})
+        return nf.make_node('ConstantOfShape', shape)
+    else:
+        assert type(batch_size) == int
+        # add default init state to graph input
+        initializer_name = X + '_zero_init_state' + postfix
+        initializer_shape = (batch_size, hidden_size)
+        nf.make_value_info(initializer_name, onnx.TensorProto.FLOAT, initializer_shape, NodeFactory.ValueInfoType.input)
+        return nf.make_initializer(np.zeros(initializer_shape, dtype=np.float32), initializer_name)
+
+# declare seq_len_subgraph if needed
+# note rank-1 for seq_len is to differentiate it from rank-2 states
+def declare_seq_len_in_subgraph(seq_len, nf_body, prefix, batch_size):
+    if seq_len:
+        seq_len_subgraph = prefix + '_seq_len_subgraph'
+        nf_body.make_value_info(seq_len_subgraph,
+                                data_type=onnx.TensorProto.INT32,
+                                shape=(batch_size,),
+                                usage=NodeFactory.ValueInfoType.input)
+    else:
+        seq_len_subgraph = None
+    return seq_len_subgraph
+
+# hook subgraph outputs, with condition from seq_len_subgraph
+def handle_subgraph_outputs(nf_body, seq_len_subgraph, batch_size, hidden_size, subgraph_output_or_default):
+    final_subgraph_output = []
+    if seq_len_subgraph:
+        seq_len_output = nf_body.make_node('Sub', [seq_len_subgraph, np.asarray([1]).astype(np.int32)])
+        nf_body.make_value_info(seq_len_output,
+                                data_type=onnx.TensorProto.INT32,
+                                shape=(batch_size,),
+                                usage=NodeFactory.ValueInfoType.output)
+        final_subgraph_output.append(seq_len_output)
+
+        # since seq_len is rank-1, need to unsqueeze for Where op on rank-2 states
+        condition = nf_body.make_node('Unsqueeze', nf_body.make_node('Greater', [seq_len_subgraph, np.zeros(shape=(), dtype=np.int32)]), {'axes':[1]})
+        for valid, default in subgraph_output_or_default:
+            final_subgraph_output.append(nf_body.make_node('Where', [condition, valid, default]))
+    else:
+        final_subgraph_output.append(None)
+        for valid, default in subgraph_output_or_default:
+            final_subgraph_output.append(nf_body.make_node('Identity', valid))
+
+    for subgraph_o in final_subgraph_output[1:]:
+        nf_body.make_value_info(subgraph_o,
+                                data_type=onnx.TensorProto.FLOAT,
+                                shape=(batch_size, hidden_size),
+                                usage=NodeFactory.ValueInfoType.output)
+
+    return final_subgraph_output
+
+# unsqueeze/concat for the final outputs from scans, when the LSTM/GRU/RNN node is bidirectional
+def handle_final_scan_outputs(node, nf, scan_outputs, state_outputs, num_directions):
+    if num_directions == 2:
+        def _bidirectional(outputs, axis, hook_output_name):
+            outputs = [nf.make_node('Unsqueeze', x, {'axes':[axis]}) for x in outputs]
+            nf.make_node('Concat', outputs, {'axis':axis}, output_names=hook_output_name)
+
+        if node.output[0]:
+            _bidirectional(scan_outputs, 1, node.output[0])
+        for i_o in range(1, len(node.output)):
+            _bidirectional(state_outputs[i_o - 1], 0, node.output[i_o])
+    else:
+        if node.output[0]:
+            nf.make_node('Unsqueeze', scan_outputs[0], {'axes':[1]}, output_names=node.output[0])
+        for i_o in range(1, len(node.output)):
+            nf.make_node('Unsqueeze', state_outputs[i_o - 1], {'axes':[0]}, output_names=node.output[i_o])
+
+def convert_lstm_to_scan(node, out_main_graph):
+    assert node.op_type == 'LSTM'
+    nf = NodeFactory(out_main_graph)
+    with nf.scoped_prefix(node.output[0]) as scoped_prefix:
+        X = node.input[0]
+        Wa = nf.get_initializer(node.input[1])
+        Ra = nf.get_initializer(node.input[2])
+        num_inputs = len(node.input)
+        Ba = nf.get_initializer(node.input[3]) if num_inputs > 3 else None
+        seq_len = node.input[4] if num_inputs > 4 else None
+        InitHa = node.input[5] if num_inputs > 5 else None
+        InitCa = node.input[6] if num_inputs > 6 else None
+        PB = node.input[7] if num_inputs > 7 else None
+
+        # TODO: support peephole
+        assert not PB
+
+        direction, num_directions, activations = handle_common_attributes(node, ['Sigmoid', 'Tanh', 'Tanh'])
+
+        hidden_size = NodeFactory.get_attribute(node, 'hidden_size')
+        input_forget = NodeFactory.get_attribute(node, 'input_forget')
+
+        # TODO: implement input_forget = 1
+        assert not (input_forget != None and input_forget == 1)
+
+        # split initializer if needed:
+        is_same_init = InitHa == InitCa
+        InitHa = handle_init_state(InitHa, nf, num_directions)
+        if is_same_init:
+            InitCa = InitHa
+        else:
+            InitCa = handle_init_state(InitCa, nf, num_directions)
+
+        batch_size, batch_node = handle_batch_size(X, nf, InitHa is None or InitCa is None)
+
+        scan_outputs = []
+        scan_h_outputs = []
+        scan_c_outputs = []
+        for direction_index in range(num_directions):
+            # for each direction
+            # X [seq_len, batch_size, input_size]
+            # W [4*hidden_size, input_size]
+            # R [4*hidden_size, hidden_size]
+            # B [8*hidden_size]
+            # seq_len [batch_size]
+            # init_h [batch_size, hidden_size]
+            # init_c [batch_size, hidden_size]
+            # PB [3*hidden_size]
+
+            name_prefix = node.output[0] + '_' + str(direction_index) + '_'
+
+            if InitHa is None:
+                init_h = default_init_state(X, batch_size, batch_node, hidden_size, nf, '_H')
+            else:
+                init_h = InitHa[direction_index]
+
+            if InitCa is None:
+                init_c =  default_init_state(X, batch_size, batch_node, hidden_size, nf, '_C')
+            else:
+                init_c = InitCa[direction_index]
+
+            input_size = Wa.shape[len(Wa.shape) - 1]
+            Wt = np.transpose(Wa[direction_index])
+            Rt = np.transpose(Ra[direction_index])
+            B = Ba[direction_index].reshape(2, 4*hidden_size).sum(axis=0) # [4*hidden_size]
+            X_proj = nf.make_node('MatMul', [X, Wt]) #[seq_len, batch_size, 4*hidden_size]
+            X_proj = nf.make_node('Add', [X_proj, B])
+            if num_directions == 1:
+                is_backward = 0 if direction == 'forward' else 1
+            else:
+                is_backward = direction_index
+
+            scan_body = onnx.GraphProto()
+            scan_body.name = name_prefix + '_subgraph'
+
+            nf_body = NodeFactory(out_main_graph, scan_body)
+            with nf_body.scoped_prefix(name_prefix) as body_scoped_prefix:
+                # subgraph inputs
+                X_proj_subgraph = X_proj.name + '_subgraph'
+                prev_h_subgraph = name_prefix + '_h_subgraph'
+                prev_c_subgraph = name_prefix + '_c_subgraph'
+
+                seq_len_subgraph = declare_seq_len_in_subgraph(seq_len, nf_body, X_proj.name, batch_size)
+
+                for subgraph_i in [prev_h_subgraph, prev_c_subgraph]:
+                    nf_body.make_value_info(subgraph_i,
+                                            data_type=onnx.TensorProto.FLOAT,
+                                            shape=(batch_size, hidden_size),
+                                            usage=NodeFactory.ValueInfoType.input)
+
+                nf_body.make_value_info(X_proj_subgraph,
+                                        data_type=onnx.TensorProto.FLOAT,
+                                        shape=(batch_size, 4*hidden_size),
+                                        usage=NodeFactory.ValueInfoType.input)
+                # subgraph nodes
+                # it = f(Xt*(Wi^T) + Ht-1*(Ri^T) + Pi (.) Ct-1 + Wbi + Rbi)
+                # ft = f(Xt*(Wf^T) + Ht-1*(Rf^T) + Pf (.) Ct-1 + Wbf + Rbf)
+                # ct = g(Xt*(Wc^T) + Ht-1*(Rc^T) + Wbc + Rbc)
+                # Ct = ft (.) Ct-1 + it (.) ct
+                # ot = f(Xt*(Wo^T) + Ht-1*(Ro^T) + Po (.) Ct + Wbo + Rbo)
+                # Ht = ot (.) h(Ct)
+                prev_h_proj = nf_body.make_node('MatMul', [prev_h_subgraph, Rt])
+                sum_x_proj_h_proj_bias = nf_body.make_node('Add', [X_proj_subgraph, prev_h_proj])
+                split_outputs = ['split_i', 'split_o', 'split_f', 'split_c']
+                nf_body.make_node('Split', sum_x_proj_h_proj_bias, {"axis":1, "split":[hidden_size]*4}, output_names=split_outputs)
+                # manually add shape inference to split outputs
+                for split_o in split_outputs:
+                    nf_body.make_value_info(split_o,
+                                            data_type=onnx.TensorProto.FLOAT,
+                                            shape=(batch_size, hidden_size))
+                activation_f, activation_g, activation_h = activations[direction_index*3:(direction_index+1)*3]
+                it = nf_body.make_node(activation_f, 'split_i')
+                ft = nf_body.make_node(activation_f, 'split_f')
+                ct = nf_body.make_node(activation_g, 'split_c')
+                c_subgraph = nf_body.make_node('Add',
+                                               [nf_body.make_node('Mul', [ft, prev_c_subgraph]),
+                                                nf_body.make_node('Mul', [it, ct])])
+                ot = nf_body.make_node(activation_f, 'split_o')
+                h_subgraph = nf_body.make_node('Mul', [ot, nf_body.make_node(activation_h, c_subgraph)])
+
+                subgraph_outputs = handle_subgraph_outputs(nf_body,
+                                                           seq_len_subgraph,
+                                                           batch_size,
+                                                           hidden_size,
+                                                           [(h_subgraph, prev_h_subgraph),
+                                                            (c_subgraph, prev_c_subgraph)] +
+                                                           ([(h_subgraph, np.zeros(shape=(), dtype=np.float32))] if node.output[0] else [])) # skip scan output if node.output[0] is empty
+
+                scan = nf.make_node('Scan', ([seq_len] if seq_len else []) + [init_h, init_c, X_proj],
+                                    {'body':scan_body,
+                                      'scan_input_directions':[is_backward],
+                                      'scan_output_directions':[is_backward],
+                                      'num_scan_inputs':1},
+                                    output_names=[o.name for o in subgraph_outputs[(0 if seq_len else 1):]])
+
+                scan_h_outputs.append(subgraph_outputs[1])
+                scan_c_outputs.append(subgraph_outputs[2])
+                if node.output[0]:
+                    scan_outputs.append(subgraph_outputs[3])
+
+        handle_final_scan_outputs(node, nf, scan_outputs, [scan_h_outputs, scan_c_outputs], num_directions)
+
+    # remove old initializers
+    nf.remove_initializer(node.input[1])
+    nf.remove_initializer(node.input[2])
+    if num_inputs > 3:
+        nf.remove_initializer(node.input[3])
+    if num_inputs > 5:
+        nf.remove_initializer(node.input[5], allow_empty=True)
+    if num_inputs > 6:
+        nf.remove_initializer(node.input[6], allow_empty=True)
+    return True
+
+def convert_gru_to_scan(node, out_main_graph):
+    assert node.op_type == 'GRU'
+    nf = NodeFactory(out_main_graph)
+    with nf.scoped_prefix(node.output[0]) as scoped_prefix:
+        X = node.input[0]
+        Wa = nf.get_initializer(node.input[1])
+        Ra = nf.get_initializer(node.input[2])
+        num_inputs = len(node.input)
+        Ba = nf.get_initializer(node.input[3]) if num_inputs > 3 else None
+        seq_len = node.input[4] if num_inputs > 4 else None
+        InitHa = node.input[5] if num_inputs > 5 else None
+
+        direction, num_directions, activations = handle_common_attributes(node, ['Sigmoid', 'Tanh'])
+
+        hidden_size = NodeFactory.get_attribute(node, 'hidden_size')
+        linear_before_reset = NodeFactory.get_attribute(node, 'linear_before_reset')
+        InitHa = handle_init_state(InitHa, nf, num_directions)
+
+        batch_size, batch_node = handle_batch_size(X, nf, InitHa is None)
+        if InitHa is None:
+            zero_init_state = default_init_state(X, batch_size, batch_node, hidden_size, nf)
+
+        scan_outputs = []
+        scan_h_outputs = []
+        for direction_index in range(num_directions):
+            # for each direction
+            # X [seq_len, batch_size, input_size]
+            # W [3*hidden_size, input_size]
+            # R [3*hidden_size, hidden_size]
+            # B [6*hidden_size]
+            # seq_len [batch_size]
+            # init_h [batch_size, hidden_size]
+
+            name_prefix = node.output[0] + '_' + str(direction_index) + '_'
+
+            if InitHa is None:
+                init_h = zero_init_state
+            else:
+                init_h = InitHa[direction_index]
+
+            input_size = Wa.shape[len(Wa.shape) - 1]
+            W_t = np.transpose(Wa[direction_index]) # [input_size, 3*hidden_size]
+            R_t = np.transpose(Ra[direction_index]) # [hidden_size, 3*hidden_size]
+            Rzr_t, Rh_t = np.hsplit(R_t, [2*hidden_size]) # [hidden_size, 2*hidden_size] and [hidden_size, hidden_size]
+            Bzr, Bh = np.hsplit(Ba[direction_index].reshape(2, 3*hidden_size), [2*hidden_size])
+            Bzr = Bzr.sum(axis=0) # [2*hidden_size]
+            Wbh = Bh[0]
+            Rbh = Bh[1]
+            X_proj = nf.make_node('Add', [nf.make_node('MatMul', [X, W_t]), np.concatenate((Bzr, Wbh))]) #[seq_len, batch_size, 3*hidden_size]
+            if num_directions == 1:
+                is_backward = 0 if direction == 'forward' else 1
+            else:
+                is_backward = direction_index
+
+            scan_body = onnx.GraphProto()
+            scan_body.name = name_prefix + '_subgraph'
+
+            nf_body = NodeFactory(out_main_graph, scan_body)
+            with nf_body.scoped_prefix(name_prefix) as body_scoped_prefix:
+                # subgraph inputs
+                X_proj_subgraph = X_proj.name + '_subgraph'
+                prev_h_subgraph = name_prefix + '_h_subgraph'
+
+                seq_len_subgraph = declare_seq_len_in_subgraph(seq_len, nf_body, X_proj.name, batch_size)
+
+                nf_body.make_value_info(prev_h_subgraph,
+                                        data_type=onnx.TensorProto.FLOAT,
+                                        shape=(batch_size, hidden_size),
+                                        usage=NodeFactory.ValueInfoType.input)
+
+                nf_body.make_value_info(X_proj_subgraph,
+                                        data_type=onnx.TensorProto.FLOAT,
+                                        shape=(batch_size, 3*hidden_size),
+                                        usage=NodeFactory.ValueInfoType.input)
+
+                # subgraph nodes
+                # zt = f(Xt*(Wz^T) + Ht-1*(Rz^T) + Wbz + Rbz)
+                # rt = f(Xt*(Wr^T) + Ht-1*(Rr^T) + Wbr + Rbr)
+                # ht = g(Xt*(Wh^T) + (rt (.) Ht-1)*(Rh^T) + Rbh + Wbh) # default, when linear_before_reset = 0
+                # ht = g(Xt*(Wh^T) + (rt (.) (Ht-1*(Rh^T) + Rbh)) + Wbh) # when linear_before_reset != 0
+                # Ht = (1 - zt) (.) ht + zt (.) Ht-1
+
+                split_X_outputs = ['split_Xzr', 'split_Xh']
+                nf_body.make_node('Split', X_proj_subgraph, {"axis":1, "split":[2*hidden_size, hidden_size]}, output_names=split_X_outputs)
+                nf_body.make_value_info('split_Xzr',
+                                        data_type=onnx.TensorProto.FLOAT,
+                                        shape=(batch_size, 2*hidden_size))
+                nf_body.make_value_info('split_Xh',
+                                        data_type=onnx.TensorProto.FLOAT,
+                                        shape=(batch_size, hidden_size))
+
+                activation_f, activation_g = activations[direction_index*2:(direction_index+1)*2]
+
+                if linear_before_reset:
+                    prev_h_proj = nf_body.make_node('Add', [nf_body.make_node('MatMul', [prev_h_subgraph, R_t]), np.concatenate((np.zeros(2*hidden_size).astype(np.float32), Rbh))])
+                    split_prev_h_outputs = ['split_Hzr', 'split_Hh']
+                    nf_body.make_node('Split', prev_h_proj, {"axis":1, "split":[2*hidden_size, hidden_size]}, output_names=split_prev_h_outputs)
+                    nf_body.make_value_info('split_Hzr',
+                                            data_type=onnx.TensorProto.FLOAT,
+                                            shape=(batch_size, 2*hidden_size))
+                    nf_body.make_value_info('split_Hh',
+                                            data_type=onnx.TensorProto.FLOAT,
+                                            shape=(batch_size, hidden_size))
+                    ztrt = nf_body.make_node(activation_f, nf_body.make_node('Add', ['split_Hzr', 'split_Xzr']))
+                    split_ztrt_outputs = ['split_zt', 'split_rt']
+                    nf_body.make_node('Split', ztrt, {"axis":1, "split":[hidden_size, hidden_size]}, output_names=split_ztrt_outputs)
+                    nf_body.make_value_info('split_zt',
+                                            data_type=onnx.TensorProto.FLOAT,
+                                            shape=(batch_size, hidden_size))
+                    nf_body.make_value_info('split_rt',
+                                            data_type=onnx.TensorProto.FLOAT,
+                                            shape=(batch_size, hidden_size))
+                    ht = nf_body.make_node(activation_g, nf_body.make_node('Add', [nf_body.make_node('Mul', ['split_rt', 'split_Hh']), 'split_Xh']))
+                else:
+                    ztrt = nf_body.make_node(activation_f, nf_body.make_node('Add', [nf_body.make_node('MatMul', [prev_h_subgraph, Rzr_t]), 'split_Xzr']))
+                    split_ztrt_outputs = ['split_zt', 'split_rt']
+                    nf_body.make_node('Split', ztrt, {"axis":1, "split":[hidden_size, hidden_size]}, output_names=split_ztrt_outputs)
+                    nf_body.make_value_info('split_zt',
+                                            data_type=onnx.TensorProto.FLOAT,
+                                            shape=(batch_size, hidden_size))
+                    nf_body.make_value_info('split_rt',
+                                            data_type=onnx.TensorProto.FLOAT,
+                                            shape=(batch_size, hidden_size))
+                    ht = nf_body.make_node(activation_g, nf_body.make_node('Add', [nf_body.make_node('MatMul', [nf_body.make_node('Mul', [prev_h_subgraph, 'split_rt']), Rh_t]), 'split_Xh']))
+
+                Ht = nf_body.make_node('Add', [nf_body.make_node('Mul', [nf_body.make_node('Sub', [np.asarray([1]).astype(np.float32),
+                                                                                                   'split_zt']),
+                                                                         ht]),
+                                               nf_body.make_node('Mul', ['split_zt', prev_h_subgraph])])
+
+                subgraph_outputs = handle_subgraph_outputs(nf_body,
+                                                           seq_len_subgraph,
+                                                           batch_size,
+                                                           hidden_size,
+                                                           [(Ht, prev_h_subgraph)] +
+                                                           ([(Ht, np.zeros(shape=(), dtype=np.float32))] if node.output[0] else []))
+
+                scan = nf.make_node('Scan', ([seq_len] if seq_len else []) + [init_h, X_proj],
+                                    {'body':scan_body,
+                                      'scan_input_directions':[is_backward],
+                                      'scan_output_directions':[is_backward],
+                                      'num_scan_inputs':1},
+                                    output_names=[o.name for o in subgraph_outputs[(0 if seq_len else 1):]])
+
+                scan_h_outputs.append(subgraph_outputs[1])
+                if node.output[0]:
+                    scan_outputs.append(subgraph_outputs[2])
+
+        handle_final_scan_outputs(node, nf, scan_outputs, [scan_h_outputs], num_directions)
+
+    # remove old initializers
+    nf.remove_initializer(node.input[1])
+    nf.remove_initializer(node.input[2])
+    if num_inputs > 3:
+        nf.remove_initializer(node.input[3])
+    if num_inputs > 5:
+        nf.remove_initializer(node.input[5], allow_empty=True)
+    return True
+
+def convert_rnn_to_scan(node, out_main_graph):
+    assert node.op_type == 'RNN'
+    nf = NodeFactory(out_main_graph)
+    with nf.scoped_prefix(node.output[0]) as scoped_prefix:
+        X = node.input[0]
+        Wa = nf.get_initializer(node.input[1])
+        Ra = nf.get_initializer(node.input[2])
+        num_inputs = len(node.input)
+        Ba = nf.get_initializer(node.input[3]) if num_inputs > 3 else None
+        seq_len = node.input[4] if num_inputs > 4 else None
+        InitHa = node.input[5] if num_inputs > 5 else None
+
+        direction, num_directions, activations = handle_common_attributes(node, ['Tanh'])
+
+        hidden_size = NodeFactory.get_attribute(node, 'hidden_size')
+
+        InitHa = handle_init_state(InitHa, nf, num_directions)
+
+        batch_size, batch_node = handle_batch_size(X, nf, InitHa is None)
+        if InitHa is None:
+            zero_init_state = default_init_state(X, batch_size, batch_node, hidden_size, nf)
+
+        scan_outputs = []
+        scan_h_outputs = []
+        for direction_index in range(num_directions):
+            # for each direction
+            # X [seq_len, batch_size, input_size]
+            # W [hidden_size, input_size]
+            # R [hidden_size, hidden_size]
+            # B [2*hidden_size]
+            # seq_len [batch_size]
+            # init_h [batch_size, hidden_size]
+
+            name_prefix = node.output[0] + '_' + str(direction_index) + '_'
+
+            if InitHa is None:
+                init_h = zero_init_state
+            else:
+                init_h = InitHa[direction_index]
+
+            input_size = Wa.shape[len(Wa.shape) - 1]
+            W_t = np.transpose(Wa[direction_index]) # [input_size, hidden_size]
+            R_t = np.transpose(Ra[direction_index]) # [hidden_size, hidden_size]
+            B = Ba[direction_index].reshape(2, hidden_size).sum(axis=0) # [hidden_size]
+            X_proj = nf.make_node('Add', [nf.make_node('MatMul', [X, W_t]), B]) #[seq_len, batch_size, hidden_size]
+            if num_directions == 1:
+                is_backward = 0 if direction == 'forward' else 1
+            else:
+                is_backward = direction_index
+
+            scan_body = onnx.GraphProto()
+            scan_body.name = name_prefix + '_subgraph'
+
+            nf_body = NodeFactory(out_main_graph, scan_body)
+            with nf_body.scoped_prefix(name_prefix) as body_scoped_prefix:
+                # subgraph inputs
+                X_proj_subgraph = X_proj.name + '_subgraph'
+                prev_h_subgraph = name_prefix + '_h_subgraph'
+
+                seq_len_subgraph = declare_seq_len_in_subgraph(seq_len, nf_body, X_proj.name, batch_size)
+
+                nf_body.make_value_info(prev_h_subgraph,
+                                        data_type=onnx.TensorProto.FLOAT,
+                                        shape=(batch_size, hidden_size),
+                                        usage=NodeFactory.ValueInfoType.input)
+
+                nf_body.make_value_info(X_proj_subgraph,
+                                        data_type=onnx.TensorProto.FLOAT,
+                                        shape=(batch_size, hidden_size),
+                                        usage=NodeFactory.ValueInfoType.input)
+                # subgraph nodes
+                # Ht = f(Xt*(W^T) + Ht-1*(R^T) + Wb + Rb)
+
+                activation_f = activations[direction_index]
+                Ht = nf_body.make_node(activation_f, nf_body.make_node('Add', [nf_body.make_node('MatMul', [prev_h_subgraph, R_t]), X_proj_subgraph]))
+
+                subgraph_outputs = handle_subgraph_outputs(nf_body,
+                                                           seq_len_subgraph,
+                                                           batch_size,
+                                                           hidden_size,
+                                                           [(Ht, prev_h_subgraph)] +
+                                                           ([(Ht, np.zeros(shape=(), dtype=np.float32))] if node.output[0] else []))
+
+                scan = nf.make_node('Scan', ([seq_len] if seq_len else []) + [init_h, X_proj],
+                                    {'body':scan_body,
+                                      'scan_input_directions':[is_backward],
+                                      'scan_output_directions':[is_backward],
+                                      'num_scan_inputs':1},
+                                    output_names=[o.name for o in subgraph_outputs[(0 if seq_len else 1):]])
+
+                scan_h_outputs.append(subgraph_outputs[1])
+                if node.output[0]:
+                    scan_outputs.append(subgraph_outputs[2])
+
+        handle_final_scan_outputs(node, nf, scan_outputs, [scan_h_outputs], num_directions)
+
+    # remove old initializers
+    nf.remove_initializer(node.input[1])
+    nf.remove_initializer(node.input[2])
+    if num_inputs > 3:
+        nf.remove_initializer(node.input[3])
+    if num_inputs > 5:
+        nf.remove_initializer(node.input[5])
+    return True
+
+def convert_to_scan_model(input_model, output_model):
+    in_mp = onnx.load(input_model)
+    out_mp = onnx.ModelProto()
+    out_mp.CopyFrom(in_mp)
+    out_mp.ir_version = 5 # update ir version to avoid requirement of initializer in graph input
+    ensure_opset(out_mp, 9) # bump up to ONNX opset 9, which is required for Scan
+    out_mp.graph.ClearField('node')
+    for in_n in in_mp.graph.node:
+        if in_n.op_type in ['LSTM', 'GRU', 'RNN']:
+            in_n = trim_unused_outputs(in_n, in_mp.graph)
+        if in_n.op_type == 'LSTM':
+            if convert_lstm_to_scan(in_n, out_mp.graph):
+                continue
+        if in_n.op_type == 'GRU':
+            if convert_gru_to_scan(in_n, out_mp.graph):
+                continue
+        if in_n.op_type == 'RNN':
+            if convert_rnn_to_scan(in_n, out_mp.graph):
+                continue
+        out_n = out_mp.graph.node.add()
+        out_n.CopyFrom(in_n)
+
+    onnx.save(out_mp, output_model)
+
+# Old models (ir_version < 4) is required to initializers in graph inputs
+# This is optional for ir_version >= 4
+def remove_initializers_from_inputs(input_model, output_model, remain_inputs=[]):
+    mp = onnx.load(input_model)
+
+    def _append_initializer_from_graph(graph):
+        initializers = [i.name for i in graph.initializer]
+        for node in graph.node:
+            if node.op_type == 'Scan': # currently only handle Scan
+                subgraph = NodeFactory.get_attribute(node, 'body')
+                initializers += _append_initializer_from_graph(subgraph)
+        return initializers
+
+    all_initializer_names = [n for n in _append_initializer_from_graph(mp.graph) if n not in remain_inputs]
+    new_inputs = [vi for vi in mp.graph.input if not vi.name in all_initializer_names]
+    mp.graph.ClearField('input')
+    mp.graph.input.extend(new_inputs)
+    onnx.save(mp, output_model)
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--mode', help='The modification mode',
+                        choices=['to_scan',
+                                 'remove_initializers_from_inputs'])
+    parser.add_argument('--input', help='The input model file', default=None)
+    parser.add_argument('--output', help='The output model file', default=None)
+    return parser.parse_args()
+
+if __name__ == '__main__':
+    args = parse_arguments()
+    print('input model: ' + args.input)
+    print('output model ' + args.output)
+    if args.mode == 'to_scan':
+        print('Convert LSTM to Scan...')
+        convert_to_scan_model(args.input, args.output)
+    elif args.mode == 'remove_initializers_from_inputs':
+        print('Remove all initializers from input for model with IR version >= 4...')
+        remove_initializers_from_inputs(args.input, args.output)
+    else:
+        raise NotImplementedError('Unknown mode')
+    print('Running symbolic shape inference on output model')
+    SymbolicShapeInference.infer_shapes(args.output, args.output)
+    print('Done!')
diff --git a/onnxruntime/core/providers/nuphar/scripts/model_quantizer.py b/onnxruntime/core/providers/nuphar/scripts/model_quantizer.py
new file mode 100644
index 0000000000000..c93934d1b5641
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/model_quantizer.py
@@ -0,0 +1,310 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# -*- coding: UTF-8 -*-
+import argparse
+from enum import Enum
+import json
+import numpy as np
+import onnx
+from onnx import helper, numpy_helper
+from node_factory import NodeFactory, ensure_opset
+from symbolic_shape_infer import SymbolicShapeInference
+
+class QuantizeConfig:
+    def __init__(self, signed, reserved_bits, type_bits):
+        self.sign_bit_ = 1 if signed else 0
+        self.reserved_bits_ = reserved_bits
+        self.type_bits_ = type_bits
+
+    @staticmethod
+    def from_dict(qcfg_dict):
+        return QuantizeConfig(1 if qcfg_dict['QuantizationType'] == 'Signed' else 0,
+                              qcfg_dict['ReservedBit'],
+                              qcfg_dict['QuantizeBit'])
+
+    def signed(self):
+        return self.sign_bit_ == 1
+
+    def usable_bits(self):
+        return self.type_bits_ - self.reserved_bits_
+
+    def q_max(self):
+        return float((1 << (self.usable_bits() - self.sign_bit_)) - 1)
+
+    def q_min(self):
+        return float(-(self.q_max() + 1) if self.signed() else 0)
+
+    def q_range(self):
+        return self.q_max() + 0.5 if self.signed() else float(1 << self.usable_bits())
+
+    def q_type(self):
+        if self.type_bits_ == 8:
+            return np.int8 if self.sign_bit_ else np.uint8
+        else:
+            assert self.type_bits_ == 16
+            return np.int16 if self.sign_bit_ else np.uint16
+
+    def q_type_bits(self):
+        return self.type_bits_
+
+    def __iter__(self): # need this to make dict for json
+        return iter([('QuantizeBit', self.type_bits_),
+                     ('QuantizationType', 'Signed' if self.sign_bit_ else 'Unsigned'),
+                     ('ReservedBit', self.reserved_bits_)])
+
+def parse_node_description(in_node):
+    if not in_node.doc_string:
+        return None
+    from model_editor_internal import parse_custom_attributes
+    custom_qcfg = parse_custom_attributes(in_node)
+    if custom_qcfg:
+        assert custom_qcfg['IntermediateBit'] == 32
+        assert custom_qcfg['PerRowQuantization']
+        assert custom_qcfg['QuantizeBitOfVector'] == custom_qcfg['QuantizeBitOfMatrix']
+        qbits = custom_qcfg['QuantizeBitOfVector']
+        assert ("Asymmetric" in custom_qcfg['VectorQuantizationType']) == ("Asymmetric" in custom_qcfg['MatrixQuantizationType'])
+        symmetric = 0 if "Asymmetric" in custom_qcfg['VectorQuantizationType'] else 1
+        x_signed = 0 if "Unsigned" in custom_qcfg['VectorQuantizationType'] else 1
+        w_signed = 0 if "Unsigned" in custom_qcfg['MatrixQuantizationType'] else 1
+        x_reserved_bits = custom_qcfg['ReservedBitOfVector']
+        w_reserved_bits = custom_qcfg['ReservedBitOfMatrix']
+        return {'W' : dict(QuantizeConfig(signed=w_signed, reserved_bits=w_reserved_bits, type_bits=qbits)),
+                'X' : dict(QuantizeConfig(signed=x_signed, reserved_bits=x_reserved_bits, type_bits=qbits)),
+                'Symmetric' : symmetric}
+    return None
+
+def quantize_matmul_2d_with_weight(in_node, in_graph, nf, converted_weights, quantized_inputs, qcfg_dict, update_qcfg_dict, default_qcfg):
+    assert in_node.op_type == 'MatMul'
+
+    # quantize weight
+    # only handles weight being inputs[1] of MatMul/Gemm node
+    fparam_name = in_node.input[1]
+
+    # skip if weights shared by other nodes that's not MatMul
+    # TODO: support GEMM op if needed
+    other_nodes = [n for n in in_graph.node if n != in_node and fparam_name in n.input and n.op_type != 'MatMul']
+    if other_nodes:
+        return False
+
+    if in_node.output[0] in qcfg_dict:
+        node_qcfg = qcfg_dict[in_node.output[0]]
+    else:
+        node_qcfg = parse_node_description(in_node)
+        if not node_qcfg:
+            if not update_qcfg_dict and qcfg_dict:
+                # when qcfg_dict is readonly, raise warning if qcfg is not found for this node
+                print("Warning: qcfg is not found for node with output: " + in_node.output[0] + ", fall back to default qcfg.")
+            node_qcfg = default_qcfg
+
+    w_qcfg = QuantizeConfig.from_dict(node_qcfg['W'])
+    x_qcfg = QuantizeConfig.from_dict(node_qcfg['X'])
+    symmetric = node_qcfg['Symmetric']
+
+    # for symmetric quantization, both weight and input should be quantized to signed
+    assert not symmetric or (w_qcfg.signed() and x_qcfg.signed())
+    # quantize_type should match between weight and input
+    assert w_qcfg.q_type_bits() == x_qcfg.q_type_bits()
+
+    if fparam_name in converted_weights:
+        step, base, qparam_rowsum, qparam, w_qcfg1, symmetric1 = converted_weights[fparam_name]
+        # for shared weights, node should use the same kind of quantization
+        assert dict(w_qcfg1) == dict(w_qcfg)
+        assert symmetric1 == symmetric
+    else:
+        fparam = nf.get_initializer(fparam_name)
+        if fparam is None or len(fparam.shape) != 2:
+            return False
+
+        q_range = w_qcfg.q_range()
+        if symmetric:
+            fscale = np.amax(np.abs(fparam), axis=0)
+            step = fscale / q_range
+            base = 0
+        else:
+            fmin = np.amin(fparam, axis=0)
+            fmax = np.amax(fparam, axis=0)
+            fscale = (fmax - fmin)/(2 if w_qcfg.signed() else 1) # signed would be normalized to [-1, 1], and unsigned to [0, 1]
+            step = fscale / q_range
+            base = (fmax + fmin + step) * 0.5 if w_qcfg.signed() else fmin
+
+        fparam_norm = np.zeros_like(fparam)
+        expand_fscale = np.expand_dims(fscale,0)
+        np.divide((fparam - np.expand_dims(base,0)), expand_fscale, out=fparam_norm, where=expand_fscale!=0)
+        qparam = np.round(fparam_norm * q_range)
+        qparam = np.clip(qparam, w_qcfg.q_min(), w_qcfg.q_max())
+        qparam_rowsum = np.sum(qparam, axis=0)
+        qparam = qparam.astype(w_qcfg.q_type())
+
+        # create new weights in main graph in case other Scans share via converted_weights
+        nf.make_initializer(step, fparam_name + '_step', in_main_graph=True)
+        nf.make_initializer(qparam, fparam_name + '_qparam', in_main_graph=True)
+        step = fparam_name + '_step'
+        qparam = fparam_name + '_qparam'
+        if symmetric:
+            # no need to compute qparam_rowsum and base for symmetric quantization
+            base = None
+            qparam_rowsum = None
+        else:
+            nf.make_initializer(base, fparam_name + '_base', in_main_graph=True)
+            base = fparam_name + '_base'
+            nf.make_initializer(qparam_rowsum, fparam_name + '_qparam_rowsum', in_main_graph=True)
+            qparam_rowsum = fparam_name + '_qparam_rowsum'
+        converted_weights[fparam_name] = (step, base, qparam_rowsum, qparam, w_qcfg, symmetric)
+        nf.remove_initializer(fparam_name)
+
+    # quantize input
+    with nf.scoped_prefix(in_node.name) as scoped_prefix:
+        input_dim = nf.get_initializer(qparam).shape[0]
+        X = in_node.input[0]
+        if quantized_inputs is not None:
+            quantized_inputs_key = '{}_{}_{}'.format(X, symmetric, '|'.join(['{}:{}'.format(k,v) for (k, v) in x_qcfg]))
+        if quantized_inputs is not None and quantized_inputs_key in quantized_inputs:
+            scale_X, bias_X, Q_X, Q_X_sum_int32 = quantized_inputs[quantized_inputs_key]
+        else:
+            if symmetric:
+                delta_X = nf.make_node('ReduceMax', nf.make_node('Abs', X), {'axes':[-1]}) # keepdims = 1
+                inv_delta_X = nf.make_node('Reciprocal', delta_X)
+                norm_X = nf.make_node('Mul', [X, inv_delta_X])
+                bias_X = None
+                assert x_qcfg.signed()
+            else:
+                reduce_max_X = nf.make_node('ReduceMax', X, {'axes':[-1]}) # keepdims = 1
+                bias_X = nf.make_node('ReduceMin', X, {'axes':[-1]})
+                delta_X = nf.make_node('Sub', [reduce_max_X, bias_X])
+                inv_delta_X = nf.make_node('Reciprocal', delta_X)
+                norm_X = nf.make_node('Mul', [nf.make_node('Sub', [X, bias_X]), inv_delta_X])
+
+            scale_X = nf.make_node('Mul', [delta_X, np.asarray(1.0 / x_qcfg.q_range()).astype(np.float32)])
+            Q_Xf = nf.make_node('Mul', [norm_X, np.asarray(x_qcfg.q_range()).astype(np.float32)])
+            Q_Xf = nf.make_node('Add', [Q_Xf, np.asarray(0.5).astype(np.float32)])
+            Q_Xf = nf.make_node('Floor', Q_Xf)
+            Q_Xf = nf.make_node('Clip', Q_Xf, {'max':x_qcfg.q_max(), 'min':x_qcfg.q_min()})
+            Q_X = nf.make_node('Cast', Q_Xf, {'to':int({np.uint8  : onnx.TensorProto.UINT8,
+                                                        np.int8   : onnx.TensorProto.INT8,
+                                                        np.uint16 : onnx.TensorProto.UINT16,
+                                                        np.int16  : onnx.TensorProto.INT16}[x_qcfg.q_type()])})
+
+            if symmetric:
+                Q_X_sum_int32 = None
+            else:
+                Q_X_sum_int32 = nf.make_node('ReduceSum', nf.make_node('Cast', Q_X, {'to':int(onnx.TensorProto.INT32)}), {'axes':[-1]})
+
+            if quantized_inputs is not None:
+                quantized_inputs[quantized_inputs_key] = (scale_X, bias_X, Q_X, Q_X_sum_int32)
+
+        # MatMulInteger
+        if x_qcfg.q_type_bits() == 8:
+            Q_Y = nf.make_node('MatMulInteger', [Q_X, qparam])
+        else:
+            Q_Y = nf.make_node('MatMulInteger16', [Q_X, qparam])
+            Q_Y.domain = "com.microsoft"
+
+        # Dequantize
+        Y = in_node.output[0]
+        if symmetric:
+            nf.make_node('Mul',
+                      [nf.make_node('Mul', [step, scale_X]),
+                       nf.make_node('Cast', Q_Y, {'to': int(onnx.TensorProto.FLOAT)})],
+                      output_names=Y)
+        else:
+            o0 = nf.make_node('Mul', [nf.make_node('Mul', [step, scale_X]),
+                                      nf.make_node('Cast', Q_Y, {'to': int(onnx.TensorProto.FLOAT)})])
+            o1 = nf.make_node('Mul', [nf.make_node('Mul', [step, bias_X]), qparam_rowsum])
+            o2 = nf.make_node('Mul', [base, nf.make_node('Mul', [scale_X, nf.make_node('Cast', Q_X_sum_int32, {'to':int(onnx.TensorProto.FLOAT)})])])
+            o3 = nf.make_node('Mul', [base, nf.make_node('Mul', [bias_X, np.asarray(float(input_dim)).astype(np.float32)])])
+            nf.make_node('Sum', [o3, o2, o1, o0], output_names=Y)
+
+    if update_qcfg_dict:
+        qcfg_dict[in_node.output[0]] = node_qcfg
+
+    return True
+
+def upgrade_slice_op(nf, in_n):
+    # convert opset9 Slice to opset10
+    assert len(in_n.input) == 1
+    with nf.scoped_prefix(in_n.name) as scoped_prefix:
+        slice_inputs = [in_n.input[0],
+                        np.asarray(NodeFactory.get_attribute(in_n,'starts')).astype(np.int64),
+                        np.asarray(NodeFactory.get_attribute(in_n,'ends')).astype(np.int64),
+                        np.asarray(NodeFactory.get_attribute(in_n,'axes')).astype(np.int64)]
+        nf.make_node('Slice', slice_inputs, output_names=[in_n.output[0]])
+
+# quantize matmul to MatMulInteger using asymm uint8
+def convert_matmul_model(input_model, output_model, only_for_scan=False, share_input_quantization=False, preset_str='asymm8_param0_input1', qcfg_json=None, export_qcfg_json=None):
+    preset_qcfgs = {'asymm8_param0_input1' : {'W' : dict(QuantizeConfig(signed=1, reserved_bits=0, type_bits=8)),
+                                              'X' : dict(QuantizeConfig(signed=0, reserved_bits=1, type_bits=8)),
+                                              'Symmetric' : 0},
+                    'symm16_param3_input3' : {'W' : dict(QuantizeConfig(signed=1, reserved_bits=3, type_bits=16)),
+                                              'X' : dict(QuantizeConfig(signed=1, reserved_bits=3, type_bits=16)),
+                                              'Symmetric' : 1}}
+    default_qcfg = preset_qcfgs[preset_str]
+    in_mp = onnx.load(input_model)
+
+    qcfg_dict = {}
+    if qcfg_json and not export_qcfg_json:
+        with open(qcfg_json, 'r') as f:
+            qcfg_dict = json.load(f)
+
+    out_mp = onnx.ModelProto()
+    out_mp.CopyFrom(in_mp)
+    out_mp.ir_version = 5 # update ir version to avoid requirement of initializer in graph input
+    ensure_opset(out_mp, 10) # bump up to ONNX opset 10, which is required for MatMulInteger
+    ensure_opset(out_mp, 1, 'com.microsoft') # add MS domain for MatMulInteger16
+    out_mp.graph.ClearField('node')
+    nf = NodeFactory(out_mp.graph)
+    converted_weights = {} # remember MatMul weights that have been converted, in case of sharing
+    quantized_inputs = {} if share_input_quantization else None # remember quantized inputs that might be able to share between MatMuls
+    for in_n in in_mp.graph.node:
+        if in_n.op_type == 'Slice' and len(in_n.input) == 1:
+            upgrade_slice_op(nf, in_n)
+            continue
+
+        if in_n.op_type == 'MatMul' and not only_for_scan:
+            if quantize_matmul_2d_with_weight(in_n, in_mp.graph, nf, converted_weights, quantized_inputs, qcfg_dict, export_qcfg_json, default_qcfg):
+                continue
+
+        out_n = out_mp.graph.node.add()
+        out_n.CopyFrom(in_n)
+        if in_n.op_type == 'Scan':
+            in_subgraph = NodeFactory.get_attribute(in_n, 'body')
+            out_subgraph = NodeFactory.get_attribute(out_n, 'body')
+            out_subgraph.ClearField('node')
+            scan_nf = NodeFactory(out_mp.graph, out_subgraph)
+            subgraph_quantized_inputs = {} if share_input_quantization else None # remember quantized inputs that might be able to share between MatMuls
+            for in_sn in in_subgraph.node:
+                if in_sn.op_type == 'MatMul':
+                    if quantize_matmul_2d_with_weight(in_sn, in_subgraph, scan_nf, converted_weights, subgraph_quantized_inputs, qcfg_dict, export_qcfg_json, default_qcfg):
+                        continue
+
+                if in_sn.op_type == 'Slice' and len(in_sn.input) == 1:
+                    upgrade_slice_op(scan_nf, in_sn)
+                    continue
+
+                out_sn = out_subgraph.node.add()
+                out_sn.CopyFrom(in_sn)
+
+    onnx.save(out_mp, output_model)
+    if export_qcfg_json:
+        with open(qcfg_json, 'w') as f:
+            f.write(json.dumps(qcfg_dict, indent=2))
+
+def parse_arguments():
+  parser = argparse.ArgumentParser()
+  parser.add_argument('--input', help='The input model file', default=None)
+  parser.add_argument('--output', help='The output model file', default=None)
+  parser.add_argument('--default_qcfg', help='The preset of quantization of <asymm|symm><qbits>_param<reserve_bit>_input<reserve_bit>', choices=['asymm8_param0_input1', 'symm16_param3_input3'], default='asymm8_param0_input1')
+  parser.add_argument('--qcfg_json', help='The quantization config json file for read or write.', default=None)
+  parser.add_argument('--export_qcfg_json', help='If set, write default quantization config to qcfg_json file.', action='store_true', default=False)
+  parser.add_argument('--only_for_scan', help='If set, apply quantization of MatMul only inside scan', action='store_true', default=False)
+  parser.add_argument('--share_input_quantization', help='If set, allow input quantization to be shared if the same input is used in multiple MatMul', action='store_true', default=False)
+  return parser.parse_args()
+
+if __name__ == '__main__':
+    args = parse_arguments()
+    print('input model: ' + args.input)
+    print('output model ' + args.output)
+    print('Quantize MatMul to MatMulInteger...')
+    assert not args.export_qcfg_json or args.qcfg_json, "--qcfg_json must be specified when --export_qcfg_json is used"
+    convert_matmul_model(args.input, args.output, args.only_for_scan, args.share_input_quantization, args.default_qcfg, args.qcfg_json, args.export_qcfg_json)
+    print('Done!')
\ No newline at end of file
diff --git a/onnxruntime/core/providers/nuphar/scripts/node_factory.py b/onnxruntime/core/providers/nuphar/scripts/node_factory.py
new file mode 100644
index 0000000000000..687c045f952b7
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/node_factory.py
@@ -0,0 +1,153 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# -*- coding: UTF-8 -*-
+from enum import Enum
+import json
+import numpy as np
+import onnx
+from onnx import helper, numpy_helper
+import re
+
+class NodeFactory:
+    node_count_ = 0
+    const_count_ = 0
+
+    def __init__(self, main_graph, sub_graph=None):
+        self.graph_ = sub_graph if sub_graph else main_graph
+        self.main_graph_ = main_graph
+        self.name_prefix_ = ''
+
+    class ScopedPrefix:
+        def __init__(self, nf, name):
+            self.name_ = name
+            self.nf_ = nf
+
+        def __enter__(self):
+            self.saved_name_ = self.nf_.name_prefix_
+            self.nf_.name_prefix_ = self.name_
+
+        def __exit__(self, type, value, tb):
+            self.nf_.name_prefix_ = self.saved_name_
+
+    def scoped_prefix(self, prefix):
+        return NodeFactory.ScopedPrefix(self, prefix)
+
+    def get_prefix(self, prefix):
+        return self.name_prefix_
+
+    def get_initializer(self, name):
+        found = [i for i in list(self.main_graph_.initializer) + list(self.graph_.initializer) if i.name == name]
+        if found:
+            return numpy_helper.to_array(found[0])
+        return None
+
+    def get_value_info(self, name):
+        found = [vi for vi in list(self.graph_.value_info) + list(self.graph_.input) if vi.name == name]
+        if found:
+            return found[0]
+        return None
+
+    def remove_initializer(self, name, allow_empty=False):
+        removed = False
+        for graph in [self.main_graph_, self.graph_] if self.main_graph_ != self.graph_ else [self.main_graph_]:
+            initializer = [i for i in graph.initializer if i.name == name]
+            if not initializer:
+                continue
+            assert not removed
+            graph.initializer.remove(initializer[0])
+            initializer_in_input = [i for i in graph.input if i.name == name]
+            if initializer_in_input:
+                graph.input.remove(initializer_in_input[0])
+            removed = True
+        assert removed or allow_empty
+
+    @staticmethod
+    def get_attribute(node, attr_name, default_value=None):
+        found = [attr for attr in node.attribute if attr.name == attr_name]
+        if found:
+            return helper.get_attribute_value(found[0])
+        return default_value
+
+    class ValueInfoType(Enum):
+        input = 1
+        output = 2
+
+    def make_value_info(self, node_or_name, data_type, shape=None, usage=None):
+        if usage == NodeFactory.ValueInfoType.input:
+            value_info = self.graph_.input.add()
+        elif usage == NodeFactory.ValueInfoType.output:
+            value_info = self.graph_.output.add()
+        elif not usage:
+            value_info = self.graph_.value_info.add()
+        else:
+            raise NotImplementedError("unknown usage")
+
+        if type(node_or_name) == str:
+            name = node_or_name
+        else:
+            assert len(node_or_name.output) == 1
+            name = node_or_name.output[0]
+
+        value_info.CopyFrom(helper.make_tensor_value_info(name, data_type, shape))
+
+    def make_initializer(self, ndarray, name='', in_main_graph=False):
+        new_initializer = (self.main_graph_ if in_main_graph else self.graph_).initializer.add()
+        new_name = name
+        if len(new_name) == 0:
+            already_existed = True
+            while already_existed:
+                new_name = self.name_prefix_ + '_Const_' + str(NodeFactory.const_count_)
+                NodeFactory.const_count_ = NodeFactory.const_count_ + 1
+                already_existed = new_name in [i.name for i in list(self.main_graph_.initializer) + list(self.graph_.initializer)]
+        new_initializer.CopyFrom(numpy_helper.from_array(ndarray, new_name))
+        return new_initializer
+
+    def make_node(self, op_type, inputs, attributes={}, output_names=None, node=None):
+        if type(inputs) != list:
+            inputs = [inputs]
+        if output_names and type(output_names) != list:
+            output_names = [output_names]
+        input_names = []
+        for i in inputs:
+            if type(i) in [onnx.NodeProto, onnx.TensorProto, onnx.ValueInfoProto]:
+                input_names.append(i.name)
+            elif type(i) == str:
+                input_names.append(i)
+            elif type(i) == np.ndarray:
+                new_initializer = self.make_initializer(i)
+                input_names.append(new_initializer.name)
+            else:
+                assert False # unexpected type in input
+
+        if not node:
+            node = self.graph_.node.add()
+
+        name = self.name_prefix_ + op_type + '_' + str(NodeFactory.node_count_)
+        NodeFactory.node_count_ = NodeFactory.node_count_ + 1
+
+        if not output_names:
+            output_names = [name]
+
+        node.CopyFrom(helper.make_node(op_type, input_names, output_names, name, **attributes))
+        return node
+
+
+def ensure_opset(mp, ver, domains=['onnx', '']):
+    if type(domains) == str:
+        domains = [domains]
+    assert type(domains) == list
+    final_ver = None
+    for opset in mp.opset_import:
+        if opset.domain in domains:
+            if opset.version < ver:
+                opset.version = ver
+            final_ver = opset.version
+
+    if final_ver is None:
+        opset = mp.opset_import.add()
+        opset.domain = domains[0]
+        opset.version = ver
+        final_ver = ver
+
+    return final_ver
diff --git a/onnxruntime/core/providers/nuphar/scripts/rnn_benchmark.py b/onnxruntime/core/providers/nuphar/scripts/rnn_benchmark.py
new file mode 100644
index 0000000000000..362a4a0f486d4
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/rnn_benchmark.py
@@ -0,0 +1,205 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# -*- coding: UTF-8 -*-
+import argparse
+import ctypes
+import multiprocessing
+import numpy as np
+import onnx
+# use lines below when building ONNX Runtime from source with --enable_pybind
+#import sys
+#sys.path.append(r'X:\Repos\Lotus\build\Windows\Release\Release')
+#sys.path.append('/repos/Lotus/build/Linux/Release')
+import onnxruntime
+from onnx import helper, numpy_helper
+from onnx import shape_inference
+import os
+from timeit import default_timer as timer
+
+def generate_model(rnn_type, input_dim, hidden_dim, bidirectional, layers, model_name, batch_one, has_seq_len=False):
+    model = onnx.ModelProto()
+    opset = model.opset_import.add()
+    opset.domain == 'onnx'
+    opset.version = 7
+    num_directions = 2 if bidirectional else 1
+
+    X = 'input'
+    model.graph.input.add().CopyFrom(helper.make_tensor_value_info(X, onnx.TensorProto.FLOAT, ['s', 1 if batch_one else 'b', input_dim]))
+    model.graph.initializer.add().CopyFrom(numpy_helper.from_array(np.asarray([0, 0, -1], dtype=np.int64), 'shape'))
+
+    if has_seq_len:
+        seq_len = 'seq_len'
+        model.graph.input.add().CopyFrom(helper.make_tensor_value_info(seq_len, onnx.TensorProto.INT32, [1 if batch_one else 'b',]))
+
+    gates = {'lstm':4, 'gru':3, 'rnn':1}[rnn_type]
+    for i in range(layers):
+        layer_input_dim = (input_dim if i == 0 else hidden_dim * num_directions)
+        model.graph.initializer.add().CopyFrom(numpy_helper.from_array(np.random.rand(num_directions, gates*hidden_dim, layer_input_dim).astype(np.float32), 'W'+str(i)))
+        model.graph.initializer.add().CopyFrom(numpy_helper.from_array(np.random.rand(num_directions, gates*hidden_dim, hidden_dim).astype(np.float32), 'R'+str(i)))
+        model.graph.initializer.add().CopyFrom(numpy_helper.from_array(np.random.rand(num_directions, 2*gates*hidden_dim).astype(np.float32), 'B'+str(i)))
+        layer_inputs = [X, 'W'+str(i), 'R'+str(i), 'B'+str(i)]
+        if has_seq_len:
+            layer_inputs += [seq_len]
+        layer_outputs = ['layer_output_'+str(i)]
+        model.graph.node.add().CopyFrom(helper.make_node(rnn_type.upper(), layer_inputs, layer_outputs, rnn_type+str(i), hidden_size=hidden_dim, direction='bidirectional' if bidirectional else 'forward'))
+        model.graph.node.add().CopyFrom(helper.make_node('Transpose', layer_outputs, ['transposed_output_'+str(i)], 'transpose'+str(i), perm=[0,2,1,3]))
+        model.graph.node.add().CopyFrom(helper.make_node('Reshape', ['transposed_output_'+str(i), 'shape'], ['reshaped_output_'+str(i)], 'reshape'+str(i)))
+        X = 'reshaped_output_'+str(i)
+    model.graph.output.add().CopyFrom(helper.make_tensor_value_info(X, onnx.TensorProto.FLOAT, ['s', 'b', hidden_dim * num_directions]))
+    model = shape_inference.infer_shapes(model)
+    onnx.save(model, model_name)
+
+def perf_run(sess, feeds, min_counts=5, min_duration_seconds=10):
+    # warm up
+    sess.run([], feeds)
+
+    start = timer()
+    run = True
+    count = 0
+    per_iter_cost = []
+    while run:
+        iter_start = timer()
+        sess.run([], feeds)
+        end = timer()
+        count = count + 1
+        per_iter_cost.append(end - iter_start)
+        if end - start >= min_duration_seconds and count >= min_counts:
+            run = False
+    return count, (end - start), per_iter_cost
+
+def top_n_avg(per_iter_cost, n):
+    # following the perf test methodology in [timeit](https://docs.python.org/3/library/timeit.html#timeit.Timer.repeat)
+    per_iter_cost.sort()
+    return sum(per_iter_cost[:n]) * 1000 / n
+
+def set_num_threads(num_threads):
+    if os.name == 'nt':
+        try:
+            mkl_rt = ctypes.CDLL('mklml.dll')
+            mkl_rt.MKL_Set_Num_Threads(ctypes.c_int(num_threads))
+            assert mkl_rt.MKL_Get_Max_Threads() == num_threads
+        except:
+            print("Warning: failed to set number of threads for MKL! Default threads setting applied!")
+    elif os.name == 'posix':
+        try:
+            mkl_rt = ctypes.CDLL('libmklml_intel.so')
+            mkl_rt.mkl_set_num_threads(ctypes.byref(ctypes.c_int(num_threads)))
+            assert mkl_rt.mkl_get_max_threads() == num_threads
+        except:
+            print("Warning: failed to set number of threads for MKL! Default threads setting applied!")
+    else:
+        raise NotImplementedError
+
+def perf_test(rnn_type, num_threads, input_dim, hidden_dim, bidirectional, layers, seq_len, batch_size, top_n=5, min_duration_seconds=10):
+    set_num_threads(num_threads)
+
+    model_name = '{}_i{}_h{}_{}_l{}_{}.onnx'.format(rnn_type, input_dim, hidden_dim,
+                                                    'bi' if bidirectional else '',
+                                                    layers,
+                                                    'batched' if batch_size > 1 else 'no_batch')
+
+    generate_model(rnn_type, input_dim, hidden_dim, bidirectional, layers, model_name, batch_size == 1)
+    feeds = {'input':np.random.rand(seq_len, batch_size, input_dim).astype(np.float32)}
+
+    # run original model
+    sess = onnxruntime.InferenceSession(model_name)
+    count, duration, per_iter_cost = perf_run(sess, feeds, min_counts=top_n, min_duration_seconds=min_duration_seconds)
+    avg_rnn = top_n_avg(per_iter_cost, top_n)
+    print('perf_rnn {}: run for {} iterations, top {} avg {} ms'.format(model_name, count, top_n, avg_rnn))
+
+    # run Scan model converted from original
+    from model_editor import convert_to_scan_model
+    from symbolic_shape_infer import SymbolicShapeInference
+    scan_model_name = os.path.splitext(model_name)[0] + '_scan.onnx'
+    convert_to_scan_model(model_name, scan_model_name)
+    # note that symbolic shape inference is needed because model has symbolic batch dim, thus init_state is ConstantOfShape
+    SymbolicShapeInference.infer_shapes(scan_model_name, scan_model_name)
+    sess = onnxruntime.InferenceSession(scan_model_name)
+    count, duration, per_iter_cost = perf_run(sess, feeds, min_counts=top_n, min_duration_seconds=min_duration_seconds)
+    avg_scan = top_n_avg(per_iter_cost, top_n)
+    print('perf_scan {}: run for {} iterations, top {} avg {} ms'.format(scan_model_name, count, top_n, avg_scan))
+
+    # quantize Scan model to int8
+    from model_quantizer import convert_matmul_model
+    int8_model_name = os.path.splitext(model_name)[0] + '_int8.onnx'
+    convert_matmul_model(scan_model_name, int8_model_name)
+    SymbolicShapeInference.infer_shapes(int8_model_name, int8_model_name)
+    sess = onnxruntime.InferenceSession(int8_model_name)
+    count, duration, per_iter_cost = perf_run(sess, feeds, min_counts=top_n, min_duration_seconds=min_duration_seconds)
+    avg_int8 = top_n_avg(per_iter_cost, top_n)
+    print('perf_int8 {}: run for {} iterations, top {} avg {} ms'.format(int8_model_name, count, top_n, avg_int8))
+
+    return avg_rnn, avg_scan, avg_int8
+
+def perf_test_auto(auto_file):
+    # generate reports in csv format
+    with open('single_thread_' + auto_file + '.csv', 'w') as f:
+        print('single thread test: unidirection 4-layer lstm/gru/rnn with input_dim=128 batch_size=1', file=f)
+        print('rnn_type,hidden,seq_len,avg_rnn,avg_nuphar_fp,avg_nuphar_int8,speedup_fp,speedup_int8', file=f)
+        for rnn_type in ['lstm', 'gru', 'rnn']:
+            for hidden_dim in [32, 128, 1024, 2048]:
+                for seq_len in [1, 16, 32, 64]:
+                    avg_rnn, avg_scan, avg_int8 = perf_test(rnn_type, 1, 128, hidden_dim, False, 4, seq_len, 1)
+                    print('{},{},{},{},{},{},{},{}'.format(rnn_type,hidden_dim, seq_len, avg_rnn, avg_scan, avg_int8, avg_rnn/avg_scan, avg_rnn/avg_int8), file=f)
+
+    with open('multi_thread_' + auto_file + '.csv', 'w') as f:
+        print('multi-thread test: unidirection 4-layer lstm/gru/rnn with input_dim=128 seq_len=32 batch_size=1', file=f)
+        print('rnn_type,threads,hidden,avg_rnn,avg_nuphar_fp,avg_nuphar_int8,speedup_fp,speedup_int8', file=f)
+        for rnn_type in ['lstm', 'gru', 'rnn']:
+            for num_threads in [1, 2, 4]:
+                for hidden_dim in [32, 128, 1024, 2048]:
+                    avg_rnn, avg_scan, avg_int8 = perf_test(rnn_type, num_threads, 128, hidden_dim, False, 4, seq_len, 1)
+                    print('{},{},{},{},{},{},{},{}'.format(rnn_type,num_threads, hidden_dim, avg_rnn, avg_scan, avg_int8, avg_rnn/avg_scan, avg_rnn/avg_int8), file=f)
+
+    with open('batch_single_thread_' + auto_file + '.csv', 'w') as f:
+        print('single thread test: unidirection 4-layer lstm/gru/rnn with input_dim=128 hidden_dim=1024', file=f)
+        print('rnn_type,seq_len,batch_size,avg_rnn,avg_nuphar_fp,avg_nuphar_int8,speedup_fp,speedup_int8', file=f)
+        for rnn_type in ['lstm', 'gru', 'rnn']:
+            for seq_len in [1, 16, 32, 64]:
+                for batch_size in [1, 4, 16, 64]:
+                    avg_rnn, avg_scan, avg_int8 = perf_test(rnn_type, 1, 128, 1024, False, 4, seq_len, batch_size)
+                    print('{},{},{},{},{},{},{},{}'.format(rnn_type,seq_len, batch_size, avg_rnn, avg_scan, avg_int8, avg_rnn/avg_scan, avg_rnn/avg_int8), file=f)
+
+    with open('batch_multi_thread_' + auto_file + '.csv', 'w') as f:
+        print('batch thread test: unidirection 4-layer lstm/gru/rnn with input_dim=128 hidden_dim=1024 seq_len=32', file=f)
+        print('rnn_type,threads,batch_size,avg_rnn,avg_nuphar_fp,avg_nuphar_int8,speedup_fp,speedup_int8', file=f)
+        for rnn_type in ['lstm', 'gru', 'rnn']:
+            for num_threads in [1, 2, 4]:
+                for batch_size in [1, 4, 16, 64]:
+                    avg_rnn, avg_scan, avg_int8 = perf_test(rnn_type, num_threads, 128, 1024, False, 4, 32, batch_size)
+                    print('{},{},{},{},{},{},{},{}'.format(rnn_type,num_threads, batch_size, avg_rnn, avg_scan, avg_int8, avg_rnn/avg_scan, avg_rnn/avg_int8), file=f)
+
+def parse_arguments():
+  parser = argparse.ArgumentParser()
+  parser.add_argument('--rnn_type', help='Type of rnn, one of lstm/gru/rnn', choices=['lstm', 'gru', 'rnn'], default='lstm')
+  parser.add_argument('--input_dim', help='Input size of lstm/gru/rnn', type=int, default=128)
+  parser.add_argument('--hidden_dim', help='Hidden size of lstm/gru/rnn', type=int, default=1024)
+  parser.add_argument('--bidirectional', help='Use bidirectional', action='store_true', default=False)
+  parser.add_argument('--layers', help='Number of layers', type=int, default=4)
+  parser.add_argument('--seq_len', help='Sequence length', type=int, default=32)
+  parser.add_argument('--batch_size', help='Batch size', type=int, default=1)
+  parser.add_argument('--num_threads', help='Number of MKL threads', type=int, default=1)
+  parser.add_argument('--top_n', help='Fastest N samples to compute average time', type=int, default=5)
+  parser.add_argument('--auto', help='Auto_name (usually CPU type) for auto test to generate (batch_)single|multithread_<auto_name>.csv files', default=None)
+  return parser.parse_args()
+
+if __name__ == '__main__':
+    args = parse_arguments()
+    if args.auto:
+        perf_test_auto(args.auto)
+    else:
+        print('Testing model: ', args.rnn_type.upper())
+        print('  input_dim: ', args.input_dim)
+        print('  hidden_dim: ', args.hidden_dim)
+        if args.bidirectional:
+            print('  bidirectional')
+        print('  layers: ', args.layers)
+        cpu_count = multiprocessing.cpu_count()
+        num_threads = max(min(args.num_threads, cpu_count), 1)
+        print('Test setup')
+        print('  cpu_count: ', cpu_count)
+        print('  num_threads: ', num_threads)
+        print('  seq_len: ', args.seq_len)
+        print('  batch_size: ', args.batch_size)
+        perf_test(args.rnn_type, num_threads, args.input_dim, args.hidden_dim, args.bidirectional, args.layers, args.seq_len, args.batch_size, args.top_n)
diff --git a/onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py b/onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py
new file mode 100644
index 0000000000000..7b40d201f40ef
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/scripts/symbolic_shape_infer.py
@@ -0,0 +1,643 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# -*- coding: UTF-8 -*-
+import argparse
+import numpy as np
+import onnx
+import sys
+from onnx import helper, numpy_helper, shape_inference
+import sympy
+
+from packaging import version
+assert version.parse(onnx.__version__) >= version.parse("1.5.0") # need at least opset 10 for MatMulInteger shape inference
+
+def get_attribute(node, attr_name, default_value=None):
+    found = [attr for attr in node.attribute if attr.name == attr_name]
+    if found:
+        return helper.get_attribute_value(found[0])
+    return default_value
+
+def get_shape_from_type_proto(type_proto):
+    return [getattr(i, i.WhichOneof('value')) if type(i.WhichOneof('value')) == str else None for i in type_proto.tensor_type.shape.dim]
+
+def get_shape_from_sympy_shape(sympy_shape):
+    return [int(i) if is_literal(i) or i.is_number else str(i) for i in sympy_shape]
+
+def is_literal(dim):
+    return type(dim) in [int, np.int64, sympy.Integer]
+
+def handle_negative_axis(axis, rank):
+    assert axis < rank and axis >= -rank
+    return axis if axis >= 0 else rank + axis
+
+def get_opset(mp, domain=['', 'onnx']):
+    if type(domain) != list:
+        domain = [domain]
+    for opset in mp.opset_import:
+        if opset.domain in domain:
+            return opset.version
+    return None
+
+class SymbolicShapeInference:
+    def __init__(self, auto_merge, verbose):
+        self.dispatcher_ = {
+            'Cast'              : self._infer_Cast,
+            'CategoryMapper'    : self._infer_CategoryMapper,
+            'Compress'          : self._infer_Compress,
+            'Concat'            : self._infer_Concat,
+            'ConstantOfShape'   : self._infer_ConstantOfShape,
+            'Expand'            : self._infer_Expand,
+            'Gather'            : self._infer_Gather,
+            'Min'               : self._infer_Min,
+            'Mul'               : self._infer_Mul,
+            'NonMaxSuppression' : self._infer_NonMaxSuppression,
+            'NonZero'           : self._infer_NonZero,
+            'Reshape'           : self._infer_Reshape,
+            'Shape'             : self._infer_Shape,
+            'Slice'             : self._infer_Slice,
+            'Split'             : self._infer_Split,
+            'Squeeze'           : self._infer_Squeeze,
+            'Tile'              : self._infer_Tile,
+            'TopK'              : self._infer_TopK,
+            'Unsqueeze'         : self._infer_Unsqueeze}
+        self.suggested_merge_ = {}
+        self.run_ = True
+        self.auto_merge_ = auto_merge
+        self.verbose_ = verbose
+
+    def _set_input_model(self, in_mp):
+        self._preprocess(in_mp)
+        self.initializers_ = dict([(i.name, i) for i in self.out_mp_.graph.initializer])
+        self.known_vi_ = dict([(i.name, i) for i in list(self.out_mp_.graph.input)])
+        self.known_vi_.update(dict([(i.name, helper.make_tensor_value_info(i.name, i.data_type, list(i.dims))) for i in self.out_mp_.graph.initializer]))
+        self.sympy_data_ = {}
+        self.dynamic_dims_ = {} # new symbolic dims from some ops' dynamic output, i.e. NonZero
+        self.computed_dims_ = {}
+        # create a temporary ModelProto for single node inference
+        # note that we remove initializer to have faster inference
+        # for tensor ops like Reshape/Tile/Expand that read initializer, we need to do sympy computation based inference anyways
+        self.tmp_mp_ = onnx.ModelProto()
+        self.tmp_mp_.CopyFrom(self.out_mp_)
+        self.tmp_mp_.graph.ClearField('initializer')
+
+    def _add_suggested_merge(self, d1, d2):
+        existing = set(self.suggested_merge_.values())
+        if d1 in existing and d2 in existing:
+            self.suggested_merge_[d2] = d1
+            for k, v in self.suggested_merge_.items():
+                if v == d2:
+                    self.suggested_merge_[k] = d1
+        elif d1 in existing:
+            self.suggested_merge_[d2] = d1
+        elif d2 in existing:
+            self.suggested_merge_[d1] = d2
+        elif d1 in self.suggested_merge_:
+            self.suggested_merge_[d2] = self.suggested_merge_[d1]
+        elif d2 in self.suggested_merge_:
+            self.suggested_merge_[d1] = self.suggested_merge_[d2]
+        else:
+            self.suggested_merge_[d1] = d2
+
+    def _apply_suggested_merge(self):
+        if not self.suggested_merge_:
+            return
+        for i in self.out_mp_.graph.input:
+            for d in i.type.tensor_type.shape.dim:
+                if d.dim_param in self.suggested_merge_:
+                    d.dim_param = self.suggested_merge_[d.dim_param]
+
+    def _preprocess(self, in_mp):
+        out_mp = onnx.ModelProto()
+        out_mp.CopyFrom(in_mp)
+        out_mp.graph.ClearField('value_info')
+        out_mp.graph.ClearField('node')
+        self.out_mp_ = out_mp
+
+        self._apply_suggested_merge()
+
+        # constant op -> initializer, topological sort
+        defined = set([i.name for i in list(in_mp.graph.input) + list(in_mp.graph.initializer)])
+        pending_nodes = []
+
+        # returns True if no more ready nodes
+        def _insert_ready_nodes():
+            ready_nodes = [pn for pn in pending_nodes if all([i in defined for i in pn.input if i])]
+            for rn in ready_nodes:
+                self.out_mp_.graph.node.add().CopyFrom(rn)
+                for o in rn.output:
+                    defined.add(o)
+                pending_nodes.remove(rn)
+            return not ready_nodes
+
+        for in_n in in_mp.graph.node:
+            if in_n.op_type == 'Constant':
+                t = get_attribute(in_n, 'value')
+                t.name = in_n.output[0]
+                self.out_mp_.graph.initializer.add().CopyFrom(t)
+                defined.add(t.name)
+            else:
+                pending_nodes.append(in_n)
+            _insert_ready_nodes()
+
+        while pending_nodes:
+            if _insert_ready_nodes():
+                break
+
+        if pending_nodes and self.verbose_ > 0:
+            print('SymbolicShapeInference: orphaned nodes discarded: ')
+            print(*[n.op_type + ': ' + n.output[0] for n in pending_nodes], sep='\n')
+
+    def _merge_dynamic_dims(self, d1, d2):
+        if d1 in self.suggested_merge_:
+            return self.suggested_merge_[d1]
+        if d2 in self.suggested_merge_:
+            return self.suggested_merge_[d2]
+        if not d1 in self.dynamic_dims_ and not d2 in self.dynamic_dims_:
+            return None
+        if d1 in self.dynamic_dims_:
+            if self.dynamic_dims_[d1] == None:
+                self.dynamic_dims_[d1] = d2
+            else:
+                assert self.dynamic_dims_[d1] == d2
+            return d2
+        return _merge_dynamic_dims(d2, d1)
+
+    # broadcast from right to left, and merge symbolic dims if needed
+    def _broadcast_shapes(self, shape1, shape2):
+        new_shape = []
+        rank1 = len(shape1)
+        rank2 = len(shape2)
+        new_rank = max(rank1, rank2)
+        for i in range(new_rank):
+            dim1 = shape1[rank1 - 1 - i] if i < rank1 else 1
+            dim2 = shape2[rank2 - 1 - i] if i < rank2 else 1
+            if dim1 == 1 or dim1 == dim2:
+                new_dim = dim2
+            elif dim2 == 1:
+                new_dim = dim1
+            elif not dim1 == dim2:
+                new_dim = self._merge_dynamic_dims(dim1, dim2)
+                if not new_dim:
+                    print('unsupported broadcast between ' + str(dim1) + ' ' + str(dim2))
+            new_shape = [new_dim] + new_shape
+        return new_shape
+
+    def _get_shape(self, node, idx):
+        name = node.input[idx]
+        if name in self.known_vi_:
+            return get_shape_from_type_proto(self.known_vi_[name].type)
+        else:
+            assert name in self.initializers_
+            return list(self.initializers_[name].dims)
+
+    def _get_shape_rank(self, node, idx):
+        return len(self._get_shape(node, idx))
+
+    def _get_sympy_shape(self, node, idx):
+        sympy_shape = []
+        for d in self._get_shape(node, idx):
+            if type(d) == str:
+                sym_dim = self.computed_dims_[d] if d in self.computed_dims_ else sympy.Symbol(d, integer=True)
+                sympy_shape.append(sym_dim)
+            else:
+                assert None != d
+                sympy_shape.append(d)
+        return sympy_shape
+
+    def _get_value(self, node, idx):
+        name = node.input[idx]
+        assert name in self.sympy_data_ or name in self.initializers_
+        return self.sympy_data_[name] if name in self.sympy_data_ else numpy_helper.to_array(self.initializers_[name])
+
+    def _try_get_value(self, node, idx):
+        if idx >= len(node.input):
+            return None
+        name = node.input[idx]
+        if name in self.sympy_data_ or name in self.initializers_:
+            return self._get_value(node, idx)
+        return None
+
+    def _update_computed_dims(self, new_sympy_shape):
+        for new_dim in new_sympy_shape:
+            if not is_literal(new_dim) and not type(new_dim) == str: # add new_dim if it's a computational expression
+                if not str(new_dim) in self.computed_dims_:
+                    self.computed_dims_[str(new_dim)] = new_dim
+
+    def _onnx_infer_single_node(self, node):
+        # run single node inference with self.known_vi_ shapes
+        # note that inference rely on initializer values is not handled
+        # as we don't copy initializer weights to tmp_graph for inference speed purpose
+        tmp_graph = helper.make_graph([node],
+                                      'tmp',
+                                      [self.known_vi_[i] for i in node.input if i],
+                                      [helper.make_tensor_value_info(i, onnx.TensorProto.UNDEFINED, None) for i in node.output])
+        self.tmp_mp_.graph.CopyFrom(tmp_graph)
+        self.tmp_mp_ = shape_inference.infer_shapes(self.tmp_mp_)
+        for i_o in range(len(node.output)):
+            o = node.output[i_o]
+            vi = self.out_mp_.graph.value_info.add()
+            vi.CopyFrom(self.tmp_mp_.graph.output[i_o])
+            self.known_vi_[vi.name] = vi
+
+    def _get_int_values(self, node):
+        values = [self._try_get_value(node, i) for i in range(len(node.input))]
+        if all([v is not None for v in values]):
+            # some shape compute is in floating point, cast to int for sympy
+            for i,v in enumerate(values):
+                if type(v) != np.ndarray:
+                    continue
+                assert len(v.shape) <= 1
+                if len(v.shape) == 0:
+                    new_v = int(np.asscalar(v))
+                else:
+                    assert len(v.shape) == 1
+                    new_v = [int(vv) for vv in v]
+                values[i] = new_v
+        return values
+
+    def _compute_on_sympy_data(self, node, op_func):
+        assert len(node.output) == 1
+        values = self._get_int_values(node)
+        if all([v is not None for v in values]):
+            is_list = [type(v) == list for v in values]
+            as_list = any(is_list)
+            if as_list:
+                self.sympy_data_[node.output[0]] = [op_func(vs) for vs in zip(*values)]
+            else:
+                self.sympy_data_[node.output[0]] = op_func(values)
+
+    def _pass_on_sympy_data(self, node):
+        assert len(node.input) == 1 or node.op_type == 'Reshape'
+        self._compute_on_sympy_data(node, lambda x: x[0])
+
+    def _infer_Cast(self, node):
+        self._pass_on_sympy_data(node)
+
+    def _infer_CategoryMapper(self, node):
+        input_type = self.known_vi_[node.input[0]].type.tensor_type.elem_type
+        if input_type == onnx.TensorProto.STRING:
+            output_type = onnx.TensorProto.INT64
+        else:
+            output_type = onnx.TensorProto.STRING
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0],
+                                                  output_type,
+                                                  self._get_shape(node, 0)))
+
+    def _infer_Compress(self, node):
+        input_shape = self._get_shape(node, 0)
+        # create a new symbolic dimension for Compress output
+        compress_len = node.output[0] + '_compress_len'
+        self.dynamic_dims_[compress_len] = None # set to None as it may merge with other symbolic dims later
+        axis = get_attribute(node, 'axis')
+        if axis == None:
+            # when axis is not specified, input is flattened before compress so output is 1D
+            output_shape = [compress_len]
+        else:
+            output_shape = input_shape
+            output_shape[handle_negative_axis(axis, len(input_shape))] = compress_len
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0], self.known_vi_[node.input[0]].type.tensor_type.elem_type, output_shape))
+
+    def _infer_Concat(self, node):
+        if any([i in self.sympy_data_ for i in node.input]):
+            values = self._get_int_values(node)
+            assert all([v is not None for v in values])
+            assert 0 == get_attribute(node, 'axis')
+            self.sympy_data_[node.output[0]] = []
+            for i in range(len(node.input)):
+                value = values[i]
+                if type(value) == list:
+                    self.sympy_data_[node.output[0]].extend(value)
+                else:
+                    self.sympy_data_[node.output[0]].append(value)
+
+        sympy_shape = self._get_sympy_shape(node, 0)
+        axis = handle_negative_axis(get_attribute(node, 'axis'), len(sympy_shape))
+        for i_idx in range(1, len(node.input)):
+            sympy_shape[axis] = sympy_shape[axis] + self._get_sympy_shape(node, i_idx)[axis]
+        self._update_computed_dims(sympy_shape)
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0], self.known_vi_[node.input[0]].type.tensor_type.elem_type, get_shape_from_sympy_shape(sympy_shape)))
+
+    def _infer_ConstantOfShape(self, node):
+        sympy_shape = self._get_value(node, 0)
+        if sympy_shape:
+            vi = self.known_vi_[node.output[0]]
+            vi.CopyFrom(helper.make_tensor_value_info(node.output[0],
+                                                      vi.type.tensor_type.elem_type,
+                                                      [int(i) if is_literal(i) else str(i) for i in sympy_shape]))
+
+    def _infer_Expand(self, node):
+        expand_to_shape = self._try_get_value(node, 1)
+        if expand_to_shape is not None:
+            input_shape = self._get_shape(node, 0)
+            target_shape = get_shape_from_sympy_shape(expand_to_shape)
+            new_shape = self._broadcast_shapes(input_shape, target_shape)
+            vi = self.known_vi_[node.output[0]]
+            vi.CopyFrom(helper.make_tensor_value_info(node.output[0], self.known_vi_[node.input[0]].type.tensor_type.elem_type, new_shape))
+
+    def _infer_Gather(self, node):
+        axis = get_attribute(node, 'axis', 0)
+        data_shape = self._get_shape(node, 0)
+        indices_shape = self._get_shape(node, 1)
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0],
+                                                  vi.type.tensor_type.elem_type,
+                                                  data_shape[:axis] + indices_shape + data_shape[axis+1:]))
+        if node.input[0] in self.sympy_data_:
+            assert 0 == get_attribute(node, 'axis') # only handle 1D sympy compute
+            idx = int(self._get_value(node, 1))
+            data = self.sympy_data_[node.input[0]]
+            if type(data) == list:
+                self.sympy_data_[node.output[0]] = data[idx]
+            else:
+                assert idx == 0
+                self.sympy_data_[node.output[0]] = data
+
+    def _infer_Min(self, node):
+        self._compute_on_sympy_data(node, lambda l: sympy.Min(l[0], l[1]))
+
+    def _infer_Mul(self, node):
+        self._compute_on_sympy_data(node, lambda l: l[0] * l[1])
+
+    def _infer_NonMaxSuppression(self, node):
+        selected = node.output[0] + '_num_selected'
+        self.dynamic_dims_[selected] = None # set to None as it may merge with other symbolic dims later
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0], onnx.TensorProto.INT64, [selected, 3]))
+
+    def _infer_NonZero(self, node):
+        input_rank = self._get_shape_rank(node, 0)
+        # create a new symbolic dimension for NonZero output
+        nz_len = node.output[0] + '_nz_len'
+        self.dynamic_dims_[nz_len] = None # set to None as it may merge with other symbolic dims later
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0], vi.type.tensor_type.elem_type, [input_rank, nz_len]))
+
+    def _infer_Reshape(self, node):
+        shape_value = self._get_value(node, 1)
+        input_shape = self._get_shape(node, 0)
+        input_sympy_shape = self._get_sympy_shape(node, 0)
+        total = int(1)
+        for d in input_sympy_shape:
+            total = total * d
+        new_sympy_shape = []
+        deferred_dim_idx = -1
+        non_deferred_size = int(1)
+        for i, d in enumerate(shape_value):
+            if type(d) == sympy.Symbol:
+                new_sympy_shape.append(d)
+            elif d == 0:
+                new_sympy_shape.append(input_sympy_shape[i])
+                non_deferred_size = non_deferred_size * input_sympy_shape[i]
+            else:
+                new_sympy_shape.append(d)
+            if d == -1:
+                deferred_dim_idx = i
+            elif d != 0:
+                non_deferred_size = non_deferred_size * d
+
+        assert new_sympy_shape.count(-1) < 2
+        if -1 in new_sympy_shape:
+            new_dim = total / non_deferred_size
+            new_sympy_shape[deferred_dim_idx] = new_dim
+            self._update_computed_dims(new_sympy_shape)
+
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0],
+                                                  vi.type.tensor_type.elem_type,
+                                                  get_shape_from_sympy_shape(new_sympy_shape)))
+        self._pass_on_sympy_data(node)
+
+    def _infer_Shape(self, node):
+        self.sympy_data_[node.output[0]] = self._get_sympy_shape(node, 0)
+
+    def _infer_Slice(self, node):
+        if get_opset(self.out_mp_) <= 9:
+            axes = get_attribute(node, 'axes')
+            starts = get_attribute(node, 'starts')
+            ends = get_attribute(node, 'ends')
+            steps = [1]*len(axes)
+        else:
+            starts = self._get_value(node, 1)
+            ends = self._get_value(node, 2)
+            assert starts is not None and ends is not None
+            axes = self._try_get_value(node, 3)
+            steps = self._try_get_value(node, 4)
+            if axes is None:
+                axes = list(range(0, len(starts)))
+            if steps is None:
+                steps = [1]*len(starts)
+
+        new_shape = self._get_sympy_shape(node, 0)
+        for i,s,e,t in zip(axes, starts, ends, steps):
+            # TODO: handle step
+            assert t == 1
+            idx = handle_negative_axis(i, len(new_shape))
+            if is_literal(e):
+                if e >= int(2 ** 31 - 1): # max value of int32
+                    e = new_shape[i]
+                elif is_literal(new_shape[i]):
+                    e = min(e, new_shape[i])
+                else:
+                    if e > 0:
+                        e = sympy.Min(e, new_shape[i])
+                    else:
+                        e = new_shape[i] + e
+            else:
+                if is_literal(new_shape[i]):
+                    e = sympy.Min(e, new_shape[i])
+                else:
+                    try:
+                        if e >= new_shape[i]:
+                            e = new_shape[i]
+                    except Exception:
+                        print('Unable to determine if {} <= {}, treat as equal'.format(e, new_shape[i]))
+                        e = new_shape[i]
+
+            if is_literal(s) and int(s) < 0:
+                s = new_shape[i] + s
+
+            new_shape[idx] = e - s
+
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0],
+                                                  vi.type.tensor_type.elem_type,
+                                                  get_shape_from_sympy_shape(new_shape)))
+        if node.input[0] in self.sympy_data_:
+            assert [0] == axes
+            assert len(starts) == 1
+            assert len(ends) == 1
+            self.sympy_data_[node.output[0]] = self.sympy_data_[node.input[0]][starts[0]:ends[0]]
+
+    def _infer_Split(self, node):
+        shape = self._get_shape(node, 0)
+        axis = handle_negative_axis(get_attribute(node, 'axis', 0), len(shape))
+        split = get_attribute(node, 'split')
+        if not split:
+            num_outputs = len(node.output)
+            split = [int(shape[axis]/num_outputs)]*num_outputs
+        for i_o in range(len(split)):
+            vi = self.known_vi_[node.output[i_o]]
+            vi.CopyFrom(helper.make_tensor_value_info(node.output[i_o],
+                                                      self.known_vi_[node.input[0]].type.tensor_type.elem_type,
+                                                      shape[:axis] + [split[i_o]] + shape[axis+1:]))
+            self.known_vi_[vi.name] = vi
+
+    def _infer_Squeeze(self, node):
+        self._pass_on_sympy_data(node)
+
+    def _infer_Tile(self, node):
+        repeats_value = self._get_value(node, 1)
+        input_sympy_shape = self._get_sympy_shape(node, 0)
+        new_shape = []
+        for i,d in enumerate(input_sympy_shape):
+            new_dim = d * repeats_value[i]
+            new_shape.append(new_dim)
+        self._update_computed_dims(new_shape)
+        vi = self.known_vi_[node.output[0]]
+        vi.CopyFrom(helper.make_tensor_value_info(node.output[0],
+                                                  vi.type.tensor_type.elem_type,
+                                                  get_shape_from_sympy_shape(new_shape)))
+
+    def _infer_TopK(self, node):
+        rank = self._get_shape_rank(node, 0)
+        axis = handle_negative_axis(get_attribute(node, 'axis'), rank)
+        new_shape = self._get_sympy_shape(node, 0)
+
+        if get_opset(self.out_mp_) <= 9:
+            k = get_attribute(node, 'k')
+        else:
+            k = self._try_get_value(node, 1)
+
+        if k == None:
+            k = sympy.Symbol(node.output[0] + '_num_topK', integer=True)
+            self.dynamic_dims_[k] = None # set to None as it may merge with other symbolic dims later
+
+        new_shape[axis] = k
+
+        for i_o in range(len(node.output)):
+            vi = self.known_vi_[node.output[i_o]]
+            vi.CopyFrom(helper.make_tensor_value_info(node.output[i_o], vi.type.tensor_type.elem_type, get_shape_from_sympy_shape(new_shape)))
+
+    def _infer_Unsqueeze(self, node):
+        self._pass_on_sympy_data(node)
+
+    def _infer_impl(self, in_mp):
+        self._set_input_model(in_mp)
+        for node in self.out_mp_.graph.node:
+            assert all([i in self.known_vi_ for i in node.input if i])
+            self._onnx_infer_single_node(node)
+            if node.op_type in self.dispatcher_:
+                self.dispatcher_[node.op_type](node)
+
+            if self.verbose_ > 2:
+                print(node.op_type + ': ' + node.name)
+            for i_o in range(len(node.output)):
+                out_shape = get_shape_from_type_proto(self.known_vi_[node.output[i_o]].type)
+                if self.verbose_ > 2:
+                    print('  {}: {} {}'.format(node.output[i_o], str(out_shape), self.known_vi_[node.output[i_o]].type.tensor_type.elem_type))
+                    if node.output[i_o] in self.sympy_data_:
+                        print('  Sympy Data: ' + str(self.sympy_data_[node.output[i_o]]))
+                if None in out_shape:
+                    if self.auto_merge_:
+                        if node.op_type in ['Add', 'Sub', 'Mul', 'Div', 'MatMul']:
+                            idx = out_shape.index(None)
+                            shapes = [self._get_shape(node, i) for i in range(2)]
+                            dim_idx = [len(s) - len(out_shape) + idx for s in shapes]
+                            assert dim_idx[0] >= 0 and dim_idx[1] >= 0
+                            if node.op_type == 'MatMul':
+                                # only support auto merge for MatMul for dim < rank-2 when rank > 2
+                                assert len(shapes[0]) > 2 and dim_idx[0] < len(shapes[0]) - 2
+                                assert len(shapes[1]) > 2 and dim_idx[1] < len(shapes[1]) - 2
+                            dim0, dim1 = [s[i] for s, i in zip(shapes, dim_idx)]
+                            assert type(dim0) == str and type(dim1) == str
+                            if dim0 in self.computed_dims_ and dim1 in self.computed_dims_:
+                                dd = self.computed_dims_[dim0]/self.computed_dims_[dim1]
+                                dim0 = dd.args[0]
+                                dim1 = dd.args[1].args[0]
+                                assert type(dim0) == sympy.Symbol
+                                assert type(dim1) == sympy.Symbol
+                                self._add_suggested_merge(str(dim1), str(dim0))
+                            else:
+                                assert not dim0 in self.computed_dims_ or type(self.computed_dims_[dim0]) == sympy.Symbol
+                                assert not dim1 in self.computed_dims_ or type(self.computed_dims_[dim1]) == sympy.Symbol
+                                self._add_suggested_merge(str(dim1), str(dim0))
+                            self.run_ = True
+                        elif node.op_type == 'Expand':
+                            # auto merge for cases like Expand([min(batch, 1), min(seq, 512)], [batch, seq])
+                            input_shape = self._get_shape(node, 0)
+                            expand_shape = self._get_value(node, 1)
+                            for reverse_idx in range(min(len(input_shape), len(expand_shape))):
+                                dim0 = input_shape[len(input_shape) - 1 - reverse_idx]
+                                dim1 = expand_shape[len(expand_shape) - 1 - reverse_idx]
+                                if any([type(d) == str for d in [dim0, dim1]]):
+                                    if type(dim0) == str:
+                                        self._add_suggested_merge(str(dim0), str(dim1))
+                                    else:
+                                        self._add_suggested_merge(str(dim1), str(dim0))
+                            self.run_ = True
+                        else:
+                            self.run_ = False
+                    else:
+                        self.run_ = False
+
+                    if self.verbose_ > 0 or not self.auto_merge_:
+                        print('Stopping at incomplete shape inference at ' + node.op_type + ': ' + node.name)
+                        for o in node.output:
+                            print(self.known_vi_[o])
+                        if self.auto_merge_:
+                            print('Merging: ' + str(self.suggested_merge_))
+                    return False
+
+        self.run_ = False
+        return True
+
+    def _replace_dynamic_dims(self):
+        # replace all self.dynamic_dims_ if there's a matching
+        for vi in list(self.out_mp_.graph.value_info) + list(self.out_mp_.graph.output):
+            for d in vi.type.tensor_type.shape.dim:
+                if d.dim_param in self.dynamic_dims_:
+                    v = self.dynamic_dims_[d.dim_param]
+                    if v != None:
+                        if type(v) == str:
+                            d.dim_param = v
+                        else:
+                            assert type(v) == int
+                            d.dim_value = v
+
+    def _update_output_from_vi(self):
+        for output in self.out_mp_.graph.output:
+            if output.name in self.known_vi_:
+                output.CopyFrom(self.known_vi_[output.name])
+
+    @staticmethod
+    def infer_shapes(input_model, output_model, auto_merge=False, verbose=0):
+        in_mp = onnx.load(input_model)
+        symbolic_shape_inference = SymbolicShapeInference(auto_merge, verbose)
+        all_shapes_inferred = False
+        while symbolic_shape_inference.run_:
+            all_shapes_inferred = symbolic_shape_inference._infer_impl(in_mp)
+            symbolic_shape_inference._replace_dynamic_dims()
+        symbolic_shape_inference._update_output_from_vi()
+        onnx.save(symbolic_shape_inference.out_mp_, output_model)
+        if not all_shapes_inferred:
+            sys.exit(1)
+
+def parse_arguments():
+  parser = argparse.ArgumentParser()
+  parser.add_argument('--input', help='The input model file')
+  parser.add_argument('--output', help='The input model file')
+  parser.add_argument('--auto_merge', help='Automatically merge symbolic dims when confliction happens', action='store_true', default=False)
+  parser.add_argument('--verbose', help='Prints detailed logs of inference, 0: turn off, 1: warnings, 3: detailed', type=int, default=0)
+  return parser.parse_args()
+
+if __name__ == '__main__':
+    args = parse_arguments()
+    print('input model: ' + args.input)
+    print('output model ' + args.output)
+    print('Doing symbolic shape inference...')
+    out_mp = SymbolicShapeInference.infer_shapes(args.input, args.output, args.auto_merge, args.verbose)
+    print('Done!')
\ No newline at end of file
diff --git a/onnxruntime/core/providers/nuphar/symbols.txt b/onnxruntime/core/providers/nuphar/symbols.txt
new file mode 100644
index 0000000000000..bcce3bb926c25
--- /dev/null
+++ b/onnxruntime/core/providers/nuphar/symbols.txt
@@ -0,0 +1 @@
+OrtSessionOptionsAppendExecutionProvider_Nuphar
diff --git a/onnxruntime/core/providers/openvino/openvino_mo/openvino_mo.py b/onnxruntime/core/providers/openvino/openvino_mo/openvino_mo.py
index b1bd8db440274..824cfcdef2f3a 100644
--- a/onnxruntime/core/providers/openvino/openvino_mo/openvino_mo.py
+++ b/onnxruntime/core/providers/openvino/openvino_mo/openvino_mo.py
@@ -251,6 +251,21 @@ def driver_R5(onnx_modelproto_bytes, precision: str, output_model_name: str, out
     convert_batch_norm(graph)
     graph_clean_up(graph)
 
+    convert_scale_shift_to_mul_add(graph)
+    graph_clean_up(graph)
+
+    fuse_mul_add_sequence(graph)
+    graph_clean_up(graph)
+
+    fuse_linear_ops(graph)
+    graph_clean_up(graph)
+
+    grouped_convolutions_fusing(graph)
+    graph_clean_up(graph)
+
+    fuse_linear_ops(graph)
+    graph_clean_up(graph)
+
     convert_muladd_to_scaleshift_or_power(graph)
     graph_clean_up(graph)
 
@@ -308,9 +323,7 @@ def driver_R1(onnx_modelproto_bytes, precision: str, output_model_name: str, out
                                                        input=None, input_model=None, input_shape=None, keep_shape_ops=False, log_level='ERROR', mean_scale_values={}, mean_values=(), model_name=None, move_to_preprocess=False, output=None, output_dir='.', placeholder_shapes=None, reverse_input_channels=False, scale=None, scale_values=(), silent=False, version=False)
         graph.graph['fw'] = 'onnx'
         graph.graph['feature_dim'] = 1 if graph.graph['layout'] == 'NCHW' else 3
-        graph.graph['ir_version'] = 4
-        extract_node_attrs(graph, lambda node: (
-            True, common_onnx_fields(node)))
+        graph.graph['ir_version'] = 5
     except Exception as e:
         raise Error(
             'Cannot pre-process ONNX graph after reading from model file "{}". '
@@ -327,8 +340,6 @@ def driver_R1(onnx_modelproto_bytes, precision: str, output_model_name: str, out
     # --------------------------------- LOAD END ------------------------------------------------------
     class_registration.apply_replacements(
         graph, class_registration.ClassType.FRONT_REPLACER)
-    partial_infer(graph)
-    graph.check_empty_graph('partial_infer')
     class_registration.apply_replacements(
         graph, class_registration.ClassType.MIDDLE_REPLACER)
 
@@ -339,6 +350,21 @@ def driver_R1(onnx_modelproto_bytes, precision: str, output_model_name: str, out
     convert_batch_norm(graph)
     graph_clean_up_onnx(graph)
 
+    convert_scale_shift_to_mul_add(graph)
+    graph_clean_up_onnx(graph)
+
+    fuse_mul_add_sequence(graph)
+    graph_clean_up_onnx(graph)
+
+    fuse_linear_ops(graph)
+    graph_clean_up_onnx(graph)
+
+    grouped_convolutions_fusing(graph)
+    graph_clean_up_onnx(graph)
+
+    fuse_linear_ops(graph)
+    graph_clean_up_onnx(graph)
+
     convert_muladd_to_scaleshift_or_power(graph)
     graph_clean_up_onnx(graph)
 
diff --git a/onnxruntime/core/providers/tensorrt/tensorrt_allocator.h b/onnxruntime/core/providers/tensorrt/tensorrt_allocator.h
deleted file mode 100755
index ff2e74714691b..0000000000000
--- a/onnxruntime/core/providers/tensorrt/tensorrt_allocator.h
+++ /dev/null
@@ -1,32 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT License.
-
-#pragma once
-
-#include "core/framework/allocator.h"
-
-namespace onnxruntime {
-constexpr const char* TRT = "Trt";
-
-class TensorrtPinnedAllocator : public CPUAllocator {
- public:
-  virtual const OrtAllocatorInfo& Info() const override {
-    static OrtAllocatorInfo tensorrt_cpu_allocator_info(TRT,
-                                                   OrtAllocatorType::OrtDeviceAllocator, OrtDevice(), 0,
-                                                   OrtMemType::OrtMemTypeCPU);
-    return tensorrt_cpu_allocator_info;
-  }
-};
-
-/*! \brief The default allocator doesn't allocate anything. It's used here to let allocation
-           planner get allocator information.
-*/
-class TensorrtAllocator : public CPUAllocator {
- public:
-  virtual const OrtAllocatorInfo& Info() const override {
-    static OrtAllocatorInfo tensorrt_default_allocator_info(TRT,
-                                                       OrtAllocatorType::OrtDeviceAllocator);
-    return tensorrt_default_allocator_info;
-  }
-};
-}  // namespace onnxruntime
diff --git a/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc b/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc
index 1ff6839af4dc5..529b4a041629b 100644
--- a/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc
+++ b/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc
@@ -2,13 +2,15 @@
 // Licensed under the MIT License.
 
 #include "tensorrt_execution_provider.h"
-#include "tensorrt_allocator.h"
+#include "core/providers/cuda/cuda_allocator.h"
 #include "core/session/onnxruntime_cxx_api.h"
 #include "core/framework/execution_provider.h"
 #include "core/framework/op_kernel.h"
 #include "core/framework/kernel_registry.h"
 #include "core/framework/compute_capability.h"
 #include "core/providers/cpu/cpu_execution_provider.h"
+#include "core/providers/cuda/cuda_common.h"
+#include "core/providers/cuda/cuda_fence.h"
 #include "core/platform/env.h"
 #include "core/common/status.h"
 #include "onnx/shape_inference/implementation.h"
@@ -23,6 +25,12 @@ using namespace ::onnxruntime::logging;
 
 namespace onnxruntime {
 
+// Per TensorRT documentation, logger needs to be a singleton.
+TensorrtLogger& GetTensorrtLogger() {
+  static TensorrtLogger trt_logger(nvinfer1::ILogger::Severity::kWARNING);
+  return trt_logger;
+}
+
 #define CHECK_CUDA(call)                                        \
   do {                                                          \
     cudaError_t status = call;                                  \
@@ -31,18 +39,17 @@ namespace onnxruntime {
     }                                                           \
   } while (0)
 
-TensorrtExecutionProvider::TensorrtExecutionProvider()
-    : IExecutionProvider{onnxruntime::kTensorrtExecutionProvider} {
-  DeviceAllocatorRegistrationInfo trt_device_info({OrtMemTypeCPU, [](int) {
-                                                     return std::make_unique<TensorrtPinnedAllocator>();
-                                                   },
-                                                   std::numeric_limits<size_t>::max()});
-  InsertAllocator(CreateAllocator(trt_device_info));
-  DeviceAllocatorRegistrationInfo default_device_info({OrtMemTypeDefault, [](int) {
-                                                         return std::make_unique<TensorrtAllocator>();
-                                                       },
-                                                       std::numeric_limits<size_t>::max()});
-  InsertAllocator(CreateAllocator(default_device_info));
+TensorrtExecutionProvider::TensorrtExecutionProvider(const TensorrtExecutionProviderInfo& info)
+    : IExecutionProvider{onnxruntime::kTensorrtExecutionProvider}, device_id_(info.device_id) {
+  CUDA_CALL_THROW(cudaSetDevice(device_id_));
+
+  DeviceAllocatorRegistrationInfo default_allocator_info(
+      {OrtMemTypeDefault, [](int id) { return std::make_unique<CUDAAllocator>(id, TRT); }, std::numeric_limits<size_t>::max()});
+  InsertAllocator(CreateAllocator(default_allocator_info, device_id_));
+
+  DeviceAllocatorRegistrationInfo pinned_allocator_info(
+      {OrtMemTypeCPUOutput, [](int) { return std::make_unique<CUDAPinnedAllocator>(0, TRT_PINNED); }, std::numeric_limits<size_t>::max()});
+  InsertAllocator(CreateAllocator(pinned_allocator_info, device_id_));
 }
 
 TensorrtExecutionProvider::~TensorrtExecutionProvider() {}
@@ -180,13 +187,14 @@ SubGraphCollection_t TensorrtExecutionProvider::GetSupportedList(SubGraphCollect
           graph_build.AddNode(node->Name(), node->OpType(), node->Description(), inputs, outputs, &node->GetAttributes(), node->Domain());
         }
 
+        ORT_ENFORCE(graph_build.Resolve().IsOK());
+
         for (const auto& input : sub_graph->GetMetaDef()->inputs) {
           const ONNX_NAMESPACE::TensorProto* initializer = nullptr;
           if (graph.GetInitializedTensor(input, initializer)) {
             graph_build.AddInitializedTensor(*initializer);
           }
         }
-        ORT_ENFORCE(graph_build.Resolve().IsOK());
 
         // Serialize modelproto to string
         ONNX_NAMESPACE::ModelProto model_proto = model_build.ToProto();
@@ -195,7 +203,7 @@ SubGraphCollection_t TensorrtExecutionProvider::GetSupportedList(SubGraphCollect
 
         // Get supported node list recursively
         SubGraphCollection_t parser_nodes_list;
-        TensorrtLogger trt_logger(nvinfer1::ILogger::Severity::kWARNING);
+        TensorrtLogger& trt_logger = GetTensorrtLogger();
         auto trt_builder = unique_pointer<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(trt_logger));
         auto trt_network = unique_pointer<nvinfer1::INetworkDefinition>(trt_builder->createNetwork());
         auto trt_parser = unique_pointer<nvonnxparser::IParser>(nvonnxparser::createParser(*trt_network, trt_logger));
@@ -235,13 +243,14 @@ TensorrtExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph,
     graph_build.AddNode(node.Name(), node.OpType(), node.Description(), inputs, outputs, &node.GetAttributes(), node.Domain());
   }
 
+  auto status = graph_build.Resolve();
+
   //Add initializer to graph
   const auto& init_tensors = graph.GetAllInitializedTensors();
   for (const auto& tensor : init_tensors) {
     graph_build.AddInitializedTensor(*(tensor.second));
   }
 
-  auto status = graph_build.Resolve();
   ORT_ENFORCE(status.IsOK(), status);
   ONNX_NAMESPACE::ModelProto model_proto = model.ToProto();
   model_proto.set_ir_version(ONNX_NAMESPACE::Version::IR_VERSION);
@@ -252,7 +261,7 @@ TensorrtExecutionProvider::GetCapability(const onnxruntime::GraphViewer& graph,
 
   // Get supported node list
   SubGraphCollection_t parser_nodes_vector;
-  TensorrtLogger trt_logger(nvinfer1::ILogger::Severity::kWARNING);
+  TensorrtLogger& trt_logger = GetTensorrtLogger();
   auto trt_builder = unique_pointer<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(trt_logger));
   auto trt_network = unique_pointer<nvinfer1::INetworkDefinition>(trt_builder->createNetwork());
   auto trt_parser = unique_pointer<nvonnxparser::IParser>(nvonnxparser::createParser(*trt_network, trt_logger));
@@ -320,7 +329,7 @@ common::Status TensorrtExecutionProvider::Compile(const std::vector<onnxruntime:
     model_proto.SerializeToString(&string_buf);
 
     // Create TensorRT engine
-    TensorrtLogger trt_logger(nvinfer1::ILogger::Severity::kWARNING);
+    TensorrtLogger& trt_logger = GetTensorrtLogger();
     auto trt_builder = unique_pointer<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(trt_logger));
     auto trt_network = unique_pointer<nvinfer1::INetworkDefinition>(trt_builder->createNetwork());
     auto trt_parser = unique_pointer<nvonnxparser::IParser>(nvonnxparser::createParser(*trt_network, trt_logger));
@@ -433,17 +442,13 @@ common::Status TensorrtExecutionProvider::Compile(const std::vector<onnxruntime:
       Ort::CustomOpApi ort{*api};
       TensorrtFuncState* trt_state = reinterpret_cast<TensorrtFuncState*>(state);
       const std::vector<int>& input_indexes = (trt_state->input_info)[0];
-      const std::vector<int>& input_dim_sizes = (trt_state->input_info)[1];
       const std::vector<int>& output_indexes = (trt_state->output_info)[0];
-      const std::vector<int>& output_dim_sizes = (trt_state->output_info)[1];
       const std::vector<int>& output_types = (trt_state->output_info)[2];
       std::vector<std::vector<int64_t>> output_shapes = trt_state->output_shapes;
 
       int num_binding_inputs = input_indexes.size();
       int num_binding_outputs = output_indexes.size();
       int total_bindings = num_binding_inputs + num_binding_outputs;
-      cudaStream_t stream;
-      CHECK_CUDA(cudaStreamCreate(&stream));
       std::vector<void*> buffers(total_bindings);
       int batch_size = 1;
 
@@ -461,75 +466,43 @@ common::Status TensorrtExecutionProvider::Compile(const std::vector<onnxruntime:
         }
         batch_size = input_batch_size;
 
-        int input_size = batch_size * input_dim_sizes[i];
         if (tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT) {
-          const float* input = const_cast<float*>(ort.GetTensorData<float>(input_tensor));
-          CHECK_CUDA(cudaMalloc(&buffers[i], input_size * sizeof(float)));
-          CHECK_CUDA(cudaMemcpy(buffers[i], input, input_size * sizeof(float), cudaMemcpyHostToDevice));
+          buffers[i] = const_cast<float*>(ort.GetTensorData<float>(input_tensor));
         } else if (tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT8) {
-          const int8_t* input = const_cast<int8_t*>(ort.GetTensorData<int8_t>(input_tensor));
-          CHECK_CUDA(cudaMalloc(&buffers[i], input_size * sizeof(int8_t)));
-          CHECK_CUDA(cudaMemcpy(buffers[i], input, input_size * sizeof(int8_t), cudaMemcpyHostToDevice));
+          buffers[i] = const_cast<int8_t*>(ort.GetTensorData<int8_t>(input_tensor));
         } else if (tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT32) {
-          const int32_t* input = const_cast<int32_t*>(ort.GetTensorData<int32_t>(input_tensor));
-          CHECK_CUDA(cudaMalloc(&buffers[i], input_size * sizeof(int32_t)));
-          CHECK_CUDA(cudaMemcpy(buffers[i], input, input_size * sizeof(int32_t), cudaMemcpyHostToDevice));
+          buffers[i] = const_cast<int32_t*>(ort.GetTensorData<int32_t>(input_tensor));
         } else {
           return common::Status(common::ONNXRUNTIME, common::NOT_IMPLEMENTED);
         }
       }
 
       // Allocate cuda memory for outputs
-      for (int i = 0, end = num_binding_outputs; i < end; ++i) {
-        if (output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT) {
-          CHECK_CUDA(cudaMalloc(&buffers[i + num_binding_inputs], batch_size * output_dim_sizes[i] * sizeof(float)));
-        } else if (output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT8) {
-          CHECK_CUDA(cudaMalloc(&buffers[i + num_binding_inputs], batch_size * output_dim_sizes[i] * sizeof(int8_t)));
-        } else if (output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT32 || output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64) {
-          CHECK_CUDA(cudaMalloc(&buffers[i + num_binding_inputs], batch_size * output_dim_sizes[i] * sizeof(int32_t)));
-        } else {
-          return common::Status(common::ONNXRUNTIME, common::NOT_IMPLEMENTED);
-        }
-      }
-
-      // Run TRT inference
-      std::lock_guard<OrtMutex> lock(*(trt_state->tensorrt_mu_ptr));
-      trt_state->context->enqueue(batch_size, &buffers[0], stream, nullptr);
-
-      // Copy TRT outputs to output tensors
       for (int i = 0, end = num_binding_outputs; i < end; ++i) {
         int output_index = output_indexes[i];
         output_shapes[i].insert(output_shapes[i].begin(), batch_size);
 
-        int output_size = batch_size * output_dim_sizes[i];
         OrtValue* output_tensor = ort.KernelContext_GetOutput(context, output_index, output_shapes[i].data(), output_shapes[i].size());
         if (output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT) {
-          CHECK_CUDA(cudaMemcpy(ort.GetTensorMutableData<float>(output_tensor), buffers[i + num_binding_inputs], output_size * sizeof(float), cudaMemcpyDeviceToHost));
+          buffers[i + num_binding_inputs] = ort.GetTensorMutableData<float>(output_tensor);
         } else if (output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT8) {
-          CHECK_CUDA(cudaMemcpy(ort.GetTensorMutableData<int8_t>(output_tensor), buffers[i + num_binding_inputs], output_size * sizeof(int8_t), cudaMemcpyDeviceToHost));
-        } else if (output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT32) {
-          CHECK_CUDA(cudaMemcpy(ort.GetTensorMutableData<int32_t>(output_tensor), buffers[i + num_binding_inputs], output_size * sizeof(int32_t), cudaMemcpyDeviceToHost));
-        } else if (output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64) {
-          // If output tensor type is INT64, TensorRT processes data as INT32 and the output will be converted to INT64.
-          int* output = new int32_t[output_size];
-          CHECK_CUDA(cudaMemcpy(output, buffers[i + num_binding_inputs], output_size * sizeof(int32_t), cudaMemcpyDeviceToHost));
-          for (int j = 0; j < output_size; ++j) {
-            ort.GetTensorMutableData<int64_t>(output_tensor)[j] = output[j];
-          }
-          delete[] output;
+          buffers[i + num_binding_inputs] = ort.GetTensorMutableData<int8_t>(output_tensor);
+        } else if (output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT32 || output_types[i] == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64) {
+          buffers[i + num_binding_inputs] = ort.GetTensorMutableData<int32_t>(output_tensor);
         } else {
           return common::Status(common::ONNXRUNTIME, common::NOT_IMPLEMENTED);
         }
       }
 
-      // Sync stream
-      cudaStreamSynchronize(stream);
-
-      // Free CUDA memory
-      cudaStreamDestroy(stream);
-
-      for (int i = 0, end = total_bindings; i < end; ++i) {
-        CHECK_CUDA(cudaFree(buffers[i]));
+      // Run TRT inference
+      std::lock_guard<OrtMutex> lock(*(trt_state->tensorrt_mu_ptr));
+      bool ret = trt_state->context->enqueue(batch_size, &buffers[0], nullptr, nullptr);
+      if (!ret) {
+        if (trt_state->context->getEngine().getMaxBatchSize() < batch_size) {
+	  return common::Status(common::ONNXRUNTIME, common::INVALID_ARGUMENT,
+			        "TRT enqueue failed: Set ORT_TENSORRT_MAX_BATCH_SIZE environment variable to at least " + to_string(batch_size));
+        }
+        return common::Status(common::ONNXRUNTIME, common::FAIL, "Failed to enqueue to TRT execution context.");
       }
 
       return Status::OK();
diff --git a/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h b/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h
index 7f7b27b60bbcc..7b9b7a4ae5daa 100755
--- a/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h
+++ b/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h
@@ -52,13 +52,15 @@ struct TensorrtFuncState {
 // Logical device representation.
 class TensorrtExecutionProvider : public IExecutionProvider {
  public:
-  TensorrtExecutionProvider();
+  explicit TensorrtExecutionProvider(const TensorrtExecutionProviderInfo& info);
   virtual ~TensorrtExecutionProvider();
 
   std::vector<std::unique_ptr<ComputeCapability>>
   GetCapability(const onnxruntime::GraphViewer& graph,
                 const std::vector<const KernelRegistry*>& /*kernel_registries*/) const override;
 
+  int GetDeviceId() const { return device_id_; }
+
   common::Status Compile(const std::vector<onnxruntime::Node*>& fused_nodes,
                          std::vector<NodeComputeInfo>& node_compute_funcs) override;
 
diff --git a/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.cc b/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.cc
index 435c9eea4d91c..7574a03e53f74 100755
--- a/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.cc
+++ b/onnxruntime/core/providers/tensorrt/tensorrt_provider_factory.cc
@@ -11,23 +11,28 @@ using namespace onnxruntime;
 namespace onnxruntime {
 
 struct TensorrtProviderFactory : IExecutionProviderFactory {
-  TensorrtProviderFactory() {}
+  TensorrtProviderFactory(int device_id) : device_id_(device_id) {}
   ~TensorrtProviderFactory() override {}
 
   std::unique_ptr<IExecutionProvider> CreateProvider() override;
+
+ private:
+  int device_id_;
 };
 
 std::unique_ptr<IExecutionProvider> TensorrtProviderFactory::CreateProvider() {
-  return std::make_unique<TensorrtExecutionProvider>();
+  TensorrtExecutionProviderInfo info;
+  info.device_id = device_id_;
+  return std::make_unique<TensorrtExecutionProvider>(info);
 }
 
-std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Tensorrt() {
-  return std::make_shared<onnxruntime::TensorrtProviderFactory>();
+std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Tensorrt(int device_id) {
+  return std::make_shared<onnxruntime::TensorrtProviderFactory>(device_id);
 }
 }  // namespace onnxruntime
 
-ORT_API_STATUS_IMPL(OrtSessionOptionsAppendExecutionProvider_Tensorrt, _In_ OrtSessionOptions* options) {
-  options->provider_factories.push_back(onnxruntime::CreateExecutionProviderFactory_Tensorrt());
+ORT_API_STATUS_IMPL(OrtSessionOptionsAppendExecutionProvider_Tensorrt, _In_ OrtSessionOptions* options, int device_id) {
+  options->provider_factories.push_back(onnxruntime::CreateExecutionProviderFactory_Tensorrt(device_id));
   return nullptr;
 }
 
diff --git a/onnxruntime/core/session/abi_session_options.cc b/onnxruntime/core/session/abi_session_options.cc
index 710ab2db8121f..551acbd576059 100644
--- a/onnxruntime/core/session/abi_session_options.cc
+++ b/onnxruntime/core/session/abi_session_options.cc
@@ -44,6 +44,12 @@ ORT_API_STATUS_IMPL(OrtDisableSequentialExecution, _In_ OrtSessionOptions* optio
   return nullptr;
 }
 
+// set filepath to save optimized onnx model.
+ORT_API_STATUS_IMPL(OrtSetOptimizedModelFilePath, _In_ OrtSessionOptions* options, _In_ const ORTCHAR_T* optimized_model_filepath) {
+  options->value.optimized_model_filepath = optimized_model_filepath;
+  return nullptr;
+}
+
 // enable profiling for this session.
 ORT_API_STATUS_IMPL(OrtEnableProfiling, _In_ OrtSessionOptions* options, _In_ const ORTCHAR_T* profile_file_prefix) {
   options->value.enable_profiling = true;
@@ -94,22 +100,40 @@ ORT_API_STATUS_IMPL(OrtSetSessionLogVerbosityLevel, _In_ OrtSessionOptions* opti
   return nullptr;
 }
 
+ORT_API_STATUS_IMPL(OrtSetSessionLogSeverityLevel, _In_ OrtSessionOptions* options, int session_log_severity_level) {
+  options->value.session_log_severity_level = session_log_severity_level;
+  return nullptr;
+}
+
 // Set Graph optimization level.
-// Available options are : 0, 1, 2.
-ORT_API_STATUS_IMPL(OrtSetSessionGraphOptimizationLevel, _In_ OrtSessionOptions* options, int graph_optimization_level) {
+ORT_API_STATUS_IMPL(OrtSetSessionGraphOptimizationLevel, _In_ OrtSessionOptions* options,
+                    GraphOptimizationLevel graph_optimization_level) {
   if (graph_optimization_level < 0) {
     return OrtCreateStatus(ORT_INVALID_ARGUMENT, "graph_optimization_level is not valid");
   }
-  if (graph_optimization_level >= static_cast<int>(onnxruntime::TransformerLevel::MaxTransformerLevel))
-    return OrtCreateStatus(ORT_INVALID_ARGUMENT, "graph_optimization_level is not valid");
-  options->value.graph_optimization_level = static_cast<onnxruntime::TransformerLevel>(graph_optimization_level);
+
+  switch (graph_optimization_level) {
+    case ORT_DISABLE_ALL:
+      options->value.graph_optimization_level = onnxruntime::TransformerLevel::Default;
+      break;
+    case ORT_ENABLE_BASIC:
+      options->value.graph_optimization_level = onnxruntime::TransformerLevel::Level1;
+      break;
+    case ORT_ENABLE_EXTENDED:
+      options->value.graph_optimization_level = onnxruntime::TransformerLevel::Level2;
+      break;
+    case ORT_ENABLE_ALL:
+      options->value.graph_optimization_level = onnxruntime::TransformerLevel::Level3;
+      break;
+    default:
+      return OrtCreateStatus(ORT_INVALID_ARGUMENT, "graph_optimization_level is not valid");
+  }
+
   return nullptr;
 }
 
 ///How many threads in the session thread pool.
 ORT_API_STATUS_IMPL(OrtSetSessionThreadPoolSize, _In_ OrtSessionOptions* options, int session_thread_pool_size) {
-  if (session_thread_pool_size <= 0)
-    return OrtCreateStatus(ORT_INVALID_ARGUMENT, "session_thread_pool_size is not valid");
   options->value.session_thread_pool_size = session_thread_pool_size;
   return nullptr;
 }
diff --git a/onnxruntime/core/session/default_cpu_allocator_c_api.cc b/onnxruntime/core/session/default_cpu_allocator_c_api.cc
index a4a7e7ea51612..b486a54b1fb1d 100644
--- a/onnxruntime/core/session/default_cpu_allocator_c_api.cc
+++ b/onnxruntime/core/session/default_cpu_allocator_c_api.cc
@@ -2,6 +2,7 @@
 // Licensed under the MIT License.
 
 #include <atomic>
+#include "core/framework/utils.h"
 #include "core/session/onnxruntime_cxx_api.h"
 #include <assert.h>
 
@@ -23,10 +24,10 @@ struct OrtDefaultAllocator : OrtAllocatorImpl {
   ~OrtDefaultAllocator() override { OrtReleaseAllocatorInfo(cpuAllocatorInfo); }
 
   void* Alloc(size_t size) {
-    return ::malloc(size);
+    return onnxruntime::utils::DefaultAlloc(size);
   }
   void Free(void* p) {
-    return ::free(p);
+    onnxruntime::utils::DefaultFree(p);
   }
   const OrtAllocatorInfo* Info() const {
     return cpuAllocatorInfo;
@@ -46,13 +47,10 @@ struct OrtDefaultAllocator : OrtAllocatorImpl {
     return OrtCreateStatus(ORT_RUNTIME_EXCEPTION, ex.what()); \
   }
 
-ORT_API_STATUS_IMPL(OrtCreateDefaultAllocator, _Out_ OrtAllocator** out) {
+ORT_API_STATUS_IMPL(OrtGetAllocatorWithDefaultOptions, _Out_ OrtAllocator** out) {
   API_IMPL_BEGIN
-  *out = new OrtDefaultAllocator();
+  static OrtDefaultAllocator ort_default_allocator;
+  *out = &ort_default_allocator;
   return nullptr;
   API_IMPL_END
 }
-
-ORT_API(void, OrtReleaseAllocator, _In_ OrtAllocator* allocator) {
-  delete static_cast<OrtAllocatorImpl*>(allocator);
-}
diff --git a/onnxruntime/core/session/environment.cc b/onnxruntime/core/session/environment.cc
index ef253bbb61ad2..d1f9041c9253f 100644
--- a/onnxruntime/core/session/environment.cc
+++ b/onnxruntime/core/session/environment.cc
@@ -10,6 +10,9 @@
 #ifndef DISABLE_CONTRIB_OPS
 #include "core/graph/contrib_ops/contrib_defs.h"
 #endif
+#ifdef MICROSOFT_AUTOML
+#include "core/graph/automl_ops/automl_defs.h"
+#endif
 
 namespace onnxruntime {
 using namespace ::onnxruntime::common;
@@ -33,10 +36,14 @@ Status Environment::Initialize() {
     std::call_once(schemaRegistrationOnceFlag, []() {
       ONNX_NAMESPACE::OpSchemaRegistry::DomainToVersionRange::Instance().AddDomainToVersion(onnxruntime::kMSDomain, 1, 1);
       ONNX_NAMESPACE::OpSchemaRegistry::DomainToVersionRange::Instance().AddDomainToVersion(onnxruntime::kMSNchwcDomain, 1, 1);
+      ONNX_NAMESPACE::OpSchemaRegistry::DomainToVersionRange::Instance().AddDomainToVersion(onnxruntime::kMSAutoMLDomain, 1, 1);
       // Register contributed schemas.
       // The corresponding kernels are registered inside the appropriate execution provider.
 #ifndef DISABLE_CONTRIB_OPS
       contrib::RegisterContribSchemas();
+#endif
+#ifdef MICROSOFT_AUTOML
+      automl::RegisterAutoMLSchemas();
 #endif
       RegisterOnnxOperatorSetSchema();
       RegisterOnnxMLOperatorSetSchema();
diff --git a/onnxruntime/core/session/inference_session.cc b/onnxruntime/core/session/inference_session.cc
index 10ceef943af0d..38b59ce5b7d0b 100644
--- a/onnxruntime/core/session/inference_session.cc
+++ b/onnxruntime/core/session/inference_session.cc
@@ -90,15 +90,23 @@ inline std::basic_string<T> GetCurrentTimeString() {
   OrtStrftime<T>(time_str, sizeof(time_str), GetDateFormatString<T>(), &local_tm);
   return std::basic_string<T>(time_str);
 }
+
+concurrency::ThreadPool* CreateThreadPool(int size) {
+  if (size < 0) size = std::thread::hardware_concurrency() / 2;
+  return size > 0 ? new concurrency::ThreadPool("SESSION", size) : nullptr;
+}
+
 }  // namespace
 
 InferenceSession::InferenceSession(const SessionOptions& session_options,
                                    logging::LoggingManager* logging_manager)
     : session_options_{session_options},
-      graph_transformation_mgr_{session_options_.max_num_graph_transformation_steps},
+      graph_transformation_mgr_{session_options.max_num_graph_transformation_steps},
       logging_manager_{logging_manager},
+      thread_pool_(CreateThreadPool(session_options.session_thread_pool_size)),
       session_state_(execution_providers_,
-                     session_options.enable_mem_pattern && session_options.enable_sequential_execution),
+                     session_options.enable_mem_pattern && session_options.enable_sequential_execution,
+                     thread_pool_.get()),
       insert_cast_transformer_{"CastFloat16Transformer"} {
   ORT_ENFORCE(Environment::IsInitialized(),
               "Environment must be initialized before creating an InferenceSession.");
@@ -106,18 +114,6 @@ InferenceSession::InferenceSession(const SessionOptions& session_options,
   InitLogger(logging_manager);
 
   session_state_.SetDataTransferMgr(&data_transfer_mgr_);
-
-  // The threadpool is currently evolving.  We will always create a per session threadpool.
-  // Beyond this, we will create a global thread pool to share across sessions.
-  {
-    int pool_size = session_options_.session_thread_pool_size <= 0
-                        ? std::thread::hardware_concurrency() / 2
-                        : session_options_.session_thread_pool_size;
-
-    thread_pool_ = std::make_unique<onnxruntime::concurrency::ThreadPool>("SESSION", pool_size);
-  }
-
-  session_state_.SetThreadPool(thread_pool_.get());
   session_profiler_.Initialize(session_logger_);
   session_state_.SetProfiler(session_profiler_);
   if (session_options.enable_profiling) {
@@ -398,11 +394,9 @@ common::Status InferenceSession::CreateSubgraphSessionState(Graph& graph, Sessio
       ORT_ENFORCE(subgraph, "Main Graph instance should have populated all subgraphs when being resolved.");
 
       auto subgraph_session_state =
-          std::make_unique<SessionState>(execution_providers_, session_state.GetEnableMemoryPattern());
+          std::make_unique<SessionState>(execution_providers_, session_state.GetEnableMemoryPattern(), session_state.GetThreadPool());
       subgraph_session_state->SetProfiler(session_profiler_);
       subgraph_session_state->SetLogger(*session_logger_);
-      // Pass threadpool to subgraph
-      subgraph_session_state->SetThreadPool(session_state.GetThreadPool());
       // Pass data transfer manager to subgraph.
       subgraph_session_state->SetDataTransferMgr(&session_state.GetDataTransferMgr());
       // Pass fused function manager to subgraph
@@ -441,8 +435,6 @@ common::Status InferenceSession::InitializeSubgraphSessions(Graph& graph, Sessio
       ORT_RETURN_IF_ERROR(initializer.CreatePlan(&node, &implicit_inputs,
                                                  session_options_.enable_sequential_execution));
 
-      ORT_RETURN_IF_ERROR(initializer.InitializeAndSave(&implicit_inputs));
-
       // LOGS(*session_logger_, VERBOSE) << std::make_pair(subgraph_info.session_state->GetExecutionPlan(),
       //                                                   &*subgraph_info.session_state);
 
@@ -528,14 +520,20 @@ common::Status InferenceSession::Initialize() {
     // now that all the transforms are done, call Resolve on the main graph. this will recurse into the subgraphs.
     ORT_RETURN_IF_ERROR(graph.Resolve());
 
+    if (!session_options_.optimized_model_filepath.empty()) {
+      if (session_options_.graph_optimization_level < TransformerLevel::Level3) {
+        // Serialize optimized ONNX model.
+        ORT_RETURN_IF_ERROR(Model::Save(*model_, session_options_.optimized_model_filepath));
+      } else {
+        LOGS(*session_logger_, WARNING) << "Serializing Optimized ONNX model with Graph Optimization"
+                                           " level greater than 2 is not supported.";
+      }
+    }
+
     ORT_RETURN_IF_ERROR(session_initializer.CreatePlan(nullptr, nullptr, session_options_.enable_sequential_execution));
-    ORT_RETURN_IF_ERROR(session_initializer.InitializeAndSave(nullptr));
 
     // handle any subgraphs
     ORT_RETURN_IF_ERROR(InitializeSubgraphSessions(graph, session_state_));
-
-    session_state_.CalculateNodeIndexInfo();
-
     is_inited_ = true;
 
     LOGS(*session_logger_, INFO) << "Session successfully initialized.";
@@ -570,8 +568,7 @@ common::Status InferenceSession::CheckShapes(const std::string& input_name,
     ostr << "Invalid rank for input: " << input_name
          << " Got: " << input_shape_sz << " Expected: " << expected_shape_sz
          << " Please fix either the inputs or the model.";
-    LOGS(*session_logger_, WARNING) << ostr.str();
-    return Status::OK();
+    return Status(ONNXRUNTIME, INVALID_ARGUMENT, ostr.str());
   }
 
   std::vector<int> invalid_dim_indices;
@@ -592,7 +589,7 @@ common::Status InferenceSession::CheckShapes(const std::string& input_name,
       ostr << " index: " << idx << " Got: " << input_shape[idx] << " Expected: " << expected_shape[idx] << "\n";
     }
     ostr << " Please fix either the inputs or the model.";
-    LOGS(*session_logger_, WARNING) << ostr.str();
+    return Status(ONNXRUNTIME, INVALID_ARGUMENT, ostr.str());
   }
 
   return Status::OK();
diff --git a/onnxruntime/core/session/inference_session.h b/onnxruntime/core/session/inference_session.h
index 761c121e95a42..632c92105bba4 100644
--- a/onnxruntime/core/session/inference_session.h
+++ b/onnxruntime/core/session/inference_session.h
@@ -56,6 +56,9 @@ struct SessionOptions {
   // enable profiling for this session.
   bool enable_profiling = false;
 
+  // non empty filepath enables serialization of the transformed optimized model to the specified filepath.
+  std::basic_string<ORTCHAR_T> optimized_model_filepath;
+
   // enable the memory pattern optimization.
   // The idea is if the input shapes are the same, we could trace the internal memory allocation
   // and generate a memory pattern for future request. So next time we could just do one allocation
@@ -85,7 +88,7 @@ struct SessionOptions {
   TransformerLevel graph_optimization_level = TransformerLevel::Level1;
 
   // How many threads in the session thread pool.
-  int session_thread_pool_size = 0;
+  int session_thread_pool_size = -1;
 };
 
 /**
@@ -404,6 +407,10 @@ class InferenceSession {
   // The list of execution providers.
   ExecutionProviders execution_providers_;
 
+ private:
+  // Threadpool for this session
+  std::unique_ptr<onnxruntime::concurrency::ThreadPool> thread_pool_;
+
  protected:
   // Immutable state for each op in the model. Shared by all executors.
   // It has a dependency on execution_providers_.
@@ -430,8 +437,6 @@ class InferenceSession {
   std::unordered_map<std::string, InputDefMetaData> input_def_map_;
   OutputDefList output_def_list_;
 
-  // Threadpool for this session
-  std::unique_ptr<onnxruntime::concurrency::ThreadPool> thread_pool_;
   // Data transfer manager.
   DataTransferManager data_transfer_mgr_;
 
diff --git a/onnxruntime/core/session/onnxruntime_c_api.cc b/onnxruntime/core/session/onnxruntime_c_api.cc
index 4cc41de5b54ea..80f550cfdccb2 100644
--- a/onnxruntime/core/session/onnxruntime_c_api.cc
+++ b/onnxruntime/core/session/onnxruntime_c_api.cc
@@ -17,7 +17,7 @@
 #include "core/framework/tensor.h"
 #include "core/framework/ml_value.h"
 #include "core/session/environment.h"
-#include "core/common/callback.h"
+#include "core/framework/callback.h"
 #include "core/framework/tensorprotoutils.h"
 #include "core/framework/onnxruntime_typeinfo.h"
 #include "core/session/inference_session.h"
@@ -67,9 +67,9 @@ struct OrtEnv {
   auto v = reinterpret_cast<const ::OrtValue*>(value); \
   auto& tensor = v->Get<onnxruntime::Tensor>();
 
-#define TENSOR_READWRITE_API_BEGIN               \
-  API_IMPL_BEGIN                                 \
-  auto v = reinterpret_cast<::OrtValue*>(value); \
+#define TENSOR_READWRITE_API_BEGIN \
+  API_IMPL_BEGIN                   \
+  auto v = (value);                \
   auto tensor = v->GetMutable<onnxruntime::Tensor>();
 
 class LoggingWrapper : public ISink {
@@ -416,7 +416,7 @@ ORT_API_STATUS_IMPL(OrtCreateSessionFromArray, _In_ const OrtEnv* env, _In_ cons
 }
 
 ORT_API_STATUS_IMPL(OrtRun, _Inout_ OrtSession* sess,
-                    _In_ const OrtRunOptions* run_options,
+                    _In_opt_ const OrtRunOptions* run_options,
                     _In_ const char* const* input_names, _In_ const OrtValue* const* input, size_t input_len,
                     _In_ const char* const* output_names1, size_t output_names_len, _Outptr_ OrtValue** output) {
   API_IMPL_BEGIN
@@ -449,7 +449,7 @@ ORT_API_STATUS_IMPL(OrtRun, _Inout_ OrtSession* sess,
   std::vector<OrtValue> fetches(output_names_len);
   for (size_t i = 0; i != output_names_len; ++i) {
     if (output[i] != nullptr) {
-      ::OrtValue& value = *reinterpret_cast<::OrtValue*>(output[i]);
+      ::OrtValue& value = *(output[i]);
       if (value.Fence())
         value.Fence()->BeforeUsingAsOutput(onnxruntime::kCpuExecutionProvider, queue_id);
       fetches[i] = value;
@@ -520,58 +520,9 @@ ORT_API_STATUS_IMPL(OrtGetStringTensorContent, _In_ const OrtValue* value,
     if ((!_status.IsOK())) return ToOrtStatus(_status); \
   } while (0)
 
-ORT_API_STATUS_IMPL(OrtTensorProtoToOrtValue, _In_ const void* input, int input_len,
-                    _In_opt_ const ORTCHAR_T* input_file_path, _Inout_ void* preallocated, size_t preallocated_size,
-                    _Outptr_ OrtValue** out, _Outptr_ OrtCallback** deleter) {
-  API_IMPL_BEGIN
-  OrtAllocatorInfo* cpuAllocatorInfo;
-  auto st = OrtCreateCpuAllocatorInfo(OrtDeviceAllocator, OrtMemTypeDefault, &cpuAllocatorInfo);
-  if (st != nullptr) return st;
-  ::ONNX_NAMESPACE::TensorProto proto;
-  if (!proto.ParseFromArray(input, input_len)) {
-    return OrtCreateStatus(ORT_FAIL, "parse input tensor proto failed");
-  }
-  auto value = std::make_unique<OrtValue>();
-  std::unique_ptr<OrtCallback> del = std::make_unique<OrtCallback>();
-  auto status =
-      utils::TensorProtoToMLValue(Env::Default(), input_file_path, proto,
-                                  MemBuffer(preallocated, preallocated_size, *cpuAllocatorInfo), *value, *del);
-  OrtReleaseAllocatorInfo(cpuAllocatorInfo);
-  if (!status.IsOK()) {
-    return ToOrtStatus(status);
-  }
-  *out = value.release();
-  if (del->f != nullptr) {
-    *deleter = del.release();
-  } else
-    *deleter = nullptr;
-  return nullptr;
-  API_IMPL_END
-}
-
-ORT_API_STATUS_IMPL(OrtGetTensorMemSizeInBytesFromTensorProto, _In_ const void* input, int input_len, size_t alignment,
-                    size_t* out) {
-  API_IMPL_BEGIN
-  ::ONNX_NAMESPACE::TensorProto proto;
-  if (!proto.ParseFromArray(input, input_len)) {
-    return OrtCreateStatus(ORT_FAIL, "parse input tensor proto failed");
-  }
-  switch (alignment) {
-    case 0:
-      ORT_C_API_RETURN_IF_ERROR(utils::GetSizeInBytesFromTensorProto<0>(proto, out));
-      break;
-    case 256:
-      ORT_C_API_RETURN_IF_ERROR(utils::GetSizeInBytesFromTensorProto<256>(proto, out));
-      break;
-    default:
-      return OrtCreateStatus(ORT_INVALID_ARGUMENT, "Invalid alignment, which can only be 0 or 256");
-  }
-  return nullptr;
-  API_IMPL_END
-}
-#define DEFINE_RELEASE_ORT_OBJECT_FUNCTION(INPUT_TYPE, REAL_TYPE) \
-  ORT_API(void, OrtRelease##INPUT_TYPE, Ort##INPUT_TYPE* value) { \
-    delete reinterpret_cast<REAL_TYPE*>(value);                   \
+#define DEFINE_RELEASE_ORT_OBJECT_FUNCTION(INPUT_TYPE, REAL_TYPE)                 \
+  ORT_API(void, OrtRelease##INPUT_TYPE, _Frees_ptr_opt_ Ort##INPUT_TYPE* value) { \
+    delete reinterpret_cast<REAL_TYPE*>(value);                                   \
   }
 
 ORT_API_STATUS_IMPL(OrtSessionGetInputCount, _In_ const OrtSession* sess, _Out_ size_t* out) {
diff --git a/onnxruntime/core/util/gemmlowp_common.cc b/onnxruntime/core/util/gemmlowp_common.cc
new file mode 100644
index 0000000000000..7b14a09b510c5
--- /dev/null
+++ b/onnxruntime/core/util/gemmlowp_common.cc
@@ -0,0 +1,49 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/util/gemmlowp_common_wrapper.h"
+#include "core/util/gemmlowp_common.h"
+
+namespace onnxruntime {
+
+void GemmlowpMultiplyu8u8_u8(const uint8_t* lhs_data, const uint8_t* rhs_data, uint8_t* result_data,
+                        const int lhs_offset, const int rhs_offset, const int result_offset,
+                        int m, int n, int k, int32_t int_multiplier, int32_t right_shift, const int32_t* bias) {
+  // TODO exp ColMajor order for rhs and result. That may be faster
+  const auto matOrder = gemmlowp::MapOrder::RowMajor;
+  gemmlowp::MatrixMap<const uint8_t, matOrder> lhs(lhs_data, m, k);
+  gemmlowp::MatrixMap<const uint8_t, matOrder> rhs(rhs_data, k, n);
+  gemmlowp::MatrixMap<std::uint8_t, matOrder> result(result_data, m, n);
+
+  gemmlowp::GemmContext gemm_context;
+
+  if (bias == nullptr) {
+    auto output_pipeline = MakeOutputPipelineWithOutBias(result_offset,
+                                                         int_multiplier, right_shift);
+    gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t, gemmlowp::DefaultL8R8BitDepthParams>(
+        &gemm_context, lhs, rhs, &result, -lhs_offset, -rhs_offset, output_pipeline);
+  } else {
+    auto output_pipeline = MakeOutputPipelineWithBias(bias, m, result_offset, int_multiplier, right_shift);
+    gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
+                                     gemmlowp::DefaultL8R8BitDepthParams>(
+        &gemm_context, lhs, rhs, &result, -lhs_offset, -rhs_offset, output_pipeline);
+  }
+}
+
+void GemmlowpMultiplyu8u8_s32(const uint8_t* lhs_data, const uint8_t* rhs_data, int32_t* result_data,
+                                const int lhs_offset, const int rhs_offset, int m, int n, int k, concurrency::ThreadPool* ) {
+
+  const auto matOrder = gemmlowp::MapOrder::RowMajor;
+  gemmlowp::MatrixMap<const uint8_t, matOrder> lhs(lhs_data, m, k);
+  gemmlowp::MatrixMap<const uint8_t, matOrder> rhs(rhs_data, k, n);
+  gemmlowp::MatrixMap<std::int32_t, matOrder> result(result_data, m, n);
+
+  gemmlowp::GemmContext gemm_context;
+
+  const std::tuple<> empty_pipeline = {};
+
+  gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::int32_t, gemmlowp::DefaultL8R8BitDepthParams>(
+      &gemm_context, lhs, rhs, &result, -lhs_offset, -rhs_offset, empty_pipeline);
+
+}
+}
\ No newline at end of file
diff --git a/onnxruntime/core/util/gemmlowp_common.h b/onnxruntime/core/util/gemmlowp_common.h
new file mode 100644
index 0000000000000..1744058d46942
--- /dev/null
+++ b/onnxruntime/core/util/gemmlowp_common.h
@@ -0,0 +1,65 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+#pragma once
+#include "core/util/gemmlowp_common_wrapper.h"
+#include "core/platform/threadpool.h"
+
+namespace onnxruntime {
+
+void inline QuantizeMultiplier(float fp_multiplier, std::int32_t* integer_multiplier, int* right_shift) {
+  auto* fp_as_bits = reinterpret_cast<uint32_t*>(&fp_multiplier);
+  auto current_exponent = (*fp_as_bits >> 23);
+  // bring multiplier in [.5,1) range and calculate the shift
+  auto bumped_multiplier_as_bits =
+      (*fp_as_bits & UINT32_C(0x007fffff)) | UINT32_C(0x3f000000);
+  auto* bumped_multiplier = reinterpret_cast<float*>(&bumped_multiplier_as_bits);
+  auto shift = 126 - current_exponent;
+  // convert to fixed point number
+  auto int_multiplier = static_cast<std::int64_t>(std::round(*bumped_multiplier * (1ll << 31)));
+
+  *integer_multiplier = static_cast<int32_t>(int_multiplier);
+  *right_shift = shift;
+}
+
+typedef gemmlowp::VectorMap<const std::int32_t, gemmlowp::VectorShape::Col> ColVectorMap;
+
+inline std::tuple<gemmlowp::OutputStageBiasAddition<ColVectorMap>,
+                  gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint,
+                  gemmlowp::OutputStageSaturatingCastToUint8>
+MakeOutputPipelineWithBias(const int32_t* bias,
+                           int rows,
+                           std::int32_t result_offset,
+                           std::int32_t result_mult_int,
+                           std::int32_t result_shift) {
+  ColVectorMap bias_vector(bias, rows);
+  gemmlowp::OutputStageBiasAddition<ColVectorMap> bias_addition_stage;
+  bias_addition_stage.bias_vector = bias_vector;
+  gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint quantize_down_stage;
+  quantize_down_stage.result_offset_after_shift = result_offset;
+  quantize_down_stage.result_fixedpoint_multiplier = result_mult_int;
+  quantize_down_stage.result_shift = result_shift;
+  gemmlowp::OutputStageSaturatingCastToUint8 saturating_cast_stage;
+  return std::make_tuple(bias_addition_stage, quantize_down_stage, saturating_cast_stage);
+}
+
+inline std::tuple<gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint,
+                  gemmlowp::OutputStageSaturatingCastToUint8>
+MakeOutputPipelineWithOutBias(std::int32_t result_offset,
+                              std::int32_t result_mult_int,
+                              std::int32_t result_shift) {
+  gemmlowp::OutputStageQuantizeDownInt32ByFixedPoint quantize_down_stage;
+  quantize_down_stage.result_offset_after_shift = result_offset;
+  quantize_down_stage.result_fixedpoint_multiplier = result_mult_int;
+  quantize_down_stage.result_shift = result_shift;
+  gemmlowp::OutputStageSaturatingCastToUint8 saturating_cast_stage;
+  return std::make_tuple(quantize_down_stage, saturating_cast_stage);
+}
+
+void GemmlowpMultiplyu8u8_u8(const uint8_t* lhs_data, const uint8_t* rhs_data, uint8_t* result_data,
+                        const int lhs_offset, const int rhs_offset, const int result_offset,
+                        int m, int n, int k, int32_t int_multiplier, int32_t right_shift, const int32_t* bias = nullptr);
+
+void GemmlowpMultiplyu8u8_s32(const uint8_t* lhs_data, const uint8_t* rhs_data, int32_t* result_data,
+                             const int lhs_offset, const int rhs_offset, int m, int n, int k, concurrency::ThreadPool*);
+
+}  // namespace onnxruntime
\ No newline at end of file
diff --git a/onnxruntime/core/util/gemmlowp_common_wrapper.h b/onnxruntime/core/util/gemmlowp_common_wrapper.h
index 48f3afdb26235..9a2386e659c60 100644
--- a/onnxruntime/core/util/gemmlowp_common_wrapper.h
+++ b/onnxruntime/core/util/gemmlowp_common_wrapper.h
@@ -13,6 +13,8 @@
 #else
 #pragma warning(push)
 #pragma warning(disable : 4100)  //'identifier' : unreferenced formal parameter
+#pragma warning(disable : 4244)
+#pragma warning(disable : 4267)
 #endif
 
 #include "public/gemmlowp.h"
diff --git a/onnxruntime/core/util/math.h b/onnxruntime/core/util/math.h
index 21593c69326ee..7e200b7b6a7ad 100644
--- a/onnxruntime/core/util/math.h
+++ b/onnxruntime/core/util/math.h
@@ -35,6 +35,9 @@ extern "C" {
 #include "core/framework/tensor.h"
 
 namespace onnxruntime {
+namespace concurrency {
+class ThreadPool;
+}
 
 enum StorageOrder {
   UNKNOWN = 0,
@@ -74,7 +77,7 @@ void MatMul(
     int K,
     const T* A,
     const T* B,
-    T* C);
+    T* C, concurrency::ThreadPool* threadpool);
 
 // Decaf gemm provides a simpler interface to the gemm functions, with the
 // limitation that the data has to be contiguous in memory.
@@ -90,7 +93,7 @@ void Gemm(
     const T* B,
     float beta,
     T* C,
-    Provider* provider);
+    Provider*);
 
 // We also provide a gemm that has explicit lda, ldb and ldc specified.
 // In most cases you probably want to use the function above, though.
@@ -109,7 +112,7 @@ void GemmEx(
     T beta,
     T* C,
     int ldc,
-    Provider* provider);
+    Provider*);
 
 // Gemv always takes in a M*N matrix A, and depending on whether we set TransA
 // to Trans, the output is:
diff --git a/onnxruntime/core/util/math_cpu.cc b/onnxruntime/core/util/math_cpu.cc
index 822d58ec63140..9cf66e0037852 100644
--- a/onnxruntime/core/util/math_cpu.cc
+++ b/onnxruntime/core/util/math_cpu.cc
@@ -20,16 +20,17 @@
 #include "core/util/math_cpuonly.h"
 #include "core/mlas/inc/mlas.h"
 #include "Eigen/src/Core/arch/GPU/Half.h"
+using onnxruntime::concurrency::ThreadPool;
 
 namespace onnxruntime {
 namespace math {
 
 // MatMul implementation purely based on Eigen.
-#define EIGEN_MATMUL_FUNCTION(T)                                                         \
-  template <>                                                                            \
-  void MatMul<T>(int M, int N, int K, const T* A, const T* B, T* C) {                    \
-    auto C_mat = EigenMatrixMap<T>(C, N, M);                                             \
-    C_mat.noalias() = ConstEigenMatrixMap<T>(B, N, K) * ConstEigenMatrixMap<T>(A, K, M); \
+#define EIGEN_MATMUL_FUNCTION(T)                                                                \
+  template <>                                                                                   \
+  void MatMul<T>(int M, int N, int K, const T* A, const T* B, T* C, concurrency::ThreadPool*) { \
+    auto C_mat = EigenMatrixMap<T>(C, N, M);                                                    \
+    C_mat.noalias() = ConstEigenMatrixMap<T>(B, N, K) * ConstEigenMatrixMap<T>(A, K, M);        \
   }
 
 EIGEN_MATMUL_FUNCTION(int32_t)
@@ -44,7 +45,7 @@ EIGEN_MATMUL_FUNCTION(uint64_t)
 // CBLAS call or the Eigen implementation.
 ////////////////////////////////////////////////////////////////////////////////
 // when USE_MKLML is defined, use cblas APIs for MKLML
-#if defined(USE_EIGEN_FOR_BLAS) && !defined(USE_MKLML_FOR_BLAS)
+#if !defined(USE_MKLML_FOR_BLAS)
 
 // Caffe2 gemm provides a simpler interface to the gemm functions, with the
 // limitation that the data has to be contiguous in memory.
@@ -62,28 +63,26 @@ EIGEN_MATMUL_FUNCTION(uint64_t)
 // (transpose) if the argument TransA or TransB is set to CblasNoTrans or
 // CblasTrans, respectively, for each of A and B.
 template <>
-void Gemm<float, CPUMathUtil>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const int64_t M,
-                              const int64_t N, const int64_t K, float alpha, const float* A, const float* B, float beta,
-                              float* C, CPUMathUtil* /*provider*/) {
+void Gemm<float, ThreadPool>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const int64_t M,
+                             const int64_t N, const int64_t K, float alpha, const float* A, const float* B, float beta,
+                             float* C, ThreadPool* threadpool) {
   int lda = static_cast<int>((TransA == CblasNoTrans) ? K : M);
   int ldb = static_cast<int>((TransB == CblasNoTrans) ? N : K);
-  // TODO: Make this use the operator threadpool
-  MlasSgemm(TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, N, nullptr);
+  MlasSgemm(TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, N, threadpool);
 }
 
 template <>
-void MatMul<float>(int M, int N, int K, const float* A, const float* B, float* C) {
-  // TODO: Make this use the operator threadpool
-  MlasSgemm(CblasNoTrans, CblasNoTrans, M, N, K, 1.f, A, K, B, N, 0.f, C, N, nullptr);
+void MatMul<float>(int M, int N, int K, const float* A, const float* B, float* C, ThreadPool* threadpool) {
+  MlasSgemm(CblasNoTrans, CblasNoTrans, M, N, K, 1.f, A, K, B, N, 0.f, C, N, threadpool);
 }
 
 EIGEN_MATMUL_FUNCTION(double)
 
 template <>
-void GemmEx<float, CPUMathUtil>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, int M, int N, int K,
-                                float alpha, const float* A, int lda, const float* B, int ldb, float beta, float* C,
-                                int ldc, CPUMathUtil*) {
-  MlasSgemm(TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc, nullptr);
+void GemmEx<float, ThreadPool>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, int M, int N, int K,
+                               float alpha, const float* A, int lda, const float* B, int ldb, float beta, float* C,
+                               int ldc, ThreadPool* threadpool) {
+  MlasSgemm(TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc, threadpool);
 }
 
 template <>
@@ -125,12 +124,12 @@ void Gemv<float, CPUMathUtil>(const CBLAS_TRANSPOSE TransA, int M, int N, float
 SPECIALIZED_AXPY(float)
 #undef SPECIALIZED_AXPY
 
-#else  // USE_EIGEN_FOR_BLAS
+#else
 
 template <>
-void Gemm<float, CPUMathUtil>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const int64_t M,
-                              const int64_t N, const int64_t K, float alpha, const float* A, const float* B, float beta,
-                              float* C, CPUMathUtil* /*context*/) {
+void Gemm<float, ThreadPool>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, const int64_t M,
+                             const int64_t N, const int64_t K, float alpha, const float* A, const float* B, float beta,
+                             float* C, ThreadPool* /*context*/) {
   int lda = gsl::narrow_cast<int>((TransA == CblasNoTrans) ? K : M);
   int ldb = gsl::narrow_cast<int>((TransB == CblasNoTrans) ? N : K);
   cblas_sgemm(CblasRowMajor, TransA, TransB,
@@ -142,19 +141,19 @@ void Gemm<float, CPUMathUtil>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOS
 }
 
 template <>
-void MatMul<float>(int M, int N, int K, const float* A, const float* B, float* C) {
+void MatMul<float>(int M, int N, int K, const float* A, const float* B, float* C, ThreadPool*) {
   cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1, A, K, B, N, 0, C, N);
 }
 
 template <>
-void MatMul<double>(int M, int N, int K, const double* A, const double* B, double* C) {
+void MatMul<double>(int M, int N, int K, const double* A, const double* B, double* C, ThreadPool*) {
   cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1, A, K, B, N, 0, C, N);
 }
 
 template <>
-void GemmEx<float, CPUMathUtil>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, int M, int N, int K,
-                                float alpha, const float* A, int lda, const float* B, int ldb, float beta, float* C,
-                                int ldc, CPUMathUtil* /*context*/) {
+void GemmEx<float, ThreadPool>(const CBLAS_TRANSPOSE TransA, const CBLAS_TRANSPOSE TransB, int M, int N, int K,
+                               float alpha, const float* A, int lda, const float* B, int ldb, float beta, float* C,
+                               int ldc, ThreadPool* /*context*/) {
   cblas_sgemm(CblasRowMajor, TransA, TransB, M, N, K, alpha, A, lda, B, ldb,
               beta, C, ldc);
 }
@@ -177,7 +176,7 @@ void Gemv<float, CPUMathUtil>(const CBLAS_TRANSPOSE TransA, int M, int N, float
 CAFFE2_SPECIALIZED_AXPY(float, s)
 #undef CAFFE2_SPECIALIZED_AXPY
 
-#endif  // USE_EIGEN_FOR_BLAS
+#endif
 
 #define DELEGATE_SIMPLE_UNARY_FUNCTION(T, Funcname, expr)                  \
   template <>                                                              \
diff --git a/onnxruntime/core/util/math_cpuonly.h b/onnxruntime/core/util/math_cpuonly.h
index 7efbd711885d2..3865d4b990211 100644
--- a/onnxruntime/core/util/math_cpuonly.h
+++ b/onnxruntime/core/util/math_cpuonly.h
@@ -20,7 +20,7 @@
 #include "onnxruntime_config.h"
 // external/eigen/Eigen/src/Core/AssignEvaluator.h:86:63:
 // error: enum constant in boolean context [-Werror=int-in-bool-context]
-#if defined(__GNUC__) && __GNUC__>=7
+#if defined(__GNUC__) && __GNUC__ >= 7
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wint-in-bool-context"
 #ifdef HAS_DEPRECATED_COPY
@@ -30,7 +30,7 @@
 
 #include "Eigen/Core"
 
-#if defined(__GNUC__) && __GNUC__>=7
+#if defined(__GNUC__) && __GNUC__ >= 7
 #pragma GCC diagnostic pop
 #endif
 
@@ -101,4 +101,14 @@ void FuseActivation(const std::string& activation, T* y_data, size_t size, float
   }
 }
 
+// cast TA and TB to TC, and do matrix multiply in Eigen
+// note that inputs/outputs is row-major, while Eigen is col-major
+// so (M, K) x (K, N) -> (M, N) becomes (N, K) x (K, M) -> (N, M) in Eigen
+template <typename TA, typename TB, typename TY>
+void EigenCastGEMM(const TA* A_data, const TB* B_data, TY* Y_data, int M, int N, int K) {
+  auto A = ConstEigenMatrixMap<TA>(A_data, K, M);
+  auto B = ConstEigenMatrixMap<TB>(B_data, N, K);
+  EigenMatrixMap<TY>(Y_data, N, M) = B.template cast<TY>() * A.template cast<TY>();
+}
+
 }  // namespace onnxruntime
diff --git a/onnxruntime/core/util/protobuf_parsing_utils.cc b/onnxruntime/core/util/protobuf_parsing_utils.cc
index 20969a140e087..7cdf60bb783a3 100644
--- a/onnxruntime/core/util/protobuf_parsing_utils.cc
+++ b/onnxruntime/core/util/protobuf_parsing_utils.cc
@@ -49,7 +49,9 @@
 #include <google/protobuf/stubs/common.h>
 #include <google/protobuf/stubs/logging.h>
 #include <google/protobuf/stubs/stl_util.h>
+#ifdef _WIN32
 #include <google/protobuf/stubs/io_win32.h>
+#endif
 #include "protobuf_parsing_utils.h"
 #include <string.h>
 
diff --git a/onnxruntime/core/util/qmath.cc b/onnxruntime/core/util/qmath.cc
new file mode 100644
index 0000000000000..9372ad29d89ff
--- /dev/null
+++ b/onnxruntime/core/util/qmath.cc
@@ -0,0 +1,33 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "core/util/qmath.h"
+#include "core/common/common.h"
+
+namespace onnxruntime {
+
+void QGemmu8u8_s32(
+    int M,
+    int N,
+    int K,
+    const uint8_t* lhs_data,
+    int lda,
+    const uint8_t lhs_offset,
+    const uint8_t* rhs_data,
+    int ldb,
+    const uint8_t rhs_offset,
+    int32_t* result_data,
+    int ldc,
+    concurrency::ThreadPool* thread_pool) {
+#ifdef USE_GEMMLOWP
+
+  ORT_ENFORCE(lda == K && ldb == N &&  ldc == N, "For gemmlowp only RowMajor*RowMajor=RowMajor format is supported");
+
+  GemmlowpMultiplyu8u8_s32(lhs_data, rhs_data, result_data, lhs_offset, rhs_offset, M, N, K, thread_pool);  
+
+#else
+  MlasQgemm(M, N, K, lhs_data, lda, lhs_offset, rhs_data, ldb, rhs_offset, result_data, ldc, thread_pool);
+
+#endif
+}
+}  // namespace onnxruntime
\ No newline at end of file
diff --git a/onnxruntime/core/util/qmath.h b/onnxruntime/core/util/qmath.h
new file mode 100644
index 0000000000000..f6ce7bcaa25b1
--- /dev/null
+++ b/onnxruntime/core/util/qmath.h
@@ -0,0 +1,38 @@
+#pragma once
+
+// default to gemmlowp when building for arm devices
+#ifndef USE_GEMMLOWP
+#if defined(_M_ARM64) || defined(__aarch64__)
+#define USE_GEMMLOWP
+#endif
+#if defined(_M_ARM) || defined(__arm__)
+#define USE_GEMMLOWP
+#endif
+#endif
+
+#ifdef USE_GEMMLOWP
+#include "core/util/gemmlowp_common.h"
+#else
+#include "core/mlas/inc/mlas.h"
+#endif
+#include "core/platform/threadpool.h"
+#include <mutex>
+#include <thread>
+
+namespace onnxruntime {
+
+void QGemmu8u8_s32(
+    int M,
+    int N,
+    int K,
+    const uint8_t* lhs_data,
+    int lda,
+    const uint8_t lhs_offset,
+    const uint8_t* rhs_data,
+    int ldb,
+    const uint8_t rhs_offset,
+    int32_t* result_data,
+    int ldc,
+    concurrency::ThreadPool* thread_pool);
+
+}  // namespace onnxruntime
diff --git a/onnxruntime/python/onnxruntime_pybind_state.cc b/onnxruntime/python/onnxruntime_pybind_state.cc
index 83b6c878385d6..99d26356d6e9a 100644
--- a/onnxruntime/python/onnxruntime_pybind_state.cc
+++ b/onnxruntime/python/onnxruntime_pybind_state.cc
@@ -8,6 +8,8 @@
 #include <numpy/arrayobject.h>
 
 #include "core/graph/graph_viewer.h"
+#include "core/common/logging/logging.h"
+#include "core/common/logging/severity.h"
 
 #if USE_CUDA
 #define BACKEND_PROC "GPU"
@@ -47,13 +49,19 @@
 #define BACKEND_OPENVINO ""
 #endif
 
+#ifdef USE_NUPHAR
+#define BACKEND_NUPHAR "-NUPHAR"
+#else
+#define BACKEND_NUPHAR ""
+#endif
+
 #if USE_OPENBLAS
 #define BACKEND_OPENBLAS "-OPENBLAS"
 #else
 #define BACKEND_OPENBLAS ""
 #endif
 
-#define BACKEND_DEVICE BACKEND_PROC BACKEND_MKLDNN BACKEND_MKLML BACKEND_NGRAPH BACKEND_OPENVINO BACKEND_OPENBLAS
+#define BACKEND_DEVICE BACKEND_PROC BACKEND_MKLDNN BACKEND_MKLML BACKEND_NGRAPH BACKEND_OPENVINO BACKEND_NUPHAR BACKEND_OPENBLAS
 #include "core/session/onnxruntime_cxx_api.h"
 #include "core/providers/providers.h"
 #include "core/providers/cpu/cpu_execution_provider.h"
@@ -76,6 +84,7 @@
 #endif
 #ifdef USE_NUPHAR
 #include "core/providers/nuphar/nuphar_provider_factory.h"
+std::string nuphar_settings;
 #endif
 #ifdef USE_BRAINSLICE
 #include "core/providers/brainslice/brainslice_provider_factory.h"
@@ -84,11 +93,11 @@
 namespace onnxruntime {
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_CPU(int use_arena);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_CUDA(int device_id);
-std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Tensorrt();
+std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Tensorrt(int device_id);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Mkldnn(int use_arena);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_NGraph(const char* ng_backend_type);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_OpenVINO(const char* device);
-std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Nuphar(int device_id, const char*);
+std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Nuphar(bool, const char*);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_BrainSlice(uint32_t ip, int, int, bool, const char*, const char*, const char*);
 }  // namespace onnxruntime
 
@@ -223,7 +232,7 @@ void InitializeSession(InferenceSession* sess) {
 
 #ifdef USE_TENSORRT
   {
-    RegisterExecutionProvider(sess, *onnxruntime::CreateExecutionProviderFactory_Tensorrt());
+    RegisterExecutionProvider(sess, *onnxruntime::CreateExecutionProviderFactory_Tensorrt(0));
   }
 #endif
 
@@ -252,9 +261,10 @@ void InitializeSession(InferenceSession* sess) {
   }
 #endif
 
-#if 0  //USE_NUPHAR
+#if USE_NUPHAR
   {
-    RegisterExecutionProvider(sess, *onnxruntime::CreateExecutionProviderFactory_Nuphar(0, ""));
+    RegisterExecutionProvider(sess, *onnxruntime::CreateExecutionProviderFactory_Nuphar(true, nuphar_settings.c_str()));
+    nuphar_settings.clear();  // clear nuphar_settings after use to avoid it being accidentally passed on to next session
   }
 #endif
 
@@ -275,12 +285,15 @@ void addGlobalMethods(py::module& m) {
   m.def(
       "get_device", []() -> std::string { return BACKEND_DEVICE; },
       "Return the device used to compute the prediction (CPU, MKL, ...)");
-  m.def("set_default_logger_severity", [](int severity) {
-    ORT_ENFORCE(severity >= 0 && severity <= 4,
-                "Invalid logging severity. 0:Verbose, 1:Info, 2:Warning, 3:Error, 4:Fatal");
-    logging::LoggingManager* default_logging_manager = SessionObjectInitializer::Get();
-    default_logging_manager->SetDefaultLoggerSeverity(static_cast<logging::Severity>(severity));
+
+#ifdef USE_NUPHAR
+  m.def("set_nuphar_settings", [](const std::string& str) {
+    nuphar_settings = str;
+  });
+  m.def("get_nuphar_settings", []() -> std::string {
+    return nuphar_settings;
   });
+#endif
 
 #ifdef onnxruntime_PYBIND_EXPORT_OPSCHEMA
   m.def(
@@ -288,11 +301,84 @@ void addGlobalMethods(py::module& m) {
         return ONNX_NAMESPACE::OpSchemaRegistry::get_all_schemas_with_history();
       },
       "Return a vector of OpSchema all registed operators");
+  m.def(
+      "get_all_opkernel_def", []() -> const std::vector<onnxruntime::KernelDef> {
+        std::vector<onnxruntime::KernelDef> result;
+
+        // default logger is needed to create the MklDNNExecutionProvider
+        std::string default_logger_id{"DefaultLogger"};
+        std::unique_ptr<onnxruntime::logging::LoggingManager> default_logging_manager = 
+                  std::make_unique<LoggingManager>(
+                          std::unique_ptr<onnxruntime::logging::ISink>{ new onnxruntime::logging::CLogSink {}}, 
+                          onnxruntime::logging::Severity::kWARNING, 
+                          false,
+                          onnxruntime::logging::LoggingManager::InstanceType::Default, 
+                          &default_logger_id, 
+                          /*default_max_vlog_level*/ -1);
+     
+        std::vector<std::shared_ptr<onnxruntime::IExecutionProviderFactory>> factories = {
+          onnxruntime::CreateExecutionProviderFactory_CPU(0),
+#ifdef USE_CUDA
+          onnxruntime::CreateExecutionProviderFactory_CUDA(0),
+#endif
+#ifdef USE_MKLDNN
+          onnxruntime::CreateExecutionProviderFactory_Mkldnn(1),
+#endif
+#ifdef USE_NGRAPH
+          onnxruntime::CreateExecutionProviderFactory_NGraph("CPU"),
 #endif
+#ifdef USE_OPENVINO
+          onnxruntime::CreateExecutionProviderFactory_OpenVINO("CPU"),
+#endif    
+#ifdef  USE_TENSORRT    
+          onnxruntime::CreateExecutionProviderFactory_Tensorrt()
+#endif          
+        };
+
+      for (const auto& f: factories){
+        for (const auto& m: f->CreateProvider()
+                       ->GetKernelRegistry()
+                       ->GetKernelCreateMap()){
+          result.emplace_back(*(m.second.kernel_def)); 
+        }
+      }
+
+      return result;
+    },
+    "Return a vector of KernelDef for all registered OpKernels"
+  );
+#endif //onnxruntime_PYBIND_EXPORT_OPSCHEMA
 }
 
 #ifdef onnxruntime_PYBIND_EXPORT_OPSCHEMA
 
+void addOpKernelSubmodule(py::module& m){
+  auto opkernel = m.def_submodule("opkernel");
+  opkernel.doc() = "OpKernel submodule";
+  py::class_<onnxruntime::KernelDef> kernel_def(opkernel, "KernelDef");
+  kernel_def.def_property_readonly("op_name", &onnxruntime::KernelDef::OpName)
+            .def_property_readonly("domain", &onnxruntime::KernelDef::Domain)
+            .def_property_readonly("provider", &onnxruntime::KernelDef::Provider)
+            .def_property_readonly("version_range", 
+              [](const onnxruntime::KernelDef& kernelDef) -> std::pair<int, int> {
+                return kernelDef.onnxruntime::KernelDef::SinceVersion();
+              })
+            .def_property_readonly("type_constraints", 
+              [](const onnxruntime::KernelDef& kernelDef) -> std::unordered_map<std::string, std::vector<std::string> > {
+                std::unordered_map<std::string, std::vector<std::string> > result;
+                const auto& tempResult = kernelDef.TypeConstraints();
+                for (const auto& tc: tempResult){
+                  result[tc.first] = std::vector<std::string>();
+                  for (const auto& dt: tc.second){
+                    result[tc.first].emplace_back(onnxruntime::DataTypeImpl::ToString(dt));  
+                  }
+                }
+                return result;
+              })
+            ;
+}
+
+
 void addOpSchemaSubmodule(py::module& m) {
   auto schemadef = m.def_submodule("schemadef");
   schemadef.doc() = "Schema submodule";
@@ -379,52 +465,89 @@ void addOpSchemaSubmodule(py::module& m) {
 #endif  //onnxruntime_PYBIND_EXPORT_OPSCHEMA
 
 void addObjectMethods(py::module& m) {
-  py::class_<SessionOptions>(m, "SessionOptions", R"pbdoc(Configuration information for a session.)pbdoc")
+  py::enum_<GraphOptimizationLevel>(m, "GraphOptimizationLevel")
+      .value("ORT_DISABLE_ALL", GraphOptimizationLevel::ORT_DISABLE_ALL)
+      .value("ORT_ENABLE_BASIC", GraphOptimizationLevel::ORT_ENABLE_BASIC)
+      .value("ORT_ENABLE_EXTENDED", GraphOptimizationLevel::ORT_ENABLE_EXTENDED)
+      .value("ORT_ENABLE_ALL", GraphOptimizationLevel::ORT_ENABLE_ALL);
+
+  py::class_<SessionOptions> sess(m, "SessionOptions", R"pbdoc(Configuration information for a session.)pbdoc");
+  sess
       .def(py::init())
       .def_readwrite("enable_cpu_mem_arena", &SessionOptions::enable_cpu_mem_arena,
                      R"pbdoc(Enables the memory arena on CPU. Arena may pre-allocate memory for future usage.
 Set this option to false if you don't want it. Default is True.)pbdoc")
       .def_readwrite("enable_profiling", &SessionOptions::enable_profiling,
                      R"pbdoc(Enable profiling for this session. Default is false.)pbdoc")
+      .def_readwrite("optimized_model_filepath", &SessionOptions::optimized_model_filepath,
+                     R"pbdoc(File path to serialize optimized model. By default, optimized model is not serialized if optimized_model_filepath is not provided.)pbdoc")
       .def_readwrite("enable_mem_pattern", &SessionOptions::enable_mem_pattern,
                      R"pbdoc(Enable the memory pattern optimization. Default is true.)pbdoc")
       .def_readwrite("enable_sequential_execution", &SessionOptions::enable_sequential_execution,
                      R"pbdoc(Enables sequential execution, disables parallel execution. Default is true.)pbdoc")
-      .def_readwrite("max_num_graph_transformation_steps", &SessionOptions::max_num_graph_transformation_steps,
-                     R"pbdoc(Runs optimization steps on the execution graph. Default is 5.)pbdoc")
-      .def_readwrite("session_logid", &SessionOptions::session_logid,
+      .def_readwrite("logid", &SessionOptions::session_logid,
                      R"pbdoc(Logger id to use for session output.)pbdoc")
-      .def_readwrite("session_log_severity_level", &SessionOptions::session_log_severity_level,
+      .def_readwrite("log_severity_level", &SessionOptions::session_log_severity_level,
                      R"pbdoc(Log severity level. Applies to session load, initialization, etc. 
 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2.)pbdoc")
-      .def_readwrite("session_log_verbosity_level", &SessionOptions::session_log_verbosity_level,
+      .def_readwrite("log_verbosity_level", &SessionOptions::session_log_verbosity_level,
                      R"pbdoc(VLOG level if DEBUG build and session_log_verbosity_level is 0. 
 Applies to session load, initialization, etc. Default is 0.)pbdoc")
-      .def_readwrite("session_thread_pool_size", &SessionOptions::session_thread_pool_size,
+      .def_readwrite("thread_pool_size", &SessionOptions::session_thread_pool_size,
                      R"pbdoc(How many threads in the session thread pool. Default is 0 to let onnxruntime choose.
 This parameter is unused unless *enable_sequential_execution* is false.)pbdoc")
-      .def_property_readonly(
+      .def_property(
           "graph_optimization_level",
-          [](const SessionOptions* options) -> uint32_t {
-            return static_cast<uint32_t>(options->graph_optimization_level);
+          [](const SessionOptions* options) -> GraphOptimizationLevel {
+            GraphOptimizationLevel retval = ORT_ENABLE_BASIC;
+            switch (options->graph_optimization_level) {
+              case onnxruntime::TransformerLevel::Default:
+                retval = ORT_DISABLE_ALL;
+                break;
+              case onnxruntime::TransformerLevel::Level1:
+                retval = ORT_ENABLE_BASIC;
+                break;
+              case onnxruntime::TransformerLevel::Level2:
+                retval = ORT_ENABLE_EXTENDED;
+                break;
+              case onnxruntime::TransformerLevel::Level3:
+                retval = ORT_ENABLE_ALL;
+                break;
+              default:
+                retval = ORT_ENABLE_BASIC;
+                LOGS_DEFAULT(WARNING) << "Got invalid graph optimization level; defaulting to ORT_ENABLE_BASIC";
+                break;
+            }
+            return retval;
           },
-          R"pbdoc(Graph optimization level for this session.)pbdoc")
-      .def(
-          "set_graph_optimization_level",
-          [](SessionOptions* options, uint32_t level) -> void {
-            options->graph_optimization_level = static_cast<TransformerLevel>(level);
+          
+          [](SessionOptions* options, GraphOptimizationLevel level) -> void {
+            switch (level) {
+              case ORT_DISABLE_ALL:
+                options->graph_optimization_level = onnxruntime::TransformerLevel::Default;
+                break;
+              case ORT_ENABLE_BASIC:
+                options->graph_optimization_level = onnxruntime::TransformerLevel::Level1;
+                break;
+              case ORT_ENABLE_EXTENDED:
+                options->graph_optimization_level = onnxruntime::TransformerLevel::Level2;
+                break;
+              case ORT_ENABLE_ALL:
+                options->graph_optimization_level = onnxruntime::TransformerLevel::Level3;
+                break;
+            }
           },
-          R"pbdoc(Graph optimization level for this session. 0 disables all optimizations.
-Whereas 1 enables basic optimizations and 2 enables all optimizations.)pbdoc");
+          R"pbdoc(Graph optimization level for this session.)pbdoc");
+
 
   py::class_<RunOptions>(m, "RunOptions", R"pbdoc(Configuration information for a single Run.)pbdoc")
       .def(py::init())
-      .def_readwrite("run_log_severity_level", &RunOptions::run_log_severity_level,
+      .def_readwrite("log_severity_level", &RunOptions::run_log_severity_level,
                      R"pbdoc(Log severity level for a particular Run() invocation. 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2.)pbdoc")
-      .def_readwrite("run_log_verbosity_level", &RunOptions::run_log_verbosity_level,
+      .def_readwrite("log_verbosity_level", &RunOptions::run_log_verbosity_level,
                      R"pbdoc(VLOG level if DEBUG build and run_log_severity_level is 0. 
 Applies to a particular Run() invocation. Default is 0.)pbdoc")
-      .def_readwrite("run_tag", &RunOptions::run_tag,
+      .def_readwrite("logid", &RunOptions::run_tag,
                      "To identify logs generated by a particular Run() invocation.")
       .def_readwrite("terminate", &RunOptions::terminate,
                      R"pbdoc(Set to True to terminate any currently executing calls that are using this
@@ -448,57 +571,55 @@ including arg name, arg type (contains both type and shape).)pbdoc")
             return *(na.Type());
           },
           "node type")
-      .def(
-          "__str__", [](const onnxruntime::NodeArg& na) -> std::string {
-            std::ostringstream res;
-            res << "NodeArg(name='" << na.Name() << "', type='" << *(na.Type()) << "', shape=";
-            auto shape = na.Shape();
-            std::vector<py::object> arr;
-            if (shape == nullptr || shape->dim_size() == 0) {
-              res << "[]";
+      .def("__str__", [](const onnxruntime::NodeArg& na) -> std::string {
+        std::ostringstream res;
+        res << "NodeArg(name='" << na.Name() << "', type='" << *(na.Type()) << "', shape=";
+        auto shape = na.Shape();
+        std::vector<py::object> arr;
+        if (shape == nullptr || shape->dim_size() == 0) {
+          res << "[]";
+        } else {
+          res << "[";
+          for (int i = 0; i < shape->dim_size(); ++i) {
+            if (shape->dim(i).has_dim_value()) {
+              res << shape->dim(i).dim_value();
+            } else if (shape->dim(i).has_dim_param()) {
+              res << "'" << shape->dim(i).dim_param() << "'";
             } else {
-              res << "[";
-              for (int i = 0; i < shape->dim_size(); ++i) {
-                if (shape->dim(i).has_dim_value()) {
-                  res << shape->dim(i).dim_value();
-                } else if (shape->dim(i).has_dim_param()) {
-                  res << "'" << shape->dim(i).dim_param() << "'";
-                } else {
-                  res << "None";
-                }
-
-                if (i < shape->dim_size() - 1) {
-                  res << ", ";
-                }
-              }
-              res << "]";
+              res << "None";
             }
-            res << ")";
 
-            return std::string(res.str());
-          },
-          "converts the node into a readable string")
-      .def_property_readonly(
-          "shape", [](const onnxruntime::NodeArg& na) -> std::vector<py::object> {
-            auto shape = na.Shape();
-            std::vector<py::object> arr;
-            if (shape == nullptr || shape->dim_size() == 0) {
-              return arr;
+            if (i < shape->dim_size() - 1) {
+              res << ", ";
             }
+          }
+          res << "]";
+        }
+        res << ")";
 
-            arr.resize(shape->dim_size());
-            for (int i = 0; i < shape->dim_size(); ++i) {
-              if (shape->dim(i).has_dim_value()) {
-                arr[i] = py::cast(shape->dim(i).dim_value());
-              } else if (shape->dim(i).has_dim_param()) {
-                arr[i] = py::cast(shape->dim(i).dim_param());
-              } else {
-                arr[i] = py::none();
-              }
-            }
-            return arr;
-          },
-          "node shape (assuming the node holds a tensor)");
+        return std::string(res.str());
+      },
+           "converts the node into a readable string")
+      .def_property_readonly("shape", [](const onnxruntime::NodeArg& na) -> std::vector<py::object> {
+        auto shape = na.Shape();
+        std::vector<py::object> arr;
+        if (shape == nullptr || shape->dim_size() == 0) {
+          return arr;
+        }
+
+        arr.resize(shape->dim_size());
+        for (int i = 0; i < shape->dim_size(); ++i) {
+          if (shape->dim(i).has_dim_value()) {
+            arr[i] = py::cast(shape->dim(i).dim_value());
+          } else if (shape->dim(i).has_dim_param()) {
+            arr[i] = py::cast(shape->dim(i).dim_param());
+          } else {
+            arr[i] = py::none();
+          }
+        }
+        return arr;
+      },
+                             "node shape (assuming the node holds a tensor)");
 
   py::class_<SessionObjectInitializer>(m, "SessionObjectInitializer");
   py::class_<InferenceSession>(m, "InferenceSession", R"pbdoc(This is the main class used to run a model.)pbdoc")
@@ -513,16 +634,15 @@ including arg name, arg type (contains both type and shape).)pbdoc")
             InitializeSession(sess);
           },
           R"pbdoc(Load a model saved in ONNX format.)pbdoc")
-      .def(
-          "read_bytes", [](InferenceSession* sess, const py::bytes& serializedModel) {
-            std::istringstream buffer(serializedModel);
-            auto status = sess->Load(buffer);
-            if (!status.IsOK()) {
-              throw std::runtime_error(status.ToString().c_str());
-            }
-            InitializeSession(sess);
-          },
-          R"pbdoc(Load a model serialized in ONNX format.)pbdoc")
+      .def("read_bytes", [](InferenceSession* sess, const py::bytes& serializedModel) {
+        std::istringstream buffer(serializedModel);
+        auto status = sess->Load(buffer);
+        if (!status.IsOK()) {
+          throw std::runtime_error(status.ToString().c_str());
+        }
+        InitializeSession(sess);
+      },
+           R"pbdoc(Load a model serialized in ONNX format.)pbdoc")
       .def("run", [](InferenceSession* sess, std::vector<std::string> output_names, std::map<std::string, py::object> pyfeeds, RunOptions* run_options = nullptr) -> std::vector<py::object> {
         NameMLValMap feeds;
         for (auto _ : pyfeeds) {
@@ -602,6 +722,7 @@ including arg name, arg type (contains both type and shape).)pbdoc")
       });
 }
 
+
 PYBIND11_MODULE(onnxruntime_pybind11_state, m) {
   m.doc() = "pybind11 stateful interface to ONNX runtime";
 
@@ -631,6 +752,7 @@ PYBIND11_MODULE(onnxruntime_pybind11_state, m) {
 
 #ifdef onnxruntime_PYBIND_EXPORT_OPSCHEMA
   addOpSchemaSubmodule(m);
+  addOpKernelSubmodule(m);
 #endif
 }
 
diff --git a/onnxruntime/python/tools/quantization/quantize.py b/onnxruntime/python/tools/quantization/quantize.py
index deb4dddf33b45..eb4c51aa88021 100644
--- a/onnxruntime/python/tools/quantization/quantize.py
+++ b/onnxruntime/python/tools/quantization/quantize.py
@@ -60,7 +60,7 @@ def mode_for_data_type(data_type):
     if not callable(getattr(DataQuantizationMode, attr)) and not attr.startswith("__")]
 
 
-class Weight:
+class QuantizedInitializer:
     '''
         Represents a linearly quantized weight input from ONNX operators
     '''
@@ -78,6 +78,23 @@ def __init__(self, name, initializer, rmins, rmaxs, zero_points, scales, data=[]
                           # If empty, single zero point and scales computed from a single rmin and rmax
         self.qType = qType # type of quantized data.
 
+class QuantizedValueType():
+    Input = 0
+    Initializer = 1    
+
+class QuantizedValue:
+    '''
+    Represents a linearly quantized value (input\output\intializer)
+    '''
+    def __init__(self, name, new_quantized_name, scale_name, zero_point_name, quantized_value_type, axis=None,
+                 qType=onnx_proto.TensorProto.UINT8):
+        self.original_name = name
+        self.q_name = new_quantized_name
+        self.scale_name = scale_name
+        self.zp_name = zero_point_name
+        self.value_type = quantized_value_type
+        self.axis = axis
+        self.qType = qType
 
 def quantize_data(data, quantize_range, mode=DataQuantizationMode.Linear_NonScaled):
     '''
@@ -225,16 +242,18 @@ def _find_nodes_using_initializer(graph, initializer):
     return result
 
 class ONNXQuantizer:
-    def __init__(self, model, per_channel, mode, static, weight_qType, input_qType,
-            input_quantization_params, output_quantization_params):
+    def __init__(self, model, per_channel, mode, static, fuse_dynamic_quant, weight_qType, input_qType,
+            input_quantization_params, output_quantization_params, nodes_to_quantize):
         self.model = model
         self.per_channel = per_channel # weight-pack per channel
         self.weight_qType = weight_qType  # quantize data type
         self.mode = mode # QuantizationMode.Value
         self.static = static # use static quantization for inputs.
+        self.fuse_dynamic_quant = fuse_dynamic_quant
         self.input_qType = input_qType # quantize input type
         self.input_quantization_params = input_quantization_params # zero point and scale values for node inputs.
         self.output_quantization_params = output_quantization_params # zero point and scale values for node outputs.
+        self.nodes_to_quantize = nodes_to_quantize # specific nodes to quantize
 
         if not self.mode in quantization_modes:
             raise ValueError('unsupported quantization mode {}'.format(self.mode))
@@ -248,19 +267,28 @@ def __init__(self, model, per_channel, mode, static, weight_qType, input_qType,
         # In scaled mode, zero point is always zero (respresented by fixed_zero_point_name tensor)
         self.fixed_zero_zp_name = "fixed_zero_zp"
 
-        # List of weights quantized.
+        # List of quantized weights
         self._quantized_weights = []
-
+        # Map of all original value names to quantized value names
+        self.quantized_value_map = {}
+    
     def quantize_model(self):
         # Create a new topologically sorted list for quantizing a model
         new_list = []
         for node in self.model.graph.node:
-            if node.op_type == 'Conv':
-                new_list += self._quantize_convolution(node, new_list)
-            elif node.op_type == 'MatMul':
-                new_list += self._quantize_matmul(node, new_list)
+            if self.nodes_to_quantize is not None and node.name not in self.nodes_to_quantize:
+                new_list +=self._handle_other_ops(node, new_list)
             else:
-                new_list.append(node)
+                if node.op_type == 'Conv':
+                    new_list += self._quantize_convolution(node, new_list)
+                elif node.op_type == 'MatMul':
+                    new_list += self._quantize_matmul(node, new_list)
+                elif node.op_type == 'Gather':
+                    new_list += self._quantize_gather_ops(node, new_list)
+                elif node.op_type == 'Relu' or node.op_type == 'Clip':
+                    new_list +=self._handle_activation_ops(node, new_list)
+                else:
+                    new_list +=self._handle_other_ops(node, new_list)                    
 
         # extend is used to append to the list for a protobuf fields
         # https://developers.google.com/protocol-buffers/docs/reference/python-generated?csw=1#fields
@@ -315,9 +343,11 @@ def _update_graph(self, weight):
              - remove old weight input, update with new inputs for quantized weight, zero point, and scale
             This function does NOT update the nodes in the graph, just initializers and inputs
         '''
-        packed_weight_name = weight.name + '_quantized'
-        scale_name = weight.name + '_scale'
-        zero_point_name = weight.name + '_zero_point'
+        quantized_value = self.quantized_value_map[weight.name]
+        assert(quantized_value is not None)
+        packed_weight_name = quantized_value.q_name
+        scale_name = quantized_value.scale_name
+        zero_point_name = quantized_value.zp_name
 
         # Update packed weight, zero point, and scale initializers
         packed_weight_np_data = np.asarray(weight.quantized_data,
@@ -354,8 +384,14 @@ def _get_quantized_weight(self, initializer, qType):
         weights_data = self.find_weight_data(initializer)
         rmin, rmax, zero_point, scale, quantized_weights_data = quantize_data(weights_data.flatten().tolist(),
             _get_qrange_for_qType(qType), mode=DataQuantizationMode.mode_for_data_type(qType))
-        weight = Weight(initializer.name, initializer, [rmin], [rmax], [zero_point], [scale],
+        weight = QuantizedInitializer(initializer.name, initializer, [rmin], [rmax], [zero_point], [scale],
                         weights_data, quantized_weights_data, axis=None, qType=qType)
+        
+        # Log entry for this quantized weight
+        assert(weight.name not in self.quantized_value_map)
+        quantized_value = QuantizedValue(weight.name, weight.name + "_quantized", weight.name + "_scale", weight.name + "_zero_point", QuantizedValueType.Initializer, None, qType)
+        self.quantized_value_map[weight.name] = quantized_value
+
         return weight
 
     def _get_quantized_weight_convolution(self, initializer, qType):
@@ -396,8 +432,14 @@ def _get_quantized_weight_convolution(self, initializer, qType):
             channel_weights = np.asarray(quantized_per_channel_data_list[i]).reshape(reshape_dims)
             quantized_weights = np.concatenate((quantized_weights, channel_weights), axis=0)
 
-        weight = Weight(initializer.name, initializer, rmin_list, rmax_list, zero_point_list,
+        weight = QuantizedInitializer(initializer.name, initializer, rmin_list, rmax_list, zero_point_list,
                         scale_list, weights, quantized_weights.flatten().tolist(), channel_index, qType)
+        
+        # Make entry for this quantized weight
+        assert(weight.name not in self.quantized_value_map)
+        quantized_value = QuantizedValue(weight.name, weight.name + "_quantized", weight.name + "_scale", weight.name + "_zero_point", QuantizedValueType.Initializer, None, qType)
+        self.quantized_value_map[weight.name] = quantized_value
+
         return weight
 
     def _get_dynamic_input_quantization_params(self, input_name, nodes_list, qType):
@@ -590,7 +632,7 @@ def _get_output_quantization_params(self, output_name):
             parameter output_name: Name of the output.
             return: scale_name, zero_point_name, scale_shape, zero_point_shape.
         '''
-        if self.output_quantization_params is None or output_name not in self.output_quantization_params:
+        if self.output_quantization_params is None or output_name not in self.output_quantization_params:            
             raise ValueError("Quantization parameters are not specified for output {}.".format(output_name))
         params = self.output_quantization_params[output_name]
         if params is None or len(params) != 2:
@@ -633,22 +675,40 @@ def _get_quantize_input_nodes(self, node, input_index, qType):
             return: List of newly created nodes in NodeProto format.
         '''
         input_name = node.input[input_index]
+        output_name = input_name + "_quantized"        
 
-        nodes = []
         if self.static:
             scale_name, zp_name, scale_shape, zp_shape = \
                 self._get_static_input_quantization_params(input_name, qType)
-        else:
-            scale_name, zp_name, scale_shape, zp_shape = \
-                self._get_dynamic_input_quantization_params(input_name, nodes, qType)
 
-        # Add QuantizeLinear Node
-        output_name = input_name + "_quantized"
+            qlinear_node = onnx.helper.make_node("QuantizeLinear", [input_name, scale_name, zp_name], 
+                [output_name], input_name + "_QuantizeLinear")
+            
+            return [qlinear_node]
+            
+        else:
+            if self.fuse_dynamic_quant and qType == onnx_proto.TensorProto.UINT8:
+                scale_name = input_name + "_scale"
+                zeropoint_name = input_name + "_zero_point"
+                qlinear_node = onnx.helper.make_node("DynamicQuantizeLinear", [input_name],
+                    [output_name, scale_name, zeropoint_name], input_name + "_QuantizeLinear")
+                return [qlinear_node]
+                
+            else:
+                nodes = []
+                scale_name, zp_name, scale_shape, zp_shape = \
+                    self._get_dynamic_input_quantization_params(input_name, nodes, qType)
+                qlinear_node = onnx.helper.make_node("QuantizeLinear", [input_name, scale_name, zp_name], 
+                    [output_name], input_name + "_QuantizeLinear")
+            
+                return nodes + [qlinear_node]
+
+        # Add QuantizeLinear Node        
         qlinear_node = onnx.helper.make_node("QuantizeLinear", [input_name, scale_name, zp_name],
             [output_name], input_name + "_QuantizeLinear")
-        return nodes + [qlinear_node]
+        return nodes + [qlinear_node]    
 
-    def _update_unsupported_nodes_using_weight(self, weight, new_nodes_list):
+    def _update_unsupported_nodes_using_weight(self, weight, new_nodes_list):        
         '''Find all nodes using a weight that do not support quantization and
         add a DequantizeLinear node before those nodes. This includes all nodes except Conv, MatMul.
 
@@ -657,7 +717,7 @@ def _update_unsupported_nodes_using_weight(self, weight, new_nodes_list):
             return: List of new nodes created.
         '''
         nodes_using_weight = _find_nodes_using_initializer(self.model.graph, weight.initializer)
-        unsupported_nodes = [node for node in nodes_using_weight if node.op_type not in ["Conv", "MatMul"]]
+        unsupported_nodes = [node for node in nodes_using_weight if node.op_type not in ["Conv", "MatMul", "Gather"]]
 
         nodes_list = []
         dequantize_linear_name = weight.name + "_DequantizeLinear"
@@ -679,34 +739,6 @@ def _update_unsupported_nodes_using_weight(self, weight, new_nodes_list):
 
         return nodes_list
 
-    def _is_quantized(self, weight):
-        '''
-        Check if this weight is already quantized to the expected type and quantization axis.
-        If it is already quantized to (type, axis) different from expected values,
-        this function will throw an exception and stop the quantization.
-
-            parameter weight: Weight object.
-            return: Boolean indicating if quantized weight is already added to graph.
-        '''
-        quantized_initializer_name = weight.name + "_quantized"
-        quantized_initializer = _find_by_name(quantized_initializer_name, self.model.graph.initializer)
-        zero_point = _find_by_name(weight.name + "_zero_point", self.model.graph.initializer)
-        if quantized_initializer is None:
-            return False
-
-        # Compare type
-        if quantized_initializer.data_type != weight.qType:
-            raise ValueError("{} is being used by multiple nodes which are being quantized to different types. "
-                "Please use different initializers for these nodes.", weight.name)
-
-        expected_dims = [] if weight.axis is None else [len(weight.zero_points)]
-        # Compare quantization axis
-        if zero_point.dims != expected_dims:
-            raise ValueError("{} is being used by multiple nodes which are being quantized to different shapes. "
-                "Please use different initializers for these nodes.", weight.name)
-
-        return True
-
     def _quantize_inputs(self, node, indices, weight_index, new_nodes_list):
         '''
         Given a node, this function quantizes the inputs as follows:
@@ -725,7 +757,7 @@ def _quantize_inputs(self, node, indices, weight_index, new_nodes_list):
                      List of scale names used for input quantization,
                      List of new QuantizeLinear nodes created)
         '''
-        assert (node.op_type == "Conv" or node.op_type == "MatMul")
+        assert (node.op_type == "Conv" or node.op_type == "MatMul" or node.op_type == "Gather")
 
         quantized_input_names = []
         zero_point_names = []
@@ -733,37 +765,129 @@ def _quantize_inputs(self, node, indices, weight_index, new_nodes_list):
         nodes = []
 
         for input_index in indices:
-            qType = self.weight_qType if input_index == weight_index else self.input_qType
             node_input = node.input[input_index]
+
+            # Find if this input is already quantized
+            if node_input in self.quantized_value_map:
+                quantized_value = self.quantized_value_map[node_input]
+                qType = self.weight_qType if quantized_value.value_type == QuantizedValueType.Initializer else self.input_qType
+                if quantized_value.qType != qType: 
+                    raise ValueError("{} is being used by multiple nodes which are being quantized to different types. "
+                "This is not suported.", node_input)
+
+                quantized_input_names.append(quantized_value.q_name)
+                scale_names.append(quantized_value.scale_name)
+                zero_point_names.append(quantized_value.zp_name)
+                continue
+
+            # Quantize the input
             initializer = _find_by_name(node_input, self.model.graph.initializer)
             if initializer is not None:
-                # Quantize the data
                 if node.op_type == "Conv" and input_index == weight_index:
-                    weight = self._get_quantized_weight_convolution(initializer, qType)
+                    weight = self._get_quantized_weight_convolution(initializer, self.weight_qType)
                 else:
-                    weight = self._get_quantized_weight(initializer, qType)
+                    weight = self._get_quantized_weight(initializer, self.weight_qType)
 
-                if not self._is_quantized(weight):
-                    nodes.extend(self._update_unsupported_nodes_using_weight(weight, new_nodes_list))
-                    self._update_graph(weight)
+                # Update graph
+                nodes.extend(self._update_unsupported_nodes_using_weight(weight, new_nodes_list))
+                self._update_graph(weight)
 
                 quantized_input_names.append(weight.name + "_quantized")
                 zero_point_names.append(weight.name + "_zero_point")
                 scale_names.append(weight.name + "_scale")
             else:
-                # Not an initializer input. Add QuantizeLinear node.
-                # Find if there is already a quantizeLinear node for this input
+                # Add QuantizeLinear node.
                 qlinear_node = _find_node_by_name(node_input + "_QuantizeLinear", self.model.graph, new_nodes_list)
                 if qlinear_node is None:
-                    quantize_input_nodes = self._get_quantize_input_nodes(node, input_index, qType)
+                    quantize_input_nodes = self._get_quantize_input_nodes(node, input_index, self.input_qType)
                     nodes.extend(quantize_input_nodes)
                     qlinear_node = quantize_input_nodes[-1]
 
-                quantized_input_names.extend(qlinear_node.output)
-                scale_names.append(qlinear_node.input[1])
-                zero_point_names.append(qlinear_node.input[2])
+                if qlinear_node.op_type == "QuantizeLinear":
+                    quantized_input_names.extend(qlinear_node.output)
+                    scale_names.append(qlinear_node.input[1])
+                    zero_point_names.append(qlinear_node.input[2])
+                else:
+                    quantized_input_names.append(qlinear_node.output[0])
+                    scale_names.append(qlinear_node.output[1])
+                    zero_point_names.append(qlinear_node.output[2])
+
 
         return (quantized_input_names, zero_point_names, scale_names, nodes)
+ 
+    def _handle_other_ops(self, node, new_nodes_list):
+        '''
+        Given a node which does not support quantization(Conv, Matmul, Gather), this method 
+        checks whether the input to this node is quantized and adds a DequantizeLinear node 
+        to dequantize this input back to FP32
+
+            parameter node: Current node
+            parameter new_nodes_list: List of new nodes created before processing current node
+            return: List of new nodes created
+        '''
+        nodes = []
+        for index, node_input in enumerate(node.input):
+            if node_input in self.quantized_value_map:
+                node_input_altered = True
+                input_name = node.input[index]
+                quantized_value = self.quantized_value_map[input_name]
+                # Add DequantizeLinear Node for this input
+                dqlinear_name = input_name + "_DequantizeLinear"
+                dqlinear_node = _find_node_by_name(dqlinear_name, self.model.graph, new_nodes_list)
+                if dqlinear_node is None:
+                    dqlinear_inputs = [quantized_value.q_name, quantized_value.scale_name, quantized_value.zp_name]
+                    dequantize_node = onnx.helper.make_node("DequantizeLinear", dqlinear_inputs, [input_name], dqlinear_name)
+                    nodes.append(dequantize_node)
+                else:
+                    # DQ op is already present, assert it's output matches the input of current node
+                    assert(input_name == dqlinear_node.output[0])
+
+        # Append the original node
+        nodes.append(node)
+        return nodes
+
+    def _handle_activation_ops(self, node, new_node_list):
+        '''
+        Checks whether the give activation op can be removed from the graph. When mode is QLinearOps, 
+        the output quatization params are calculated based on outputs from activation nodes, 
+        therefore these nodes can be removed from the graph if they follow a quantized op.
+        
+            parameter node: Current node
+            parameter new_nodes_list: List of new nodes created before processing current node
+            return: List of nodes
+        '''
+        assert(node.op_type == "Relu" or node.op_type == 'Clip')
+        if self.mode is not QuantizationMode.QLinearOps:
+            return [node]
+        # When mode is QLinearOps, the output quatization params are calculated based on outputs from
+        # activation nodes, therefore these nodes can be removed from the graph if they follow a quantized op.
+        # If input to this node is not quantized then keep this node
+        if node.input[0] not in self.quantized_value_map:
+            return [node]
+
+        # Prepare to remove this node
+        quantized_value = self.quantized_value_map[node.input[0]]
+        self.quantized_value_map[node.output[0]] = quantized_value
+
+        return []
+
+    def _quantize_gather_ops(self, node, new_nodes_list):
+        assert (node.op_type == "Gather")
+        (quantized_input_names, zero_point_names, scale_names, nodes) = \
+            self._quantize_inputs(node, [0], 0, new_nodes_list)
+        
+        gather_new_output = node.output[0] + "_quantized"
+
+        # Create an entry for this quantized value
+        q_output = QuantizedValue(node.output[0], gather_new_output, scale_names[0], zero_point_names[0], QuantizedValueType.Input)        
+        self.quantized_value_map[node.output[0]] = q_output
+
+        gather_original_output = node.output[0]
+        node.output[0] = gather_new_output
+        node.input[0] = quantized_input_names[0]
+        nodes.append(node)
+
+        return nodes        
 
     def _quantize_convolution_integer_ops(self, node, new_nodes_list):
         '''
@@ -814,6 +938,7 @@ def _quantize_convolution_integer_ops(self, node, new_nodes_list):
         if conv_integer_name != "":
             output_scale_mul_op = conv_integer_name + "_output_scale_mul"
         nodes.append(_get_mul_node([cast_op_output, scales_mul_op_output], node.output[0], output_scale_mul_op))
+
         return nodes
 
     def _quantize_matmul_integer_ops(self, node, new_nodes_list):
@@ -903,12 +1028,11 @@ def _quantize_convolution_qlinear_ops(self, node, new_nodes_list):
         qlinear_conv_node = onnx.helper.make_node("QLinearConv", qlinear_conv_inputs,
             [qlinear_conv_output], qlinear_conv_name, **kwargs)
         nodes.append(qlinear_conv_node)
-
-        # Add DequantizeLinear node.
-        dqlinear_name = node.output[0] + "_DequantizeLinear"
-        dqlinear_inputs = [qlinear_conv_output, output_scale_name, output_zp_name]
-        dqlinear_node = onnx.helper.make_node("DequantizeLinear", dqlinear_inputs, [node.output[0]], dqlinear_name)
-        nodes.append(dqlinear_node)
+        
+        # Create an entry for this quantized value
+        q_output = QuantizedValue(node.output[0], qlinear_conv_output, output_scale_name, output_zp_name, QuantizedValueType.Input)        
+        self.quantized_value_map[node.output[0]] = q_output
+        
         return nodes
 
     def _quantize_matmul_qlinear_ops(self, node, new_nodes_list):
@@ -948,11 +1072,10 @@ def _quantize_matmul_qlinear_ops(self, node, new_nodes_list):
             [qlinear_matmul_output], qlinear_matmul_name)
         nodes.append(qlinear_matmul_node)
 
-        # Add DequantizeLinear node.
-        dqlinear_name = node.output[0] + "_DequantizeLinear"
-        dqlinear_inputs = [qlinear_matmul_output, output_scale_name, output_zp_name]
-        dqlinear_node = onnx.helper.make_node("DequantizeLinear", dqlinear_inputs, [node.output[0]], dqlinear_name)
-        nodes.append(dqlinear_node)
+        # Create an entry for this quantized value
+        q_output = QuantizedValue(node.output[0], qlinear_matmul_output, output_scale_name, output_zp_name, QuantizedValueType.Input)        
+        self.quantized_value_map[node.output[0]] = q_output
+        
         return nodes
 
     def _quantize_convolution(self, node, new_nodes_list):
@@ -991,7 +1114,8 @@ def _quantize_matmul(self, node, new_nodes_list):
 
 
 def quantize(model, per_channel=False, nbits=8, quantization_mode=QuantizationMode.IntegerOps,
-    static=False, asymmetric_input_types=False, input_quantization_params=None, output_quantization_params=None):
+    static=False, fuse_dynamic_quant=True, asymmetric_input_types=False, 
+    input_quantization_params=None, output_quantization_params=None, nodes_to_quantize=None):
     '''
         Given an onnx model, create a quantized onnx model and save it into a file
 
@@ -1008,6 +1132,10 @@ def quantize(model, per_channel=False, nbits=8, quantization_mode=QuantizationMo
               specified through input_quantization_params.
         False: The inputs/activations are quantized using dynamic scale and zero point values
                computed while running the model.
+    :param fuse_dynamic_quant:
+        True: Fuses nodes added for dynamic quantization
+        False: No fusion is applied for nodes which are added for dynamic quantization.
+        Should be only used in cases where backends want to apply special fusion routines        
     :param asymmetric_input_types:
         True: Weights are quantized into signed integers and inputs/activations into unsigned integers.
         False: Weights and inputs/activations are quantized into unsigned integers.
@@ -1038,6 +1166,14 @@ def quantize(model, per_channel=False, nbits=8, quantization_mode=QuantizationMo
                 'resnet_model/Relu_4:0': [np.uint8(0), np.float32(0.011359662748873234)]
             }
     :return: ModelProto with quantization
+    :param nodes_to quantize:
+        List of nodes names to quantize. When this list is not None only the nodes in this list
+        are quantized.
+        exmaple:
+        [
+            'Cov__224',
+            'Conv__252'
+        ]
     '''
     if nbits == 8:
         input_qType = onnx_proto.TensorProto.UINT8
@@ -1045,8 +1181,8 @@ def quantize(model, per_channel=False, nbits=8, quantization_mode=QuantizationMo
         mode = quantization_mode
         copy_model = onnx_proto.ModelProto()
         copy_model.CopyFrom(model)
-        quantizer = ONNXQuantizer(copy_model, per_channel, mode, static, weight_qType, input_qType,
-                        input_quantization_params, output_quantization_params)
+        quantizer = ONNXQuantizer(copy_model, per_channel, mode, static, fuse_dynamic_quant, weight_qType, input_qType,
+                        input_quantization_params, output_quantization_params, nodes_to_quantize)
         quantizer.quantize_model()
         quantizer.model.producer_name = __producer__
         quantizer.model.producer_version = __version__
diff --git a/onnxruntime/server/environment.cc b/onnxruntime/server/environment.cc
index 6ebcb90b4cef5..f4a87ee7c4821 100644
--- a/onnxruntime/server/environment.cc
+++ b/onnxruntime/server/environment.cc
@@ -48,7 +48,7 @@ void ServerEnvironment::InitializeModel(const std::string& model_path) {
 
   auto output_count = session.GetOutputCount();
 
-  auto allocator = Ort::Allocator::CreateDefault();
+  Ort::AllocatorWithDefaultOptions allocator;
   for (size_t i = 0; i < output_count; i++) {
     auto name = session.GetOutputName(i, allocator);
     model_output_names_.push_back(name);
diff --git a/onnxruntime/server/executor.cc b/onnxruntime/server/executor.cc
index e7cae666f45b4..c72aa32a318e8 100644
--- a/onnxruntime/server/executor.cc
+++ b/onnxruntime/server/executor.cc
@@ -10,7 +10,6 @@
 #include "core/framework/ml_value.h"
 #include "core/framework/tensor.h"
 #include "serializing/tensorprotoutils.h"
-#include "core/common/callback.h"
 
 #include "onnx-ml.pb.h"
 #include "predict.pb.h"
diff --git a/onnxruntime/test/automl_ops/datetimetransformer_test.cc b/onnxruntime/test/automl_ops/datetimetransformer_test.cc
new file mode 100644
index 0000000000000..6f8fd7edad999
--- /dev/null
+++ b/onnxruntime/test/automl_ops/datetimetransformer_test.cc
@@ -0,0 +1,97 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "gtest/gtest.h"
+#include "test/providers/provider_test_utils.h"
+
+#include "core/automl/featurizers/src/FeaturizerPrep/Featurizers/DateTimeFeaturizer.h"
+
+namespace dft = Microsoft::Featurizer::DateTimeFeaturizer;
+
+using SysClock = std::chrono::system_clock;
+
+namespace onnxruntime {
+namespace test {
+
+TEST(DateTimeTransformer, Past_1976_Nov_17__12_27_04) {
+
+  const time_t date = 217081624;
+  OpTester test("DateTimeTransformer", 1, onnxruntime::kMSAutoMLDomain);
+
+  // We are adding a scalar Tensor in this instance
+  test.AddInput<int64_t>("X", {1}, {date});
+
+  SysClock::time_point stp = SysClock::from_time_t(date);
+  dft::TimePoint tp(stp);
+  ASSERT_EQ(tp.year, 1976);
+  ASSERT_EQ(tp.month, dft::TimePoint::NOVEMBER);
+  ASSERT_EQ(tp.day, 17);
+  ASSERT_EQ(tp.hour, 12);
+  ASSERT_EQ(tp.minute, 27);
+  ASSERT_EQ(tp.second, 4);
+  ASSERT_EQ(tp.dayOfWeek, dft::TimePoint::WEDNESDAY);
+  ASSERT_EQ(tp.dayOfYear, 321);
+  ASSERT_EQ(tp.quarterOfYear, 4);
+  ASSERT_EQ(tp.weekOfMonth, 2);
+
+  // Expected output.
+  test.AddOutput<dft::TimePoint>("Y", std::move(tp));
+  test.Run(OpTester::ExpectResult::kExpectSuccess);
+}
+
+TEST(DateTimeTransformer, Past_1976_Nov_17__12_27_05) {
+  const time_t date = 217081625;
+
+  OpTester test("DateTimeTransformer", 1, onnxruntime::kMSAutoMLDomain);
+  // We are adding a scalar Tensor in this instance
+  test.AddInput<int64_t>("X", {1}, {date});
+
+  SysClock::time_point stp = SysClock::from_time_t(date);
+
+  dft::Transformer dt;
+  dft::TimePoint tp = dt.transform(stp);
+  ASSERT_EQ(tp.year, 1976);
+  ASSERT_EQ(tp.month, dft::TimePoint::NOVEMBER);
+  ASSERT_EQ(tp.day, 17);
+  ASSERT_EQ(tp.hour, 12);
+  ASSERT_EQ(tp.minute, 27);
+  ASSERT_EQ(tp.second, 5);
+  ASSERT_EQ(tp.dayOfWeek, dft::TimePoint::WEDNESDAY);
+  ASSERT_EQ(tp.dayOfYear, 321);
+  ASSERT_EQ(tp.quarterOfYear, 4);
+  ASSERT_EQ(tp.weekOfMonth, 2);
+
+  // Expected output.
+  test.AddOutput<dft::TimePoint>("Y", std::move(tp));
+  test.Run(OpTester::ExpectResult::kExpectSuccess);
+}
+
+TEST(DateTimeTransformer, Future_2025_June_30) {
+  const time_t date = 1751241600;
+
+  OpTester test("DateTimeTransformer", 1, onnxruntime::kMSAutoMLDomain);
+  // We are adding a scalar Tensor in this instance
+  test.AddInput<int64_t>("X", {1}, {date});
+
+  SysClock::time_point stp = SysClock::from_time_t(date);
+
+  dft::Transformer dt;
+  dft::TimePoint tp = dt.transform(stp);
+  ASSERT_EQ(tp.year, 2025);
+  ASSERT_EQ(tp.month, dft::TimePoint::JUNE);
+  ASSERT_EQ(tp.day, 30);
+  ASSERT_EQ(tp.hour, 0);
+  ASSERT_EQ(tp.minute, 0);
+  ASSERT_EQ(tp.second, 0);
+  ASSERT_EQ(tp.dayOfWeek, dft::TimePoint::MONDAY);
+  ASSERT_EQ(tp.dayOfYear, 180);
+  ASSERT_EQ(tp.quarterOfYear, 2);
+  ASSERT_EQ(tp.weekOfMonth, 4);
+
+  test.AddOutput<dft::TimePoint>("Y", std::move(tp));
+  test.Run(OpTester::ExpectResult::kExpectSuccess);
+}
+
+
+}  // namespace test
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/common/logging/logging_test.cc b/onnxruntime/test/common/logging/logging_test.cc
index 828e2304827e9..11ca7fb4ca434 100644
--- a/onnxruntime/test/common/logging/logging_test.cc
+++ b/onnxruntime/test/common/logging/logging_test.cc
@@ -11,10 +11,6 @@
 
 #include "test/common/logging/helpers.h"
 
-using namespace onnxruntime;
-using namespace ::onnxruntime::logging;
-using InstanceType = LoggingManager::InstanceType;
-
 // if we pull in the whole 'testing' namespace we get warnings from date.h as both use '_' in places.
 // to avoid that we explicitly pull in the pieces we are using
 using testing::Eq;
@@ -23,6 +19,12 @@ using testing::Ge;
 using testing::HasSubstr;
 using testing::Property;
 
+namespace onnxruntime {
+using namespace logging;
+using InstanceType = LoggingManager::InstanceType;
+
+namespace test {
+
 static std::string default_logger_id{"TestFixtureDefaultLogger"};
 
 // class to provide single default instance of LoggingManager for use with macros involving 'DEFAULT'
@@ -232,3 +234,31 @@ TEST_F(LoggingTestsFixture, TestVLog) {
   VLOGS(*logger, 0) << "Should be ignored.";  // ignored as disabled
 #endif
 }
+
+class CTestSink : public OStreamSink {
+ public:
+  CTestSink(std::ostringstream& stream) : OStreamSink(stream, /*flush*/ true) {
+  }
+};
+
+TEST_F(LoggingTestsFixture, TestTruncation) {
+  const std::string logger_id{"TestTruncation"};
+  const Severity min_log_level = Severity::kVERBOSE;
+  const bool filter_user_data = false;
+
+  std::ostringstream out;
+  auto* sink_ptr = new CTestSink{out};
+
+  LoggingManager manager{std::unique_ptr<ISink>(sink_ptr), min_log_level, filter_user_data,
+                         InstanceType::Temporal};
+
+  auto logger = manager.CreateLogger(logger_id);
+
+  // attempt to print string longer than hard-coded 2K buffer limit
+  LOGF(*logger, ERROR, "%s", std::string(4096, 'a').c_str());
+
+  EXPECT_THAT(out.str(), HasSubstr("[...truncated...]"));
+}
+
+}  // namespace test
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/contrib_ops/matmul_integer16_test.cc b/onnxruntime/test/contrib_ops/matmul_integer16_test.cc
new file mode 100644
index 0000000000000..6c4cc23960e67
--- /dev/null
+++ b/onnxruntime/test/contrib_ops/matmul_integer16_test.cc
@@ -0,0 +1,41 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "gtest/gtest.h"
+#include "test/providers/provider_test_utils.h"
+
+#include "core/common/common.h"
+#include "core/framework/op_kernel.h"
+#include "core/util/math_cpuonly.h"
+
+namespace onnxruntime {
+namespace test {
+
+TEST(MatmulInteger16OpTest, MatMulInteger16_1) {
+  OpTester test("MatMulInteger16", 1, onnxruntime::kMSDomain);
+  test.AddInput<int16_t>("T1", {1, 1}, {15});
+  test.AddInput<int16_t>("T2", {1, 1}, {16});
+  test.AddOutput<int32_t>("T3", {1, 1}, {240});
+  test.Run();
+}
+
+TEST(MatmulInteger16OpTest, MatMulInteger16_2) {
+  OpTester test("MatMulInteger16", 1, onnxruntime::kMSDomain);
+  test.AddInput<int16_t>("T1", {1, 2}, {-7, 10});
+  test.AddInput<int16_t>("T2", {2, 1}, {-8, -11});
+  test.AddOutput<int32_t>("T3", {1, 1}, {-54});
+  test.Run();
+}
+
+TEST(MatmulInteger16OpTest, MatMulInteger16_3) {
+  OpTester test("MatMulInteger16", 1, onnxruntime::kMSDomain);
+  test.AddInput<int16_t>("T1", {3, 2}, {-7, 10, 10, -1113, 22, -356});
+  test.AddInput<int16_t>("T2", {2, 4}, {-8, -11, 13, 14, -99, 1234, 321, -6});
+  test.AddOutput<int32_t>("T3", {3, 4}, {-934, 12417, 3119, -158,
+                                         110107, -1373552, -357143, 6818,
+                                         35068, -439546, -113990, 2444});
+  test.Run();
+}
+
+}  // namespace test
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/framework/TestAllocatorManager.cc b/onnxruntime/test/framework/TestAllocatorManager.cc
index 550e877252bb6..61f6eb789f156 100644
--- a/onnxruntime/test/framework/TestAllocatorManager.cc
+++ b/onnxruntime/test/framework/TestAllocatorManager.cc
@@ -52,10 +52,10 @@ Status AllocatorManager::InitializeAllocators() {
   auto cpu_alocator = std::make_unique<CPUAllocator>();
   ORT_RETURN_IF_ERROR(RegisterAllocator(map_, std::move(cpu_alocator), std::numeric_limits<size_t>::max(), true));
 #ifdef USE_CUDA
-  auto cuda_alocator = std::make_unique<CUDAAllocator>(0);
+  auto cuda_alocator = std::make_unique<CUDAAllocator>(0, CUDA);
   ORT_RETURN_IF_ERROR(RegisterAllocator(map_, std::move(cuda_alocator), std::numeric_limits<size_t>::max(), true));
 
-  auto cuda_pinned_alocator = std::make_unique<CUDAPinnedAllocator>();
+  auto cuda_pinned_alocator = std::make_unique<CUDAPinnedAllocator>(0, CUDA_PINNED);
   ORT_RETURN_IF_ERROR(RegisterAllocator(map_, std::move(cuda_pinned_alocator), std::numeric_limits<size_t>::max(), true));
 #endif  // USE_CUDA
 
diff --git a/onnxruntime/test/framework/allocation_planner_test.cc b/onnxruntime/test/framework/allocation_planner_test.cc
index c555c9ba21900..362dbf2c64417 100644
--- a/onnxruntime/test/framework/allocation_planner_test.cc
+++ b/onnxruntime/test/framework/allocation_planner_test.cc
@@ -34,8 +34,8 @@ struct UnaryNode {
   std::vector<onnxruntime::NodeArg*> output_args;
   onnxruntime::Node* p_node;
 
-  UnaryNode(onnxruntime::Graph& graph, const std::string& op,
-            onnxruntime::NodeArg* p_input_arg, onnxruntime::NodeArg* p_output_arg)
+  UnaryNode(onnxruntime::Graph& graph, const std::string& op, onnxruntime::NodeArg* p_input_arg,
+            onnxruntime::NodeArg* p_output_arg)
       : input_args({p_input_arg}), output_args({p_output_arg}) {
     int num = NodeCounter::Next();
     p_node = &graph.AddNode("node" + std::to_string(num), op, "test op", input_args, output_args);
@@ -155,14 +155,17 @@ class PlannerTest : public ::testing::Test {
   std::vector<std::unique_ptr<OpKernelInfo>> op_kernel_infos_;
   std::vector<std::pair<onnxruntime::Node*, KernelDef&>> kernel_bindings_;
   ExecutionProviders execution_providers_;
+  concurrency::ThreadPool tp_;
   SessionState state_;
   ShapeMap shape_map_;
   std::unique_ptr<SequentialExecutionPlan> plan_;
 
  public:
-  PlannerTest() : model_("test"), graph_{model_.MainGraph()}, state_{execution_providers_, false} {
-    std_kernel_ = KernelDefBuilder().SetName("Transpose").Build();
-    in_place_kernel_ = KernelDefBuilder().SetName("Clip").MayInplace(0, 0).Build();
+  PlannerTest()
+      : model_("test"), graph_(model_.MainGraph()), tp_("test", 1), state_(execution_providers_, false, &tp_) {
+    std_kernel_ = KernelDefBuilder().SetName("Transpose").Provider(kCpuExecutionProvider).SinceVersion(1, 10).Build();
+    in_place_kernel_ =
+        KernelDefBuilder().SetName("Relu").Provider(kCpuExecutionProvider).SinceVersion(1, 10).MayInplace(0, 0).Build();
     CPUExecutionProviderInfo epi;
     auto execution_provider = std::make_unique<CPUExecutionProvider>(epi);
     execution_providers_.Add("CPUExecutionProvider", std::move(execution_provider));
@@ -193,18 +196,20 @@ class PlannerTest : public ::testing::Test {
     return AddNode(*in_place_kernel_, input, output);
   }
 
-  void BindKernel(onnxruntime::Node* p_node, ::onnxruntime::KernelDef& kernel_def) {
+  void BindKernel(onnxruntime::Node* p_node, ::onnxruntime::KernelDef& kernel_def, KernelRegistry* reg) {
     auto info = std::make_unique<OpKernelInfo>(*p_node, kernel_def, *execution_providers_.Get(*p_node),
                                                state_.GetInitializedTensors(), state_.GetOrtValueNameIdxMap(),
                                                state_.GetFuncMgr(), state_.GetDataTransferMgr());
-    auto dummy = std::make_unique<DummyOpKernel>(*info);
     op_kernel_infos_.push_back(std::move(info));
-    state_.AddKernel(p_node->Index(), std::move(dummy));
+    if (reg->TryFindKernel(*p_node, onnxruntime::kCpuExecutionProvider) == nullptr) {
+      auto st = reg->Register(
+          KernelCreateInfo(std::make_unique<KernelDef>(kernel_def),
+                           [](const OpKernelInfo& info) -> OpKernel* { return new DummyOpKernel(info); }));
+      ORT_ENFORCE(st.IsOK(), st.ErrorMessage());
+    }
   }
 
-  void SetShape(std::string& name, TensorShapeProto* shape) {
-    shape_map_[Arg(name)] = shape;
-  }
+  void SetShape(std::string& name, TensorShapeProto* shape) { shape_map_[Arg(name)] = shape; }
 
   void SetShape(std::initializer_list<std::pair<std::string&, TensorShapeProto*>> shapes) {
     for (auto& pair : shapes) {
@@ -214,29 +219,27 @@ class PlannerTest : public ::testing::Test {
 
   void CreatePlan(const std::vector<const NodeArg*>& outer_scope_node_args = {}) {
     EXPECT_EQ(graph_.Resolve(), Status::OK());
-    state_.SetGraphViewer(std::make_unique<GraphViewer>(graph_));
 
-    OrtValueNameIdxMap& mlvalue_name_idx_map{state_.GetOrtValueNameIdxMap()};
+    state_.SetGraph(graph_);
 
-    int count = 0;
-    for (auto& pair : name_to_arg_) {
-      EXPECT_EQ(mlvalue_name_idx_map.Add(pair.first), count++);
-    }
+    std::shared_ptr<KernelRegistry> reg = std::make_shared<KernelRegistry>();
 
     for (auto& binding : kernel_bindings_) {
-      BindKernel(binding.first, binding.second);
+      BindKernel(binding.first, binding.second, reg.get());
     }
 
     auto cpu_execution_provider = std::make_unique<CPUExecutionProvider>(CPUExecutionProviderInfo());
     KernelRegistryManager kernel_registry_manager;
+    kernel_registry_manager.RegisterKernelRegistry(reg);
     ExecutionProviders execution_providers;
     execution_providers.Add(onnxruntime::kCpuExecutionProvider, std::move(cpu_execution_provider));
     auto status = kernel_registry_manager.RegisterKernels(execution_providers);
     EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
-
+    status = state_.CreateKernels(kernel_registry_manager);
+    EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
     SequentialPlannerTestContext test_context(&shape_map_);
     status = SequentialPlanner::CreatePlan(nullptr, GraphViewer(graph_), outer_scope_node_args, execution_providers,
-                                           kernel_registry_manager, mlvalue_name_idx_map, test_context, plan_);
+                                           kernel_registry_manager, state_.GetOrtValueNameIdxMap(), test_context, plan_);
 
     EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
     AllocationPlanTestUtility::BasicIntegrityCheck(*plan_, name_to_arg_.size());
diff --git a/onnxruntime/test/framework/cuda/allocator_cuda_test.cc b/onnxruntime/test/framework/cuda/allocator_cuda_test.cc
index 75b433c045fdb..46ec9ac86d03a 100644
--- a/onnxruntime/test/framework/cuda/allocator_cuda_test.cc
+++ b/onnxruntime/test/framework/cuda/allocator_cuda_test.cc
@@ -12,7 +12,7 @@ namespace test {
 TEST(AllocatorTest, CUDAAllocatorTest) {
   int cuda_device_id = 0;
   DeviceAllocatorRegistrationInfo default_allocator_info({OrtMemTypeDefault,
-                                                          [](int id) { return std::make_unique<CUDAAllocator>(id); }, std::numeric_limits<size_t>::max()});
+                                                          [](int id) { return std::make_unique<CUDAAllocator>(id, CUDA); }, std::numeric_limits<size_t>::max()});
 
   auto cuda_arena = CreateAllocator(default_allocator_info, cuda_device_id);
 
@@ -28,7 +28,7 @@ TEST(AllocatorTest, CUDAAllocatorTest) {
   EXPECT_TRUE(cuda_addr);
 
   DeviceAllocatorRegistrationInfo pinned_allocator_info({OrtMemTypeCPUOutput,
-                                                         [](int) { return std::make_unique<CUDAPinnedAllocator>(); }, std::numeric_limits<size_t>::max()});
+                                                         [](int) { return std::make_unique<CUDAPinnedAllocator>(0, CUDA_PINNED); }, std::numeric_limits<size_t>::max()});
 
   auto pinned_allocator = CreateAllocator(pinned_allocator_info);
 
diff --git a/onnxruntime/test/framework/execution_frame_test.cc b/onnxruntime/test/framework/execution_frame_test.cc
index 48e2ce8c3b0e6..ee4ce1f58c98e 100644
--- a/onnxruntime/test/framework/execution_frame_test.cc
+++ b/onnxruntime/test/framework/execution_frame_test.cc
@@ -36,16 +36,20 @@ std::unique_ptr<IExecutionProvider> CreateCPUExecutionProvider() {
   return std::make_unique<CPUExecutionProvider>(info);
 }
 
-TEST(ExecutionFrameTest, TensorAllocationTest) {
+class ExecutionFrameTest : public ::testing::Test {
+ protected:
+  concurrency::ThreadPool tp_{"test", 1};
+};
+
+TEST_F(ExecutionFrameTest, TensorAllocationTest) {
   onnxruntime::Model model("test");
   onnxruntime::Graph& graph = model.MainGraph();
   TypeProto tensor_float;
   tensor_float.mutable_tensor_type()->set_elem_type(TensorProto_DataType_FLOAT);
   onnxruntime::NodeArg input_def("X", &tensor_float), output_def("Y", &tensor_float);
 
-  graph.AddNode("node1", "Clip", "Clip operator", ArgMap{&input_def}, ArgMap{&output_def});
-  onnxruntime::Node* node = graph.GetNode(graph.NumberOfNodes() - 1);
-
+  onnxruntime::Node* node = &graph.AddNode("node1", "Relu", "Relu operator", ArgMap{&input_def}, ArgMap{&output_def});
+  node->SetExecutionProviderType(kCpuExecutionProvider);
   Status status = graph.Resolve();
   EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
 
@@ -57,12 +61,9 @@ TEST(ExecutionFrameTest, TensorAllocationTest) {
   status = kernel_registry_manager.RegisterKernels(execution_providers);
   EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
 
-  SessionState state{execution_providers, true};
-  state.SetGraphViewer(std::make_unique<GraphViewer>(graph));
-
-  OrtValueNameIdxMap& mlvalue_name_idx_map{state.GetOrtValueNameIdxMap()};
-  mlvalue_name_idx_map.Add("X");
-  mlvalue_name_idx_map.Add("Y");
+  SessionState state{execution_providers, true, &tp_};
+  status = state.SetGraphAndCreateKernels(graph, kernel_registry_manager);
+  EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
 
   node->SetExecutionProviderType(xp_typ);
 
@@ -70,12 +71,10 @@ TEST(ExecutionFrameTest, TensorAllocationTest) {
   // TODO below line is for testing only. In production use SequentialPlanner::CreatePlan()
   SequentialPlannerContext context(false);
   status = SequentialPlanner::CreatePlan(nullptr, GraphViewer(graph), {}, execution_providers, kernel_registry_manager,
-                                         mlvalue_name_idx_map, context, p_seq_exec_plan);
+                                         state.GetOrtValueNameIdxMap(), context, p_seq_exec_plan);
   EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
   state.SetExecutionPlan(std::move(p_seq_exec_plan));
 
-  state.CalculateNodeIndexInfo();
-
   vector<OrtValue> outputs;
   ExecutionFrame frame({}, {}, {}, outputs, {}, state);
 
@@ -111,22 +110,23 @@ TEST(ExecutionFrameTest, TensorAllocationTest) {
   EXPECT_EQ(tensor2->template Data<float>(), p_tensor->template Data<float>());
 }
 
-TEST(ExecutionFrameTest, FeedInDataTest) {
-  onnxruntime::Model model("test");
+TEST_F(ExecutionFrameTest, FeedInDataTest) {
+  onnxruntime::Model model("test", false, ModelMetaData(), IOnnxRuntimeOpSchemaRegistryList(),
+                           std::unordered_map<std::string, int>{{"", 10}});
   onnxruntime::Graph& graph = model.MainGraph();
   TypeProto tensor_float;
   tensor_float.mutable_tensor_type()->set_elem_type(TensorProto_DataType_FLOAT);
   onnxruntime::NodeArg input_def("X", &tensor_float), output_def("Y", &tensor_float);
 
-  graph.AddNode("node1", "Clip", "Clip operator", ArgMap{&input_def}, ArgMap{&output_def});
+  graph.AddNode("node1", "Clip", "Clip operator", ArgMap{&input_def}, ArgMap{&output_def})
+      .SetExecutionProviderType(kCpuExecutionProvider);
   graph.Resolve();
-  auto cpu_allocator = TestCPUExecutionProvider()->GetAllocator(0, OrtMemTypeDefault);
   auto element_type = DataTypeImpl::GetType<float>();
   TensorShape shape({3, 2});
+  std::vector<float> fdata(static_cast<size_t>(shape.Size()));
   //create fake ml value with owned buffer.
-  std::unique_ptr<Tensor> p_tensor = std::make_unique<Tensor>(element_type,
-                                                              shape,
-                                                              cpu_allocator);
+  OrtAllocatorInfo cpuinfo(kCpuExecutionProvider, OrtDeviceAllocator);
+  std::unique_ptr<Tensor> p_tensor = std::make_unique<Tensor>(element_type, shape, fdata.data(), cpuinfo);
   OrtValue value;
   value.Init(p_tensor.release(),
              DataTypeImpl::GetType<Tensor>(),
@@ -139,15 +139,14 @@ TEST(ExecutionFrameTest, FeedInDataTest) {
   ExecutionProviders execution_providers;
   execution_providers.Add(xp_typ, std::move(cpu_xp));
   EXPECT_TRUE(kernel_registry_manager.RegisterKernels(execution_providers).IsOK());
+  SessionState state{execution_providers, true, &tp_};
+  auto status = state.SetGraphAndCreateKernels(graph, kernel_registry_manager);
+  EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
 
-  SessionState state{execution_providers, true};
-  state.SetGraphViewer(std::make_unique<GraphViewer>(graph));
-
-  OrtValueNameIdxMap& mlvalue_name_idx_map{state.GetOrtValueNameIdxMap()};
-  auto x_idx = mlvalue_name_idx_map.Add("X");
-  auto y_idx = mlvalue_name_idx_map.Add("Y");
-
-  state.CalculateNodeIndexInfo();
+  const OrtValueNameIdxMap& mlvalue_name_idx_map = state.GetOrtValueNameIdxMap();
+  int x_idx, y_idx;
+  ASSERT_TRUE(mlvalue_name_idx_map.GetIdx("X", x_idx).IsOK());
+  ASSERT_TRUE(mlvalue_name_idx_map.GetIdx("Y", y_idx).IsOK());
 
   vector<OrtValue> outputs;
   ExecutionFrame frame({x_idx}, {value}, {y_idx}, outputs, {}, state);
@@ -160,7 +159,7 @@ TEST(ExecutionFrameTest, FeedInDataTest) {
   EXPECT_EQ(p_tensor_arg_0->MutableData<float>(), value.GetMutable<Tensor>()->MutableData<float>());
 }
 
-TEST(ExecutionFrameTest, MemPatternTest) {
+TEST_F(ExecutionFrameTest, MemPatternTest) {
   auto cpu_xp = CreateCPUExecutionProvider();
   auto xp_type = cpu_xp->Type();
   std::unordered_map<std::string, int> domain_to_version;
@@ -192,17 +191,21 @@ TEST(ExecutionFrameTest, MemPatternTest) {
   execution_providers.Add(xp_type, std::move(cpu_xp));
   kernel_registry_manager.RegisterKernels(execution_providers);
   //1. prepare input
-  SessionState state{execution_providers, true};
-  state.SetGraphViewer(std::make_unique<GraphViewer>(graph));
+  SessionState state{execution_providers, true, &tp_};
+  status = state.SetGraphAndCreateKernels(graph, kernel_registry_manager);
+  EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
+
+  const OrtValueNameIdxMap& mlvalue_name_idx_map{state.GetOrtValueNameIdxMap()};
 
-  OrtValueNameIdxMap& mlvalue_name_idx_map{state.GetOrtValueNameIdxMap()};
+  int x1_idx, x2_idx, x3_idx;
+  int t1_idx, t2_idx, t3_idx;
+  ASSERT_TRUE(mlvalue_name_idx_map.GetIdx("X1", x1_idx).IsOK());
+  ASSERT_TRUE(mlvalue_name_idx_map.GetIdx("X2", x2_idx).IsOK());
+  ASSERT_TRUE(mlvalue_name_idx_map.GetIdx("X3", x3_idx).IsOK());
 
-  auto x1_idx = mlvalue_name_idx_map.Add("X1");
-  auto x2_idx = mlvalue_name_idx_map.Add("X2");
-  auto x3_idx = mlvalue_name_idx_map.Add("X3");
-  mlvalue_name_idx_map.Add("T1");
-  mlvalue_name_idx_map.Add("T2");
-  auto t3_idx = mlvalue_name_idx_map.Add("T3");
+  ASSERT_TRUE(mlvalue_name_idx_map.GetIdx("T1", t1_idx).IsOK());
+  ASSERT_TRUE(mlvalue_name_idx_map.GetIdx("T2", t2_idx).IsOK());
+  ASSERT_TRUE(mlvalue_name_idx_map.GetIdx("T3", t3_idx).IsOK());
 
   auto cpu_allocator = execution_providers.Get(xp_type)->GetAllocator(0, OrtMemTypeDefault);
 
@@ -225,8 +228,6 @@ TEST(ExecutionFrameTest, MemPatternTest) {
 
   state.SetExecutionPlan(std::move(p_seq_exec_plan));
 
-  state.CalculateNodeIndexInfo();
-
   vector<OrtValue> outputs;
   ExecutionFrame frame({x1_idx, x2_idx, x3_idx}, {v1, v2, v3}, {t3_idx}, outputs, {}, state);
 
@@ -264,7 +265,7 @@ TEST(ExecutionFrameTest, MemPatternTest) {
   EXPECT_EQ(p->GetBlock(4)->offset_, 64);
 }
 
-TEST(ExecutionFrameTest, BadModelInvalidDimParamUsage) {
+TEST(ExecutionFrameTestWithoutSessionState, BadModelInvalidDimParamUsage) {
   // load model with 2 Scan ops that both incorrectly use shapes of { 'None', 'None' } for their outputs.
   // as 'None' is not a special value it's treated as a variable name, leading to a runtime error when we
   // attempt to re-use the output from the first Scan node for the second. validate we detect this and error out.
diff --git a/onnxruntime/test/framework/inference_session_test.cc b/onnxruntime/test/framework/inference_session_test.cc
index 1c4cdad4f867e..9d4f2e419371c 100644
--- a/onnxruntime/test/framework/inference_session_test.cc
+++ b/onnxruntime/test/framework/inference_session_test.cc
@@ -118,6 +118,18 @@ class FuseExecutionProvider : public IExecutionProvider {
   }
 };
 
+// InferenceSession wrapper to expose loaded graph.
+class InferenceSessionGetGraphWrapper : public InferenceSession {
+ public:
+  explicit InferenceSessionGetGraphWrapper(const SessionOptions& session_options,
+                                          logging::LoggingManager* logging_manager) : InferenceSession(session_options, logging_manager) {
+  }
+
+  const Graph& GetGraph() {
+    return model_->MainGraph();
+  }
+};
+
 namespace test {
 static void VerifyOutputs(const std::vector<OrtValue>& fetches, const std::vector<int64_t>& expected_dims,
                           const std::vector<float>& expected_values);
@@ -330,6 +342,77 @@ TEST(InferenceSessionTests, DisableCPUArena) {
   RunModel(session_object, run_options);
 }
 
+TEST(InferenceSessionTests, TestModelSerialization) {
+  // Load model with level 0 transform level
+  // and assert that the model has Identity nodes.
+  SessionOptions so;
+  const string test_model = "testdata/transform/abs-id-max.onnx";
+  so.session_logid = "InferenceSessionTests.TestModelSerialization";
+  so.graph_optimization_level = TransformerLevel::Default;
+  InferenceSessionGetGraphWrapper session_object_noopt{so, &DefaultLoggingManager()};
+  ASSERT_TRUE(session_object_noopt.Load(test_model).IsOK());
+  ASSERT_TRUE(session_object_noopt.Initialize().IsOK());
+
+  // Assert that model has Identity Nodes.
+  const auto& graph_noopt = session_object_noopt.GetGraph();
+  std::map<std::string, int> op_to_count_noopt = CountOpsInGraph(graph_noopt);
+  ASSERT_TRUE(op_to_count_noopt["Identity"] > 0);
+
+  // Load model with level 1 transform level.
+  so.graph_optimization_level = TransformerLevel::Level1;
+  so.optimized_model_filepath = ToWideString(test_model + "-TransformLevel-" + std::to_string(static_cast<uint32_t>(so.graph_optimization_level)));
+  InferenceSessionGetGraphWrapper session_object{so, &DefaultLoggingManager()};
+  ASSERT_TRUE(session_object.Load(test_model).IsOK());
+  ASSERT_TRUE(session_object.Initialize().IsOK());
+ 
+  // Assert that model has been transformed and identity Node is removed.
+  const auto& graph = session_object.GetGraph();
+  std::map<std::string, int> op_to_count = CountOpsInGraph(graph);
+  ASSERT_TRUE(op_to_count["Identity"] == 0);
+
+  // Serialize model to the same file path again to make sure that rewrite doesn't fail.
+  InferenceSession overwrite_session_object{so, &DefaultLoggingManager()};
+  ASSERT_TRUE(overwrite_session_object.Load(test_model).IsOK());
+  ASSERT_TRUE(overwrite_session_object.Initialize().IsOK());
+
+  // Load serialized model with no transform level and serialize model.
+  SessionOptions so_opt;
+  so_opt.session_logid = "InferenceSessionTests.TestModelSerialization";
+  so_opt.graph_optimization_level = TransformerLevel::Default;
+  so_opt.optimized_model_filepath = ToWideString(so.optimized_model_filepath) + ToWideString("-TransformLevel-" + std::to_string(static_cast<uint32_t>(so_opt.graph_optimization_level)));
+  InferenceSession session_object_opt{so_opt, &DefaultLoggingManager()};
+  ASSERT_TRUE(session_object_opt.Load(so.optimized_model_filepath).IsOK());
+  ASSERT_TRUE(session_object_opt.Initialize().IsOK());
+  
+  // Assert that re-feed of optimized model with default transform level results
+  // in same runtime model as abs-id-max.onnx with TransformLevel-1.
+  std::ifstream model_fs_session1(so.optimized_model_filepath, ios::in | ios::binary);
+  ASSERT_TRUE(model_fs_session1.good());
+  std::ifstream model_fs_session2(so_opt.optimized_model_filepath, ios::in | ios::binary);
+  ASSERT_TRUE(model_fs_session2.good());
+  ASSERT_TRUE(model_fs_session1.tellg() == model_fs_session2.tellg());
+  model_fs_session1.seekg(0, std::ifstream::beg);
+  model_fs_session2.seekg(0, std::ifstream::beg);
+  ASSERT_TRUE(std::equal(std::istreambuf_iterator<char>(model_fs_session1.rdbuf()),
+                         std::istreambuf_iterator<char>(),
+                         std::istreambuf_iterator<char>(model_fs_session2.rdbuf())));
+
+  // Assert that empty optimized model file-path doesn't fail loading.
+  so_opt.optimized_model_filepath = ToWideString("");
+  InferenceSession session_object_emptyValidation{so_opt, &DefaultLoggingManager()};
+  ASSERT_TRUE(session_object_emptyValidation.Load(test_model).IsOK());
+  ASSERT_TRUE(session_object_emptyValidation.Initialize().IsOK());
+
+  // Assert that level 3 optimization doesn't result in serialized model.
+  so_opt.optimized_model_filepath = ToWideString("ShouldNotSerialize");
+  so_opt.graph_optimization_level = TransformerLevel::Level3;
+  InferenceSession session_object_Level3Test{so_opt, &DefaultLoggingManager()};
+  ASSERT_TRUE(session_object_Level3Test.Load(test_model).IsOK());
+  ASSERT_TRUE(session_object_Level3Test.Initialize().IsOK());
+  std::ifstream model_fs_Level3(so_opt.optimized_model_filepath, ios::in | ios::binary);
+  ASSERT_TRUE(model_fs_Level3.fail());
+}
+
 #ifdef ORT_RUN_EXTERNAL_ONNX_TESTS
 static bool Compare(const InputDefList& f_arg, const InputDefList& s_arg) {
   if (f_arg.size() != s_arg.size()) {
diff --git a/onnxruntime/test/framework/math_test.cc b/onnxruntime/test/framework/math_test.cc
index 97104647334d4..2c254bc3cc823 100644
--- a/onnxruntime/test/framework/math_test.cc
+++ b/onnxruntime/test/framework/math_test.cc
@@ -17,12 +17,24 @@
 
 #include "core/util/math.h"
 #include <gtest/gtest.h>
+#include "core/platform/threadpool.h"
 #include "core/util/math_cpuonly.h"
 namespace onnxruntime {
 
 #define VECTOR_HEAD(x) x.size() > 0 ? &x[0] : NULL
 
-TEST(MathTest, GemmNoTransNoTrans) {
+//parameter is thread pool size
+class MathGemmTest : public testing::TestWithParam<int> {
+ protected:
+  static concurrency::ThreadPool* CreateThreadPool(int size) {
+    if (size == 1)
+      return nullptr;
+    return new concurrency::ThreadPool("test", size);
+  }
+  std::unique_ptr<concurrency::ThreadPool> tp{CreateThreadPool(GetParam())};
+};
+
+TEST_P(MathGemmTest, GemmNoTransNoTrans) {
   auto& provider = CPUMathUtil::Instance();
   std::vector<float> X(50);  // 5 * 10
   std::vector<float> W(60);  // 10 * 6
@@ -40,34 +52,35 @@ TEST(MathTest, GemmNoTransNoTrans) {
   const float kOne = 1.0;
   const float kPointFive = 0.5;
   const float kZero = 0.0;
-  math::Gemm<float, CPUMathUtil>(CblasNoTrans, CblasNoTrans, 5, 6, 10, kOne,
-                                 VECTOR_HEAD(X), VECTOR_HEAD(W), kZero, VECTOR_HEAD(Y),
-                                 &provider);
+  math::Gemm<float>(CblasNoTrans, CblasNoTrans, 5, 6, 10, kOne,
+                    VECTOR_HEAD(X), VECTOR_HEAD(W), kZero, VECTOR_HEAD(Y),
+                    tp.get());
   EXPECT_EQ(Y.size(), 30);
   for (size_t i = 0; i < Y.size(); ++i) {
     EXPECT_EQ(Y[i], 10) << i;
   }
   // Test Accumulate
-  math::Gemm<float, CPUMathUtil>(CblasNoTrans, CblasNoTrans, 5, 6, 10, kOne,
-                                 VECTOR_HEAD(X), VECTOR_HEAD(W), kPointFive,
-                                 VECTOR_HEAD(Y), &provider);
+  math::Gemm<float>(CblasNoTrans, CblasNoTrans, 5, 6, 10, kOne,
+                    VECTOR_HEAD(X), VECTOR_HEAD(W), kPointFive,
+                    VECTOR_HEAD(Y), tp.get());
   EXPECT_EQ(Y.size(), 30);
   for (size_t i = 0; i < Y.size(); ++i) {
     EXPECT_EQ(Y[i], 15) << i;
   }
   // Test Accumulate
-  math::Gemm<float, CPUMathUtil>(CblasNoTrans, CblasNoTrans, 5, 6, 10,
-                                 kPointFive,
-                                 VECTOR_HEAD(X), VECTOR_HEAD(W), kOne, VECTOR_HEAD(Y),
-                                 &provider);
+  math::Gemm<float>(CblasNoTrans, CblasNoTrans, 5, 6, 10,
+                    kPointFive,
+                    VECTOR_HEAD(X), VECTOR_HEAD(W), kOne, VECTOR_HEAD(Y),
+                    tp.get());
   EXPECT_EQ(Y.size(), 30);
   for (size_t i = 0; i < Y.size(); ++i) {
     EXPECT_EQ(Y[i], 20) << i;
   }
 }
 
-TEST(MathTest, GemmNoTransTrans) {
+TEST_P(MathGemmTest, GemmNoTransTrans) {
   auto& provider = CPUMathUtil::Instance();
+
   std::vector<float> X(50);  // 5 * 10
   std::vector<float> W(60);  // 10 * 6
   std::vector<float> Y(30);  // 5 * 6
@@ -84,30 +97,33 @@ TEST(MathTest, GemmNoTransTrans) {
   const float kOne = 1.0;
   const float kPointFive = 0.5;
   const float kZero = 0.0;
-  math::Gemm<float, CPUMathUtil>(CblasNoTrans, CblasTrans, 5, 6, 10, kOne,
-                                 VECTOR_HEAD(X), VECTOR_HEAD(W), kZero, VECTOR_HEAD(Y),
-                                 &provider);
+  math::Gemm<float>(CblasNoTrans, CblasTrans, 5, 6, 10, kOne,
+                    VECTOR_HEAD(X), VECTOR_HEAD(W), kZero, VECTOR_HEAD(Y),
+                    tp.get());
   EXPECT_EQ(Y.size(), 30);
   for (size_t i = 0; i < Y.size(); ++i) {
     EXPECT_EQ(Y[i], 10) << i;
   }
   // Test Accumulate
-  math::Gemm<float, CPUMathUtil>(CblasNoTrans, CblasTrans, 5, 6, 10, kOne,
-                                 VECTOR_HEAD(X), VECTOR_HEAD(W), kPointFive,
-                                 VECTOR_HEAD(Y), &provider);
+  math::Gemm<float>(CblasNoTrans, CblasTrans, 5, 6, 10, kOne,
+                    VECTOR_HEAD(X), VECTOR_HEAD(W), kPointFive,
+                    VECTOR_HEAD(Y), tp.get());
   EXPECT_EQ(Y.size(), 30);
   for (size_t i = 0; i < Y.size(); ++i) {
     EXPECT_EQ(Y[i], 15) << i;
   }
-  math::Gemm<float, CPUMathUtil>(CblasNoTrans, CblasTrans, 5, 6, 10, kPointFive,
-                                 VECTOR_HEAD(X), VECTOR_HEAD(W), kOne, VECTOR_HEAD(Y),
-                                 &provider);
+  math::Gemm<float>(CblasNoTrans, CblasTrans, 5, 6, 10, kPointFive,
+                    VECTOR_HEAD(X), VECTOR_HEAD(W), kOne, VECTOR_HEAD(Y),
+                    tp.get());
   EXPECT_EQ(Y.size(), 30);
   for (size_t i = 0; i < Y.size(); ++i) {
     EXPECT_EQ(Y[i], 20) << i;
   }
 }
 
+INSTANTIATE_TEST_CASE_P(MathGemmTests, MathGemmTest,
+                        testing::Values(1, 4));
+
 TEST(MathTest, GemvNoTrans) {
   auto& provider = CPUMathUtil::Instance();
   std::vector<float> A(50);  // 5 * 10
diff --git a/onnxruntime/test/framework/session_state_test.cc b/onnxruntime/test/framework/session_state_test.cc
index 6064b74630897..9941a21d51f32 100644
--- a/onnxruntime/test/framework/session_state_test.cc
+++ b/onnxruntime/test/framework/session_state_test.cc
@@ -35,11 +35,12 @@ class TestOpKernel : public OpKernel {
 };
 
 TEST(SessionStateTest, AddGetKernelTest) {
+  concurrency::ThreadPool tp{"test", 1};
   ONNX_OPERATOR_SCHEMA(Variable)
       .SetDoc("Input variable.")
       .Output(0, "output_1", "docstr for output_1.", "tensor(int32)");
   ExecutionProviders execution_providers;
-  SessionState s{execution_providers, true};
+  SessionState s{execution_providers, true, &tp};
 
   onnxruntime::Model model("graph_1");
   auto& graph = model.MainGraph();
@@ -52,86 +53,102 @@ TEST(SessionStateTest, AddGetKernelTest) {
   outputs.push_back(&output_arg);
   onnxruntime::Node& node = graph.AddNode("node_1", "Variable", "node 1.", inputs, outputs);
   auto status = graph.Resolve();
-  EXPECT_TRUE(status.IsOK());
-  KernelDef kernel_def;
-  CPUExecutionProvider execution_provider{CPUExecutionProviderInfo{"CPUExecutionProvider"}};
+  ASSERT_TRUE(status.IsOK());
+  auto kernel_def = KernelDefBuilder().SetName("Variable").Provider(kCpuExecutionProvider).SinceVersion(1, 10).Build();
+  auto cpu_execution_provider = std::make_unique<CPUExecutionProvider>(CPUExecutionProviderInfo(false));
 
-  OpKernelInfo p_info(node, kernel_def, execution_provider, s.GetConstantInitializedTensors(),
+  OpKernelInfo p_info(node, *kernel_def, *cpu_execution_provider.get(), s.GetConstantInitializedTensors(),
                       s.GetOrtValueNameIdxMap(), s.GetFuncMgr(), s.GetDataTransferMgr());
   unique_ptr<TestOpKernel> p_kernel;
   p_kernel.reset(new TestOpKernel(p_info));
   size_t orig_num_outputs = p_kernel->Node().OutputDefs().size();
   std::cout << "node_idx: " << node.Index() << std::endl;
 
-  s.SetGraphViewer(std::make_unique<GraphViewer>(graph));
-  s.AddKernel(node.Index(), std::move(p_kernel));
+  execution_providers.Add(kCpuExecutionProvider, std::move(cpu_execution_provider));
+  KernelRegistryManager kernel_registry_manager;
+  status = kernel_registry_manager.RegisterKernels(execution_providers);
+  ASSERT_TRUE(status.IsOK()) << status.ErrorMessage();
+  node.SetExecutionProviderType(kCpuExecutionProvider);
+  std::shared_ptr<KernelRegistry> kernel_registry = std::make_shared<KernelRegistry>();
+  kernel_registry->Register(KernelCreateInfo(
+      std::move(kernel_def), [](const OpKernelInfo& info) -> OpKernel* { return new TestOpKernel(info); }));
+  kernel_registry_manager.RegisterKernelRegistry(kernel_registry);
+  status = s.SetGraphAndCreateKernels(graph, kernel_registry_manager);
+  ASSERT_TRUE(status.IsOK()) << status.ErrorMessage();
   auto test_kernel = s.GetKernel(node.Index());
   std::cout << "orig: " << orig_num_outputs << " new: " << test_kernel->Node().OutputDefs().size() << std::endl;
   EXPECT_EQ(orig_num_outputs, test_kernel->Node().OutputDefs().size());
 }
 
+namespace {
+class TestParam {
+ public:
+  int ir_version;
+  bool enable_mem_pattern;
+};
+TestParam param_list[] = {{3, true}, {4, true}, {3, false}, {4, false}};
+}  // namespace
+class SessionStateTestP : public testing::TestWithParam<TestParam> {};
 // Test that we separate out constant and non-constant initializers correctly
-TEST(SessionStateTest, TestInitializerProcessing) {
-  std::vector<int> ir_versions = {3, 4};
-  for (auto ir_version : ir_versions) {
-    std::string model_path = "testdata/optional_inputs_ir" + std::to_string(ir_version) + ".onnx";
-    Status status;
-    std::shared_ptr<Model> model;
-    ASSERT_TRUE((status = Model::Load(model_path, model)).IsOK()) << status;
-    Graph& graph = model->MainGraph();
-    // take a copy as this gets cleared during session state initialization
-    InitializedTensorSet initializers = graph.GetAllInitializedTensors();
-
-    const bool enable_mem_pattern = false;
-    ExecutionProviders execution_providers;
-    CPUExecutionProviderInfo epi{false};
-    status = execution_providers.Add(onnxruntime::kCpuExecutionProvider, std::make_unique<CPUExecutionProvider>(epi));
-    ASSERT_TRUE(status.IsOK()) << status;
-
-    KernelRegistryManager krm;
-    status = krm.RegisterKernels(execution_providers);
-    ASSERT_TRUE(status.IsOK()) << status;
-
-    SessionState session_state(execution_providers, enable_mem_pattern);
-    SessionStateInitializer session_initializer(enable_mem_pattern, ToWideString(model_path), graph,
-                                                session_state, execution_providers, krm);
-
-    GraphPartitioner partitioner(krm, execution_providers);
-    status = partitioner.Partition(graph, session_state.ExportDll(), session_state.GetMutableFuncMgr());
-    ASSERT_TRUE(status.IsOK()) << status;
-
-    status = session_initializer.CreatePlan(nullptr, nullptr, true);
-    ASSERT_TRUE(status.IsOK()) << status;
-
-    status = session_initializer.InitializeAndSave(nullptr);
-    ASSERT_TRUE(status.IsOK()) << status;
-
-    const auto& initialized_tensors = session_state.GetInitializedTensors();
-    const auto& const_initialized_tensors = session_state.GetConstantInitializedTensors();
-
-    ASSERT_EQ(initializers.size(), initialized_tensors.size())
-        << "SessionState should have an entry for all initializers in Graph.";
-
-    if (ir_version < 4) {
-      ASSERT_EQ(initialized_tensors.size(), const_initialized_tensors.size())
-          << "All initializers should be considered constant if IR version < 4.";
-    } else {
-      const auto& name_to_idx = session_state.GetOrtValueNameIdxMap();
-
-      for (auto entry : initializers) {
-        int idx;
-        name_to_idx.GetIdx(entry.first, idx);
-
-        bool found = initialized_tensors.find(idx) != initialized_tensors.cend();
-        ASSERT_TRUE(found) << "Missing entry for " << entry.first << " in session state initialized tensors";
-
-        if (graph_utils::IsConstantInitializer(graph, entry.first, false)) {
-          found = const_initialized_tensors.find(idx) != const_initialized_tensors.cend();
-          ASSERT_TRUE(found) << "Missing entry for " << entry.first << " in session state const initialized tensors";
-        }
+TEST_P(SessionStateTestP, TestInitializerProcessing) {
+  const TestParam& param = GetParam();
+  concurrency::ThreadPool tp{"test", 1};
+
+  std::string model_path = "testdata/optional_inputs_ir" + std::to_string(param.ir_version) + ".onnx";
+  Status status;
+  std::shared_ptr<Model> model;
+  ASSERT_TRUE((status = Model::Load(model_path, model)).IsOK()) << status;
+  Graph& graph = model->MainGraph();
+  // take a copy as this gets cleared during session state initialization
+  InitializedTensorSet initializers = graph.GetAllInitializedTensors();
+
+  ExecutionProviders execution_providers;
+  CPUExecutionProviderInfo epi{false};
+  status = execution_providers.Add(onnxruntime::kCpuExecutionProvider, std::make_unique<CPUExecutionProvider>(epi));
+  ASSERT_TRUE(status.IsOK()) << status;
+
+  KernelRegistryManager krm;
+  status = krm.RegisterKernels(execution_providers);
+  ASSERT_TRUE(status.IsOK()) << status;
+
+  SessionState session_state(execution_providers, param.enable_mem_pattern, &tp);
+  SessionStateInitializer session_initializer(param.enable_mem_pattern, ToWideString(model_path), graph, session_state,
+                                              execution_providers, krm);
+
+  GraphPartitioner partitioner(krm, execution_providers);
+  status = partitioner.Partition(graph, session_state.ExportDll(), session_state.GetMutableFuncMgr());
+  ASSERT_TRUE(status.IsOK()) << status;
+
+  status = session_initializer.CreatePlan(nullptr, nullptr, true);
+  ASSERT_TRUE(status.IsOK()) << status;
+
+  const auto& initialized_tensors = session_state.GetInitializedTensors();
+  const auto& const_initialized_tensors = session_state.GetConstantInitializedTensors();
+
+  ASSERT_EQ(initializers.size(), initialized_tensors.size())
+      << "SessionState should have an entry for all initializers in Graph.";
+
+  if (param.ir_version < 4) {
+    ASSERT_EQ(initialized_tensors.size(), const_initialized_tensors.size())
+        << "All initializers should be considered constant if IR version < 4.";
+  } else {
+    const auto& name_to_idx = session_state.GetOrtValueNameIdxMap();
+
+    for (auto entry : initializers) {
+      int idx;
+      name_to_idx.GetIdx(entry.first, idx);
+
+      bool found = initialized_tensors.find(idx) != initialized_tensors.cend();
+      ASSERT_TRUE(found) << "Missing entry for " << entry.first << " in session state initialized tensors";
+
+      if (graph_utils::IsConstantInitializer(graph, entry.first, false)) {
+        found = const_initialized_tensors.find(idx) != const_initialized_tensors.cend();
+        ASSERT_TRUE(found) << "Missing entry for " << entry.first << " in session state const initialized tensors";
       }
     }
   }
 }
+
+INSTANTIATE_TEST_CASE_P(SessionStateTests, SessionStateTestP, testing::ValuesIn(param_list));
 }  // namespace test
 }  // namespace onnxruntime
diff --git a/onnxruntime/test/shared_lib/test_tensor_loader.cc b/onnxruntime/test/framework/test_tensor_loader.cc
similarity index 59%
rename from onnxruntime/test/shared_lib/test_tensor_loader.cc
rename to onnxruntime/test/framework/test_tensor_loader.cc
index 5fe845af11c17..2d1dcea7f90fd 100644
--- a/onnxruntime/test/shared_lib/test_tensor_loader.cc
+++ b/onnxruntime/test/framework/test_tensor_loader.cc
@@ -1,11 +1,10 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 
-#include "core/session/onnxruntime_cxx_api.h"
 #include "core/common/common.h"
-#include "onnx_protobuf.h"
-
-#include "test_fixture.h"
+#include "core/framework/callback.h"
+#include "core/framework/tensorprotoutils.h"
+#include "gtest/gtest.h"
 #include "file_util.h"
 
 #ifdef _WIN32
@@ -15,7 +14,7 @@
 namespace onnxruntime {
 namespace test {
 
-TEST_F(CApiTest, load_simple_float_tensor_not_enough_space) {
+TEST(CApiTest, load_simple_float_tensor_not_enough_space) {
   // construct a tensor proto
   onnx::TensorProto p;
   p.mutable_float_data()->Add(1.0f);
@@ -27,16 +26,19 @@ TEST_F(CApiTest, load_simple_float_tensor_not_enough_space) {
   ASSERT_TRUE(p.SerializeToString(&s));
   // deserialize it
   std::vector<float> output(1);
-  OrtValue* value;
-  OrtCallback* deleter;
-  auto st = OrtTensorProtoToOrtValue(s.data(), static_cast<int>(s.size()), nullptr, output.data(),
-                                     output.size() * sizeof(float), &value, &deleter);
+  OrtValue value;
+  auto deleter = std::make_unique<onnxruntime::OrtCallback>();
+  OrtAllocatorInfo cpu_allocator_info(onnxruntime::CPU, OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeDefault);
+  auto st = utils::TensorProtoToMLValue(Env::Default(), nullptr, p,
+                                        MemBuffer(output.data(), output.size() * sizeof(float), cpu_allocator_info), value, *deleter);
   // check the result
-  ASSERT_NE(st, nullptr);
-  OrtReleaseStatus(st);
+  ASSERT_FALSE(st.IsOK());
+  if (deleter->f) {
+    OrtRunCallback(deleter.release());
+  }
 }
 
-TEST_F(CApiTest, load_simple_float_tensor) {
+TEST(CApiTest, load_simple_float_tensor) {
   // construct a tensor proto
   onnx::TensorProto p;
   p.mutable_float_data()->Add(1.0f);
@@ -49,20 +51,23 @@ TEST_F(CApiTest, load_simple_float_tensor) {
   ASSERT_TRUE(p.SerializeToString(&s));
   // deserialize it
   std::vector<float> output(3);
-  OrtValue* value;
-  OrtCallback* deleter;
-  auto st = OrtTensorProtoToOrtValue(s.data(), static_cast<int>(s.size()), nullptr, output.data(),
-                                     output.size() * sizeof(float), &value, &deleter);
-  ASSERT_EQ(st, nullptr) << OrtGetErrorMessage(st);
+  OrtValue value;
+  auto deleter = std::make_unique<onnxruntime::OrtCallback>();
+  OrtAllocatorInfo cpu_allocator_info(onnxruntime::CPU, OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeDefault);
+  auto st = utils::TensorProtoToMLValue(Env::Default(), nullptr, p,
+                                        MemBuffer(output.data(), output.size() * sizeof(float), cpu_allocator_info), value, *deleter);
+  ASSERT_TRUE(st.IsOK()) << st.ErrorMessage();
   float* real_output;
-  st = OrtGetTensorMutableData(value, (void**)&real_output);
-  ASSERT_EQ(st, nullptr) << OrtGetErrorMessage(st);
+  auto ort_st = OrtGetTensorMutableData(&value, (void**)&real_output);
+  ASSERT_EQ(ort_st, nullptr) << OrtGetErrorMessage(ort_st);
   // check the result
   ASSERT_EQ(real_output[0], 1.0f);
   ASSERT_EQ(real_output[1], 2.2f);
   ASSERT_EQ(real_output[2], 3.5f);
-  OrtReleaseValue(value);
-  OrtRunCallback(deleter);
+  OrtReleaseStatus(ort_st);
+  if (deleter->f) {
+    OrtRunCallback(deleter.release());
+  }
 }
 
 template <bool use_current_dir>
@@ -88,8 +93,6 @@ static void run_external_data_test() {
   ASSERT_TRUE(p.SerializeToString(&s));
   // deserialize it
   std::vector<float> output(3);
-  OrtValue* value;
-  OrtCallback* deleter;
   std::basic_string<ORTCHAR_T> cwd;
   if (use_current_dir) {
 #ifdef _WIN32
@@ -107,27 +110,33 @@ static void run_external_data_test() {
     cwd.append(ORT_TSTR("/fake.onnx"));
 #endif
   }
-  auto st = OrtTensorProtoToOrtValue(s.data(), static_cast<int>(s.size()), cwd.empty() ? nullptr : cwd.c_str(),
-                                     output.data(), output.size() * sizeof(float), &value, &deleter);
-  ASSERT_EQ(st, nullptr) << OrtGetErrorMessage(st);
+  OrtValue value;
+  auto deleter = std::make_unique<onnxruntime::OrtCallback>();
+  OrtAllocatorInfo cpu_allocator_info(onnxruntime::CPU, OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeDefault);
+  auto st = utils::TensorProtoToMLValue(Env::Default(), nullptr, p,
+                                        MemBuffer(output.data(), output.size() * sizeof(float), cpu_allocator_info), value, *deleter);
+  ASSERT_TRUE(st.IsOK()) << st.ErrorMessage();
   float* real_output;
-  st = OrtGetTensorMutableData(value, (void**)&real_output);
-  ASSERT_EQ(st, nullptr) << OrtGetErrorMessage(st);
+  auto ort_st = OrtGetTensorMutableData(&value, (void**)&real_output);
+  ASSERT_EQ(ort_st, nullptr) << OrtGetErrorMessage(ort_st);
   // check the result
   ASSERT_EQ(real_output[0], 1.0f);
   ASSERT_EQ(real_output[1], 2.2f);
   ASSERT_EQ(real_output[2], 3.5f);
-  OrtReleaseValue(value);
-  OrtRunCallback(deleter);
+  OrtReleaseStatus(ort_st);
+  if (deleter->f) {
+    OrtRunCallback(deleter.release());
+  }
 }
-TEST_F(CApiTest, load_float_tensor_with_external_data) {
+
+TEST(CApiTest, load_float_tensor_with_external_data) {
   run_external_data_test<true>();
   run_external_data_test<false>();
 }
 
 #if defined(__amd64__) || defined(_M_X64)
-
-TEST_F(CApiTest, load_huge_tensor_with_external_data) {
+#ifdef NDEBUG
+TEST(CApiTest, load_huge_tensor_with_external_data) {
   FILE* fp;
   std::basic_string<ORTCHAR_T> filename(ORT_TSTR("tensor_XXXXXX"));
   CreateTestFile(fp, filename);
@@ -154,21 +163,26 @@ TEST_F(CApiTest, load_huge_tensor_with_external_data) {
   ASSERT_TRUE(p.SerializeToString(&s));
   // deserialize it
   std::vector<int> output(total_ele_count);
-  OrtValue* value;
-  OrtCallback* deleter;
-  auto st = OrtTensorProtoToOrtValue(s.data(), static_cast<int>(s.size()), nullptr, output.data(),
-                                     output.size() * sizeof(int), &value, &deleter);
+  OrtValue value;
+  auto deleter = std::make_unique<onnxruntime::OrtCallback>();
+  OrtAllocatorInfo cpu_allocator_info(onnxruntime::CPU, OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeDefault);
+  auto st = utils::TensorProtoToMLValue(Env::Default(), nullptr, p,
+                                        MemBuffer(output.data(), output.size() * sizeof(int), cpu_allocator_info), value, *deleter);
+
   // check the result
-  ASSERT_EQ(st, nullptr) << OrtGetErrorMessage(st);
+  ASSERT_TRUE(st.IsOK()) << "Error from TensorProtoToMLValue: " << st.ErrorMessage();
   int* buffer;
-  st = OrtGetTensorMutableData(value, (void**)&buffer);
-  ASSERT_EQ(st, nullptr) << OrtGetErrorMessage(st);
+  auto ort_st = OrtGetTensorMutableData(&value, (void**)&buffer);
+  ASSERT_EQ(ort_st, nullptr) << "Error from OrtGetTensorMutableData: " << OrtGetErrorMessage(ort_st);
   for (size_t i = 0; i != total_ele_count; ++i) {
     ASSERT_EQ(1, buffer[i]);
   }
-  OrtReleaseValue(value);
-  OrtRunCallback(deleter);
+  OrtReleaseStatus(ort_st);
+  if (deleter->f) {
+    OrtRunCallback(deleter.release());
+  }
 }
 #endif
+#endif
 }  // namespace test
 }  // namespace onnxruntime
diff --git a/onnxruntime/test/framework/test_utils.cc b/onnxruntime/test/framework/test_utils.cc
index 7604e15ce7429..48aae3f40d3c9 100644
--- a/onnxruntime/test/framework/test_utils.cc
+++ b/onnxruntime/test/framework/test_utils.cc
@@ -22,7 +22,8 @@ IExecutionProvider* TestCudaExecutionProvider() {
 
 #ifdef USE_TENSORRT
 IExecutionProvider* TestTensorrtExecutionProvider() {
-  static TensorrtExecutionProvider trt_provider;
+  static TensorrtExecutionProviderInfo info;
+  static TensorrtExecutionProvider trt_provider(info);
   return &trt_provider;
 }
 #endif
diff --git a/onnxruntime/test/mlas/unittest.cpp b/onnxruntime/test/mlas/unittest.cpp
index 36c96c819c87d..64690d8ee725a 100644
--- a/onnxruntime/test/mlas/unittest.cpp
+++ b/onnxruntime/test/mlas/unittest.cpp
@@ -26,11 +26,21 @@ Module Name:
 #else
 #include <sys/mman.h>
 #endif
+#if !defined(MLAS_NO_ONNXRUNTIME_THREADPOOL)
+#include "core/platform/threadpool.h"
+#endif
 
 #if !defined(_countof)
 #define _countof(_Array) (sizeof(_Array) / sizeof(_Array[0]))
 #endif
 
+#if defined(_M_IX86) || defined(__i386__) || defined(_M_AMD64) || defined(__x86_64__)
+#define MLAS_HAS_QGEMM_U8U8
+#endif
+
+MLAS_THREADPOOL* threadpool = nullptr;
+
+template <typename T>
 class MatrixGuardBuffer
 {
 public:
@@ -46,7 +56,7 @@ class MatrixGuardBuffer
         ReleaseBuffer();
     }
 
-    float* GetBuffer(size_t Elements)
+    T* GetBuffer(size_t Elements)
     {
         //
         // Check if the internal buffer needs to be reallocated.
@@ -64,7 +74,7 @@ class MatrixGuardBuffer
             constexpr size_t BufferAlignment = 64 * 1024;
             constexpr size_t GuardPadding = 256 * 1024;
 
-            size_t BytesToAllocate = ((Elements * sizeof(float)) + BufferAlignment - 1) & ~(BufferAlignment - 1);
+            size_t BytesToAllocate = ((Elements * sizeof(T)) + BufferAlignment - 1) & ~(BufferAlignment - 1);
 
             _BaseBufferSize = BytesToAllocate + GuardPadding;
 
@@ -93,26 +103,26 @@ class MatrixGuardBuffer
             }
 #endif
 
-            _ElementsAllocated = BytesToAllocate / sizeof(float);
-            _GuardAddress = (float*)((unsigned char*)_BaseBuffer + BytesToAllocate);
+            _ElementsAllocated = BytesToAllocate / sizeof(T);
+            _GuardAddress = (T*)((unsigned char*)_BaseBuffer + BytesToAllocate);
         }
 
         //
         //
         //
 
-        float* GuardAddress = _GuardAddress;
-        float* buffer = GuardAddress - Elements;
+        T* GuardAddress = _GuardAddress;
+        T* buffer = GuardAddress - Elements;
 
         const int MinimumFillValue = -23;
         const int MaximumFillValue = 23;
 
         int FillValue = MinimumFillValue;
-        float* FillAddress = buffer;
+        T* FillAddress = buffer;
 
         while (FillAddress < GuardAddress) {
 
-            *FillAddress++ = (float)FillValue;
+            *FillAddress++ = (T)FillValue;
 
             FillValue++;
 
@@ -145,7 +155,7 @@ class MatrixGuardBuffer
     size_t _ElementsAllocated;
     void* _BaseBuffer;
     size_t _BaseBufferSize;
-    float* _GuardAddress;
+    T* _GuardAddress;
 };
 
 class MlasTestBase
@@ -225,7 +235,7 @@ class MlasSgemmTest : public MlasTestBase
         std::fill_n(C, M * N, -0.5f);
         std::fill_n(CReference, M * N, -0.5f);
 
-        MlasSgemm(TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc, nullptr);
+        MlasSgemm(TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc, threadpool);
         ReferenceSgemm(TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, CReference, ldc);
 
         for (size_t f = 0; f < M * N; f++) {
@@ -345,10 +355,10 @@ class MlasSgemmTest : public MlasTestBase
         }
     }
 
-    MatrixGuardBuffer BufferA;
-    MatrixGuardBuffer BufferB;
-    MatrixGuardBuffer BufferC;
-    MatrixGuardBuffer BufferCReference;
+    MatrixGuardBuffer<float> BufferA;
+    MatrixGuardBuffer<float> BufferB;
+    MatrixGuardBuffer<float> BufferC;
+    MatrixGuardBuffer<float> BufferCReference;
 
 public:
     void
@@ -442,6 +452,181 @@ class MlasSgemmTest : public MlasTestBase
     }
 };
 
+#ifdef MLAS_HAS_QGEMM_U8U8
+
+class MlasQgemmU8U8Test : public MlasTestBase
+{
+private:
+    void
+    Test(
+        size_t M,
+        size_t N,
+        size_t K,
+        uint8_t offa,
+        uint8_t offb
+        )
+    {
+        const uint8_t* A = BufferA.GetBuffer(K * M);
+        const uint8_t* B = BufferB.GetBuffer(N * K);
+        int32_t* C = BufferC.GetBuffer(N * M);
+        int32_t* CReference = BufferCReference.GetBuffer(N * M);
+
+        Test(M, N, K, A, K, offa, B, N, offb, C, CReference, N);
+    }
+
+    void
+    Test(
+        size_t M,
+        size_t N,
+        size_t K,
+        const uint8_t* A,
+        size_t lda,
+        uint8_t offa,
+        const uint8_t* B,
+        size_t ldb,
+        uint8_t offb,
+        int32_t* C,
+        int32_t* CReference,
+        size_t ldc
+        )
+    {
+        std::fill_n(C, M * N, -1);
+        std::fill_n(CReference, M * N, -1);
+
+        MlasQgemm(M, N, K, A, lda, offa, B, ldb, offb, C, ldc, threadpool);
+        ReferenceQgemm(M, N, K, A, lda, offa, B, ldb, offb, CReference, ldc);
+
+        for (size_t f = 0; f < M * N; f++) {
+            if (C[f] != CReference[f]) {
+                printf("mismatch M=%zd, N=%zd, K=%zd, offa=%d, offb=%d!\n", M, N, K, offa, offb);
+            }
+        }
+    }
+
+    void
+    ReferenceQgemm(
+        size_t M,
+        size_t N,
+        size_t K,
+        const uint8_t* A,
+        size_t lda,
+        uint8_t offa,
+        const uint8_t* B,
+        size_t ldb,
+        uint8_t offb,
+        int32_t* C,
+        size_t ldc
+        )
+    {
+        for (size_t m = 0; m < M; m++) {
+
+            for (size_t n = 0; n < N; n++) {
+
+                const uint8_t* a = A + (m * lda);
+                const uint8_t* b = B + n;
+                int32_t* c = C + (m * ldc) + n;
+                int32_t sum = 0;
+
+                for (size_t k = 0; k < K; k++) {
+                    sum += ((int32_t(*b) - offb) * (int32_t(*a) - offa));
+                    b += ldb;
+                    a += 1;
+                }
+
+                *c = sum;
+            }
+        }
+    }
+
+    MatrixGuardBuffer<uint8_t> BufferA;
+    MatrixGuardBuffer<uint8_t> BufferB;
+    MatrixGuardBuffer<int32_t> BufferC;
+    MatrixGuardBuffer<int32_t> BufferCReference;
+
+public:
+    void
+    ExecuteShort(
+        void
+        ) override
+    {
+        for (size_t b = 1; b < 16; b++) {
+            Test(b, b, b, 14, 211);
+        }
+        for (size_t b = 16; b <= 256; b <<= 1) {
+            Test(b, b, b, 34, 1);
+        }
+        for (size_t b = 256; b < 320; b += 32) {
+            Test(b, b, b, 85, 173);
+        }
+    }
+
+    void
+    ExecuteLong(
+        void
+        ) override
+    {
+        static const uint8_t zero_points[] = { 0, 18, 128, 157, 231, 255 };
+
+        for (size_t a = 0; a < _countof(zero_points); a++) {
+            uint8_t offa = zero_points[a];
+
+            for (size_t b = 0; b < _countof(zero_points); b++) {
+                uint8_t offb = zero_points[b];
+
+                for (size_t M = 16; M < 160; M += 32) {
+                    for (size_t N = 16; N < 160; N += 32) {
+
+                        static const size_t ks[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 20, 32, 48, 64, 118, 119, 120, 121, 122, 160, 240, 320 };
+                        for (size_t k = 0; k < _countof(ks); k++) {
+                            size_t K = ks[k];
+
+                            Test(M, N, K, offa, offb);
+                            Test(M + 1, N, K, offa, offb);
+                            Test(M, N + 1, K, offa, offb);
+                            Test(M + 1, N + 1, K, offa, offb);
+                            Test(M + 3, N + 2, K, offa, offb);
+                            Test(M + 4, N, K, offa, offb);
+                            Test(M, N + 4, K, offa, offb);
+                            Test(M + 4, N + 4, K, offa, offb);
+                            Test(M + 3, N + 7, K, offa, offb);
+                            Test(M + 8, N, K, offa, offb);
+                            Test(M, N + 8, K, offa, offb);
+                            Test(M + 12, N + 12, K, offa, offb);
+                            Test(M + 13, N, K, offa, offb);
+                            Test(M, N + 15, K, offa, offb);
+                            Test(M + 15, N + 15, K, offa, offb);
+                        }
+                    }
+                    printf("a %zd/%zd b %zd/%zd M %zd\n", a, _countof(zero_points), b, _countof(zero_points), M);
+                }
+            }
+        }
+
+        for (size_t M = 1; M < 160; M++) {
+            for (size_t N = 1; N < 160; N++) {
+                for (size_t K = 1; K < 160; K++) {
+                    Test(M, N, K, 18, 24);
+                }
+            }
+            printf("M %zd\n", M);
+        }
+
+        for (size_t M = 160; M < 320; M += 24) {
+            for (size_t N = 112; N < 320; N += 24) {
+                for (size_t K = 1; K < 16; K++) {
+                    Test(M, N, K, 1, 3);
+                }
+                for (size_t K = 16; K < 160; K += 32) {
+                    Test(M, N, K, 5, 7);
+                }
+            }
+            printf("M %zd\n", M);
+        }
+    }
+};
+
+#endif
+
 class MlasConv2DTest : public MlasTestBase
 {
 protected:
@@ -667,7 +852,7 @@ class MlasConv2DTest : public MlasTestBase
                 }
 
                 MlasSgemm(CblasNoTrans, CblasNoTrans, FilterCount, OutputSize, K, 1.0f,
-                    filter, K, Im2Col, OutputSize, 0.0f, Output, OutputSize, nullptr);
+                    filter, K, Im2Col, OutputSize, 0.0f, Output, OutputSize, threadpool);
 
                 //
                 // Apply the bias.
@@ -687,13 +872,13 @@ class MlasConv2DTest : public MlasTestBase
         }
     }
 
-    MatrixGuardBuffer BufferInput;
-    MatrixGuardBuffer BufferFilter;
-    MatrixGuardBuffer BufferBias;
-    MatrixGuardBuffer BufferOutput;
-    MatrixGuardBuffer BufferOutputReference;
-    MatrixGuardBuffer BufferWorking;
-    MatrixGuardBuffer BufferIm2Col;
+    MatrixGuardBuffer<float> BufferInput;
+    MatrixGuardBuffer<float> BufferFilter;
+    MatrixGuardBuffer<float> BufferBias;
+    MatrixGuardBuffer<float> BufferOutput;
+    MatrixGuardBuffer<float> BufferOutputReference;
+    MatrixGuardBuffer<float> BufferWorking;
+    MatrixGuardBuffer<float> BufferIm2Col;
 
 public:
     void
@@ -907,10 +1092,10 @@ class MlasNchwcConv2DTest : public MlasConv2DTest
 
     const size_t BlockSize = MlasNchwcGetBlockSize();
 
-    MatrixGuardBuffer BufferNchwcInput;
-    MatrixGuardBuffer BufferNchwcFilter;
-    MatrixGuardBuffer BufferNchwcBias;
-    MatrixGuardBuffer BufferNchwcOutput;
+    MatrixGuardBuffer<float> BufferNchwcInput;
+    MatrixGuardBuffer<float> BufferNchwcFilter;
+    MatrixGuardBuffer<float> BufferNchwcBias;
+    MatrixGuardBuffer<float> BufferNchwcOutput;
 
 public:
     void
@@ -1072,7 +1257,7 @@ class MlasPool2DTest : public MlasTestBase
         float* Output
         )
     {
-        MlasPool(PoolingKind, 2, InputShape, KernelShape, Padding, StrideShape, OutputShape, Input, Output, nullptr);
+        MlasPool(PoolingKind, 2, InputShape, KernelShape, Padding, StrideShape, OutputShape, Input, Output, threadpool);
     }
 
     void
@@ -1210,9 +1395,9 @@ class MlasPool2DTest : public MlasTestBase
         }
     }
 
-    MatrixGuardBuffer BufferInput;
-    MatrixGuardBuffer BufferOutput;
-    MatrixGuardBuffer BufferOutputReference;
+    MatrixGuardBuffer<float> BufferInput;
+    MatrixGuardBuffer<float> BufferOutput;
+    MatrixGuardBuffer<float> BufferOutputReference;
 
 public:
     void
@@ -1314,8 +1499,8 @@ class MlasNchwcPool2DTest : public MlasPool2DTest
         MlasReorderOutput(OutputShape, NchwcOutput, Output);
     }
 
-    MatrixGuardBuffer BufferNchwcInput;
-    MatrixGuardBuffer BufferNchwcOutput;
+    MatrixGuardBuffer<float> BufferNchwcInput;
+    MatrixGuardBuffer<float> BufferNchwcOutput;
 
     const size_t BlockSize = MlasNchwcGetBlockSize();
 
@@ -1417,7 +1602,7 @@ class MlasPool3DTest : public MlasTestBase
         float* Output = BufferOutput.GetBuffer(OutputBufferElements);
         float* OutputReference = BufferOutputReference.GetBuffer(OutputBufferElements);
 
-        MlasPool(MlasMaximumPooling, 3, InputShape, KernelShape, Padding, StrideShape, OutputShape, Input, Output, nullptr);
+        MlasPool(MlasMaximumPooling, 3, InputShape, KernelShape, Padding, StrideShape, OutputShape, Input, Output, threadpool);
         ReferenceMaximumPool3D(InputShape, KernelShape, Padding, StrideShape, Input, OutputReference);
 
         if (memcmp(Output, OutputReference, OutputBufferElements * sizeof(float)) != 0) {
@@ -1425,7 +1610,7 @@ class MlasPool3DTest : public MlasTestBase
                 InputChannels, InputDepth, InputHeight, InputWidth, KernelDepth, KernelHeight, KernelWidth);
         }
 
-        MlasPool(MlasAveragePoolingExcludePad, 3, InputShape, KernelShape, Padding, StrideShape, OutputShape, Input, Output, nullptr);
+        MlasPool(MlasAveragePoolingExcludePad, 3, InputShape, KernelShape, Padding, StrideShape, OutputShape, Input, Output, threadpool);
         ReferenceAveragePool3D(InputShape, KernelShape, Padding, StrideShape, Input, OutputReference, false);
 
         if (memcmp(Output, OutputReference, OutputBufferElements * sizeof(float)) != 0) {
@@ -1433,7 +1618,7 @@ class MlasPool3DTest : public MlasTestBase
                 InputChannels, InputDepth, InputHeight, InputWidth, KernelDepth, KernelHeight, KernelWidth);
         }
 
-        MlasPool(MlasAveragePoolingIncludePad, 3, InputShape, KernelShape, Padding, StrideShape, OutputShape, Input, Output, nullptr);
+        MlasPool(MlasAveragePoolingIncludePad, 3, InputShape, KernelShape, Padding, StrideShape, OutputShape, Input, Output, threadpool);
         ReferenceAveragePool3D(InputShape, KernelShape, Padding, StrideShape, Input, OutputReference, true);
 
         if (memcmp(Output, OutputReference, OutputBufferElements * sizeof(float)) != 0) {
@@ -1611,9 +1796,9 @@ class MlasPool3DTest : public MlasTestBase
         }
     }
 
-    MatrixGuardBuffer BufferInput;
-    MatrixGuardBuffer BufferOutput;
-    MatrixGuardBuffer BufferOutputReference;
+    MatrixGuardBuffer<float> BufferInput;
+    MatrixGuardBuffer<float> BufferOutput;
+    MatrixGuardBuffer<float> BufferOutputReference;
 
 public:
     void
@@ -1781,28 +1966,42 @@ main(
     void
     )
 {
-    printf("SGEMM tests.\n");
-    std::make_unique<MlasSgemmTest>()->ExecuteShort();
+    for (int i = 0; i != 2; ++i) {
+        printf("SGEMM tests.\n");
+        std::make_unique<MlasSgemmTest>()->ExecuteShort();
 
-    printf("Conv2D tests.\n");
-    std::make_unique<MlasConv2DTest>()->ExecuteShort();
-    if (MlasNchwcGetBlockSize() > 1) {
-        std::make_unique<MlasNchwcConv2DTest>()->ExecuteShort();
-    }
+#ifdef MLAS_HAS_QGEMM_U8U8
+        printf("QGEMM tests.\n");
+        std::make_unique<MlasQgemmU8U8Test>()->ExecuteShort();
+#endif
 
-    printf("Pool2D tests.\n");
-    std::make_unique<MlasPool2DTest>()->ExecuteShort();
-    if (MlasNchwcGetBlockSize() > 1) {
-        std::make_unique<MlasNchwcPool2DTest>()->ExecuteShort();
-    }
+        printf("Conv2D tests.\n");
+        std::make_unique<MlasConv2DTest>()->ExecuteShort();
+        if (MlasNchwcGetBlockSize() > 1) {
+            std::make_unique<MlasNchwcConv2DTest>()->ExecuteShort();
+        }
 
-    printf("Pool3D tests.\n");
-    std::make_unique<MlasPool3DTest>()->ExecuteShort();
+        printf("Pool2D tests.\n");
+        std::make_unique<MlasPool2DTest>()->ExecuteShort();
+        if (MlasNchwcGetBlockSize() > 1) {
+            std::make_unique<MlasNchwcPool2DTest>()->ExecuteShort();
+        }
 
-    printf("Activation tests.\n");
-    std::make_unique<MlasActivationTest>()->ExecuteShort();
+        printf("Pool3D tests.\n");
+        std::make_unique<MlasPool3DTest>()->ExecuteShort();
 
-    printf("Done.\n");
+        printf("Activation tests.\n");
+        std::make_unique<MlasActivationTest>()->ExecuteShort();
 
+        printf("Done.\n");
+#if !defined(MLAS_NO_ONNXRUNTIME_THREADPOOL)
+        if(threadpool != nullptr) threadpool = new onnxruntime::concurrency::ThreadPool("test", 2);
+#else
+        break;
+#endif
+	}
+#if !defined(MLAS_NO_ONNXRUNTIME_THREADPOOL)
+    delete threadpool;
+#endif
     return 0;
 }
diff --git a/onnxruntime/test/onnx/README.txt b/onnxruntime/test/onnx/README.txt
index 9986a13c0bce3..a08c41db66215 100644
--- a/onnxruntime/test/onnx/README.txt
+++ b/onnxruntime/test/onnx/README.txt
@@ -4,7 +4,7 @@ Options:
         -c [runs]: Specifies the number of Session::Run() to invoke simultaneously for each model.
         -n [test_case_name]: Specifies a single test case to run.
         -p [PLANNER_TYPE]: PLANNER_TYPE could be 'seq' or 'simple'. Default: 'simple'.
-        -e [EXECUTION_PROVIDER]: EXECUTION_PROVIDER could be 'cpu' or 'cuda'. Default: 'cpu'.
+        -e [EXECUTION_PROVIDER]: EXECUTION_PROVIDER could be 'cpu', 'cuda', 'mkldnn', 'tensorrt', 'ngraph' or 'nuphar'. Default: 'cpu'.
         -h: help
 
 The debug version of this program depends on dbghelp.dll. Please make sure it's in your PATH.
diff --git a/onnxruntime/test/onnx/TestCase.cc b/onnxruntime/test/onnx/TestCase.cc
index 16b2e409f6214..fe43b296b6046 100644
--- a/onnxruntime/test/onnx/TestCase.cc
+++ b/onnxruntime/test/onnx/TestCase.cc
@@ -1,7 +1,8 @@
 // Copyright (c) Microsoft Corporation. All rights reserved.
 // Licensed under the MIT License.
 
-//TODO: switch to use onnxruntime public api
+// needs to be included first to get around onnxruntime\cmake\external\onnx\onnx/common/constants.h(14): error C2513: 'bool': no variable declared before '='
+#include "tensorprotoutils.h"
 
 #include "TestCase.h"
 #include <fstream>
@@ -12,6 +13,7 @@
 #include "core/platform/ort_mutex.h"
 #include "core/session/onnxruntime_cxx_api.h"
 #include "core/framework/path_lib.h"
+#include "core/framework/allocator.h"
 #include <sstream>
 #include <map>
 #include <regex>
@@ -57,7 +59,6 @@ using ORT_VALUE_HOLDER = std::unique_ptr<OrtValue, decltype(&OrtReleaseValue)>;
 const std::string TestModelInfo::unknown_version = "unknown version";
 
 namespace {
-
 template <typename T>
 ONNXTensorElementDataType NumericTypeToONNXType();
 template <>
@@ -209,9 +210,9 @@ class OnnxModelInfo : public TestModelInfo {
 #ifdef __GNUG__
     std::smatch match;
     std::string url_string{model_url};
-    const std::regex onnx_tag_regex("onnx[0-9a-z]{3}"); //e.g. onnx141, onnx150, onnxtip
+    const std::regex onnx_tag_regex("onnx[0-9a-z]{3}");  //e.g. onnx141, onnx150, onnxtip
     if (std::regex_search(url_string, match, onnx_tag_regex)) {
-      onnx_commit_tag_ = match[0].str();   
+      onnx_commit_tag_ = match[0].str();
     } else {
       onnx_commit_tag_ = TestModelInfo::unknown_version;
     }
@@ -265,22 +266,30 @@ static void SortTensorFileNames(std::vector<std::basic_string<PATH_CHAR_TYPE>>&
   }
 }
 
-OrtValue* TensorToOrtValue(const ONNX_NAMESPACE::TensorProto& t, HeapBuffer& b) {
-  std::string s = t.SerializeAsString();
-  size_t len;
-  ORT_THROW_ON_ERROR(OrtGetTensorMemSizeInBytesFromTensorProto(s.data(), static_cast<int>(s.size()), 0, &len));
+OrtValue* TensorToOrtValue(const ONNX_NAMESPACE::TensorProto& t, onnxruntime::test::HeapBuffer& b) {
+  size_t len = 0;
+  auto status = onnxruntime::test::GetSizeInBytesFromTensorProto<0>(t, &len);
+  if (!status.IsOK()) {
+    ORT_THROW(status.ToString());
+  }
   void* p = len == 0 ? nullptr : b.AllocMemory(len);
-  OrtCallback* d;
-  OrtValue* temp_value = nullptr;
-  ORT_THROW_ON_ERROR(OrtTensorProtoToOrtValue(s.data(), static_cast<int>(s.size()), nullptr, p, len, &temp_value, &d));
-  if (d != nullptr) {
-    b.AddDeleter(d);
+  Ort::Value temp_value{nullptr};
+  auto d = std::make_unique<onnxruntime::test::OrtCallback>();
+  OrtAllocatorInfo cpu_allocator_info(onnxruntime::CPU, OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeDefault);
+  status = onnxruntime::test::TensorProtoToMLValue(t, onnxruntime::test::MemBuffer(p, len, cpu_allocator_info),
+                                                   temp_value, *d);
+  if (!status.IsOK()) {
+    ORT_THROW(status.ToString());
+  }
+  if (d->f) {
+    b.AddDeleter(d.release());
   }
-  return temp_value;
+  return temp_value.release();
 }
 
 void LoopDataFile(int test_data_pb_fd, bool is_input, const TestModelInfo* modelinfo,
-                  std::unordered_map<std::string, OrtValue*>& name_data_map, HeapBuffer& b, std::ostringstream& oss) {
+                  std::unordered_map<std::string, OrtValue*>& name_data_map, onnxruntime::test::HeapBuffer& b,
+                  std::ostringstream& oss) {
   google::protobuf::io::FileInputStream f(test_data_pb_fd);
   f.SetCloseOnDelete(true);
   google::protobuf::io::CodedInputStream coded_input(&f);
@@ -388,7 +397,8 @@ class OnnxTestCase : public ITestCase {
     return std::string();
   }
 
-  void ConvertTestData(const std::vector<ONNX_NAMESPACE::TensorProto>& test_data_pbs, HeapBuffer& b, bool is_input,
+  void ConvertTestData(const std::vector<ONNX_NAMESPACE::TensorProto>& test_data_pbs, onnxruntime::test::HeapBuffer& b,
+                       bool is_input,
                        std::unordered_map<std::string, OrtValue*>& out);
 
   std::once_flag model_parsed_;
@@ -423,7 +433,8 @@ class OnnxTestCase : public ITestCase {
   std::string GetTestCaseVersion() const override {
     return model_info_->GetModelVersion();
   }
-  void LoadTestData(size_t id, HeapBuffer& b, std::unordered_map<std::string, OrtValue*>&, bool is_input) override;
+  void LoadTestData(size_t id, onnxruntime::test::HeapBuffer& b, std::unordered_map<std::string, OrtValue*>&,
+                    bool is_input) override;
 };
 
 ITestCase* CreateOnnxTestCase(const std::string& test_case_name, TestModelInfo* model,
@@ -476,8 +487,6 @@ static bool read_config_file(const std::basic_string<PATH_CHAR_TYPE>& path, std:
   return true;
 }
 
-
-
 //load tensors from disk
 template <typename PATH_STRING_TYPE>
 static void LoadTensors(const std::vector<PATH_STRING_TYPE>& pb_files,
@@ -498,7 +507,8 @@ static void LoadTensors(const std::vector<PATH_STRING_TYPE>& pb_files,
   }
 }
 
-void OnnxTestCase::LoadTestData(size_t id, HeapBuffer& b, std::unordered_map<std::string, OrtValue*>& name_data_map,
+void OnnxTestCase::LoadTestData(size_t id, onnxruntime::test::HeapBuffer& b,
+                                std::unordered_map<std::string, OrtValue*>& name_data_map,
                                 bool is_input) {
   if (id >= test_data_dirs_.size()) {
     ORT_THROW("index out of bound");
@@ -551,7 +561,8 @@ void OnnxTestCase::LoadTestData(size_t id, HeapBuffer& b, std::unordered_map<std
   ConvertTestData(test_data_pbs, b, is_input, name_data_map);
 }
 
-void OnnxTestCase::ConvertTestData(const std::vector<ONNX_NAMESPACE::TensorProto>& test_data_pbs, HeapBuffer& b,
+void OnnxTestCase::ConvertTestData(const std::vector<ONNX_NAMESPACE::TensorProto>& test_data_pbs,
+                                   onnxruntime::test::HeapBuffer& b,
                                    bool is_input, std::unordered_map<std::string, OrtValue*>& out) {
   bool has_valid_names = true;
   std::vector<std::string> var_names(test_data_pbs.size());
@@ -575,15 +586,25 @@ void OnnxTestCase::ConvertTestData(const std::vector<ONNX_NAMESPACE::TensorProto
   for (size_t input_index = 0; input_index != test_data_pbs.size(); ++input_index) {
     std::string name = var_names[input_index];
     const ONNX_NAMESPACE::TensorProto& input = test_data_pbs[input_index];
-    std::string s = input.SerializeAsString();
-    OrtValue* v1;
-    size_t len;
-    ORT_THROW_ON_ERROR(OrtGetTensorMemSizeInBytesFromTensorProto(s.data(), (int)s.size(), 0, &len));
+    size_t len = 0;
+
+    auto status = onnxruntime::test::GetSizeInBytesFromTensorProto<0>(input, &len);
+    if (!status.IsOK()) {
+      ORT_THROW(status.ToString());
+    }
     void* p = len == 0 ? nullptr : b.AllocMemory(len);
-    OrtCallback* d;
-    ORT_THROW_ON_ERROR(OrtTensorProtoToOrtValue(s.data(), (int)s.size(), nullptr, p, len, &v1, &d));
-    if (d != nullptr) b.AddDeleter(d);
-    out.insert(std::make_pair(name, v1));
+    Ort::Value v1{nullptr};
+    auto d = std::make_unique<onnxruntime::test::OrtCallback>();
+    OrtAllocatorInfo cpu_allocator_info(onnxruntime::CPU, OrtDeviceAllocator, OrtDevice(), 0, OrtMemTypeDefault);
+    status = onnxruntime::test::TensorProtoToMLValue(input, onnxruntime::test::MemBuffer(p, len, cpu_allocator_info),
+                                                     v1, *d);
+    if (!status.IsOK()) {
+      ORT_THROW(status.ToString());
+    }
+    if (d->f) {
+      b.AddDeleter(d.release());
+    }
+    out.insert(std::make_pair(name, v1.release()));
   }
 }
 
diff --git a/onnxruntime/test/onnx/TestCase.h b/onnxruntime/test/onnx/TestCase.h
index 829107882cc25..66663a0450d8a 100644
--- a/onnxruntime/test/onnx/TestCase.h
+++ b/onnxruntime/test/onnx/TestCase.h
@@ -10,6 +10,7 @@
 #include <core/session/onnxruntime_cxx_api.h>
 #include <core/framework/path_lib.h>
 #include "heap_buffer.h"
+
 namespace ONNX_NAMESPACE {
 class ValueInfoProto;
 }
@@ -18,7 +19,7 @@ class ValueInfoProto;
 //One test case can contain multiple test data(input/output pairs)
 class ITestCase {
  public:
-  virtual void LoadTestData(size_t id, HeapBuffer& b, std::unordered_map<std::string, OrtValue*>& name_data_map,
+  virtual void LoadTestData(size_t id, onnxruntime::test::HeapBuffer& b, std::unordered_map<std::string, OrtValue*>& name_data_map,
                             bool is_input) = 0;
   virtual const PATH_CHAR_TYPE* GetModelUrl() const = 0;
   virtual const std::string& GetNodeName() const = 0;
@@ -53,7 +54,7 @@ class TestModelInfo {
   virtual int GetOutputCount() const = 0;
   virtual const std::string& GetInputName(size_t i) const = 0;
   virtual const std::string& GetOutputName(size_t i) const = 0;
-  virtual std::string GetModelVersion() const {return "";}
+  virtual std::string GetModelVersion() const { return ""; }
   virtual ~TestModelInfo() = default;
 
   static TestModelInfo* LoadOnnxModel(_In_ const PATH_CHAR_TYPE* model_url);
diff --git a/onnxruntime/test/onnx/callback.cc b/onnxruntime/test/onnx/callback.cc
new file mode 100644
index 0000000000000..99b3b7a6bd303
--- /dev/null
+++ b/onnxruntime/test/onnx/callback.cc
@@ -0,0 +1,16 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "callback.h"
+
+namespace onnxruntime {
+namespace test {
+void OrtRunCallback(OrtCallback* f) noexcept {
+  if (f == nullptr) return;
+  if (f->f != nullptr) {
+    f->f(f->param);
+    delete f;
+  }
+}
+}  // namespace test
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/onnx/callback.h b/onnxruntime/test/onnx/callback.h
new file mode 100644
index 0000000000000..c548b57486b21
--- /dev/null
+++ b/onnxruntime/test/onnx/callback.h
@@ -0,0 +1,17 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+#pragma once
+
+namespace onnxruntime {
+namespace test {
+struct OrtCallback {
+  void (*f)(void* param) noexcept;
+  void* param;
+};
+
+/**
+ *  f will be freed in this call
+ */
+void OrtRunCallback(OrtCallback* f) noexcept;
+}  // namespace test
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/onnx/heap_buffer.cc b/onnxruntime/test/onnx/heap_buffer.cc
index cd8199cfffe0a..aca75de061268 100644
--- a/onnxruntime/test/onnx/heap_buffer.cc
+++ b/onnxruntime/test/onnx/heap_buffer.cc
@@ -2,9 +2,11 @@
 // Licensed under the MIT License.
 
 #include "heap_buffer.h"
-
 #include "core/session/onnxruntime_c_api.h"
+#include "callback.h"
 
+namespace onnxruntime {
+namespace test {
 void HeapBuffer::AddDeleter(OrtCallback* d) {
   if (d != nullptr) deleters_.push_back(d);
 }
@@ -16,4 +18,6 @@ HeapBuffer::~HeapBuffer() {
   for (void* p : buffers_) {
     free(p);
   }
-}
\ No newline at end of file
+}
+}  // namespace test
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/onnx/heap_buffer.h b/onnxruntime/test/onnx/heap_buffer.h
index 03104dc44af21..b4abf131b1b21 100644
--- a/onnxruntime/test/onnx/heap_buffer.h
+++ b/onnxruntime/test/onnx/heap_buffer.h
@@ -4,8 +4,10 @@
 #pragma once
 #include <vector>
 #include <memory>
-struct OrtCallback;
 
+namespace onnxruntime {
+namespace test {
+struct OrtCallback;
 /**
  * A holder for delay freed buffers
  */
@@ -26,4 +28,6 @@ class HeapBuffer {
  private:
   std::vector<OrtCallback*> deleters_;
   std::vector<void*> buffers_;
-};
\ No newline at end of file
+};
+}  // namespace test
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/onnx/main.cc b/onnxruntime/test/onnx/main.cc
index 86218dc6e0c9f..39b2d0add5479 100644
--- a/onnxruntime/test/onnx/main.cc
+++ b/onnxruntime/test/onnx/main.cc
@@ -35,14 +35,11 @@ void usage() {
       "\t-r [repeat]: Specifies the number of times to repeat\n"
       "\t-v: verbose\n"
       "\t-n [test_case_name]: Specifies a single test case to run.\n"
-      "\t-e [EXECUTION_PROVIDER]: EXECUTION_PROVIDER could be 'cpu', 'cuda', 'mkldnn', 'tensorrt', 'ngraph' or 'openvino'. "
+      "\t-e [EXECUTION_PROVIDER]: EXECUTION_PROVIDER could be 'cpu', 'cuda', 'mkldnn', 'tensorrt', 'ngraph', 'openvino' or 'nuphar'. "
       "Default: 'cpu'.\n"
       "\t-x: Use parallel executor, default (without -x): sequential executor.\n"
-      "\t-o [optimization level]: Specifies the graph optimization level to enable. Valid values are 0 through 3. Default is 1.\n"
-      "\t\t0 -> Disable all optimizations\n"
-      "\t\t1 -> Enable basic optimizations\n"
-      "\t\t2 -> Enable extended optimizations\n"
-      "\t\t3 -> Enable extended+layout optimizations\n"
+      "\t-o [optimization level]: Default is 1. Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).\n"
+      "\t\tPlease see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels. \n"
       "\t-h: help\n"
       "\n"
       "onnxruntime version: %s\n",
@@ -98,7 +95,7 @@ int real_main(int argc, char* argv[], Ort::Env& env) {
   bool enable_mem_pattern = true;
   bool enable_openvino = false;
   bool enable_nnapi = false;
-  uint32_t graph_optimization_level{};
+  GraphOptimizationLevel graph_optimization_level = ORT_DISABLE_ALL;
   bool user_graph_optimization_level_set = false;
 
   OrtLoggingLevel logging_level = ORT_LOGGING_LEVEL_WARNING;
@@ -166,15 +163,34 @@ int real_main(int argc, char* argv[], Ort::Env& env) {
         case 'x':
           enable_sequential_execution = false;
           break;
-        case 'o':
-          graph_optimization_level = static_cast<uint32_t>(OrtStrtol<PATH_CHAR_TYPE>(optarg, nullptr));
-          if (graph_optimization_level >= static_cast<uint32_t>(TransformerLevel::MaxTransformerLevel)) {
-            fprintf(stderr, "See usage for valid values of graph optimization level\n");
-            usage();
-            return -1;
+        case 'o': {
+          int tmp = static_cast<int>(OrtStrtol<PATH_CHAR_TYPE>(optarg, nullptr));
+          switch (tmp) {
+            case ORT_DISABLE_ALL:
+              graph_optimization_level = ORT_DISABLE_ALL;
+              break;
+            case ORT_ENABLE_BASIC:
+              graph_optimization_level = ORT_ENABLE_BASIC;
+              break;
+            case ORT_ENABLE_EXTENDED:
+              graph_optimization_level = ORT_ENABLE_EXTENDED;
+              break;
+            case ORT_ENABLE_ALL:
+              graph_optimization_level = ORT_ENABLE_ALL;
+              break;
+            default: {
+              if (tmp > ORT_ENABLE_ALL) {  // relax constraint
+                graph_optimization_level = ORT_ENABLE_ALL;
+              } else {
+                fprintf(stderr, "See usage for valid values of graph optimization level\n");
+                usage();
+                return -1;
+              }
+            }
           }
           user_graph_optimization_level_set = true;
           break;
+        }
         case '?':
         case 'h':
         default:
@@ -232,7 +248,7 @@ int real_main(int argc, char* argv[], Ort::Env& env) {
 
     if (enable_tensorrt) {
 #ifdef USE_TENSORRT
-      ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Tensorrt(sf));
+      ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Tensorrt(sf, 0));
       ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(sf, 0));
 #else
       fprintf(stderr, "TensorRT is not supported in this build");
@@ -258,7 +274,7 @@ int real_main(int argc, char* argv[], Ort::Env& env) {
     }
     if (enable_nuphar) {
 #ifdef USE_NUPHAR
-      ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Nuphar(sf, 0, ""));
+      ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Nuphar(sf, /*allow_unaligned_buffers*/ 1, ""));
 #else
       fprintf(stderr, "Nuphar is not supported in this build");
       return -1;
@@ -363,13 +379,41 @@ int real_main(int argc, char* argv[], Ort::Env& env) {
       {"shrink", "test case is wrong", {"onnx141"}},
       {"maxpool_with_argmax_2d_precomputed_strides", "ShapeInferenceError"},
       {"tf_inception_v2", "result mismatch"},
-      {"mxnet_arcface", "result mismatch"}
+      {"mxnet_arcface", "result mismatch"},
+      {"dynamicquantizelinear_expanded", "Round(11) not implemented yet"},
+      {"dynamicquantizelinear_max_adjusted_expanded", "Round(11) not implemented yet"},
+      {"dynamicquantizelinear_min_adjusted_expanded", "Round(11) not implemented yet"},
+      {"top_k", "not implemented yet for opset 11", {"onnxtip"}},
+      {"top_k_smallest", "not implemented yet for opset 11", {"onnxtip"}},
+      {"unique_not_sorted_without_axis", "not implemented yet"},
+      {"unique_sorted_with_axis", "not implemented yet"},
+      {"unique_sorted_with_axis_3d", "not implemented yet"},
+      {"unique_sorted_without_axis", "not implemented yet"},
+      {"scatter_elements_with_axis", "not implemented yet"},
+      {"scatter_elements_without_axis", "not implemented yet"},
+      {"round", "not implemented yet"},
+      {"gather_elements_1", "not implemented yet"},
+      {"gather_elements_0", "not implemented yet"},
+      {"depthtospace_crd_mode_example", "not implemented yet"},
+      {"depthtospace_crd_mode", "not implemented yet"},
+      {"cumsum_2d_axis_1", "not implemented yet"},
+      {"cumsum_2d_axis_0", "not implemented yet"},
+      {"cumsum_1d_reverse_exclusive", "not implemented yet"},
+      {"cumsum_1d_reverse", "not implemented yet"},
+      {"cumsum_1d_exclusive", "not implemented yet"},
+      {"cumsum_1d", "not implemented yet"},      
   };
 
 #ifdef USE_NGRAPH
   broken_tests.insert({"dequantizelinear", "ambiguity in scalar dimensions [] vs [1]", {"onnx150"}});
   broken_tests.insert({"qlinearconv", "ambiguity in scalar dimensions [] vs [1]"});
   broken_tests.insert({"quantizelinear", "ambiguity in scalar dimensions [] vs [1]", {"onnx150"}});
+  broken_tests.insert({"clip_splitbounds", "not implemented yet for opset 11"});
+  broken_tests.insert({"clip_outbounds", "not implemented yet for opset 11"});	
+  broken_tests.insert({"clip_example", "not implemented yet for opset 11"});	
+  broken_tests.insert({"clip_default_min", "not implemented yet for opset 11"});	
+  broken_tests.insert({"clip_default_max", "not implemented yet for opset 11"});
+  broken_tests.insert({"clip", "not implemented yet for opset 11"});
 #endif
 
 #ifdef USE_MKLDNN
@@ -388,7 +432,21 @@ int real_main(int argc, char* argv[], Ort::Env& env) {
 #endif
 #endif
 
-
+#ifdef USE_TENSORRT
+  broken_tests.insert({"fp16_shufflenet", "TRT EP bug"});
+  broken_tests.insert({"fp16_inception_v1", "TRT EP bug"});
+  broken_tests.insert({"fp16_tiny_yolov2", "TRT EP bug"});
+  broken_tests.insert({"tf_inception_v3", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_mobilenet_v1_1.0_224", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_mobilenet_v2_1.0_224", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_mobilenet_v2_1.4_224", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_resnet_v1_101", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_resnet_v1_152", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_resnet_v1_50", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_resnet_v2_101", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_resnet_v2_152", "TRT Engine couldn't be created"});
+  broken_tests.insert({"tf_resnet_v2_50", "TRT Engine couldn't be created"});
+#endif
 
 #ifdef USE_CUDA
   broken_tests.insert({"mxnet_arcface", "result mismatch"});
diff --git a/onnxruntime/test/onnx/mem_buffer.h b/onnxruntime/test/onnx/mem_buffer.h
new file mode 100644
index 0000000000000..fe2a393852756
--- /dev/null
+++ b/onnxruntime/test/onnx/mem_buffer.h
@@ -0,0 +1,27 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/common/common.h"
+
+namespace onnxruntime {
+namespace test {
+/**
+ * A simple POD for using with tensor deserialization
+ */
+class MemBuffer {
+ public:
+  MemBuffer(void* buffer, size_t len, const OrtAllocatorInfo& alloc_info)
+      : buffer_(buffer), len_(len), alloc_info_(alloc_info) {}
+  void* GetBuffer() const { return buffer_; }
+
+  size_t GetLen() const { return len_; }
+  const OrtAllocatorInfo& GetAllocInfo() const { return alloc_info_; }
+
+ private:
+  void* const buffer_;
+  const size_t len_;
+  const OrtAllocatorInfo& alloc_info_;
+};
+};  // namespace test
+}  // namespace onnxruntime
\ No newline at end of file
diff --git a/onnxruntime/test/onnx/microbenchmark/model_init.cc b/onnxruntime/test/onnx/microbenchmark/model_init.cc
deleted file mode 100644
index 5a841276e06bf..0000000000000
--- a/onnxruntime/test/onnx/microbenchmark/model_init.cc
+++ /dev/null
@@ -1,222 +0,0 @@
-// Copyright (c) Microsoft Corporation. All rights reserved.
-// Licensed under the MIT License.
-
-#include <benchmark/benchmark.h>
-#include <core/graph/model.h>
-#include <core/framework/execution_providers.h>
-#include <core/framework/kernel_registry_manager.h>
-#include <core/framework/session_state.h>
-#include <core/framework/graph_partitioner.h>
-#include <core/providers/cpu/cpu_execution_provider.h>
-#ifdef USE_CUDA
-#include <core/providers/cuda/cuda_execution_provider.h>
-#endif
-#ifdef USE_MKLDNN
-#include <core/providers/mkldnn/mkldnn_execution_provider.h>
-#endif
-#include <core/platform/env.h>
-#include <core/graph/onnx_protobuf.h>
-#include <google/protobuf/io/zero_copy_stream_impl.h>
-#include <google/protobuf/text_format.h>
-using namespace google::protobuf::io;
-
-constexpr const char* model_str =
-    "ir_version: 4\n"
-    "graph {\n"
-    "  node {\n"
-    "    input: \"X\"\n"
-    "    input: \"X\"\n"
-    "    output: \"Y\"\n"
-    "    op_type: \"MatMul\"\n"
-    "  }\n"
-    "  name: \"test-model\"\n"
-    "  input {\n"
-    "    name: \"X\"\n"
-    "    type {\n"
-    "      tensor_type {\n"
-    "        elem_type: 1\n"
-    "        shape {\n"
-    "          dim {\n"
-    "            dim_value: 2\n"
-    "          }\n"
-    "          dim {\n"
-    "            dim_value: 2\n"
-    "          }\n"
-    "        }\n"
-    "      }\n"
-    "    }\n"
-    "  }\n"
-    "  output {\n"
-    "    name: \"Y\"\n"
-    "    type {\n"
-    "      tensor_type {\n"
-    "        elem_type: 1\n"
-    "        shape {\n"
-    "          dim {\n"
-    "            dim_value: 2\n"
-    "          }\n"
-    "          dim {\n"
-    "            dim_value: 2\n"
-    "          }\n"
-    "        }\n"
-    "      }\n"
-    "    }\n"
-    "  }\n"
-    "}\n"
-    "opset_import {\n"
-    "  version: 8\n"
-    "}";
-
-using namespace onnxruntime;
-
-#define BM_BREAK_IF_ERROR(expr)                                                 \
-  do {                                                                          \
-    auto _status = (expr);                                                      \
-    if ((!_status.IsOK())) state.SkipWithError(_status.ErrorMessage().c_str()); \
-  } while (0)
-
-Status CreateModelFromStr(const char* str, std::unique_ptr<Model>* out) {
-  ONNX_NAMESPACE::ModelProto mp;
-  if (!google::protobuf::TextFormat::ParseFromString(str, &mp)) throw std::runtime_error("load model failed");
-  *out = std::make_unique<Model>(mp);
-  return Status::OK();
-}
-
-Status CreateExecutionProviders(std::unique_ptr<ExecutionProviders>* ret) {
-  std::unique_ptr<ExecutionProviders> execution_providers = std::make_unique<ExecutionProviders>();
-#ifdef USE_CUDA
-  {
-    CUDAExecutionProviderInfo epi;
-    ORT_RETURN_IF_ERROR(
-        execution_providers->Add(onnxruntime::kCudaExecutionProvider, std::make_unique<CUDAExecutionProvider>(epi)));
-  }
-#endif
-#ifdef USE_MKLDNN
-  {
-    MKLDNNExecutionProviderInfo epi;
-    ORT_RETURN_IF_ERROR(execution_providers->Add(onnxruntime::kMklDnnExecutionProvider,
-                                                 std::make_unique<MKLDNNExecutionProvider>(epi)));
-  }
-#endif
-  {
-    CPUExecutionProviderInfo epi;
-    ORT_RETURN_IF_ERROR(
-        execution_providers->Add(onnxruntime::kCpuExecutionProvider, std::make_unique<CPUExecutionProvider>(epi)));
-  }
-  *ret = std::move(execution_providers);
-  return Status::OK();
-}
-
-Status CreateKernelRegistryManagerFromModel(std::unique_ptr<KernelRegistryManager>* ret, Model* model) {
-  std::unique_ptr<ExecutionProviders> execution_providers;
-  ORT_RETURN_IF_ERROR(CreateExecutionProviders(&execution_providers));
-  std::unique_ptr<KernelRegistryManager> kernel_registry_manager = std::make_unique<KernelRegistryManager>();
-  ORT_RETURN_IF_ERROR(kernel_registry_manager->RegisterKernels(*execution_providers));
-  SessionState s{*execution_providers, true};
-  s.SetLogger(logging::LoggingManager::DefaultLogger());
-
-  ORT_RETURN_IF_ERROR(model->MainGraph().Resolve());
-  s.SetGraphViewer(std::make_unique<GraphViewer>(model->MainGraph()));
-  GraphPartitioner partitioner(*kernel_registry_manager, *execution_providers);
-  ORT_RETURN_IF_ERROR(partitioner.Partition(model->MainGraph(), s.ExportDll(), s.GetMutableFuncMgr()));
-  *ret = std::move(kernel_registry_manager);
-  return Status::OK();
-}
-
-static void SearchKernelRegistry_IMPL(benchmark::State& state, Model* model) {
-  std::unique_ptr<KernelRegistryManager> kernel_registry_manager;
-  auto st = CreateKernelRegistryManagerFromModel(&kernel_registry_manager, model);
-  if (!st.IsOK()) throw std::runtime_error("failed");
-  for (auto _ : state) {
-    for (const auto& n : model->MainGraph().Nodes()) {
-      const KernelCreateInfo* info;
-      BM_BREAK_IF_ERROR(kernel_registry_manager->SearchKernelRegistry(n, &info));
-      if (info == nullptr) state.SkipWithError("Search kernel failed");
-    }
-  }
-}
-
-static void BM_SearchKernelRegistry_SingleNodeModel(benchmark::State& state) {
-  std::unique_ptr<Model> model;
-  Status st = CreateModelFromStr(model_str, &model);
-  if (!st.IsOK()) throw std::runtime_error("failed");
-  SearchKernelRegistry_IMPL(state, model.get());
-}
-
-BENCHMARK(BM_SearchKernelRegistry_SingleNodeModel);
-
-static void BM_SearchKernelRegistry_RealModel_tiny_yolo(benchmark::State& state) {
-  std::shared_ptr<onnxruntime::Model> model;
-  auto st = onnxruntime::Model::Load("../models/opset8/test_tiny_yolov2/model.onnx", model);
-  SearchKernelRegistry_IMPL(state, model.get());
-}
-
-BENCHMARK(BM_SearchKernelRegistry_RealModel_tiny_yolo);
-
-static void BM_SearchKernelRegistry_RealModel_inception_v4(benchmark::State& state) {
-  std::shared_ptr<onnxruntime::Model> model;
-  auto st = onnxruntime::Model::Load("../models/opset9/tf_inception_v4/model.onnx", model);
-  SearchKernelRegistry_IMPL(state, model.get());
-}
-
-BENCHMARK(BM_SearchKernelRegistry_RealModel_inception_v4);
-
-static void BM_PartitionModel_tiny_yolo(benchmark::State& state) {
-  int fd;
-  Status status = Env::Default().FileOpenRd("../models/opset8/test_tiny_yolov2/model.onnx", fd);
-  if (!status.IsOK()) throw std::runtime_error("open test data failed");
-  auto raw_input = std::unique_ptr<ZeroCopyInputStream>(std::make_unique<FileInputStream>(fd));
-  auto coded_input = std::make_unique<CodedInputStream>(raw_input.get());
-
-  ONNX_NAMESPACE::ModelProto model_proto;
-  if (!model_proto.ParseFromCodedStream(coded_input.get())) throw std::runtime_error("open test data failed");
-  std::unique_ptr<ExecutionProviders> execution_providers;
-  BM_BREAK_IF_ERROR(CreateExecutionProviders(&execution_providers));
-  std::unique_ptr<KernelRegistryManager> kernel_registry_manager = std::make_unique<KernelRegistryManager>();
-  status = kernel_registry_manager->RegisterKernels(*execution_providers);
-  if (!status.IsOK()) throw std::runtime_error("RegisterKernels failed");
-
-  for (auto _ : state) {
-    state.PauseTiming();
-    std::shared_ptr<onnxruntime::Model> model = std::make_shared<onnxruntime::Model>(model_proto);
-    SessionState s{*execution_providers, true};
-    s.SetLogger(logging::LoggingManager::DefaultLogger());
-    BM_BREAK_IF_ERROR(model->MainGraph().Resolve());
-    s.SetGraphViewer(std::make_unique<GraphViewer>(model->MainGraph()));
-    GraphPartitioner partitioner(*kernel_registry_manager, *execution_providers);
-    state.ResumeTiming();
-    BM_BREAK_IF_ERROR(partitioner.Partition(model->MainGraph(), s.ExportDll(), s.GetMutableFuncMgr()));
-  }
-}
-
-BENCHMARK(BM_PartitionModel_tiny_yolo);
-
-static void BM_PartitionModel_inception_v4(benchmark::State& state) {
-  int fd;
-  Status status = Env::Default().FileOpenRd("../models/opset9/tf_inception_v4/model.onnx", fd);
-  if (!status.IsOK()) throw std::runtime_error("open test data failed");
-  auto raw_input = std::unique_ptr<ZeroCopyInputStream>(std::make_unique<FileInputStream>(fd));
-  auto coded_input = std::make_unique<CodedInputStream>(raw_input.get());
-
-  ONNX_NAMESPACE::ModelProto model_proto;
-  if (!model_proto.ParseFromCodedStream(coded_input.get())) throw std::runtime_error("open test data failed");
-  std::unique_ptr<ExecutionProviders> execution_providers;
-  BM_BREAK_IF_ERROR(CreateExecutionProviders(&execution_providers));
-  std::unique_ptr<KernelRegistryManager> kernel_registry_manager = std::make_unique<KernelRegistryManager>();
-  status = kernel_registry_manager->RegisterKernels(*execution_providers);
-  if (!status.IsOK()) throw std::runtime_error("RegisterKernels failed");
-
-  for (auto _ : state) {
-    state.PauseTiming();
-    std::shared_ptr<onnxruntime::Model> model = std::make_shared<onnxruntime::Model>(model_proto);
-    SessionState s{*execution_providers, true};
-    s.SetLogger(logging::LoggingManager::DefaultLogger());
-    BM_BREAK_IF_ERROR(model->MainGraph().Resolve());
-    s.SetGraphViewer(std::make_unique<GraphViewer>(model->MainGraph()));
-    GraphPartitioner partitioner(*kernel_registry_manager, *execution_providers);
-    state.ResumeTiming();
-    BM_BREAK_IF_ERROR(partitioner.Partition(model->MainGraph(), s.ExportDll(), s.GetMutableFuncMgr()));
-  }
-}
-
-BENCHMARK(BM_PartitionModel_inception_v4);
diff --git a/onnxruntime/test/onnx/microbenchmark/modeltest.cc b/onnxruntime/test/onnx/microbenchmark/modeltest.cc
index 4f256393ca5c1..d9f32ad744b06 100644
--- a/onnxruntime/test/onnx/microbenchmark/modeltest.cc
+++ b/onnxruntime/test/onnx/microbenchmark/modeltest.cc
@@ -50,7 +50,8 @@ BENCHMARK(BM_CreateSession_WithGPU);
 
 static void BM_CreateSession(benchmark::State& state) {
   const ORTCHAR_T* model_path = ORT_TSTR("../models/opset8/test_bvlc_alexnet/model.onnx");
-  OrtSessionOptions* session_option = OrtCreateSessionOptions();
+  OrtSessionOptions* session_option;
+  ORT_BREAK_ON_ERROR(OrtCreateSessionOptions(&session_option));
   for (auto _ : state) {
     OrtSession* session;
     ORT_BREAK_ON_ERROR(OrtCreateSession(env, model_path, session_option, &session));
diff --git a/onnxruntime/test/onnx/runner.cc b/onnxruntime/test/onnx/runner.cc
index 575292fbc6dd3..7827b4ff8a95a 100644
--- a/onnxruntime/test/onnx/runner.cc
+++ b/onnxruntime/test/onnx/runner.cc
@@ -27,45 +27,44 @@
 using namespace onnxruntime;
 using ::onnxruntime::common::Status;
 
-// Permanently exclude following tests because ORT support only opset staring from 7, 
+// Permanently exclude following tests because ORT support only opset staring from 7,
 // Please make no more changes to the list
-const std::set<std::string> immutable_broken_tests = 
-{
-    "AvgPool1d", 
-    "AvgPool1d_stride",
-    "AvgPool2d",
-    "AvgPool2d_stride",
-    "AvgPool3d",
-    "AvgPool3d_stride",
-    "AvgPool3d_stride1_pad0_gpu_input",
-    "BatchNorm1d_3d_input_eval",
-    "BatchNorm2d_eval",
-    "BatchNorm2d_momentum_eval",
-    "BatchNorm3d_eval",
-    "BatchNorm3d_momentum_eval",
-    "GLU",
-    "GLU_dim",
-    "Linear",
-    "PReLU_1d",
-    "PReLU_1d_multiparam",
-    "PReLU_2d",
-    "PReLU_2d_multiparam",
-    "PReLU_3d",
-    "PReLU_3d_multiparam",
-    "PoissonNLLLLoss_no_reduce",
-    "Softsign",
-    "operator_add_broadcast",
-    "operator_add_size1_broadcast",
-    "operator_add_size1_right_broadcast",
-    "operator_add_size1_singleton_broadcast",
-    "operator_addconstant",
-    "operator_addmm",
-    "operator_basic",
-    "operator_mm",
-    "operator_non_float_params",
-    "operator_params", 
-    "operator_pow"
-};
+const std::set<std::string> immutable_broken_tests =
+    {
+        "AvgPool1d",
+        "AvgPool1d_stride",
+        "AvgPool2d",
+        "AvgPool2d_stride",
+        "AvgPool3d",
+        "AvgPool3d_stride",
+        "AvgPool3d_stride1_pad0_gpu_input",
+        "BatchNorm1d_3d_input_eval",
+        "BatchNorm2d_eval",
+        "BatchNorm2d_momentum_eval",
+        "BatchNorm3d_eval",
+        "BatchNorm3d_momentum_eval",
+        "GLU",
+        "GLU_dim",
+        "Linear",
+        "PReLU_1d",
+        "PReLU_1d_multiparam",
+        "PReLU_2d",
+        "PReLU_2d_multiparam",
+        "PReLU_3d",
+        "PReLU_3d_multiparam",
+        "PoissonNLLLLoss_no_reduce",
+        "Softsign",
+        "operator_add_broadcast",
+        "operator_add_size1_broadcast",
+        "operator_add_size1_right_broadcast",
+        "operator_add_size1_singleton_broadcast",
+        "operator_addconstant",
+        "operator_addmm",
+        "operator_basic",
+        "operator_mm",
+        "operator_non_float_params",
+        "operator_params",
+        "operator_pow"};
 
 void ORT_CALLBACK RunTestCase(ORT_CALLBACK_INSTANCE pci, void* context, ORT_WORK work) {
   OnnxRuntimeCloseThreadpoolWork(work);
@@ -232,13 +231,13 @@ Status RunTests(TestEnv& env, int p_models, int concurrent_runs, size_t repeat_c
   }
   for (size_t i = 0; i != env.tests.size(); ++i) {
     if (!results[i]) {
-      stat.AddFailedTest(std::pair<std::string,std::string>(env.tests[i]->GetTestCaseName(), env.tests[i]->GetTestCaseVersion()));
+      stat.AddFailedTest(std::pair<std::string, std::string>(env.tests[i]->GetTestCaseName(), env.tests[i]->GetTestCaseVersion()));
       continue;
     }
     const TestCaseResult& r = *results[i];
     for (const EXECUTE_RESULT res : r.GetExcutionResult()) {
       if (res != EXECUTE_RESULT::SUCCESS && res != EXECUTE_RESULT::NOT_SUPPORT) {
-        stat.AddFailedTest(std::pair<std::string,std::string>(env.tests[i]->GetTestCaseName(),env.tests[i]->GetTestCaseVersion()));
+        stat.AddFailedTest(std::pair<std::string, std::string>(env.tests[i]->GetTestCaseName(), env.tests[i]->GetTestCaseVersion()));
       }
       switch (res) {
         case EXECUTE_RESULT::SUCCESS:
@@ -347,7 +346,7 @@ void DataRunner::RunTask(size_t task_id, ORT_CALLBACK_INSTANCE pci, bool store_r
 }
 
 EXECUTE_RESULT DataRunner::RunTaskImpl(size_t task_id) {
-  HeapBuffer holder;
+  onnxruntime::test::HeapBuffer holder;
   std::unordered_map<std::string, OrtValue*> feeds;
   c_->LoadTestData(task_id, holder, feeds, true);
 
@@ -499,7 +498,6 @@ void SeqTestRunner::Start(ORT_CALLBACK_INSTANCE pci, size_t) {
 }
 
 void RunSingleTestCase(ITestCase* info, Ort::Env& env, const Ort::SessionOptions& sf, size_t concurrent_runs, size_t repeat_count, PThreadPool tpool, ORT_CALLBACK_INSTANCE pci, TestCaseCallBack on_finished) {
-
   //for test in immutable list, do not even run it
   if (immutable_broken_tests.find(info->GetTestCaseName()) != immutable_broken_tests.end()) {
     on_finished(std::make_shared<TestCaseResult>(0, EXECUTE_RESULT::NOT_SUPPORT, info->GetNodeName()), pci);
diff --git a/onnxruntime/test/onnx/tensorprotoutils.cc b/onnxruntime/test/onnx/tensorprotoutils.cc
new file mode 100644
index 0000000000000..8b17df4aa1ae0
--- /dev/null
+++ b/onnxruntime/test/onnx/tensorprotoutils.cc
@@ -0,0 +1,459 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "tensorprotoutils.h"
+
+#include <memory>
+#include <algorithm>
+#include <limits>
+#include <gsl/pointers>
+#include "core/framework/data_types.h"
+#include "core/framework/allocator.h"
+#include "core/session/onnxruntime_cxx_api.h"
+#include "core/graph/onnx_protobuf.h"
+#include "callback.h"
+
+struct OrtStatus {
+  OrtErrorCode code;
+  char msg[1];  // a null-terminated string
+};
+
+namespace onnxruntime {
+namespace test {
+#ifdef __GNUC__
+constexpr inline bool IsLittleEndianOrder() noexcept { return __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__; }
+#else
+// On Windows and Mac, this function should always return true
+GSL_SUPPRESS(type .1)  // allow use of reinterpret_cast for this special case
+inline bool IsLittleEndianOrder() noexcept {
+  static int n = 1;
+  return (*reinterpret_cast<char*>(&n) == 1);
+}
+#endif
+
+//From core common
+inline void MakeStringInternal(std::ostringstream& /*ss*/) noexcept {
+}
+
+template <typename T>
+inline void MakeStringInternal(std::ostringstream& ss, const T& t) noexcept {
+  ss << t;
+}
+
+template <typename T, typename... Args>
+inline void MakeStringInternal(std::ostringstream& ss, const T& t, const Args&... args) noexcept {
+  ::onnxruntime::MakeStringInternal(ss, t);
+  ::onnxruntime::MakeStringInternal(ss, args...);
+}
+
+template <typename... Args>
+std::string MakeString(const Args&... args) {
+  std::ostringstream ss;
+  ::onnxruntime::MakeStringInternal(ss, args...);
+  return std::string(ss.str());
+}
+
+// Specializations for already-a-string types.
+template <>
+inline std::string MakeString(const std::string& str) {
+  return str;
+}
+inline std::string MakeString(const char* p_str) {
+  return p_str;
+}
+
+std::vector<int64_t> GetTensorShapeFromTensorProto(const onnx::TensorProto& tensor_proto) {
+  const auto& dims = tensor_proto.dims();
+  std::vector<int64_t> tensor_shape_vec(static_cast<size_t>(dims.size()));
+  for (int i = 0; i < dims.size(); ++i) {
+    tensor_shape_vec[i] = dims[i];
+  }
+
+  return tensor_shape_vec;
+}
+
+// This function doesn't support string tensors
+template <typename T>
+static void UnpackTensorWithRawData(const void* raw_data, size_t raw_data_length, size_t expected_size,
+                                    /*out*/ T* p_data) {
+  // allow this low level routine to be somewhat unsafe. assuming it's thoroughly tested and valid
+  GSL_SUPPRESS(type)       // type.1 reinterpret-cast; type.4 C-style casts; type.5 'T result;' is uninitialized;
+  GSL_SUPPRESS(bounds .1)  // pointer arithmetic
+  GSL_SUPPRESS(f .23)      // buff and temp_bytes never tested for nullness and could be gsl::not_null
+  {
+    size_t expected_size_in_bytes;
+    if (!onnxruntime::IAllocator::CalcMemSizeForArray(expected_size, sizeof(T), &expected_size_in_bytes)) {
+      throw Ort::Exception("size overflow", OrtErrorCode::ORT_FAIL);
+    }
+    if (raw_data_length != expected_size_in_bytes)
+      throw Ort::Exception(MakeString("UnpackTensor: the pre-allocated size does not match the raw data size, expected ",
+                                      expected_size_in_bytes, ", got ", raw_data_length),
+                           OrtErrorCode::ORT_FAIL);
+    if (IsLittleEndianOrder()) {
+      memcpy(p_data, raw_data, raw_data_length);
+    } else {
+      const size_t type_size = sizeof(T);
+      const char* buff = reinterpret_cast<const char*>(raw_data);
+      for (size_t i = 0; i < raw_data_length; i += type_size, buff += type_size) {
+        T result;
+        const char* temp_bytes = reinterpret_cast<char*>(&result);
+        for (size_t j = 0; j < type_size; ++j) {
+          memcpy((void*)&temp_bytes[j], (void*)&buff[type_size - 1 - i], 1);
+        }
+        p_data[i] = result;
+      }
+    }
+  }
+}
+
+// This macro doesn't work for Float16/bool/string tensors
+#define DEFINE_UNPACK_TENSOR(T, Type, field_name, field_size)                                                \
+  template <>                                                                                                \
+  void UnpackTensor(const onnx::TensorProto& tensor, const void* raw_data, size_t raw_data_len,              \
+                    /*out*/ T* p_data, int64_t expected_size) {                                              \
+    if (nullptr == p_data) {                                                                                 \
+      const size_t size = raw_data != nullptr ? raw_data_len : tensor.field_size();                          \
+      if (size == 0) return;                                                                                 \
+      throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);                                          \
+    }                                                                                                        \
+    if (nullptr == p_data || Type != tensor.data_type()) {                                                   \
+      throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);                                          \
+    }                                                                                                        \
+    if (raw_data != nullptr) {                                                                               \
+      UnpackTensorWithRawData(raw_data, raw_data_len, expected_size, p_data);                                \
+      return;                                                                                                \
+    }                                                                                                        \
+    if (tensor.field_size() != expected_size)                                                                \
+      throw Ort::Exception(MakeString("corrupted protobuf data: tensor shape size(", expected_size,          \
+                                      ") does not match the data size(", tensor.field_size(), ") in proto"), \
+                           OrtErrorCode::ORT_FAIL);                                                          \
+    auto& data = tensor.field_name();                                                                        \
+    for (auto data_iter = data.cbegin(); data_iter != data.cend(); ++data_iter)                              \
+      *p_data++ = *reinterpret_cast<const T*>(data_iter);                                                    \
+    return;                                                                                                  \
+  }
+
+// TODO: complex64 complex128
+DEFINE_UNPACK_TENSOR(float, onnx::TensorProto_DataType_FLOAT, float_data, float_data_size)
+DEFINE_UNPACK_TENSOR(double, onnx::TensorProto_DataType_DOUBLE, double_data, double_data_size);
+DEFINE_UNPACK_TENSOR(uint8_t, onnx::TensorProto_DataType_UINT8, int32_data, int32_data_size)
+DEFINE_UNPACK_TENSOR(int8_t, onnx::TensorProto_DataType_INT8, int32_data, int32_data_size)
+DEFINE_UNPACK_TENSOR(int16_t, onnx::TensorProto_DataType_INT16, int32_data, int32_data_size)
+DEFINE_UNPACK_TENSOR(uint16_t, onnx::TensorProto_DataType_UINT16, int32_data, int32_data_size)
+DEFINE_UNPACK_TENSOR(int32_t, onnx::TensorProto_DataType_INT32, int32_data, int32_data_size)
+DEFINE_UNPACK_TENSOR(int64_t, onnx::TensorProto_DataType_INT64, int64_data, int64_data_size)
+DEFINE_UNPACK_TENSOR(uint64_t, onnx::TensorProto_DataType_UINT64, uint64_data, uint64_data_size)
+DEFINE_UNPACK_TENSOR(uint32_t, onnx::TensorProto_DataType_UINT32, uint64_data, uint64_data_size)
+
+// doesn't support raw data
+template <>
+void UnpackTensor(const onnx::TensorProto& tensor, const void* /*raw_data*/, size_t /*raw_data_len*/,
+                  /*out*/ std::string* p_data, int64_t expected_size) {
+  if (nullptr == p_data) {
+    if (tensor.string_data_size() == 0) return;
+    throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);
+  }
+  if (onnx::TensorProto_DataType_STRING != tensor.data_type()) {
+    throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);
+  }
+
+  if (tensor.string_data_size() != expected_size)
+    throw Ort::Exception(
+        "UnpackTensor: the pre-allocate size does not match the size in proto", OrtErrorCode::ORT_FAIL);
+
+  auto& string_data = tensor.string_data();
+  for (const auto& iter : string_data) {
+    *p_data++ = iter;
+  }
+
+  return;
+}
+template <>
+void UnpackTensor(const onnx::TensorProto& tensor, const void* raw_data, size_t raw_data_len,
+                  /*out*/ bool* p_data, int64_t expected_size) {
+  if (nullptr == p_data) {
+    const size_t size = raw_data != nullptr ? raw_data_len : tensor.int32_data_size();
+    if (size == 0) return;
+    throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);
+  }
+  if (onnx::TensorProto_DataType_BOOL != tensor.data_type()) {
+    throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);
+  }
+
+  if (raw_data != nullptr) {
+    return UnpackTensorWithRawData(raw_data, raw_data_len, expected_size, p_data);
+  }
+
+  if (tensor.int32_data_size() != expected_size)
+    throw Ort::Exception(
+        "UnpackTensor: the pre-allocate size does not match the size in proto", OrtErrorCode::ORT_FAIL);
+  for (int iter : tensor.int32_data()) {
+    *p_data++ = static_cast<bool>(iter);
+  }
+
+  return;
+}
+template <>
+void UnpackTensor(const onnx::TensorProto& tensor, const void* raw_data, size_t raw_data_len,
+                  /*out*/ MLFloat16* p_data, int64_t expected_size) {
+  if (nullptr == p_data) {
+    const size_t size = raw_data != nullptr ? raw_data_len : tensor.int32_data_size();
+    if (size == 0) return;
+    throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);
+  }
+  if (onnx::TensorProto_DataType_FLOAT16 != tensor.data_type()) {
+    throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);
+  }
+
+  if (raw_data != nullptr) {
+    return UnpackTensorWithRawData(raw_data, raw_data_len, expected_size, p_data);
+  }
+
+  if (tensor.int32_data_size() != expected_size)
+    throw Ort::Exception(
+        "UnpackTensor: the pre-allocate size does not match the size in proto", OrtErrorCode::ORT_FAIL);
+
+  constexpr int max_value = std::numeric_limits<uint16_t>::max();
+  for (int i = 0; i < static_cast<int>(expected_size); i++) {
+    int v = tensor.int32_data()[i];
+    if (v < 0 || v > max_value) {
+      throw Ort::Exception(
+          "data overflow", OrtErrorCode::ORT_FAIL);
+    }
+    p_data[i] = MLFloat16(static_cast<uint16_t>(v));
+  }
+
+  return;
+}
+
+template <>
+void UnpackTensor(const onnx::TensorProto& tensor, const void* raw_data, size_t raw_data_len,
+                  /*out*/ BFloat16* p_data, int64_t expected_size) {
+  if (nullptr == p_data) {
+    const size_t size = raw_data != nullptr ? raw_data_len : tensor.int32_data_size();
+    if (size == 0)
+      return;
+
+    throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);
+  }
+  if (onnx::TensorProto_DataType_BFLOAT16 != tensor.data_type()) {
+    throw Ort::Exception("", OrtErrorCode::ORT_INVALID_ARGUMENT);
+  }
+
+  if (raw_data != nullptr) {
+    return UnpackTensorWithRawData(raw_data, raw_data_len, expected_size, p_data);
+  }
+
+  if (tensor.int32_data_size() != expected_size)
+    throw Ort::Exception(
+        "UnpackTensor: the pre-allocate size does not match the size in proto", OrtErrorCode::ORT_FAIL);
+
+  constexpr int max_value = std::numeric_limits<uint16_t>::max();
+  for (int i = 0; i < static_cast<int>(expected_size); i++) {
+    int v = tensor.int32_data()[i];
+    if (v < 0 || v > max_value) {
+      throw Ort::Exception(
+          "data overflow", OrtErrorCode::ORT_FAIL);
+    }
+    p_data[i] = BFloat16(static_cast<uint16_t>(v));
+  }
+
+  return;
+}
+
+#define CASE_PROTO_TRACE(X, Y)                                                            \
+  case onnx::TensorProto_DataType::TensorProto_DataType_##X:                              \
+    if (!IAllocator::CalcMemSizeForArrayWithAlignment<alignment>(size, sizeof(Y), out)) { \
+      throw Ort::Exception("Invalid TensorProto", OrtErrorCode::ORT_FAIL);                \
+    }                                                                                     \
+    break;
+
+template <size_t alignment>
+Status GetSizeInBytesFromTensorProto(const ONNX_NAMESPACE::TensorProto& tensor_proto, size_t* out) {
+  const auto& dims = tensor_proto.dims();
+  size_t size = 1;
+  for (google::protobuf::int64 dim : dims) {
+    if (dim < 0 || static_cast<uint64_t>(dim) >= std::numeric_limits<size_t>::max()) {
+      return Status(common::ONNXRUNTIME, common::INVALID_ARGUMENT, "Invalid TensorProto");
+    }
+    if (!IAllocator::CalcMemSizeForArray(size, static_cast<size_t>(dim), &size)) {
+      return Status(common::ONNXRUNTIME, common::INVALID_ARGUMENT, "Invalid TensorProto");
+    }
+  }
+  switch (tensor_proto.data_type()) {
+    CASE_PROTO_TRACE(FLOAT, float);
+    CASE_PROTO_TRACE(DOUBLE, double);
+    CASE_PROTO_TRACE(BOOL, bool);
+    CASE_PROTO_TRACE(INT8, int8_t);
+    CASE_PROTO_TRACE(INT16, int16_t);
+    CASE_PROTO_TRACE(INT32, int32_t);
+    CASE_PROTO_TRACE(INT64, int64_t);
+    CASE_PROTO_TRACE(UINT8, uint8_t);
+    CASE_PROTO_TRACE(UINT16, uint16_t);
+    CASE_PROTO_TRACE(UINT32, uint32_t);
+    CASE_PROTO_TRACE(UINT64, uint64_t);
+    CASE_PROTO_TRACE(FLOAT16, MLFloat16);
+    CASE_PROTO_TRACE(BFLOAT16, BFloat16);
+    CASE_PROTO_TRACE(STRING, std::string);
+    default:
+      return Status(common::ONNXRUNTIME, common::NOT_IMPLEMENTED);
+  }
+  return Status::OK();
+}
+
+struct UnInitializeParam {
+  void* preallocated;
+  size_t preallocated_size;
+  ONNXTensorElementDataType ele_type;
+};
+
+OrtStatus* OrtInitializeBufferForTensor(void* input, size_t input_len,
+                                        ONNXTensorElementDataType type) {
+  try {
+    if (type != ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING || input == nullptr) return nullptr;
+    size_t tensor_size = input_len / sizeof(std::string);
+    std::string* ptr = reinterpret_cast<std::string*>(input);
+    for (size_t i = 0, n = tensor_size; i < n; ++i) {
+      new (ptr + i) std::string();
+    }
+  } catch (std::exception& ex) {
+    return OrtCreateStatus(ORT_RUNTIME_EXCEPTION, ex.what());
+  }
+  return nullptr;
+}
+
+ORT_API(void, OrtUninitializeBuffer, _In_opt_ void* input, size_t input_len, enum ONNXTensorElementDataType type);
+
+static void UnInitTensor(void* param) noexcept {
+  UnInitializeParam* p = reinterpret_cast<UnInitializeParam*>(param);
+  OrtUninitializeBuffer(p->preallocated, p->preallocated_size, p->ele_type);
+  delete p;
+}
+
+ORT_API(void, OrtUninitializeBuffer, _In_opt_ void* input, size_t input_len, enum ONNXTensorElementDataType type) {
+  if (type != ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING || input == nullptr) return;
+  size_t tensor_size = input_len / sizeof(std::string);
+  std::string* ptr = reinterpret_cast<std::string*>(input);
+  using std::string;
+  for (size_t i = 0, n = tensor_size; i < n; ++i) {
+    ptr[i].~string();
+  }
+}
+
+#define CASE_PROTO(X, Y)                                                                                       \
+  case onnx::TensorProto_DataType::TensorProto_DataType_##X:                                                   \
+    ::onnxruntime::test::UnpackTensor<Y>(tensor_proto, raw_data, raw_data_len, (Y*)preallocated, tensor_size); \
+    break;
+
+#define CASE_TYPE(X)                   \
+  case onnx::TensorProto_DataType_##X: \
+    return ONNX_TENSOR_ELEMENT_DATA_TYPE_##X;
+
+ONNXTensorElementDataType CApiElementTypeFromProtoType(int type) {
+  switch (type) {
+    CASE_TYPE(FLOAT)
+    CASE_TYPE(UINT8)
+    CASE_TYPE(INT8)
+    CASE_TYPE(UINT16)
+    CASE_TYPE(INT16)
+    CASE_TYPE(INT32)
+    CASE_TYPE(INT64)
+    CASE_TYPE(STRING)
+    CASE_TYPE(BOOL)
+    CASE_TYPE(FLOAT16)
+    CASE_TYPE(DOUBLE)
+    CASE_TYPE(UINT32)
+    CASE_TYPE(UINT64)
+    CASE_TYPE(COMPLEX64)
+    CASE_TYPE(COMPLEX128)
+    CASE_TYPE(BFLOAT16)
+    default:
+      return ONNX_TENSOR_ELEMENT_DATA_TYPE_UNDEFINED;
+  }
+}
+
+ONNXTensorElementDataType GetTensorElementType(const onnx::TensorProto& tensor_proto) {
+  return CApiElementTypeFromProtoType(tensor_proto.data_type());
+}
+
+Status TensorProtoToMLValue(const onnx::TensorProto& tensor_proto, const MemBuffer& m, Ort::Value& value,
+                            OrtCallback& deleter) {
+  const OrtAllocatorInfo& allocator = m.GetAllocInfo();
+  ONNXTensorElementDataType ele_type = test::GetTensorElementType(tensor_proto);
+  const void* raw_data = nullptr;
+  size_t raw_data_len = 0;
+  void* tensor_data;
+  {
+    if (tensor_proto.data_location() == onnx::TensorProto_DataLocation::TensorProto_DataLocation_EXTERNAL) {
+      return Status(common::ONNXRUNTIME, common::INVALID_ARGUMENT, "Server doesn't support external data.");
+    } else if (tensor_proto.has_raw_data()) {
+      if (ele_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_STRING)
+        return Status(common::ONNXRUNTIME, common::FAIL, "String tensor cannot have raw data.");
+      raw_data = tensor_proto.raw_data().data();
+      raw_data_len = tensor_proto.raw_data().size();
+    }
+    {
+      void* preallocated = m.GetBuffer();
+      size_t preallocated_size = m.GetLen();
+      int64_t tensor_size = 1;
+      {
+        for (auto i : tensor_proto.dims()) {
+          if (i < 0) return Status(common::ONNXRUNTIME, common::FAIL, "Tensor can't contain negative dims");
+          tensor_size *= i;
+        }
+      }
+      // tensor_size could be zero. see test_slice_start_out_of_bounds\test_data_set_0\output_0.pb
+      if (static_cast<uint64_t>(tensor_size) > SIZE_MAX) {
+        return Status(common::ONNXRUNTIME, common::INVALID_ARGUMENT, "Size overflow");
+      }
+      size_t size_to_allocate;
+      ORT_RETURN_IF_ERROR(GetSizeInBytesFromTensorProto<0>(tensor_proto, &size_to_allocate));
+
+      if (preallocated && preallocated_size < size_to_allocate)
+        return Status(common::ONNXRUNTIME, common::FAIL, MakeString("The buffer planner is not consistent with tensor buffer size, expected ", size_to_allocate, ", got ", preallocated_size));
+      switch (tensor_proto.data_type()) {
+        CASE_PROTO(FLOAT, float);
+        CASE_PROTO(DOUBLE, double);
+        CASE_PROTO(BOOL, bool);
+        CASE_PROTO(INT8, int8_t);
+        CASE_PROTO(INT16, int16_t);
+        CASE_PROTO(INT32, int32_t);
+        CASE_PROTO(INT64, int64_t);
+        CASE_PROTO(UINT8, uint8_t);
+        CASE_PROTO(UINT16, uint16_t);
+        CASE_PROTO(UINT32, uint32_t);
+        CASE_PROTO(UINT64, uint64_t);
+        CASE_PROTO(FLOAT16, MLFloat16);
+        CASE_PROTO(BFLOAT16, BFloat16);
+        case onnx::TensorProto_DataType::TensorProto_DataType_STRING:
+          if (preallocated != nullptr) {
+            OrtStatus* status = OrtInitializeBufferForTensor(preallocated, preallocated_size, ele_type);
+            if (status != nullptr) {
+              OrtReleaseStatus(status);
+              return Status(common::ONNXRUNTIME, common::FAIL, "initialize preallocated buffer failed");
+            }
+            deleter.f = UnInitTensor;
+            deleter.param = new UnInitializeParam{preallocated, preallocated_size, ele_type};
+          }
+          ::onnxruntime::test::UnpackTensor<std::string>(tensor_proto, raw_data, raw_data_len,
+                                                         (std::string*)preallocated, tensor_size);
+          break;
+        default: {
+          std::ostringstream ostr;
+          ostr << "Initialized tensor with unexpected type: " << tensor_proto.data_type();
+          return Status(common::ONNXRUNTIME, common::INVALID_ARGUMENT, ostr.str());
+        }
+      }
+      tensor_data = preallocated;
+    }
+  }
+  std::vector<int64_t> tensor_shape_vec = GetTensorShapeFromTensorProto(tensor_proto);
+  // Note: We permit an empty tensor_shape_vec, and treat it as a scalar (a tensor of size 1).
+  value = Ort::Value::CreateTensor(&allocator, tensor_data, m.GetLen(), tensor_shape_vec.data(), tensor_shape_vec.size(), (ONNXTensorElementDataType)tensor_proto.data_type());
+  return Status::OK();
+}
+template Status GetSizeInBytesFromTensorProto<256>(const onnx::TensorProto& tensor_proto,
+                                                   size_t* out);
+template Status GetSizeInBytesFromTensorProto<0>(const onnx::TensorProto& tensor_proto, size_t* out);
+}  // namespace test
+}  // namespace onnxruntime
\ No newline at end of file
diff --git a/onnxruntime/test/onnx/tensorprotoutils.h b/onnxruntime/test/onnx/tensorprotoutils.h
new file mode 100644
index 0000000000000..ab3bb7dc821c2
--- /dev/null
+++ b/onnxruntime/test/onnx/tensorprotoutils.h
@@ -0,0 +1,39 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+
+#include <vector>
+#include <type_traits>
+#include "core/session/onnxruntime_c_api.h"
+#include "core/session/onnxruntime_cxx_api.h"
+#include "callback.h"
+#include "mem_buffer.h"
+
+namespace onnx {
+class TensorProto;
+}
+
+namespace onnxruntime {
+namespace test {
+// How much memory it will need for putting the content of this tensor into a plain array
+// complex64/complex128 tensors are not supported.
+// The output value could be zero or -1.
+template <size_t alignment>
+common::Status GetSizeInBytesFromTensorProto(const onnx::TensorProto& tensor_proto, size_t* out);
+/**
+ * deserialize a TensorProto into a preallocated memory buffer.
+ *  Impl must correspond to onnxruntime/core/framework/tensorprotoutils.cc
+ * This implementation does not support external data so as to reduce dependency surface.
+ */
+common::Status TensorProtoToMLValue(const onnx::TensorProto& input, const MemBuffer& m, /* out */ Ort::Value& value,
+                                    OrtCallback& deleter);
+
+template <typename T>
+void UnpackTensor(const onnx::TensorProto& tensor, const void* raw_data, size_t raw_data_len,
+                  /*out*/ T* p_data, int64_t expected_size);
+
+ONNXTensorElementDataType CApiElementTypeFromProtoType(int type);
+ONNXTensorElementDataType GetTensorElementType(const onnx::TensorProto& tensor_proto);
+}  // namespace test
+}  // namespace onnxruntime
\ No newline at end of file
diff --git a/onnxruntime/test/optimizer/graph_transform_test.cc b/onnxruntime/test/optimizer/graph_transform_test.cc
index 1052f1ea6378b..c5ac41be393b3 100644
--- a/onnxruntime/test/optimizer/graph_transform_test.cc
+++ b/onnxruntime/test/optimizer/graph_transform_test.cc
@@ -537,7 +537,10 @@ TEST(GraphTransformationTests, FuseConvBnAddMulFloat16) {
 }
 
 TEST(GraphTransformationTests, ReluClipFusion) {
-  Model model("ReluClipFusion");
+
+  // Clip op schema changed for opset version 11. Until Clip op is updated in ORT hard coding this model to use
+  // older opset.
+  Model model("ReluClipFusion", true, ModelMetaData(), IOnnxRuntimeOpSchemaRegistryList(), {{"", 10}}, {});
   auto& graph = model.MainGraph();
 
   std::vector<NodeArg*> inputs;
diff --git a/onnxruntime/test/perftest/README.md b/onnxruntime/test/perftest/README.md
index 927aa61830576..7ddaa3dad2133 100644
--- a/onnxruntime/test/perftest/README.md
+++ b/onnxruntime/test/perftest/README.md
@@ -4,7 +4,7 @@ onnxruntime_perf_test [options...] model_path result_file
 Options:
         -m [test_mode]: Specifies the test mode. Value coulde be 'duration' or 'times'.
                 Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. Default:'duration'.
-        -e [cpu|cuda|mkldnn|tensorrt]: Specifies the provider 'cpu','cuda','mkldnn' or 'tensorrt'. Default:'cpu'.
+        -e [cpu|cuda|mkldnn|tensorrt|ngraph|nuphar]: Specifies the provider 'cpu','cuda','mkldnn','tensorrt','ngraph' or 'nuphar'. Default:'cpu'.
         -r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
         -t [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
         -p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
diff --git a/onnxruntime/test/perftest/TFModelInfo.cc b/onnxruntime/test/perftest/TFModelInfo.cc
index 21503846bc36f..1de28a875ece4 100644
--- a/onnxruntime/test/perftest/TFModelInfo.cc
+++ b/onnxruntime/test/perftest/TFModelInfo.cc
@@ -11,7 +11,7 @@ TestModelInfo* TFModelInfo::Create(_In_ const PATH_CHAR_TYPE* model_url) {
   meta_file_path.append(ORT_TSTR(".meta"));
   void* p = nullptr;
   size_t len = 0;
-  OrtCallback b;
+  onnxruntime::OrtCallback b;
   auto st = onnxruntime::Env::Default().ReadFileAsString(meta_file_path.c_str(), 0, p, len, b);
   if (!st.IsOK()) {
     ORT_THROW(st.ErrorMessage());
diff --git a/onnxruntime/test/perftest/command_args_parser.cc b/onnxruntime/test/perftest/command_args_parser.cc
index 849e4f1723431..a6ae2801b1be5 100644
--- a/onnxruntime/test/perftest/command_args_parser.cc
+++ b/onnxruntime/test/perftest/command_args_parser.cc
@@ -32,7 +32,8 @@ namespace perftest {
       "\t-M: Disable memory pattern.\n"
       "\t-A: Disable memory arena\n"
       "\t-c [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.\n"
-      "\t-e [cpu|cuda|mkldnn|tensorrt|ngraph|openvino]: Specifies the provider 'cpu','cuda','mkldnn','tensorrt', 'ngraph' or 'openvino'. "
+      "\t-e [cpu|cuda|mkldnn|tensorrt|ngraph|openvino|nuphar]: Specifies the provider 'cpu','cuda','mkldnn','tensorrt', "
+      "'ngraph', 'openvino' or 'nuphar'. "
       "Default:'cpu'.\n"
       "\t-b [tf|ort]: backend to use. Default:ort\n"
       "\t-r [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.\n"
@@ -40,9 +41,10 @@ namespace perftest {
       "\t-p [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.\n"
       "\t-s: Show statistics result, like P75, P90.\n"
       "\t-v: Show verbose information.\n"
-      "\t-x [thread_size]: Session thread pool size.\n"
+      "\t-x [thread_size]: Session thread pool size, must >=0.\n"
       "\t-P: Use parallel executor instead of sequential executor.\n"
-      "\t-o [optimization level]: 0: disable optimization, 1: basic optimization, 2: extended optimization, 3: extended+layout optimization. \n"
+      "\t-o [optimization level]: Default is 1. Valid values are 0 (disable), 1 (basic), 2 (extended), 99 (all).\n"
+      "\t\tPlease see onnxruntime_c_api.h (enum GraphOptimizationLevel) for the full list of all optimization levels. \n"
       "\t-h: help\n");
 }
 
@@ -87,7 +89,9 @@ namespace perftest {
         } else if (!CompareCString(optarg, ORT_TSTR("openvino"))) {
           test_config.machine_config.provider_type_name = onnxruntime::kOpenVINOExecutionProvider;
         } else if (!CompareCString(optarg, ORT_TSTR("nnapi"))) {
-            test_config.machine_config.provider_type_name = onnxruntime::kNnapiExecutionProvider;
+          test_config.machine_config.provider_type_name = onnxruntime::kNnapiExecutionProvider;
+        } else if (!CompareCString(optarg, ORT_TSTR("nuphar"))) {
+          test_config.machine_config.provider_type_name = onnxruntime::kNupharExecutionProvider;
         } else {
           return false;
         }
@@ -114,7 +118,7 @@ namespace perftest {
         break;
       case 'x':
         test_config.run_config.session_thread_pool_size = static_cast<int>(OrtStrtol<PATH_CHAR_TYPE>(optarg, nullptr));
-        if (test_config.run_config.session_thread_pool_size <= 0) {
+        if (test_config.run_config.session_thread_pool_size < 0) {
           return false;
         }
         break;
@@ -128,12 +132,31 @@ namespace perftest {
           return false;
         }
         break;
-      case 'o':
-        test_config.run_config.optimization_level = static_cast<uint32_t>(OrtStrtol<PATH_CHAR_TYPE>(optarg, nullptr));
-        if (test_config.run_config.optimization_level >= static_cast<uint32_t>(TransformerLevel::MaxTransformerLevel)) {
-          return false;
+      case 'o': {
+        int tmp = static_cast<int>(OrtStrtol<PATH_CHAR_TYPE>(optarg, nullptr));
+        switch (tmp) {
+          case ORT_DISABLE_ALL:
+            test_config.run_config.optimization_level = ORT_DISABLE_ALL;
+            break;
+          case ORT_ENABLE_BASIC:
+            test_config.run_config.optimization_level = ORT_ENABLE_BASIC;
+            break;
+          case ORT_ENABLE_EXTENDED:
+            test_config.run_config.optimization_level = ORT_ENABLE_EXTENDED;
+            break;
+          case ORT_ENABLE_ALL:
+            test_config.run_config.optimization_level = ORT_ENABLE_ALL;
+            break;
+          default: {
+            if (tmp > ORT_ENABLE_ALL) {  // relax constraint
+              test_config.run_config.optimization_level = ORT_ENABLE_ALL;
+            } else {
+              return false;
+            }
+          }
         }
         break;
+      }
       case '?':
       case 'h':
       default:
diff --git a/onnxruntime/test/perftest/ort_test_session.cc b/onnxruntime/test/perftest/ort_test_session.cc
index f863c61a179e6..67fb41828a522 100644
--- a/onnxruntime/test/perftest/ort_test_session.cc
+++ b/onnxruntime/test/perftest/ort_test_session.cc
@@ -50,13 +50,13 @@ OnnxRuntimeTestSession::OnnxRuntimeTestSession(Ort::Env& env, std::random_device
 #endif
   } else if (provider_name == onnxruntime::kNupharExecutionProvider) {
 #ifdef USE_NUPHAR
-    ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Nuphar(session_options, 0, ""));
+    ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Nuphar(session_options, /*allow_unaligned_buffers*/ 0, ""));
 #else
     ORT_THROW("Nuphar is not supported in this build\n");
 #endif
   } else if (provider_name == onnxruntime::kTensorrtExecutionProvider) {
 #ifdef USE_TENSORRT
-    ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Tensorrt(session_options));
+    ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Tensorrt(session_options, 0));
     ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0));
 #else
     ORT_THROW("TensorRT is not supported in this build\n");
@@ -91,9 +91,7 @@ OnnxRuntimeTestSession::OnnxRuntimeTestSession(Ort::Env& env, std::random_device
   else
     session_options.DisableSequentialExecution();
   fprintf(stdout, "Setting thread pool size to %d\n", performance_test_config.run_config.session_thread_pool_size);
-  // Don't set the thread pool size unless it has been changed from our zero default value (as zero will fail)
-  if (performance_test_config.run_config.session_thread_pool_size != 0)
-    session_options.SetThreadPoolSize(performance_test_config.run_config.session_thread_pool_size);
+  session_options.SetThreadPoolSize(performance_test_config.run_config.session_thread_pool_size);
   // Set optimization level.
   session_options.SetGraphOptimizationLevel(performance_test_config.run_config.optimization_level);
   if (!performance_test_config.run_config.profile_file.empty())
@@ -102,7 +100,7 @@ OnnxRuntimeTestSession::OnnxRuntimeTestSession(Ort::Env& env, std::random_device
 
   size_t output_count = session_.GetOutputCount();
   output_names_.resize(output_count);
-  Ort::Allocator a = Ort::Allocator::CreateDefault();
+  Ort::AllocatorWithDefaultOptions a;
   for (size_t i = 0; i != output_count; ++i) {
     char* output_name = session_.GetOutputName(i, a);
     assert(output_name != nullptr);
diff --git a/onnxruntime/test/perftest/performance_runner.h b/onnxruntime/test/perftest/performance_runner.h
index d4abaceeea82d..8d9cf1d808155 100644
--- a/onnxruntime/test/perftest/performance_runner.h
+++ b/onnxruntime/test/perftest/performance_runner.h
@@ -128,7 +128,7 @@ class PerformanceRunner {
   PerformanceTestConfig performance_test_config_;
   TestModelInfo* test_model_info_;
   std::unique_ptr<TestSession> session_;
-  HeapBuffer b_;
+  onnxruntime::test::HeapBuffer b_;
   std::unique_ptr<ITestCase> test_case_;
 
   // TODO: Convert to OrtMutex
diff --git a/onnxruntime/test/perftest/test_configuration.h b/onnxruntime/test/perftest/test_configuration.h
index 9c3793742511c..1ab5010948ec6 100644
--- a/onnxruntime/test/perftest/test_configuration.h
+++ b/onnxruntime/test/perftest/test_configuration.h
@@ -44,8 +44,8 @@ struct RunConfig {
   bool enable_memory_pattern{true};
   bool enable_cpu_mem_arena{true};
   bool enable_sequential_execution{true};
-  int session_thread_pool_size{0};
-  uint32_t optimization_level{2};
+  int session_thread_pool_size{-1};
+  GraphOptimizationLevel optimization_level{ORT_ENABLE_EXTENDED};
 };
 
 struct PerformanceTestConfig {
diff --git a/onnxruntime/test/providers/cpu/activation/activation_op_test.cc b/onnxruntime/test/providers/cpu/activation/activation_op_test.cc
index cd8b284056205..eac1880f7ac9a 100644
--- a/onnxruntime/test/providers/cpu/activation/activation_op_test.cc
+++ b/onnxruntime/test/providers/cpu/activation/activation_op_test.cc
@@ -200,14 +200,13 @@ TEST(ActivationOpTest, Softplus) {
                              return x + logf(expf(-x) + 1);
                            else
                              return logf(expf(x) + 1);
-                         },
-                         {}, false);
+                         });
 }
 
 TEST(ActivationOpTest, Softsign) {
   TestUnaryElementwiseOp("Softsign",
                          no_inf_input_vals,
-                         [](float x) { return x / (1 + std::abs(x)); });
+                         [](float x) { return x / (1 + std::abs(x)); }, {}, false);  // Disable TensorRT because result mismatches
 }
 
 }  // namespace test
diff --git a/onnxruntime/test/providers/cpu/generator/random_test.cc b/onnxruntime/test/providers/cpu/generator/random_test.cc
index 9e5a804054779..ca5864b97d6c2 100644
--- a/onnxruntime/test/providers/cpu/generator/random_test.cc
+++ b/onnxruntime/test/providers/cpu/generator/random_test.cc
@@ -246,7 +246,7 @@ TEST(Random, MultinomialGoodCase) {
   const std::vector<int64_t> output_dims{batch_size, num_samples};
 #ifdef _WIN32
   const std::vector<int64_t> expected_output{2, 0, 0, 2, 2, 2, 0, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 0};
-#elif defined(__MACH__) || defined (__ANDROID__)
+#elif defined(__MACH__) || defined(__ANDROID__)
   const std::vector<int64_t> expected_output{1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 1, 1, 2, 0, 2, 2, 0, 2, 1, 1};
 #else
   const std::vector<int64_t> expected_output{2, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 1, 1, 0, 1, 0, 2, 0, 2, 0};
@@ -257,31 +257,46 @@ TEST(Random, MultinomialGoodCase) {
 }
 
 TEST(Random, MultinomialDefaultDType) {
-  OpTester test("Multinomial");
+  auto run_test = [](int num_run_calls, const std::vector<int32_t>& expected_output) {
+    OpTester test("Multinomial");
+    const int64_t num_samples = 10;
+    const int batch_size = 2;
+    const float seed = 1618.f;
+
+    const std::vector<int64_t> input_dims{2, 3};
+    std::vector<float> input(TensorShape(input_dims).Size());
+    std::fill(input.begin(), input.end(), -10.f);
+    test.AddInput<float>("X", input_dims, input);
+
+    test.AddAttribute("sample_size", num_samples);
+    test.AddAttribute("seed", seed);
 
-  const int64_t num_samples = 10;
-  const int batch_size = 2;
-  const float seed = 1618.f;
+    const std::vector<int64_t> output_dims{batch_size, num_samples};
+    test.AddOutput<int32_t>("Y", output_dims, expected_output);
 
-  const std::vector<int64_t> input_dims{2, 3};
-  std::vector<float> input(TensorShape(input_dims).Size());
-  std::fill(input.begin(), input.end(), -10.f);
-  test.AddInput<float>("X", input_dims, input);
+    // test.Run() re-loads the model each time, so we need to do multiple calls to InferenceSession::Run inside of it
+    // to test that the second call to Compute produces different data
+    test.SetNumRunCalls(num_run_calls);
 
-  test.AddAttribute("sample_size", num_samples);
-  test.AddAttribute("seed", seed);
+    test.Run();
+  };
 
-  const std::vector<int64_t> output_dims{batch_size, num_samples};
 #ifdef _WIN32
-  const std::vector<int32_t> expected_output{2, 0, 0, 2, 2, 2, 0, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 0};
-#elif defined(__MACH__) || defined (__ANDROID__)
-  const std::vector<int32_t> expected_output{1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 1, 1, 2, 0, 2, 2, 0, 2, 1, 1};
+  const std::vector<int32_t> expected_output_1{2, 0, 0, 2, 2, 2, 0, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 0};
+  const std::vector<int32_t> expected_output_2{0, 0, 1, 0, 2, 2, 2, 0, 2, 1, 2, 1, 0, 2, 0, 2, 2, 1, 2, 1};
+#elif defined(__MACH__) || defined(__ANDROID__)
+  const std::vector<int32_t> expected_output_1{1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 1, 1, 2, 0, 2, 2, 0, 2, 1, 1};
+  const std::vector<int32_t> expected_output_2{1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 2, 0, 1, 1, 0, 2, 2, 2, 1};
 #else
-  const std::vector<int32_t> expected_output{2, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 1, 1, 0, 1, 0, 2, 0, 2, 0};
+  const std::vector<int32_t> expected_output_1{2, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 1, 1, 0, 1, 0, 2, 0, 2, 0};
+  const std::vector<int32_t> expected_output_2{2, 2, 1, 1, 0, 2, 2, 1, 1, 2, 0, 0, 0, 2, 0, 1, 1, 1, 0, 0};
 #endif
-  test.AddOutput<int32_t>("Y", output_dims, expected_output);
 
-  test.Run();
+  // Test output from a single call to Multinomial::Compute
+  run_test(1, expected_output_1);
+
+  // Test output from 2 calls to Multinomial::Compute
+  run_test(2, expected_output_2);
 }
 
 TEST(Random, MultinomialInvalidDtype) {
diff --git a/onnxruntime/test/providers/cpu/math/clip_test.cc b/onnxruntime/test/providers/cpu/math/clip_test.cc
index 7a80f6a9835be..f68f89d7ef4b1 100644
--- a/onnxruntime/test/providers/cpu/math/clip_test.cc
+++ b/onnxruntime/test/providers/cpu/math/clip_test.cc
@@ -7,8 +7,8 @@
 namespace onnxruntime {
 namespace test {
 
-TEST(MathOpTest, Clip) {
-  OpTester test("Clip");
+TEST(MathOpTest, Clip_6) {
+  OpTester test("Clip", 6);
 
   test.AddAttribute("min", -10.0f);
   test.AddAttribute("max", 10.0f);
@@ -25,5 +25,41 @@ TEST(MathOpTest, Clip) {
   test.Run();
 }
 
+TEST(MathOpTest, Clip_Default) {
+  OpTester test("Clip", 11);
+
+  std::vector<int64_t> dims{3, 3};
+  test.AddInput<float>("X", dims,
+                       {11.0f, 4.4f, 432.3f,
+                        -1.3f, 3.5f, 64.0f,
+                        -5.4f, 9.3f, 82.4f});
+  test.AddOutput<float>("Y", dims,
+                        {11.0f, 4.4f, 432.3f,
+                         -1.3f, 3.5f, 64.0f,
+                         -5.4f, 9.3f, 82.4f});
+
+  // nGraph does not support Clip opset 11 yet.
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kNGraphExecutionProvider});
+}
+
+TEST(MathOpTest, Clip) {
+  OpTester test("Clip", 11);
+
+  std::vector<int64_t> dims{3, 3};
+  test.AddInput<float>("X", dims,
+                       {-1.0f, 0.0f, 1.0f,
+                        -6.0f, 0.0f, 6.0f,
+                        -5.4f, 2.0f, 6.0f});
+  test.AddInput<float>("min", {}, {-5});
+  test.AddInput<float>("max", {}, {5});
+  test.AddOutput<float>("Y", dims,
+                        {-1.0f, 0.0f, 1.0f,
+                         -5.0f, 0.0f, 5.0f,
+                         -5.0f, 2.0f, 5.0f});
+
+  // nGraph does not support Clip opset 11 yet.
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kNGraphExecutionProvider});
+}
+
 }  // namespace test
 }  // namespace onnxruntime
diff --git a/onnxruntime/test/providers/cpu/math/element_wise_ops_test.cc b/onnxruntime/test/providers/cpu/math/element_wise_ops_test.cc
index 80770af33a06b..c77f5cd22457d 100644
--- a/onnxruntime/test/providers/cpu/math/element_wise_ops_test.cc
+++ b/onnxruntime/test/providers/cpu/math/element_wise_ops_test.cc
@@ -806,6 +806,14 @@ TEST(MathOpTest, Less_Scalar1) {
   test.Run();
 }
 
+TEST(MathOpTest, Less_int64_Scalar1) {
+  OpTester test("Less", 9);
+  test.AddInput<int64_t>("A", {4}, {1, 0, 2, -1});
+  test.AddInput<int64_t>("B", {1}, {1});
+  test.AddOutput<bool>("C", {4}, {false, true, false, true});
+  test.Run();
+}
+
 TEST(MathOpTest, Greater_7) {
   OpTester test("Greater");
   std::vector<int64_t> dims{4};
diff --git a/onnxruntime/test/providers/cpu/math/matmul_integer_test.cc b/onnxruntime/test/providers/cpu/math/matmul_integer_test.cc
index c197a7511b967..a9438ff79c7ed 100644
--- a/onnxruntime/test/providers/cpu/math/matmul_integer_test.cc
+++ b/onnxruntime/test/providers/cpu/math/matmul_integer_test.cc
@@ -8,12 +8,14 @@
 #include "core/framework/op_kernel.h"
 #include "core/util/math_cpuonly.h"
 
+#include <random>
+
 namespace onnxruntime {
 namespace test {
 
-TEST(MatmulIntegerOpTest, MatMulInteger1) {
+TEST(MatmulIntegerOpTest, MatMulInteger_2D) {
   OpTester test("MatMulInteger", 10);
-  test.AddInput<uint8_t>("T1", {4, 3}, {11, 7, 3, 10, 6, 2, 9, 5, 1, 8, 4, 0});  
+  test.AddInput<uint8_t>("T1", {4, 3}, {11, 7, 3, 10, 6, 2, 9, 5, 1, 8, 4, 0});
   test.AddInput<uint8_t>("T2", {3, 2}, {1, 4, 2, 5, 3, 6});
   test.AddInput<uint8_t>("a_zero_point", {}, {12});
   test.AddInput<uint8_t>("b_zero_point", {}, {0});
@@ -30,5 +32,60 @@ TEST(MatmulIntegerOpTest, MatMulInteger) {
   test.AddOutput<int32_t>("T3", {1, 1}, {-1});
   test.Run();
 }
+TEST(MatmulIntegerOpTest, MatMulInteger_WithZero_ZeroPoint) {
+  OpTester test("MatMulInteger", 10);
+  test.AddInput<uint8_t>("T1", {4, 3}, {11, 7, 3, 10, 6, 2, 9, 5, 1, 8, 4, 0});
+  test.AddInput<uint8_t>("T2", {3, 2}, {1, 4, 2, 5, 3, 6});
+  test.AddInput<uint8_t>("a_zero_point", {}, {0});
+  test.AddInput<uint8_t>("b_zero_point", {}, {0});
+  test.AddOutput<int32_t>("T3", {4, 2}, {34, 97, 28, 82, 22, 67, 16, 52});
+  test.Run();
+}
+
+template <typename T>
+std::vector<T> ToVector(const int* value, int size) {
+  std::vector<T> data(size);
+  for (int i = 0; i < size; i++)
+    data[i] = static_cast<T>(value[i]);
+  return data;
+}
+
+// [M x N] = [M x K] x [K x N] = [batch_seq x input_dim] x [input_dim x embed_dim]
+void RunMatMulIntegerU8S8Test(const int M, const int N, const int K) {
+  OpTester test("MatMulInteger", 10);
+  static std::default_random_engine e(123);
+  static std::uniform_int_distribution<int> n_unsigned(0, 127);
+  static std::uniform_int_distribution<int> n_signed(-128, 127);
+  Eigen::MatrixXi T1 = Eigen::MatrixXi::Random(K, M)
+                           .unaryExpr([](int) { return n_unsigned(e); });
+  Eigen::MatrixXi T2 = Eigen::MatrixXi::Random(N, K)
+                           .unaryExpr([](int) { return n_signed(e); });
+  Eigen::MatrixXi T3 = (T2 * T1).eval();
+
+  test.AddInput<uint8_t>("T1", {M, K},
+                         ToVector<uint8_t>(T1.data(), M * K));
+  test.AddInput<int8_t>("T2", {K, N},
+                        ToVector<int8_t>(T2.data(), K * N), /*is_initializer*/ true);
+  test.AddOutput<int32_t>("T3", {M, N},
+                          ToVector<int32_t>(T3.data(), M * N));
+
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kNGraphExecutionProvider});  // currently nGraph provider does not support gemm_u8s8
+}
+
+TEST(MatmulIntegerOpTest, MatMulInteger_Uint8_Int8) {
+  // GEMV
+  RunMatMulIntegerU8S8Test(1, 2, 64);
+  RunMatMulIntegerU8S8Test(1, 2, 16);
+  RunMatMulIntegerU8S8Test(1, 1, 288);
+  RunMatMulIntegerU8S8Test(1, 1, 32);
+  RunMatMulIntegerU8S8Test(1, 1, 260);
+  // GEMM
+  RunMatMulIntegerU8S8Test(2, 2, 40);
+  RunMatMulIntegerU8S8Test(2, 48, 33);
+  RunMatMulIntegerU8S8Test(2, 51, 40);
+  RunMatMulIntegerU8S8Test(6, 10, 34);
+  RunMatMulIntegerU8S8Test(8, 16, 64);
+}
+
 }  // namespace test
-}  // namespace onnxruntime
\ No newline at end of file
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/providers/cpu/math/softmax_test.cc b/onnxruntime/test/providers/cpu/math/softmax_test.cc
index aad97cfaf8754..0273378194b5d 100644
--- a/onnxruntime/test/providers/cpu/math/softmax_test.cc
+++ b/onnxruntime/test/providers/cpu/math/softmax_test.cc
@@ -194,23 +194,23 @@ TEST(SoftmaxOperator, InvalidAxis) {
 
 TEST(SoftmaxOperator, TestInputTooLarge) {
   float* ignored = nullptr;
-
+  concurrency::ThreadPool tp("", 1);
   // N > INT32_MAX
   int64_t N = int64_t(INT32_MAX) + 1;
   int64_t D = 1;
-  auto status = SoftmaxCPU(N, D, ignored, ignored, ignored, ignored, true, ignored);
+  auto status = SoftmaxCPU(N, D, ignored, ignored, ignored, ignored, true, ignored, &tp);
   EXPECT_EQ(status.Code(), common::INVALID_ARGUMENT);
 
   // D > INT32_MAX
   N = 1;
   D = int64_t(INT32_MAX) + 1;
-  status = SoftmaxCPU(N, D, ignored, ignored, ignored, ignored, true, ignored);
+  status = SoftmaxCPU(N, D, ignored, ignored, ignored, ignored, true, ignored, &tp);
   EXPECT_EQ(status.Code(), common::INVALID_ARGUMENT);
 
   // N * D > INT32_MAX
   N = int64_t(INT32_MAX) / 2;
   D = 3;
-  status = SoftmaxCPU(N, D, ignored, ignored, ignored, ignored, true, ignored);
+  status = SoftmaxCPU(N, D, ignored, ignored, ignored, ignored, true, ignored, &tp);
   EXPECT_EQ(status.Code(), common::INVALID_ARGUMENT);
 
   /*
diff --git a/onnxruntime/test/providers/cpu/ml/label_encoder_test.cc b/onnxruntime/test/providers/cpu/ml/label_encoder_test.cc
index c52f43d682abb..05abc8a917dcd 100644
--- a/onnxruntime/test/providers/cpu/ml/label_encoder_test.cc
+++ b/onnxruntime/test/providers/cpu/ml/label_encoder_test.cc
@@ -42,5 +42,131 @@ TEST(LabelEncoder, IntToString) {
   RunTest(dims, input, output);
 }
 
+TEST(LabelEncoder, StringToIntOpset2) {
+  std::vector<std::int64_t> dims{1, 5};
+
+  std::vector<std::string> input{"AA", "BB", "CC", "DD", "AA"};
+  std::vector<std::int64_t> output{9, 1, 5566, 4, 9};
+
+  OpTester test("LabelEncoder", 2, onnxruntime::kMLDomain);
+
+  const std::vector<std::string> keys{"AA", "BB", "DD"};
+  const std::vector<std::int64_t> values{9, 1, 4};
+
+  test.AddAttribute("keys_strings", keys);
+  test.AddAttribute("values_int64s", values);
+  test.AddAttribute("default_int64", (std::int64_t)5566);
+
+  test.AddInput<std::string>("X", dims, input);
+  test.AddOutput<std::int64_t>("Y", dims, output);
+
+  test.Run();
+}
+
+TEST(LabelEncoder, IntToStringOpset2) {
+  std::vector<std::int64_t> dims{1, 5};
+
+  std::vector<std::int64_t> input{9, 1, 5566, 4, 9};
+  std::vector<std::string> output{"AA", "BB", "CC", "DD", "AA"};
+
+  OpTester test("LabelEncoder", 2, onnxruntime::kMLDomain);
+
+  const std::vector<std::int64_t> keys{9, 1, 4};
+  const std::vector<std::string> values{"AA", "BB", "DD"};
+
+  test.AddAttribute("keys_int64s", keys);
+  test.AddAttribute("values_strings", values);
+  test.AddAttribute<std::string>("default_string", "CC");
+
+  test.AddInput<std::int64_t>("X", dims, input);
+  test.AddOutput<std::string>("Y", dims, output);
+
+  test.Run();
+}
+
+TEST(LabelEncoder, FloatToStringOpset2) {
+  std::vector<std::int64_t> dims{5, 1};
+
+  std::vector<float> input{9.4f, 1.7f, 3.6f, 1.2f, 2.8f};
+  std::vector<std::string> output{"AA", "BB", "DD", "CC", "CC"};
+
+  OpTester test("LabelEncoder", 2, onnxruntime::kMLDomain);
+
+  const std::vector<float> keys{9.4f, 1.7f, 3.6f};
+  const std::vector<std::string> values{"AA", "BB", "DD"};
+
+  test.AddAttribute("keys_floats", keys);
+  test.AddAttribute("values_strings", values);
+  test.AddAttribute<std::string>("default_string", "CC");
+
+  test.AddInput<float>("X", dims, input);
+  test.AddOutput<std::string>("Y", dims, output);
+
+  test.Run();
+}
+
+TEST(LabelEncoder, StringToFloatOpset2) {
+  std::vector<std::int64_t> dims{5, 1};
+
+  std::vector<std::string> input{"AA", "BB", "DD", "CC", "CC"};
+  std::vector<float> output{9.4f, 1.7f, 3.6f, 55.66f, 55.66f};
+
+  OpTester test("LabelEncoder", 2, onnxruntime::kMLDomain);
+
+  const std::vector<std::string> keys{"AA", "BB", "DD"};
+  const std::vector<float> values{9.4f, 1.7f, 3.6f};
+
+  test.AddAttribute("keys_strings", keys);
+  test.AddAttribute("values_floats", values);
+  test.AddAttribute("default_float", 55.66f);
+
+  test.AddInput<std::string>("X", dims, input);
+  test.AddOutput<float>("Y", dims, output);
+
+  test.Run();
+}
+
+TEST(LabelEncoder, FloatToInt64Opset2) {
+  std::vector<std::int64_t> dims{5};
+
+  std::vector<float> input{9.4f, 1.7f, 3.6f, 55.66f, 55.66f};
+  std::vector<std::int64_t> output{1, 9, 3, -8, -8};
+
+  OpTester test("LabelEncoder", 2, onnxruntime::kMLDomain);
+
+  const std::vector<float> keys{9.4f, 1.7f, 3.6f};
+  const std::vector<std::int64_t> values{1, 9, 3};
+
+  test.AddAttribute("keys_floats", keys);
+  test.AddAttribute("values_int64s", values);
+  test.AddAttribute("default_int64", (std::int64_t)-8);
+
+  test.AddInput<float>("X", dims, input);
+  test.AddOutput<std::int64_t>("Y", dims, output);
+
+  test.Run();
+}
+
+TEST(LabelEncoder, Int64ToFloatOpset2) {
+  std::vector<std::int64_t> dims{5};
+
+  std::vector<std::int64_t> input{3, 1, 9, -8, -8};
+  std::vector<float> output{3.6f, 9.4f, 1.7f, 55.66f, 55.66f};
+
+  OpTester test("LabelEncoder", 2, onnxruntime::kMLDomain);
+
+  const std::vector<std::int64_t> keys{1, 9, 3};
+  const std::vector<float> values{9.4f, 1.7f, 3.6f};
+
+  test.AddAttribute("keys_int64s", keys);
+  test.AddAttribute("values_floats", values);
+  test.AddAttribute("default_float", 55.66f);
+
+  test.AddInput<std::int64_t>("X", dims, input);
+  test.AddOutput<float>("Y", dims, output);
+
+  test.Run();
+}
+
 }  // namespace test
 }  // namespace onnxruntime
diff --git a/onnxruntime/test/providers/cpu/nn/conv_transpose_op_test.cc b/onnxruntime/test/providers/cpu/nn/conv_transpose_op_test.cc
index 00f3d97e15cd4..22e6d728de26d 100644
--- a/onnxruntime/test/providers/cpu/nn/conv_transpose_op_test.cc
+++ b/onnxruntime/test/providers/cpu/nn/conv_transpose_op_test.cc
@@ -240,6 +240,7 @@ TEST(ConvTransposeTest, ConvTranspose_2D_OutputShapeWithBatchSize) {
   TestConvTransposeOp(attrs, {X, W, B}, {X_shape, W_shape, B_shape}, expected_vals, Y_shape);
 }
 
+#ifndef USE_NGRAPH
 TEST(ConvTransposeTest, ConvTranspose_InvalidKernelShape) {
   ConvTransposeOpAttributes attrs = {
       vector<int64_t>{1, 1, 1, 5},   // invalid kernel_shape, should be [1, 5]
@@ -264,6 +265,7 @@ TEST(ConvTransposeTest, ConvTranspose_InvalidKernelShape) {
                       OpTester::ExpectResult::kExpectFailure,
                       "kernel_shape num_dims is not compatible with W num_dims. kernel_shape: {1,1,1,5} W: {1,1,1,5}");
 }
+#endif
 
 TEST(ConvTransposeTest, ConvTranspose_onnx) {
   ConvTransposeOpAttributes attrs = {
diff --git a/onnxruntime/test/providers/cpu/nn/shrink_test.cc b/onnxruntime/test/providers/cpu/nn/shrink_test.cc
index 03bf0eeb159b1..d4f96530c683e 100644
--- a/onnxruntime/test/providers/cpu/nn/shrink_test.cc
+++ b/onnxruntime/test/providers/cpu/nn/shrink_test.cc
@@ -68,7 +68,7 @@ std::vector<ShrinkTestData<T>> GenerateUnsignedTestCases() {
 }
 
 template <typename T>
-void RunShrinkTest(const std::vector<ShrinkTestData<T>>& test_cases) {
+void RunShrinkTest(const std::vector<ShrinkTestData<T>>& test_cases, const std::unordered_set<std::string>& excluded_provider_types = {}) {
   for (const auto& test_data : test_cases) {
     OpTester test("Shrink", 9);
 
@@ -82,7 +82,7 @@ void RunShrinkTest(const std::vector<ShrinkTestData<T>>& test_cases) {
 
     test.AddInput<T>("X", test_data.input_dimensions, test_data.input_vals);
     test.AddOutput<T>("Y", test_data.expected_dimensions, test_data.expected_vals);
-    test.Run();
+    test.Run(OpTester::ExpectResult::kExpectSuccess, {}, excluded_provider_types);
   }
 }
 
@@ -101,7 +101,7 @@ TEST(MathOpTest, ShrinkInt8Type) {
 
 TEST(MathOpTest, ShrinkUint8Type) {
   const auto& test_cases = GenerateUnsignedTestCases<uint8_t>();
-  RunShrinkTest<uint8_t>(test_cases);
+  RunShrinkTest<uint8_t>(test_cases, {kNGraphExecutionProvider});
 }
 
 TEST(MathOpTest, ShrinkInt16Type) {
@@ -111,7 +111,7 @@ TEST(MathOpTest, ShrinkInt16Type) {
 
 TEST(MathOpTest, ShrinkUint16Type) {
   const auto& test_cases = GenerateUnsignedTestCases<uint16_t>();
-  RunShrinkTest<uint16_t>(test_cases);
+  RunShrinkTest<uint16_t>(test_cases, {kNGraphExecutionProvider});
 }
 
 TEST(MathOpTest, ShrinkInt32Type) {
@@ -121,7 +121,7 @@ TEST(MathOpTest, ShrinkInt32Type) {
 
 TEST(MathOpTest, ShrinkUint32Type) {
   const auto& test_cases = GenerateUnsignedTestCases<uint32_t>();
-  RunShrinkTest<uint32_t>(test_cases);
+  RunShrinkTest<uint32_t>(test_cases, {kNGraphExecutionProvider});
 }
 
 TEST(MathOpTest, ShrinkInt64Type) {
@@ -131,7 +131,7 @@ TEST(MathOpTest, ShrinkInt64Type) {
 
 TEST(MathOpTest, ShrinkUint64Type) {
   const auto& test_cases = GenerateUnsignedTestCases<uint64_t>();
-  RunShrinkTest<uint64_t>(test_cases);
+  RunShrinkTest<uint64_t>(test_cases, {kNGraphExecutionProvider});
 }
 
 TEST(MathOpTest, ShrinkFloatType) {
diff --git a/onnxruntime/test/providers/cpu/object_detection/non_max_suppression_test.cc b/onnxruntime/test/providers/cpu/object_detection/non_max_suppression_test.cc
index 9675612b7e12e..45f537bc89046 100644
--- a/onnxruntime/test/providers/cpu/object_detection/non_max_suppression_test.cc
+++ b/onnxruntime/test/providers/cpu/object_detection/non_max_suppression_test.cc
@@ -73,7 +73,7 @@ TEST(NonMaxSuppressionOpTest, TwoClasses) {
   test.Run();
 }
 
-TEST(NonMaxSuppressionOpTest, TwoBathes) {
+TEST(NonMaxSuppressionOpTest, TwoBatches_OneClass) {
   OpTester test("NonMaxSuppression", 10, kOnnxDomain);
   test.AddInput<float>("boxes", {2, 6, 4},
                        {0.0f, 0.0f, 1.0f, 1.0f,
@@ -103,6 +103,41 @@ TEST(NonMaxSuppressionOpTest, TwoBathes) {
   test.Run();
 }
 
+TEST(NonMaxSuppressionOpTest, TwoBatches_TwoClasses) {
+  OpTester test("NonMaxSuppression", 10, kOnnxDomain);
+  test.AddInput<float>("boxes", {2, 5, 4},
+                       {0.0f, 0.0f, 0.3f, 0.3f,
+                        0.0f, 0.0f, 0.4f, 0.4f,
+                        0.0f, 0.0f, 0.5f, 0.5f,
+                        0.5f, 0.5f, 0.9f, 0.9f,
+                        0.5f, 0.5f, 1.0f, 1.0f,
+
+                        0.0f, 0.0f, 0.3f, 0.3f,
+                        0.0f, 0.0f, 0.4f, 0.4f,
+                        0.5f, 0.5f, 0.95f, 0.95f,
+                        0.5f, 0.5f, 0.96f, 0.96f,
+                        0.5f, 0.5f, 1.0f, 1.0f});
+  test.AddInput<float>("scores", {2, 2, 5},
+                       {0.1f, 0.2f, 0.6f, 0.3f, 0.9f,
+                        0.1f, 0.2f, 0.6f, 0.3f, 0.9f,
+
+                        0.1f, 0.2f, 0.6f, 0.3f, 0.9f,
+                        0.1f, 0.2f, 0.6f, 0.3f, 0.9f});
+  test.AddInput<int64_t>("max_output_boxes_per_class", {}, {2L});
+  test.AddInput<float>("iou_threshold", {}, {0.8f});
+  test.AddOutput<int64_t>("selected_indices", {8, 3},
+                          {0L, 0L, 4L,
+                           0L, 0L, 2L,
+                           0L, 1L, 4L,
+                           0L, 1L, 2L,
+
+                           1L, 0L, 4L,
+                           1L, 0L, 1L,
+                           1L, 1L, 4L,
+                           1L, 1L, 1L});
+  test.Run();
+}
+
 TEST(NonMaxSuppressionOpTest, WithScoreThreshold) {
   OpTester test("NonMaxSuppression", 10, kOnnxDomain);
   test.AddInput<float>("boxes", {1, 6, 4},
diff --git a/onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc b/onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc
index 3a7e0ad762067..c8b4daefedc5f 100644
--- a/onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc
+++ b/onnxruntime/test/providers/cpu/reduction/reduction_ops_test.cc
@@ -471,6 +471,23 @@ TEST(ReductionOpTest, ReduceMax_int32) {
   test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider}); //TensorRT: axis must be 0
 }
 
+TEST(ReductionOpTest, ReduceMax_int64) {
+  OpTester test("ReduceMax");
+  test.AddAttribute("axes", std::vector<int64_t>{1, 2});
+  test.AddAttribute("keepdims", (int64_t)1);
+  test.AddInput<int64_t>("data", {3, 2, 2},
+                         {1, 2,
+                          3, 4,
+
+                          5, 6,
+                          7, 8,
+
+                          9, 10,
+                          11, 12});
+  test.AddOutput<int64_t>("reduced", {3, 1, 1}, {4, 8, 12});
+  test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});  //TensorRT: axis must be 0
+}
+
 TEST(ReductionOpTest, ReduceMean_default_axes_keepdims) {
   OpTester test("ReduceMean");
   test.AddAttribute("keepdims", (int64_t)1);
@@ -744,6 +761,24 @@ TEST(ReductionOpTest, ReduceSum_int32) {
   test.Run();
 }
 
+TEST( ReductionOpTest, ReduceSum_int64 )
+{
+   OpTester test( "ReduceSum" );
+   test.AddAttribute( "axes", std::vector<int64_t>{0, 2} );
+   test.AddAttribute( "keepdims", ( int64_t ) 1 );
+   test.AddInput<int64_t>( "data", { 3, 2, 2 },
+      { 1, 2,
+       3, 4,
+
+       5, 6,
+       7, 8,
+
+       9, 10,
+       11, 12 } );
+   test.AddOutput<int64_t>( "reduced", { 1, 2, 1 }, { 33, 45 } );
+   test.Run();
+}
+
 TEST(ReductionOpTest, ReduceSum_default_axes_keepdims) {
   OpTester test("ReduceSum");
   test.AddAttribute("keepdims", (int64_t)1);
diff --git a/onnxruntime/test/providers/cpu/rnn/deep_cpu_gru_op_test.cc b/onnxruntime/test/providers/cpu/rnn/deep_cpu_gru_op_test.cc
index fe9cf9a389373..bf9e722f0e16b 100644
--- a/onnxruntime/test/providers/cpu/rnn/deep_cpu_gru_op_test.cc
+++ b/onnxruntime/test/providers/cpu/rnn/deep_cpu_gru_op_test.cc
@@ -823,6 +823,39 @@ TEST(GRUTest, ONNXRuntime_TestGRUOpShorterSeqInMiddle) {
   ctx.RunTest(X, batch_size, seq_length, sequence_length, &initial_h, expected_Y, expected_Y_h, true);
 }
 
+TEST(GRUTest, ONNXRuntime_TestGRUOpZeroSeqInMiddle) {
+  const std::string direction = "bidirectional";
+  const std::vector<std::string> activations = {"sigmoid", "tanh", "sigmoid", "tanh"};
+
+  DeepCpuGruOpTestContext ctx(direction, activations);
+
+  const int batch_size = 4;
+  const int seq_length = 2;
+  std::vector<float> X = {-0.455351f, -0.276391f,
+                          0.855351f, 0.676391f,
+                          -0.185934f, -0.269585f,
+                          -0.585934f, 0.669585f,
+                          -0.351455f, -0.391276f,
+                          0.670351f, 0.894676f,
+                          0.987653f, 1.876567f,
+                          -1.234357f, -0.775668f};
+  std::vector<int> sequence_length = {2, 0, 2, 2};
+  std::vector<float> initial_h = {0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f,
+                                  0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f};
+
+  std::vector<float> expected_Y = {-0.0325528607f, 0.0774837881f, 0.0f, 0.0f, -0.0456649921f, 0.0462125241f, -0.1494070887f, 0.1356348693f,
+                                   -0.0398676469f, 0.1030099019f, 0.0f, 0.0f, -0.2552363872f, 0.1258624643f, -0.1111927852f, 0.1987708956f,
+
+                                   -0.0317345410f, 0.0898682102f, 0.0f, 0.0f, -0.4344840049f, 0.1124109625f, -0.0373909101f, 0.1958667039f,
+                                   -0.0190722197f, 0.0559314489f, 0.0f, 0.0f, -0.4121740460f, 0.0858790874f, 0.0524947792f, 0.1172080263f};
+
+  std::vector<float> expected_Y_h = {-0.0317345410f, 0.0898682102f, 0.0f, 0.0f, -0.4344840049f, 0.1124109625f, -0.0373909101f, 0.1958667039f,
+
+                                     -0.0398676469f, 0.1030099019f, 0.0f, 0.0f, -0.2552363872f, 0.1258624643f, -0.1111927852f, 0.1987708956f};
+
+  ctx.RunTest(X, batch_size, seq_length, sequence_length, &initial_h, expected_Y, expected_Y_h, true);
+}
+
 TEST(GRUTest, ONNXRuntime_TestGRUOpSequenceLengthWithPartialZero) {
   const std::string direction = "bidirectional";
   const std::vector<std::string> activations = {"sigmoid", "tanh", "sigmoid", "tanh"};
diff --git a/onnxruntime/test/providers/cpu/rnn/deep_cpu_lstm_op_test.cc b/onnxruntime/test/providers/cpu/rnn/deep_cpu_lstm_op_test.cc
index 34121981a7271..b3661a9dbeea3 100644
--- a/onnxruntime/test/providers/cpu/rnn/deep_cpu_lstm_op_test.cc
+++ b/onnxruntime/test/providers/cpu/rnn/deep_cpu_lstm_op_test.cc
@@ -1167,6 +1167,46 @@ TEST(LSTMTest, ONNXRuntime_TestLSTMShorterSeqInMiddle) {
   context.RunTest(X_data, batch_size, seq_len, nullptr, nullptr, Y_data, Y_h_data, Y_c_data,
                   &sequence_length, use_bias, use_peepholes, 0.0f, false, false);
 }
+
+TEST(LSTMTest, ONNXRuntime_TestLSTMZeroSeqInMiddle) {
+  const int seq_len = 2;
+  int batch_size = 4;
+  std::vector<std::string> activations = {"sigmoid", "tanh", "tanh", "sigmoid", "tanh", "tanh"};
+
+  bool use_bias = true;
+  bool use_peepholes = false;
+
+  std::vector<float> X_data = {-0.455351f, -0.776391f,
+                               0.0f, 0.0f,
+                               0.348763f, 0.678345f,
+                               0.877836f, 0.543859f,
+
+                               -0.185934f, -0.169585f,
+                               0.0f, 0.0f,
+                               0.078053f, 0.163457f,
+                               0.846098f, 0.987531f};
+
+  std::vector<int> sequence_length = {2, 0, 1, 2};
+
+  std::vector<float> Y_data = {0.02907280f, 0.01765226f, 0.0f, 0.0f, -0.15355367f, 0.04701351f, -0.12951779f, -0.00989562f,
+                               0.01841230f, 0.04093486f, 0.0f, 0.0f, -0.15355367f, 0.04701351f, -0.17956293f, 0.01607513f,
+
+                               -0.02912546f, 0.04120104f, 0.0f, 0.0f, 0.0f, 0.0f, -0.22162350f, 0.03132058f,
+                               -0.04350187f, 0.03531464f, 0.0f, 0.0f, 0.0f, 0.0f, -0.17885581f, 0.01959856f};
+
+  std::vector<float> Y_h_data = {-0.02912546f, 0.04120104f, 0.0f, 0.0f, -0.15355367f, 0.04701351f, -0.22162350f, 0.03132058f,
+
+                                 0.01841230f, 0.04093486f, 0.0f, 0.0f, -0.15355367f, 0.04701351f, -0.17956293f, 0.01607513f};
+
+  std::vector<float> Y_c_data = {-0.06609819f, 0.06838701f, 0.0f, 0.0f, -0.2894889f, 0.07438067f, -0.39655977f, 0.05050645f,
+
+                                 0.04934450f, 0.07126625f, 0.0f, 0.0f, -0.28948891f, 0.07438067f, -0.34931409f, 0.02799958f};
+
+  std::string direction = "bidirectional";
+  LstmOpContext2x1x2x2 context(direction, activations);
+  context.RunTest(X_data, batch_size, seq_len, nullptr, nullptr, Y_data, Y_h_data, Y_c_data,
+                  &sequence_length, use_bias, use_peepholes, 0.0f, false, false);
+}
 #endif // USE_NGRAPH
 
 }  // namespace test
diff --git a/onnxruntime/test/providers/cpu/rnn/rnn_op_test.cc b/onnxruntime/test/providers/cpu/rnn/rnn_op_test.cc
index a7412d32e2140..2b9e81c149b81 100644
--- a/onnxruntime/test/providers/cpu/rnn/rnn_op_test.cc
+++ b/onnxruntime/test/providers/cpu/rnn/rnn_op_test.cc
@@ -345,7 +345,7 @@ TEST(RNNTest, RNN_forward_direction_zigged_batch) {
   test.Run();
 }
 
-TEST(RNNTest, RNN_bidirectional) {
+TEST(RNNTest, RNN_bidirectional_0) {
   OpTester test("RNN");
   int64_t num_directions = 2, input_size = 2, hidden_size = 3, batch_size = 1, seq_length = 5;
 
@@ -407,6 +407,58 @@ TEST(RNNTest, RNN_bidirectional) {
   test.Run();
 }
 
+TEST(RNNTest, RNN_bidirectional_1) {
+  OpTester test("RNN");
+  int64_t num_directions = 2, input_size = 2, hidden_size = 2, batch_size = 1, seq_length = 1;
+
+  test.AddAttribute("activations", vector<string>(num_directions, "Tanh"));
+  test.AddAttribute("direction", "bidirectional");
+  test.AddAttribute("hidden_size", hidden_size);
+
+  std::vector<int64_t> X_dims = {seq_length, batch_size, input_size};
+  std::vector<float> X_data({1.0F, 1.0F});
+
+  test.AddInput<float>("X", X_dims, X_data);
+
+  std::vector<int64_t> W_dims = {num_directions, hidden_size, input_size};
+  std::vector<float> W_data({1.0F, 1.0F, 1.0F, 1.0F,
+                             1.0F, 1.0F, 1.0F, 1.0F});
+
+  test.AddInput<float>("W", W_dims, W_data);
+
+  std::vector<int64_t> R_dims = {num_directions, hidden_size, hidden_size};
+  std::vector<float> R_data({// forward
+                             1.0F, 1.0F,
+                             1.0F, 1.0F, 
+                             // reverse
+                             1.0F, 1.0F,
+                             1.0F, 1.0F});
+  test.AddInput<float>("R", R_dims, R_data);
+
+  std::vector<int64_t> B_dims = {num_directions, 2 * hidden_size};
+  std::vector<float> B_data({0.0F, 0.0F, 0.0F, 0.0F,
+                             0.0F, 0.0F, 0.0F, 0.0F});
+  test.AddInput<float>("B", B_dims, B_data);
+
+  std::vector<int64_t> sequence_lens_dims({batch_size});
+  std::vector<int> sequence_lens_data(batch_size, (int)seq_length);
+  test.AddInput<int>("sequence_lens", sequence_lens_dims, sequence_lens_data);
+
+  std::vector<int64_t> initial_h_dims = {num_directions, batch_size, hidden_size};
+  std::vector<float> initial_h_data({0.1F, 0.2F, 0.3F, 0.4F});
+  test.AddInput<float>("initial_h", initial_h_dims, initial_h_data);
+
+  std::vector<int64_t> Y_dims = {seq_length, num_directions, batch_size, hidden_size};
+  std::vector<float> Y_data({0.98009639F, 0.98009639F, 0.99100745F, 0.99100745F});
+  test.AddOutput<float>("Y", Y_dims, Y_data);
+
+  std::vector<int64_t> Y_h_dims{num_directions, batch_size, hidden_size};
+  std::vector<float> Y_h_data({0.98009639F, 0.98009639F, 0.99100745F, 0.99100745F});
+  test.AddOutput<float>("Y_h", Y_h_dims, Y_h_data);
+
+  test.Run();
+}
+
 typedef enum {
   RNNOutputY,
   RNNOutputY_h,
diff --git a/onnxruntime/test/providers/cpu/tensor/dynamic_quantize_linear_test.cc b/onnxruntime/test/providers/cpu/tensor/dynamic_quantize_linear_test.cc
new file mode 100644
index 0000000000000..0d6d9bb5dae1d
--- /dev/null
+++ b/onnxruntime/test/providers/cpu/tensor/dynamic_quantize_linear_test.cc
@@ -0,0 +1,51 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "gtest/gtest.h"
+#include "test/providers/provider_test_utils.h"
+
+namespace onnxruntime {
+namespace test {
+
+// range = [-ve, +ve]
+TEST(QuantizeLinearOpTest, DynamicQuantizeLinear) {
+  OpTester test("DynamicQuantizeLinear", 11);
+  std::vector<int64_t> dims{6};
+  test.AddInput<float>("x", dims, {0, 2, -3, -2.5f, 1.34f, 0.5f});  
+  test.AddOutput<uint8_t>("y", dims, {153, 255, 0, 26, 221, 179});
+  test.AddOutput<float>("y_scale", {}, {0.0196078438f});
+  test.AddOutput<uint8_t>("y_zero_point", {}, {153});  
+  test.Run();
+}
+
+// quantize with 2D data with min adjustment to include 0 in the input range.
+TEST(QuantizeLinearOpTest, DynamicQuantizeLinear_Min_Adjusted) {
+  OpTester test("DynamicQuantizeLinear", 11);
+  std::vector<int64_t> dims{3, 4};
+  test.AddInput<float>("x", dims,
+                       {1, 2.1f, 1.3f, 2.5f,
+                        3.34f, 4.0f, 1.5f, 2.6f,
+                        3.9f, 4.0f, 3.0f, 2.345f});
+
+  test.AddOutput<uint8_t>("y", dims,
+                          {64, 134, 83, 159,
+                           213, 255, 96, 166,
+                           249, 255, 191, 149});
+  test.AddOutput<float>("y_scale", {}, {0.01568628f});
+  test.AddOutput<uint8_t>("y_zero_point", {}, {0});
+  test.Run();
+}
+
+// quantize max adjustment to include 0 in the input range.
+TEST(QuantizeLinearOpTest, DynamicQuantizeLinear_Max_Adjusted) {
+  OpTester test("DynamicQuantizeLinear", 11);
+  std::vector<int64_t> dims{6};
+  test.AddInput<float>("x", dims, {-1.0f, -2.1f, -1.3f, -2.5f, -3.34f, -4.0f});
+  test.AddOutput<uint8_t>("y", dims, {191, 121, 172, 96, 42, 0});
+  test.AddOutput<float>("y_scale", {}, {0.01568628f});
+  test.AddOutput<uint8_t>("y_zero_point", {}, {255});
+  test.Run();
+}
+
+}  // namespace test
+}  // namespace onnxruntime
diff --git a/onnxruntime/test/providers/cpu/tensor/nonzero_op_test.cc b/onnxruntime/test/providers/cpu/tensor/nonzero_op_test.cc
index 75afeca8a7e7c..cccab727afa37 100644
--- a/onnxruntime/test/providers/cpu/tensor/nonzero_op_test.cc
+++ b/onnxruntime/test/providers/cpu/tensor/nonzero_op_test.cc
@@ -48,6 +48,25 @@ TEST(NonZeroOpTest, BasicBool) {
   test.Run();
 }
 
+TEST(NonZeroOpTest, ThreeDims) {
+  OpTester test{kOpName, kOpVersion};
+
+  std::vector<int64_t> X_dims{2, 2, 2};
+  std::vector<int64_t> X{0, 1,
+                         1, 0,
+
+                         1, 0,
+                         1, 0};
+  test.AddInput<int64_t>("X", X_dims, std::vector<int64_t>{X.begin(), X.end()});
+  test.AddOutput<int64_t>(
+      "Y", {3, 4},
+      {0, 0, 1, 1,
+       0, 1, 0, 1,
+       1, 0, 0, 0});
+
+  test.Run();
+}
+
 TEST(NonZeroOpTest, Scalar) {
   {
     OpTester test{kOpName, kOpVersion};
diff --git a/onnxruntime/test/providers/cpu/tensor/onehot_op_test.cc b/onnxruntime/test/providers/cpu/tensor/onehot_op_test.cc
index c816c7f7b6661..63cebfa1e7a50 100644
--- a/onnxruntime/test/providers/cpu/tensor/onehot_op_test.cc
+++ b/onnxruntime/test/providers/cpu/tensor/onehot_op_test.cc
@@ -65,6 +65,34 @@ TEST(OneHotOpTest, DefaultAxis_int64_float_int64 /*indices, output, depth*/) {
   test.Run();
 }
 
+TEST(OneHotOpTest, DefaultAxis_int32_float_float /*indices, output, depth*/) {
+  OpTester test("OneHot", 9);
+  test.AddInput<int32_t>("indices", {2, 3}, {1, 9, 8, 2, 4, 6});
+  test.AddInput<float>("depth", {1}, {10.0f});
+  test.AddInput<float>("values", {2}, {0.0f, 1.0f});
+  test.AddOutput<float>("output", {2, 3, 10}, {0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f,
+                                               0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f,
+                                               0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f,
+                                               0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f,
+                                               0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f,
+                                               0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f});
+  test.Run();
+}
+
+TEST(OneHotOpTest, DefaultAxis_int32_float_int32 /*indices, output, depth*/) {
+  OpTester test("OneHot", 9);
+  test.AddInput<int32_t>("indices", {2, 3}, {1, 9, 8, 2, 4, 6});
+  test.AddInput<int32_t>("depth", {1}, {10});
+  test.AddInput<float>("values", {2}, {0.0f, 1.0f});
+  test.AddOutput<float>("output", {2, 3, 10}, {0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f,
+                                               0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f,
+                                               0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f,
+                                               0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f,
+                                               0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f,
+                                               0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f});
+  test.Run();
+}
+
 TEST(OneHotOpTest, Axis_0) {
   OpTester test("OneHot", 9);
   int64_t axis = 0;
diff --git a/onnxruntime/test/providers/cpu/tensor/quantize_linear_test.cc b/onnxruntime/test/providers/cpu/tensor/quantize_linear_test.cc
index 3eef712e1f6c3..5a33f0c021dd3 100644
--- a/onnxruntime/test/providers/cpu/tensor/quantize_linear_test.cc
+++ b/onnxruntime/test/providers/cpu/tensor/quantize_linear_test.cc
@@ -47,7 +47,7 @@ TEST(DequantizeLinearOpTest, DequantizeLinear_2) {
 
 
 // quantize with scalar zero point and scale
-TEST(QuantizeLinearOpTest, QuantizeLinear_0) {
+TEST(QuantizeLinearOpTest, QuantizeLinear_uint8) {
   OpTester test("QuantizeLinear", 10);
   std::vector<int64_t> dims{6};
   test.AddInput<float>("x", dims, {0, 2, 3, 1000, -254, -1000});
@@ -57,6 +57,16 @@ TEST(QuantizeLinearOpTest, QuantizeLinear_0) {
   test.Run();
 }
 
+// quantize with scalar zero point and scale
+TEST(QuantizeLinearOpTest, QuantizeLinear_int8) {
+  OpTester test("QuantizeLinear", 10);
+  std::vector<int64_t> dims{6};
+  test.AddInput<float>("x", dims, {0, 2, 3, 5, -2, -5});
+  test.AddInput<float>("y_scale", {}, {.039215686f});
+  test.AddInput<int8_t>("y_zero_point", {}, {0});
+  test.AddOutput<int8_t>("y", dims, {0, 51, 76, 127, -51, -127});
+  test.Run();
+}
 
 // quantize with 2D data
 TEST(QuantizeLinearOpTest, QuantizeLinear_1) {
diff --git a/onnxruntime/test/providers/cpu/tensor/resize_op_test.cc b/onnxruntime/test/providers/cpu/tensor/resize_op_test.cc
index 0611aa2501937..1d2f03e39ce1d 100644
--- a/onnxruntime/test/providers/cpu/tensor/resize_op_test.cc
+++ b/onnxruntime/test/providers/cpu/tensor/resize_op_test.cc
@@ -7,7 +7,7 @@
 
 namespace onnxruntime {
 namespace test {
-TEST(ResizeOpTest, ResizeOpLineartDownSampleTest) {
+TEST(ResizeOpTest, ResizeOpLineartDownSampleTest_4DBilinear) {
   OpTester test("Resize", 10);
   std::vector<float> scales{1.0f, 1.0f, 0.6f, 0.6f};
 
@@ -27,7 +27,27 @@ TEST(ResizeOpTest, ResizeOpLineartDownSampleTest) {
   test.Run();
 }
 
-TEST(ResizeOpTest, ResizeOpLineartUpSampleTest) {
+TEST(ResizeOpTest, ResizeOpLineartDownSampleTest_2DBilinear) {
+  OpTester test("Resize", 10);
+  std::vector<float> scales{0.6f, 0.6f};
+
+  test.AddAttribute("mode", "linear");
+
+  const int64_t H = 2, W = 4;
+  std::vector<float> X = {
+      1.0f, 2.0f, 3.0f, 4.0f,
+      5.0f, 6.0f, 7.0f, 8.0f};
+
+  test.AddInput<float>("X", {H, W}, X);
+  test.AddInput<float>("scales", {2}, scales);
+
+  std::vector<float> Y = {1.0f, 2.66666651f};
+
+  test.AddOutput<float>("Y", {(int64_t)(H * scales[0]), (int64_t)(W * scales[1])}, Y);
+  test.Run();
+}
+
+TEST(ResizeOpTest, ResizeOpLineartUpSampleTest_4DBilinear) {
   OpTester test("Resize", 10);
   std::vector<float> scales{1.0f, 1.0f, 2.0f, 4.0f};
   test.AddAttribute("mode", "linear");
@@ -57,7 +77,30 @@ TEST(ResizeOpTest, ResizeOpLineartUpSampleTest) {
   test.Run();
 }
 
-TEST(ResizeOpTest, ResizeOpLineartNoScaleTest) {
+TEST(ResizeOpTest, ResizeOpLineartUpSampleTest_2DBilinear) {
+  OpTester test("Resize", 10);
+  std::vector<float> scales{2.0f, 4.0f};
+  test.AddAttribute("mode", "linear");
+
+  const int64_t H = 2, W = 2;
+  std::vector<float> X = {1.0f, 3.0f,
+                          4.0f, 8.0f};
+
+  test.AddInput<float>("X", {H, W}, X);
+  test.AddInput<float>("scales", {2}, scales);
+
+  std::vector<float> Y = {
+      1.0f, 1.5f, 2.0f, 2.5f, 3.0f, 3.0f, 3.0f, 3.0f,
+      2.5f, 3.25f, 4.0f, 4.75f, 5.5f, 5.5f, 5.5f, 5.5f,
+      4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 8.0f, 8.0f, 8.0f,
+      4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 8.0f, 8.0f, 8.0f
+  };
+
+  test.AddOutput<float>("Y", {(int64_t)(H * scales[0]), (int64_t)(W * scales[1])}, Y);
+  test.Run();
+}
+
+TEST(ResizeOpTest, ResizeOpLineartScalesNoOpTest) {
   OpTester test("Resize", 10);
   std::vector<float> scales{1.0f, 1.0f, 1.0f, 1.0f};
   test.AddAttribute("mode", "linear");
diff --git a/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc b/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc
index 68924aa60b3b0..e7a67bc12d682 100644
--- a/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc
+++ b/onnxruntime/test/providers/cpu/tensor/upsample_op_test.cc
@@ -264,7 +264,7 @@ TEST(UpsampleOpTest, UpsampleOpNearest2XTest_int32) {
   test.Run(OpTester::ExpectResult::kExpectSuccess, "", {kTensorrtExecutionProvider});  //TensorRT: nvinfer1::query::Ports<nvinfer1::query::AbstractTensor>&): Assertion `!formats.empty()' failed
 }
 
-TEST(UpsampleOpTest, UpsampleOpBilinearTest) {
+TEST(UpsampleOpTest, UpsampleOp4DBilinearTest) {
   OpTester test("Upsample");
 
   std::vector<float> scales{1.0f, 1.0f, 2.0f, 4.0f};
@@ -295,7 +295,31 @@ TEST(UpsampleOpTest, UpsampleOpBilinearTest) {
   test.Run();
 }
 
-TEST(UpsampleOpTest, UpsampleOpBilinearTest_NoScale) {
+TEST(UpsampleOpTest, UpsampleOp2DBilinearTest) {
+  OpTester test("Upsample");
+
+  std::vector<float> scales{2.0f, 4.0f};
+  test.AddAttribute("mode", "linear");
+  test.AddAttribute("scales", scales);
+
+  const int64_t H = 2, W = 2;
+  std::vector<float> X = {1.0f, 3.0f,
+                          3.0f, 5.0f};
+
+  test.AddInput<float>("X", {H, W}, X);
+
+  std::vector<float> Y = {
+      1.0f, 1.5f, 2.0f, 2.5f, 3.0f, 3.0f, 3.0f, 3.0f,
+      2.0f, 2.5f, 3.0f, 3.5f, 4.0f, 4.0f, 4.0f, 4.0f,
+      3.0f, 3.5f, 4.0f, 4.5f, 5.0f, 5.0f, 5.0f, 5.0f,
+      3.0f, 3.5f, 4.0f, 4.5f, 5.0f, 5.0f, 5.0f, 5.0f
+  };
+
+  test.AddOutput<float>("Y", {(int64_t)(H * scales[0]), (int64_t)(W * scales[1])}, Y);
+  test.Run();
+}
+
+TEST(UpsampleOpTest, UpsampleOp4DBilinearTest_ScalesNoOp) {
   OpTester test("Upsample");
 
   std::vector<float> scales{1.0f, 1.0f, 1.0f, 1.0f};
@@ -321,7 +345,7 @@ TEST(UpsampleOpTest, UpsampleOpBilinearTest_NoScale) {
   test.Run();
 }
 
-TEST(UpsampleOpTest, UpsampleOpBilinearTest_int32) {
+TEST(UpsampleOpTest, UpsampleOp4DBilinearTest_int32) {
   OpTester test("Upsample");
 
   std::vector<float> scales{1.0f, 1.0f, 2.0f, 4.0f};
diff --git a/onnxruntime/test/providers/memcpy_test.cc b/onnxruntime/test/providers/memcpy_test.cc
index c43779875fb02..bf52a115796ed 100644
--- a/onnxruntime/test/providers/memcpy_test.cc
+++ b/onnxruntime/test/providers/memcpy_test.cc
@@ -22,11 +22,13 @@ void PutAllNodesOnOneProvider(Graph& graph, const std::string& provider_type) {
 }
 }  // namespace
 TEST(MemcpyTest, copy1) {
+  concurrency::ThreadPool tp{"test", 1};
+
   ExecutionProviders execution_providers;
   CPUExecutionProviderInfo epi;
   auto st = execution_providers.Add(onnxruntime::kCpuExecutionProvider, std::make_unique<CPUExecutionProvider>(epi));
   ASSERT_TRUE(st.IsOK()) << st.ErrorMessage();
-  SessionState s{execution_providers, true};
+  SessionState s{execution_providers, true, &tp};
   s.SetLogger(logging::LoggingManager::DefaultLogger());
   KernelRegistryManager kernel_registry_manager;
   kernel_registry_manager.RegisterKernels(execution_providers);
@@ -40,14 +42,12 @@ TEST(MemcpyTest, copy1) {
   Model model(mp);
   st = model.MainGraph().Resolve();
   ASSERT_TRUE(st.IsOK()) << st.ErrorMessage();
-  s.SetGraphViewer(std::make_unique<GraphViewer>(model.MainGraph()));
   PutAllNodesOnOneProvider(model.MainGraph(), onnxruntime::kCpuExecutionProvider);
   SessionStateInitializer session_initializer{true, ORT_TSTR(""), model.MainGraph(),
                                               s, execution_providers, kernel_registry_manager};
   st = session_initializer.CreatePlan(nullptr, {}, true);
   ASSERT_TRUE(st.IsOK()) << st.ErrorMessage();
-  st = session_initializer.InitializeAndSave(nullptr);
-  ASSERT_TRUE(st.IsOK()) << st.ErrorMessage();
+
   AllocatorPtr allocator =
       execution_providers.Get(onnxruntime::kCpuExecutionProvider)->GetAllocator(0, OrtMemTypeDefault);
   auto* data_type = DataTypeImpl::GetType<float>();
diff --git a/onnxruntime/test/providers/provider_test_utils.cc b/onnxruntime/test/providers/provider_test_utils.cc
index 4abbc94b827a9..d9e7b1d1b34c6 100644
--- a/onnxruntime/test/providers/provider_test_utils.cc
+++ b/onnxruntime/test/providers/provider_test_utils.cc
@@ -14,6 +14,11 @@
 #include "core/session/inference_session.h"
 #include "test/util/include/default_providers.h"
 
+#ifdef MICROSOFT_AUTOML
+#include "automl_ops/automl_featurizers.h"
+namespace dtf = Microsoft::Featurizer::DateTimeFeaturizer;
+#endif
+
 using namespace ::onnxruntime::logging;
 
 namespace onnxruntime {
@@ -30,7 +35,7 @@ void Check(const OpTester::Data& expected_data, const Tensor& output_tensor, con
   auto size = output_tensor.Shape().Size();
 
   for (int i = 0; i < size; ++i) {
-    EXPECT_EQ(expected[i], output[i]) << "provider_type: " << provider_type;
+    EXPECT_EQ(expected[i], output[i]) << "i:" << i << ", provider_type: " << provider_type;
   }
 }
 
@@ -51,19 +56,21 @@ void Check<double>(const OpTester::Data& expected_data, const Tensor& output_ten
 
   for (int i = 0; i < size; ++i) {
     if (std::isinf(expected[i])) {  // Test infinity for equality
-      EXPECT_EQ(expected[i], output[i]);
+      EXPECT_EQ(expected[i], output[i]) << "i:" << i;
     } else if (std::isnan(expected[i])) {
       EXPECT_TRUE(std::isnan(output[i])) << "Expected output " << i << " to be NaN";
     } else {
       if (!has_abs_err && !has_rel_err) {
         // the default for existing tests
-        EXPECT_NEAR(expected[i], output[i], threshold) << "provider_type: " << provider_type;
+        EXPECT_NEAR(expected[i], output[i], threshold) << "i:" << i << ", provider_type: " << provider_type;
       } else {
         if (has_abs_err) {
-          EXPECT_NEAR(expected[i], output[i], expected_data.absolute_error_.value()) << "provider_type: " << provider_type;
+          EXPECT_NEAR(expected[i], output[i], expected_data.absolute_error_.value())
+              << "i:" << i << ", provider_type: " << provider_type;
         }
         if (has_rel_err) {
-          EXPECT_NEAR(expected[i], output[i], expected_data.relative_error_.value() * std::abs(expected[i])) << "provider_type: " << provider_type;
+          EXPECT_NEAR(expected[i], output[i], expected_data.relative_error_.value() * std::abs(expected[i]))
+              << "i:" << i << ", provider_type: " << provider_type;
         }
       }
     }
@@ -87,19 +94,21 @@ void Check<float>(const OpTester::Data& expected_data, const Tensor& output_tens
 
   for (int i = 0; i < size; ++i) {
     if (std::isinf(expected[i])) {  // Test infinity for equality
-      EXPECT_EQ(expected[i], output[i]);
+      EXPECT_EQ(expected[i], output[i]) << "i:" << i;
     } else if (std::isnan(expected[i])) {
       EXPECT_TRUE(std::isnan(output[i])) << "Expected output " << i << " to be NaN";
     } else {
       if (!has_abs_err && !has_rel_err) {
         // the default for existing tests
-        EXPECT_NEAR(expected[i], output[i], threshold) << "provider_type: " << provider_type;
+        EXPECT_NEAR(expected[i], output[i], threshold) << "i:" << i << ", provider_type: " << provider_type;
       } else {
         if (has_abs_err) {
-          EXPECT_NEAR(expected[i], output[i], expected_data.absolute_error_.value()) << "provider_type: " << provider_type;
+          EXPECT_NEAR(expected[i], output[i], expected_data.absolute_error_.value())
+              << "i:" << i << ", provider_type: " << provider_type;
         }
         if (has_rel_err) {
-          EXPECT_NEAR(expected[i], output[i], expected_data.relative_error_.value() * std::abs(expected[i])) << "provider_type: " << provider_type;
+          EXPECT_NEAR(expected[i], output[i], expected_data.relative_error_.value() * std::abs(expected[i]))
+              << "i:" << i << ", provider_type: " << provider_type;
         }
       }
     }
@@ -118,6 +127,30 @@ void Check<MLFloat16>(const OpTester::Data& expected_data, const Tensor& output_
   ConvertMLFloat16ToFloat(expected, f_expected.data(), static_cast<int>(size));
   ConvertMLFloat16ToFloat(output, f_output.data(), static_cast<int>(size));
 
+  float threshold = 0.001f;
+  for (int i = 0; i < size; ++i) {
+    if (std::isinf(f_expected[i]))  // Test infinity for equality
+      EXPECT_EQ(f_expected[i], f_output[i]) << "i:" << i;
+    else {
+      // the default for existing tests
+      EXPECT_NEAR(f_expected[i], f_output[i], threshold) << "i:" << i << ", provider_type: " << provider_type;
+    }
+  }
+}
+
+template <>
+void Check<BFloat16>(const OpTester::Data& expected_data, const Tensor& output_tensor, const std::string& provider_type) {
+  auto& expected_tensor = expected_data.data_.Get<Tensor>();
+  auto* expected = expected_tensor.template Data<BFloat16>();
+  auto* output = output_tensor.template Data<BFloat16>();
+  auto size = output_tensor.Shape().Size();
+
+  std::vector<float> f_expected(size);
+  std::vector<float> f_output(size);
+  BFloat16ToFloat(expected, f_expected.data(), static_cast<size_t>(size));
+  BFloat16ToFloat(output, f_output.data(), static_cast<size_t>(size));
+
+  /// XXX: May need to adjust threshold as BFloat is coarse
   float threshold = 0.001f;
   for (int i = 0; i < size; ++i) {
     if (std::isinf(f_expected[i]))  // Test infinity for equality
@@ -180,8 +213,13 @@ void CheckDispatch(MLDataType type, const OpTester::Data& expected_data, OrtValu
 }
 
 void Check(const OpTester::Data& expected_data, OrtValue& ort_value, const std::string& provider_type) {
+#ifdef MICROSOFT_AUTOML
+  CheckDispatch<dtf::TimePoint,VectorMapStringToFloat, VectorMapInt64ToFloat>(expected_data.data_.Type(), expected_data, ort_value,
+                                                               provider_type);
+#else
   CheckDispatch<VectorMapStringToFloat, VectorMapInt64ToFloat>(expected_data.data_.Type(), expected_data, ort_value,
                                                                provider_type);
+#endif
 }
 
 void DebugTrap() {
@@ -342,23 +380,27 @@ void OpTester::ExecuteModel(Model& model, InferenceSession& session_object, Expe
   default_run_options.run_log_verbosity_level = 1;
 
   std::vector<OrtValue> fetches;
-  status = session_object.Run(run_options ? *run_options : default_run_options, feeds, output_names, &fetches);
-  if (status.IsOK()) {
-    EXPECT_TRUE(expect_result == ExpectResult::kExpectSuccess) << "Expected failure but Run was successful";
-    if (expect_result == ExpectResult::kExpectFailure) {
-      return;
-    }
-  } else {
-    if (expect_result == ExpectResult::kExpectFailure) {
-      // Disable expected_failure_string checks for MKL-DNN and nGraph EP's
-      if (provider_type != kMklDnnExecutionProvider && provider_type != kNGraphExecutionProvider) {
-        EXPECT_THAT(status.ErrorMessage(), testing::HasSubstr(expected_failure_string));
+  for (int i = 0; i < num_run_calls_; ++i) {
+    fetches.clear();
+    status = session_object.Run(run_options ? *run_options : default_run_options, feeds, output_names, &fetches);
+
+    if (status.IsOK()) {
+      EXPECT_TRUE(expect_result == ExpectResult::kExpectSuccess) << "Expected failure but Run was successful";
+      if (expect_result == ExpectResult::kExpectFailure) {
+        return;
       }
     } else {
-      LOGS_DEFAULT(ERROR) << "Run failed with status: " << status.ErrorMessage();
-      EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
+      if (expect_result == ExpectResult::kExpectFailure) {
+        // Disable expected_failure_string checks for MKL-DNN and nGraph EP's
+        if (provider_type != kMklDnnExecutionProvider && provider_type != kNGraphExecutionProvider) {
+          EXPECT_THAT(status.ErrorMessage(), testing::HasSubstr(expected_failure_string));
+        }
+      } else {
+        LOGS_DEFAULT(ERROR) << "Run failed with status: " << status.ErrorMessage();
+        EXPECT_TRUE(status.IsOK()) << status.ErrorMessage();
+      }
+      return;
     }
-    return;
   }
 
   // Verify the outputs
@@ -440,7 +482,6 @@ void OpTester::Run(ExpectResult expect_result,
     std::unordered_map<std::string, OrtValue> feeds;
     std::vector<std::string> output_names;
     FillFeedsAndOutputNames(feeds, output_names);
-
     // Run the model
     SessionOptions so;
     so.session_logid = op_;
@@ -515,7 +556,10 @@ void OpTester::Run(ExpectResult expect_result,
 
           //if node is not registered for the provider, skip
           node.SetExecutionProviderType(provider_type);
-          if (provider_type == onnxruntime::kNGraphExecutionProvider || provider_type == onnxruntime::kTensorrtExecutionProvider || provider_type == onnxruntime::kOpenVINOExecutionProvider)
+          if (provider_type == onnxruntime::kNGraphExecutionProvider ||
+              provider_type == onnxruntime::kTensorrtExecutionProvider ||
+              provider_type == onnxruntime::kOpenVINOExecutionProvider ||
+              provider_type == onnxruntime::kNupharExecutionProvider)
             continue;
           auto reg = execution_provider->GetKernelRegistry();
           const KernelCreateInfo* kci = reg->TryFindKernel(node, execution_provider->Type());
diff --git a/onnxruntime/test/providers/provider_test_utils.h b/onnxruntime/test/providers/provider_test_utils.h
index 5f93eabbcc714..6f291adf2b172 100644
--- a/onnxruntime/test/providers/provider_test_utils.h
+++ b/onnxruntime/test/providers/provider_test_utils.h
@@ -176,6 +176,30 @@ class OpTester {
     AddData(input_data_, name, dims, values.data(), values.size(), is_initializer);
   }
 
+  // Add other registered types, possibly experimental
+  template <typename T>
+  void AddInput(const char* name, const T& val) {
+    auto mltype = DataTypeImpl::GetType<T>();
+    ORT_ENFORCE(mltype != nullptr, "T must be a registered cpp type");
+    auto ptr = std::make_unique<T>(val);
+    OrtValue value;
+    value.Init(ptr.get(), mltype, mltype->GetDeleteFunc());
+    ptr.release();
+    input_data_.push_back({{name, mltype->GetTypeProto()}, value, optional<float>(), optional<float>()});
+  }
+
+  template <typename T>
+  void AddInput(const char* name, T&& val) {
+    auto mltype = DataTypeImpl::GetType<T>();
+    ORT_ENFORCE(mltype != nullptr, "T must be a registered cpp type");
+    auto ptr = std::make_unique<T>(std::move(val));
+    OrtValue value;
+    value.Init(ptr.get(), mltype, mltype->GetDeleteFunc());
+    ptr.release();
+    input_data_.push_back({{name, mltype->GetTypeProto()}, value, optional<float>(), optional<float>()});
+  }
+
+
   template <typename TKey, typename TVal>
   void AddInput(const char* name, const std::map<TKey, TVal>& val) {
     std::unique_ptr<std::map<TKey, TVal>> ptr = std::make_unique<std::map<TKey, TVal>>(val);
@@ -208,6 +232,29 @@ class OpTester {
     output_data_.push_back({{name, &s_type_proto<T>}, {}, optional<float>(), optional<float>()});
   }
 
+  // Add other registered types, possibly experimental
+  template <typename T>
+  void AddOutput(const char* name, const T& val) {
+    auto mltype = DataTypeImpl::GetType<T>();
+    ORT_ENFORCE(mltype != nullptr, "T must be a registered cpp type");
+    auto ptr = std::make_unique<T>(val);
+    OrtValue value;
+    value.Init(ptr.get(), mltype, mltype->GetDeleteFunc());
+    ptr.release();
+    output_data_.push_back({{name, mltype->GetTypeProto()}, value, optional<float>(), optional<float>()});
+  }
+
+  template <typename T>
+  void AddOutput(const char* name, T&& val) {
+    auto mltype = DataTypeImpl::GetType<T>();
+    ORT_ENFORCE(mltype != nullptr, "T must be a registered cpp type");
+    auto ptr = std::make_unique<T>(std::move(val));
+    OrtValue value;
+    value.Init(ptr.get(), mltype, mltype->GetDeleteFunc());
+    ptr.release();
+    output_data_.push_back({{name, mltype->GetTypeProto()}, value, optional<float>(), optional<float>()});
+  }
+
   // Add non tensor output
   template <typename TKey, typename TVal>
   void AddOutput(const char* name, const std::vector<std::map<TKey, TVal>>& val) {
@@ -227,6 +274,13 @@ class OpTester {
   void SetOutputAbsErr(const char* name, float v);
   void SetOutputRelErr(const char* name, float v);
 
+  // Number of times to call InferenceSession::Run. The same feeds are used each time.
+  // e.g. used to verify the generator ops behave as expected
+  void SetNumRunCalls(int n) {
+    ORT_ENFORCE(n > 0);
+    num_run_calls_ = n;
+  }
+
   template <typename T>
   void AddAttribute(std::string name, T value) {
     // Generate a the proper AddAttribute call for later
@@ -318,6 +372,7 @@ class OpTester {
   int opset_version_;
   bool add_shape_to_tensor_data_ = true;
   int add_symbolic_dim_to_tensor_data_ = -1;
+  int num_run_calls_ = 1;
   std::vector<Data> input_data_;
   std::vector<Data> output_data_;
   std::vector<size_t> initializer_index_;
diff --git a/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc b/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc
index d5c7f33b6d9f7..ac5074de105bf 100644
--- a/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc
+++ b/onnxruntime/test/providers/tensorrt/tensorrt_basic_test.cc
@@ -87,7 +87,11 @@ TEST(TensorrtExecutionProviderTest, FunctionTest) {
   run_options.run_tag = so.session_logid;
 
   InferenceSession session_object{so};
-  session_object.RegisterExecutionProvider(std::make_unique<::onnxruntime::TensorrtExecutionProvider>());
+
+  TensorrtExecutionProviderInfo epi;
+  epi.device_id = 0;
+  EXPECT_TRUE(session_object.RegisterExecutionProvider(std::make_unique<::onnxruntime::TensorrtExecutionProvider>(epi)).IsOK());
+
   status = session_object.Load(model_file_name);
   ASSERT_TRUE(status.IsOK());
   status = session_object.Initialize();
diff --git a/onnxruntime/test/python/onnx_backend_test_series.py b/onnxruntime/test/python/onnx_backend_test_series.py
index 703a879fdc1c3..41c5b876be2cb 100644
--- a/onnxruntime/test/python/onnx_backend_test_series.py
+++ b/onnxruntime/test/python/onnx_backend_test_series.py
@@ -109,12 +109,23 @@ def create_backend_test(testname=None):
                                  '^test_cumsum_1d_reverse_cpu.*',
                                  '^test_cumsum_1d_reverse_exclusive_cpu.*',
                                  '^test_cumsum_2d_axis_0_cpu.*',
-                                 '^test_cumsum_2d_axis_1_cpu.*'
+                                 '^test_cumsum_2d_axis_1_cpu.*',
+                                 '^test_dynamicquantizelinear_expanded*',
+                                 '^test_dynamicquantizelinear_max_adjusted_expanded*',
+                                 '^test_dynamicquantizelinear_min_adjusted_expanded*',
+                                 '^test_depthtospace*',
+                                 '^test_gather_elements*',
+                                 '^test_scatter_elements*',
+                                 '^test_top_k*',
+                                 '^test_unique_*',
+                                 '^test_mod_float_mixed_sign_example_cpu.*', #onnxruntime::Mod::Compute fmod_ was false. fmod attribute must be true for float, float16 and double types
                                  )
 
         # Example of how to disable tests for a specific provider.
         # if c2.supports_device('NGRAPH'):
         #    current_failing_tests = current_failing_tests + ('|^test_operator_repeat_dim_overflow_cpu.*',)
+        if c2.supports_device('NGRAPH'):
+            current_failing_tests = current_failing_tests + ('|^test_clip*',)
 
         filters = current_failing_tests + \
                   tests_with_pre_opset7_dependencies_filters() + \
diff --git a/onnxruntime/test/python/onnxruntime_test_python.py b/onnxruntime/test/python/onnxruntime_test_python.py
index ce4da1a0a6964..0fac49ad01427 100644
--- a/onnxruntime/test/python/onnxruntime_test_python.py
+++ b/onnxruntime/test/python/onnxruntime_test_python.py
@@ -8,7 +8,6 @@
 import onnxruntime as onnxrt
 import threading
 
-
 class TestInferenceSession(unittest.TestCase):
 
     def get_name(self, name):
@@ -34,6 +33,14 @@ def run_model(self, session_object, run_options):
         np.testing.assert_allclose(
             output_expected, res[0], rtol=1e-05, atol=1e-08)
 
+    def testModelSerialization(self):
+        so = onnxrt.SessionOptions()
+        so.log_verbosity_level = 1
+        so.logid = "TestModelSerialization"
+        so.optimized_model_filepath = "./PythonApiTestOptimizedModel.onnx"
+        onnxrt.InferenceSession(self.get_name("mul_1.onnx"), sess_options=so)
+        self.assertTrue(os.path.isfile(so.optimized_model_filepath))
+
     def testRunModel(self):
         sess = onnxrt.InferenceSession(self.get_name("mul_1.onnx"))
         x = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]], dtype=np.float32)
@@ -88,15 +95,15 @@ def testRunModel2(self):
 
     def testRunModelMultipleThreads(self):
         so = onnxrt.SessionOptions()
-        so.session_log_verbosity_level = 1
-        so.session_logid = "MultiThreadsTest"
+        so.log_verbosity_level = 1
+        so.logid = "MultiThreadsTest"
         sess = onnxrt.InferenceSession(
             self.get_name("mul_1.onnx"), sess_options=so)
         ro1 = onnxrt.RunOptions()
-        ro1.run_tag = "thread1"
+        ro1.logid = "thread1"
         t1 = threading.Thread(target=self.run_model, args=(sess, ro1))
         ro2 = onnxrt.RunOptions()
-        ro2.run_tag = "thread2"
+        ro2.logid = "thread2"
         t2 = threading.Thread(target=self.run_model, args=(sess, ro2))
         t1.start()
         t2.start()
@@ -486,6 +493,18 @@ def test_run_model_mlnet(self):
         total = mat.sum()
         self.assertEqual(total, 0)
 
+    def testGraphOptimizationLevel(self):
+        opt = onnxrt.SessionOptions()
+        self.assertEqual(opt.graph_optimization_level, onnxrt.GraphOptimizationLevel.ORT_ENABLE_BASIC)
+            # default should be basic optimization
+        opt.graph_optimization_level = onnxrt.GraphOptimizationLevel.ORT_ENABLE_ALL
+        self.assertEqual(opt.graph_optimization_level, onnxrt.GraphOptimizationLevel.ORT_ENABLE_ALL)
+        sess = onnxrt.InferenceSession(self.get_name("logicaland.onnx"), sess_options=opt)
+        a = np.array([[True, True], [False, False]], dtype=np.bool)
+        b = np.array([[True, False], [True, False]], dtype=np.bool)
+
+        res = sess.run([], {'input1:0': a, 'input:0':b})
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/onnxruntime/test/python/onnxruntime_test_python_nuphar.py b/onnxruntime/test/python/onnxruntime_test_python_nuphar.py
new file mode 100644
index 0000000000000..db534dac6b3d3
--- /dev/null
+++ b/onnxruntime/test/python/onnxruntime_test_python_nuphar.py
@@ -0,0 +1,111 @@
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+# -*- coding: UTF-8 -*-
+import numpy as np
+import onnx
+from onnx import numpy_helper
+import onnxruntime as onnxrt
+import os
+from rnn_benchmark import perf_test, generate_model
+import sys
+import subprocess
+import tarfile
+from timeit import default_timer as timer
+import unittest
+import urllib.request
+
+class TestNuphar(unittest.TestCase):
+    def test_bidaf(self):
+        # download BiDAF model
+        cwd = os.getcwd()
+        bidaf_url = 'https://onnxzoo.blob.core.windows.net/models/opset_9/bidaf/bidaf.tar.gz'
+        cache_dir = os.path.join(os.path.expanduser("~"), '.cache','onnxruntime')
+        os.makedirs(cache_dir, exist_ok=True)
+        bidaf_local = os.path.join(cache_dir, 'bidaf.tar.gz')
+        if not os.path.exists(bidaf_local):
+            urllib.request.urlretrieve(bidaf_url, bidaf_local)
+        with tarfile.open(bidaf_local, 'r') as f:
+            f.extractall(cwd)
+
+        # verify accuracy of quantized model
+        bidaf_dir = os.path.join(cwd, 'bidaf')
+        bidaf_model = os.path.join(bidaf_dir, 'bidaf.onnx')
+        bidaf_scan_model = os.path.join(bidaf_dir, 'bidaf_scan.onnx')
+        bidaf_int8_scan_only_model = os.path.join(bidaf_dir, 'bidaf_int8_scan_only.onnx')
+        subprocess.run([sys.executable, 'model_editor.py', '--input', bidaf_model, '--output', bidaf_scan_model, '--mode', 'to_scan'], check=True, cwd=cwd)
+        subprocess.run([sys.executable, 'model_quantizer.py', '--input', bidaf_scan_model, '--output', bidaf_int8_scan_only_model, '--only_for_scan'], check=True, cwd=cwd)
+
+        # run onnx_test_runner to verify results
+        # use -M to disable memory pattern
+        # use -c 1 to run one model/session at a time when running multiple models
+        onnx_test_runner = os.path.join(cwd, 'onnx_test_runner')
+        subprocess.run([onnx_test_runner, '-e', 'nuphar', '-M', '-c', '1', '-n', 'bidaf', cwd], check=True, cwd=cwd)
+
+        # test AOT on the quantized model
+        cache_dir = os.path.join(cwd, 'nuphar_cache')
+        if os.path.exists(cache_dir):
+            for sub_dir in os.listdir(cache_dir):
+                full_sub_dir = os.path.join(cache_dir, sub_dir)
+                if os.path.isdir(full_sub_dir):
+                    for f in os.listdir(full_sub_dir):
+                        os.remove(os.path.join(full_sub_dir, f))
+        else:
+            os.makedirs(cache_dir)
+
+        nuphar_settings = 'nuphar_cache_path:{}'.format(cache_dir)
+        onnxrt.capi._pybind_state.set_nuphar_settings(nuphar_settings)
+
+        non_jit_repeats = 1
+        jit_repeats = 10
+
+        # prepare feed
+        feed = {}
+        for i in range(4):
+            tp = onnx.load_tensor(os.path.join(bidaf_dir, 'test_data_set_0', 'input_{}.pb'.format(i)))
+            feed[tp.name] = numpy_helper.to_array(tp)
+
+        start = timer()
+        sess = onnxrt.InferenceSession(bidaf_int8_scan_only_model) # JIT cache happens when initializing session
+        for i in range(non_jit_repeats):
+            output = sess.run([], feed)
+        non_jit_time = timer() - start
+
+        cache_dir_content = os.listdir(cache_dir)
+        assert len(cache_dir_content) == 1
+        cache_versioned_dir = os.path.join(cache_dir, cache_dir_content[0])
+        if os.name == 'nt': # Windows
+            subprocess.run(['cmd', '/c', os.path.join(cwd, 'create_shared.cmd'), cache_versioned_dir], check=True, cwd=cwd)
+        elif os.name == 'posix': #Linux
+            subprocess.run(['bash', os.path.join(cwd, 'create_shared.sh'), '-c', cache_versioned_dir], check=True, cwd=cwd)
+        else:
+            return # don't run the rest of test if AOT is not supported
+
+        nuphar_settings = 'nuphar_cache_path:{}'.format(cache_dir) + ', nuphar_cache_force_no_jit:{}'.format('on')
+        onnxrt.capi._pybind_state.set_nuphar_settings(nuphar_settings)
+        sess = onnxrt.InferenceSession(bidaf_int8_scan_only_model) # JIT cache happens when initializing session
+        start = timer()
+        for i in range(jit_repeats):
+            sess.run([], feed)
+        jit_time = timer() - start
+
+        self.assertLess(jit_time, non_jit_time)
+
+    def test_rnn_benchmark(self):
+        # make sure benchmarking scripts works
+        # note: quantized model requires AVX2, otherwise it might be slow
+        avg_rnn, avg_scan, avg_int8 = perf_test('lstm', num_threads=1,
+                                                input_dim=128, hidden_dim=1024, bidirectional=True,
+                                                layers=1, seq_len=16, batch_size=1,
+                                                min_duration_seconds=1)
+        avg_rnn, avg_scan, avg_int8 = perf_test('gru', num_threads=1,
+                                                input_dim=128, hidden_dim=1024, bidirectional=False,
+                                                layers=2, seq_len=16, batch_size=3,
+                                                min_duration_seconds=1)
+        avg_rnn, avg_scan, avg_int8 = perf_test('rnn', num_threads=1,
+                                                input_dim=128, hidden_dim=1024, bidirectional=False,
+                                                layers=3, seq_len=16, batch_size=2,
+                                                min_duration_seconds=1)
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/onnxruntime/test/server/unit_tests/converter_tests.cc b/onnxruntime/test/server/unit_tests/converter_tests.cc
index 4b192dd0d08c7..d020584837e8e 100644
--- a/onnxruntime/test/server/unit_tests/converter_tests.cc
+++ b/onnxruntime/test/server/unit_tests/converter_tests.cc
@@ -304,7 +304,7 @@ TEST(MLValueToTensorProtoTests, UInt8ProtoRoundTrip) {
   tp.set_data_type(onnx::TensorProto_DataType_UINT8);
   Ort::Value ml_value{nullptr};
   char buf[1000];
-  auto allocator = Ort::Allocator::CreateDefault();
+  Ort::AllocatorWithDefaultOptions allocator;
   auto info = allocator.GetInfo();
   MemBuffer buffer((void*)&buf, tp.ByteSizeLong(), *info);
   onnxruntime::server::TensorProtoToMLValue(tp, buffer, ml_value);
diff --git a/onnxruntime/test/shared_lib/test_allocator.cc b/onnxruntime/test/shared_lib/test_allocator.cc
index 372594fa8f16c..b20231f94d561 100644
--- a/onnxruntime/test/shared_lib/test_allocator.cc
+++ b/onnxruntime/test/shared_lib/test_allocator.cc
@@ -20,7 +20,7 @@ TEST_F(CApiTest, allocation_info) {
 }
 
 TEST_F(CApiTest, DefaultAllocator) {
-  Ort::Allocator default_allocator = Ort::Allocator::CreateDefault();
+  Ort::AllocatorWithDefaultOptions default_allocator;
   char* p = (char*)default_allocator.Alloc(100);
   ASSERT_NE(p, nullptr);
   memset(p, 0, 100);
diff --git a/onnxruntime/test/shared_lib/test_inference.cc b/onnxruntime/test/shared_lib/test_inference.cc
index 2833803ac9ad2..2ba0ee6475c1d 100644
--- a/onnxruntime/test/shared_lib/test_inference.cc
+++ b/onnxruntime/test/shared_lib/test_inference.cc
@@ -77,7 +77,7 @@ void TestInference(Ort::Env& env, T model_uri,
 #endif
   } else if (provider_type == 3) {
 #ifdef USE_NUPHAR
-    ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Nuphar(session_options, 0, ""));
+    ORT_THROW_ON_ERROR(OrtSessionOptionsAppendExecutionProvider_Nuphar(session_options, /*allow_unaligned_buffers*/ 1, ""));
     std::cout << "Running simple inference with nuphar provider" << std::endl;
 #else
     return;
@@ -218,7 +218,7 @@ TEST_F(CApiTest, custom_op_handler) {
   TestInference<PATH_TYPE>(env_, CUSTOM_OP_MODEL_URI, inputs, "Y", expected_dims_y, expected_values_y, 0, custom_op_domain);
 }
 
-#if defined(ENABLE_LANGUAGE_INTEROP_OPS) && !defined(_WIN32) // on windows, PYTHONHOME must be set explicitly
+#if defined(ENABLE_LANGUAGE_INTEROP_OPS) && !defined(_WIN32)  // on windows, PYTHONHOME must be set explicitly
 TEST_F(CApiTest, test_pyop) {
   std::cout << "Test model with pyop" << std::endl;
   std::ofstream module("mymodule.py");
diff --git a/onnxruntime/test/shared_lib/test_session_options.cc b/onnxruntime/test/shared_lib/test_session_options.cc
index 60fbc6b819e7b..258bd62dc373b 100644
--- a/onnxruntime/test/shared_lib/test_session_options.cc
+++ b/onnxruntime/test/shared_lib/test_session_options.cc
@@ -9,15 +9,6 @@ using namespace onnxruntime;
 
 TEST_F(CApiTest, session_options_graph_optimization_level) {
   // Test set optimization level succeeds when valid level is provided.
-  uint32_t valid_optimization_level = static_cast<uint32_t>(TransformerLevel::Level2);
   Ort::SessionOptions options;
-  options.SetGraphOptimizationLevel(valid_optimization_level);
-
-  // Test set optimization level fails when invalid level is provided.
-  try {
-    uint32_t invalid_level = static_cast<uint32_t>(TransformerLevel::MaxTransformerLevel);
-    options.SetGraphOptimizationLevel(invalid_level);
-  } catch (const Ort::Exception& e) {
-    ASSERT_EQ(e.GetOrtErrorCode(), ORT_INVALID_ARGUMENT);
-  }
+  options.SetGraphOptimizationLevel(ORT_ENABLE_EXTENDED);
 }
diff --git a/onnxruntime/test/testdata/CNTK/gen.py b/onnxruntime/test/testdata/CNTK/gen.py
index adac5e3285fb9..c0aaf8ab11dee 100644
--- a/onnxruntime/test/testdata/CNTK/gen.py
+++ b/onnxruntime/test/testdata/CNTK/gen.py
@@ -5,22 +5,17 @@
 import numpy as np
 import onnx
 import os
+from onnx import numpy_helper
 
 model_file = 'model.onnx'
 data_dir = 'test_data_set_0'
 
 def SaveTensorProto(file_path, variable, data, name):
-    tp = onnx.TensorProto()
-    tp.name = name if name else variable.uid
     # ONNX input shape always has sequence axis as the first dimension, if sequence axis exists
-    for i in range(len(variable.dynamic_axes)):
-        tp.dims.append(data.shape[len(variable.dynamic_axes) - 1 - i])
-    for (i,d) in enumerate(variable.shape):
-        tp.dims.append(d) if d > 0 else tp.dims.append(data.shape[len(data.shape)-len(variable.shape)+i])
-    tp.data_type = onnx.TensorProto.FLOAT
-    tp.raw_data = data.tobytes()
-    with open(file_path, 'wb') as f:
-        f.write(tp.SerializeToString())
+    if len(variable.dynamic_axes) == 2:
+        data = data.transpose((1,0,)+tuple(range(2,len(data.shape))))
+    tp = numpy_helper.from_array(data, name if name else variable.uid)
+    onnx.save_tensor(tp, file_path)
 
 def SaveData(test_data_dir, prefix, variables, data_list, name_replacements=None):
     if isinstance(data_list, np.ndarray):
@@ -36,10 +31,7 @@ def Save(dir, func, feed, outputs):
 
     # onnx model may have different name for RNN initial states as inputs
     cntk_to_actual_names = {}
-    with open(onnx_file, 'rb') as ff:
-        sf = ff.read()
-    model = onnx.ModelProto()
-    model.ParseFromString(sf)
+    model = onnx.load(onnx_file)
     for actual_input in model.graph.input:
         actual_input_name = actual_input.name
         for cntk_input in func.arguments:
@@ -140,13 +132,46 @@ def GenScan():
     feature = C.sequence.input_variable((3,), np.float32)
     model = C.layers.For(range(4), lambda : C.layers.Recurrence(LSTM(2, use_scan=True)))(feature)
 
-    data_feature = np.random.rand(1,5,3).astype(np.float32)
+    data_feature = np.random.rand(2,5,3).astype(np.float32)
     data_output = np.asarray(model.eval(data_feature))
 
-    # print values for test as ground truth
-    print("Scan input\n", data_feature, "\nScan output\n", data_output)
-
     Save('test_Scan', model, data_feature, data_output)
+    
+    # Currently CNTK only outputs batch == 1, do some editing
+    in_mp = onnx.load('test_Scan/model.onnx')
+    out_mp = onnx.ModelProto()
+    out_mp.CopyFrom(in_mp)
+    out_mp.graph.ClearField('initializer')
+
+    # change LSTM init_c/h into inputs to support truncated sequence
+    # as batch dimension is unknown on those data when building model
+    # note here we assume init_c/h starts from 0
+    # if not the case, user need to manually broadcast it for feed
+    num_inputs = 1
+    for i in in_mp.graph.initializer:
+        if i.name.startswith("Constant"):
+            shape = i.dims
+            shape[0] = 2
+            aa = np.zeros(shape, dtype=np.float32)
+            tp = numpy_helper.from_array(aa, i.name)
+            with open('test_Scan/test_data_set_0/input_' + str(num_inputs) + '.pb', 'wb') as ff:
+                ff.write(tp.SerializeToString())
+            num_inputs = num_inputs + 1
+        else:
+            out_mp.graph.initializer.add().CopyFrom(i)
+
+    for vi in list(out_mp.graph.input) + list(out_mp.graph.output) + list(out_mp.graph.value_info):
+        dim = vi.type.tensor_type.shape.dim
+        dim[len(dim) - 2].dim_param = 'batch'
+        
+    for n in out_mp.graph.node:
+        if n.op_type == 'Scan':
+            body = [attr for attr in n.attribute if attr.name == 'body'][0]
+            for vi in list(body.g.input) + list(body.g.output) + list(body.g.value_info):
+                dim = vi.type.tensor_type.shape.dim
+                dim[0].dim_param = 'batch'
+
+    onnx.save(out_mp, 'test_Scan/model.onnx', 'wb')
 
 def GenSimpleScan():
     feature = C.sequence.input_variable((128,), np.float32)
@@ -157,35 +182,21 @@ def GenSimpleScan():
     data_output = np.asarray(model.eval(data_feature), dtype=np.float32)
     Save('test_SimpleScan', model, data_feature, data_output)
 
-def GenLCBLSTM():
-    nc_len = 16
-    nr_len = 16
-    in_dim = 128
-    cell_dim = 256
-    batch_size = 1
-    input = C.sequence.input_variable((in_dim,))
-    init_h = C.input_variable((cell_dim,))
-    init_c = C.input_variable((cell_dim,))
-
-    fwd_cell = LSTM(cell_dim, use_scan=True)
-    fwd_hc = C.layers.RecurrenceFrom(fwd_cell, go_backwards=False, return_full_state=True)(init_h, init_c, input)
-
-    fwd_h_nc = C.sequence.slice(fwd_hc[0], -nr_len-1, -nr_len)
-    fwd_c_nc = C.sequence.slice(fwd_hc[1], -nr_len-1, -nr_len)
-
-    bwd_cell = LSTM(cell_dim, use_scan=True)
-    bwd = C.layers.Recurrence(bwd_cell, go_backwards=True)(input)
+def GenGRU():
+    feature = C.sequence.input_variable((64,), np.float32)
+    gru_fw = C.layers.Recurrence(C.layers.GRU(128))(feature)
+    gru_bw = C.layers.Recurrence(C.layers.GRU(128), go_backwards=True)(feature)
+    model = C.splice(gru_fw, gru_bw, axis=0)
+    data_feature = np.random.rand(1,16,64).astype(np.float32)
+    data_output = np.asarray(model.eval(data_feature))
+    Save('test_GRU', model, data_feature, data_output)
 
-    nr = C.splice(fwd_hc[0], bwd)
-    # workaround CNTK bug in slice output name by + 0
-    model = C.combine([C.sequence.reduce_sum(nr), fwd_h_nc + 0, fwd_c_nc + 0])
-    
-    input_data = np.random.rand(batch_size, nc_len+nr_len, in_dim).astype(np.float32)
-    init_h_data = np.random.rand(batch_size, cell_dim).astype(np.float32)
-    init_c_data = np.random.rand(batch_size, cell_dim).astype(np.float32)
-    feed = {input:input_data, init_h:init_h_data, init_c:init_c_data}
-    data_output = model.eval(feed)
-    Save('test_LCBLSTM', model, feed, data_output)
+def GenRNN():
+    feature = C.sequence.input_variable((64,), np.float32)
+    model = C.optimized_rnnstack(feature, C.parameter((C.InferredDimension, 64,), init=C.glorot_uniform()), 128, 2, True, 'rnnReLU')
+    data_feature = np.random.rand(1,16,64).astype(np.float32)
+    data_output = np.asarray(model.eval(data_feature))
+    Save('test_RNN', model, data_feature, data_output)
 
 if __name__=='__main__':
     np.random.seed(0)
@@ -196,5 +207,6 @@ def GenLCBLSTM():
     GenLSTMx4(use_scan=True)
     GenLSTMx4(use_scan=False)
     GenSimpleScan()
-    GenLCBLSTM()
     GenScan()
+    GenGRU()
+    GenRNN()
\ No newline at end of file
diff --git a/onnxruntime/test/testdata/test_model_with_fullonnxdomain.onnx b/onnxruntime/test/testdata/test_model_with_fullonnxdomain.onnx
new file mode 100644
index 0000000000000..fe97d50aa5529
--- /dev/null
+++ b/onnxruntime/test/testdata/test_model_with_fullonnxdomain.onnx
@@ -0,0 +1,18 @@
+onnx-example"ai.onnx:j
+
+X1
+X2Y"Equal:ai.onnx
+test-modelZ
+X1
+
+
+Z
+X2
+
+
+b
+Y
+	
+
+B
+ai.onnx	
\ No newline at end of file
diff --git a/onnxruntime/test/util/default_providers.cc b/onnxruntime/test/util/default_providers.cc
index 65c7bf4ee97eb..dabff2ad516da 100644
--- a/onnxruntime/test/util/default_providers.cc
+++ b/onnxruntime/test/util/default_providers.cc
@@ -12,10 +12,10 @@ std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_CPU(in
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_CUDA(int device_id);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Mkldnn(int use_arena);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_NGraph(const char* ng_backend_type);
-std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Nuphar(int device_id, const char*);
+std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Nuphar(bool, const char*);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_BrainSlice(uint32_t ip, int, int, bool, const char*, const char*, const char*);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Nnapi();
-std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Tensorrt();
+std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_Tensorrt(int device_id);
 std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory_OpenVINO(const char* device_id);
 
 namespace test {
@@ -26,7 +26,7 @@ std::unique_ptr<IExecutionProvider> DefaultCpuExecutionProvider(bool enable_aren
 
 std::unique_ptr<IExecutionProvider> DefaultTensorrtExecutionProvider() {
 #ifdef USE_TENSORRT
-  return CreateExecutionProviderFactory_Tensorrt()->CreateProvider();
+  return CreateExecutionProviderFactory_Tensorrt(0)->CreateProvider();
 #else
   return nullptr;
 #endif
@@ -65,10 +65,11 @@ std::unique_ptr<IExecutionProvider> DefaultNGraphExecutionProvider() {
 #endif
 }
 
-std::unique_ptr<IExecutionProvider> DefaultNupharExecutionProvider() {
+std::unique_ptr<IExecutionProvider> DefaultNupharExecutionProvider(bool allow_unaligned_buffers) {
 #ifdef USE_NUPHAR
-  return CreateExecutionProviderFactory_Nuphar(0, "")->CreateProvider();
+  return CreateExecutionProviderFactory_Nuphar(allow_unaligned_buffers, "")->CreateProvider();
 #else
+  ORT_UNUSED_PARAMETER(allow_unaligned_buffers);
   return nullptr;
 #endif
 }
diff --git a/onnxruntime/test/util/include/default_providers.h b/onnxruntime/test/util/include/default_providers.h
index c5369b4def58a..9382acd22c4af 100644
--- a/onnxruntime/test/util/include/default_providers.h
+++ b/onnxruntime/test/util/include/default_providers.h
@@ -11,7 +11,7 @@ std::unique_ptr<IExecutionProvider> DefaultCpuExecutionProvider(bool enable_aren
 std::unique_ptr<IExecutionProvider> DefaultCudaExecutionProvider();
 std::unique_ptr<IExecutionProvider> DefaultMkldnnExecutionProvider(bool enable_arena = true);
 std::unique_ptr<IExecutionProvider> DefaultNGraphExecutionProvider();
-std::unique_ptr<IExecutionProvider> DefaultNupharExecutionProvider();
+std::unique_ptr<IExecutionProvider> DefaultNupharExecutionProvider(bool allow_unaligned_buffers = true);
 std::unique_ptr<IExecutionProvider> DefaultBrainSliceExecutionProvider();
 std::unique_ptr<IExecutionProvider> DefaultTensorrtExecutionProvider();
 std::unique_ptr<IExecutionProvider> DefaultOpenVINOExecutionProvider();
diff --git a/requirements-dev.txt b/requirements-dev.txt
index 67095e2aa7809..2877c2bba698b 100644
--- a/requirements-dev.txt
+++ b/requirements-dev.txt
@@ -3,6 +3,7 @@ jinja2
 numpy
 onnx
 onnxmltools
+packaging
 pandas
 protobuf
 pytest
@@ -10,4 +11,5 @@ pytest-cov
 scikit-learn
 scipy
 six
+sympy
 wheel
diff --git a/samples/c_cxx/fns_candy_style_transfer/README.md b/samples/c_cxx/fns_candy_style_transfer/README.md
index 4211aa8d1ec59..3e8529e7659d7 100644
--- a/samples/c_cxx/fns_candy_style_transfer/README.md
+++ b/samples/c_cxx/fns_candy_style_transfer/README.md
@@ -1,19 +1,18 @@
-# Build 
+# FNS Candy
+FNS Candy is a style transfer model. In this sample application, we use the ONNX Runtime C API to process an image using the FNS Candy model in ONNX format.
+
+# Build Instructions
 See [../README.md](../README.md)
 
 # Prepare data
-Please download the model from (candy.onnx)[https://raw.githubusercontent.com/microsoft/Windows-Machine-Learning/master/Samples/FNSCandyStyleTransfer/UWP/cs/Assets/candy.onnx]
+First, download the FNS Candy ONNX model from [here](https://raw.githubusercontent.com/microsoft/Windows-Machine-Learning/master/Samples/FNSCandyStyleTransfer/UWP/cs/Assets/candy.onnx).
 
-Then prepare an image:
-1. In png format
-2. With dimension of 720x720
+Then, prepare an image:
+1. PNG format
+2. Dimension of 720x720
 
 # Run
+Command to run the application:
 ```
 fns_candy_style_transfer.exe <model_path> <input_image_path> <output_image_path>
 ```
-
-
-
-
-
diff --git a/samples/c_cxx/imagenet/main.cc b/samples/c_cxx/imagenet/main.cc
index 1077bcd40789f..4a1b975c55540 100644
--- a/samples/c_cxx/imagenet/main.cc
+++ b/samples/c_cxx/imagenet/main.cc
@@ -128,7 +128,7 @@ class Validator : public OutputCollector<TCharString> {
     CreateSession();
     VerifyInputOutputCount(session_);
     OrtAllocator* ort_alloc;
-    ORT_THROW_ON_ERROR(OrtCreateDefaultAllocator(&ort_alloc));
+    ORT_THROW_ON_ERROR(OrtGetAllocatorWithDefaultOptions(&ort_alloc));
     {
       char* t;
       ORT_THROW_ON_ERROR(OrtSessionGetInputName(session_, 0, ort_alloc, &t));
@@ -139,7 +139,6 @@ class Validator : public OutputCollector<TCharString> {
       OrtAllocatorFree(ort_alloc, t);
     }
 
-    OrtReleaseAllocator(ort_alloc);
     OrtTypeInfo* info;
     ORT_THROW_ON_ERROR(OrtSessionGetInputTypeInfo(session_, 0, &info));
     const OrtTensorTypeAndShapeInfo* tensor_info;
@@ -253,7 +252,7 @@ int main(int argc, ORTCHAR_T* argv[]) {
   try {
     ret = real_main(argc, argv);
   } catch (const std::exception& ex) {
-    fprintf(stderr, "%s\n", ex.what());    
+    fprintf(stderr, "%s\n", ex.what());
   }
 #ifdef _WIN32
   CoUninitialize();
diff --git a/tools/ci_build/build.py b/tools/ci_build/build.py
index bbaa891ef282b..3ecfecb20aa87 100755
--- a/tools/ci_build/build.py
+++ b/tools/ci_build/build.py
@@ -127,6 +127,8 @@ def parse_arguments():
     parser.add_argument("--use_openblas", action='store_true', help="Build with OpenBLAS.")
     parser.add_argument("--use_mkldnn", action='store_true', help="Build with MKLDNN.")
     parser.add_argument("--use_mklml", action='store_true', help="Build with MKLML.")
+    parser.add_argument("--use_gemmlowp", action='store_true', help="Build with gemmlowp for quantized gemm.")
+    parser.add_argument("--use_automl", action='store_true', help="Build with AutoML support.")
     parser.add_argument("--use_ngraph", action='store_true', help="Build with nGraph.")
     parser.add_argument("--use_openvino", nargs="?", const="CPU_FP32",
                         choices=["CPU_FP32","GPU_FP32","GPU_FP16","VAD-M_FP16","MYRIAD_FP16"], help="Build with OpenVINO for specific hardware.")
@@ -174,7 +176,7 @@ def get_config_build_dir(build_dir, config):
     # build directory per configuration
     return os.path.join(build_dir, config)
 
-def run_subprocess(args, cwd=None, capture=False, dll_path=None, shell=False):
+def run_subprocess(args, cwd=None, capture=False, dll_path=None, shell=False, env={}):
     log.debug("Running subprocess in '{0}'\n{1}".format(cwd or os.getcwd(), args))
     my_env = os.environ.copy()
     if dll_path:
@@ -187,6 +189,7 @@ def run_subprocess(args, cwd=None, capture=False, dll_path=None, shell=False):
                 my_env["LD_LIBRARY_PATH"] = dll_path
 
     stdout, stderr = (subprocess.PIPE, subprocess.STDOUT) if capture else (None, None)
+    my_env.update(env)
     return subprocess.run(args, cwd=cwd, check=True, stdout=stdout, stderr=stderr, env=my_env, shell=shell)
 
 def update_submodules(source_dir):
@@ -231,8 +234,10 @@ def install_ubuntu_deps(args):
             raise BuildError("Error setting up required APT packages. {}".format(str(e)))
 
 def install_python_deps(numpy_version=""):
-    dep_packages = ['setuptools', 'wheel']
+    dep_packages = ['setuptools', 'wheel', 'pytest']
     dep_packages.append('numpy=={}'.format(numpy_version) if numpy_version else 'numpy>=1.15.0')
+    dep_packages.append('sympy>=1.1')
+    dep_packages.append('packaging')
     run_subprocess([sys.executable, '-m', 'pip', 'install', '--trusted-host', 'files.pythonhosted.org'] + dep_packages)
 
 def check_md5(filename, expected_md5):
@@ -323,6 +328,7 @@ def generate_build_tree(cmake_path, source_dir, build_dir, cuda_home, cudnn_home
                  "-Donnxruntime_USE_CUDA=" + ("ON" if args.use_cuda else "OFF"),
                  "-Donnxruntime_USE_NSYNC=" + ("OFF" if is_windows() or not args.use_nsync else "ON"),
                  "-Donnxruntime_CUDNN_HOME=" + (cudnn_home if args.use_cuda else ""),
+                 "-Donnxruntime_USE_AUTOML=" + ("ON" if args.use_automl else "OFF"),				 
                  "-Donnxruntime_CUDA_HOME=" + (cuda_home if args.use_cuda else ""),
                  "-Donnxruntime_USE_JEMALLOC=" + ("ON" if args.use_jemalloc else "OFF"),
                  "-Donnxruntime_ENABLE_PYTHON=" + ("ON" if args.enable_pybind else "OFF"),
@@ -332,6 +338,7 @@ def generate_build_tree(cmake_path, source_dir, build_dir, cuda_home, cudnn_home
                  "-Donnxruntime_USE_OPENBLAS=" + ("ON" if args.use_openblas else "OFF"),
                  "-Donnxruntime_USE_MKLDNN=" + ("ON" if args.use_mkldnn else "OFF"),
                  "-Donnxruntime_USE_MKLML=" + ("ON" if args.use_mklml else "OFF"),
+                 "-Donnxruntime_USE_GEMMLOWP=" + ("ON" if args.use_gemmlowp else "OFF"),
                  "-Donnxruntime_USE_NGRAPH=" + ("ON" if args.use_ngraph else "OFF"),
                  "-Donnxruntime_USE_OPENVINO=" + ("ON" if args.use_openvino else "OFF"),
                  "-Donnxruntime_USE_OPENVINO_BINARY=" + ("ON" if args.use_openvino else "OFF"),
@@ -597,7 +604,6 @@ def run_onnxruntime_tests(args, source_dir, ctest_path, build_dir, configs, enab
                 if onnxml_test:
                     run_subprocess([sys.executable, 'onnxruntime_test_python_keras.py'], cwd=cwd, dll_path=dll_path)
 
-
 def run_onnx_tests(build_dir, configs, onnx_test_data_dir, provider, enable_parallel_executor_test, num_parallel_models):
     for config in configs:
         cwd = get_config_build_dir(build_dir, config)
@@ -614,6 +620,8 @@ def run_onnx_tests(build_dir, configs, onnx_test_data_dir, provider, enable_para
              cmd += ['-c', '1']
           if provider == 'openvino':
              cmd += ['-c', '1']
+          if provider == 'nuphar':
+             cmd += ['-c', '1']
 
         if num_parallel_models > 0:
           cmd += ["-j", str(num_parallel_models)]
@@ -625,6 +633,9 @@ def run_onnx_tests(build_dir, configs, onnx_test_data_dir, provider, enable_para
         if os.path.exists(onnx_test_data_dir):
           cmd.append(onnx_test_data_dir)
 
+        if config == 'Debug' and provider == 'nuphar':
+          return
+
         run_subprocess([exe] + cmd, cwd=cwd)
         if enable_parallel_executor_test:
           run_subprocess([exe,'-x'] + cmd, cwd=cwd)
@@ -663,6 +674,21 @@ def mkldnn_run_onnx_tests(build_dir, configs, onnx_test_data_dir):
           run_subprocess([exe, '-x'] + opset9_cmd, cwd=cwd)
 
 
+# nuphar temporary function for running python tests separately as it requires ONNX 1.5.0
+def nuphar_run_python_tests(build_dir, configs, azure_sas_key):
+    for config in configs:
+        if config == 'Debug':
+            continue
+        cwd = get_config_build_dir(build_dir, config)
+        if is_windows():
+            cwd = os.path.join(cwd, config)
+        dll_path = os.path.join(build_dir, config, "external", "tvm", config)
+        # install onnx for shape inference in testing Nuphar scripts
+        # this needs to happen after onnx_test_data preparation which uses onnx 1.3.0
+        run_subprocess([sys.executable, '-m', 'pip', 'install', '--user', 'onnx==1.5.0'])
+        run_subprocess([sys.executable, 'onnxruntime_test_python_nuphar.py'], cwd=cwd, dll_path=dll_path)
+
+
 def split_server_binary_and_symbol(build_dir, configs):
     if is_windows():
         # TODO: Windows support
@@ -778,23 +804,43 @@ def build_protoc_for_host(cmake_path, source_dir, build_dir, args):
 
 def generate_documentation(source_dir, build_dir, configs):
     operator_doc_path = os.path.join(source_dir, 'docs', 'ContribOperators.md')
+    opkernel_doc_path = os.path.join(source_dir, 'docs', 'OperatorKernels.md')
     for config in configs:
         #copy the gen_doc.py
         shutil.copy(os.path.join(source_dir,'tools','python','gen_doc.py'),
                     os.path.join(build_dir,config, config))
+        shutil.copy(os.path.join(source_dir,'tools','python','gen_opkernel_doc.py'),
+                    os.path.join(build_dir,config, config))
+
         run_subprocess([
                         sys.executable,
                         'gen_doc.py',
                         '--output_path', operator_doc_path
                     ],
                     cwd = os.path.join(build_dir,config, config))
+
+        run_subprocess([
+                        sys.executable,
+                        'gen_opkernel_doc.py',
+                        '--output_path', opkernel_doc_path
+                    ],
+                    cwd = os.path.join(build_dir,config, config))
+
     docdiff = ''
     try:
-        docdiff = subprocess.check_output(['git', 'diff', operator_doc_path])
+        docdiff = subprocess.check_output(['git', 'diff', opkernel_doc_path])
     except subprocess.CalledProcessError:
         print('git diff returned non-zero error code')
+    if len(docdiff) > 0:
+        # Show warning instead of throwing exception, because it is dependent on build configuration for including execution propviders 
+        log.warning('The updated opkernel document file '+str(opkernel_doc_path)+' is different from the checked in version. Consider regenrating the file with CPU, MKLDNN and CUDA providers enabled.')
+        log.debug('diff:\n'+str(docdiff))
 
-
+    docdiff = ''
+    try:
+        docdiff = subprocess.check_output(['git', 'diff', operator_doc_path])
+    except subprocess.CalledProcessError:
+        print('git diff returned non-zero error code')
     if len(docdiff) > 0:
         raise BuildError('The updated operator document file '+str(operator_doc_path)+' must be checked in.\n diff:\n'+str(docdiff))
 
@@ -924,14 +970,20 @@ def main():
             elif args.use_ngraph:
               run_onnx_tests(build_dir, configs, onnx_test_data_dir, 'ngraph', True, 1)
             elif args.use_openvino:
-               run_onnx_tests(build_dir, configs, onnx_test_data_dir, 'openvino', False,1)
+              run_onnx_tests(build_dir, configs, onnx_test_data_dir, 'openvino', False, 1)
               # TODO: parallel executor test fails on MacOS
+            elif args.use_nuphar:
+              run_onnx_tests(build_dir, configs, onnx_test_data_dir, 'nuphar', False, 1)
             else:
               run_onnx_tests(build_dir, configs, onnx_test_data_dir, None, True, 0)
 
               if args.use_mkldnn:
                 mkldnn_run_onnx_tests(build_dir, configs, onnx_test_data_dir)
 
+        # run nuphar python tests last, as it installs ONNX 1.5.0
+        if args.enable_pybind and not args.skip_onnx_tests and args.use_nuphar:
+            nuphar_run_python_tests(build_dir, configs, args.azure_sas_key)
+
     if args.build_server:
         split_server_binary_and_symbol(build_dir, configs)
         if args.enable_server_tests:
diff --git a/tools/ci_build/gen_def.py b/tools/ci_build/gen_def.py
index 41eb75abde70e..6ed654db6a469 100755
--- a/tools/ci_build/gen_def.py
+++ b/tools/ci_build/gen_def.py
@@ -7,6 +7,7 @@ def parse_arguments():
     parser = argparse.ArgumentParser()
     parser.add_argument("--src_root", required=True, help="input symbol file")
     parser.add_argument("--output", required=True, help="output file")
+    parser.add_argument("--output_source", required=True, help="output file")
     parser.add_argument("--version_file", required=True, help="VERSION_NUMBER file")
     parser.add_argument("--style", required=True, choices=["gcc", "vc"])
     parser.add_argument("--config", required=True, nargs="+")
@@ -14,7 +15,6 @@ def parse_arguments():
 
 args = parse_arguments()
 print("Generating symbol file for %s" % str(args.config))
-
 with open(args.version_file, 'r') as f:
     VERSION_STRING=f.read().strip();
 
@@ -52,3 +52,13 @@ def parse_arguments():
     file.write(" local:\n")
     file.write("    *;\n")
     file.write("};   \n")
+
+with open(args.output_source, 'w') as file:
+   file.write("#include <onnxruntime_c_api.h>\n")
+   for c in args.config:
+      file.write("#include <core/providers/%s/%s_provider_factory.h>\n" % (c,c))
+   file.write("void* GetFunctionEntryByName(const char* name){\n")
+   for symbol in symbols:
+      file.write("if(strcmp(name,\"%s\") ==0) return (void*)&%s;\n" % (symbol,symbol))
+   file.write("return NULL;\n");
+   file.write("}\n");
\ No newline at end of file
diff --git a/tools/ci_build/github/azure-pipelines/azure-pipelines-py-packaging.yml b/tools/ci_build/github/azure-pipelines/azure-pipelines-py-packaging.yml
index 8721078422c21..592ed02874667 100644
--- a/tools/ci_build/github/azure-pipelines/azure-pipelines-py-packaging.yml
+++ b/tools/ci_build/github/azure-pipelines/azure-pipelines-py-packaging.yml
@@ -189,7 +189,7 @@ jobs:
       displayName: 'Run build script'
       inputs:
         filename: 'build.bat'
-        arguments: ' --use_cuda --cuda_home="C:\local\cuda_10.0.130_win10"
+        arguments: ' --use_cuda --cuda_home="C:\local\cuda_10.0.130_win10_trt515dll"
       --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --build_dir $(buildDirectory) --config Release --use_openmp --build_wheel'
         workingFolder: "$(Build.SourcesDirectory)"
 
@@ -250,4 +250,4 @@ jobs:
     - task: ms.vss-governance-buildtask.governance-build-task-component-detection.ComponentGovernanceComponentDetection@0
       displayName: 'Component Detection'
 
-    - template: templates/clean-agent-build-directory-step.yml
\ No newline at end of file
+    - template: templates/clean-agent-build-directory-step.yml
diff --git a/tools/ci_build/github/azure-pipelines/c-api-packaging-pipelines.yml b/tools/ci_build/github/azure-pipelines/c-api-packaging-pipelines.yml
index d8222370b9433..dff813e0debf6 100644
--- a/tools/ci_build/github/azure-pipelines/c-api-packaging-pipelines.yml
+++ b/tools/ci_build/github/azure-pipelines/c-api-packaging-pipelines.yml
@@ -156,7 +156,7 @@ jobs:
       displayName: 'Build and Test OnnxRuntime'
       inputs:
         script: |
-          $(Build.BinariesDirectory)\packages\python\python.exe $(Build.SourcesDirectory)\tools\ci_build\build.py --config $(buildConfig) --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --build_shared_lib --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --enable_onnx_tests --use_openmp --msvc_toolset=14.11 --use_cuda  --cuda_home="C:\local\cuda_10.0.130_win10" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda"
+          $(Build.BinariesDirectory)\packages\python\python.exe $(Build.SourcesDirectory)\tools\ci_build\build.py --config $(buildConfig) --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --build_shared_lib --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --enable_onnx_tests --use_openmp --msvc_toolset=14.11 --use_cuda  --cuda_home="C:\local\cuda_10.0.130_win10_trt515dll" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda"
         workingDirectory: '$(Build.BinariesDirectory)'
 
     - template: templates/c-api-artifacts-package-and-publish-steps-windows.yml
diff --git a/tools/ci_build/github/azure-pipelines/linux-ci-pipeline.yml b/tools/ci_build/github/azure-pipelines/linux-ci-pipeline.yml
index d14dd53a43ba4..6269240962030 100644
--- a/tools/ci_build/github/azure-pipelines/linux-ci-pipeline.yml
+++ b/tools/ci_build/github/azure-pipelines/linux-ci-pipeline.yml
@@ -3,6 +3,6 @@ jobs:
   parameters:
     AgentPool : 'Linux-CPU'
     JobName: 'Linux_CI_Dev'
-    BuildCommand: 'tools/ci_build/github/linux/run_dockerbuild.sh -o ubuntu16.04 -d cpu -r $(Build.BinariesDirectory) -x "--use_mklml --use_tvm --build_wheel --enable_language_interop_ops"'
+    BuildCommand: 'tools/ci_build/github/linux/run_dockerbuild.sh -o ubuntu16.04 -d cpu -r $(Build.BinariesDirectory) -x "--use_mklml --use_tvm --use_automl --build_wheel --enable_language_interop_ops"'
     DoNugetPack:  'false'
     ArtifactName: 'drop-linux'
diff --git a/tools/ci_build/github/azure-pipelines/linux-gpu-tensorrt-ci-pipeline.yml b/tools/ci_build/github/azure-pipelines/linux-gpu-tensorrt-ci-pipeline.yml
index 8b5fe3a3cd9eb..23b5ca1e345ff 100644
--- a/tools/ci_build/github/azure-pipelines/linux-gpu-tensorrt-ci-pipeline.yml
+++ b/tools/ci_build/github/azure-pipelines/linux-gpu-tensorrt-ci-pipeline.yml
@@ -15,8 +15,7 @@ jobs:
       displayName: 'Download test data'
       inputs:
         scriptPath: '$(Build.SourcesDirectory)/tools/ci_build/github/download_test_data.py'
-        # There are some tests in 20190130.zip that TensorRT can't run. Instead here use 20181210 opset8 for TensorRT test.
-        arguments: --test_data_url https://onnxruntimetestdata.blob.core.windows.net/models/20181210.zip --build_dir $(Build.BinariesDirectory)
+        arguments: --test_data_url $(TestDataUrl) --build_dir $(Build.BinariesDirectory)
         pythonInterpreter: '/usr/bin/python3'
         workingDirectory: $(Build.BinariesDirectory)
     
diff --git a/tools/ci_build/github/azure-pipelines/mac-ci-pipeline.yml b/tools/ci_build/github/azure-pipelines/mac-ci-pipeline.yml
index b48df8834079f..c3b3f8922a6d3 100644
--- a/tools/ci_build/github/azure-pipelines/mac-ci-pipeline.yml
+++ b/tools/ci_build/github/azure-pipelines/mac-ci-pipeline.yml
@@ -3,4 +3,4 @@ jobs:
   parameters:
     AgentPool : 'Hosted macOS High Sierra'
     DoNugetPack: 'false'
-    BuildCommand: 'python3 $(Build.SourcesDirectory)/tools/ci_build/build.py --use_openmp --build_dir $(Build.BinariesDirectory) --build_wheel --skip_submodule_sync --parallel --build_shared_lib --enable_language_interop_ops --enable_onnx_tests --config Debug RelWithDebInfo'
+    BuildCommand: 'python3 $(Build.SourcesDirectory)/tools/ci_build/build.py --use_openmp --build_dir $(Build.BinariesDirectory) --build_wheel --skip_submodule_sync --use_automl --parallel --build_shared_lib --enable_language_interop_ops --enable_onnx_tests --config Debug RelWithDebInfo'
diff --git a/tools/ci_build/github/azure-pipelines/nuget/templates/cpu.yml b/tools/ci_build/github/azure-pipelines/nuget/templates/cpu.yml
index 72377b20683ef..7f3aa87f1e532 100644
--- a/tools/ci_build/github/azure-pipelines/nuget/templates/cpu.yml
+++ b/tools/ci_build/github/azure-pipelines/nuget/templates/cpu.yml
@@ -177,6 +177,15 @@ jobs:
       downloadPath: $(Build.BinariesDirectory)/nuget-artifact/final-package
       gitCommitHash: $(OnnxRuntimeGitCommitHashShort)  
 
+
+  - task: PowerShell@2
+    displayName: 'Get Current Date'
+    inputs:
+      targetType: 'inline'
+      script: |
+          $date = $(Get-Date -Format "yyyy-MM-dd")
+          Write-Host "##vso[task.setvariable variable=CurrentDate]$date"
+
   - task: AzureFileCopy@3
     displayName: 'Copy Signed NuGet Package to Blob Store'
     condition: ne(variables['IsReleaseBuild'], 'true') # rlease build has a different package naming scheme
@@ -186,4 +195,5 @@ jobs:
       destination: azureBlob
       storage: ortpackages
       containerName: ortpackages
+      blobPrefix: '$(CurrentDate)/'
 
diff --git a/tools/ci_build/github/azure-pipelines/nuget/templates/gpu.yml b/tools/ci_build/github/azure-pipelines/nuget/templates/gpu.yml
index ebe56764e8b80..25807b7ecdddb 100644
--- a/tools/ci_build/github/azure-pipelines/nuget/templates/gpu.yml
+++ b/tools/ci_build/github/azure-pipelines/nuget/templates/gpu.yml
@@ -13,14 +13,14 @@ jobs:
   parameters:
     AgentPool : $(AgentPoolWin)
     JobName: 'Windows_CI_GPU_Dev'
-    BuildCommand: '$(Build.SourcesDirectory)\tools\ci_build\build.py --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --build_shared_lib  --build_csharp --enable_onnx_tests --use_cuda --cuda_home="C:\local\cuda_10.0.130_win10" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --msvc_toolset=14.11'
+    BuildCommand: '$(Build.SourcesDirectory)\tools\ci_build\build.py --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --build_shared_lib  --build_csharp --enable_onnx_tests --use_cuda --cuda_home="C:\local\cuda_10.0.130_win10_trt515dll" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --msvc_toolset=14.11'
     DoDebugBuild: 'false'
     DoNugetPack : 'true'
     DoCompliance: 'false'
     DoEsrp: ${{ parameters.DoEsrp }}
     BuildArch: 'amd64'
     SetVcvars: 'true'
-    MsbuildArguments: '/m /p:CudaToolkitDir=C:\local\cuda_10.0.130_win10\'
+    MsbuildArguments: '/m /p:CudaToolkitDir=C:\local\cuda_10.0.130_win10_trt515dll\'
     EnvSetupScript: 'setup_env_cuda.bat'
     CudaVersion: '10.0'
     NuPackScript: |
diff --git a/tools/ci_build/github/azure-pipelines/nuget/templates/test_win.yml b/tools/ci_build/github/azure-pipelines/nuget/templates/test_win.yml
index 4336c4279c599..fb04583779b5f 100644
--- a/tools/ci_build/github/azure-pipelines/nuget/templates/test_win.yml
+++ b/tools/ci_build/github/azure-pipelines/nuget/templates/test_win.yml
@@ -25,7 +25,7 @@ jobs:
 
   - script: |
      @echo "Running build.py --update"
-     $(Build.BinariesDirectory)\packages\python\python.exe $(Build.SourcesDirectory)\tools\ci_build\build.py --build_dir $(Build.BinariesDirectory) --config Debug --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe  --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --update --download_test_data 
+     $(Build.BinariesDirectory)\packages\python\python.exe $(Build.SourcesDirectory)\tools\ci_build\build.py --build_dir $(Build.BinariesDirectory) --config Debug --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe  --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --update --download_test_data --test_data_url=$(TestDataUrl) --test_data_checksum=$(TestDataChecksum)
 
     displayName: 'Download Test Data'
 
diff --git a/tools/ci_build/github/azure-pipelines/templates/esrp_dll.yml b/tools/ci_build/github/azure-pipelines/templates/esrp_dll.yml
index 4185d4a80db76..175db32f1a5ab 100644
--- a/tools/ci_build/github/azure-pipelines/templates/esrp_dll.yml
+++ b/tools/ci_build/github/azure-pipelines/templates/esrp_dll.yml
@@ -8,7 +8,7 @@ steps:
   - task: SFP.build-tasks.custom-build-task-1.EsrpCodeSigning@1
     displayName: ${{ parameters.DisplayName }}
     inputs:
-      ConnectedServiceName: 'OnnxRuntime CodeSign 20171127'
+      ConnectedServiceName: 'OnnxRuntime CodeSign 20190817'
       FolderPath: ${{ parameters.FolderPath }}
       Pattern: '*.dll'
       signConfigType: inlineSignParams
diff --git a/tools/ci_build/github/azure-pipelines/templates/esrp_nuget.yml b/tools/ci_build/github/azure-pipelines/templates/esrp_nuget.yml
index 876a8ee3acad1..0e6c6daab5c8a 100644
--- a/tools/ci_build/github/azure-pipelines/templates/esrp_nuget.yml
+++ b/tools/ci_build/github/azure-pipelines/templates/esrp_nuget.yml
@@ -8,7 +8,7 @@ steps:
   - task: SFP.build-tasks.custom-build-task-1.EsrpCodeSigning@1
     displayName: ${{ parameters.DisplayName }}
     inputs:
-      ConnectedServiceName: 'OnnxRuntime CodeSign 20171127'
+      ConnectedServiceName: 'OnnxRuntime CodeSign 20190817'
       FolderPath: ${{ parameters.FolderPath }}
       Pattern: '*.nupkg'
       signConfigType: inlineSignParams
diff --git a/tools/ci_build/github/azure-pipelines/templates/win-ci.yml b/tools/ci_build/github/azure-pipelines/templates/win-ci.yml
index bcd1cb8da9b76..3297280236835 100644
--- a/tools/ci_build/github/azure-pipelines/templates/win-ci.yml
+++ b/tools/ci_build/github/azure-pipelines/templates/win-ci.yml
@@ -57,8 +57,9 @@ jobs:
           solution: '$(Build.BinariesDirectory)\Debug\onnxruntime.sln'
           platform: 'x64'
           configuration: 'Debug'
-          msbuildArguments: ${{ parameters.MsbuildArguments }}
+          msbuildArgs: ${{ parameters.MsbuildArguments }}
           msbuildArchitecture: 'x64'
+          maximumCpuCount: true
           logProjectEvents: true
           workingFolder: '$(Build.BinariesDirectory)\Debug'
           createLogFile: true
@@ -70,23 +71,24 @@ jobs:
           arguments: '$(BuildCommand) --test --config Debug'
           workingFolder: '$(Build.BinariesDirectory)'
 
-      - task: VSBuild@1
+      - task: MSBuild@1
         displayName: 'Build C# Debug'
         inputs:
-          solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.sln'
-          platform: 'any cpu'
+          solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+          platform: 'Any CPU'
           configuration: 'Debug'
-          restoreNugetPackages: false
           msbuildArchitecture: 'x64'
           workingFolder: '$(Build.SourcesDirectory)\csharp'
-          msbuildArguments: '/m'
 
-      - task: CmdLine@2
+      - task: MSBuild@1
         displayName: 'Test C# Debug'
         inputs:
-          script: |
-           FOR /F  %%f IN ('dir /s /b test\Microsoft.ML.OnnxRuntime.Tests\bin\*Tests.dll') DO %dotnetexe% vstest %%f 
-          workingDirectory: '$(Build.SourcesDirectory)\csharp'
+          solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+          platform: 'Any CPU'
+          configuration: 'Debug'
+          msbuildArchitecture: 'x64'
+          msbuildArguments: '/t:RunTest'
+          workingFolder: '$(Build.SourcesDirectory)\csharp'
 
     # Run test coverage report
     - ${{ if eq(parameters['DoTestCoverage'], 'true') }}:
@@ -104,7 +106,7 @@ jobs:
         solution: '$(Build.BinariesDirectory)\RelWithDebInfo\onnxruntime.sln'
         platform: 'x64'
         configuration: 'RelWithDebInfo'
-        msbuildArguments: ${{ parameters.MsbuildArguments }}
+        msbuildArgs: ${{ parameters.MsbuildArguments }}
         msbuildArchitecture: 'x64'
         logProjectEvents: true
         workingFolder: '$(Build.BinariesDirectory)\RelWithDebInfo'
@@ -116,23 +118,24 @@ jobs:
         arguments: '$(BuildCommand) --test --config RelWithDebInfo'
         workingFolder: "$(Build.BinariesDirectory)"
 
-    - task: VSBuild@1
-      displayName: 'Build c# RelWithDebInfo'
+    - task: MSBuild@1
+      displayName: 'Build C# RelWithDebInfo'
       inputs:
-        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.sln'
-        platform: 'any cpu'
+        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+        platform: 'Any CPU'
         configuration: 'RelWithDebInfo'
         msbuildArchitecture: 'x64'
-        restoreNugetPackages: false
         workingFolder: '$(Build.SourcesDirectory)\csharp'
-        msbuildArguments: '/m'
 
-    - task: CmdLine@2
+    - task: MSBuild@1
       displayName: 'Test C# RelWithDebInfo'
       inputs:
-        script: |
-         FOR /F  %%f IN ('dir /s /b test\Microsoft.ML.OnnxRuntime.Tests\bin\*Tests.dll') DO %dotnetexe% vstest %%f 
-        workingDirectory: '$(Build.SourcesDirectory)\csharp'
+        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+        platform: 'Any CPU'
+        configuration: 'RelWithDebInfo'
+        msbuildArchitecture: 'x64'
+        msbuildArguments: '/t:RunTest'
+        workingfolder: '$(Build.SourcesDirectory)\csharp'
 
     - task: PublishTestResults@2
       displayName: 'Publish unit test results'
diff --git a/tools/ci_build/github/azure-pipelines/templates/win-x86-ci.yml b/tools/ci_build/github/azure-pipelines/templates/win-x86-ci.yml
index 43fc4693a33e8..f274a4b53a7bf 100644
--- a/tools/ci_build/github/azure-pipelines/templates/win-x86-ci.yml
+++ b/tools/ci_build/github/azure-pipelines/templates/win-x86-ci.yml
@@ -54,23 +54,25 @@ jobs:
           arguments: '$(BuildCommand) --test --config Debug'
           workingFolder: '$(Build.BinariesDirectory)'
 
-      - task: VSBuild@1
+      - task: MSBuild@1
         displayName: 'Build C# Debug'
         inputs:
-          solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.sln'
-          platform: 'any cpu'
+          solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+          platform: 'Any CPU'
           configuration: 'Debug'
-          restoreNugetPackages: false
           msbuildArchitecture: 'x86'
-          workingFolder: '$(Build.SourcesDirectory)\csharp'
           msbuildArguments: '/m'
+          workingFolder: '$(Build.SourcesDirectory)\csharp'
 
-      - task: CmdLine@2
+      - task: MSBuild@1
         displayName: 'Test C# Debug'
         inputs:
-          script: |
-           FOR /F  %%f IN ('dir /s /b test\Microsoft.ML.OnnxRuntime.Tests\bin\*Tests.dll') DO %dotnetexe% vstest %%f 
-          workingDirectory: '$(Build.SourcesDirectory)\csharp'
+          solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+          platform: 'Any CPU'
+          configuration: 'Debug'
+          msbuildArchitecture: 'x86'
+          msbuildArguments: '/t:RunTest'
+          workingFolder: '$(Build.SourcesDirectory)\csharp'
 
     # Build RelWithDebInfo -- this variable required to build C#
     - task: CmdLine@2
@@ -97,23 +99,25 @@ jobs:
         arguments: '$(BuildCommand) --test --config RelWithDebInfo'
         workingFolder: "$(Build.BinariesDirectory)"
 
-    - task: VSBuild@1
+    - task: MSBuild@1
       displayName: 'Build C# RelWithDebInfo'
       inputs:
-        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.sln'
-        platform: 'any cpu'
+        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+        platform: 'Any CPU'
         configuration: 'RelWithDebInfo'
         msbuildArchitecture: 'x86'
-        restoreNugetPackages: false
-        workingFolder: '$(Build.SourcesDirectory)\csharp'
         msbuildArguments: '/m'
+        workingFolder: '$(Build.SourcesDirectory)\csharp'
 
-    - task: CmdLine@2
+    - task: MSBuild@1
       displayName: 'Test C# RelWithDebInfo'
       inputs:
-        script: |
-         FOR /F  %%f IN ('dir /s /b test\Microsoft.ML.OnnxRuntime.Tests\bin\*Tests.dll') DO %dotnetexe% vstest %%f 
-        workingDirectory: '$(Build.SourcesDirectory)\csharp'
+        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+        platform: 'Any CPU'
+        configuration: 'RelWithDebInfo'
+        msbuildArchitecture: 'x86'
+        msbuildArguments: '/t:RunTest'
+        workingfolder: '$(Build.SourcesDirectory)\csharp'
 
     - task: PublishTestResults@2
       displayName: 'Publish unit test results'
diff --git a/tools/ci_build/github/azure-pipelines/templates/windows-build-tools-setup-steps.yml b/tools/ci_build/github/azure-pipelines/templates/windows-build-tools-setup-steps.yml
index d9fda65add37d..0990f441d9149 100644
--- a/tools/ci_build/github/azure-pipelines/templates/windows-build-tools-setup-steps.yml
+++ b/tools/ci_build/github/azure-pipelines/templates/windows-build-tools-setup-steps.yml
@@ -9,13 +9,6 @@ steps:
       displayName: Use Nuget 4.9
       inputs:
         versionSpec: 4.9.4
-    - task: NuGetCommand@2
-      displayName: 'NuGet restore'
-      inputs:
-        restoreSolution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.sln'
-        feedsToUse: config
-        nugetConfigPath: '$(Build.SourcesDirectory)\csharp\Nuget.CSharp.config'
-        restoreDirectory: '$(Build.SourcesDirectory)\csharp'
     # - task: UniversalPackages@0
     #   displayName: 'Download python'
     #   inputs:
@@ -26,18 +19,27 @@ steps:
     #     downloadDirectory: '$(Build.BinariesDirectory)\python'
 
     # Temporary bypass of artifacts permission issue
-    - task: CmdLine@1
-      displayName: 'Download azcopy'
+    - task: PowerShell@2
+      displayName: 'Download AzCopy (used for download test data script)'
       inputs:
-        filename: 'AzCopy.exe'
-        arguments: '/Y /Source:https://onnxruntimetestdata.blob.core.windows.net/models/azcopy.exe /Dest:$(Build.BinariesDirectory)\azcopy.exe'
-        
+        targetType: 'inline'
+        script: |
+          Invoke-WebRequest -OutFile $(Build.BinariesDirectory)\azcopy.exe https://onnxruntimetestdata.blob.core.windows.net/models/azcopy.exe
+
+    # - task: PowerShell@2
+    #   displayName: 'Download Python'
+    #   inputs:
+    #     targetType: 'inline'
+    #     script: |
+    #       Invoke-WebRequest -OutFile $(Build.BinariesDirectory)\Miniconda3-4.7.10-Windows-x86_64.exe https://onnxruntimetestdata.blob.core.windows.net/models/Miniconda3-4.7.10-Windows-x86_64.exe
+
     - task: CmdLine@1
-      displayName: 'Download python'
+      displayName: 'Download Python'
       inputs:
-        filename: 'AzCopy.exe'
-        arguments: '/Y /Source:https://onnxruntimetestdata.blob.core.windows.net/models/Miniconda3-4.7.10-Windows-x86_64.exe /Dest:$(Build.BinariesDirectory)\Miniconda3-4.7.10-Windows-x86_64.exe'               
-  
+        filename: '$(Build.BinariesDirectory)\azcopy.exe'
+        arguments: 'copy https://onnxruntimetestdata.blob.core.windows.net/models/Miniconda3-4.7.10-Windows-x86_64.exe $(Build.BinariesDirectory)\Miniconda3-4.7.10-Windows-x86_64.exe'
+      timeoutInMinutes: 10
+
     - task: CmdLine@1
       displayName: 'Run python installer'
       inputs:
diff --git a/tools/ci_build/github/azure-pipelines/win-ci-pipeline.yml b/tools/ci_build/github/azure-pipelines/win-ci-pipeline.yml
index bdaa8f0cf5289..46a334bb0ee19 100644
--- a/tools/ci_build/github/azure-pipelines/win-ci-pipeline.yml
+++ b/tools/ci_build/github/azure-pipelines/win-ci-pipeline.yml
@@ -4,7 +4,7 @@ jobs:
     AgentPool : 'Win-CPU'
     DoDebugBuild: 'true'
     DoCompliance: 'false'
-    BuildCommand: '$(Build.SourcesDirectory)\tools\ci_build\build.py --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --use_tvm --enable_pybind --use_mkldnn --use_openmp --build_shared_lib --build_csharp --enable_onnx_tests --gen_doc'
+    BuildCommand: '$(Build.SourcesDirectory)\tools\ci_build\build.py --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --use_tvm --use_automl --enable_pybind --use_mkldnn --use_openmp --build_shared_lib --build_csharp --enable_onnx_tests'
     JobName: 'Windows_CI_Dev'
     DoNugetPack:  'false'
     NuPackScript : ''
diff --git a/tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml b/tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml
index a2c42de7d7367..fdbd7f91055ec 100644
--- a/tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml
+++ b/tools/ci_build/github/azure-pipelines/win-gpu-ci-pipeline.yml
@@ -4,13 +4,13 @@ jobs:
     AgentPool : 'Win-GPU-CUDA10'
     DoDebugBuild: 'true'
     DoCompliance: 'false'
-    BuildCommand: '$(Build.SourcesDirectory)\tools\ci_build\build.py --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --use_mkldnn --use_mkldnn --build_shared_lib  --build_csharp --enable_onnx_tests --use_cuda --cuda_version=10.0 --cuda_home="C:\local\cuda_10.0.130_win10" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --msvc_toolset=14.11'
+    BuildCommand: '$(Build.SourcesDirectory)\tools\ci_build\build.py --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --use_mkldnn --use_mkldnn --build_shared_lib  --build_csharp --enable_onnx_tests --use_cuda --cuda_version=10.0 --cuda_home="C:\local\cuda_10.0.130_win10_trt515dll" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --msvc_toolset=14.11 --gen_doc'
     JobName: 'Windows_CI_GPU_Dev'
     DoNugetPack:  'false'
     NuPackScript : ''
     DoTestCoverage: 'false'
     BuildArch: 'amd64'
     SetVcvars: 'true'
-    MsbuildArguments: '/m /p:CudaToolkitDir=C:\local\cuda_10.0.130_win10\'
+    MsbuildArguments: '/m /p:CudaToolkitDir=C:\local\cuda_10.0.130_win10_trt515dll\'
     EnvSetupScript: 'setup_env_cuda.bat'
     CudaVersion: '10.0'
diff --git a/tools/ci_build/github/azure-pipelines/win-gpu-tensorrt-ci-pipeline.yml b/tools/ci_build/github/azure-pipelines/win-gpu-tensorrt-ci-pipeline.yml
index a31ad59615489..88d70de0822fa 100644
--- a/tools/ci_build/github/azure-pipelines/win-gpu-tensorrt-ci-pipeline.yml
+++ b/tools/ci_build/github/azure-pipelines/win-gpu-tensorrt-ci-pipeline.yml
@@ -4,13 +4,11 @@ jobs:
   pool: Win-GPU-CUDA10
   variables:
     buildDirectory: '$(Build.BinariesDirectory)'
+    OnnxRuntimeBuildDirectory: '$(Build.BinariesDirectory)'
     CUDA_VERSION: '10.0'
-    # There are some tests in 20190130.zip that TensorRT can't run. Instead here use 20181210 opset8 for TensorRT test.
-    TestDataUrl: https://onnxruntimetestdata.blob.core.windows.net/models/20181210.zip
-    TestDataChecksum: a966def7447f4ff04f5665bca235b3f3
 
   steps:
-    # - template: templates/set-test-data-variables-step.yml
+    - template: templates/set-test-data-variables-step.yml
     - template: templates/windows-build-tools-setup-steps.yml
       parameters:
         EnvSetupScript: 'setup_env_cuda.bat'
@@ -28,7 +26,7 @@ jobs:
       displayName: 'Download test data and generate cmake config'
       inputs:
         filename: '$(Build.BinariesDirectory)\packages\python\python.exe'
-        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Debug Release --build_dir $(Build.BinariesDirectory) --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --use_mkldnn --build_shared_lib  --enable_onnx_tests --cuda_home="C:\local\cuda_10.0.130_win10" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --use_tensorrt --tensorrt_home="C:\local\TensorRT-5.0.4.3" --update --msvc_toolset=14.11'
+        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Debug Release --build_dir $(Build.BinariesDirectory) --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --use_mkldnn --build_shared_lib  --enable_onnx_tests --cuda_home="C:\local\cuda_10.0.130_win10_trt515dll" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --use_tensorrt --tensorrt_home="C:\local\TensorRT-5.1.5.0" --update --msvc_toolset=14.11'
         workingDirectory: "$(Build.BinariesDirectory)"
 
     - task: VSBuild@1
@@ -37,7 +35,7 @@ jobs:
         solution: '$(Build.BinariesDirectory)\Debug\onnxruntime.sln'
         platform: 'x64'
         configuration: 'Debug'
-        msbuildArgs: '/m /p:CudaToolkitDir=C:\local\cuda_10.0.130_win10\'
+        msbuildArgs: '/m /p:CudaToolkitDir=C:\local\cuda_10.0.130_win10_trt515dll\'
         msbuildArchitecture: 'x64'
         logProjectEvents: true
         workingFolder: '$(Build.BinariesDirectory)\Debug'
@@ -45,34 +43,36 @@ jobs:
       displayName: 'Test Debug'
       inputs:
         filename: '$(Build.BinariesDirectory)\packages\python\python.exe'
-        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Debug --build_dir $(Build.BinariesDirectory) --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --use_mkldnn --build_shared_lib  --enable_onnx_tests --cuda_version=10.0 --cuda_home="C:\local\cuda_10.0.130_win10" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --use_tensorrt --tensorrt_home="C:\local\TensorRT-5.0.4.3" --test'
+        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Debug --build_dir $(Build.BinariesDirectory) --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --use_mkldnn --build_shared_lib  --enable_onnx_tests --cuda_version=10.0 --cuda_home="C:\local\cuda_10.0.130_win10_trt515dll" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --use_tensorrt --tensorrt_home="C:\local\TensorRT-5.1.5.0" --test'
         workingFolder: '$(Build.BinariesDirectory)'
-    - task: VSBuild@1
+
+    - task: MSBuild@1
       displayName: 'Build C# Debug'
       inputs:
-        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.sln'
+        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+        platform: 'Any CPU'
         configuration: 'Debug'
-        platform: 'any cpu'
-        restoreNugetPackages: false
         msbuildArchitecture: 'x64'
+        msbuildArguments: '/m' 
         workingFolder: '$(Build.SourcesDirectory)\csharp'
-        msbuildArgs: '/m /p:OnnxRuntimeBuildDirectory=$(Build.BinariesDirectory)'
 
-    - task: VSTest@2
-      displayName: 'VsTest - C# Debug'
+    - task: MSBuild@1
+      displayName: 'Test C# Debug'
       inputs:
-        testAssemblyVer2: '**\bin\Debug\**\*Tests.dll'
-        searchFolder: '$(Build.SourcesDirectory)\csharp\test'
-        runInParallel: true
-        configuration: Debug
-
+        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+        platform: 'Any CPU'
+        configuration: 'Debug'
+        msbuildArchitecture: 'x64'
+        msbuildArguments: '/t:RunTest'
+        workingFolder: '$(Build.SourcesDirectory)\csharp'
+    
     - task: VSBuild@1
       displayName: 'Build Release'
       inputs:
         solution: '$(Build.BinariesDirectory)\Release\onnxruntime.sln'
         platform: 'x64'
         configuration: 'Release'
-        msbuildArgs: '/m /p:CudaToolkitDir=C:\local\cuda_10.0.130_win10\'
+        msbuildArgs: '/m /p:CudaToolkitDir=C:\local\cuda_10.0.130_win10_trt515dll\'
         msbuildArchitecture: 'x64'
         logProjectEvents: true
         workingFolder: '$(Build.BinariesDirectory)\Release'
@@ -81,27 +81,28 @@ jobs:
       displayName: 'Test Release'
       inputs:
         filename: '$(Build.BinariesDirectory)\packages\python\python.exe'
-        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Release --build_dir $(Build.BinariesDirectory) --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --use_mkldnn --build_shared_lib  --enable_onnx_tests --cuda_version=10.0 --cuda_home="C:\local\cuda_10.0.130_win10" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --use_tensorrt --tensorrt_home="C:\local\TensorRT-5.0.4.3" --test'
+        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Release --build_dir $(Build.BinariesDirectory) --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe  --enable_pybind --use_openmp --use_mkldnn --build_shared_lib  --enable_onnx_tests --cuda_version=10.0 --cuda_home="C:\local\cuda_10.0.130_win10_trt515dll" --cudnn_home="C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda" --use_tensorrt --tensorrt_home="C:\local\TensorRT-5.1.5.0" --test'
         workingFolder: "$(Build.BinariesDirectory)"
 
-    - task: VSBuild@1
-      displayName: 'Build c# Release'
+    - task: MSBuild@1
+      displayName: 'Build C# Release'
       inputs:
-        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.sln'
-        platform: 'any cpu'
+        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+        platform: 'Any CPU'
         configuration: 'Release'
         msbuildArchitecture: 'x64'
-        restoreNugetPackages: false
+        msbuildArgs: '/m'
         workingFolder: '$(Build.SourcesDirectory)\csharp'
-        msbuildArgs: '/m /p:OnnxRuntimeBuildDirectory=$(Build.BinariesDirectory)'
 
-    - task: VSTest@2
-      displayName: 'VsTest - C# Release'
+    - task: MSBuild@1
+      displayName: 'Test C# Release'
       inputs:
-        testAssemblyVer2: '**\bin\Release\**\*Tests.dll'
-        searchFolder: '$(Build.SourcesDirectory)\csharp\test'
-        runInParallel: true
-        configuration: Release
+        solution: '$(Build.SourcesDirectory)\csharp\OnnxRuntime.CSharp.proj'
+        platform: 'Any CPU'
+        configuration: 'Release'
+        msbuildArchitecture: 'x64'
+        msbuildArguments: '/t:RunTest'
+        workingfolder: '$(Build.SourcesDirectory)\csharp'
 
     - task: PublishTestResults@2
       displayName: 'Publish unit test results'
diff --git a/tools/ci_build/github/azure-pipelines/win-ngraph-ci-pipeline.yml b/tools/ci_build/github/azure-pipelines/win-ngraph-ci-pipeline.yml
index b45c0bf74c44b..1eaafcf46c040 100644
--- a/tools/ci_build/github/azure-pipelines/win-ngraph-ci-pipeline.yml
+++ b/tools/ci_build/github/azure-pipelines/win-ngraph-ci-pipeline.yml
@@ -17,7 +17,7 @@ jobs:
       displayName: 'Download test data and generate cmake config'
       inputs:
         filename: '$(Build.BinariesDirectory)\packages\python\python.exe'
-        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Debug Release --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --enable_pybind --use_openmp --use_ngraph --use_full_protobuf --build_shared_lib --gen_doc --update'
+        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Debug Release --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --enable_pybind --use_openmp --use_ngraph --use_full_protobuf --build_shared_lib --update'
         workingDirectory: "$(Build.BinariesDirectory)"
     - task: VSBuild@1
       displayName: 'Build Debug'
@@ -33,7 +33,7 @@ jobs:
       displayName: 'Test Debug'
       inputs:
         filename: '$(Build.BinariesDirectory)\packages\python\python.exe'
-        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Debug --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --enable_pybind --use_ngraph --use_full_protobuf --build_shared_lib --gen_doc --test'
+        arguments: '$(Build.SourcesDirectory)\tools\ci_build\build.py --config Debug --build_dir $(Build.BinariesDirectory) --skip_submodule_sync --cmake_path $(Build.BinariesDirectory)\cmake\bin\cmake.exe --ctest_path $(Build.BinariesDirectory)\cmake\bin\ctest.exe --enable_pybind --use_ngraph --use_full_protobuf --build_shared_lib --test'
         workingFolder: '$(Build.BinariesDirectory)'
     - task: VSBuild@1
       displayName: 'Build Release'
diff --git a/tools/ci_build/github/linux/copy_strip_binary.sh b/tools/ci_build/github/linux/copy_strip_binary.sh
index 585310f6ce1d9..20c08d835ffb1 100644
--- a/tools/ci_build/github/linux/copy_strip_binary.sh
+++ b/tools/ci_build/github/linux/copy_strip_binary.sh
@@ -32,6 +32,8 @@ then
     ln -s $LIB_NAME $BINARY_DIR/$ARTIFACT_NAME/lib/libonnxruntime.so
 fi
 cp $SOURCE_DIR/include/onnxruntime/core/session/onnxruntime_c_api.h  $BINARY_DIR/$ARTIFACT_NAME/include
+cp $SOURCE_DIR/include/onnxruntime/core/session/onnxruntime_cxx_api.h  $BINARY_DIR/$ARTIFACT_NAME/include
+cp $SOURCE_DIR/include/onnxruntime/core/session/onnxruntime_cxx_inline.h  $BINARY_DIR/$ARTIFACT_NAME/include
 cp $SOURCE_DIR/include/onnxruntime/core/providers/cpu/cpu_provider_factory.h  $BINARY_DIR/$ARTIFACT_NAME/include
 cp $SOURCE_DIR/include/onnxruntime/core/providers/cuda/cuda_provider_factory.h  $BINARY_DIR/$ARTIFACT_NAME/include
 
diff --git a/tools/ci_build/github/linux/docker/Dockerfile.ubuntu_tensorrt b/tools/ci_build/github/linux/docker/Dockerfile.ubuntu_tensorrt
index 8f6264f71f092..d78b0161bd40c 100644
--- a/tools/ci_build/github/linux/docker/Dockerfile.ubuntu_tensorrt
+++ b/tools/ci_build/github/linux/docker/Dockerfile.ubuntu_tensorrt
@@ -1,8 +1,8 @@
-# Tag: nvcr.io/nvidia/tensorrt:19.02-py3
-# Label: com.nvidia.cuda.version: 10.0.130
-# Label: com.nvidia.cudnn.version: 7.4.2
+# Tag: nvcr.io/nvidia/tensorrt:19.06-py3
+# Label: com.nvidia.cuda.version: 10.1.168
+# Label: com.nvidia.cudnn.version: 7.6.0
 # Ubuntu 16.04
-FROM nvcr.io/nvidia/tensorrt:19.02-py3
+FROM nvcr.io/nvidia/tensorrt:19.06-py3
 
 ARG PYTHON_VERSION=3.5
 
diff --git a/tools/ci_build/github/linux/docker/scripts/install_onnx.sh b/tools/ci_build/github/linux/docker/scripts/install_onnx.sh
index a8495aff3a966..c15c01ec4adc5 100755
--- a/tools/ci_build/github/linux/docker/scripts/install_onnx.sh
+++ b/tools/ci_build/github/linux/docker/scripts/install_onnx.sh
@@ -13,7 +13,7 @@ version2tag=(5af210ca8a1c73aa6bae8754c9346ec54d0a756e-onnx123
              bae6333e149a59a3faa9c4d9c44974373dcf5256-onnx130
              9e55ace55aad1ada27516038dfbdc66a8a0763db-onnx141
              7d7bc83d29a328233d3e8affa4c4ea8b3e3599ef-onnx150
-             65b8e0f9979fbade16e3becbdfa69c0764946f72-onnxtip)
+             7d90796473295ca3cdf976ed772215c5980ad3e0-onnxtip)
 for v2t in ${version2tag[*]}; do
   onnx_version="$(cut -d'-' -f1<<<${v2t})"
   onnx_tag="$(cut -d'-' -f2<<<${v2t})"
diff --git a/tools/ci_build/github/windows/setup_env_cuda.bat b/tools/ci_build/github/windows/setup_env_cuda.bat
index 7f87a119904c0..2a486189f90aa 100644
--- a/tools/ci_build/github/windows/setup_env_cuda.bat
+++ b/tools/ci_build/github/windows/setup_env_cuda.bat
@@ -1 +1 @@
-set PATH=%BUILD_BINARIESDIRECTORY%\packages\python;%BUILD_BINARIESDIRECTORY%\packages\python\DLLs;%BUILD_BINARIESDIRECTORY%\packages\python\Library\bin;%BUILD_BINARIESDIRECTORY%\packages\python\script;C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda\bin;C:\local\cuda_10.0.130_win10\bin;%PATH%
+set PATH=%BUILD_BINARIESDIRECTORY%\packages\python;%BUILD_BINARIESDIRECTORY%\packages\python\DLLs;%BUILD_BINARIESDIRECTORY%\packages\python\Library\bin;%BUILD_BINARIESDIRECTORY%\packages\python\script;C:\local\cudnn-10.0-windows10-x64-v7.3.1.20\cuda\bin;C:\local\cuda_10.0.130_win10_trt515dll\bin;%PATH%
diff --git a/tools/python/gen_opkernel_doc.py b/tools/python/gen_opkernel_doc.py
new file mode 100644
index 0000000000000..8fd004a2ee819
--- /dev/null
+++ b/tools/python/gen_opkernel_doc.py
@@ -0,0 +1,152 @@
+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+
+
+from collections import defaultdict
+import io
+import os
+import sys
+import argparse
+
+
+import onnxruntime as rt
+import onnxruntime.capi.onnxruntime_pybind11_state as rtpy 
+from onnxruntime.capi.onnxruntime_pybind11_state import opkernel
+from onnxruntime.capi.onnxruntime_pybind11_state import schemadef 
+from onnxruntime.capi.onnxruntime_pybind11_state.opkernel import KernelDef 
+from onnxruntime.capi.onnxruntime_pybind11_state.schemadef import OpSchema 
+
+
+def format_version_range(v):
+    if (v[1] >= 2147483647):
+        return str(v[0])+'+'
+    else:
+        return '['+str(v[0])+', '+str(v[1])+']'    
+
+def format_type_constraints(tc):
+    counter = 0
+    tcstr = ''
+    firsttcitem = True
+    for tcitem in tc:
+        counter += 1
+        if firsttcitem:
+            firsttcitem = False
+        else:
+            tcstr += ', '
+        tcstr += tcitem
+    return tcstr
+
+def format_param_strings(params):
+    firstparam = True
+    s = ''
+    if params:
+        for param in params:
+            if firstparam:
+                firstparam = False
+            else:
+                s += ' or '
+            s += param
+    return s
+    
+def main(args):  # type: (Type[Args]) -> None
+    
+    with io.open(args.output, 'w', newline='', encoding="utf-8") as fout:
+        fout.write('## Supported Operators Data Types\n')
+        fout.write(
+            "*This file is automatically generated from the\n"
+            "            [def files](/onnxruntime/core/providers/cpu/cpu_execution_provider.cc) via [this script](/tools/python/gen_opkernel_doc.py).\n"
+            "            Do not modify directly and instead edit operator definitions.*\n")
+        opdef = rtpy.get_all_operator_schema()
+        paramdict = {}
+        for schema in opdef:
+            inputs = schema.inputs
+            domain = schema.domain
+            if (domain == ''):
+                domain = 'ai.onnx.ml'
+            fullname = domain+'.'+schema.name
+            paramstr = '('
+            firstinput = True
+            if inputs:
+                for inp in inputs:
+                    if firstinput:
+                        firstinput = False
+                    else:
+                        paramstr += ', '
+                    paramstr += '*in* {}:**{}**'.format(inp.name, inp.typeStr)
+
+            outputs = schema.outputs
+            if outputs:
+                for outp in outputs:
+                    if firstinput:
+                        firstinput = False
+                    else:
+                        paramstr += ', '
+                    paramstr += '*out* {}:**{}**'.format(outp.name, outp.typeStr)
+
+            paramstr += ')'
+            paramset = paramdict.get(fullname,None)
+            if paramset == None:
+                paramdict[fullname] = set()
+            
+            paramdict[fullname].add(paramstr)
+
+        index = defaultdict(lambda: defaultdict(lambda: defaultdict(list))) 
+        for op in rtpy.get_all_opkernel_def():
+            domain = op.domain
+            if (domain == ''):
+                domain = 'ai.onnx.ml'
+            index[op.provider][domain][op.op_name].append(op)
+
+               
+        fout.write('\n')
+        for provider, domainmap in sorted(index.items()):
+            fout.write('\n\n## Operators implemented by '+provider+'\n\n')
+            fout.write('| Op Name | Parameters | OpSet Version | Types Supported |\n')
+            fout.write('|---------|------------|---------------|-----------------|\n')
+            for domain, namemap in sorted(domainmap.items()):
+                fout.write('**Operator Domain:** *'+domain+'*\n')
+                for name, ops in sorted(namemap.items()):
+                    last_version = (0,0)
+                    version_type_index = defaultdict(lambda: defaultdict(set))
+                    for op in ops: 
+                        formatted_version_range = format_version_range(op.version_range)
+                        for tname,tclist in op.type_constraints.items():
+                            for c in tclist:
+                                version_type_index[formatted_version_range][tname].add(c)
+
+                    namefirsttime = True
+                    for version, typemap in sorted(version_type_index.items()):
+                        versionfirsttime = True
+                        for tname, tcset in sorted(typemap.items()):
+                            if (namefirsttime):
+                                params = paramdict.get(domain+'.'+name, None)
+                                fout.write('|'+name+'|'+format_param_strings(params) +'|')
+                                namefirsttime = False
+                            else:
+                                fout.write('| | |')
+                            if (versionfirsttime):
+                                versionfirsttime = False
+                                fout.write(version+'|')
+                            else:
+                                fout.write('|')
+
+                            tclist = []
+                            for tc in tcset:
+                                tclist.append(tc)
+                            fout.write('**'+tname+'** = '+format_type_constraints(tclist)+'|\n')
+                        
+                fout.write('| |\n| |\n')
+        
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='ONNX Runtime Operator Kernel Documentation Generator')
+    parser.add_argument('--output_path', help='output markdown file path', 
+                        default=os.path.join(os.path.dirname(os.path.realpath(__file__)), 'OperatorKernels.md')
+                       )
+    args = parser.parse_args()
+
+
+    class Args(object):
+        output = args.output_path
+    main(Args)