optimize com.microsoft.MatMulNbits operator #28504

bopeng1234 · 2025-01-17T06:47:35Z

This PR is doing some optimization work on onnxfrontend com.microsoft.MatMulNbits operators

with this changes:

it disabled const folding with use 75GB for phi3 INT4 model and 200+GB for llama3 INT4 model.
it trigger oneDNN matmul primitives, much benefits the GPU performance

we tested this changes along with another PR #28163 , and confirmed phi3/llama3 INT4 model run well in LNL.

### Details: - use convert instead of convert_like op, it help disabled const folding and run online int2/4/8 dequantize rather than const folding as complie time, benefits compile memory usage and inference latency. - use zero point as uint2/4/8, it trigled oneDNN kernel, much benefits the GPU performance.

ilya-lavrenov · 2025-01-17T07:25:10Z

build_jenkins

bopeng1234 requested a review from a team as a code owner January 17, 2025 06:47

github-actions bot added the category: ONNX FE OpenVINO ONNX FrontEnd label Jan 17, 2025

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Jan 17, 2025

ilya-lavrenov assigned gkrivor Jan 17, 2025

ilya-lavrenov added this to the 2025.1 milestone Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize com.microsoft.MatMulNbits operator #28504

optimize com.microsoft.MatMulNbits operator #28504

bopeng1234 commented Jan 17, 2025

ilya-lavrenov commented Jan 17, 2025

optimize com.microsoft.MatMulNbits operator #28504

Are you sure you want to change the base?

optimize com.microsoft.MatMulNbits operator #28504

Conversation

bopeng1234 commented Jan 17, 2025

ilya-lavrenov commented Jan 17, 2025