Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tianxing/moe quantization #693

Draft
wants to merge 55 commits into
base: main_perf
Choose a base branch
from
Draft

Conversation

Chi-Chu319
Copy link

@Chi-Chu319 Chi-Chu319 commented Jan 3, 2025

MOE int8, fp8 quantization support. based on tianxing/moe-quantization

FP8_W8A8 Benchmark

M K N E top_k TFLOPS Bandwidth (GB/s)
64.0 128.0 256.0 8.0 2.0 0.092190 3.719927
64.0 1024.0 1792.0 8.0 2.0 5.155517 169.963879
64.0 4096.0 7168.0 8.0 2.0 31.803841 1007.491579
128.0 4096.0 7168.0 8.0 2.0 62.831290 997.923160
1024.0 4096.0 7168.0 8.0 2.0 378.030246 750.201456
4096.0 4096.0 7168.0 8.0 2.0 684.761333 531.158431
64.0 4096.0 14336.0 8.0 2.0 47.124442 1480.603863
128.0 4096.0 14336.0 8.0 2.0 93.602312 1478.376964
256.0 4096.0 14336.0 8.0 2.0 166.424183 1358.410607
512.0 4096.0 14336.0 8.0 2.0 314.778601 1296.118960
1024.0 4096.0 14336.0 8.0 2.0 469.500765 1057.664647
2048.0 4096.0 14336.0 8.0 2.0 617.068864 772.391025
4096.0 4096.0 14336.0 8.0 2.0 672.719851 514.981773

Model Results:

Model M N K E top_k TFLOPS Bandwidth (GB/s)
mistral-7B 4096 28672 4096 8 2 661.756375 489.792274
mistral-7B 4096 4096 14336 8 2 694.422561 410.142435
mistral-22B 4096 32768 6144 8 2 674.569986 434.342128
mistral-22B 4096 6144 16384 8 2 712.699644 420.907449

INT8_W8A16 Benchmark

M K N E top_k TFLOPS Bandwidth (GB/s)
64.0 128.0 256.0 8.0 2.0 0.092512 3.852741
64.0 1024.0 1792.0 8.0 2.0 5.222558 170.757878
64.0 4096.0 7168.0 8.0 2.0 30.971172 977.097146
128.0 4096.0 7168.0 8.0 2.0 61.452061 993.447948
1024.0 4096.0 7168.0 8.0 2.0 282.687560 619.585019
4096.0 4096.0 7168.0 8.0 2.0 397.113785 318.947949
64.0 4096.0 14336.0 8.0 2.0 44.764809 1394.195296
128.0 4096.0 14336.0 8.0 2.0 85.919723 1308.488185
256.0 4096.0 14336.0 8.0 2.0 145.609915 1164.399782
512.0 4096.0 14336.0 8.0 2.0 246.880804 1023.148388
1024.0 4096.0 14336.0 8.0 2.0 331.187691 735.699149
2048.0 4096.0 14336.0 8.0 2.0 391.701333 493.718930
4096.0 4096.0 14336.0 8.0 2.0 425.553113 326.596512

Model Results:

Model M N K E top_k TFLOPS Bandwidth (GB/s)
mistral-7B 4096 28672 4096 8 2 422.868241 325.256533
mistral-7B 4096 4096 14336 8 2 426.262209 288.454299
mistral-22B 4096 32768 6144 8 2 434.789953 290.210389
mistral-22B 4096 6144 16384 8 2 424.348817 268.953209
  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@Chi-Chu319 Chi-Chu319 self-assigned this Jan 3, 2025
@Chi-Chu319 Chi-Chu319 mentioned this pull request Jan 4, 2025
7 tasks
@Chi-Chu319 Chi-Chu319 force-pushed the tianxing/moe-quantization branch from b83a9f9 to 78c50a4 Compare January 8, 2025 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant