Tianxing/moe quantization #693

Chi-Chu319 · 2025-01-03T11:57:03Z

MOE int8, fp8 quantization support. based on tianxing/moe-quantization

FP8_W8A8 Benchmark

M	K	N	E	top_k	TFLOPS	Bandwidth (GB/s)
64.0	128.0	256.0	8.0	2.0	0.092190	3.719927
64.0	1024.0	1792.0	8.0	2.0	5.155517	169.963879
64.0	4096.0	7168.0	8.0	2.0	31.803841	1007.491579
128.0	4096.0	7168.0	8.0	2.0	62.831290	997.923160
1024.0	4096.0	7168.0	8.0	2.0	378.030246	750.201456
4096.0	4096.0	7168.0	8.0	2.0	684.761333	531.158431
64.0	4096.0	14336.0	8.0	2.0	47.124442	1480.603863
128.0	4096.0	14336.0	8.0	2.0	93.602312	1478.376964
256.0	4096.0	14336.0	8.0	2.0	166.424183	1358.410607
512.0	4096.0	14336.0	8.0	2.0	314.778601	1296.118960
1024.0	4096.0	14336.0	8.0	2.0	469.500765	1057.664647
2048.0	4096.0	14336.0	8.0	2.0	617.068864	772.391025
4096.0	4096.0	14336.0	8.0	2.0	672.719851	514.981773

Model Results:

Model	M	N	K	E	top_k	TFLOPS	Bandwidth (GB/s)
mistral-7B	4096	28672	4096	8	2	661.756375	489.792274
mistral-7B	4096	4096	14336	8	2	694.422561	410.142435
mistral-22B	4096	32768	6144	8	2	674.569986	434.342128
mistral-22B	4096	6144	16384	8	2	712.699644	420.907449

INT8_W8A16 Benchmark

M	K	N	E	top_k	TFLOPS	Bandwidth (GB/s)
64.0	128.0	256.0	8.0	2.0	0.092512	3.852741
64.0	1024.0	1792.0	8.0	2.0	5.222558	170.757878
64.0	4096.0	7168.0	8.0	2.0	30.971172	977.097146
128.0	4096.0	7168.0	8.0	2.0	61.452061	993.447948
1024.0	4096.0	7168.0	8.0	2.0	282.687560	619.585019
4096.0	4096.0	7168.0	8.0	2.0	397.113785	318.947949
64.0	4096.0	14336.0	8.0	2.0	44.764809	1394.195296
128.0	4096.0	14336.0	8.0	2.0	85.919723	1308.488185
256.0	4096.0	14336.0	8.0	2.0	145.609915	1164.399782
512.0	4096.0	14336.0	8.0	2.0	246.880804	1023.148388
1024.0	4096.0	14336.0	8.0	2.0	331.187691	735.699149
2048.0	4096.0	14336.0	8.0	2.0	391.701333	493.718930
4096.0	4096.0	14336.0	8.0	2.0	425.553113	326.596512

Model Results:

Model	M	N	K	E	top_k	TFLOPS	Bandwidth (GB/s)
mistral-7B	4096	28672	4096	8	2	422.868241	325.256533
mistral-7B	4096	4096	14336	8	2	426.262209	288.454299
mistral-22B	4096	32768	6144	8	2	434.789953	290.210389
mistral-22B	4096	6144	16384	8	2	424.348817	268.953209

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

The gemm support weights so as the benchmark. You can tune the gemm with bencmark option -tune

…ing/moe-gemm

…r fp8

Chi-Chu319 and others added 30 commits December 2, 2024 13:28

moe_gemm_kernel implemented

a8e7ccf

added kernel caller

a706c5d

test

fa489bd

wil fix test

0a2158d

pdated autotune setting

a6495a5

reformatted the code

15b959d

moe_align_block_size impl

60d8059

correct impl

7e1beb9

added weight, test

9cdaee6

updated readme and git action work flow

fe44173

pre-commit

6f737c6

benchmark

4e17aba

moved moe align block size outside the gemm function

da1e148

pre-commit

f7ffc17

Removed unnecessary comments

9a43c1c

Merge branch 'main_perf' into tianxing/moe-gemm

cefc74e

add dtype support, set the default dtype to fp16

c13b572

pre-commit

b9a9504

tunnable benchmark

1beeb14

pre commit

16c08d8

code clean up

e5ac604

Implemented moe gemm, test and benchmarking.

71649a6

The gemm support weights so as the benchmark. You can tune the gemm with bencmark option -tune

Merge branch 'tianxing/moe-gemm' of github.com:ROCm/triton into tianx…

3d044a2

…ing/moe-gemm

removed benchmark files

5adb971

Merge branch 'main_perf' into tianxing/moe-gemm

6852a96

removed all the benchmark files

ce4c4e8

updated readme

826cf5d

quantization impl

1e1b66b

Quantization implementation fp8 int8. TODO check the b_scale shape fo…

121f922

…r fp8

corrected moe scale int8 dim

e73becc

Chi-Chu319 self-assigned this Jan 3, 2025

Chi-Chu319 added 2 commits January 3, 2025 16:29

remove the -tune option, and consolidated the config files

c06c61e

pre commit

34130b9

Chi-Chu319 mentioned this pull request Jan 4, 2025

Tianxing/moe gemm #685

Open

7 tasks

Chi-Chu319 added 11 commits January 7, 2025 08:08

Merge branch 'main_perf' into tianxing/moe-gemm

c2b85ff

updated M_THRESHOLD and configs after tunnig

dc68ce1

Merge branch 'main_perf' into tianxing/moe-gemm

04a1629

mistral model benchmarking

53fa702

pre commit

1698d79

noqa: E402

42d8dbc

pre commit

3a04e90

more fine tuned model config. show mem throught put in benchmark

9c83fd7

pre-commit

e9d3dc2

Merge branch 'tianxing/moe-gemm' into tianxing/moe-quantization

03f43d7

quantization benchmark support

78c50a4

Chi-Chu319 force-pushed the tianxing/moe-quantization branch from b83a9f9 to 78c50a4 Compare January 8, 2025 14:01

Chi-Chu319 added 12 commits January 8, 2025 14:02

pre commit

f474a63

fixed bandwidth computation

ad2daad

fixed bandwidth calculation and config matmul intra dim

26b17e4

First and second gemm odel benchmarking

30a488c

Merge branch 'tianxing/moe-gemm' into tianxing/moe-quantization

2eddc52

reversed k n

39eca09

pre commit

0da016e

pre-commit fix format

be6520b

pre commit fix

54d207d

pre commit fix

2fe49dd

pre commit

6959222

Merge branch 'tianxing/moe-gemm' into tianxing/moe-quantization

4469329

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tianxing/moe quantization #693

Tianxing/moe quantization #693

Chi-Chu319 commented Jan 3, 2025 •

edited

Loading

Tianxing/moe quantization #693

Are you sure you want to change the base?

Tianxing/moe quantization #693

Conversation

Chi-Chu319 commented Jan 3, 2025 • edited Loading

Chi-Chu319 commented Jan 3, 2025 •

edited

Loading