-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : group all experts in a single ggml_mul_mat_id #6505
Conversation
cuda : improve mmid row copy
This comment was marked as off-topic.
This comment was marked as off-topic.
Benchmarked this is on a Ryzen 5800x (64GB ddr4@3733MT CL16) and Tesla P40 24GB. 28/33 layers offloaded. Model is Bagel Mistery Tour 8x7b (mixtral)
|
Alright, wow. PP went down from 11,85 ms/t to 4,95 ms/t (with partial offloading, 5 layers Mixtral and 2060). Simply incredible, but I'm not surprised anymore as Slaren always delivers. Llama.cpp's MOE implementation is now extremly robust. |
I also noticed that this pull uses significantly less CUDA Buffer (50% less) compared to master which allowed an extra layer at low ctx. PR:
Master (build = 2620 (d4f220a))
|
Just tested with 8k context... not as much as a savings.
Master
|
@JohannesGaessler I had already tested most of these, but on my 3090 I didn't see a meaningful improvement. Anyway I have pushed my changes that I think already cover all of that. In the long term the goal is to use a grouped GEMM with cutlass without requiring a synchronization. I think that this will also allow removing the row rearrangement entirely, which has a significant cost. |
Some quick performance comparisons from me:
I am measuring a performance difference from the changes:
That would definitely help. When you look into this I think it would also make sense to check whether it is possible to set the input/output type to FP32 to avoid the conversion of some tensors. (With the input I suspect that it's probably not possible though.) |
Definite speedup on P40 with larger iq4_xs quant with partial offloading.
|
@ggerganov I really do not want to have to modify 21 functions in exactly the same way again, I would rather spend some time refactoring. Did you find any reason that would prevent making the Metal |
No reason at all - I simply wasn't able to fit this into templates. Thanks for doing it - it was very ugly before |
…nstead of silu. Do not pass too much time on this function as it will be replaced in #6505
Hey there, just wondering if there's any reason this isn't ready to be merged yet. Have heard a couple of reports that it's really beneficial for mixtral PP speed for some people who have tried it. |
It's still missing a Metal implementation. It should be good for CPU and CUDA already. |
Let's rebase on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! M2 Ultra results (-ub 256
is optimal):
./scripts/compare-commits.sh master sl/moe-rework-2 -m models/mixtral-8x7b-32k-fast/ggml-model-f16.gguf -ub 256 -p 1,2,4,8,16,32,64,128,256,512
CPU | Model | Test | t/s master | t/s sl/moe-rework-2 | Speedup |
---|---|---|---|---|---|
M2 Ultra | llama 8x7B F16 | pp1 | 22.28 | 23.15 | 1.04 |
M2 Ultra | llama 8x7B F16 | pp2 | 21.18 | 22.11 | 1.04 |
M2 Ultra | llama 8x7B F16 | pp4 | 26.67 | 27.40 | 1.03 |
M2 Ultra | llama 8x7B F16 | pp8 | 30.73 | 44.21 | 1.44 |
M2 Ultra | llama 8x7B F16 | pp16 | 50.73 | 79.88 | 1.57 |
M2 Ultra | llama 8x7B F16 | pp32 | 90.07 | 154.87 | 1.72 |
M2 Ultra | llama 8x7B F16 | pp64 | 155.10 | 263.48 | 1.70 |
M2 Ultra | llama 8x7B F16 | pp128 | 256.59 | 357.97 | 1.40 |
M2 Ultra | llama 8x7B F16 | pp256 | 319.72 | 370.37 | 1.16 |
M2 Ultra | llama 8x7B F16 | pp512 | 319.97 | 370.87 | 1.16 |
M2 Ultra | llama 8x7B F16 | tg128 | 22.38 | 23.12 | 1.03 |
@NeoZhangJianyu @airMeng This change will break mul_mat_id in SYCL again. Sorry for the inconvenience, the change to the interface was necessary to improve performance. |
@ggerganov Do you know why the ggml-ci cuda-v100 failed? The log ends during a |
Yes, it exceeded 30 min. On We can either increase to 40 min or maybe not run |
Got it! I will study and fix later. Thank for reminding! |
@slaren can we have a workaround like macros in llama.cpp or a fallback to CPU to maintain SYCL capabilities? Then SYCL will not block your merging, and we can have more time on SYCL kernels (I am just assigned a JIRA about MOE, maybe I can re-use the efforts) |
We could disable offloading of MoE models when using SYCL by setting |
I think that the problem is that there are too many types. We can run the full tests only for a few types, and a basic test only for the rest. |
Yes, for now should I bump the timeout to 40 min and figure a test reduction later on |
I think this is good enough for now. There are full tests with a few types to verify the logic, and then a simple test with the other types to check if they work at all. |
Should improve performance of MoE models with CUDA significantly. Also improved the rearrangement of the rows in the CUDA backend with custom kernels instead of memcpys, that's about 50% of the speedup here.