-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal: new q4_0 mat-vec mul kernel #2188
Conversation
Please feel free to share results on M1/M2 Pro/Max/Ultra! |
Well, I had not checked out the PR branch. With the PR branch I get on M2 Max with 30-core GPU, 64 GB RAM:
The above is for an empty prompt. If I use |
Looks promising! From what I've read before there is still a lot of room for Metal code improvements so this could be very welcome. I will try to test it out on my M1 Max machine as well. |
Nice! I update the measurements with new prompt. Looks like this PR brings the M1 series to the same speed as M2 series. I guess M2 series have better memory access prediction so their speed are already good without this PR. Nonetheless for 65B model I guess we can see a larger performance improvement for M2 series. |
Would be nice to see results on M1 Ultra or M2 Ultra chips. Their two-die design may also suffer from cache miss. |
0e3eeb6
to
38ec9a2
Compare
Here's my benchmark: MBP 14 Apple M2 Max 32GB 12C CPU, 30C GPU
Logs
|
M1 Pro 32GB
|
Another data point: M2 Max with 30-core GPU, 65B model,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive! 🦙
ggml-metal.m
Outdated
@@ -660,7 +662,11 @@ void ggml_metal_graph_compute( | |||
|
|||
nth0 = 8; | |||
nth1 = 8; | |||
[encoder setComputePipelineState:ctx->pipeline_mul_mat_q4_0_f32]; | |||
if (ne01 % 8 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it somehow possible to remove this restriction without affecting the performance?
If so, we can simply delete the old pipeline_mul_mat_q4_0_f32
kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for models like WizardLM. In these models the last mat-vec multiplication in inference will have row number = 32001, while our new kernel consumes 8 rows every time. Let me test later if adding capabilities for dealing with <8 rows in the kernel will make it run slower or faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the old kernel. Might have minor performance gain or loss depending on model size and n_vocab. Tested on llama-7B and wizardlm-30B, generation results are same.
And here a full list of all Meta LLaMA models on 30-core M2 Max with
|
MacBook Pro, M2 Pro, 32GB, Ventura 13.4.1
|
minor issue |
With M2 Mac Studio Ultra, 128GB RAM
|
38ec9a2
to
fa0e4cf
Compare
Prefetch data to improve GPU utilization. ~48% faster for 33B model.
fa0e4cf
to
5150582
Compare
@ggerganov how to properly run benchmarks? I would love to contribute! |
kernel void kernel_mul_mat_q4_0_f32( | ||
device const void * src0, | ||
device const float * src1, | ||
device float * dst, | ||
constant int64_t & ne00, | ||
constant int64_t & ne10, | ||
constant int64_t & ne0, | ||
threadgroup float * sum [[threadgroup(0)]], | ||
constant int64_t & ne01[[buffer(4)]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the function of [[buffer(4)]]
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets the value from the buffer at index 4, corresponding the following line in ggml-metal.m
:
[encoder setBytes:&ne01 length:sizeof(ne01) atIndex:4];
@gauravpathak-infa I also have an M2 ultra, i am getting less than half of your numbers on these models; could you share your command line please? |
@amj Here are the commands: If you are getting half the numbers, the problem may be somewhere else. It may be useful to check if you have compiled llama.cpp with Metal. If not: |
yep that did it; LLAMA_METAL had been dropped during some clean/build step. I'm getting the same numbers you were showing. Thanks! |
Prefetch data to achieve better memory bandwidth utilization. With the new kernel token generation is ~48% faster for 33B model and ~14% faster for 7B model. Tests for 65B model are welcome.
GPUs of M1 Max / M2 Max can sustain 340 GB/s , can we reach that in llama.cpp? 😳
* GPU Read Bandwidth measured using Developer Tools comes with Xcode.
** Measured with the following command on M1 Max 32GB. Generation results are same between this PR and master branch.
./main -m model_file -n 128 -c 512 -s 12 -ngl 1 --no-mmap
Prompt 1 “”
Prompt 2 “I believe the meaning of life is”