-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
≈65% speedup of the AVX-512 implementation of ggml_vec_dot_q4_0()
#933
Conversation
Today, an "official" microbenchmark for `make benchmark` output for 0e07e6a (the "old" AVX-512 version)
`make benchmark` output for 0e07e6a with AVX-512 disabled (the AVX2 version)
`make benchmark` output for d787348 (the "new" AVX-512 version)
I used the following patch for 0e07e6a to force AVX2 instead of AVX-512: diff --git a/ggml.c b/ggml.c
index 42e3ee3..9c72456 100644
--- a/ggml.c
+++ b/ggml.c
@@ -1967,7 +1967,7 @@ static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void * rest
}
sumf = sum0 + sum1;
-#elif defined(__AVX512F__)
+#elif 0 && defined(__AVX512F__)
// Initialize accumulator with zeros
__m512 acc0 = _mm512_setzero_ps();
__m512 acc1 = _mm512_setzero_ps(); I would appreciate if someone with AVX-512-enabled hardware could also run the official microbenchmark to determine if this is worth merging. |
FWIW, I didn't intend to close the PR. I merged the latest upstream commits to run the microbenchmark and accidentally force-pushed 0e07e6a (the latest upstream master I used for microbenchmarking) into my branch. GitHub apparently interpreted this as "this PR has already been merged, let's close it". I fixed this now (by force-pushing d787348 and re-opening the PR); sorry for the noise. |
According to the datasheet my cpu has 2 "AVX-512 FMA Units" so i could give it a try. However, I noticed that the benchmark folder has no
I think I was able to build
Below you can find the outputs for both the up to date branch and your commit. However, I think I'm doing something wrong (or something stupid 😅) --> I do not obtain any output from the benchmark when I build with benchmark when using 43ddef, all settings default
benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = OFF | AVX512-VNNI=OFF
benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = OFF | AVX512-VNNI=ON
benchmark when using d787348 with AVX512 = ON |AVX512-VBMI = ON | AVX512-VNNI= OFF
|
I can confirm that this is a big speedup for me, well done!
master branch:
It does make me think we are spending a lot of extra effort shuffling values around because of the memory layout. |
@KASR #include <immintrin.h>
#include <iostream>
int main() {
__m512 x = _mm512_set1_ps(1.0f);
__m512bh y = _mm512_set1_epi8(2);
__m512bh z = _mm512_set1_epi8(3);
char out[64]{};
_mm512_storeu_ps(out, _mm512_dpbf16_ps(x, y, z));
std::cout << static_cast<int>(out[1]) << std::endl;
} I compiled it with
When I run the program under debugger, I can see that it traps:
So I believe this is the reason you "do not obtain any output from the benchmark when I build with AVX512-VBMI enabled". There's a |
That sounds like a neat idea. I tried implementing a quick proof-of-concept to see how fast it can potentially be (for AVX-512 only), but it appears to be harder than I thought:
At this point I gave up. If anyone wants to take a stab at this, this would be great. |
Here are the benchmark results so far, summarized (the value is the average of
I think that at this point it is only worth benchmarking the new AVX-512 implementation against the current AVX2 implementation, since the current AVX-512 implementation is consistently worse than the current AVX2 one. I know of two ways to force AVX2 for matrix multiplication:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot test this as I don't have the hardware.
The reported results are encouraging so I think it is OK to merge this and continue to support AVX512 for now.
Test on Raptor Lake fails as the Intel has dropped support for AVX512 in latest generations of consumer CPUs. It supports avx_vnni though (256bit instructions instead of 512bit in AVX512), would it be possible to use it? |
Fixed a trivial merge conflict after 0ad9646. Otherwise, nothing has changed. For posterity's sake: one thing really bothering me in this PR is that I can't use the _mm*_sign_ps() trick from the AVX implementation because apparently @Ameobea As the author of the original AVX-512 implementation (in #320), would you be willing to take a look at this PR before I merge it? I would appreciate any feedback (will make separate PRs if necessary). Also, you mentioned here that you have an AMD CPU supporting AVX-512. It would be interesting to run the benchmarks on an AMD CPU, because so far we only have measurements for Intel. |
I think it's possible. The current AVX2 implementation (where AVX-VNNI would belong) multiplies 16-bit integers (split across two different registers) instead of 8-bit integers, so we would only save one instruction instead of two. However, the patch should be just a couple of lines (i.e., replace I don't have access to any AVX-VNNI-enabled hardware, though, so I can't test/benchmark it. I can make a separate PR if you're willing to help testing it. |
Sure. I can run it on an i7-13700. |
i have amd 7950x, but me need instructions for run bencmark: "make benchmark" on dfyz:master not working, its no make target |
Huh, this is strange. The head of Anyway, I just rebased onto the latest master and it appears that I think I'm going to merge this as is, and then work on optimizing |
Apologies for the slightly clickbaity title: while technically true, as mentioned in this comment, the current AVX-512 implementation is slower than the regular AVX2 implementation. Compared to the current AVX2 implementation, the new AVX-512 implementation only gets a ≈35% speedup on modern hardware.
The measurements below were made on a laptop with a Tiger Lake CPU (
i7-1165G7
). 8 threads were used.LLaMA 7B (
./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 256
)LLaMA 13B (
./main -m ./models/13B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 256
)For clarity, I only include the time of generation per token, but the prompt processing improvements are very similar.
These speedup percentages are, of course, an oversimplification:
ggml_vec_dot_q4_0()
, it also does quite a lot of other stuff.gcc 12.2.1
, the results with other compilers can vary (but not too much).clang
, in particular, does some "optimizations" than only hurt performance (fortunately, not in a major way).VBMI
andVNNI
extensions, when available. If they are not present, the code runs slightly slower.However, my microbenchmark (
./build.sh && ./main
on a Linux system withlibbenchmark
andgcc
installed) suggests solid improvements over both the AVX2 and old AVX-512 implementations across a variety of CPUs. Unless I screwed up something major, the improvements in the microbenchmark should directly lead to improvements in generation. I would appreciate any performance measurements with this PR applied.Implementation-wise, the basic idea is that built-in masking and register-wide shuffles in AVX-512 allow us to operate on two Q4_0 blocks at once. I tried to comment the code extensively.