-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654
10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654
Conversation
This PR would change the behavior of evaluation. It makes the output of LLaMA extremely different from what Apple M1 produces on the same inputs. Please compare your change with 02c5b27 which was fine. What's interesting is that this PR causes behavior to regress in the exact same way #642 did, which was merged a moment ago.. I raised the same concern there too #642 (comment) I've confirmed that this PR would cause an 11% performance boost on inference, which I measured on an Intel(R) Core(TM) i9-9900 CPU @ 3.10GHz.. Would it be possible to get that performance boost without changing the behavior of LLaMA? If so, why is it not possible to preserve consistency of behavior? |
|
There are issues with compiling on msvc with this pull error C2065: '__m128i_u': undeclared identifier It's probably resolvable as mentioned in here: https://stackoverflow.com/questions/68939552/can-i-assign-the-result-of-intrinsic-that-returns-m128i-to-variable-of-the-typ |
Could you elaborate a bit more on what you mean with "changes the behavior of the evaluation" and how you produced the outcome? I tried to replicate your findings on my machine, but I was not successful. I've tried to reproduce this with three execution runs, both with the same seed:
The text output was identical, the only difference were the improved timings ( 751.68 ms per run -> 516.78 ms per run), see below.
|
|
Joining the discussion here, coming from #642... I'm seeing a tiny speed improvement of ~2% here, certainly not 1.5x. This is on an i3-8100. No change in output compared to master after merge of #642, which is in line with @jart's observation:
|
I've also tried to replicate with |
@sw That's very interesting, apparently the performance is highly dependent on the the specific constellation (CPU, Compiler, Operating System, ...). Would it be possible that you try to compile with the compiler switch |
|
The problem (at least on my machine) is: Even with the SAME executable, I get a lot of variance in the "ms per run" eval timings. Between the minimum (448 ms per run) and the maximum (598 ms per run) was a delta of ~150 ms, that's 33% of the total time per run (see below). My hypothesis is: It highly depends what else is happening on the system. This means: It's going to be very hard to derive any conclusion from a single run of I think we need a consistent "test harness", with identical settings, compiler switches that executes several runs to get to a statistical view. If started to work towards something like this in PR#653
|
Help run a stats calculation based on FLOPS per us:
|
@howard0su Nice! What do "a" and "b" stand for in your comment? |
Impressive work! I honestly don't fully understand how this is providing performance improvements over the already optimized stuff in #642, but I can replicate the improvement locally on my machine as well. It seems that after these changes, the AVX512 version can't keep pace and it's actually significantly slower to use that version now, even on AVX512 capable hardware. I've made some attempts to port your changes over to the AVX512 code as well, but I'm unable to see any success. My best guess for the reason behind this is that the extremely dense 4-bit data has more overhead than benefit for AVX512 at this point. For example, there is no single instruction to broadcast a 32-bit memory address to a zmm/512-bit register; Additionally, I've noticed that while shorter in instruction count, the AVX512 code actually has a longer latency due to data dependencies. The CPU actually seems to be able to parallelize work across three execution ports in your improved AVX2 version (AVX2 execution trace) vs. only two in the AVX512 version (AVX512 execution trace) I've created a goldbolt comparison between your AVX2 version and my attempt at porting it for AVX512. You can switch between them by toggling the define on line 20. Anyway, I figure there's a good chance you don't have access to an AVX512-enabled CPU yourself, so if you're not able to replicate or continue on with the AVX512 porting of your improvements, I think the best thing to do for now would be to remove or disable the AVX512 codepath and default back to AVX2 for all CPUs that support it. EDIT: I've done some more investigation, and it seems that Updated godbolt comparison using clang 13: https://c.godbolt.org/z/Yv5szs6Kh clang-13 AVX2 execution trace: https://uica.uops.info/tmp/63000579f2b24ee182d450170a4f5dd8_trace.html clang-13 AVX512 execution trace: https://uica.uops.info/tmp/d65cfb55599a41ffbb0cbcbe22c8dcbe_trace.html With clang-13 on my 7950x CPU, I'm seeing:
With gcc-12, I'm seeing:
|
I've wasted a couple more hours on this. I've realized that once I get ~6 threads going, there are no detectable differences between the AVX2 and AVX512 versions. The whole thing becomes completely memory bound and no changes to compiler or AVX2 vs. AVX512 move more than a percent or two away from 100ms/token. I'm not going to spend any more time trying to optimize this at this myself point. My opinion is that we should just drop the AVX512 implementation. It provides no detectable benefit for me at this point in the best case, and makes things slower in the worst case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Ameobea that we should drop AVX512
for now until we have an implementation that is actually benefiting from these instructions. Moreover, now that we use -march=native
by default
Regarding reproducibility of the results:
It is totally expected the produced results to vary for different SIMD implementations since we are dealing with floating point round off errors and the model is auto-regressive, meaning that small variability at the start can cause totally different results at the end
The thing that we are interested in remaining approximately the same is perplexity
.
This is the quality metric that we currently have and as long as it is not degraded, we can accept changes in the SIMD implementation that improve the performance
I haven't tested this PR on x86 since it is more tedious for me, but if you observe good perplexity values and performance improvement, please go ahead and merge these changes.
Important
When you evaluate the perplexity, remember that for batch size of >= 32
, currently ggml
will switch to using OpenBLAS sgemm
if you are linking to it and this piece of code will not be evaluated at all. So make sure to either use a smaller batch, or best - disable OpenBLAS
when evaluating the perplexity.
I would be very happy to verify this. But: I started to the perplexity binary and it tells me: Is there someone with a beefier machine who could help out? |
@ggerganov I would suggest to move that to a separate PR and keep the removal in a single commit. This would help to keep things clean and it would be easier to revert the commit if the AVX512 could would still be wanted by someone. |
Awesome, thank you for the links - did not know that existed, it's super helpful.
Yes, I currently do not have access to a AVX512 CPU. This why I suggested to move that into a separate issue here: #654 (comment)
I was also under the impression that CLANG seems to produce slightly better assembly. But I am not sure if I compare it to gcc without |
cb5cc71
to
d8acf29
Compare
d8acf29
to
e621f62
Compare
9e62f03
to
69ef03d
Compare
Doing some short but repetitive testing of the latest commits, about the same performance as the initial commit of this PR. Compared to current master, it's on average 190ms per run versus 165ms per run. This PR being faster. So 10-15% performance increase is still pretty accurate. Output remains the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given @ggerganov's approval on a previous revision, and the detailed statistics provided by @SebastianApel (thanks!), I'm merging this even though I see no improvement on my machine.
Glad to see this merged. Little by little inference is getting faster. |
This brought in a bunch of trailing whitespace, sorry for not catching that. @SebastianApel : please configure your editor properly. |
UPDATE:
@slaren @rabidcopy @sw @howard0su
First: Let me say "sorry" for the confusion: I was not clear in my original post as to what the baseline was. This was mainly, because #642 was merged AFTER I had branched to run my tests of which I have reported the results.
When I published this pull request, I had not realized that #642 existed, and also not that it had been merged. My results were therefore based on the pre-#642 code version.
I have re-run the benchmarks now with three versions: the "pre-PR642" AVX2 code, the AVX2 code from #642 and the AVX2 code from this PR (#654).
At least on my machine I get (for 100 iterations of the benchmark from #653) a 14% performance increase, which is similar to what @jart reported in #654 (comment).
I have therefore changed the title of the PR. Again, sorry for the confusion.
Please let me know your thoughts.
Results Summary (single thread performance)
Please find the raw data of the benchmark runs here: benchmark-data.csv
Boxplot of the results:
Technical details
PREVIOUS POST:
(I leave it here for sake of transparency)
This change produces a ~1.5x performance increase of
ggml_vec_dot_q4_0
on AVX2.Root causes for performance improvement:
bytesFromNibbles
and then seperates them again inggml_vec_dot_q4_0
Benchmark data for code in master branch (commit 02c5b27) - produced with tool from pull request 653.
Benchmark data for code in this pull request: