-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Improve performance on x86 #295
Conversation
This worked with gcc 11.3 but gave only a slight improvement. I've been looking at that function as well since it's clearly the hotspot. I was playing around with the AVX2 instructions but it seems to be pretty much memory-bound. I tried using |
The hard coded 32 in the prefetch distance is quite arbitrary. I wonder if different numbers would work better for your machine. As you reach the limit of your machine, other stuff you have running would start to create more variation of the measured performance too. So you might have to run it multiple times and compare the best performances. The original code runs very consistently at ~425+-5 ms/token for me, however the modified version varies between 340 and 380 between runs for me. |
I was independently trying to do something similar on the Q4_1 code here. I managed to squeeze out somewhere around 5% more performance by rearranging the SIMD math and avoiding a double load on the constant offsets, but saw no improvements from prefetching anything on my setup (a Skylake mobile Xeon, GCC 11). |
I took a quick look at wiki for Skylake mobile Xeon, looks like the L3 cache size there (8MB) is less than the L1 cache (13MB) on this i7700 desktop chip. The prefetch distance in this PR might be way too far for your chip? Here it's also trying to prefetch in to L1, you might have better luck prefetching into L3 given the smaller cache size? |
Personally noticed improvement on 10700kf on windows 10 From 270 ms to 241 ms on 13b alpaca, although only part I added from this commit was the main loop modification since no #include <sched.h>on windows and thread stuff would need adjustment i assume to work on windows aswell performance gains prolly could be bigger on 30 b and 65 b models, aswell as if I got thread affinity stuff going on windows |
Please reopen when and if this is ready to merge |
…terial-9.1.15 Bump mkdocs-material from 9.1.14 to 9.1.15
Someone please take over this pull request?
Unfortunately, I'm quite behind on a few other obligations so I won't be able to continue exploring here. Feel free to take this as an inspiration and make a production ready version!
I did some initial exploration on various ways to squeeze more performance out of the main loop on my ubuntu desktop with an i7-7700k CPU.
Code was complied with gcc-10 and invoked with
./main -m ./models/7B/ggml-model-q4_0.bin -s 1679164839 -m ./models/7B/ggml-model-q4_0.bin -n 1280
Since inference is usually memory bound, I specifically looked for ways to improve memory access.
Seems like a combination of prefetch +CPU pinning + unroll can improve the performance by up to ~25%
These changes here is only tested on my machine and I suspect the code won't even compile for other platforms.