-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603
Comments
Screenshot on the left has FMA and SSE3 enabled, is that the one that is faster? Try building both with the same flags. |
No, that's what I'm saying. The left is optimized llama.cpp, the right is unoptimized gpt4all. The unoptimized gpt4all is significantly faster than the optimized llama. So there's something wrong. |
Without bisecting the exact commit that introduced a performance regression it is hard to do much about it. I suspect that it happened when the code for transposed matrix multiplication was replaced with a copy. |
First I'm just trying to check if someone else can reproduce the behavior. I def wanna use llama version because it has more features haha |
Double check you have built |
Personally had small performance regression (13b model) over last 14 days, from 207 ms to 238 ms, but its no biggie |
I see significant slowness when comparing on windows latest llama.cpp 30B with gpt4all lora |
I also notice performance drops on x86-64 Linux, it also uses a lot more memory than before. I compiled the project following the instructions on the Readme.md |
@MillionthOdin16 How did you do that? I couldn't get this to work yesterday. See this issue |
Did you try it in the last few hours? There were some commits couple hours ago that made it easy. Look at the readme section that was added for GPT4all It kinda drives me crazy that the one dude forked llama.cpp then stopped maintaining it because other repos are forking his repo which is outdated |
what is gpt4all and what changes to llama.cpp and alpaca.cpp? I arrived by parachute here in the conversation, I don't really know the context, I apologize |
We should definitely look into this as this definitely shouldn't be the case. I'm pretty confident though that enabling the optimizations didn't do that since when we did that #375 the perf was pretty well researched. If performance got lost and memory usage went up somewhere along the way, we'll need to look at where this happened. If it doesn't run well, everyone just going to fork from a older point in time instead. @BrunoIsaac27 alpaca.cpp is pretty much a fork from a older llama.cpp version (which is apparently faster), but nothing much is really changed except changing a few default variables. gpt4all is a fork from the alpaca.cpp fork with modifications tailored specifically to the gpt4all model. |
Someone on Windows should try and bisect to see where this alleged degradation happens - I don't have a Windows machine. I'm pretty sure there is no performance degradation no Mac OS and Linux. The only slowness introduced, as @slaren mentioned, was the removal of the transposed I guess the But overall, we need some more specific info about your setup and timing numbers to be able pinpoint the problem. |
alpaca.cpp / gpt4all was forked specifically at this point: 9b4a15b There is exactly 180 commits between then and now, obviously too many to test them manually. Commit list
I was thinking what we'd need is a script which: start -> create directory -> git clone and build commit id -> log the performance of some runs to a file -> remove directory -> loop to start That would narrow it down to exactly where it happened. Unfortunately the GitHub action runners are very limited, otherwise incorporating a performance test to be ran on every pull request would be easy to incorporate to the test suite to nip these issues in the bud before the situation devolves like this that we have to go bug hunting through a list of 180 commits. |
I'm thinking it would be a good thing to measure performance more accurately. Wall-clock time is not good enough, especially if there's other things happening on the system. You have to let it run for a long time to get a good enough average. Another pitfall may be that a test suite causes downclocking of the processor. So the first test will get a cold processor running at full speed, and the later tests will have to run on a hot, slow processor. Maybe |
@sw Very good points. "Preheating" a processor before a set of runs would ensure the stability between other sets of runs. Then again, the first-run(s) advantage quickly dissipates when doing a larger set of runs. Especially on desktop computers with good cooling this isn't much of a problem unless you happened to just fire up your PC and instantly start a test. There is also the problem, especially in the case of Windows, beginning in Windows 8 and onwards, each new iteration being worse than the one before, of the OS being so noisy that it affects the performance greatly so a large set of runs is required to achieve anything resembling an accurate average. That would be more about perf testing generally though, as in this case where the perf drop is significant enough to be able to be visually inspected, probably just a run or three would be enough to narrow down where it happened. I think that is the most feasible way to go about this since manually trying to figure out from the now much-changed codebase where the problem lies would be harder than just letting a script run and perf test tell you that. edit: the downclocking part you linked, that's exactly the thing I was trying to remember as I actually posted something about earlier of AVX512 sometimes having worse performance for some workloads especially in the earlier Intel processors which first introduced the set, but didn't remember what was exactly the cause but that was definitely it. That whatever improvement AVX512 brought to the table was offset by the downclocking so that the overall performance actually decreased. |
This can be done as a binary search too, the |
Very interesting, I actually had no idea such a command existed. Seems useful in many cases. However in this case since there also was the format change somewhere in the middle, it'd simply be easier to go about it sequentially until you run to the one where the format was changed, change the model only once and proceed. Or run two binary searches for the ones before and after the model change, that is also an option. To be honest when already went to the trouble of setting the script up and have some compute time set aside for it, a sequential run would also give a log of the performance deltas of all the commits and which ones increased performance and which ones decreased it, as it might not be any single commit that's causing it but a pile up of smaller decreases here and there. There obviously has been steps forward but also steps back in the bunch. |
I've written a small python script to benchmark the token time as a function of number of threads. I've added the script in attachment if anyone want to try it. ( --> benchmark_threads.txt , I had to change the extension in order to upload ) It could be useful to benchmark performance for different versions. If not just ignore this message :) Below you can see the result from my pc. I'm using windows 10, the typical system information looks like:
I didn't go up to the 36 threads, you can see the results below. The script runs each prompt 5 times and plots the average token times. For some reason the timings go up at around 18 threads (i.e. the number of physical cores). I will try it again later to see how robust the timings are. Feel free to propose suggestions to make the benchmark more reliable. Edit: I've updated the plot so that it includes the eval and prompt eval as well (don't mind the typo's). It's really strange why the performs drops around 18 (i.e. the number of physical cores) and afterwards drops again... |
@KASR at the moment it seems that you are only measuring the prompt eval time, I would recommend considering the prompt eval time and the regular (generation) eval time separately. You can safely assume that 1 run = 1 token in the eval time. |
Wow, thanks for the critical thinking guys. You've mentioned some pretty interesting points. It's pretty crazy to see what affects performance, and looking at some of the discussions it seems like there are things like Intel perf cores they can have a significant impact (although not in my case 🙃). I can definitely help with testing and performance metrics, I just need to make a script that'll get reliable builds between versions for my environment. It's pretty picky and often needs tweaks to make the build succeed. One of the differences/struggles right now is that the current llamaCPP gives much more performance metric info than the build used in gpt4all. So it's hard to see the specific timings in the older gpt4all version. Apart from that, I'd want to make sure that the info I'm collecting while running builds for specific commits is actually the info that will help us.
As for my build and build process, I have a Ryzen 3900x (12c, 24t) and use CMake and ninja to build my executables. I've also built with blas linked, but haven't seen a noticeable difference while using the library vs not. Other than that I use avx avx2 maxv maxv2 f16c sss3 on Release. I usually run with -t 8. And the models I use are the 4-bit quantized 7B.
Where should I expect to see the performance increases when I'm running with BLAS? Is it during larger completions after the prompt is loaded? I could also do things in WSL2, but I'm not sure about the performance impacts, which is why I currently don't use it. If you think it would be better let me know. Again, awesome job guys! You're having a huge impact on making these models accessible to normal people 🔥🔥🔥 |
@MillionthOdin16 For Windows definitely the most common configuration (4 or 8 threads, AVX=yes, SSE3=yes, F16C=yes, AVX2=yes , AVX512=no , BLAS=no , WSL2=no) would be the best to base the benchmarks on. Obviously if you want and have the time for it, more is always better. The most important thing to know would be the performance data between commits starting from 9b4a15b and ending to 9cbc404 . That is the thing which will help in understanding where the decreases happened. Since there has been many commits with optimizations and performance increases, it makes no sense that gpt4all/alpaca/llama-9b4a15b is faster, it should be slower because they don't have any of the recent optimizations. That leads to only one conclusion that there must have been significant decreases at some points in the timeline. It can be something not easily seen like compiler dropping inlining because of some change (inline and force-inline isn't same, and compiler can even drop force-inline) or a mistake somewhere, cannot know really. Only data can save us. 😄 |
I've been able to observe some performance degradation on Arch Linux as well. I didn't have time to look for the precise commit yet, but I found the potentially helpful information that the degradation seemed to have happened after the ggml format migration, which may help simplifying the exploration. I think it would be nice if someone else could confirm this and make sure this isn't something that happens for me only 😅 I'll keep doing some exploration, but here are the numbers I observed so far: System info: Arch Linux - CPU: Intel Core i7 7700K. Compiled using I only use 4 threads, as 8 threads tend to cause performance degradation for me. ed3c680
074bea2 (first commit using new format)
|
This might come in handy for tech savvy lads here who need slight performance boost #295 |
Thank you @cyyynthia , something is definitely up here. Interestingly, the new format made load times go up 40% but the sampling and predict times stayed the same (within margin of error) I've only now woken up to this since you don't tend to see marginal changes (like in general, in anything) as I've always been on the latest version and didn't notice the performance degrading gradually. But obviously now everything is much slower, loading/sampling/prompt evaluation, and this is a high priority issue. For anyone trying out gpt4all/alpaca.cpp vs current-gen llama.cpp will find it painfully obvious while for someone just developing incrementally this has gone by unnoticed. |
After a bit more digging, #439 seems to be a very clear culprit of performance drops: eval time goes from 175ms (better than before!) @ 404e1da to 244ms @ 483bab2. It seems other timings do fluctuate in some intriguing ways, with increased loading times and sample times. I'll try to put together a test script and plot the evolution of these values over time on my machine. |
I've done a first test to see, and I've already gathered some interesting data. I have ran my script on 18 commits from the range cited earlier, skipping 10 commits every time. Doing it on the full range will take a while so I wanted to see what I could get without too much precision. Here are the following graphs I've gathered, and the raw csv data if you want to explore it some more. I'll run it on all 180 commits later, probably tomorrow. |
@cyyynthia : it's great that you're putting in all this work Had a go at reverting #439 and #500, don't know if that would help you: sw@f747a43 (edit: some mistakes hopefully fixed: sw@f6d83dc) It's faster than latest master at first glance, but I really haven't tested it thoroughly. |
@sw Oh yes it is! Only 333s runtime! 107ms per token on average, this is by far the fastest run of them all!! Here is the graph (with only interesting commits): And the raw CSV data: |
@cyyynthia relevant commits (for context tooltips) Also thanks to the awesome people who are divining into performance analysis! Really appreciate the effort by all of you <3 |
The vertical axis is in microseconds, and the horizontal axis is the number of tokens. I ran main with set parameters and let it generate 2000 tokens. For each token, I logged the time it took to generate it and plotted it on the graph. Each line is a single test on a specific commit (single run). The amount of times it takes to generate a single token grows over time as there are more and more things in the context store to care about. The yellow line represents when #439 was introduced, where we can see the time it takes to generate a single token grows much faster than before (the green line). The blue line is current master branch, and we can see the regression is still very pronounced. The red line is sw' patch which is master with #439 (and #500) reverted. We can see the time it takes to generate a single token is much lower, and grows much slower (exactly like it used to do). We can also see there is an overall performance increase, from all the SIMD optimizations that have been done in the past days. I hope this explains it well enough! edit: for the reference, here are the total run times for all runs (from newer to oldest): |
That's awesome! I always wonder what the actual performance increase is when I see commits for with additional optimizations. This plot is super useful for evaluating performance. I wonder if we can turn it into a script that runs for each new commit / publish as a workflow both as a performance sanity check and just because it's cool to see the progress :) Just an idea for the future. But awesome job!!! |
I did not notice any performance improvement for sw@f747a43 =.=
Main:
|
I'm highly unsure if perplexity is a meaningful test for measuring performance, at least here. As I've illustrated in my graphs, at the beginning the token time is fine so long only very few tokens are in the context, and only as the context fills up the degradation shows up and we get further and further away from the baseline. At 512 tokens, my data shows 2x slowdown in token time and at 2000 tokens a 7.5x slowdown. I think the fact OpenBLAS being broken is sort of expected, as the point of the branch was to see if the same degradation of token time was occurring if we reverted the offending commit but kept everything else - and according to my data the degradation completely vanishes - and not to find a proper fix for the degradation yet. edit: I'm even more convinced that perplexity and the shown ETA is not a good test at all, as it seems the measurement is done on an empty context (see here), meaning it is totally blind to the core of this issue and what my analysis points out. |
I see a significant improvement for sw@f747a43 . My experience on windows matches cyyynthia's data. I didn't realize how bad it had gotten until I ran the reverted version and generation was so much faster. Thanks :) |
Glad to see that this shows an improvement for some of you. I must admit I just quickly threw this together. I haven't looked at OpenBLAS at all so I apologize for that being broken. (edit: can't get OpenBLAS to link right away, but some mistakes are hopefully fixed here: sw@f6d83dc) Reverting #439 is straightforward, but my second commit which purports to revert #500 is a bit disingenious what the commit description is concerned. It certainly deserves further scrutiny, for example the Also, my intention is certainly not to have this slapped back onto master without further investigation. The intentions behind #439 and #500, namely reducing the complexity of the code base, were good after all. |
@ivanstepanovftw I agree with @cyyynthia we first need to confirm the root cause of the performance degradation, right now it does appear to be #439. On a side note, it has been discussed that perplexity is improved when using blas (see #406 for more info) so it's difficult to compare the results with/without blas. I also agree with @sw --> @sw depending on the number of tokens, especially at n=2048 the improvements are very big (at least on windows when using cmake and vs to build, can't say anything for other configurations) i.e. this has a significant impact during interactive mode |
I believe this is incorrect. It's not the -n that matters, it's how many things are in the context memory (i.e. -n_ctx and how far we are in the generation/interaction). On the revert branch, I've had significantly faster responses in interactive mode on the 13B model. I don't have data to back all this up, but I'm pretty sure the impact is the same in interactive mode. |
Thank you for this hard work - I missed this regression because I rarely run generations with more than a few tens of tokens. The problem is that the transpose operation for the V matrix is very slow and becomes slower and slower with every new token added. I think I have provided a fix here: #775 Tested only on M1 so far |
#775 was merged pretty quickly, @cyyynthia if it's not too much trouble, could you update your pretty graphs with that? |
Also @cyyynthia is there a way that we can adapt your process as a way to analyze future builds? Do you have a script or something that you were using that could help us create something like that? People have done some pretty cool stuff and I want to make sure it isn't lost once the issue is resolved. |
@sw The graphs aren't super pretty this time because I didn't take the time to properly close everything and had a bunch of things open in background while the test was running 😅 That being said, the regression appears to be gone. 🎉 Here's the graphs and the raw CSV:
@MillionthOdin16 As described earlier, my process was simply to shoving a It could be added to llama.cpp itself behind a compile flag for enabling it more easily. But it's not rocket science, and I didn't use any special script for it. |
Okay, thanks for summarizing the process. That graph makes me so happy ❤️ |
Well I guess that settles it. The Great Wizard Georgi has saved the day! Thanks to @cyyynthia and @KASR for putting in the hard work of tracking this down. I have opened #790 to track the discrepancy in the different partial times vs total time. I think this issue could be closed. Thanks everyone. |
Expected Behavior
I am comparing the performance of two executables: llama.cpp (current version) and the default gpt4all executable (which uses a previous version of llama.cpp). I am using the same language model for both executables, and I expect the current version of llama.cpp (which is built specifically for the hardware) to perform at least as fast as the default gpt4all executable.
Current Behavior
The default gpt4all executable, which uses a previous version of llama.cpp, performs significantly faster than the current version of llama.cpp. Despite building the current version of llama.cpp with hardware-specific compiler flags, it consistently performs significantly slower when using the same model as the default gpt4all executable.
Environment and Context
I am running the comparison on a Windows platform, using the default gpt4all executable and the current version of llama.cpp included in the gpt4all project. The version of llama.cpp is the latest available (after the compatibility with the gpt4all model).
Steps to Reproduce
Here's some context/config when I'm doing the runs:
(left panel is latest llama.cpp, right panel is gpt4all build)
This is the older version that gpt4all uses (with some tweaks): https://github.com/zanussbaum/gpt4all.cpp
*To quickly test the difference yourself you can use the gpt4all default binaries here: https://github.com/nomic-ai/gpt4all/tree/main/chat
The text was updated successfully, but these errors were encountered: