-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767
Comments
A couple of things to check:
|
I followed the instructions in the repo (simply git clone the repo, cd to the folder and make, so I suppose by default it builds in release mode). I tried with f16 (that I understands are the q4_1 ones) and q4_0 with similar results. I can add that if I load a basic command like Video of CleanShot.2023-04-04.at.20.05.20.mp4Video of the chat-13B example script: CleanShot.2023-04-04.at.20.07.29.mp4 |
Hm, I don’t see a |
It’s the chat-13B.sh but with the path for the vicuna model, I get the exact same results using the default chat-13B.sh with the standard alpaca 13B model (and with every other example script in that folder). |
Probably relevant, #603 |
My guess is that one or more of the additional options the script is passing to |
The older version used by Dalai (https://github.com/candywrap/llama.cpp) doesn't include the changes pointed out in #603 which appear to have caused significant performance regression. My assumption is that it's related to what we're investigating over there. |
I tried changing and removing the additional options without results. Moreover the strangest thing is that now even the simple CleanShot.2023-04-04.at.23.56.29.mp4 |
I don't know that this is specifically the issue that I describe in #603. His behavior is different than mine, and might be related to memory and swap issues. I've seen problems for some users since the mmap() update, and what he's describing sounds more similar to one of those where performance plummets to unbearable levels straight from the start. I didn't even see this issue because it was closed before I got a chance to. The only reason I saw it was because it was referenced in my issue. @serovar can you look at your disk utilization as it's processing? |
Here: CleanShot.2023-04-05.at.12.05.58.mp4 |
Okay, thanks. What's your RAM usage look like? Do you have spare ram or is it all allocated? edit: But in relation to issue 603, this issue is different and I think it only started happening in the last few days for users. The reason you don't have the issue with dalai is because it doesn't have some of the more recent updates updates from this repo. |
It does not seems like the performance is directly correlated to RAM allocation. I installed htop to have a complete view on the process and I got to record two different sessions (one with decent speed and one very slow) with the same chat script. Here the rapid one: CleanShot.2023-04-05.at.13.00.20.mp4Here the slow one: CleanShot.2023-04-05.at.14.13.55.mp4 |
Okay so just to be clear, You're running the same exact command and sometimes generation speed is horrible, and other times it generates normally? One thing I noticed is that in the fast generation video it looked like your system time was 6 minutes. And the slow example, your uptime was much longer. This may seem odd, but after restarting your computer and running the model for the first time, do you have faster generation? |
Can confirm I had this exact problem on an M1 Pro 16GB ram and rebooting fixed the issue 😄 |
Okay, we need some mmap people in here then. Because there's definitely
something that changed with it users aren't getting a clear indication of
what's going on other than horrible performance. It may relate to mlock,
but I'm on Windows and don't use that so I'm not familiar with it.
…On Wed, Apr 5, 2023, 19:59 James Olds ***@***.***> wrote:
Can confirm I had this exact problem on an M1 Pro 16GB ram and rebooting
fixed the issue 😄
—
Reply to this email directly, view it on GitHub
<#767 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYMC3ACIY3R6TFMWDBDMRJTW7YBMXANCNFSM6AAAAAAWTASIBY>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
Aaaaand after loading chrome and doing some other stuff its now back to being extremely slow. Also running this exact setup on an M1 Max with 64GB ram and not seeing the issue. It doesn't seem to be spiking CPU or RAM usage, though it's reading from disk at ~830MB/s while trying to respond |
Are you using mlock? I think what's happening is the mmap is allowing you
to load a larger model than you'd normally be able to load because you
don't have enough memory, but the trade-off to being allowed to load it is
that it performs very poorly because it can't keep what it needs in RAM.
…On Wed, Apr 5, 2023, 20:12 James Olds ***@***.***> wrote:
Aaaaand after loading chrome and doing some other stuff its now back to
being extremely slow. Also running this exact setup on an M1 Max with 64GB
ram and not seeing the issue
—
Reply to this email directly, view it on GitHub
<#767 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYMC3ABGRMH4CX52W47RFETW7YC6NANCNFSM6AAAAAAWTASIBY>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
ah adding |
Same behavior here, thanks ! |
Specifying --mlock did not fix the issue for me. I should preface that I'm not using apple silicone, but I did experience poor performance on the 13B model comparatively with Dalai, as the OP specified. Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a substantially lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count. For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores. Hope this insight helps someone. |
I'm experiencing similar issues using llama.cpp with the 13B model in Ubuntu 22.10. The token generation is initially fast, but becomes unbearably slow as more tokens are generated. Here's the code snippet I'm using, but I'm not a C++ programmer and haven't worked with LLMs in ages though so the error might lie elsewhere: llama_context_params params = llama_context_default_params();
ctx = llama_init_from_file(model_path.c_str(), params);
if (!ctx)
{
throw std::runtime_error("Failed to initialize the llama model from file: " + model_path);
}
std::vector<llama_token> tokens(llama_n_ctx(ctx));
int token_count = llama_tokenize(ctx, input.c_str(), tokens.data(), tokens.size(), true);
if (token_count < 0) {
throw std::runtime_error("Failed to tokenize the input text.");
}
tokens.resize(token_count);
int n_predict = 50; // Number of tokens to generate
std::string output_str;
for (int i = 0; i < n_predict; ++i) {
int result = llama_eval(ctx, tokens.data(), token_count, 0, 8);
if (result != 0) {
throw std::runtime_error("Failed to run llama inference.");
}
llama_token top_token = llama_sample_top_p_top_k(ctx, tokens.data(), token_count, 40, 0.9f, 1.0f, 1.0f);
const char *output_token_str = llama_token_to_str(ctx, top_token);
output_str += std::string(output_token_str);
std::cout << output_str << std::endl;
// Update context with the generated token
tokens.push_back(top_token);
token_count++;
}
return output_str;
Edit: Turns out I can't code. Works fine now, just had to get rid of the first token in each iteration. |
I'm getting similar experience with following line: main -i --threads 12 --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-model-q4_1.bin It's Ryzen 9 with 12 cores. Each token takes at least 2 seconds to appear. |
@ssuukk Does adding |
I haven't seen any case where setting your thread count high significantly
improves people's performance performance. If you're on Intel you want to
set your thread count to the number of performance cores that you have. I
have a Ryzen, and I could potentially use 24 threads, but I don't get any
better performance at 18 then I do at 12. Usually when I run I use between
6 and 12 depending on what else is going on.
People definitely don't want to be using anywhere near the max number of
threads they can use though....
…On Sun, Apr 9, 2023, 12:46 Brendan Price ***@***.***> wrote:
Specifying --mlock did not fix the issue for me.
Oddly, the only thing that ended up working for me was explicitly setting
the number of threads to a *substantially* lower number than what is
available on my system. Anecdotally, I got the best performance when
specifying a thread count that is 1/4 of my available cores count.
For context I've an i9 12900k processor that has 24 virtual cores
available. When running using all 24 virtual cores it's basically unusable;
each token takes many, many seconds to generate. This continues to be the
case until I set the thread count to about 3/4 (16) of my available cores,
but even here there are intermittent pauses where nothing happens for
several seconds. Only when I get to around half of my available cores (12)
where it starts to perform nominally, with it seemingly improving further
going down to 1/4 of my available cores.
Hope this insight helps someone.
—
Reply to this email directly, view it on GitHub
<#767 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYMC3AATLZI5WPDTEIJTQSDXALRVJANCNFSM6AAAAAAWTASIBY>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
With --mlock it is as slow as without, but maybe even slower - now it takes 2 seconds to generate parts of the words! |
You should be setting -t to the number of P cores in your system. Your system has 8+8 IIRC (8*2 +8 = 24) so set You can modify this script to measure the scaling on your machine - https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py |
* Fix rope scale with backwards compatibility * Fix defaults * Fix op * Remove backwards compatibility * Check single val
Expected Behavior
I can load a 13B model and generate text with it with decent token generation speed with a M1 Pro CPU (16 GB RAM).
Current Behavior
When I load a 13B model with llama.cpp (like Alpaca 13B or other models based on it) and I try to generate some text, every token generation needs several seconds, to the point that these models are not usable for how unbearably slow they are. But they works with reasonable speed using Dalai, that uses an older version of llama.cpp
Environment and Context
MacBook Pro with M1 Pro, 16 GB RAM, macOS Ventura 13.3.
Python 3.9.16
GNU Make 3.81
Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix
If you need some kind of log or other informations, I will post everything you need. Thanks in advance.
The text was updated successfully, but these errors were encountered: