Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

serovar · 2023-04-04T17:33:04Z

Expected Behavior

I can load a 13B model and generate text with it with decent token generation speed with a M1 Pro CPU (16 GB RAM).

Current Behavior

When I load a 13B model with llama.cpp (like Alpaca 13B or other models based on it) and I try to generate some text, every token generation needs several seconds, to the point that these models are not usable for how unbearably slow they are. But they works with reasonable speed using Dalai, that uses an older version of llama.cpp

Environment and Context

MacBook Pro with M1 Pro, 16 GB RAM, macOS Ventura 13.3.

Python 3.9.16

GNU Make 3.81

Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix

If you need some kind of log or other informations, I will post everything you need. Thanks in advance.

j-f1 · 2023-04-04T17:37:21Z

A couple of things to check:

Are you building in release mode? Debug mode would be significantly slower.
Which weight type are you using? f16 weights may be slower to run than q4_0 weights.

serovar · 2023-04-04T18:09:59Z

I followed the instructions in the repo (simply git clone the repo, cd to the folder and make, so I suppose by default it builds in release mode).

I tried with f16 (that I understands are the q4_1 ones) and q4_0 with similar results.

I can add that if I load a basic command like ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 the generation speed is fine, but when using the example scripts (obviously with the correct model path) it becomes unbearably slow.

Video of ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 :

CleanShot.2023-04-04.at.20.05.20.mp4

Video of the chat-13B example script:

CleanShot.2023-04-04.at.20.07.29.mp4

j-f1 · 2023-04-04T18:13:28Z

Hm, I don’t see a chat-13B-vicuna.sh example in the examples folder. Are you sure you’re filing this against the right repo?

serovar · 2023-04-04T18:16:18Z

It’s the chat-13B.sh but with the path for the vicuna model, I get the exact same results using the default chat-13B.sh with the standard alpaca 13B model (and with every other example script in that folder).

x02Sylvie · 2023-04-04T18:34:51Z

Probably relevant, #603

j-f1 · 2023-04-04T18:35:58Z

My guess is that one or more of the additional options the script is passing to ./main is causing the slowdown — if you start by removing all of them and then adding them back one at a time you should be able to track down which one is causing the slowdown.

cyyynthia · 2023-04-04T19:01:10Z

The older version used by Dalai (https://github.com/candywrap/llama.cpp) doesn't include the changes pointed out in #603 which appear to have caused significant performance regression. My assumption is that it's related to what we're investigating over there.

KASR · 2023-04-04T19:37:44Z

Might be the same as issue as #735 and #677 and indeed probably related to #603

serovar · 2023-04-04T21:59:00Z

My guess is that one or more of the additional options the script is passing to ./main is causing the slowdown — if you start by removing all of them and then adding them back one at a time you should be able to track down which one is causing the slowdown.

I tried changing and removing the additional options without results. Moreover the strangest thing is that now even the simple ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 command leads to slow token generation (I did not change anything at all).

CleanShot.2023-04-04.at.23.56.29.mp4

j-f1 · 2023-04-04T23:58:09Z

I think we should close this out for now since it seems like the performance regression discussion is happening in #603 / #677 / #735

MillionthOdin16 · 2023-04-05T06:36:59Z

I don't know that this is specifically the issue that I describe in #603. His behavior is different than mine, and might be related to memory and swap issues. I've seen problems for some users since the mmap() update, and what he's describing sounds more similar to one of those where performance plummets to unbearable levels straight from the start.

I didn't even see this issue because it was closed before I got a chance to. The only reason I saw it was because it was referenced in my issue.

@serovar can you look at your disk utilization as it's processing?

serovar · 2023-04-05T10:08:36Z

Here:

CleanShot.2023-04-05.at.12.05.58.mp4

MillionthOdin16 · 2023-04-05T10:12:17Z

Okay, thanks. What's your RAM usage look like? Do you have spare ram or is it all allocated?

edit:
I'm not sure about the technical details of what's going on, but it looks like you might be using swap because your low on ram. To this point I've only seen it confirmed on Windows, but it might be related to the case where end mmap causes poor performance for users. Someone with more experience with the recent updates can probably help narrow it down.

But in relation to issue 603, this issue is different and I think it only started happening in the last few days for users. The reason you don't have the issue with dalai is because it doesn't have some of the more recent updates updates from this repo.

serovar · 2023-04-05T12:24:11Z

It does not seems like the performance is directly correlated to RAM allocation. I installed htop to have a complete view on the process and I got to record two different sessions (one with decent speed and one very slow) with the same chat script.

Here the rapid one:

CleanShot.2023-04-05.at.13.00.20.mp4

Here the slow one:

CleanShot.2023-04-05.at.14.13.55.mp4

MillionthOdin16 · 2023-04-05T20:01:49Z

Okay so just to be clear, You're running the same exact command and sometimes generation speed is horrible, and other times it generates normally?

One thing I noticed is that in the fast generation video it looked like your system time was 6 minutes. And the slow example, your uptime was much longer.

This may seem odd, but after restarting your computer and running the model for the first time, do you have faster generation?

oldsj · 2023-04-05T23:58:56Z

Can confirm I had this exact problem on an M1 Pro 16GB ram and rebooting fixed the issue 😄

MillionthOdin16 · 2023-04-06T00:05:34Z

Okay, we need some mmap people in here then. Because there's definitely something that changed with it users aren't getting a clear indication of what's going on other than horrible performance. It may relate to mlock, but I'm on Windows and don't use that so I'm not familiar with it.

…

On Wed, Apr 5, 2023, 19:59 James Olds ***@***.***> wrote: Can confirm I had this exact problem on an M1 Pro 16GB ram and rebooting fixed the issue 😄 — Reply to this email directly, view it on GitHub <#767 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYMC3ACIY3R6TFMWDBDMRJTW7YBMXANCNFSM6AAAAAAWTASIBY> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

oldsj · 2023-04-06T00:12:10Z

Aaaaand after loading chrome and doing some other stuff its now back to being extremely slow. Also running this exact setup on an M1 Max with 64GB ram and not seeing the issue.

It doesn't seem to be spiking CPU or RAM usage, though it's reading from disk at ~830MB/s while trying to respond

MillionthOdin16 · 2023-04-06T00:23:00Z

Are you using mlock? I think what's happening is the mmap is allowing you to load a larger model than you'd normally be able to load because you don't have enough memory, but the trade-off to being allowed to load it is that it performs very poorly because it can't keep what it needs in RAM.

…

On Wed, Apr 5, 2023, 20:12 James Olds ***@***.***> wrote: Aaaaand after loading chrome and doing some other stuff its now back to being extremely slow. Also running this exact setup on an M1 Max with 64GB ram and not seeing the issue — Reply to this email directly, view it on GitHub <#767 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYMC3ABGRMH4CX52W47RFETW7YC6NANCNFSM6AAAAAAWTASIBY> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

oldsj · 2023-04-06T00:30:06Z

ah adding --mlock actually seems to fix the issue! at the cost of slightly longer initial load

vpellegrain · 2023-04-06T08:27:16Z

ah adding --mlock actually seems to fix the issue! at the cost of slightly longer initial load

Same behavior here, thanks !

serovar · 2023-04-06T09:23:31Z

I can confirm that I have the same behavior as @oldsj and that adding --mlock does fix the issue.

Could I ask you what this option does? I read about it in this discussion but it is not very clear to me and they also talk about it needing root permissions (that I did not give).

cbrendanprice · 2023-04-09T16:46:01Z

Specifying --mlock did not fix the issue for me. I should preface that I'm not using apple silicone, but I did experience poor performance on the 13B model comparatively with Dalai, as the OP specified.

Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a substantially lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count.

For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores.

Hope this insight helps someone.

hanss0n · 2023-04-09T18:23:51Z

I'm experiencing similar issues using llama.cpp with the 13B model in Ubuntu 22.10. The token generation is initially fast, but becomes unbearably slow as more tokens are generated.

Here's the code snippet I'm using, but I'm not a C++ programmer and haven't worked with LLMs in ages though so the error might lie elsewhere:

    llama_context_params params = llama_context_default_params();
    ctx = llama_init_from_file(model_path.c_str(), params);
    if (!ctx)
    {
        throw std::runtime_error("Failed to initialize the llama model from file: " + model_path);
    }
    std::vector<llama_token> tokens(llama_n_ctx(ctx));
    int token_count = llama_tokenize(ctx, input.c_str(), tokens.data(), tokens.size(), true);
    if (token_count < 0) {
        throw std::runtime_error("Failed to tokenize the input text.");
    }
    tokens.resize(token_count);

    int n_predict = 50; // Number of tokens to generate
    std::string output_str;
    
    for (int i = 0; i < n_predict; ++i) {
        int result = llama_eval(ctx, tokens.data(), token_count, 0, 8);
        if (result != 0) {
            throw std::runtime_error("Failed to run llama inference.");
        }

        llama_token top_token = llama_sample_top_p_top_k(ctx, tokens.data(), token_count, 40, 0.9f, 1.0f, 1.0f);
        const char *output_token_str = llama_token_to_str(ctx, top_token);

        output_str += std::string(output_token_str);
        std::cout << output_str << std::endl;
        
        // Update context with the generated token
        tokens.push_back(top_token);
        token_count++;
    }

    return output_str;

Edit: Turns out I can't code. Works fine now, just had to get rid of the first token in each iteration.

ssuukk · 2023-04-10T13:40:49Z

I'm getting similar experience with following line:

main -i --threads 12 --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-model-q4_1.bin

It's Ryzen 9 with 12 cores. Each token takes at least 2 seconds to appear.

HanClinto · 2023-04-10T22:36:28Z

@ssuukk Does adding --mlock help in your situation, or no?

MillionthOdin16 · 2023-04-11T01:56:36Z

I haven't seen any case where setting your thread count high significantly improves people's performance performance. If you're on Intel you want to set your thread count to the number of performance cores that you have. I have a Ryzen, and I could potentially use 24 threads, but I don't get any better performance at 18 then I do at 12. Usually when I run I use between 6 and 12 depending on what else is going on. People definitely don't want to be using anywhere near the max number of threads they can use though....

…

On Sun, Apr 9, 2023, 12:46 Brendan Price ***@***.***> wrote: Specifying --mlock did not fix the issue for me. Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a *substantially* lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count. For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores. Hope this insight helps someone. — Reply to this email directly, view it on GitHub <#767 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYMC3AATLZI5WPDTEIJTQSDXALRVJANCNFSM6AAAAAAWTASIBY> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

ssuukk · 2023-04-11T06:36:46Z

@ssuukk Does adding --mlock help in your situation, or no?

With --mlock it is as slow as without, but maybe even slower - now it takes 2 seconds to generate parts of the words!

kiratp · 2023-04-30T23:40:01Z

Here is the scaling on an M1 Max (7B int4, maxed out GPU, 64 GB RAM)

kiratp · 2023-04-30T23:42:54Z

Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a substantially lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count.

For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores.

You should be setting -t to the number of P cores in your system. Your system has 8+8 IIRC (8*2 +8 = 24) so set -t 8

You can modify this script to measure the scaling on your machine - https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py

* Fix rope scale with backwards compatibility * Fix defaults * Fix op * Remove backwards compatibility * Check single val

j-f1 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2023

j-f1 reopened this Apr 5, 2023

ggerganov mentioned this issue Apr 5, 2023

Avoid heavy V transpose operation + improvements #775

Merged

gjmulder added the performance Speed related topics label Apr 6, 2023

jmtatsch mentioned this issue Apr 6, 2023

Running a Vicuna-13B 4it model ? #771

Closed

janekb04 mentioned this issue Apr 7, 2023

Do not recreate context while LLama is writing #828

Closed

4 tasks

vashat mentioned this issue Apr 7, 2023

Performance degrading over time #832

Closed

0xdevalias mentioned this issue Apr 11, 2023

Fix for MPS support on Apple Silicon oobabooga/text-generation-webui#393

Merged

ggerganov closed this as completed Jul 28, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Fix rope scaling defaults (ggml-org#767)

a945404

* Fix rope scale with backwards compatibility * Fix defaults * Fix op * Remove backwards compatibility * Check single val

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

serovar commented Apr 4, 2023

j-f1 commented Apr 4, 2023

serovar commented Apr 4, 2023

j-f1 commented Apr 4, 2023

serovar commented Apr 4, 2023

x02Sylvie commented Apr 4, 2023

j-f1 commented Apr 4, 2023

cyyynthia commented Apr 4, 2023

KASR commented Apr 4, 2023

serovar commented Apr 4, 2023

j-f1 commented Apr 4, 2023 •

edited

Loading

MillionthOdin16 commented Apr 5, 2023 •

edited

Loading

serovar commented Apr 5, 2023

MillionthOdin16 commented Apr 5, 2023 •

edited

Loading

serovar commented Apr 5, 2023

MillionthOdin16 commented Apr 5, 2023

oldsj commented Apr 5, 2023

MillionthOdin16 commented Apr 6, 2023 via email

oldsj commented Apr 6, 2023 •

edited

Loading

MillionthOdin16 commented Apr 6, 2023 via email

oldsj commented Apr 6, 2023

vpellegrain commented Apr 6, 2023

serovar commented Apr 6, 2023 •

edited

Loading

cbrendanprice commented Apr 9, 2023 •

edited

Loading

hanss0n commented Apr 9, 2023 •

edited

Loading

ssuukk commented Apr 10, 2023

HanClinto commented Apr 10, 2023

MillionthOdin16 commented Apr 11, 2023 via email •

edited

Loading

ssuukk commented Apr 11, 2023

kiratp commented Apr 30, 2023 •

edited

Loading

kiratp commented Apr 30, 2023

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

Comments

serovar commented Apr 4, 2023

Expected Behavior

Current Behavior

Environment and Context

j-f1 commented Apr 4, 2023

serovar commented Apr 4, 2023

j-f1 commented Apr 4, 2023

serovar commented Apr 4, 2023

x02Sylvie commented Apr 4, 2023

j-f1 commented Apr 4, 2023

cyyynthia commented Apr 4, 2023

KASR commented Apr 4, 2023

serovar commented Apr 4, 2023

j-f1 commented Apr 4, 2023 • edited Loading

MillionthOdin16 commented Apr 5, 2023 • edited Loading

serovar commented Apr 5, 2023

MillionthOdin16 commented Apr 5, 2023 • edited Loading

serovar commented Apr 5, 2023

MillionthOdin16 commented Apr 5, 2023

oldsj commented Apr 5, 2023

MillionthOdin16 commented Apr 6, 2023 via email

oldsj commented Apr 6, 2023 • edited Loading

MillionthOdin16 commented Apr 6, 2023 via email

oldsj commented Apr 6, 2023

vpellegrain commented Apr 6, 2023

serovar commented Apr 6, 2023 • edited Loading

cbrendanprice commented Apr 9, 2023 • edited Loading

hanss0n commented Apr 9, 2023 • edited Loading

ssuukk commented Apr 10, 2023

HanClinto commented Apr 10, 2023

MillionthOdin16 commented Apr 11, 2023 via email • edited Loading

ssuukk commented Apr 11, 2023

kiratp commented Apr 30, 2023 • edited Loading

kiratp commented Apr 30, 2023

j-f1 commented Apr 4, 2023 •

edited

Loading

MillionthOdin16 commented Apr 5, 2023 •

edited

Loading

MillionthOdin16 commented Apr 5, 2023 •

edited

Loading

oldsj commented Apr 6, 2023 •

edited

Loading

serovar commented Apr 6, 2023 •

edited

Loading

cbrendanprice commented Apr 9, 2023 •

edited

Loading

hanss0n commented Apr 9, 2023 •

edited

Loading

MillionthOdin16 commented Apr 11, 2023 via email •

edited

Loading

kiratp commented Apr 30, 2023 •

edited

Loading