-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744
Comments
When I run llama-bench, it hangs like this (nothing is happening) ./llama-bench --numa distribute -t 18 -m "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" -r 1 -p 0
I opened another wsl instance and run llama-cli. This one hangs on: the command are: ./llama-bench --numa distribute -t 18 -m "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" -r 1 -p 0 ./llama-cli --model "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" --ctx-size 400 --no-kv-offload --threads 18 --numa distribute -no-cnv --prio 3 --temp 0.65 --top_k 40 --top_p 0.9 --min-p 0.05 --seed 42 --prompt "<|User|>Why is the sky blue?<|Assistant|>" Didn't use |
@nekiee13 Sorry, never used llama.cpp on Windows (why do you torture yourself with this abomination?) so can't help with that. Maybe model loading simply takes a long time and you have to wait longer? |
No, it WSL (Windows Subsystem for Linux) - a feature of MS Windows that allows using a Linux environment without the need for a separate virtual machine or dual booting. So, my cmds are ok (Linux wise)? I mean, You run dual terminal also? I'll try next to boot clean Linux from USB and see if that works... |
Dual AMD 9124 (Linux) - Total time
Your tweak works great for dense models. For MoE, the numbers remain similar (single test). I have to do more testing (by shear volume & more systematic), but I can live with these numbers. You rule... 🥇 |
@nekiee13 Great that it worked for you! Regarding the MoE models my initial tests on DeepSeek R1 also found limited improvement when using 2 CPUs, so now I'm going to investigate the reason for this. Edit: older MoE models like Mixtral 8x7B and 8x22B don't seem to be affected by this. They use |
I tested deepseek-r1 2.51bits on my dual E5v2 CPU + 4 x 2080Ti box, I could get 3.3 token per second tg while using --numa distribute and moe offloading to system memory. However if i setup it only use 1 CPU the generating speed would boost to 3.8 tps. I guess that llama.cpp can be improved by splitting the tensor location and computation task into numa nodes. just as #11333 proposed. |
I have temporary access to a dual Epyc Turin system and found a little trick that restores normal token generation performance in llama.cpp on dual Epyc systems. The trick is to load and cache the model in memory while doing token generation, not the prompt processing. You can use llama-bench for this.
First drop caches as root:
echo 3 > /proc/sys/vm/drop_caches
and then run llama-bench with only the generation benchmark:
llama-bench --numa distribute -t <number of threads> -m <model> -r 1 -p 0
Then use llama.cpp as usual (but don't drop caches to keep the model loaded in memory). Of course you have to pass the same
--numa distribute -t <number of threads>
arguments to llama-cli or llama-server.On the tested system it increased the token generation rate by 80% (dual Epyc 9175F, 16 x DDR5 6400 MT/s RAM, Llama-3.1-70B-Instruct model, f16, tg increased from 2.4 t/s to 4.31 t/s)
Let me know if it works for you.
The text was updated successfully, but these errors were encountered: