Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744

fairydreaming · 2025-02-08T01:07:31Z

I have temporary access to a dual Epyc Turin system and found a little trick that restores normal token generation performance in llama.cpp on dual Epyc systems. The trick is to load and cache the model in memory while doing token generation, not the prompt processing. You can use llama-bench for this.

First drop caches as root:

echo 3 > /proc/sys/vm/drop_caches

and then run llama-bench with only the generation benchmark:

llama-bench --numa distribute -t <number of threads> -m <model> -r 1 -p 0

Then use llama.cpp as usual (but don't drop caches to keep the model loaded in memory). Of course you have to pass the same --numa distribute -t <number of threads> arguments to llama-cli or llama-server.

On the tested system it increased the token generation rate by 80% (dual Epyc 9175F, 16 x DDR5 6400 MT/s RAM, Llama-3.1-70B-Instruct model, f16, tg increased from 2.4 t/s to 4.31 t/s)

Let me know if it works for you.

The text was updated successfully, but these errors were encountered:

nekiee13 · 2025-02-09T09:38:05Z

When I run llama-bench, it hangs like this (nothing is happening)

./llama-bench --numa distribute -t 18 -m "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" -r 1 -p 0

model	size	params	backend	ngl	threads	test	t/s

I opened another wsl instance and run llama-cli. This one hangs on:
....(truncated)
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)

the command are:

./llama-bench --numa distribute -t 18 -m "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" -r 1 -p 0

./llama-cli --model "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" --ctx-size 400 --no-kv-offload --threads 18 --numa distribute -no-cnv --prio 3 --temp 0.65 --top_k 40 --top_p 0.9 --min-p 0.05 --seed 42 --prompt "<|User|>Why is the sky blue?<|Assistant|>"

Didn't use - r 1, as i don't wanna to halt prompt generation...

fairydreaming · 2025-02-09T10:54:43Z

@nekiee13 Sorry, never used llama.cpp on Windows (why do you torture yourself with this abomination?) so can't help with that. Maybe model loading simply takes a long time and you have to wait longer?

nekiee13 · 2025-02-09T11:38:35Z

No, it WSL (Windows Subsystem for Linux) - a feature of MS Windows that allows using a Linux environment without the need for a separate virtual machine or dual booting.

So, my cmds are ok (Linux wise)? I mean, You run dual terminal also?

I'll try next to boot clean Linux from USB and see if that works...

nekiee13 · 2025-02-10T08:42:03Z

Dual AMD 9124 (Linux) - Total time

Mistral-Small-24B-Instruct-2501.BF16 - 12.51 tokens per second (ctx 35000)
DeepSeek-R1-Distill-Llama-70B-GGUF Q8_0 - 4.56 tokens per second (ctx 35000)
DeepSeek-R1-UD-Q2_K_XL - 2.13 tokens per second (ctx 32092)

Your tweak works great for dense models. For MoE, the numbers remain similar (single test).

I have to do more testing (by shear volume & more systematic), but I can live with these numbers.

You rule...

🥇

fairydreaming · 2025-02-10T08:57:36Z

@nekiee13 Great that it worked for you! Regarding the MoE models my initial tests on DeepSeek R1 also found limited improvement when using 2 CPUs, so now I'm going to investigate the reason for this.

Edit: older MoE models like Mixtral 8x7B and 8x22B don't seem to be affected by this. They use ggml_mul_mat_id() in FFN that does not use llamafile_sgemm() internally, maybe that's the reason why.

Readon · 2025-02-23T16:52:51Z

@nekiee13 Great that it worked for you! Regarding the MoE models my initial tests on DeepSeek R1 also found limited improvement when using 2 CPUs, so now I'm going to investigate the reason for this.

Edit: older MoE models like Mixtral 8x7B and 8x22B don't seem to be affected by this. They use ggml_mul_mat_id() in FFN that does not use llamafile_sgemm() internally, maybe that's the reason why.

I tested deepseek-r1 2.51bits on my dual E5v2 CPU + 4 x 2080Ti box, I could get 3.3 token per second tg while using --numa distribute and moe offloading to system memory.

However if i setup it only use 1 CPU the generating speed would boost to 3.8 tps.

I guess that llama.cpp can be improved by splitting the tensor location and computation task into numa nodes. just as #11333 proposed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744

fairydreaming commented Feb 8, 2025

nekiee13 commented Feb 9, 2025

fairydreaming commented Feb 9, 2025

nekiee13 commented Feb 9, 2025

nekiee13 commented Feb 10, 2025

fairydreaming commented Feb 10, 2025 •

edited

Loading

Readon commented Feb 23, 2025

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744

Comments

fairydreaming commented Feb 8, 2025

nekiee13 commented Feb 9, 2025

fairydreaming commented Feb 9, 2025

nekiee13 commented Feb 9, 2025

nekiee13 commented Feb 10, 2025

fairydreaming commented Feb 10, 2025 • edited Loading

Readon commented Feb 23, 2025

fairydreaming commented Feb 10, 2025 •

edited

Loading