Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible solution for poor token generation performance in llama.cpp on dual Epyc Genoa/Turin systems #11744

Open
fairydreaming opened this issue Feb 8, 2025 · 6 comments

Comments

@fairydreaming
Copy link
Collaborator

I have temporary access to a dual Epyc Turin system and found a little trick that restores normal token generation performance in llama.cpp on dual Epyc systems. The trick is to load and cache the model in memory while doing token generation, not the prompt processing. You can use llama-bench for this.

First drop caches as root:

echo 3 > /proc/sys/vm/drop_caches

and then run llama-bench with only the generation benchmark:

llama-bench --numa distribute -t <number of threads> -m <model> -r 1 -p 0

Then use llama.cpp as usual (but don't drop caches to keep the model loaded in memory). Of course you have to pass the same --numa distribute -t <number of threads> arguments to llama-cli or llama-server.

On the tested system it increased the token generation rate by 80% (dual Epyc 9175F, 16 x DDR5 6400 MT/s RAM, Llama-3.1-70B-Instruct model, f16, tg increased from 2.4 t/s to 4.31 t/s)

Let me know if it works for you.

@nekiee13
Copy link

nekiee13 commented Feb 9, 2025

When I run llama-bench, it hangs like this (nothing is happening)

./llama-bench --numa distribute -t 18 -m "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" -r 1 -p 0

model size params backend ngl threads test t/s

I opened another wsl instance and run llama-cli. This one hangs on:
....(truncated)
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)


the command are:

./llama-bench --numa distribute -t 18 -m "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" -r 1 -p 0

./llama-cli --model "/mnt/i/LLMs/Qwen/Qwen2.5-32B-Instruct-GGUF/qwen2.5-32b-instruct-fp16-00001-of-00017.gguf" --ctx-size 400 --no-kv-offload --threads 18 --numa distribute -no-cnv --prio 3 --temp 0.65 --top_k 40 --top_p 0.9 --min-p 0.05 --seed 42 --prompt "<|User|>Why is the sky blue?<|Assistant|>"


Didn't use - r 1, as i don't wanna to halt prompt generation...

@fairydreaming
Copy link
Collaborator Author

@nekiee13 Sorry, never used llama.cpp on Windows (why do you torture yourself with this abomination?) so can't help with that. Maybe model loading simply takes a long time and you have to wait longer?

@nekiee13
Copy link

nekiee13 commented Feb 9, 2025

No, it WSL (Windows Subsystem for Linux) - a feature of MS Windows that allows using a Linux environment without the need for a separate virtual machine or dual booting.

So, my cmds are ok (Linux wise)? I mean, You run dual terminal also?

I'll try next to boot clean Linux from USB and see if that works...

@nekiee13
Copy link

Dual AMD 9124 (Linux) - Total time

  1. Mistral-Small-24B-Instruct-2501.BF16 - 12.51 tokens per second (ctx 35000)
  2. DeepSeek-R1-Distill-Llama-70B-GGUF Q8_0 - 4.56 tokens per second (ctx 35000)
  3. DeepSeek-R1-UD-Q2_K_XL - 2.13 tokens per second (ctx 32092)

Your tweak works great for dense models. For MoE, the numbers remain similar (single test).

I have to do more testing (by shear volume & more systematic), but I can live with these numbers.

You rule...

🥇

@fairydreaming
Copy link
Collaborator Author

fairydreaming commented Feb 10, 2025

@nekiee13 Great that it worked for you! Regarding the MoE models my initial tests on DeepSeek R1 also found limited improvement when using 2 CPUs, so now I'm going to investigate the reason for this.

Edit: older MoE models like Mixtral 8x7B and 8x22B don't seem to be affected by this. They use ggml_mul_mat_id() in FFN that does not use llamafile_sgemm() internally, maybe that's the reason why.

@Readon
Copy link

Readon commented Feb 23, 2025

@nekiee13 Great that it worked for you! Regarding the MoE models my initial tests on DeepSeek R1 also found limited improvement when using 2 CPUs, so now I'm going to investigate the reason for this.

Edit: older MoE models like Mixtral 8x7B and 8x22B don't seem to be affected by this. They use ggml_mul_mat_id() in FFN that does not use llamafile_sgemm() internally, maybe that's the reason why.

I tested deepseek-r1 2.51bits on my dual E5v2 CPU + 4 x 2080Ti box, I could get 3.3 token per second tg while using --numa distribute and moe offloading to system memory.

However if i setup it only use 1 CPU the generating speed would boost to 3.8 tps.

I guess that llama.cpp can be improved by splitting the tensor location and computation task into numa nodes. just as #11333 proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants