Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUMA-aware KV cache buffer type (experimental) #11580

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fairydreaming
Copy link
Collaborator

@fairydreaming fairydreaming commented Feb 1, 2025

This PR contains an experimental NUMA-aware KV cache buffer implementation so that people can try it and check if it improves performance on multi-CPU systems.

IMPORTANT: this mechanism works only in conjunction with --no-kv-offload option

Also most likely this code won't compile on Windows, so it's not merge-ready.

The idea behind this is to allocate memory for KV buffer tensors the same way as for model tensors - that is by using mmap() function. In this case we are not mapping an existing file to memory but use a MAP_ANONYMOUS flag - you can think about it as a zeroized virtual file. The purpose is to allocate pages backing the KV cache accordingly to the first-touch policy on a NUMA node that will use the given KV cache memory fragment.

This prevents allocating the whole KV cache on a single NUMA node like it's currently done. To illustrate the problem examine this numactl --hardware output taken in the middle of loading of a very large model:

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 32 33 34 35
node 0 size: 48051 MB
node 0 free: 19959 MB
node 1 cpus: 4 5 6 7 36 37 38 39
node 1 size: 48381 MB
node 1 free: 20286 MB
node 2 cpus: 8 9 10 11 40 41 42 43
node 2 size: 48381 MB
node 2 free: 20168 MB
node 3 cpus: 12 13 14 15 44 45 46 47
node 3 size: 48381 MB
node 3 free: 20425 MB
node 4 cpus: 16 17 18 19 48 49 50 51
node 4 size: 48381 MB
node 4 free: 20279 MB
node 5 cpus: 20 21 22 23 52 53 54 55
node 5 size: 48381 MB
node 5 free: 154 MB
node 6 cpus: 24 25 26 27 56 57 58 59
node 6 size: 48337 MB
node 6 free: 20392 MB
node 7 cpus: 28 29 30 31 60 61 62 63
node 7 size: 48338 MB
node 7 free: 20200 MB

Note that NUMA node 5 has no free memory left while other nodes still have around 20GB of free memory. In this case KV cache size was around 20GB, so it was allocated on NUMA node 5. This node has no free memory left, so CPU cores associated with this node will have to make foreign (non-local) memory accesses to use model weights that would be originally placed in memory associated with this NUMA node.

With this PR we have uniform memory usage across all NUMA nodes:

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 32 33 34 35
node 0 size: 48051 MB
node 0 free: 20269 MB
node 1 cpus: 4 5 6 7 36 37 38 39
node 1 size: 48381 MB
node 1 free: 20847 MB
node 2 cpus: 8 9 10 11 40 41 42 43
node 2 size: 48381 MB
node 2 free: 20340 MB
node 3 cpus: 12 13 14 15 44 45 46 47
node 3 size: 48381 MB
node 3 free: 20751 MB
node 4 cpus: 16 17 18 19 48 49 50 51
node 4 size: 48381 MB
node 4 free: 20747 MB
node 5 cpus: 20 21 22 23 52 53 54 55
node 5 size: 48337 MB
node 5 free: 20701 MB
node 6 cpus: 24 25 26 27 56 57 58 59
node 6 size: 48381 MB
node 6 free: 20739 MB
node 7 cpus: 28 29 30 31 60 61 62 63
node 7 size: 48338 MB
node 7 free: 20495 MB

It would be nice if someone with a dual AMD Epyc Linux system could check if this PR makes any difference in performance. The test would be as follows:

  1. As root do echo 0 > /proc/sys/kernel/numa_balancing
  2. Run llama-bench on some large model without this PR, use --numa distribute --no-kv-offload 1 options. Make sure the model is fully cached in memory, so it's best to run the command twice.
  3. As root do echo 3 > /proc/sys/vm/drop_caches
  4. Run llama-bench compiled with this PR applied, use the same --numa distribute --no-kv-offload 1 options. Also run the command twice.
  5. Post the outputs of the second llama-bench runs here.

Edit: tested this, found no meaningful difference in performance.

… the first-touch policy

llama : use NUMA-aware buffer type for KV cache
@fairydreaming fairydreaming marked this pull request as draft February 1, 2025 17:36
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 1, 2025
@saood06
Copy link

saood06 commented Feb 3, 2025

Edit: tested this, found no meaningful difference in performance.

Not my experience so far, this seems to help me

@cpumaxx
Copy link
Contributor

cpumaxx commented Feb 3, 2025

Sad to say I also found exactly the same performance on both HEAD as well as this PR branch.
I tried a half-dozen times with llama-bench and re-verified I was using your branch's code before concluding this.

@fairydreaming
Copy link
Collaborator Author

Sad to say I also found exactly the same performance on both HEAD as well as this PR branch. I tried a half-dozen times with llama-bench and re-verified I was using your branch's code before concluding this.

It could be that the performance gains are only visible when using longer context. But to measure this you'd have to use #11126 and pass for example -gp 4096,64 to llama-bench.

@saood06
Copy link

saood06 commented Feb 4, 2025

It could be that the performance gains are only visible when using longer context.

Based on my hardware, and habit of looking at numastat while running models on my dual socket Xeon E5 v3. I'm not sure how much of a difference this makes in that regard. numactl does look a bit more balanced when checking with this on vs off for Deepseek-R1.

The real benefit is avoiding performance loss, without this PR the MLA branch which has the old KV cache allocated but not used, would benchmark fine with llama-batched-bench up to 8K, but server would have worse performance, due to paging in from disk. With this branch I was able to run the server all the way up to 30K tokens while not paging out to disk. There is still a performance loss from going to high context as is to be expected, but eliminating the performance loss from paging made going past 8K to 30K feasible.

@jukofyork
Copy link
Contributor

jukofyork commented Feb 6, 2025

Have you guy's tried using numactl --interleave=all <command>?

(after echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null + echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null)

Using --numa distribute alone on my dual-E5-2696v4 machine didn't actually evenly spread the pages (as can be seen via numactl -H), but as soon as I used both numactl --interleave=all and --numa distribute I got a huge performance boost when offloading the non-shared experts' tensors only and keeping everything else on the GPU using #11397).

I made a custom Q5_K_XL 463GB quant where everything is Q8_0 apart from the non-shared experts' tensors (Q5_K for up/gate projections and Q6_K for down projections), and can run this at around 4.25 tokens per second (4.75 tokens per second for the Q4_K_XL 413GB variant too).

To put this in context:

  • This is nearly 2x what I as getting with a ~250GB Q2_K / Q4_K / Q8_0 mix earlier without using numactl --interleave=all on the same machine.
  • Around half of what I got when I linked 6 A6000s GPU using RPC for a IQ2_S / IQ3_S / Q6_0 mix (~9 tokens per second).

I only have around 78GB/s for each socket so I would assume it would work even better for dual EPYC setups.

FWIW: I have the BIOS set to "Home Snoop with Directory" NUMA mode too (see: https://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/), but no idea what options are available for EPYCs.

@saood06
Copy link

saood06 commented Feb 6, 2025

I made a custom Q5_K_XL 463GB quant where everything is Q8_0 apart from the non-shared experts' tensors (Q5_K for up/gate projections and Q6_K for down projections).
a ~250GB Q2_K / Q4_K / Q8_0 mix earlier
a IQ2_S / IQ3_S / Q6_0 mix (~9 tokens per second).

Can you run the first 8-16 chunks of perplexity on wiki.test.raw on some of your mixes?
I'm currently testing and collecting some data.

Quant [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
IQ2_XXS ** 3.39 4.56 3.44 3.27 3.27 3.20 3.12 3.12
IQ3_XXS ** 2.69 3.53 2.51 2.11 1.91 1.78 1.69 1.62
4.52BPW mix (V1) 2.5954 3.3338 2.3993 1.9972 1.8080 1.6659 1.5697 1.5047 1.4555 1.4154 1.4007 1.4493 1.4581 1.5866 1.7193 1.7815
UD-IQ1_M ** 3.4155 4.2311 3.0817 2.8601 2.6933 2.5792 2.5123 2.5239
UD-IQ1_S ** 3.8939 4.7189 3.7812 3.6799 3.6215 3.6922 3.6442 3.7472 3.8353 3.7663 3.8983 4.0621
1.65BPW mix (V2) 3.7554 4.6569 3.5681 3.4458 nan nan nan nan nan nan nan nan nan nan nan nan
1.65BPW mix (V2) -b 4096 3.7554 4.6569 3.5681 3.4458 3.5419 3.5822 3.5429 3.6624 3.7312 3.6580 3.7719 3.9520 nan nan nan nan
1.62BPW Mix (V1) -b 4096 3.6625 4.5832 3.5418 3.4340 nan nan nan nan

** is data that was posted by other people online, not my tests.
UD refers to Unsloth quants.
(V2) for 1.65 BPW refers to the second low BPW mix I tested.
(V1) for 1.65 BPW refers to the first low BPW mix I tested.
(V1) for 4.52BPW refers to the fact that I am currently making a slightly bigger quant that should be better.

@jukofyork
Copy link
Contributor

jukofyork commented Feb 7, 2025

Can you run the first 8-16 chunks of perplexity on wiki.test.raw on some of your mixes? I'm currently testing and collecting some data.

I can run it on the custom Q5_K_XL and Q4_K_XL quants, but the other 2 were just testers and long since deleted (one was the maximum quant I could fit on 6 x 48GB GPUs for testing RPC and the other was the maximum I could fit on a single NUMA node of a 512GB system [and the "IQ" quants work horrible on CPU!]).

Also, I'm not sure the negative effects of using lower quants for the attention matrices will show up using perplexity unless we were to use longer sequences (IIRC, it's 512 tokens by default?). The negative effects become very obvious if you ask a model to write several chapters of a story, where each chapter the POV switches back and forth (eg: the defender(s) of a castle and the attacker(s) of a castle)... Low-bit quants very quickly get mixed up, whereas the same model using Q8_0+ for the attention matrices will work fine.

@ubergarm
Copy link

@jukofyork

I used both numactl --interleave=all and --numa distribute

Thanks for this tip that got my tg128 from ~5 up to ~5.4 on a dual socket Intel Xeon 6980P. I had been using explicit numactl for the entire 6 nodes 512 threads numactl -N 0-5 -b -m 0-5 -C 0-511 llama-bench --numa numactl and changing to numactl --interleave=all llama-bench --numa distribute seems to have budged the needle.

I tried both with and without disabling numa_balancing but the difference wasn't significant though setting it to 0 was very slightly (within the variance) for just a few comparison benchmarks

Best speed seems to be around only 86 threads testing the unsloth R1 GGUF Q4_K_M quant across both cores. Still slower than 86 threads on a single CPU socket at ~7.7 tok/sec lol...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants