-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUMA-aware KV cache buffer type (experimental) #11580
base: master
Are you sure you want to change the base?
Conversation
… the first-touch policy llama : use NUMA-aware buffer type for KV cache
Not my experience so far, this seems to help me |
Sad to say I also found exactly the same performance on both HEAD as well as this PR branch. |
It could be that the performance gains are only visible when using longer context. But to measure this you'd have to use #11126 and pass for example |
Based on my hardware, and habit of looking at numastat while running models on my dual socket Xeon E5 v3. I'm not sure how much of a difference this makes in that regard. numactl does look a bit more balanced when checking with this on vs off for Deepseek-R1. The real benefit is avoiding performance loss, without this PR the MLA branch which has the old KV cache allocated but not used, would benchmark fine with llama-batched-bench up to 8K, but server would have worse performance, due to paging in from disk. With this branch I was able to run the server all the way up to 30K tokens while not paging out to disk. There is still a performance loss from going to high context as is to be expected, but eliminating the performance loss from paging made going past 8K to 30K feasible. |
Have you guy's tried using (after Using I made a custom To put this in context:
I only have around 78GB/s for each socket so I would assume it would work even better for dual EPYC setups. FWIW: I have the BIOS set to "Home Snoop with Directory" NUMA mode too (see: https://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/), but no idea what options are available for EPYCs. |
Can you run the first 8-16 chunks of perplexity on wiki.test.raw on some of your mixes?
** is data that was posted by other people online, not my tests. |
I can run it on the custom Also, I'm not sure the negative effects of using lower quants for the attention matrices will show up using |
Thanks for this tip that got my tg128 from ~5 up to ~5.4 on a dual socket Intel Xeon 6980P. I had been using explicit numactl for the entire 6 nodes 512 threads I tried both with and without disabling Best speed seems to be around only 86 threads testing the unsloth R1 GGUF |
This PR contains an experimental NUMA-aware KV cache buffer implementation so that people can try it and check if it improves performance on multi-CPU systems.
IMPORTANT: this mechanism works only in conjunction with
--no-kv-offload
optionAlso most likely this code won't compile on Windows, so it's not merge-ready.
The idea behind this is to allocate memory for KV buffer tensors the same way as for model tensors - that is by using mmap() function. In this case we are not mapping an existing file to memory but use a MAP_ANONYMOUS flag - you can think about it as a zeroized virtual file. The purpose is to allocate pages backing the KV cache accordingly to the first-touch policy on a NUMA node that will use the given KV cache memory fragment.
This prevents allocating the whole KV cache on a single NUMA node like it's currently done. To illustrate the problem examine this numactl --hardware output taken in the middle of loading of a very large model:
Note that NUMA node 5 has no free memory left while other nodes still have around 20GB of free memory. In this case KV cache size was around 20GB, so it was allocated on NUMA node 5. This node has no free memory left, so CPU cores associated with this node will have to make foreign (non-local) memory accesses to use model weights that would be originally placed in memory associated with this NUMA node.
With this PR we have uniform memory usage across all NUMA nodes:
It would be nice if someone with a dual AMD Epyc Linux system could check if this PR makes any difference in performance. The test would be as follows:
echo 0 > /proc/sys/kernel/numa_balancing
--numa distribute --no-kv-offload 1
options. Make sure the model is fully cached in memory, so it's best to run the command twice.echo 3 > /proc/sys/vm/drop_caches
--numa distribute --no-kv-offload 1
options. Also run the command twice.Edit: tested this, found no meaningful difference in performance.