-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : store non-RoPEd K cache #3234
base: custom-attention-mask
Are you sure you want to change the base?
llama : store non-RoPEd K cache #3234
Conversation
These are the results I obtained on CPU (13900k):
|
Yup, these results might be a strong argument against the non-RoPEd K cache. |
I suspect that the main cost is the copy of |
This is likely the case. We can verify this by replacing the rope with a cpy. |
5bda9e2
to
0161372
Compare
Do I understand correctly that this makes the cache values independent of the token position? Would this make it possible to precompute the KV cache for all (or maybe just the most common) tokens in the vocabulary, so that during inference time you only need to copy it and apply RoPE? |
The memory requirements would be too huge for this to work |
Really? When I run a 13B model with 4096 context, I get the following output: I'm guessing this means that 4096 tokens of context require 3.2GB of memory. I believe Llama 2 has a vocabulary size of 32000, wouldn't that mean that precomputing the cache for all tokens requires 32000 / 4096 * 3.2GB = 25GB? Though I don't know if memory requirements can actually be calculated like this, maybe all of this is wrong. |
Thinking more about this, I guess the idea would work but only if you had a single layer of the transformer. In that case the KV is always computed on the token embeddings from the model and they are indeed But let's give it some more thought - I could be missing something |
Oh, I didn't realize that (my understanding of how transformers work is extremely basic), thank you for clarifying. Though I probably should've guessed that it wouldn't work out as nicely as I imagined. |
You can test the best scenario of this by removing the mul mats with |
I'm a bit confused on the reason behind it, what do we use a non rope'd cache for ? So creating a temporary 0 position rope when needed, couldn't we just run a "rope graph" on a cache copy to "unrope" it ? If that's the case, I think it is, can't we just add a couple nice API functions to reropeprocess the kv cache (either modify it or into a copy/seq). I'd guess for sliding window and similar tricks that's useful ? |
Implemented just for CPU and Metal.
With
Q4_0
there is ~8% performance hit for TG 500 and ~4% for PP 512, but might be possible to optimize the rope kernel further to compensate