llama : store non-RoPEd K cache #3234

ggerganov · 2023-09-17T20:50:12Z

Implemented just for CPU and Metal.
With Q4_0 there is ~8% performance hit for TG 500 and ~4% for PP 512, but might be possible to optimize the rope kernel further to compensate

slaren · 2023-09-18T10:42:43Z

These are the results I obtained on CPU (13900k):

model	size	params	backend	th	test	Master t/s	PR t/s	speedup
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	pp 512	34.35 ± 0.31	36.52 ± 0.70	1.06
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 128	16.03 ± 0.21	15.59 ± 0.16	0.97
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 256	15.69 ± 0.35	14.97 ± 0.28	0.95
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 512	15.56 ± 1.06	13.87 ± 0.06	0.89
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 1024	15.12 ± 0.17	11.44 ± 0.22	0.76
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 2048	14.60 ± 0.22	8.48 ± 0.10	0.58
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 4096	12.89 ± 0.10	5.52 ± 0.04	0.42
LLaMA 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	8	tg 8192	10.55 ± 0.11	3.25 ± 0.01	0.31

ggerganov · 2023-09-18T11:35:10Z

Yup, these results might be a strong argument against the non-RoPEd K cache.

slaren · 2023-09-18T11:44:10Z

I suspect that the main cost is the copy of K, but the computational cost of calculating the RoPE shouldn't be too bad, and most of it could be replaced with a lookup table if needed. So a fused attention op that applies RoPE on the fly during the computation of KQ could be a viable way to do this.

ggerganov · 2023-09-18T12:33:38Z

I suspect that the main cost is the copy of K

This is likely the case. We can verify this by replacing the rope with a cpy.

Olexorus · 2023-09-27T07:37:56Z

Do I understand correctly that this makes the cache values independent of the token position?

Would this make it possible to precompute the KV cache for all (or maybe just the most common) tokens in the vocabulary, so that during inference time you only need to copy it and apply RoPE?

ggerganov · 2023-10-02T08:30:47Z

The memory requirements would be too huge for this to work

Olexorus · 2023-10-02T15:32:26Z

The memory requirements would be too huge for this to work

Really? When I run a 13B model with 4096 context, I get the following output:
llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 3200.00 MB llama_new_context_with_model: compute buffer total size = 363.88 MB

I'm guessing this means that 4096 tokens of context require 3.2GB of memory. I believe Llama 2 has a vocabulary size of 32000, wouldn't that mean that precomputing the cache for all tokens requires 32000 / 4096 * 3.2GB = 25GB?
While that is a lot, it doesn't seem unrealistic, at least when talking about CPU RAM. Also I'm guessing this could be halved with #2969, so only 12.5GB, and this could probably be decreased much further by only storing the most common tokens. I imagine this could massively speed up prompt processing on CPU. It might also be useful for very long context lengths, since it would actually consume less memory if the context is larger than the vocabulary size.

Though I don't know if memory requirements can actually be calculated like this, maybe all of this is wrong.

ggerganov · 2023-10-03T17:29:19Z

Thinking more about this, I guess the idea would work but only if you had a single layer of the transformer. In that case the KV is always computed on the token embeddings from the model and they are indeed n_vocab in count and thus could be precomputed. However, in each layer after that, the embeddings for the KV would have some extra information intermingled from the other tokens in the context due to the attention from the previous layer. And therefore I think the idea breaks down.

But let's give it some more thought - I could be missing something

Olexorus · 2023-10-03T17:44:34Z

However, in each layer after that, the embeddings for the KV would have some extra information intermingled from the other tokens in the context due to the attention from the previous layer. And therefore I think the idea breaks down.

Oh, I didn't realize that (my understanding of how transformers work is extremely basic), thank you for clarifying. Though I probably should've guessed that it wouldn't work out as nicely as I imagined.

slaren · 2023-10-03T18:02:02Z

You can test the best scenario of this by removing the mul mats with wk, wq and wv. I tested this, and with 7B models on CPU I got between 10% and 40% higher t/s, depending on the model. GQA models (mistral) benefit less. But I agree with @ggerganov that this wouldn't work for anything other than the first layer, and in that case the performance difference would very likely be negligible.

cmp-nct · 2024-01-25T22:39:53Z

I'm a bit confused on the reason behind it, what do we use a non rope'd cache for ?
Isn't rope additive ? As if we need to modify the rope of the cache we can reprocess it to add/remove positional rotations from it ?

So creating a temporary 0 position rope when needed, couldn't we just run a "rope graph" on a cache copy to "unrope" it ?

If that's the case, I think it is, can't we just add a couple nice API functions to reropeprocess the kv cache (either modify it or into a copy/seq). I'd guess for sliding window and similar tricks that's useful ?

llama : store non-RoPEd K cache (WIP)

784d14e

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Sep 17, 2023

ggerganov mentioned this pull request Sep 17, 2023

llama : custom attention mask + parallel decoding + no context swaps #3228

Merged

20 tasks

ggerganov force-pushed the custom-attention-mask branch from 5bda9e2 to 0161372 Compare September 18, 2023 17:37

cebtenzzre mentioned this pull request Feb 13, 2024

Add support for BERT embedding models #5423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : store non-RoPEd K cache #3234

llama : store non-RoPEd K cache #3234

ggerganov commented Sep 17, 2023

slaren commented Sep 18, 2023

ggerganov commented Sep 18, 2023

slaren commented Sep 18, 2023

ggerganov commented Sep 18, 2023

Olexorus commented Sep 27, 2023

ggerganov commented Oct 2, 2023

Olexorus commented Oct 2, 2023 •

edited

Loading

ggerganov commented Oct 3, 2023

Olexorus commented Oct 3, 2023

slaren commented Oct 3, 2023

cmp-nct commented Jan 25, 2024 •

edited

Loading

llama : store non-RoPEd K cache #3234

Are you sure you want to change the base?

llama : store non-RoPEd K cache #3234

Conversation

ggerganov commented Sep 17, 2023

slaren commented Sep 18, 2023

ggerganov commented Sep 18, 2023

slaren commented Sep 18, 2023

ggerganov commented Sep 18, 2023

Olexorus commented Sep 27, 2023

ggerganov commented Oct 2, 2023

Olexorus commented Oct 2, 2023 • edited Loading

ggerganov commented Oct 3, 2023

Olexorus commented Oct 3, 2023

slaren commented Oct 3, 2023

cmp-nct commented Jan 25, 2024 • edited Loading

Olexorus commented Oct 2, 2023 •

edited

Loading

cmp-nct commented Jan 25, 2024 •

edited

Loading