-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KV cache quantized to q8_0 #2969
Conversation
On second thought, maybe not. The order of tensors would be matrix multiplication -> scale -> mask -> softmax -> pad -> matrix multiplication. For optimal performance you should be fusing those tensors anyways and adding one more unary tensor to the pipeline would not make a difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything immediately wrong with this approach, so interesting if we can make it work all the way.
Not sure if we would win much from a dedicated PAD op. We can have it as a combination of view + cpy as you have proposed. Either way would be fine I guess
Why do you add 128 again?
Also, probably better to utilize the wdata
buffer that each op uses for writing work data
I'm adding 128 so each row has some extra space to temporarily store the unquantized values as the context fills up. |
This seems to work on ROCM (RX 6600) but weirdly uses considerably more memory than |
From what I can tell
I think this is due to the hack that I'm currently using to pad one of the tensors. It should not be an issue in the final version. |
Unless the plan is to call the new AMX (and VNNI) instructions manually, not using MKL would be slower on any newer (post Sapphire Rapids) Intel CPUs. These new instructions perform tiled Int8 multiplies in HW. |
I added new functions like I replaced the padding hack with an extension to the matrix multiplication ggml code. The solution that I came up with is to quantize the last block of the hidden state without any SIMD instruction if it is incomplete. The rest of the tensor is still quantiized using SIMD instructions. This should keep implementation/maintenance overhead low with negligible impact on performance. I pushed CUDA implementations that produce correct results but the performance is bad. |
9331895
to
e569a9f
Compare
e569a9f
to
35b7f08
Compare
test-opt and test-grad0 are failing an assertion:
edit: |
Alright, this PR should now be feature complete. I added a new CLI argument
The perplexity changes as follows:
The increase in perplexity is on the same order as the perplexity differences between f16 and q8_0 or q8_0 and q6_K. |
I have a GTX 970 installed (compute 5.2), maybe that's the problem? You can test with |
554d4bc
to
82b34d8
Compare
The failing asserts were added by me and I removed them again since the assumptions that went into them seem to have been incorrect. The issue with CUDA compilation was caused by bad fallback values (which get used by cmake for the CC 5.2 PTX code) and should be fixed now. |
Running perplexity with ROCM, I just get NANs when
Token generation seemed to work okay. |
Do you still get NaN with |
Yes, I do.
I get them slower! (Not that this information is likely to be too helpful.) |
I get an assertion failure if I set my GTX 970 as the main GPU with
Backtrace:
|
I've pushed a fix for incorrect results with compute capabilities < 6.1. The results should now be correct but the performance will be bad. |
Tested at e0d5c0b - no difference as far as I can see. With |
It's not just a check for compute capability, it's primarily that the old |
Alrighty. Just let me know if/when you want me to try a new version, different settings, etc. I can test out whatever. |
@KerfuffleV2 which GPU do you have again? |
Arch Linux, the ROCM/ROCBLAS stuff seems to be at 5.6.0 if that makes a difference. |
You could maybe try editing line 6593 of |
diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index ebe267f..98847cf 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -6590,7 +6590,7 @@ void ggml_cuda_mul_mat(const ggml_tensor * src0, const ggml_tensor * src1, ggml_
}
// no quantized non-contiguous support for lower CC kernels implemented
- const bool nc_okay = src0->type == GGML_TYPE_F16 || g_compute_capabilities[g_main_device] >= MIN_CC_DP4A;
+ const bool nc_okay = false;
if (all_on_device && nc_okay && ggml_is_permuted(src0) && ggml_is_permuted(src1) && src1->ne[1] == 1) {
ggml_cuda_mul_mat_vec_p021(src0, src1, dst); No effect. Also tested with edit: Kind of just randomly changing stuff but I also tried: - if (src1->ne[1] == 1 && src0->ne[0] % GGML_CUDA_DMMV_X == 0) {
+ if (false && src1->ne[1] == 1 && src0->ne[0] % GGML_CUDA_DMMV_X == 0) { Also no difference. edit 2: Continuing that approach, I forced it to end up at line 6612:
Still no difference. (And I verified it was getting there with with a |
Okay, that's the piece that I was missing. Using the same model I can now reproduce the issue. The problem seems to be that 3b has |
Can confirm this fixes it. Awesome work! |
EDIT3: seems to be a bug in the setup/build process, tracked by: #3202 ppl for f16:
ppl for q8_0:
ppl for q8_0 nommq:
edit: also with e0d0a0f edit2: i realize that this is more of a generic mmq issue |
I pushed a fix for small models with n_embd_head % 32 != 0. I fixed it by adding |
Could you resolve the merge conflicts? I wanted to try this out on my fork, but there have been a lot of changes to ggml-cuda.cu on master. |
Already on it, I'll push it later today. |
980604d
to
c620f4d
Compare
I pushed the rebased version. From what I can tell it's working correctly. If you do encounter issues, please compare against both master and the branch |
char * dst_ptr = (char *) dst->data; | ||
float src0_padded[ne00_padded]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the CI runs are failing. From what I can tell the issue is that for some reason it tries to allocate an array of size 0 here. This would imply ne00_padded == 0
which in turn would imply ne00 == 0
. Is there a situation in which calling this function with that configuration actually makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSVC doesn't support VLA, the size of the array must be a compile-time constant.
Tested with my weird 3B Orca model and a LLaMA2-70B for token generation and perplexity. Both seem to work fine now on ROCM. |
CodeLlama 34b (Airoboros, Spicyboros, WizardLM) outputs rubbish, tested both in KoboldCPP chat and with LLama perplexity test (KV_Q_8_0 branch, around 50,000 perplexity at every context length). Example : Output: _\\\\:\ \Item ( =,, No problem with CodeLlama 7b & 13b (although I didn't test all quants). Also, the Q_8_0 branch shows again the VRAM "leak", while the Q_8_0_6 branches doesn't suffer from the problem, and shows a flat VRAM occupation as the context grows (as does show a flat VRAM occupation the current KoboldCPP Experimental build with the usual "KV16" implementation since my previous occurrence of the "VRAM leak" problem was resolved a week ago). Sorry for my lack of precise technical terms, I'm just a user testing above my pay-grade ! |
On the latest build, got the same errors as before on android in termux:
tested various models: |
I have been using q8_0 kv cache on my local fork without issues. But I have to disable it for now because there are merge conflicts after #3301. |
Personally, I would definately like to have this merged as a seperate option. I didn't encounter any issues on my end and it does help a lot. |
Hi, I use this pr to quantize KV cache and I found that when I offload 32 repeated layers to GPU, the answer of the model is almost the same as the fp16 kv cache, but when I load one or some repeated layers to CPU, the answer of this kv int8 model for long prompt is total wrong(for example: always output "!!!!!!!!!!!!!!!!!!!!!!!!!"), so I wonder how to fix this bug? Thanks! I debug further and found some intermediate tensor's data is nan value if use kv cache q8_0.
So how should I debug next? |
I merged this onto 45855b3, among some other local changes, and hit this assertion while using it via oobabooga's UI:
This must be related to this PR, because nothing else in the model I'm testing with should be quantized to 8-bit. I had just edited the prompt, so it rewound the KV cache (eval was called with a smaller n_past than before). |
I'm assuming the issue is caused by the assert in
Note that I did not test this code and that I currently do not have the time to work on llama.cpp, especially since there is no indication that this PR would be merged. |
@ggerganov This PR really deserves to be promised a merge, and his author to be encouraged. KV-Q8_0 is really a boner for LlamaCPP and its users compared to the competition. |
If anyone wants to open a long enough but easy to achieve context window, let's say the long llamaLora merge, 16k q8_0 kv could save 10g+ memory usage. |
Note that this is just one random person's opinion: I'd say it's not quite that simple. How using less memory is useful for the KV cache is very easy to see but it's not the only consideration. One thing to note is the changes are pretty complicated and there have been a lot of issues that have cropped up in testing. It seems like we haven't even reached the end of finding new ones. This isn't an easy thing to get right, even for someone as skilled as JG. It also isn't just a self-contained set of changes that only affects people that decide to opt into using the Q8_0 quantized KV: the changes touch a lot of stuff that can have an effect whether or not quantized KV is actually being used. Also another consideration is that supporting quantized KV can limit the other design choices that are possible. Just an example, KV cache fragmentation is currently an active issue which can come up when doing parallel generation. If the KV cache is quantized, it becomes a lot harder (maybe even not possible) to just move the entries around. Of course, right now it has conflicts and can't be merged. (Also, resolving those conflicts and incorporating other changes can cause new bugs to crop up.) Anyway, someone maintaining a project can't really think like "I have to merge this pull or the author will be unhappy/discouraged". They have to look at the broader picture, or they'll kill their own project by making bad choices. Personally, I know I couldn't handle it. Of course, it also feels bad to put a lot of work into a pull and have it rejected or not know if will ever be accepted. |
Unfortunately, I still hit the same assertion with the patch applied. |
What I meant is that with the patch it should be fine to remove the assert. |
Forgive me again but i asked myself why llama.cpp? Why a lot of people use it in the very beginning? I think it's because, Because we all are GPU poor people. Because we all want something freaking great. 😂 |
I have the patch applied and the assertion removed, but now in similar situations the KV cache gets corrupted and the model just starts generating hash marks no matter how many times I regenerate the response or even remove tokens from the prompt. Reloading the model fixes it. |
And with or without replacing the assertion with the syncthreads call, using the 'main' example with spicyboros-c34b-2.2.Q5_0.gguf produces garbage output for me on this PR. |
obsoleted by #4312 |
This PR aims to implement quantizing the KV cache as q8_0. As of right now it is a proof of concept and I would like to discuss the best way of doing a proper implementation ahead of time.
Potential benefits:
The current state of this PR is that only the CPU implementation works (with some hacks). These are the performance results on my system:
The GPU is an RTX 3090. The use of CUDA up to 33 layers is possible because only at 34 layers the KV cache starts being offloaded. In its current form there seems to be a small performance regression. The Perplexity on wikitext-2 changed from 5.9679 +/- 0.03352 to 5.9668 +/- 0.03349 which is actually a very small improvement, likely due to randomness (calculated with 33 layers offloaded).
The implementation for the K component of the KV cache was fairly straightforward: the values copied to the K component always have a row size of
n_embd_head
which is a multiple of 32 (q8_0 block size). So you can essentially just change the datatype of the K component to q8_0 and the preexisting implementations for copying and matrix multiplication will work out-of-the-box.The implementation for the V component was more tricky because the row size is equal to the number of processed tokens which then need to be copied to an arbitrary location in a q8_0 block. I did the following things to make it work:
ggml_compute_forward_dup_f32
by an extra case that so far was not implemented: copying non-contiguous f32 values to non-contiguous q8_0 values. My solution was to fill a zero-initialized temporary buffer behind the q8_0 block with the unquantized values. Then re-create the q8_0 block from those when either the buffer contains enough values for an entire q8_0 block or when there are no more new unquantized values to copy for a row. This requires an additional 128 bytes per V component row (hacked by just increasing the size by 128 elements).op_params
of the corresponding tensor inllm_build_llama
.I plan to do a proper implementation of this PR in the following way:
ggml_view
so that the position inside a block is written toop_params
if the underlying tensor is quantized.GGML_OP
likeGGML_OP_PAD
that pads a tensor with a constant value. Alternatively the code for quantizing the hidden state could be modified which would probably have better performance. But this would also be more complicated and require changes across a much wider range of hardware (to some of which I don't have access for testing). For the CUDA implementation only very minimal changes would be necessary to avoid padding the hidden state in an extraGGML_OP
.--memory-f32
CLI argument with a new one that lets you choose between q8_0, f16, and f32 for the KV cache type.If there are issues with the way I plan to do the implementation, please let me know ahead of time.