K Quant 64 support - quite a feat to integrate #34

cmp-nct · 2023-06-28T01:12:10Z

A large patch was just integrated into llama.cpp (ggerganov#2001) another stunning job by @ikawrakow

In the long run we need it, K quants are better for 7B and have more flexibility but two obstacles need to be solved:

We need to modify that PR so it's not a compiler switch anymore, it needs to support 256 and 64 bit.
Either by splitting and duplicating it or by using a global variable instead of the define.
Otherwise we'd need distinctly compiled binaries for 7B and 40B
These are 32 bit dequantizers, we use 16 bit for cuBLAS to save 50% VRAM.
It's not a huge thing to change but it doubles the kernels (again) and I'm a bit afraid of maintaining so many of them.
Maybe instead of duplicating all kernels from 32 to 16 it would be possible to write a wrapper, let the kernels work in 32 bit and convert that into half precision. Given the parallelization that wouldn't require much VRAM.

I'm a bit afraid of investing hours integrating such custom variants in case another big push comes from upstream.

ikawrakow · 2023-06-28T12:38:10Z

I was mainly considering the feedback from some people that there are too many quantization options after the addition of the k-quants when I decided to make the 64-blocks a compile time option. But I can see that this is not very ergonomic for Falcon users. Let me think about a better solution.

ikawrakow · 2023-06-28T12:43:23Z

Oh, and concerning fp16, I agree with you that it would be better if we standardized on fp16 for CUDA

cmp-nct · 2023-06-28T15:09:36Z

Great to hear :)
The amount of changes and features you commit regularly is astonishing.

In hindsight I was thinking that my 16 bit modifications to the dequantizers might have overshot it, maybe it would have been possible to just create a single wrapper that converts the kernels.
I'm sure what I did to make it 16 bit was clumsy compared to the best possible solution, I did not do any CUDA before and I probably should have spent more time planning it out.
So currently we have a 32 and a 16 bit representation for each block and row dequantization kernel, K and traditional Q type, a lot to maintain.

In the long run I'd personally perfer to cut all 32 bit out of cuda and go with the half precision, just all the sub functions appear to run on 32 bit so it's not a quick change.

Would a global block_size 64/256 variable introduce a downgrade in performance ?
Optimal (from a point of use view) would be if the quantized data itself contains the information of it's superblock size and the dequantizer just adapts based on that.

ikawrakow · 2023-06-29T05:36:20Z

Would a global block_size 64/256 variable introduce a downgrade in performance ?
Optimal (from a point of use view) would be if the quantized data itself contains the information of it's superblock size and the dequantizer just adapts based on that.

The quantization type is known. There is no need for a global variable. All that is needed is to make separate types with 64 and 256 block sizes, and then decide which one to use when quantizing. After that everything will just work.

cmp-nct added the enhancement New feature or request label Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K Quant 64 support - quite a feat to integrate #34

K Quant 64 support - quite a feat to integrate #34

cmp-nct commented Jun 28, 2023 •

edited

Loading

ikawrakow commented Jun 28, 2023

ikawrakow commented Jun 28, 2023

cmp-nct commented Jun 28, 2023 •

edited

Loading

ikawrakow commented Jun 29, 2023

K Quant 64 support - quite a feat to integrate #34

K Quant 64 support - quite a feat to integrate #34

Comments

cmp-nct commented Jun 28, 2023 • edited Loading

ikawrakow commented Jun 28, 2023

ikawrakow commented Jun 28, 2023

cmp-nct commented Jun 28, 2023 • edited Loading

ikawrakow commented Jun 29, 2023

cmp-nct commented Jun 28, 2023 •

edited

Loading

cmp-nct commented Jun 28, 2023 •

edited

Loading