Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K Quant 64 support - quite a feat to integrate #34

Open
cmp-nct opened this issue Jun 28, 2023 · 4 comments
Open

K Quant 64 support - quite a feat to integrate #34

cmp-nct opened this issue Jun 28, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 28, 2023

A large patch was just integrated into llama.cpp (ggerganov#2001) another stunning job by @ikawrakow

In the long run we need it, K quants are better for 7B and have more flexibility but two obstacles need to be solved:

  1. We need to modify that PR so it's not a compiler switch anymore, it needs to support 256 and 64 bit.
    Either by splitting and duplicating it or by using a global variable instead of the define.
    Otherwise we'd need distinctly compiled binaries for 7B and 40B
  2. These are 32 bit dequantizers, we use 16 bit for cuBLAS to save 50% VRAM.
    It's not a huge thing to change but it doubles the kernels (again) and I'm a bit afraid of maintaining so many of them.
    Maybe instead of duplicating all kernels from 32 to 16 it would be possible to write a wrapper, let the kernels work in 32 bit and convert that into half precision. Given the parallelization that wouldn't require much VRAM.

I'm a bit afraid of investing hours integrating such custom variants in case another big push comes from upstream.

@ikawrakow
Copy link
Collaborator

I was mainly considering the feedback from some people that there are too many quantization options after the addition of the k-quants when I decided to make the 64-blocks a compile time option. But I can see that this is not very ergonomic for Falcon users. Let me think about a better solution.

@ikawrakow
Copy link
Collaborator

Oh, and concerning fp16, I agree with you that it would be better if we standardized on fp16 for CUDA

@cmp-nct
Copy link
Owner Author

cmp-nct commented Jun 28, 2023

Great to hear :)
The amount of changes and features you commit regularly is astonishing.

In hindsight I was thinking that my 16 bit modifications to the dequantizers might have overshot it, maybe it would have been possible to just create a single wrapper that converts the kernels.
I'm sure what I did to make it 16 bit was clumsy compared to the best possible solution, I did not do any CUDA before and I probably should have spent more time planning it out.
So currently we have a 32 and a 16 bit representation for each block and row dequantization kernel, K and traditional Q type, a lot to maintain.

In the long run I'd personally perfer to cut all 32 bit out of cuda and go with the half precision, just all the sub functions appear to run on 32 bit so it's not a quick change.

Would a global block_size 64/256 variable introduce a downgrade in performance ?
Optimal (from a point of use view) would be if the quantized data itself contains the information of it's superblock size and the dequantizer just adapts based on that.

@ikawrakow
Copy link
Collaborator

Would a global block_size 64/256 variable introduce a downgrade in performance ?
Optimal (from a point of use view) would be if the quantized data itself contains the information of it's superblock size and the dequantizer just adapts based on that.

The quantization type is known. There is no need for a global variable. All that is needed is to make separate types with 64 and 256 block sizes, and then decide which one to use when quantizing. After that everything will just work.

@cmp-nct cmp-nct added the enhancement New feature or request label Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants