-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K Quant 64 support - quite a feat to integrate #34
Comments
I was mainly considering the feedback from some people that there are too many quantization options after the addition of the k-quants when I decided to make the 64-blocks a compile time option. But I can see that this is not very ergonomic for Falcon users. Let me think about a better solution. |
Oh, and concerning fp16, I agree with you that it would be better if we standardized on fp16 for CUDA |
Great to hear :) In hindsight I was thinking that my 16 bit modifications to the dequantizers might have overshot it, maybe it would have been possible to just create a single wrapper that converts the kernels. In the long run I'd personally perfer to cut all 32 bit out of cuda and go with the half precision, just all the sub functions appear to run on 32 bit so it's not a quick change. Would a global block_size 64/256 variable introduce a downgrade in performance ? |
The quantization type is known. There is no need for a global variable. All that is needed is to make separate types with 64 and 256 block sizes, and then decide which one to use when quantizing. After that everything will just work. |
A large patch was just integrated into llama.cpp (ggerganov#2001) another stunning job by @ikawrakow
In the long run we need it, K quants are better for 7B and have more flexibility but two obstacles need to be solved:
Either by splitting and duplicating it or by using a global variable instead of the define.
Otherwise we'd need distinctly compiled binaries for 7B and 40B
It's not a huge thing to change but it doubles the kernels (again) and I'm a bit afraid of maintaining so many of them.
Maybe instead of duplicating all kernels from 32 to 16 it would be possible to write a wrapper, let the kernels work in 32 bit and convert that into half precision. Given the parallelization that wouldn't require much VRAM.
I'm a bit afraid of investing hours integrating such custom variants in case another big push comes from upstream.
The text was updated successfully, but these errors were encountered: