-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature - Internal ggml precision GGML_TYPE_F16 support #1492
Comments
This is already supported and in use? - not sure which parts you are referring to. |
matmul is designed for 32 bit only, the precision is hardcoded for the dst. That's why all src1 matmul are 32 even in 4 bit quantized mode. |
I believe this is what #959 was about. |
Is it really slow? My expectation is it would be completely negligible |
I ran a test yesterday and had significant faster inference but it was a hacked together test. With my recent upstream pull all my local code needs to be adapted again, I'll run a second test once I put the pieces together again to confirm it. @ggerganov : Do you know if the 32bit precision comes with a real quality benefit compared to half precision ? |
There is no measurable difference in perplexity between F16 and F32 |
I ran a test again and I could not replicate the performance gain anymore, maybe I had two changes yesterday. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
It might be too much to ask for now, given it's rooting deep into ggml but in longterm I believe it's important to support 16 bit precision.
Especially as GPU support is finding more and more grip in GGML the 32 bit requirement is a significant performance burden while not providing any benefit on the multiplications.
After all the multiplications inside the GPU are all 16 bit, converting src1 from 32 bit to 16 bit for every calculation costs quite noticeable performance.
The text was updated successfully, but these errors were encountered: