-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q4_1 acceleration #193
Q4_1 acceleration #193
Conversation
The generation at 13B Q4_1 feels a bit iffy to me. Note the immediate misspelling (it changes the assistant's name to "Allice" in over half of the chats earlier or later), sheer drunkenness (fantasy words, believing St Basil's Cathedral to be a bridge) and final repetition loop despite the repetition penalty. I'll try and see if eliminating the RMS patch makes it better.
|
Some limited evidence that the RMS patch may be problematic. I cherry-picked it away, getting the following: Same seed:
Different seed:
|
Requesting review; it seems that it wouldn't actually let me review and merge myself (...even though I possibly could have merged locally and pushed straight to master?) |
@blackhole89 Thank you! Great work as usual 🦙 |
* Add AVX2 version of ggml_vec_dot_q4_1 * Small optimisations to q4_1 dot product (@Const-me) * Rearrange Q4_1 quantization to work for multipart models. (Fix antimatter15#152) * Fix ggml_vec_mad_q4_1 too * Fix non-vectorised q4_1 vec mul
* Add AVX2 version of ggml_vec_dot_q4_1 * Small optimisations to q4_1 dot product (@Const-me) * Rearrange Q4_1 quantization to work for multipart models. (Fix ggerganov#152) * Fix ggml_vec_mad_q4_1 too * Fix non-vectorised q4_1 vec mul
Thanks for doing that! I successfully quantised 30B to 4_1 size, here's my test.
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: prompt: ' Transcript of a conversation between a User and an AI Assistant named Judy. Judy is a helpful, kind, honest and never fails to answer the User's requests immediately and in great detail. Judy knows the answer to any question.' main: interactive mode on. sampling parameters: temp = 0.700000, top_k = 40, top_p = 0.500000, repeat_last_n = 64, repeat_penalty = 1.176470 == Running in interactive mode. ==
main: mem per token = 43600900 bytes |
Thank you! @FNsi, is this the Q4_1 quantification you are testing? The filename says models/30B/ggml-model-q4_0.bin. How is the performance at 30B? |
Sorry I just find out the problem I made. I already changed the line to 4_1 bin, and resubmit the current response.
|
Includes vectorised inference code, quantisation and a counterpart to the Q4_0 multipart fix we introduced a while ago. Tested working up to 13B, though I can't confidently say anything about the impact on quality (especially since the RMS norm patch also just landed). Speed overheads relative to Q4_0 seem to be about 50%. This should give us a viable framework to evaluate Q4_1 quantization on x86 machines.
What's missing is accelerated inference code for for ARM NEON - I have no access to any machine that has it, so I'm going to have to delegate there.