Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q4_1 inference appears broken for 13B parameters #152

Closed
blackhole89 opened this issue Mar 15, 2023 · 5 comments
Closed

Q4_1 inference appears broken for 13B parameters #152

blackhole89 opened this issue Mar 15, 2023 · 5 comments
Labels
bug Something isn't working model Model specific

Comments

@blackhole89
Copy link
Contributor

I have been experimenting with q4_1 quantisation (since some preliminary results suggest it shold perform better), and noticed that something about the pipeline for the 13B parameter model is broken (whether it is the quantization itself, or the saving or loading). This results in all inferred tokens coming out as #. Meanwhile, 7B works well.

I know we had a patch a while ago that first made the 13B+ models work for q4_0 - did whatever fixes it made not cover q4_1?

@blackhole89 blackhole89 added the bug Something isn't working label Mar 15, 2023
@tisfeng
Copy link

tisfeng commented Mar 15, 2023

Yes, I use the 13B model and it doesn't work properly for either chatting or tasks. It simply ends without any response, which is very confusing to me.

image

@ggerganov
Copy link
Owner

ggerganov commented Mar 15, 2023

First - great work!

Most likely the cause is that when I changed the Q4_0 scaling storage, I skipped the Q4_1 routines:

007a8f6

This change was necessary to make larger model works, because when we merge rows from different shards, the scaling factors have to be next to the integer quants. Originally, what happened, is the scaling factors of a row, ended up at the start of the memory buffer, before all the int quants, and this caused the merging in main.cpp to fail.

In short: you need to rearrange the scaling and offset factors for a chunk to be located right before the int quants in the memory buffer in order to have correct merging of the shards.

P.S. Btw it is strange that you don't see any asserts firing up. Are you building with -NDEBUG while developing? You should see infs or nans at the very start of the inference if you remove -NDEBUG

@tisfeng
Copy link

tisfeng commented Mar 15, 2023

I didn't use -NDEBUG, I am a newcomer to this, just converted the 13B model using the latest code from the main branch,no errors were reported during the conversion.

make -j && ./main -m ./models/13B/ggml-model-q4_0.bin -p "What is the best programming language in the world? Why?" -t 8 -n 512

When using the 7B model, it works fine.

image

@blackhole89
Copy link
Contributor Author

In 13B Q4_1 mode:

image

I think it seems to work now, so I might make a PR against the main repository.

@tisfeng
Copy link

tisfeng commented Mar 16, 2023

Great, thank you!

ggerganov pushed a commit that referenced this issue Mar 17, 2023
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix #152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
anzz1 referenced this issue in anzz1/alpaca.cpp Mar 17, 2023
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix antimatter15#152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
mudler pushed a commit to go-skynet/llama that referenced this issue Mar 17, 2023
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix ggerganov#152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working model Model specific
Projects
None yet
Development

No branches or pull requests

4 participants