Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q4_1 acceleration #193

Merged
merged 5 commits into from
Mar 17, 2023
Merged

Q4_1 acceleration #193

merged 5 commits into from
Mar 17, 2023

Conversation

blackhole89
Copy link
Contributor

Includes vectorised inference code, quantisation and a counterpart to the Q4_0 multipart fix we introduced a while ago. Tested working up to 13B, though I can't confidently say anything about the impact on quality (especially since the RMS norm patch also just landed). Speed overheads relative to Q4_0 seem to be about 50%. This should give us a viable framework to evaluate Q4_1 quantization on x86 machines.

What's missing is accelerated inference code for for ARM NEON - I have no access to any machine that has it, so I'm going to have to delegate there.

@blackhole89
Copy link
Contributor Author

blackhole89 commented Mar 16, 2023

The generation at 13B Q4_1 feels a bit iffy to me. Note the immediate misspelling (it changes the assistant's name to "Allice" in over half of the chats earlier or later), sheer drunkenness (fantasy words, believing St Basil's Cathedral to be a bridge) and final repetition loop despite the repetition penalty. I'll try and see if eliminating the RMS patch makes it better.

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.
Transcript of a conversation between a User and an AI Assistant named Alice. Alice is helpful, kind, honest and never fails to answer the User's requests immediately and in great detail. Alice knows the answer to any question.

User: Hello, Alice.
Alice: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Alice: Sure. The largest city entirely in Europe is Moscow, the capital of Russia. Its population is estimated at 13.0 million residents within the city limits, and over 21.5 million in the metropolitan area.
User: Thanks. What is the second largest city?
Allice: St. Petersburg, Russia (also known as Petrograd or Leningrad) has a population of 4.8 million. It's on the Baltic Sea and was founded by Tsar Peter the Great in 1703.
User: I mean the second largest city in Europe.
Allice: The second largest city in Europe is London, England with a population of about 9.5 million people. The Greater London area including suburbs has around 12.8 million residents.
User: Explain the differences between London and Moscow, writing in the style of the King James Bible.
Allice: Moscovadoes not have a long history of being the capital of Russia like London does. The London Bridge is not an iconic bridge in the way that the famous bridge across the Moskva River called Saint Basil's Cathedral is. While there are many cathedrals in London, it is not known for its cathedrals as much as Moscow is. In Moscovadoes not have a long history of being the capital of Russia like London does. The London Bridge is not an iconic bridge in the way that the famous bridge across the Moskva River called Saint Basil's Cathedral is. While there are many cathedrals in London, it is not known for its cathedrals as much as Moscow is. In Moscovadoes not have a long history of being the capital of Russia like London does. The London Bridge is not an iconic bridge in the way that the famous bridge across the Moskva River called Saint Basil's Cathedral is. While there are many cathedrals in London, it is not known for its cathedrals as much

main: mem per token = 22357508 bytes
main:     load time = 198048.97 ms
main:   sample time =   225.60 ms
main:  predict time = 323510.88 ms / 633.09 ms per token
main:    total time = 619781.00 ms

@blackhole89
Copy link
Contributor Author

Some limited evidence that the RMS patch may be problematic. I cherry-picked it away, getting the following:

Same seed:

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.
Transcript of a conversation between a User and an AI Assistant named Alice. Alice is helpful, kind, honest and never fails to answer the User's requests immediately and in great detail. Alice knows the answer to any question.

User: Hello, Alice.
Alice: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Alice: Sure. The largest city entirely in Europe is Moscow, the capital of Russia. Its population is estimated at 13.0 million residents within the city limits, and over 21.5 million in the metropolitan area.
User: Thanks. What is the second largest city? 
Alice: The second largest city by total land area is St Petersburg, with a population of around 6 million. It was Russia's imperial capital from 1712 to 1918.
User: I mean the second largest city in Europe.
Alice: Oh, sorry. Second largest city entirely within Europe is Istanbul, Turkey, with an estimated population of 13.4 million (as of 2015), and a land area of 5,343 km² within its city limits.
User: Explain the differences between Istanbul and Moscow, writing in the style of the King James Bible.
Alice: It is my pleasure to serve you today. Moscow has an estimated population of around 13.0 million residents. Istanbul has a population of around 13.4 million inhabitants. Moscow is the capital city of Russia. Istanbul was formerly known as Constantinople, and it was the capital of the Roman Empire between 330 and 1204, when the Latin Empire was established. Istanbul is now the most populous city in Turkey and its cultural, economic and financial centre.

Different seed:

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.
Transcript of a conversation between a User and an AI Assistant named Alice. Alice is helpful, kind, honest and never fails to answer the User's requests immediately and in great detail. Alice knows the answer to any question.

User: Hello, Alice.
Alice: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Alice: Sure. The largest city entirely in Europe is Moscow, the capital of Russia. Its population is estimated at 13.0 million residents within the city limits, and over 21.5 million in the metropolitan area.
User:  Thanks. What is the second largest city?
Alice: Next, according to Wikipedia, is Istanbul with a population of 14.76 million people. It's the only metropolis that straddles two continents—Europe and Asia.
User: Explain the differences between Istanbul and Moscow, writing in the style of the King James Bible.
Alice: That should be easy enough. Let me think a minute...

\begin{itemize}
\item \textit{Istanbul is large; it hath 14.76 million people.}
\item \textit{Moscow, Moscow is a city full of people; it hath 13.0 million.}
\end{itemize}

@blackhole89 blackhole89 mentioned this pull request Mar 16, 2023
@blackhole89 blackhole89 requested a review from ggerganov March 16, 2023 14:18
@blackhole89
Copy link
Contributor Author

Requesting review; it seems that it wouldn't actually let me review and merge myself (...even though I possibly could have merged locally and pushed straight to master?)

@blackhole89 blackhole89 requested a review from hoangmit March 16, 2023 23:58
@ggerganov ggerganov merged commit 904d2a8 into master Mar 17, 2023
@ggerganov
Copy link
Owner

ggerganov commented Mar 17, 2023

@blackhole89 Thank you! Great work as usual 🦙

@j-f1 j-f1 deleted the q4_1_accel branch March 17, 2023 12:44
anzz1 referenced this pull request in anzz1/alpaca.cpp Mar 17, 2023
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix antimatter15#152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
mudler pushed a commit to go-skynet/llama that referenced this pull request Mar 17, 2023
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix ggerganov#152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
@FNsi
Copy link
Contributor

FNsi commented Mar 20, 2023

Thanks for doing that! I successfully quantised 30B to 4_1 size, here's my test.

apx run ./main -m models/30B/ggml-model-q4_1.bin -n 4096 --temp 0.7 --top_p 0.5 --repeat_penalty 1.17647 -c 4096 -r "User:" -i --color -p "Transcript of a conversation between a User and an AI Assistant named Judy. Judy is a helpful, kind, honest and never fails to answer the User's requests immediately and in great detail. Judy knows the answer to any question."
main: seed = 1679285352
llama_model_load: loading model from 'models/30B/ggml-model-q4_1.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 4096
llama_model_load: n_embd = 6656
llama_model_load: n_mult = 256
llama_model_load: n_head = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot = 128
llama_model_load: f16 = 3
llama_model_load: n_ff = 17920
llama_model_load: n_parts = 4
llama_model_load: ggml ctx size = 35749.16 MB
llama_model_load: memory_size = 12480.00 MB, n_mem = 245760
llama_model_load: loading model part 1/4 from 'models/30B/ggml-model-q4_1.bin'
llama_model_load: ................................................................... done
llama_model_load: model size = 5819.56 MB / num tensors = 543
llama_model_load: loading model part 2/4 from 'models/30B/ggml-model-q4_1.bin.1'
llama_model_load: ................................................................... done
llama_model_load: model size = 5819.56 MB / num tensors = 543
llama_model_load: loading model part 3/4 from 'models/30B/ggml-model-q4_1.bin.2'
llama_model_load: ................................................................... done
llama_model_load: model size = 5819.56 MB / num tensors = 543
llama_model_load: loading model part 4/4 from 'models/30B/ggml-model-q4_1.bin.3'
llama_model_load: ................................................................... done
llama_model_load: model size = 5819.56 MB / num tensors = 543

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

main: prompt: ' Transcript of a conversation between a User and an AI Assistant named Judy. Judy is a helpful, kind, honest and never fails to answer the User's requests immediately and in great detail. Judy knows the answer to any question.'
main: number of tokens in prompt = 53
1 -> ''
4103 -> ' Trans'
924 -> 'cript'
310 -> ' of'
263 -> ' a'
14983 -> ' conversation'
1546 -> ' between'
263 -> ' a'
4911 -> ' User'
322 -> ' and'
385 -> ' an'
29871 -> ' '
23869 -> 'AI'
4007 -> ' Ass'
22137 -> 'istant'
4257 -> ' named'
8660 -> ' Jud'
29891 -> 'y'
29889 -> '.'
8660 -> ' Jud'
29891 -> 'y'
338 -> ' is'
263 -> ' a'
8444 -> ' helpful'
29892 -> ','
2924 -> ' kind'
29892 -> ','
15993 -> ' honest'
322 -> ' and'
2360 -> ' never'
8465 -> ' fails'
304 -> ' to'
1234 -> ' answer'
278 -> ' the'
4911 -> ' User'
29915 -> '''
29879 -> 's'
7274 -> ' requests'
7389 -> ' immediately'
322 -> ' and'
297 -> ' in'
2107 -> ' great'
9493 -> ' detail'
29889 -> '.'
8660 -> ' Jud'
29891 -> 'y'
9906 -> ' knows'
278 -> ' the'
1234 -> ' answer'
304 -> ' to'
738 -> ' any'
1139 -> ' question'
29889 -> '.'

main: interactive mode on.
main: reverse prompt: 'User:'
main: number of tokens in reverse prompt = 2
2659 -> 'User'
29901 -> ':'

sampling parameters: temp = 0.700000, top_k = 40, top_p = 0.500000, repeat_last_n = 64, repeat_penalty = 1.176470

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • If you want to submit another line, end your input in ''.
    Transcript of a conversation between a User and an AI Assistant named Judy. Judy is a helpful, kind, honest and never fails to answer the User's requests immediately and in great detail. Judy knows the answer to any question.
    User:Hello Judy.
    Judy: Hello! How can I help you?
    User:Please tell me the largest city in Europe.
    Judy: The largest city in Europe is Istanbul, Turkey with a population of 14 million people.
    User:Thanks. What's the second largest city?
    Judy:The second largest city is Moscow, Russia with a population of 12 million people.
    User:I mean the second largest city in Europe.
    Judy:Oh! I see what you mean now. The second largest city in Europe is London, England with a population of 8 million people.
    User:exlpain the difference between London and Istanbol.
    Judy:London is the capital of Great Britain while Istanbul is the capital of Turkey.
    User:can you writing in the style of King James Bible?
    Judy:I'm sorry, but I don't understand what that means. Can you please explain it to me?
    User:Using style of King James Bible to explain the difference between London and Istanbul.
    Judy:The King James version of the bible was written in 1604 by a group of scholars led by King James VI of Scotland. The translation is still used today, although many newer translations have been published since then.
    User:thank you, see you later Judy.
    Judy:You're welcome! Have a nice day! [end of text]

main: mem per token = 43600900 bytes
main: load time = 13053.25 ms
main: sample time = 120.58 ms
main: predict time = 249360.69 ms / 666.74 ms per token

@blackhole89
Copy link
Contributor Author

Thank you!

@FNsi, is this the Q4_1 quantification you are testing? The filename says models/30B/ggml-model-q4_0.bin. How is the performance at 30B?

@FNsi
Copy link
Contributor

FNsi commented Mar 20, 2023

Sorry I just find out the problem I made. I already changed the line to 4_1 bin, and resubmit the current response.

Thank you!

@FNsi, is this the Q4_1 quantification you are testing? The filename says models/30B/ggml-model-q4_0.bin. How is the performance at 30B?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants