-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not recreate context while LLama is writing #828
Comments
Seems like it's due to context swapping. The context limit of llama is 2048 tokens, after that they do a "context swapping":
https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L256 |
@ngxson This makes sense. This is invisible in, say, ChatGPT because there, this context recreation happens only after it has finished writing - when it's the user's turn. |
Would it make sense to track how full the context is in interactive mode? So that we could swap the context (or in this case clear a part of it) while the user is typing the next question? |
It could also work like ChatGPT. There, the context is recreated every time the user sends a message. The tokens in the message are counted, the max response length is added to that and then as much history is prepended to that. Though I don't know what that would be performance-wise, as context recreation seems rather expensive in Here, I think a better solution would be to recreate the context as soon as LLama stops typing. We would assume that the user's query + LLama's response must be no longer than a certain limit. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Tokens are generated at about a constant rate, ie. N tokens per second on a given machine.
Current Behavior
Sometimes, the LLM takes a much longer time to generate a token than usually. It can be a 10x slowdown.
Environment and Context
Setup
MacBook Pro 14-inch 2021
10-core Apple M1 Pro CPU
16 GB RAM
OS
MacOS Ventura 13.3 (22E252)
clang --version
Steps to Reproduce
Run
./main
-m
./models/ggml-vicuna-7b-4bit-rev1.bin
-n
512
--color
-f
prompts/chat-with-vicuna.txt
--seed
42
--mlock
The model will get stuck after "of":
...or visit one of▏
the city's many restaurants...Failure Logs
Video
The text was updated successfully, but these errors were encountered: