-
-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figure out truncation strategy for continue conversation mode #73
Comments
This just came up again for the
|
There's a really fancy version of this, where any time you run low on tokens you get the LLM to summarize the previous conversation history in order to condense it up. Not sure if that should be a feature of LLM directly, but it's pretty interesting. |
I'm using the I'm surprised at how much space that is. I've been getting the model to tell me jokes, tell stories etc and I'm still only at about 2713 tokens - counting them like this:
That's the GPT-4 tokenizer, not the Llama 2 one, but I imagine the number is pretty close. |
I'm at 4492 now and it's still going? |
Up to 6563 now. I'm suspecting Llama 2 (with MLC) may be truncating the input for me. |
Here's the full log of that conversation. https://gist.github.com/simonw/603dba123237a6e4b36e9bc5bc70b583 |
Yeah, Llama 2 MLC doesn't seem to have a limit. I piped in a 27000 token CSV:
The response cut off, but it clearly caught the end of the script:
|
cat simon-wordcamp.csv | llm -m gpt-3.5-turbo --system 'summary'
|
I like the idea of truncating middle text. Keeping the first prompt(s) and the last amount that will still fit in context. |
In a normal Using the Python API I've experimented with versions that summarize the whole past (N-1) conversation as future input context, a history so far, and even this naive approach works to some degree. It never gives this modern buffer overflow of context running out. The bot knows what's going on and how we got here. But it doesn't know what I mean if the follow-up is "that sounds neat, can you make it shorter". That is, referring to the exact previous message. In my naive implementation the previous message is the whole history so far. Yet it's surprisingly effective in carrying a conversation. I feel like the Using something like -1000, -2000 or -3000 tokens as the history → summary cutoff point might result in the right effect for all future models, ChatGPT-like long conversations, over time it will accumulate hallucinations, but usually not errors. It's a bit of a hack, but the result is magic. The status quo is "error", this alternative at least keeps going and is full aware of many messages of the past. And that gives a permission for the non-chat mode to be very absolute, "what you send is what you get", errors and all. Seeing the low level errors is important when testing large context windows manually with large inputs. The docs could guide users to use chat when they want magic, and non-chat when they want the details. Together those would cover both ends of the context management spectrum. Python API users can choose to mimic what That's from the perspective of using chat, non-chat and the Python API with a lot of models. What a great little tool this is. |
Originally posted by @simonw in #65 (comment)
The text was updated successfully, but these errors were encountered: