Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out truncation strategy for continue conversation mode #73

Open
simonw opened this issue Jul 1, 2023 · 10 comments
Open

Figure out truncation strategy for continue conversation mode #73

simonw opened this issue Jul 1, 2023 · 10 comments
Labels
help wanted Extra attention is needed research

Comments

@simonw
Copy link
Owner

simonw commented Jul 1, 2023

I'm still not clear on the best way to truncate messages in continue mode, right now I'm going to leave that and allow the model to return an error - but it would be good to have a strategy for that involving automatic truncating later on.

Originally posted by @simonw in #65 (comment)

@simonw simonw mentioned this issue Sep 5, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

This just came up again for the llm chat command:

Single biggest unanswered question, which goes for the existing llm -c conversation mode as well: what happens if the conversation gets longer than the context window?

I assume different models break in different ways. But how to fix this? Two options:

  1. Prevent the conversation from continuing past that point
  2. Truncate the conversation's start (though keep injecting the system prompt) to fit

But in both cases I need to detect when this happens. I could try and catch the error and retry, but that's dependent on knowing what the error looks like.

I could count tokens and predict the error will occur, but I need to have rock-solid token counting for that (which I can get using tiktoken for the OpenAI models, but no idea how I'd get it for other models in plugins).

Maybe part of the answer here is introducing a new standard exception - llm.PromptTooLong perhaps - and then updating all the plugins to raise that exception.

@simonw simonw added help wanted Extra attention is needed research labels Sep 5, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

There's a really fancy version of this, where any time you run low on tokens you get the LLM to summarize the previous conversation history in order to condense it up.

Not sure if that should be a feature of LLM directly, but it's pretty interesting.

@simonw simonw mentioned this issue Sep 5, 2023
2 tasks
@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

I'm using the llm chat prototype to see how much it takes to break Llama 2 13B, which has a 4096 documented token limit.

I'm surprised at how much space that is. I've been getting the model to tell me jokes, tell stories etc and I'm still only at about 2713 tokens - counting them like this:

llm logs -c | ttok

That's the GPT-4 tokenizer, not the Llama 2 one, but I imagine the number is pretty close.

@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

I'm at 4492 now and it's still going?

@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

Up to 6563 now. I'm suspecting Llama 2 (with MLC) may be truncating the input for me.

@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

Here's the full log of that conversation. https://gist.github.com/simonw/603dba123237a6e4b36e9bc5bc70b583

@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

Yeah, Llama 2 MLC doesn't seem to have a limit. I piped in a 27000 token CSV:

cat simon-wordcamp.csv | llm -m llama2 --system 'summary' 

The response cut off, but it clearly caught the end of the script:

The transcript you provided is a video of a Q&A session with a group of people discussing various topics related to artificial intelligence (AI) and machine learning. The speakers are discussing their experiences and perspectives on the current state of AI research, including the limitations of current models and the need for more recent training data. They also discuss the use of retrieval augmented generation as a state-of-the-art technique for factual questions, and the potential for directing what the AI is indexing.

Here are some key points that can be gleaned from the transcript:

@simonw
Copy link
Owner Author

simonw commented Sep 5, 2023

gpt-3.5-turbo on the other hand:

cat simon-wordcamp.csv | llm -m gpt-3.5-turbo --system 'summary'

Error: This model's maximum context length is 4097 tokens. However, your messages resulted in 29753 tokens. Please reduce the length of the messages.

@garyblankenship
Copy link

I like the idea of truncating middle text. Keeping the first prompt(s) and the last amount that will still fit in context.

@vividfog
Copy link

vividfog commented Oct 5, 2023

In a normal chat "it keeps going" is real, it feels like a lot. But if the conversation starts with an info dump (command line RAG, cat in a text file or a long !multi ... !end), then the context runs out just when the conversation gets interesting.

Using the Python API I've experimented with versions that summarize the whole past (N-1) conversation as future input context, a history so far, and even this naive approach works to some degree. It never gives this modern buffer overflow of context running out. The bot knows what's going on and how we got here. But it doesn't know what I mean if the follow-up is "that sounds neat, can you make it shorter". That is, referring to the exact previous message. In my naive implementation the previous message is the whole history so far. Yet it's surprisingly effective in carrying a conversation.

I feel like the chat side of llm should take some opinion of context management → Chat mode as pure magic. What's impossible then is to know what the context size is. Because that is near-unknown for a gguf, fully unknown for an OpenAI-compliant REST API, and only readable in a .json file for gptq/awq.

Using something like -1000, -2000 or -3000 tokens as the history → summary cutoff point might result in the right effect for all future models, ChatGPT-like long conversations, over time it will accumulate hallucinations, but usually not errors. It's a bit of a hack, but the result is magic. The status quo is "error", this alternative at least keeps going and is full aware of many messages of the past.

And that gives a permission for the non-chat mode to be very absolute, "what you send is what you get", errors and all. Seeing the low level errors is important when testing large context windows manually with large inputs. The docs could guide users to use chat when they want magic, and non-chat when they want the details.

Together those would cover both ends of the context management spectrum.

Python API users can choose to mimic what chat does in between those ends. They can use the conversation step -1 or -5 of whatever they think is the correct cutoff point for past summarization, depending on their RAG chunks size and the model they know. Which they can do themselves, following the CLI code as a reference. I don't think the Python API needs more than a documented way to point to "conversations from N-3 and before".

That's from the perspective of using chat, non-chat and the Python API with a lot of models. What a great little tool this is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed research
Projects
None yet
Development

No branches or pull requests

3 participants