-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Accept raw token IDs in stop
parameter
#1360
Comments
For reference I am using the chat template described here, which seems to be working perfectly, other than the stopping issue. Here are the official JSON files under meta-llama/Meta-Llama-3-70B-Instruct: |
|
Meta has updated their repos to specify that both stop tokens should be used as EOS: ggerganov/llama.cpp#6745 (comment) |
Apparently I was using a non instruct version! |
But you shouldn't have to hack the metadata to get the model to work as intended... And in any case I think this could be a useful feature in general, not just for llama 3. |
@ddh0 thanks for reporting, sorry I haven't had a chance to get to this till now (I just had a chance to load up Llama 3 and ran into the same issue). I think there are a couple things that need to be updated here:
Should be resolved shortly. |
Yup that works with the NousResearch repo! 2024-04-19.23-46-25.mp4 |
Kk I've implemented 2. from above and published in v0.2.63 this should fix Llama3 instruct when using the chat format from the gguf metadata. Tested and it works with the NousResearch quantization that specifies 12009 as the stop token id. I'll implement 1. as well to add support for multiple stop token ids if anyone can link a gguf file with that metadata. |
NousResearch was already working. The ones that don't work are: MaziyarPanahi and LoneStriker. |
If I understand correctly the llama.cpp folks haven't decided how exactly to support multiple EOS tokens in GGUF metadata
Would this require me to switch from |
@etemiz yes sorry I was mistaken, yeah looks like The following should work to override the chat format in the ggufs. from llama_cpp.llama_chat_format import Jinja2ChatFormatter
model.chat_handler = Jinja2ChatFormatter(
template=model.metadata["tokenizer.chat_template"],
bos_token=model.bos_token(),
eos_token=model.eos_token(),
stop_token_ids=[12001, 12009]
).to_chat_handler() @ddh0 no you can still use token_ids = {
model.detokenize(model.eos_token),
int(model.metadata["tokenizer.ggml.eos_token_id"])
}
def stop_on_token_ids(tokens, *args, **kwargs):
return tokens[-1] in token_ids
stopping_criteria = llama_cpp.StoppingCriteriaList([stop_on_token_ids])
model.create_completion(prompt=prompt, stopping_criteria=stopping_criteria)
``` |
Is your feature request related to a problem? Please describe.
I use
Llama.create_completion()
for my workflow, which allows me to pass stopping strings to end the generation. However, actually stopping when the model generates an EOS token is still sometimes a problem.My current issue is with the newly released Llama 3 family of models, which use multiple stop tokens: token ID
128001
which is "<|end_of_text|>
" and token ID128009
which is "<|eot_id|>
". The former works as expected, stopping generation, but the latter does not stop generation.Even aside from the current issue with Llama 3 models specifically, I think this could be a very useful feature.
Describe the solution you'd like
It would be nice if I could specify
stop=[128001, 128009]
or similar so that the generation ends when either token is generated, not only "<|end_of_text|>
".Describe alternatives you've considered
I have tried to specify
stop=['<|eot_id|>']
to add the stop token, but this doesn't work:Thank you @abetlen for your time and all your hard work. Let me know if there's anything I can do to help. :)
The text was updated successfully, but these errors were encountered: