Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Accept raw token IDs in stop parameter #1360

Open
ddh0 opened this issue Apr 18, 2024 · 11 comments
Open

[REQUEST] Accept raw token IDs in stop parameter #1360

ddh0 opened this issue Apr 18, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@ddh0
Copy link
Contributor

ddh0 commented Apr 18, 2024

Is your feature request related to a problem? Please describe.

I use Llama.create_completion() for my workflow, which allows me to pass stopping strings to end the generation. However, actually stopping when the model generates an EOS token is still sometimes a problem.

My current issue is with the newly released Llama 3 family of models, which use multiple stop tokens: token ID 128001 which is "<|end_of_text|>" and token ID 128009 which is "<|eot_id|>". The former works as expected, stopping generation, but the latter does not stop generation.

Even aside from the current issue with Llama 3 models specifically, I think this could be a very useful feature.

Describe the solution you'd like

It would be nice if I could specify stop=[128001, 128009] or similar so that the generation ends when either token is generated, not only "<|end_of_text|>".

Describe alternatives you've considered

I have tried to specify stop=['<|eot_id|>'] to add the stop token, but this doesn't work:

Hello there! I'm Llama 3, nice to meet you! Is there something I can help you with or would you like to chat about something in particular?assistant

Not much, just saying hi! It's nice to have someone to talk to. Do you have any fun plans or activities coming up?assistant

[...]

Thank you @abetlen for your time and all your hard work. Let me know if there's anything I can do to help. :)

@ddh0
Copy link
Contributor Author

ddh0 commented Apr 18, 2024

For reference I am using the chat template described here, which seems to be working perfectly, other than the stopping issue. Here are the official JSON files under meta-llama/Meta-Llama-3-70B-Instruct:

@etemiz
Copy link

etemiz commented Apr 19, 2024

My issue seems to be both stopping and repetitive sentences.

@ddh0
Copy link
Contributor Author

ddh0 commented Apr 19, 2024

Meta has updated their repos to specify that both stop tokens should be used as EOS: ggerganov/llama.cpp#6745 (comment)

@etemiz
Copy link

etemiz commented Apr 19, 2024

Apparently I was using a non instruct version!
I am now having success with https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF because they set
tokenizer.ggml.eos_token_id 128009 in GGUF file.

@ddh0
Copy link
Contributor Author

ddh0 commented Apr 19, 2024

But you shouldn't have to hack the metadata to get the model to work as intended... And in any case I think this could be a useful feature in general, not just for llama 3.

@abetlen abetlen added the enhancement New feature or request label Apr 20, 2024
@abetlen
Copy link
Owner

abetlen commented Apr 20, 2024

@ddh0 thanks for reporting, sorry I haven't had a chance to get to this till now (I just had a chance to load up Llama 3 and ran into the same issue). I think there are a couple things that need to be updated here:

  • first, we need to be able to accept multiple eos_tokens from the gguf metadata (straightforward check after deserialising the json)
  • second, we need to have a way to stop on token ids as well as strings. I would prefer that we just use StoppingCriteria for this instead of expanding the scope of the stop argument. . I'm going to update ChatFormatterResponse in llama_chat_format to have an optional stopping_criteria property which be set by the JinjaChatFormatter.

Should be resolved shortly.

@abetlen
Copy link
Owner

abetlen commented Apr 20, 2024

Yup that works with the NousResearch repo!

2024-04-19.23-46-25.mp4

@abetlen
Copy link
Owner

abetlen commented Apr 20, 2024

Kk I've implemented 2. from above and published in v0.2.63 this should fix Llama3 instruct when using the chat format from the gguf metadata. Tested and it works with the NousResearch quantization that specifies 12009 as the stop token id.

I'll implement 1. as well to add support for multiple stop token ids if anyone can link a gguf file with that metadata.

@etemiz
Copy link

etemiz commented Apr 20, 2024

NousResearch was already working. The ones that don't work are: MaziyarPanahi and LoneStriker.

@ddh0
Copy link
Contributor Author

ddh0 commented Apr 20, 2024

I'll implement 1. as well to add support for multiple stop token ids if anyone can link a gguf file with that metadata.

If I understand correctly the llama.cpp folks haven't decided how exactly to support multiple EOS tokens in GGUF metadata

second, we need to have a way to stop on token ids as well as strings. I would prefer that we just use StoppingCriteria for this instead of expanding the scope of the stop argument. . I'm going to update ChatFormatterResponse in llama_chat_format to have an optional stopping_criteria property which be set by the JinjaChatFormatter.

Would this require me to switch from Llama.create_completion() to one of the chat methods instead? Or would there be a way to specify stop tokens in create_completion? Currently I just pass my pre-formatted prompt to create_completion with the necessary stop sequences.

@abetlen
Copy link
Owner

abetlen commented Apr 20, 2024

@etemiz yes sorry I was mistaken, yeah looks like stop is suffiicent if the correct eos token is specified in the gguf, while both are valid the 12009 token is the one the model is actually using in practice to end a conversation turn. I think the change should be done in the gguf file or specified in a custom chat handler.

The following should work to override the chat format in the ggufs.

from llama_cpp.llama_chat_format import Jinja2ChatFormatter

model.chat_handler = Jinja2ChatFormatter(
    template=model.metadata["tokenizer.chat_template"],
    bos_token=model.bos_token(),
    eos_token=model.eos_token(),
    stop_token_ids=[12001, 12009]
).to_chat_handler()

@ddh0 no you can still use create_completion with stopping_criteria instead of stop like this

token_ids = {
    model.detokenize(model.eos_token),
    int(model.metadata["tokenizer.ggml.eos_token_id"])
}

def stop_on_token_ids(tokens, *args, **kwargs):
    return tokens[-1] in token_ids

stopping_criteria = llama_cpp.StoppingCriteriaList([stop_on_token_ids])

model.create_completion(prompt=prompt, stopping_criteria=stopping_criteria)
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants