[REQUEST] Accept raw token IDs in `stop` parameter #1360

ddh0 · 2024-04-18T21:09:04Z

Is your feature request related to a problem? Please describe.

I use Llama.create_completion() for my workflow, which allows me to pass stopping strings to end the generation. However, actually stopping when the model generates an EOS token is still sometimes a problem.

My current issue is with the newly released Llama 3 family of models, which use multiple stop tokens: token ID 128001 which is "<|end_of_text|>" and token ID 128009 which is "<|eot_id|>". The former works as expected, stopping generation, but the latter does not stop generation.

Even aside from the current issue with Llama 3 models specifically, I think this could be a very useful feature.

Describe the solution you'd like

It would be nice if I could specify stop=[128001, 128009] or similar so that the generation ends when either token is generated, not only "<|end_of_text|>".

Describe alternatives you've considered

I have tried to specify stop=['<|eot_id|>'] to add the stop token, but this doesn't work:

Hello there! I'm Llama 3, nice to meet you! Is there something I can help you with or would you like to chat about something in particular?assistant

Not much, just saying hi! It's nice to have someone to talk to. Do you have any fun plans or activities coming up?assistant

[...]

Thank you @abetlen for your time and all your hard work. Let me know if there's anything I can do to help. :)

The text was updated successfully, but these errors were encountered:

ddh0 · 2024-04-18T21:19:26Z

For reference I am using the chat template described here, which seems to be working perfectly, other than the stopping issue. Here are the official JSON files under meta-llama/Meta-Llama-3-70B-Instruct:

etemiz · 2024-04-19T07:21:39Z

~~My issue seems to be both stopping and repetitive sentences.~~

ddh0 · 2024-04-19T16:44:57Z

Meta has updated their repos to specify that both stop tokens should be used as EOS: ggerganov/llama.cpp#6745 (comment)

etemiz · 2024-04-19T18:06:10Z

Apparently I was using a non instruct version!
I am now having success with https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF because they set
tokenizer.ggml.eos_token_id 128009 in GGUF file.

ddh0 · 2024-04-19T20:35:29Z

But you shouldn't have to hack the metadata to get the model to work as intended... And in any case I think this could be a useful feature in general, not just for llama 3.

abetlen · 2024-04-20T03:20:30Z

@ddh0 thanks for reporting, sorry I haven't had a chance to get to this till now (I just had a chance to load up Llama 3 and ran into the same issue). I think there are a couple things that need to be updated here:

first, we need to be able to accept multiple eos_tokens from the gguf metadata (straightforward check after deserialising the json)
second, we need to have a way to stop on token ids as well as strings. I would prefer that we just use StoppingCriteria for this instead of expanding the scope of the stop argument. . I'm going to update ChatFormatterResponse in llama_chat_format to have an optional stopping_criteria property which be set by the JinjaChatFormatter.

Should be resolved shortly.

abetlen · 2024-04-20T03:48:23Z

Yup that works with the NousResearch repo!

2024-04-19.23-46-25.mp4

abetlen · 2024-04-20T04:18:07Z

Kk I've implemented 2. from above and published in v0.2.63 this should fix Llama3 instruct when using the chat format from the gguf metadata. Tested and it works with the NousResearch quantization that specifies 12009 as the stop token id.

I'll implement 1. as well to add support for multiple stop token ids if anyone can link a gguf file with that metadata.

etemiz · 2024-04-20T07:08:26Z

NousResearch was already working. The ones that don't work are: MaziyarPanahi and LoneStriker.

ddh0 · 2024-04-20T12:44:28Z

I'll implement 1. as well to add support for multiple stop token ids if anyone can link a gguf file with that metadata.

If I understand correctly the llama.cpp folks haven't decided how exactly to support multiple EOS tokens in GGUF metadata

second, we need to have a way to stop on token ids as well as strings. I would prefer that we just use StoppingCriteria for this instead of expanding the scope of the stop argument. . I'm going to update ChatFormatterResponse in llama_chat_format to have an optional stopping_criteria property which be set by the JinjaChatFormatter.

Would this require me to switch from Llama.create_completion() to one of the chat methods instead? Or would there be a way to specify stop tokens in create_completion? Currently I just pass my pre-formatted prompt to create_completion with the necessary stop sequences.

abetlen · 2024-04-20T14:37:51Z

@etemiz yes sorry I was mistaken, yeah looks like stop is suffiicent if the correct eos token is specified in the gguf, while both are valid the 12009 token is the one the model is actually using in practice to end a conversation turn. I think the change should be done in the gguf file or specified in a custom chat handler.

The following should work to override the chat format in the ggufs.

from llama_cpp.llama_chat_format import Jinja2ChatFormatter

model.chat_handler = Jinja2ChatFormatter(
    template=model.metadata["tokenizer.chat_template"],
    bos_token=model.bos_token(),
    eos_token=model.eos_token(),
    stop_token_ids=[12001, 12009]
).to_chat_handler()

@ddh0 no you can still use create_completion with stopping_criteria instead of stop like this

token_ids = {
    model.detokenize(model.eos_token),
    int(model.metadata["tokenizer.ggml.eos_token_id"])
}

def stop_on_token_ids(tokens, *args, **kwargs):
    return tokens[-1] in token_ids

stopping_criteria = llama_cpp.StoppingCriteriaList([stop_on_token_ids])

model.create_completion(prompt=prompt, stopping_criteria=stopping_criteria)
```

abetlen added the enhancement New feature or request label Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Accept raw token IDs in `stop` parameter #1360

[REQUEST] Accept raw token IDs in `stop` parameter #1360

ddh0 commented Apr 18, 2024 •

edited

Loading

ddh0 commented Apr 18, 2024 •

edited

Loading

etemiz commented Apr 19, 2024 •

edited

Loading

ddh0 commented Apr 19, 2024

etemiz commented Apr 19, 2024

ddh0 commented Apr 19, 2024 •

edited

Loading

abetlen commented Apr 20, 2024 •

edited

Loading

abetlen commented Apr 20, 2024

abetlen commented Apr 20, 2024

etemiz commented Apr 20, 2024

ddh0 commented Apr 20, 2024 •

edited

Loading

abetlen commented Apr 20, 2024 •

edited

Loading

[REQUEST] Accept raw token IDs in stop parameter #1360

[REQUEST] Accept raw token IDs in stop parameter #1360

Comments

ddh0 commented Apr 18, 2024 • edited Loading

ddh0 commented Apr 18, 2024 • edited Loading

etemiz commented Apr 19, 2024 • edited Loading

ddh0 commented Apr 19, 2024

etemiz commented Apr 19, 2024

ddh0 commented Apr 19, 2024 • edited Loading

abetlen commented Apr 20, 2024 • edited Loading

abetlen commented Apr 20, 2024

abetlen commented Apr 20, 2024

etemiz commented Apr 20, 2024

ddh0 commented Apr 20, 2024 • edited Loading

abetlen commented Apr 20, 2024 • edited Loading

[REQUEST] Accept raw token IDs in `stop` parameter #1360

[REQUEST] Accept raw token IDs in `stop` parameter #1360

ddh0 commented Apr 18, 2024 •

edited

Loading

ddh0 commented Apr 18, 2024 •

edited

Loading

etemiz commented Apr 19, 2024 •

edited

Loading

ddh0 commented Apr 19, 2024 •

edited

Loading

abetlen commented Apr 20, 2024 •

edited

Loading

ddh0 commented Apr 20, 2024 •

edited

Loading

abetlen commented Apr 20, 2024 •

edited

Loading