-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spaces are not being added after added tokens when legacy: true
is used
#7094
Comments
I think the MistralAI models behave differently compared to chatml based models. This note is from the mistralai 8x7B model on HF: " def tokenize(text): [BOS_ID] + In the pseudo-code above, note that the tokenize method should not add a BOS or EOS token automatically, but should add a prefix space. In the Transformers library, one can use chat templates which make sure the right format is applied. The problem I found was the jinja chat templates of even the official mistralai models on HF do not seem to correspond to the above note. However, when I implement the tokenization exactly as described above through a patched llama.cpp where I take complete control of both chat template definitions and whether the space prefix is added in the llama_tokenize() call mistral 8x7B seemed to behave noticeably differently on a couple short test prompts (got rid of a leading : in one response which should not have been there). The dolphin 2.8 transformers tokenization can be matched using these template definitions with spaces manually added in the template definitions: # ChatML with special tokens and inserted spaces to match transformers tokenizer spaces [1715004344] input prefix: '<|im_start|> user Most likely llama3 instruct from Meta does not have these spaces (guessing), but I am not sure about the new dolphin 2.9 and hermes2 pro chatml based llama3 fine tunes coming out if they are still in there or not. I think the bottom line is exactly matching instruct tune templates is going to always be hit or miss unless the model creators define text in -> token out test cases for one full turn in their documentation exactly as you have done in your issue note. I am still not sure about Mistral 7B 0.1, 0.2, 8x22b etc. if they are changing this space thing from fine tune to fine tune or not and it seems hard to reverse engineer from the model itself except by testing with and without spaces in various places and empirically determining which works best. |
This issue still exists. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I think LLaMa-1, LLaMa-2, Mistral-v0.1, Mistral-v0.2, Solar (which is based on Mistral-v0.1), and probably a few others all use
"legacy": true
.Trainers like Axolotl will use Transformers to tokenize datasets for training, which if this setting is set to
true
will add a space after special/added tokens. A bit weird in my opinion, but that's probably why they consider it legacy. Weirdness aside this all seems fine as long as inference tokenization matches training tokenization, which happens with anything that uses Transformers, but doesn't seem to be the case with llama.cpp.LLaMa-3 has no mention of
legacy
in itstokenizer_config.json
so they likely no longer follow this behaviour, and llama.cpp won't need any changes in this regard and works as is.I used the latest KoboldCPP here because I can't figure out how to tokenize with llama.cpp since I only ever use KoboldCPP and wanted to make this issue sooner rather than later. I assume they tokenize the same.
@ehartford pinging you here since I used your model to test and figured you would want to know about this behaviour.
The text was updated successfully, but these errors were encountered: