-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different tokenization than AutoTokenizer when word is adjacent to non-special added token #7049
Comments
I'm able to reproduce your results:
different model,
|
@Jeximo thanks! Updating issue title to reflect this |
I have also seen this issue. I have observed that passing HF tokenized text into llama.cpp gives good perplexity, but llama.cpp tokenized text gives bad perplexity. |
There is logic in Lines 12690 to 12695 in 947d3ad
That's why If |
Sounds reasonable. Is it a project goal to be compatible with Transformers, or is it acceptable that llama.cpp behaves differently? As far as I can tell, it tokenizes the string as [32001,2188] regardless of if I'll just close the issue if it's not a problem that llama.cpp behaves differently than AutoTokenizer. |
This is a valid question and one I'm keen to having answered. |
It's probably a good idea to stick to Transformers, so if there is a suggestion how to fix our logic we can merge it. I'm not 100% sure though I think this might be one more instance of the topic about added tokens discussed in #7144 |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
llama.cpp commit: 6ecf318 (current master)
llama.cpp example server with https://huggingface.co/RichardErkhov/NousResearch_-_Nous-Hermes-2-Mixtral-8x7B-DPO-gguf/blob/main/Nous-Hermes-2-Mixtral-8x7B-DPO.Q3_K_M.gguf
llama.cpp changes the tokenization of the word "user" when it comes directly after the added token, AutoTokenizer does not.
Could it be relevant that <|im_start|> is not a special token in this model? tokenizer_config.json
A different model where <|im_start|> is a special token does not have this behaviour. tokenizer_config.json
The text was updated successfully, but these errors were encountered: