-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added_tokens with bytemap charaters in ByteLevel could not be decoded correctly #1392
Comments
Hey! Thanks for the report, the output is the same if you use AddedTokens usually have to be sent to the decoder, because the pre-tokenizatio is applied to them. |
is it possible to let added_tokens_map_r of AddedVocabulary store the mapping of id to tokens after pre-tokenization? so that it can generate the same output after decoder? |
It's not possible no, normalizers are use for that purpose however. And you can also add the token like this: |
Ah I see. Thanks for your explaination. Is there any planning on solving this bug? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
A pr for a fix that is backward compatible is welcome! Otherwise I won't have time to dive in this 🤗 |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
meta-llama/llama3#67 (comment) TLDR this should help: >>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('Bác', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác' |
Re-opening as the merge on main will be reverted for a better fix soon |
I just found that if added tokens contain some characters that exist in the byte map for ByteLevel preprocessor could not be decoded correctly.
This is a script to reproduce the problem with version 0.14.1
the output wil be
I believe the problem comes from
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/tokenizer/mod.rs#L832-L836
I don't think added token should be sent to bytelevel decoder for it is extacted before pretokenize.
The text was updated successfully, but these errors were encountered: