-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠️⚠️[T5Tokenize
] Fix T5 family tokenizers⚠️⚠️
#24565
⚠️⚠️[T5Tokenize
] Fix T5 family tokenizers⚠️⚠️
#24565
Conversation
The documentation is not available anymore as the PR was closed or merged. |
This can also be made non "breakable" with a flag. Up to debate since it is a bug fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Let's roll with it since it's a bug fix and if people complain about the breaking change we will see if we add a flag to enable the buggy behavior.
Co-authored-by: Sylvain Gugger <[email protected]>
Edit: just to make sure, I did more testing and unfortunately , there is one bug: >>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>'] instead of >>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>'] This is because we have to prepend a |
I'm getting this legacy behaviour warning come up when simply loading a T5 tokenizer - it appears even before using the tokenizer. Is there an updated way to load the tokenizer? The warning appears when running the following lines of code: from transformers import AutoTokenizer The error is: |
Yep, just set |
What does this PR do?
Fixes the
T5Tokenizer
(not the fast one yet). (at the same time adresses part of #11531)When converting
UMT5
I created a reproduction snippet for any t5x model form the original repo. I realized that a very very small variation in the input completely changes the output for non-finetuned models. The issue lies with the way we process<extra_id_xx>
.Example:
The reason is that t5x wrapps arround
sentencepiece
, and adds the extra id to the vocab, but they are not saved that way.We don't add them to the vocab, so when we tokenize, we split on special tokens, thus the sentencepiece model only sees:
While the original model never sees a
.
(or a lot of other characters) alone, and thus we add an extra space...This is a bug fix with regards to training, it is breaking in the sense that is should remove the space.
TODO:
tokenizer.encode(". Hello")
as it remove the prefix space that is normally added.