-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to split special token in encode? #1391
Comments
Hey! This is currently not supported, I'll be working on this for the next release!
|
Thank you for you replay. But i also found that when i use tokenizer.add_tokens(), not tokenizer.add_special_tokens(), the added tokens still cannot be split, why is this? |
It's just that when you add a token, whether it's special or not it will not be split. The normalizer flag is what controls if the token is first normalized. If your normalizer add a |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
#1419 will fix this waiting to close ! |
i have converted a slow tokenizer into PreTrainedTokenizerFast, and get a tokenizer.json file.But i found that this tokenizer did not split special tokens.Here is my add_tokens in tokenizer.json:
tokenizer.add_special_tokens( [ AddedToken("[gMASK]", normalized=True, single_word=False), AddedToken("sop", normalized=True, single_word=False), ] )
The text was updated successfully, but these errors were encountered: