Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to split special token in encode? #1391

Closed
leizhao1234 opened this issue Nov 15, 2023 · 5 comments · Fixed by #1437
Closed

How to split special token in encode? #1391

leizhao1234 opened this issue Nov 15, 2023 · 5 comments · Fixed by #1437

Comments

@leizhao1234
Copy link

i have converted a slow tokenizer into PreTrainedTokenizerFast, and get a tokenizer.json file.But i found that this tokenizer did not split special tokens.Here is my add_tokens in tokenizer.json:
tokenizer.add_special_tokens( [ AddedToken("[gMASK]", normalized=True, single_word=False), AddedToken("sop", normalized=True, single_word=False), ] )

@ArthurZucker
Copy link
Collaborator

Hey! This is currently not supported, I'll be working on this for the next release!
This will be either:

  • a flag that can be passed to encode, which will skip normalization
  • a normalizer that splits the tokens
    The former is probably the simplest and makes the most amount of sense imo!

@leizhao1234
Copy link
Author

Thank you for you replay. But i also found that when i use tokenizer.add_tokens(), not tokenizer.add_special_tokens(), the added tokens still cannot be split, why is this?

@ArthurZucker
Copy link
Collaborator

It's just that when you add a token, whether it's special or not it will not be split. The normalizer flag is what controls if the token is first normalized. If your normalizer add a _ before the token for example, then <s> will be split because the representation of the token will be _<s> and will not match.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 16, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2023
@ArthurZucker ArthurZucker reopened this Jan 3, 2024
@ArthurZucker
Copy link
Collaborator

#1419 will fix this waiting to close !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants