How to split special token in encode? #1391

leizhao1234 · 2023-11-15T03:41:22Z

i have converted a slow tokenizer into PreTrainedTokenizerFast, and get a tokenizer.json file.But i found that this tokenizer did not split special tokens.Here is my add_tokens in tokenizer.json:
tokenizer.add_special_tokens( [ AddedToken("[gMASK]", normalized=True, single_word=False), AddedToken("sop", normalized=True, single_word=False), ] )

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-11-15T06:49:32Z

Hey! This is currently not supported, I'll be working on this for the next release!
This will be either:

a flag that can be passed to encode, which will skip normalization
a normalizer that splits the tokens
The former is probably the simplest and makes the most amount of sense imo!

leizhao1234 · 2023-11-15T08:20:48Z

Thank you for you replay. But i also found that when i use tokenizer.add_tokens(), not tokenizer.add_special_tokens(), the added tokens still cannot be split, why is this?

ArthurZucker · 2023-11-15T08:56:49Z

It's just that when you add a token, whether it's special or not it will not be split. The normalizer flag is what controls if the token is first normalized. If your normalizer add a _ before the token for example, then <s> will be split because the representation of the token will be _<s> and will not match.

github-actions · 2023-12-16T01:49:50Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-01-03T09:12:13Z

#1419 will fix this waiting to close !

github-actions bot added the Stale label Dec 16, 2023

ArthurZucker mentioned this issue Dec 19, 2023

add option to skip special tokens #1419

Closed

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2023

ArthurZucker reopened this Jan 3, 2024

github-actions bot removed the Stale label Jan 4, 2024

leizhao1234 closed this as completed Jan 4, 2024

ArthurZucker mentioned this issue Jan 19, 2024

Encode special tokens #1437

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to split special token in encode? #1391

How to split special token in encode? #1391

leizhao1234 commented Nov 15, 2023

ArthurZucker commented Nov 15, 2023

leizhao1234 commented Nov 15, 2023

ArthurZucker commented Nov 15, 2023

github-actions bot commented Dec 16, 2023

ArthurZucker commented Jan 3, 2024

How to split special token in encode? #1391

How to split special token in encode? #1391

Comments

leizhao1234 commented Nov 15, 2023

ArthurZucker commented Nov 15, 2023

leizhao1234 commented Nov 15, 2023

ArthurZucker commented Nov 15, 2023

github-actions bot commented Dec 16, 2023

ArthurZucker commented Jan 3, 2024