-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨🚨 🚨🚨 [Tokenizer
] attemp to fix add_token issues🚨🚨 🚨🚨
#23909
Conversation
The documentation is not available anymore as the PR was closed or merged. |
TODO: the test should be for all the |
token = AddedToken(new_tokens[i].content.lower(), single_word = new_tokens[i].single_word, lstrip = new_tokens[i].lstrip, rstrip = new_tokens[i].rstrip, normalized = new_tokens[i].normalized) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This highlights how bad the API is:
- you can't modify the content of the
AddedToken
- if you only add the content to the unique no split:
- when decoding you will have a problem,since decoding uses
_additional_special_tokens
- when encoding, the
rstrip
andlstrip
etc logic will be ignored
- when decoding you will have a problem,since decoding uses
- Adding a token does not necessarily update the
unique_no_split
etc
Is work on this still going on? I'm interested in working on tokenization. |
It is! Had to focus on a new model for a while but this is close to being over |
Cool! So, I think the best way to do this is that I fork your branch, and then do PRs there? Any changes you like would then propagate to this PR? Is that the preferred way of collaborating? |
Haha sorry what I meant is that I’ll take care of this one probably today, tomorrow so no need to dive ! |
Ah ok, my bad! I totally misunderstood. Good luck with the PR 😸 |
…to fix-add-tokens
…23909) * fix test for bart. Order is correct now let's skip BPEs * ouf * styling * fix bert.... * slow refactoring * current updates * massive refactoring * update * NICE! * update to see where I am at * updates * update * update * revert * updates * updates * start supporting legacy_save * styling * big update * revert some changes * nits * nniiiiiice * small fixes * kinda fix t5 with new behaviour * major update * fixup * fix copies * today's updates * fix byt5 * upfate * update * update * updates * update vocab size test * Barthez does not use not need the fairseq offset ids * super calll must be after * calll super * move all super init * move other super init * fixup * nits * more fixes * nits * more fixes * nits * more fix * remove useless files * ouch all of them are affected * and more! * small imporvements * no more sanitize token * more changes around unique no split tokens * partially fix more things * keep legacy save but add warning * so... more fixes * updates * guess deberta tokenizer could be nuked * fixup * fixup did some bad things * nuke it if it breaks * remove prints and pretrain fast from slow with new format. * fixups * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * fiou * nit * by default specials should not be normalized? * update * remove brakpoint * updates * a lot of updates * fixup * fixes revert some changes to match fast * small nits * that makes it cleaner * fix camembert accordingly * update * some lest breaking changes * update * fixup * fix byt5 and whisper mostly * some more fixes, canine's byte vocab * fix gpt2 * fix most of the perceiver tests (4 left) * fix layout lmv3 * fixup * fix copies for gpt2 style * make sure to only warn once * fix perciever and gpt2 tests * some more backward compatibility: also read special tokens map because some ppl use it........////..... * fixup * add else when reading * nits * fresh updates * fix copies * will this make everything faster? * fixes * more fixes * update * more fixes * fixup * is the source of truth right? * sorry camembert for the troubles * current updates * fixup * update led * update * fix regression * fix single word * more model specific fixes * fix t5 tests * fixup * more comments * update * fix nllb * rstrip removed * small fixes * better handle additional_special_tokens and vocab sizes * fixing * styling * fix 4 / 21 * fixup * fix nlbb's tests * some fixes * fix t5 * fixes * style * fix canine tests * damn this is nice * nits * m2m100 nit * fixups * fixes! * fixup * stash * fix merge * revert bad change * fixup * correct order for code Llama * fix speecht5 post merge * styling * revert source of 11 fails * small nits * all changes in one go * fnet hack * fix 2 more tests * update based on main branch of tokenizers * fixup * fix VITS issues * more fixes * fix mgp test * fix camembert issues * oups camembert still has 2 failing tests * mluke fixes * decode fixes * small nits * nits * fix llama and vits * fix camembert * smal nits * more fixes when initialising a fast from a slow and etc * fix one of the last test * fix CPM tokenizer test * fixups * fix pop2piano * fixup *⚠️ Change tokenizers required version⚠️ *⚠️ Change tokenizers required version⚠️ * "tokenizers>=0.14,<0.15", don't forget smaller than * fix musicgen tests and pretraiendtokenizerfast * fix owlvit and all * update t5 * fix 800 red * fix tests * fix the fix of the fix of t5 * styling * documentation nits * cache _added_tokens_encoder * fixups * Nit * fix red tests * one last nit! * make eveything a lot simpler * Now it's over 😉 * few small nits * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * updates that work for now * tests that should no be skipped / changed and fixed next * fixup * i am ashamed * pushe the fix * update * fixups * nits * fix added_tokens_encoder * fix canine test * fix pegasus vocab * fix transfoXL * fixup * whisper needs to be fixed for train new * pegasus nits * more pegasus fixes * minor update * better error message in failed test * fix whisper failing test * fix whisper failing test * fix pegasus * fixup * fix **** pegasus * reset things * remove another file * attempts to fix the strange custome encoder and offset * nits here and there * update * fixup * nit * fix the whisper test * nits nits * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * updates based on review * some small update to potentially remove * nits * import rlu cache * Update src/transformers/tokenization_utils_base.py Co-authored-by: Lysandre Debut <[email protected]> * move warning to `from_pretrained` * update tests results now that the special tokens are always added --------- Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Lysandre Debut <[email protected]>
@ArthurZucker I am not sure to understand the whole scope of this PR, but does it means that maintainer of such model have the responsibility to update the In other words, is
? |
No it’s not! It might right now but the goal is to keep forward compatibility. |
v4.34.0 release did a complete refactor of the tokenizer module, see: huggingface/transformers#23909 Something about the difference is causing vila to produce literally billions of lines of log warning messages to Datadog in prod. I don't know if these warnings are meaningful, but they are expensive.
v4.34.0 release did a complete refactor of the tokenizer module, see: huggingface/transformers#23909 Something about the difference is causing vila to produce literally billions of lines of log warning messages to Datadog in prod. I don't know if these warnings are meaningful, but they are expensive. Example logs: https://app.datadoghq.com/logs?query=service%3Avila-v0%20&cols=host%2Cservice&index=%2A&messageDisplay=inline&refresh_mode=paused&stream_sort=desc&viz=stream&from_ts=1697556761689&to_ts=1697557153857&live=false
Hello @ArthurZucker, about my comment above, with the latest release
Do you want me to open an issue? |
Hi @LoicDagnas, @ArthurZucker is off for this week so won't be able to address until then. Yes, please create a separate issue, linking to this one with details on what's happening and a code snippet to reproduce. |
I answered on the issue but it's fixed and part of the latest release! 🤗 |
What does this PR do?
Adresses a lot of issues related to
add_tokens
, also adds more refine testing to make sure this does not happen again.add_tokens
ignores the arguments if the token is anAddedToken
. reported inAddedToken
's argument are ignored when called inadd_tokens
's method of slow tokenizers #20734 and Adding new tokens to various models changes tokenization of adjacent elements in strings #14770,PreTrainedTokenizer
(slow) strip tokens that are aroundunique_no_split_tokens
#21120, T5Tokenizer Fast and Slow give different results with AddedTokens #16334unique_no_split_token
. Reported in LLaMATokenizerFast works abnormally #23818 , [Bug]? how does the tokenizer encode the special tokens? #23851, Adding custom tokens makes the T5Tokenizer always strip spaces #11531 but also skip_special_tokens has different behavior between slow and fast tokenizer #23250. Also linked to Add UDOP #22940 , should allow us to re-factor the way T5 tokenizes the inputs (convert_token_to_ids should not have a bunch of regex for special tokens) (also mT5 additional_special_tokens seems not work #9747)single_word
inslow
. Reported in Adding new tokens to various models changes tokenization of adjacent elements in strings #14770from_pretrained
callsadded_tokens = tokenizer.sanitize_special_tokens()
which is when the tokens are added to no_unique_split. reported in Two tokenizer initialization methods result in inconsistent segmentation results for special words #23930Fixes #20734, fixes #21120, fixes #16334, fixes #23818 , fixes #23851, fixes #11531 , fixes #9747, fixes #23459 , fixes #14770 , fixes #22935, fixes #23930, fixes #23250, fixes #7901, fixes #19873, fixes #25232, fixes #22414,
Spirit of the refactoring
The main idea is that the
PreTrainedTokenizer
's__init__
function is responsible for adding all theadditional_special tokens
,eos_token
, etc and creating thetoken_trie
that will be used for splitting the tokens.All tokens that are added are stored in their
AddedToken
format in theadded_tokens_decoder
which becomes the only way to interact with them. Theadded_tokens_encoder
cannot be modified, it is just a conversion of theadded_tokens_decoder
. The trie is only created based on theadded_tokens_decoder
. One possible addition is to keepunique_no_split tokens
, but I am currently against.All the added token information now lies in the
tokenizer_config.json
. Nuking thespecial_tokens_map.json
andadded_tokens.json
.Support for
lstrip
,rstrip
andsingle_word
is added. This is only possible because we store theAddedTokens
and not only the strings.Information on which tokens were added is also available for the Fast tokenizers. This is just a representation convenience but was not possible before.
add_special_tokens's
replace_additional_special_tokens
argument now works.Remove some of the available surface functions (self.added_tokens_encoder, self.added_tokens_decoder, self.unique_no_split_tokens, etc) and more to come here especial for special tokens,
🚨🚨 Breaking changes 🚨🚨:
unique_no_split_tokens
attribute removed and not used in the internal logicsanitize_special_tokens()
follows a deprecation cycle and does nothingSPECIAL_TOKENS_ATTRIBUTES
are stored asAddedTokens
and no strings.added_tokens
but will correct mistakes in the saved vocabulary if there are any. (And there are a lot in old format tokenizers)max(set(self.get_vocab().keys()))
accounting for holes in the vocab. Thevocab_size
no longer takes into account the added vocab for most of the tokenizers (as it should not). Mostly breaking for T5tokenizer.add_tokens([AddedToken("hey", rstrip=False, normalized=True)])
now takes into accountrstrip
,lstrip
,normalized
information.added_tokens_decoder
holdsAddedToken
, notstrings
.add_tokens()
for both fast and slow will always be updated if the token is already part of the vocab, allowing for custom stripping.AddedToken
haslstrip=True
andrstrip=True
fairseq_ids_to_tokens
attribute removed forBarthez
(was not used)