Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged
merged 268 commits into from
Sep 18, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented May 31, 2023

What does this PR do?

Adresses a lot of issues related to add_tokens, also adds more refine testing to make sure this does not happen again.

Fixes #20734, fixes #21120, fixes #16334, fixes #23818 , fixes #23851, fixes #11531 , fixes #9747, fixes #23459 , fixes #14770 , fixes #22935, fixes #23930, fixes #23250, fixes #7901, fixes #19873, fixes #25232, fixes #22414,

Spirit of the refactoring

The main idea is that the PreTrainedTokenizer's __init__ function is responsible for adding all the additional_special tokens, eos_token, etc and creating the token_trie that will be used for splitting the tokens.

  • All tokens that are added are stored in their AddedToken format in the added_tokens_decoder which becomes the only way to interact with them. The added_tokens_encoder cannot be modified, it is just a conversion of the added_tokens_decoder. The trie is only created based on the added_tokens_decoder. One possible addition is to keep unique_no_split tokens, but I am currently against.

  • All the added token information now lies in the tokenizer_config.json. Nuking the special_tokens_map.json and added_tokens.json.

  • Support for lstrip, rstrip and single_word is added. This is only possible because we store the AddedTokens and not only the strings.

  • Information on which tokens were added is also available for the Fast tokenizers. This is just a representation convenience but was not possible before.

  • add_special_tokens's replace_additional_special_tokens argument now works.

  • Remove some of the available surface functions (self.added_tokens_encoder, self.added_tokens_decoder, self.unique_no_split_tokens, etc) and more to come here especial for special tokens,

🚨🚨 Breaking changes 🚨🚨:

  • unique_no_split_tokens attribute removed and not used in the internal logic
  • sanitize_special_tokens() follows a deprecation cycle and does nothing
  • All attributes in SPECIAL_TOKENS_ATTRIBUTES are stored as AddedTokens and no strings.
  • loading a slow from a fast or a fast from a slow will no longer raise and error if the tokens added don't have the correct index. This is because they will always be added following the order of the added_tokens but will correct mistakes in the saved vocabulary if there are any. (And there are a lot in old format tokenizers)
  • the length of a tokenizer is now max(set(self.get_vocab().keys())) accounting for holes in the vocab. The vocab_size no longer takes into account the added vocab for most of the tokenizers (as it should not). Mostly breaking for T5
  • Adding a token using tokenizer.add_tokens([AddedToken("hey", rstrip=False, normalized=True)]) now takes into account rstrip, lstrip, normalized information.
  • added_tokens_decoder holds AddedToken, not strings.
  • add_tokens() for both fast and slow will always be updated if the token is already part of the vocab, allowing for custom stripping.
  • initializing a tokenizer form scratch will now add missing special tokens to the vocab.
  • stripping is not always done for special tokens! 🚨 Only if the AddedToken has lstrip=True and rstrip=True
  • fairseq_ids_to_tokens attribute removed for Barthez (was not used)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 31, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker changed the title [Tokenizer] attemp to fix add_token issues [WIP][Tokenizer] attemp to fix add_token issues May 31, 2023
@ArthurZucker
Copy link
Collaborator Author

ArthurZucker commented May 31, 2023

TODO: the test should be for all the TokenMixin rather than comparing slow and fast as the behaviour should work in any case.

Comment on lines 428 to 429
token = AddedToken(new_tokens[i].content.lower(), single_word = new_tokens[i].single_word, lstrip = new_tokens[i].lstrip, rstrip = new_tokens[i].rstrip, normalized = new_tokens[i].normalized)
else:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This highlights how bad the API is:

  • you can't modify the content of the AddedToken
  • if you only add the content to the unique no split:
    • when decoding you will have a problem,since decoding uses _additional_special_tokens
    • when encoding, the rstrip and lstrip etc logic will be ignored
  • Adding a token does not necessarily update the unique_no_split etc

@ArthurZucker ArthurZucker linked an issue Jun 7, 2023 that may be closed by this pull request
4 tasks
@stephantul
Copy link
Contributor

Is work on this still going on? I'm interested in working on tokenization.

@ArthurZucker
Copy link
Collaborator Author

It is! Had to focus on a new model for a while but this is close to being over

@stephantul
Copy link
Contributor

Cool! So, I think the best way to do this is that I fork your branch, and then do PRs there? Any changes you like would then propagate to this PR? Is that the preferred way of collaborating?

@ArthurZucker
Copy link
Collaborator Author

Haha sorry what I meant is that I’ll take care of this one probably today, tomorrow so no need to dive !
Otherwise forking transformers is always the best solution IMO, then add others as remotes.

@stephantul
Copy link
Contributor

Ah ok, my bad! I totally misunderstood. Good luck with the PR 😸

@NielsRogge NielsRogge mentioned this pull request Sep 22, 2023
6 tasks
parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
…23909)

* fix test for bart. Order is correct now let's skip BPEs

* ouf

* styling

* fix bert....

* slow refactoring

* current updates

* massive refactoring

* update

* NICE!

* update to see where I am at

* updates

* update

* update

* revert

* updates

* updates

* start supporting legacy_save

* styling

* big update

* revert some changes

* nits

* nniiiiiice

* small fixes

* kinda fix t5 with new behaviour

* major update

* fixup

* fix copies

* today's updates

* fix byt5

* upfate

* update

* update

* updates

* update vocab size test

* Barthez does not use not need the fairseq offset ids

* super calll must be after

* calll super

* move all super init

* move other super init

* fixup

* nits

* more fixes

* nits

* more fixes

* nits

* more fix

* remove useless files

* ouch all of them are affected

* and more!

* small imporvements

* no more sanitize token

* more changes around unique no split tokens

* partially fix more things

* keep legacy save but add warning

* so... more fixes

* updates

* guess deberta tokenizer could be nuked

* fixup

* fixup did some bad things

* nuke it if it breaks

* remove prints and pretrain fast from slow with new format.

* fixups

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <[email protected]>

* fiou

* nit

* by default specials should not be normalized?

* update

* remove brakpoint

* updates

* a lot of updates

* fixup

* fixes revert some changes to match fast

* small nits

* that makes it cleaner

* fix camembert accordingly

* update

* some lest breaking changes

* update

* fixup

* fix byt5 and whisper mostly

* some more fixes, canine's byte vocab

* fix gpt2

* fix most of the perceiver tests (4 left)

* fix layout lmv3

* fixup

* fix copies for gpt2 style

* make sure to only warn once

* fix perciever and gpt2 tests

* some more backward compatibility: also read special tokens map because some ppl use it........////.....

* fixup

* add else when reading

* nits

* fresh updates

* fix copies

* will this make everything faster?

* fixes

* more fixes

* update

* more fixes

* fixup

* is the source of truth right?

* sorry camembert for the troubles

* current updates

* fixup

* update led

* update

* fix regression

* fix single word

* more model specific fixes

* fix t5 tests

* fixup

* more comments

* update

* fix nllb

* rstrip removed

* small fixes

* better handle additional_special_tokens and vocab sizes

* fixing

* styling

* fix 4 / 21

* fixup

* fix nlbb's tests

* some fixes

* fix t5

* fixes

* style

* fix canine tests

* damn this is nice

* nits

* m2m100 nit

* fixups

* fixes!

* fixup

* stash

* fix merge

* revert bad change

* fixup

* correct order for code Llama

* fix speecht5 post merge

* styling

* revert source of 11 fails

* small nits

* all changes in one go

* fnet hack

* fix 2 more tests

* update based on main branch of tokenizers

* fixup

* fix VITS issues

* more fixes

* fix mgp test

* fix camembert issues

* oups camembert still has 2 failing tests

* mluke fixes

* decode fixes

* small nits

* nits

* fix llama and vits

* fix camembert

* smal nits

* more fixes when initialising a fast from a slow and etc

* fix one of the last test

* fix CPM tokenizer test

* fixups

* fix pop2piano

* fixup

* ⚠️ Change tokenizers required version ⚠️

* ⚠️ Change tokenizers required version ⚠️

* "tokenizers>=0.14,<0.15", don't forget smaller than

* fix musicgen tests and pretraiendtokenizerfast

* fix owlvit and all

* update t5

* fix 800 red

* fix tests

* fix the fix of the fix of t5

* styling

* documentation nits

* cache _added_tokens_encoder

* fixups

* Nit

* fix red tests

* one last nit!

* make eveything a lot simpler

* Now it's over 😉

* few small nits

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* updates that work for now

* tests that should no be skipped / changed and fixed next

* fixup

* i am ashamed

* pushe the fix

* update

* fixups

* nits

* fix added_tokens_encoder

* fix canine test

* fix pegasus vocab

* fix transfoXL

* fixup

* whisper needs to be fixed for train new

* pegasus nits

* more pegasus fixes

* minor update

* better error message in failed test

* fix whisper failing test

* fix whisper failing test

* fix pegasus

* fixup

* fix **** pegasus

* reset things

* remove another file

* attempts to fix the strange custome encoder and offset

* nits here and there

* update

* fixup

* nit

* fix the whisper test

* nits nits

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* updates based on review

* some small update to potentially remove

* nits

* import rlu cache

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <[email protected]>

* move warning to `from_pretrained`

* update tests results now that the special tokens are always added

---------

Co-authored-by: Sylvain Gugger <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: Lysandre Debut <[email protected]>
YYYasin19 added a commit to YYYasin19/transformers-feedstock that referenced this pull request Oct 7, 2023
@LoicDagnas
Copy link

@ArthurZucker I am not sure to understand the whole scope of this PR, but does it means that maintainer of such model have the responsibility to update the added_token.json file to still have their tokenizer usable with AutoTokenizer.from_pretrained?

In other words, is AutoTokenizer.from_pretrained("iarfmoose/t5-base-question-generator") supposed to fail now here

?

@ArthurZucker
Copy link
Collaborator Author

No it’s not! It might right now but the goal is to keep forward compatibility.
However long term wise authors should update the repo especially if they want to have the same behavior between fast and slow

@ArthurZucker ArthurZucker deleted the fix-add-tokens branch October 9, 2023 21:23
cmwilhelm pushed a commit to allenai/mmda that referenced this pull request Oct 17, 2023
v4.34.0 release did a complete refactor of the tokenizer
module, see:

huggingface/transformers#23909

Something about the difference is causing vila to
produce literally billions of lines of log warning messages
to Datadog in prod. I don't know if these warnings are
meaningful, but they are expensive.
cmwilhelm added a commit to allenai/mmda that referenced this pull request Oct 17, 2023
v4.34.0 release did a complete refactor of the tokenizer module, see:

huggingface/transformers#23909

Something about the difference is causing vila to
produce literally billions of lines of log warning messages to Datadog in prod. 
I don't know if these warnings are meaningful, but they are expensive.

Example logs:
https://app.datadoghq.com/logs?query=service%3Avila-v0%20&cols=host%2Cservice&index=%2A&messageDisplay=inline&refresh_mode=paused&stream_sort=desc&viz=stream&from_ts=1697556761689&to_ts=1697557153857&live=false
jploski referenced this pull request in ggerganov/llama.cpp Oct 22, 2023
@LoicDagnas
Copy link

Hello @ArthurZucker, about my comment above, with the latest release 4.34.1 it doesn't fail anymore in tokenization_t5.py but here

Do you want me to open an issue?

@amyeroberts
Copy link
Collaborator

Hi @LoicDagnas, @ArthurZucker is off for this week so won't be able to address until then. Yes, please create a separate issue, linking to this one with details on what's happening and a code snippet to reproduce.

@ArthurZucker
Copy link
Collaborator Author

I answered on the issue but it's fixed and part of the latest release! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment