Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making transformers work on 0.12. #16537

Closed
wants to merge 1 commit into from
Closed

Making transformers work on 0.12. #16537

wants to merge 1 commit into from

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Apr 1, 2022

What does this PR do?

tokenizers 0.12 changed the way decoder.decode( works.

Instead of returning directly a string, it returns a list of strings (the "decoded" parts), which enables the decoders to be chained (and hence customized more easily).

The fix works by simply joining those parts for versions >= 0.12.

Fixes ##16520

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 1, 2022

The documentation is not available anymore as the PR was closed or merged.

@Narsil Narsil requested review from sgugger and LysandreJik April 1, 2022 08:06
@SaulLu
Copy link
Contributor

SaulLu commented Apr 1, 2022

Thank you very much for the proposed fix.

Before reviewing this PR, I would like to see if it is possible to keep the new feature in tokenizers==0.12.0 that allows to chain decoders but to add the conversion to string format at the end.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both @SaulLu and @Narsil agree that this is the correct fix, then I'm ok to merge it like this to make master green.

Note: from transformers perspective it would obviously be better to have a 0.12.1 revert the breaking change, as all versions of transformers that preceded this fix will continue to be broken.

@Narsil
Copy link
Contributor Author

Narsil commented Apr 1, 2022

Summarizing an oral discussion around our options:

TL;DR

Ultimately the balance between 3/ and 1/ should be the biggest factors in the decision between :
reverting the change and preventing us from using the capability which is sometimes needed .
Using this PR's change and making a forward incompatibilty: transformers<=4.17 incompatible with tokenizers>=0.12

1/ The change in tokenizers is a good one. On several occasions (last in date CLIP: https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/tokenization_clip_fast.py#L111) the inability to chain decoders had to be worked around, which was not a great dev UX in transformers. It was also a limitation for Bigscience tokenizer (not the latest one, but the one before) and the reason for the change.

2/ The "".join(...decoders.decode(tokens)) is a bit clunky and not super self evident.
This is how the coders have to operate now, so using them in isolation should work that way in order to make the composition understandable. There are already discussions around convert_tokens_to_string to make it private later, since it causes some issues. Within tokenizers itself, there's no way to access tokens directly anyway, so users shouldn't have to .join in the first place.

A potential less clunky way would be to add Tokenizer.decode_tokens(tokens) within tokenizers to prevent the join in transformers which is indeed clunky. The only issue is that convert_tokens_to_string is already causing issues (mostly around the lines like why is decode not showing what I think it should) with discussions already going on about making it private at least. Enabling such a function in tokenizers might open the same discussions over there. Definitely not a showstopper, but something to think about.

The promoted way Tokenizers.decode(ids) -> str remains unchanged for tokenizers and so far raised less questions.

3/ Forward compatibility.

The main caveat to this proposed change, is that earlier versions of transformers will contain the bug with the new versions of tokenizers. Reverting is the only reasonable solution to fix that (but it's also loosing the composition options we need for the decoders)

4/ Use of convert_tokens_to_string in TokenClassification.

It's a legacy thing, and is not changed to not break BC, but causes issues on its own: #15785 (comment)

Using offsets instead of decode would help in that situation (we can do it in a non breaking matter by adding a new key, and keeping the old one).

5/ Tokenizer tests
The transformers tokenization tests totally skip that function, which lead to not seeing that function being broken. We're going to update that change to include at least a type test for that function

@LysandreJik
Copy link
Member

Would it be possible to consider a deprecation cycle for the tokenizer change, for example with an opt-in flag for the new behavior? Doing this would both keep all previous versions working with 0.12.0, while providing support for the new behavior.

This would allow us to prepare for the breaking change and have at least a few versions that support this before dropping support of the current behavior.

@Narsil
Copy link
Contributor Author

Narsil commented Apr 1, 2022

Ok 3/ forward compatibility is too important to break.

Will revert in 0.12.1 and find cleaner solution for 0.12.2 without breaking things in that manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants