Making `transformers` work on `0.12`. #16537

Narsil · 2022-04-01T07:41:20Z

What does this PR do?

tokenizers 0.12 changed the way decoder.decode( works.

Instead of returning directly a string, it returns a list of strings (the "decoded" parts), which enables the decoders to be chained (and hence customized more easily).

The fix works by simply joining those parts for versions >= 0.12.

Fixes ##16520

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-04-01T07:53:55Z

The documentation is not available anymore as the PR was closed or merged.

SaulLu · 2022-04-01T08:09:02Z

Thank you very much for the proposed fix.

Before reviewing this PR, I would like to see if it is possible to keep the new feature in tokenizers==0.12.0 that allows to chain decoders but to add the conversion to string format at the end.

LysandreJik

If both @SaulLu and @Narsil agree that this is the correct fix, then I'm ok to merge it like this to make master green.

Note: from transformers perspective it would obviously be better to have a 0.12.1 revert the breaking change, as all versions of transformers that preceded this fix will continue to be broken.

Narsil · 2022-04-01T09:36:59Z

Summarizing an oral discussion around our options:

TL;DR

Ultimately the balance between 3/ and 1/ should be the biggest factors in the decision between :
reverting the change and preventing us from using the capability which is sometimes needed .
Using this PR's change and making a forward incompatibilty: transformers<=4.17 incompatible with tokenizers>=0.12

1/ The change in tokenizers is a good one. On several occasions (last in date CLIP: https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/tokenization_clip_fast.py#L111) the inability to chain decoders had to be worked around, which was not a great dev UX in transformers. It was also a limitation for Bigscience tokenizer (not the latest one, but the one before) and the reason for the change.

2/ The "".join(...decoders.decode(tokens)) is a bit clunky and not super self evident.
This is how the coders have to operate now, so using them in isolation should work that way in order to make the composition understandable. There are already discussions around convert_tokens_to_string to make it private later, since it causes some issues. Within tokenizers itself, there's no way to access tokens directly anyway, so users shouldn't have to .join in the first place.

A potential less clunky way would be to add Tokenizer.decode_tokens(tokens) within tokenizers to prevent the join in transformers which is indeed clunky. The only issue is that convert_tokens_to_string is already causing issues (mostly around the lines like why is decode not showing what I think it should) with discussions already going on about making it private at least. Enabling such a function in tokenizers might open the same discussions over there. Definitely not a showstopper, but something to think about.

The promoted way Tokenizers.decode(ids) -> str remains unchanged for tokenizers and so far raised less questions.

3/ Forward compatibility.

The main caveat to this proposed change, is that earlier versions of transformers will contain the bug with the new versions of tokenizers. Reverting is the only reasonable solution to fix that (but it's also loosing the composition options we need for the decoders)

4/ Use of convert_tokens_to_string in TokenClassification.

It's a legacy thing, and is not changed to not break BC, but causes issues on its own: #15785 (comment)

Using offsets instead of decode would help in that situation (we can do it in a non breaking matter by adding a new key, and keeping the old one).

5/ Tokenizer tests
The transformers tokenization tests totally skip that function, which lead to not seeing that function being broken. We're going to update that change to include at least a type test for that function

LysandreJik · 2022-04-01T09:42:14Z

Would it be possible to consider a deprecation cycle for the tokenizer change, for example with an opt-in flag for the new behavior? Doing this would both keep all previous versions working with 0.12.0, while providing support for the new behavior.

This would allow us to prepare for the breaking change and have at least a few versions that support this before dropping support of the current behavior.

Narsil · 2022-04-01T10:03:49Z

Ok 3/ forward compatibility is too important to break.

Will revert in 0.12.1 and find cleaner solution for 0.12.2 without breaking things in that manner.

Making transformers work on 0.12.

9664459

Narsil requested a review from SaulLu April 1, 2022 07:42

Narsil mentioned this pull request Apr 1, 2022

Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0 #16520

Closed

Narsil requested review from sgugger and LysandreJik April 1, 2022 08:06

LysandreJik approved these changes Apr 1, 2022

View reviewed changes

Narsil closed this Apr 1, 2022

SaulLu mentioned this pull request Apr 1, 2022

add a test checking the format of convert_tokens_to_string's output #16540

Merged

5 tasks

RobertSamoilescu mentioned this pull request Apr 1, 2022

AnchorText Language Model - convert_tokens_to_string not conform to its signature SeldonIO/alibi#621

Open

This was referenced Apr 1, 2022

Issue with aggregation_strategy="max" in NER pipeline #16542

Closed

Unit tests failing with tokenizers>= 0.12 neuml/txtai#253

Closed

Narsil mentioned this pull request Apr 1, 2022

Revert "Changing Decoder trait to be more composable. (#938)" huggingface/tokenizers#971

Merged

Narsil deleted the fix_tokenizers_0_12 branch April 1, 2022 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making `transformers` work on `0.12`. #16537

Making `transformers` work on `0.12`. #16537

Narsil commented Apr 1, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 1, 2022 •

edited

Loading

SaulLu commented Apr 1, 2022 •

edited

Loading

LysandreJik left a comment •

edited

Loading

Narsil commented Apr 1, 2022

LysandreJik commented Apr 1, 2022

Narsil commented Apr 1, 2022

Making transformers work on 0.12. #16537

Making transformers work on 0.12. #16537

Conversation

Narsil commented Apr 1, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 1, 2022 • edited Loading

SaulLu commented Apr 1, 2022 • edited Loading

LysandreJik left a comment • edited Loading

Choose a reason for hiding this comment

Narsil commented Apr 1, 2022

LysandreJik commented Apr 1, 2022

Narsil commented Apr 1, 2022

Making `transformers` work on `0.12`. #16537

Making `transformers` work on `0.12`. #16537

Narsil commented Apr 1, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 1, 2022 •

edited

Loading

SaulLu commented Apr 1, 2022 •

edited

Loading

LysandreJik left a comment •

edited

Loading