Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0 #16520

davidmezzetti · 2022-03-31T14:50:49Z

I'm not sure if this is an issue with transformers, an issue with tokenizers or expected behavior. But when running the token classification pipeline with an aggregation_strategy="simple" the results are slightly different with tokenizers==0.12.0.

The following code produces different results (both examples use transformers==4.17.0).

from transformers import pipeline
nlp = pipeline("token-classification")
nlp("Hugging Face Inc. is a company based in New York City", aggregation_strategy="simple")

With tokenizers==0.11.6:

[{'entity_group': 'ORG', 'score': 0.99305606, 'word': 'Hugging Face Inc', 'start': 0, 'end': 16}, {'entity_group': 'LOC', 'score': 0.9988098, 'word': 'New York City', 'start': 40, 'end': 53}]

With tokenizers==0.12.0:

[{'entity_group': 'ORG', 'score': 0.99305606, 'word': ['Hu', 'gging', ' Face', ' Inc'], 'start': 0, 'end': 16}, {'entity_group': 'LOC', 'score': 0.9988098, 'word': ['New', ' York', ' City'], 'start': 40, 'end': 53}]

The text was updated successfully, but these errors were encountered:

patil-suraj · 2022-03-31T16:20:52Z

cc @SaulLu @Narsil

Narsil · 2022-04-01T07:47:38Z

There's a proposed PR here : #16537

Narsil · 2022-04-01T07:48:09Z

Tokenization tests were run before releasing 0.12 but not the pipeline tests :(

github-actions · 2022-04-30T15:01:42Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

This was referenced Mar 31, 2022

convert_tokens_to_string does not conform to its signature #16525

Closed

Unit tests failing with tokenizers>= 0.12 neuml/txtai#253

Closed

Narsil mentioned this issue Apr 1, 2022

Making transformers work on 0.12. #16537

Closed

5 tasks

SaulLu mentioned this issue Apr 1, 2022

add a test checking the format of convert_tokens_to_string's output #16540

Merged

5 tasks

RobertSamoilescu mentioned this issue Apr 1, 2022

AnchorText Language Model - convert_tokens_to_string not conform to its signature SeldonIO/alibi#621

Open

davidmezzetti mentioned this issue Apr 1, 2022

Issue with aggregation_strategy="max" in NER pipeline #16542

Closed

github-actions bot closed this as completed May 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0 #16520

Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0 #16520

davidmezzetti commented Mar 31, 2022

patil-suraj commented Mar 31, 2022

Narsil commented Apr 1, 2022

Narsil commented Apr 1, 2022

github-actions bot commented Apr 30, 2022

Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0 #16520

Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0 #16520

Comments

davidmezzetti commented Mar 31, 2022

patil-suraj commented Mar 31, 2022

Narsil commented Apr 1, 2022

Narsil commented Apr 1, 2022

github-actions bot commented Apr 30, 2022