-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predictions for pre-tokenized tokens with Roberta have strange offset_mapping #14305
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This is still relevant and dear to me. |
Pinging @SaulLu for advice |
First of all, thank you very much for the detailed issue that allows to understand very easily your problem. 🤗 To put it in context, the offsets feature comes from the (Rust) Tokenizers library. And I must unfortunately admit that I would need to have a little more information about the behavior in this library to be able to provide you with a solution to your problem (see the question I asked here). That being said, I strongly suspect that there was also an oversight on our part to adapt the tokenizer stored into the |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@jcklie some news about your issue, we merged some corrections in the main branch of transformers (this PR) and in the new version of tokenizers (this PR). So, using the the main branch of
There is one more case where the returned offsets can be a bit confusing, but we hesitate to make a fix in the tokenizers library because the fix will be quite heavy to implement. Don't hesitate to share your opinion in the issue that explains and discusses this case here. I'll close this issue but don't hesitate to react on it if you think your problem is not solved. |
This is still an issue with roberta-large ...
|
@ohmeow, I just tested the code snippet bellow with from transformers import AutoTokenizer
name = "roberta-large"
text = "17 yo with High blood pressure"
hf_tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True)
inputs = hf_tokenizer(text, return_offsets_mapping=True)
# Print result offset mapping
title = f"{'token':10} | {'offset':10} | corresponding text"
print(title)
print("-"*len(title))
for (start_idx, end_idx), token in zip(inputs["offset_mapping"], hf_tokenizer.convert_ids_to_tokens(inputs["input_ids"])):
print(f"{token:10} | {f'({start_idx}, {end_idx})':10} | {repr(text[start_idx:end_idx])}") and the result looks good to me:
Do you agree? To understand why my output is different from yours, can you run the command |
Yup ... my version of tokenizers was outdated! Sorry to bother you :) Thanks for the follow-up. |
Environment info
transformers
version: 4.12.3Who can help
Error/Issue is in fast Roberta tokenizers
@LysandreJik
Information
The problem arises when using:
The tasks I am working on is:
I try to POS tag with a Roberta based transformer. I base my code on this. The issues arises when I want to map back from subword tokenized predictions to my tokens.
I followed this guide and it works for BERT-based models, but I do not know exactly how to check whether something is a subword token with
add_prefix_space
, as they start both with 1 when a token of length 1 is followed by a subword token:I do not know whether it is intended or not, but it makes it not easy to align the predictions back to original tokens, as the rule that the last and first index of consecutive tokens are identical for subwords is broken in fast Roberta tokenizers.
in the WNUT example, it says
That means that if the first position in the tuple is anything other than 0, , we will set its corresponding label to -100
, which means that we do not keep it.. If we now use 1 instead, as for every token, a space is added, then this rule breaks.To reproduce
Steps to reproduce the behavior:
add_prefix_space
together withis_split_into_words
Output
Expected behavior
I would expect that the offsets behave similar to when not using
add_prefix_space
, e.g. that the space added does not influence the offsets, as it is automatically added. Is there a better way to align tokens and predictions for Roberta tokenizers than looking at the first char being a space?The text was updated successfully, but these errors were encountered: