Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing roberta type id (everything is zero). #1072

Merged
merged 3 commits into from
Sep 26, 2022
Merged

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Sep 26, 2022

The original Roberta type_ids actually gives out 0 for every part:

https://github.com/huggingface/tokenizers/blob/python-v0.12.1/tokenizers/src/processors/roberta.rs#L117

This broke 1 integration test in transformers.

from transformers import AutoTokenizer, LayoutLMForQuestionAnswering
from datasets import load_dataset
import torch

tokenizer = AutoTokenizer.from_pretrained("impira/layoutlm-document-qa", add_prefix_space=True)
model = LayoutLMForQuestionAnswering.from_pretrained("impira/layoutlm-document-qa", revision="1e3ebac")

dataset = load_dataset("nielsr/funsd", split="train")
example = dataset[0]
question = "what's his name?"
words = example["words"]
boxes = example["bboxes"]

encoding = tokenizer(
    question.split(), words, is_split_into_words=True, return_token_type_ids=True, return_tensors="pt"
)
print(encoding["input_ids"])
print(encoding.sequence_ids(0))
bbox = []
for i, s, w in zip(encoding.input_ids[0], encoding.sequence_ids(0), encoding.word_ids(0)):
    if s == 1:
        bbox.append(boxes[w])
    elif i == tokenizer.sep_token_id:
        bbox.append([1000] * 4)
    else:
        bbox.append([0] * 4)
encoding["bbox"] = torch.tensor([bbox])

word_ids = encoding.word_ids(0)
outputs = model(**encoding)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)]
data = " ".join(words[start : end + 1])
print(repr(data))

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 26, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh
Copy link

ydshieh commented Sep 26, 2022

Thanks, @Narsil for working on this issue.

First, should this line be changed too:

let pair_type_ids = vec![1; encoding.get_ids().len() + 2];

?

In the 0.12.1, it was

let pair_type_ids = vec![0; encoding.get_ids().len() + 2];

If I understand correctly, roberta_processing is a recently added test in this file, right?

Overall, LGTM for the change, thanks again. We should probably double check why we used all 0 before (when we get some time).

@Narsil
Copy link
Collaborator Author

Narsil commented Sep 26, 2022

First, should this line be changed too:

Yes indeed.
Even slightly more than that. The encodings are passed with type_ids being set by default to their number (only pairs are really used, so [0, and 1] but it seemed that roberta used 0 everywhere (the default used to be 0

If I understand correctly, roberta_processing is a recently added test in this file, right? Overall, LGTM for the change, thanks again. We should probably double check why we used all 0 before (when we get some time).

Yes there was no test basically, so I added the tests with the trait modification. That's why I put the values that seemed logic, not necessarily the ones that used to be there before (I tried to create the test with 0.12.1 version, I may have done it wrong for this one)

@Narsil Narsil force-pushed the fix_roberta_type_id branch from 051be5d to dcb532f Compare September 26, 2022 15:23
Copy link

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Narsil
Copy link
Collaborator Author

Narsil commented Sep 26, 2022

This is still weird for LayoutLM in the sense that there is a token_embedding layer, which if only zero is used... is.... not really usefull if I understand correctly.

That's what I meant when I said I was unsure about wether it was a bug or not. (regardless this fixes it)

@ydshieh
Copy link

ydshieh commented Sep 26, 2022

What I found is class LayoutLMTokenizer(BertTokenizer):. But the checkpoint in impira/layoutlm-document-qa havs "tokenizer_class": "RobertaTokenizer". So it loads RobertaTokenizer instead of LayoutLMTokenizer.

See
https://huggingface.co/impira/layoutlm-document-qa/blob/main/config.json

This should answer your question @Narsil . But we are not sure why impira/layoutlm-document-qa has "tokenizer_class": "RobertaTokenizer" - a bug, or the author intend to use that tokenizer class for training etc.

@Narsil Narsil merged commit 5f6e978 into main Sep 26, 2022
@julien-c
Copy link
Member

julien-c commented Sep 27, 2022

But we are not sure why impira/layoutlm-document-qa has "tokenizer_class": "RobertaTokenizer" - a bug, or the author intend to use that tokenizer class for training etc.

You should ask directly in the repo discussions, no @ydshieh?

Update: https://huggingface.co/impira/layoutlm-document-qa/discussions/4

@Narsil Narsil deleted the fix_roberta_type_id branch September 27, 2022 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants