Fixing roberta type id (everything is zero). #1072

Narsil · 2022-09-26T14:42:04Z

The original Roberta type_ids actually gives out 0 for every part:

https://github.com/huggingface/tokenizers/blob/python-v0.12.1/tokenizers/src/processors/roberta.rs#L117

This broke 1 integration test in transformers.

from transformers import AutoTokenizer, LayoutLMForQuestionAnswering
from datasets import load_dataset
import torch

tokenizer = AutoTokenizer.from_pretrained("impira/layoutlm-document-qa", add_prefix_space=True)
model = LayoutLMForQuestionAnswering.from_pretrained("impira/layoutlm-document-qa", revision="1e3ebac")

dataset = load_dataset("nielsr/funsd", split="train")
example = dataset[0]
question = "what's his name?"
words = example["words"]
boxes = example["bboxes"]

encoding = tokenizer(
    question.split(), words, is_split_into_words=True, return_token_type_ids=True, return_tensors="pt"
)
print(encoding["input_ids"])
print(encoding.sequence_ids(0))
bbox = []
for i, s, w in zip(encoding.input_ids[0], encoding.sequence_ids(0), encoding.word_ids(0)):
    if s == 1:
        bbox.append(boxes[w])
    elif i == tokenizer.sep_token_id:
        bbox.append([1000] * 4)
    else:
        bbox.append([0] * 4)
encoding["bbox"] = torch.tensor([bbox])

word_ids = encoding.word_ids(0)
outputs = model(**encoding)
loss = outputs.loss
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start, end = word_ids[start_scores.argmax(-1)], word_ids[end_scores.argmax(-1)]
data = " ".join(words[start : end + 1])
print(repr(data))

HuggingFaceDocBuilderDev · 2022-09-26T14:51:54Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh · 2022-09-26T15:02:09Z

Thanks, @Narsil for working on this issue.

First, should this line be changed too:

tokenizers/tokenizers/src/processors/roberta.rs

Line 179 in e592992

let pair_type_ids = vec![1; encoding.get_ids().len() + 2];

?

In the 0.12.1, it was

tokenizers/tokenizers/src/processors/roberta.rs

Line 180 in 8a9bb28

let pair_type_ids = vec![0; encoding.get_ids().len() + 2];

If I understand correctly, roberta_processing is a recently added test in this file, right?

Overall, LGTM for the change, thanks again. We should probably double check why we used all 0 before (when we get some time).

Narsil · 2022-09-26T15:08:31Z

First, should this line be changed too:

Yes indeed.
Even slightly more than that. The encodings are passed with type_ids being set by default to their number (only pairs are really used, so [0, and 1] but it seemed that roberta used 0 everywhere (the default used to be 0

If I understand correctly, roberta_processing is a recently added test in this file, right? Overall, LGTM for the change, thanks again. We should probably double check why we used all 0 before (when we get some time).

Yes there was no test basically, so I added the tests with the trait modification. That's why I put the values that seemed logic, not necessarily the ones that used to be there before (I tried to create the test with 0.12.1 version, I may have done it wrong for this one)

anything else.

ydshieh

From the docstring in RobertaTokenizer
https://github.com/huggingface/transformers/blob/a04140afb12b5dcf1df023e2180aedf616f3da87/src/transformers/models/roberta/tokenization_roberta.py#L400
and its return value
https://github.com/huggingface/transformers/blob/a04140afb12b5dcf1df023e2180aedf616f3da87/src/transformers/models/roberta/tokenization_roberta.py#L415-L417
all 0s is expected. And the change in this PR is totally right! Thanks.

Narsil · 2022-09-26T15:37:27Z

This is still weird for LayoutLM in the sense that there is a token_embedding layer, which if only zero is used... is.... not really usefull if I understand correctly.

That's what I meant when I said I was unsure about wether it was a bug or not. (regardless this fixes it)

ydshieh · 2022-09-26T15:55:26Z

What I found is class LayoutLMTokenizer(BertTokenizer):. But the checkpoint in impira/layoutlm-document-qa havs "tokenizer_class": "RobertaTokenizer". So it loads RobertaTokenizer instead of LayoutLMTokenizer.

See
https://huggingface.co/impira/layoutlm-document-qa/blob/main/config.json

This should answer your question @Narsil . But we are not sure why impira/layoutlm-document-qa has "tokenizer_class": "RobertaTokenizer" - a bug, or the author intend to use that tokenizer class for training etc.

julien-c · 2022-09-27T07:37:34Z

But we are not sure why impira/layoutlm-document-qa has "tokenizer_class": "RobertaTokenizer" - a bug, or the author intend to use that tokenizer class for training etc.

~~You should ask directly in the repo discussions, no @ydshieh?~~

Update: https://huggingface.co/impira/layoutlm-document-qa/discussions/4

Narsil requested review from mishig25, McPatate and ydshieh September 26, 2022 14:42

Narsil added 2 commits September 26, 2022 17:23

Fixing roberta type ids (everything is zero).

989eb31

We need to fix type_ids for all sequence even when not changing

dcb532f

anything else.

Narsil force-pushed the fix_roberta_type_id branch from 051be5d to dcb532f Compare September 26, 2022 15:23

ydshieh approved these changes Sep 26, 2022

View reviewed changes

mishig25 approved these changes Sep 26, 2022

View reviewed changes

Fixing tests hopefully better.

c3083e7

Narsil merged commit 5f6e978 into main Sep 26, 2022

Narsil deleted the fix_roberta_type_id branch September 27, 2022 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing roberta type id (everything is zero). #1072

Fixing roberta type id (everything is zero). #1072

Narsil commented Sep 26, 2022 •

edited by mishig25

Loading

HuggingFaceDocBuilderDev commented Sep 26, 2022 •

edited

Loading

ydshieh commented Sep 26, 2022 •

edited

Loading

Narsil commented Sep 26, 2022

ydshieh left a comment

Narsil commented Sep 26, 2022

ydshieh commented Sep 26, 2022

julien-c commented Sep 27, 2022 •

edited

Loading

Fixing roberta type id (everything is zero). #1072

Fixing roberta type id (everything is zero). #1072

Conversation

Narsil commented Sep 26, 2022 • edited by mishig25 Loading

HuggingFaceDocBuilderDev commented Sep 26, 2022 • edited Loading

ydshieh commented Sep 26, 2022 • edited Loading

Narsil commented Sep 26, 2022

ydshieh left a comment

Choose a reason for hiding this comment

Narsil commented Sep 26, 2022

ydshieh commented Sep 26, 2022

julien-c commented Sep 27, 2022 • edited Loading

Narsil commented Sep 26, 2022 •

edited by mishig25

Loading

HuggingFaceDocBuilderDev commented Sep 26, 2022 •

edited

Loading

ydshieh commented Sep 26, 2022 •

edited

Loading

julien-c commented Sep 27, 2022 •

edited

Loading