MBartForConditionalGeneration doesn't seem to be able to complete the task of filling mask. #25425

5i-wanna-be-the-666 · 2023-08-10T02:53:26Z

System Info

transformers version: 4.29.2
Platform: Linux ubt-4090 5.15.0-75-generic
Python version: 3.9.5
PyTorch version (GPU?): 1.12.1+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker @younesbelkada @patrickvonplaten

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When I used the official document on huggingface for mask filling, I got the expected output.

from transformers import AutoTokenizer, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

# de_DE is the language symbol id <LID> for German
TXT = "</s> Meine Freunde sind <mask> nett aber sie essen zu viel Kuchen. </s> de_DE"

input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

tokenizer.decode(predictions).split()
['nett', 'sehr', 'ganz', 'nicht', 'so']

But when I changed the characters that need to be filled into Chinese, there was an accident.

from transformers import AutoTokenizer, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

# de_DE is the language symbol id <LID> for German
TXT = "</s> 今天<mask>真好,我准备去公园打羽毛球. </s> zh_ZH"

input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

tokenizer.decode(predictions).split()
[',·:.']

After that, I tried to get mBART to restore a sentence with multiple masks for me, and the effect was even worse.

from transformers import MBartTokenizer,DataCollatorForLanguageModeling,MBartForConditionalGeneration
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-cc25")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")

TXT_input = "<s>The weather is so nice today, I am going to play badminton in the park</s>en_xx"

inputs = tokenizer([TXT_input], add_special_tokens=False, return_tensors="pt",max_length=32,  padding='max_length')

masked_inputs_and_labels = data_collator([inputs]) 

input_ids = masked_inputs_and_labels['input_ids'][0]
attention_mask = masked_inputs_and_labels['attention_mask'][0]
labels = masked_inputs_and_labels['labels'][0]

masked_inputs={key:value[0] for key,value in masked_inputs_and_labels.items()}
outputs = model(input_ids = input_ids,attention_mask = attention_mask,labels = labels)
logits = outputs.logits

print(f'after mask: {tokenizer.decode(masked_inputs["input_ids"][0])}')

predictions = outputs.logits.argmax(dim=-1)

print(f'Predicted sentence: {tokenizer.decode(predictions[0])}')
after mask: <s> The weather is so nice today, I am going tosähkö badminton in the park</s> en_xx<pad><pad><pad><pad><pad><pad><pad><mask><pad><pad><pad>
Predicted sentence: <s>นยยยยยนนนนนน badmintonนนนap<s><s><s><s><s><s><s><s><s><s><s><s><s><s>

Excuse me, is there something wrong with my usage?In that case, how can I correctly use mBART to fill the mask?

Expected behavior

I think mBART has at least one Chinese token with five highest probabilities.Or restore the masked sentence for me.
such as:['天气','心情',.....]
or:Predicted sentence: "The weather is so nice today, I am going to play badminton in the park en_xx"

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-08-10T06:06:01Z

Hey! It seems that the model you are trying to use was not trained on zh_ZH but zh_CN. Could you try to use this instead? (It might just be the token that need to be updated).

For the second script, I don't think you changed the src_lang of the tokenizer which is not Chinese by default.
I got [',我是早上去'] as an output with:

from transformers import AutoTokenizer, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

# de_DE is the language symbol id <LID> for German
TXT = "</s> 今天<mask>真好,我准备去公园打羽毛球. </s> zh_CN"

input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

tokenizer.decode(predictions).split()

which is already a lot better 😉

5i-wanna-be-the-666 · 2023-08-10T07:13:20Z

Thank you for your reply!I did get good results after changing zh_ZH to zh_CN. The reason why I think it is zh_ZH is that I accidentally read the document wrong.

But how can I solve the problem as in the last script? Even if there are multiple mask marks, I also want to know the MLM loss of this model.

github-actions · 2023-09-09T08:02:15Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2023-09-09T11:03:16Z

Sorry for the late reply, recommend you to have a look at #10222 and search on our forum were this has been answered!

github-actions · 2023-10-11T08:07:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aureole-1210 · 2024-02-13T02:49:20Z

I also have problem with this.
I want to use 【facebook/mbart-large-50-many-to-many-mmt】 to do mask filling task. But the output is always strange.
I modify the input format as the Model Card from https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt suggested.
My code is as follows:

from transformers import (
AutoTokenizer,
BertForMaskedLM,
MBart50TokenizerFast,
MBartForConditionalGeneration,
DataCollatorForLanguageModeling
)
model_name_or_path = 'my_path/mbart-large-50-many-to-many-mmt'
model = MBartForConditionalGeneration.from_pretrained(model_name_or_path)
tokenizer = MBart50TokenizerFast.from_pretrained(model_name_or_path)

tokenizer.src_lang = 'en_XX'

src = "So that such a thing won’t happen <mask>."
encoded_src = tokenizer([src], return_tensors="pt")
input_ids = encoded_src["input_ids"]
src_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

model_outputs = model(**encoded_src)
logits = model_outputs.logits

masked_index = torch.nonzero((input_ids[0] == tokenizer.mask_token_id)).item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

print(tokenizer.convert_ids_to_tokens(predictions))

The output is:
['.', '☎', '↔', '∏', '∴']

When I change my input, it always output strange symbols. I think this is wrong.

I am confused whether this model is not suitable for this task. How should I modify to get proper outputs? Thank you so much!

ArthurZucker mentioned this issue Aug 24, 2023

Problem about using mBART50 for Russian to Chinese translation #13116

Open

github-actions bot closed this as completed Oct 20, 2023

ArthurZucker mentioned this issue Feb 28, 2024

MBartForConditionalGeneration to do mask filling task with mbart-large-50-many-to-many-mmt #28990

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBartForConditionalGeneration doesn't seem to be able to complete the task of filling mask. #25425

MBartForConditionalGeneration doesn't seem to be able to complete the task of filling mask. #25425

5i-wanna-be-the-666 commented Aug 10, 2023 •

edited

Loading

ArthurZucker commented Aug 10, 2023

5i-wanna-be-the-666 commented Aug 10, 2023 •

edited

Loading

github-actions bot commented Sep 9, 2023

ArthurZucker commented Sep 9, 2023

github-actions bot commented Oct 11, 2023

Aureole-1210 commented Feb 13, 2024

MBartForConditionalGeneration doesn't seem to be able to complete the task of filling mask. #25425

MBartForConditionalGeneration doesn't seem to be able to complete the task of filling mask. #25425

Comments

5i-wanna-be-the-666 commented Aug 10, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Aug 10, 2023

5i-wanna-be-the-666 commented Aug 10, 2023 • edited Loading

github-actions bot commented Sep 9, 2023

ArthurZucker commented Sep 9, 2023

github-actions bot commented Oct 11, 2023

Aureole-1210 commented Feb 13, 2024

5i-wanna-be-the-666 commented Aug 10, 2023 •

edited

Loading

5i-wanna-be-the-666 commented Aug 10, 2023 •

edited

Loading