Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBartForConditionalGeneration doesn't seem to be able to complete the task of filling mask. #25425

Closed
2 of 4 tasks
5i-wanna-be-the-666 opened this issue Aug 10, 2023 · 6 comments

Comments

@5i-wanna-be-the-666
Copy link

5i-wanna-be-the-666 commented Aug 10, 2023

System Info

transformers version: 4.29.2
Platform: Linux ubt-4090 5.15.0-75-generic
Python version: 3.9.5
PyTorch version (GPU?): 1.12.1+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker @younesbelkada @patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When I used the official document on huggingface for mask filling, I got the expected output.

from transformers import AutoTokenizer, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

# de_DE is the language symbol id <LID> for German
TXT = "</s> Meine Freunde sind <mask> nett aber sie essen zu viel Kuchen. </s> de_DE"

input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

tokenizer.decode(predictions).split()
['nett', 'sehr', 'ganz', 'nicht', 'so']

But when I changed the characters that need to be filled into Chinese, there was an accident.

from transformers import AutoTokenizer, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

# de_DE is the language symbol id <LID> for German
TXT = "</s> 今天<mask>真好,我准备去公园打羽毛球. </s> zh_ZH"

input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

tokenizer.decode(predictions).split()
[',·:.']

image
After that, I tried to get mBART to restore a sentence with multiple masks for me, and the effect was even worse.

from transformers import MBartTokenizer,DataCollatorForLanguageModeling,MBartForConditionalGeneration
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-cc25")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")

TXT_input = "<s>The weather is so nice today, I am going to play badminton in the park</s>en_xx"

inputs = tokenizer([TXT_input], add_special_tokens=False, return_tensors="pt",max_length=32,  padding='max_length')

masked_inputs_and_labels = data_collator([inputs]) 

input_ids = masked_inputs_and_labels['input_ids'][0]
attention_mask = masked_inputs_and_labels['attention_mask'][0]
labels = masked_inputs_and_labels['labels'][0]

masked_inputs={key:value[0] for key,value in masked_inputs_and_labels.items()}
outputs = model(input_ids = input_ids,attention_mask = attention_mask,labels = labels)
logits = outputs.logits

print(f'after mask: {tokenizer.decode(masked_inputs["input_ids"][0])}')

predictions = outputs.logits.argmax(dim=-1)

print(f'Predicted sentence: {tokenizer.decode(predictions[0])}')
after mask: <s> The weather is so nice today, I am going tosähkö badminton in the park</s> en_xx<pad><pad><pad><pad><pad><pad><pad><mask><pad><pad><pad>
Predicted sentence: <s>นยยยยยนนนนนน badmintonนนนap<s><s><s><s><s><s><s><s><s><s><s><s><s><s>

Excuse me, is there something wrong with my usage?In that case, how can I correctly use mBART to fill the mask?

Expected behavior

I think mBART has at least one Chinese token with five highest probabilities.Or restore the masked sentence for me.
such as:['天气','心情',.....]
or:Predicted sentence: "The weather is so nice today, I am going to play badminton in the park en_xx"

@ArthurZucker
Copy link
Collaborator

Hey! It seems that the model you are trying to use was not trained on zh_ZH but zh_CN. Could you try to use this instead? (It might just be the token that need to be updated).

For the second script, I don't think you changed the src_lang of the tokenizer which is not Chinese by default.
I got [',我是早上去'] as an output with:

from transformers import AutoTokenizer, MBartForConditionalGeneration

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-cc25")
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-cc25")

# de_DE is the language symbol id <LID> for German
TXT = "</s> 今天<mask>真好,我准备去公园打羽毛球. </s> zh_CN"

input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

tokenizer.decode(predictions).split()

which is already a lot better 😉

@5i-wanna-be-the-666
Copy link
Author

5i-wanna-be-the-666 commented Aug 10, 2023

Thank you for your reply!I did get good results after changing zh_ZH to zh_CN. The reason why I think it is zh_ZH is that I accidentally read the document wrong.
0611921C

But how can I solve the problem as in the last script? Even if there are multiple mask marks, I also want to know the MLM loss of this model.

@github-actions
Copy link

github-actions bot commented Sep 9, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator

Sorry for the late reply, recommend you to have a look at #10222 and search on our forum were this has been answered!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Aureole-1210
Copy link

I also have problem with this.
I want to use 【facebook/mbart-large-50-many-to-many-mmt】 to do mask filling task. But the output is always strange.
I modify the input format as the Model Card from https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt suggested.
My code is as follows:

from transformers import (
AutoTokenizer,
BertForMaskedLM,
MBart50TokenizerFast,
MBartForConditionalGeneration,
DataCollatorForLanguageModeling
)
model_name_or_path = 'my_path/mbart-large-50-many-to-many-mmt'
model = MBartForConditionalGeneration.from_pretrained(model_name_or_path)
tokenizer = MBart50TokenizerFast.from_pretrained(model_name_or_path)

tokenizer.src_lang = 'en_XX'

src = "So that such a thing won’t happen <mask>."
encoded_src = tokenizer([src], return_tensors="pt")
input_ids = encoded_src["input_ids"]
src_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

model_outputs = model(**encoded_src)
logits = model_outputs.logits

masked_index = torch.nonzero((input_ids[0] == tokenizer.mask_token_id)).item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

print(tokenizer.convert_ids_to_tokens(predictions))

The output is:
['.', '☎', '↔', '∏', '∴']

When I change my input, it always output strange symbols. I think this is wrong.

I am confused whether this model is not suitable for this task. How should I modify to get proper outputs? Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants