Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filling more than 1 masked token at a time #3609

Closed
p-christ opened this issue Apr 3, 2020 · 10 comments
Closed

Filling more than 1 masked token at a time #3609

p-christ opened this issue Apr 3, 2020 · 10 comments

Comments

@p-christ
Copy link

p-christ commented Apr 3, 2020

I am able to use hugging face's mask filling pipeline to predict 1 masked token in a sentence using the below:

!pip install -q transformers
from __future__ import print_function
import ipywidgets as widgets
from transformers import pipeline

nlp_fill = pipeline('fill-mask')
nlp_fill("I am going to guess <mask> in this sentence")

But does anyone have an opinion on what is the best way to do this if I want to predict 2 masked tokens? e.g. if the sentence is instead "I am going to <mask> <mask> in this sentence"?

If i try and put this exact sentence into nlp_fill I get the error "ValueError: only one element tensors can be converted to Python scalars" so it doesn't work automatically.

Any help would be much appreciated!

Stack overflow question link

@julien-c
Copy link
Member

julien-c commented Apr 3, 2020

Indeed, this is not supported right now. We'd welcome a PR though :)

@jowagner
Copy link

jowagner commented Mar 31, 2021

Before somebody starts on a PR, we need to consider what exactly this should do.

For top_k = 1, most users probably expect a single forward pass and picking the top prediction for each token. For greater top_k, however, picking the k-best prediction at each mask position has increasingly high risk of yielding an inconsistent sequence. A beam search over all possible sequences with some overall objective and returning the overall top_k best sequences will be more desirable, but also more work to implement.

A naive objective could simply multiply the probabilities of each candidate replacement obtained from a single forward pass. However, these probabilities are not conditional on the specific choice for the other mask positions. What exactly these probabilities are when there is more than 1 mask token is not clear to me but I think a reasonable assumption is that the network produces some kind of weighted average of all the probability distributions one would get if one fixes the other mask tokens and makes a forward pass with just one mask token.

Therefore, I think one must make multiple forward passes to get the probability of each decision step in the gap filling process. It is not clear though in what order to make decisions. Even in the simplest case of continuous mask positions we could proceed left-to-right, right-to-left, from both sides simultaneously, start in the middle or in some other way. The order could also be influenced by the probabilities, e.g. condensating the most confidently predicted token first.

It may also be desirable to have a [MASK*] that is expanded to multiple tokens as needed. Then, one may want to have a brevity penalty or normalise by length as otherwise the model will prefer short answers as their probability is higher. One may also want to have a callback to filter candidate substitutions, e.g. for a cloze test one may want to check that the sequence does not start with '##' and that it detokenises to a single word of the target language.

@LysandreJik
Copy link
Member

Please see the following issue #10158 and PR #10222 for an attempt to take a crack at this

@naveenjafer
Copy link

@jowagner Has made some very valid points. In fact, these are the same concerns I have had previously with how multiple mask filling even works when done simultaneously. However, there are some issues with all of the approaches and I am not quite sure yet as to how it could be resolved.

Take for example you have 3 mask positions and we follow the method that gives preference first to the most confidently predicted token. There is an intrinsic issue as to what the most confident token would even mean here in the first place given that the other 2 masks are still empty and not filled. My point being, the probability of which word needs to be filled in a particular slot is not necessarily indicative of whether that SHOULD be the first one to be filled.

Do have a look at https://arxiv.org/abs/2002.03079 's work on Blank Language Model. Most of the valuable suggestions that you provided here start spilling into this paper's realm.

I would be very happy to discuss further about this with you Joachim

@mitramir55
Copy link

Hi, I've implemented right to left, left to right, and random mask filling in PyTorch for top k ids that the model thinks are the most probable tokens in a sentence in one of my projects. In this implementation, each time we want to generate a mask, the model looks at the previously generated sentences and decides what is the most probable for the next masked position. So if we have 2 masks in a sentence, by setting top_k=5, we'll have 25 sentences (5 tokens for the first position, and for each of these 5 sentences with one mask we have another 5 tokens for the second mask). It'll output something like this:(I used Persian models for this. I hope you can see how the masks are being filled)
image
Then in the next step, we implemented a beam search to choose the most probable sequence of all between all these sentences.

I'd be glad to help HuggingFace on this issue, I can send my code or send a pull request.

@LysandreJik LysandreJik reopened this May 6, 2021
@jowagner
Copy link

The idea in 80a1136#r605659735 may point to how one can combine left and right direction or even average over all possible sequences of crystallisation.

@mitramir55
Copy link

mitramir55 commented Jun 3, 2021

Hi, This is the function for different orders of prediction. I hope it helps.
Also, In the beam search section, we constructed a dictionary of bi tri and four grams in a specific corpus related to our work and scored predictions based on those. I won't include this extensive part here but tell me if it can be useful.

def predict_seqs_dict(sequence, model, tokenizer, top_k=5, order='right-to-left'):


    ids_main = tokenizer.encode(sequence,
                            return_tensors="pt",
                            add_special_tokens=False)

    ids_ = ids_main.detach().clone()
    position = torch.where(ids_main == tokenizer.mask_token_id)

    positions_list = position[1].numpy().tolist()

    if order =='left-to-right':
        positions_list.reverse()

    elif order=='random':
        random.shuffle(positions_list)

    # print(positions_list)
    predictions_ids = {}
    predictions_detokenized_sents = {}

    for i in range(len(positions_list)):
        predictions_ids[i] = []
        predictions_detokenized_sents[i] = []

        
        # if it was the first prediction, 
        # just go on and predict the first predictions
        

        if i==0:
            model_logits = model(ids_main)['logits'][0][positions_list[0]]
            top_k_tokens = torch.topk(model_logits, top_k, dim=0).indices.tolist()
            
            for j in range(len(top_k_tokens)):
                #print(j)
                ids_t_ = ids_.detach().clone()
                ids_t_[0][positions_list[0]] = top_k_tokens[j]
                predictions_ids[i].append(ids_t_)
                
                pred = tokenizer.decode(ids_t_[0])
                predictions_detokenized_sents[i].append(pred)

                # append the sentences and ids of this masked token


        # if we already have some predictions, go on and fill the rest of the masks
        # by continuing the previous predictions
        if i!=0:
            for pred_ids in predictions_ids[i-1]:
                
                # get the logits
                model_logits = model(pred_ids)['logits'][0][positions_list[i]]
                # get the top 5 of this prediction and masked token
                top_k_tokens = torch.topk(model_logits, top_k, dim=0).indices.tolist()

                for top_id in top_k_tokens:
                    
                    ids_t_i = pred_ids.detach().clone()
                    ids_t_i[0][positions_list[i]] = top_id

                    pred = tokenizer.decode(ids_t_i[0])

                    # append the sentences and ids of this masked token

                    predictions_ids[i].append(ids_t_i)
                    predictions_detokenized_sents[i].append(pred)
                    
    return predictions_detokenized_sents
   

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jul 7, 2021
@jowagner
Copy link

jowagner commented Jul 7, 2021

While an external scoring model may produce higher quality results, such an approach would move quite far away from letting the BERT model make the predictions. For example, consider a users who is evaluating the quality of a BERT model using a cloze test. They don't want issues of the BERT model to be smoothed / repaired by the external scoring model.

For finding the most confidently predicted token, I don't see why the fact that 3 or more masks may include a mask that has only masked neighbours is a problem. What we need is a measure of confidence that can be derived from the class probability distribution of the MLM head (its softmax layer). BERT gives us a class probability distribution for each masked token. The most confident token is then simply the one for which the confidence measure gives the greatest value.

I didn't yet find time to read https://arxiv.org/abs/2002.03079

@naveenjafer
Copy link

@jowagner Just to reconfirm, your proposition was to fill the slots not in an arbitrary left to right or right to left fashion, but to fill the one that has the highest value in the softmax layer and then utilize that while regenerating clozes for the rest of the masks, correct?

The high confidence for the position could be by virtue of there not being any other better suitable candidates for that position rather than being an indicator that the model is most confident about that prediction (for us to be filling that prediction first before using that as the seed to move on and fill the rest in a similar fashion). Right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants