-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filling more than 1 masked token at a time #3609
Comments
Indeed, this is not supported right now. We'd welcome a PR though :) |
Before somebody starts on a PR, we need to consider what exactly this should do. For A naive objective could simply multiply the probabilities of each candidate replacement obtained from a single forward pass. However, these probabilities are not conditional on the specific choice for the other mask positions. What exactly these probabilities are when there is more than 1 mask token is not clear to me but I think a reasonable assumption is that the network produces some kind of weighted average of all the probability distributions one would get if one fixes the other mask tokens and makes a forward pass with just one mask token. Therefore, I think one must make multiple forward passes to get the probability of each decision step in the gap filling process. It is not clear though in what order to make decisions. Even in the simplest case of continuous mask positions we could proceed left-to-right, right-to-left, from both sides simultaneously, start in the middle or in some other way. The order could also be influenced by the probabilities, e.g. condensating the most confidently predicted token first. It may also be desirable to have a [MASK*] that is expanded to multiple tokens as needed. Then, one may want to have a brevity penalty or normalise by length as otherwise the model will prefer short answers as their probability is higher. One may also want to have a callback to filter candidate substitutions, e.g. for a cloze test one may want to check that the sequence does not start with '##' and that it detokenises to a single word of the target language. |
@jowagner Has made some very valid points. In fact, these are the same concerns I have had previously with how multiple mask filling even works when done simultaneously. However, there are some issues with all of the approaches and I am not quite sure yet as to how it could be resolved. Take for example you have 3 mask positions and we follow the method that gives preference first to the most confidently predicted token. There is an intrinsic issue as to what the most confident token would even mean here in the first place given that the other 2 masks are still empty and not filled. My point being, the probability of which word needs to be filled in a particular slot is not necessarily indicative of whether that SHOULD be the first one to be filled. Do have a look at https://arxiv.org/abs/2002.03079 's work on Blank Language Model. Most of the valuable suggestions that you provided here start spilling into this paper's realm. I would be very happy to discuss further about this with you Joachim |
The idea in 80a1136#r605659735 may point to how one can combine left and right direction or even average over all possible sequences of crystallisation. |
Hi, This is the function for different orders of prediction. I hope it helps.
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
While an external scoring model may produce higher quality results, such an approach would move quite far away from letting the BERT model make the predictions. For example, consider a users who is evaluating the quality of a BERT model using a cloze test. They don't want issues of the BERT model to be smoothed / repaired by the external scoring model. For finding the most confidently predicted token, I don't see why the fact that 3 or more masks may include a mask that has only masked neighbours is a problem. What we need is a measure of confidence that can be derived from the class probability distribution of the MLM head (its softmax layer). BERT gives us a class probability distribution for each masked token. The most confident token is then simply the one for which the confidence measure gives the greatest value. I didn't yet find time to read https://arxiv.org/abs/2002.03079 |
@jowagner Just to reconfirm, your proposition was to fill the slots not in an arbitrary left to right or right to left fashion, but to fill the one that has the highest value in the softmax layer and then utilize that while regenerating clozes for the rest of the masks, correct? The high confidence for the position could be by virtue of there not being any other better suitable candidates for that position rather than being an indicator that the model is most confident about that prediction (for us to be filling that prediction first before using that as the seed to move on and fill the rest in a similar fashion). Right? |
I am able to use hugging face's mask filling pipeline to predict 1 masked token in a sentence using the below:
But does anyone have an opinion on what is the best way to do this if I want to predict 2 masked tokens? e.g. if the sentence is instead
"I am going to <mask> <mask> in this sentence"
?If i try and put this exact sentence into nlp_fill I get the error "ValueError: only one element tensors can be converted to Python scalars" so it doesn't work automatically.
Any help would be much appreciated!
Stack overflow question link
The text was updated successfully, but these errors were encountered: