-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the change from single mask to multi mask support for pytorch #10222
Changes from 1 commit
80a1136
3d0491a
6e9616d
2b815f0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -75,15 +75,9 @@ def __init__( | |
self.check_model_type(TF_MODEL_WITH_LM_HEAD_MAPPING if self.framework == "tf" else MODEL_FOR_MASKED_LM_MAPPING) | ||
self.top_k = top_k | ||
|
||
def ensure_exactly_one_mask_token(self, masked_index: np.ndarray): | ||
numel = np.prod(masked_index.shape) | ||
if numel > 1: | ||
raise PipelineException( | ||
"fill-mask", | ||
self.model.base_model_prefix, | ||
f"More than one mask_token ({self.tokenizer.mask_token}) is not supported", | ||
) | ||
elif numel < 1: | ||
def ensure_atleast_one_mask_token(self, masked_indices: np.ndarray): | ||
numel = np.prod(masked_indices.shape) | ||
if numel < 1: | ||
raise PipelineException( | ||
"fill-mask", | ||
self.model.base_model_prefix, | ||
|
@@ -141,12 +135,12 @@ def __call__(self, *args, targets=None, top_k: Optional[int] = None, **kwargs): | |
result = [] | ||
|
||
if self.framework == "tf": | ||
masked_index = tf.where(input_ids == self.tokenizer.mask_token_id).numpy() | ||
masked_indices = tf.where(input_ids == self.tokenizer.mask_token_id).numpy() | ||
|
||
# Fill mask pipeline supports only one ${mask_token} per sample | ||
self.ensure_exactly_one_mask_token(masked_index) | ||
self.ensure_atleast_one_mask_token(masked_indices) | ||
|
||
logits = outputs[i, masked_index.item(), :] | ||
logits = outputs[i, masked_indices.item(), :] | ||
probs = tf.nn.softmax(logits) | ||
if targets is None: | ||
topk = tf.math.top_k(probs, k=top_k if top_k is not None else self.top_k) | ||
|
@@ -157,32 +151,40 @@ def __call__(self, *args, targets=None, top_k: Optional[int] = None, **kwargs): | |
values = tf.gather_nd(values, tf.reshape(sort_inds, (-1, 1))).numpy() | ||
predictions = target_inds[sort_inds.numpy()] | ||
else: | ||
masked_index = torch.nonzero(input_ids == self.tokenizer.mask_token_id, as_tuple=False) | ||
masked_indices = torch.nonzero(input_ids == self.tokenizer.mask_token_id, as_tuple=False) | ||
# Fill mask pipeline supports at least one ${mask_token} per sample | ||
self.ensure_atleast_one_mask_token(masked_indices.numpy()) | ||
|
||
# Fill mask pipeline supports only one ${mask_token} per sample | ||
self.ensure_exactly_one_mask_token(masked_index.numpy()) | ||
logits_multiple = [outputs[i, index.item(), :] for index in masked_indices] | ||
|
||
probs_multiple = [logits.softmax(dim=0) for logits in logits_multiple] | ||
|
||
logits = outputs[i, masked_index.item(), :] | ||
probs = logits.softmax(dim=0) | ||
if targets is None: | ||
values, predictions = probs.topk(top_k if top_k is not None else self.top_k) | ||
values_all = [] | ||
predictions_all = [] | ||
for probs in probs_multiple: | ||
values, predictions = probs.topk(top_k if top_k is not None else self.top_k) | ||
values_all.append(values) | ||
predictions_all.append(predictions) | ||
else: | ||
# pending for when the target tokens are specifically input to the model. | ||
values = probs[..., target_inds] | ||
sort_inds = list(reversed(values.argsort(dim=-1))) | ||
values = values[..., sort_inds] | ||
predictions = target_inds[sort_inds] | ||
|
||
for v, p in zip(values.tolist(), predictions.tolist()): | ||
for i, item in enumerate(values_all[0]): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Try to avoid all indices if possible in your loops. It's usually a code smell (not necessarily but it's most likely avoidable here). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, yeah I think I can get rid of this by zipping. |
||
tokens = input_ids.numpy() | ||
tokens[masked_index] = p | ||
# Filter padding out: | ||
for i_inner, item_inner in enumerate(masked_indices.tolist()): | ||
masked_index = item_inner[0] | ||
tokens[masked_index] = predictions_all[i_inner].tolist()[i] | ||
tokens = tokens[np.where(tokens != self.tokenizer.pad_token_id)] | ||
result.append( | ||
{ | ||
"sequence": self.tokenizer.decode(tokens, skip_special_tokens=True), | ||
"score": v, | ||
"token": p, | ||
"token_str": self.tokenizer.decode(p), | ||
"scores": [v.tolist()[i] for v in values_all], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Try to keep this backward compatible if possible. You can simply change the return types based on number of masks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, but then the issue would be, that if say I was to iteratively call the pipeline with a corpus with a varying number of masks, the return output types would be different when it is a single mask vs multiple mask, I wonder if that will result in a suboptimal APIish experience for someone looking to try it out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a core maintainer can pitch in on this @LysandreJik to confirm, but if your change is not backward compatible, it will need a major release to be included. So if you want to maximize your chances of getting it merged, I think making it backward compatible is important. It's definitely something that can be taken care of later in the PR. We might even directly take care of it I'm not sure. Just letting you know ;) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or the other option would be to keep them as 2 separate pipeline calls, fill-mask and fill-mask-multiple. But not as elegant as I would have liked to handle it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @LysandreJik What would you suggest we do here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed with @Narsil that we should handle backwards compatibility as much as possible here. Having a different return according to the number of masks isn't super clean but it's definitely way better than breaking the workflow of existing Pipelines are, by design, already a bit "magical", in the sense that it returns different outputs according to your inputs. If you pass the pipeline a string or a list of strings, the resulting type will be different: >>> from transformers import pipeline
>>> mask_filler = pipeline("fill-mask")
>>> mask_filler("Where do <mask> live?")
[{'sequence': '<s>Where do you live?</s>', 'score': 0.575425386428833, 'token': 47, 'token_str': 'Ġyou'}, {'sequence': '<s>Where do YOU live?</s>', 'score': 0.1382409781217575, 'token': 10540, 'token_str': 'ĠYOU'}, {'sequence': '<s>Where do they live?</s>', 'score': 0.044609859585762024, 'token': 51, 'token_str': 'Ġthey'}, {'sequence': '<s>Where do we live?</s>', 'score': 0.0327814482152462, 'token': 52, 'token_str': 'Ġwe'}, {'sequence': '<s>Where do millennials live?</s>', 'score': 0.02294538915157318, 'token': 15100, 'token_str': 'Ġmillennials'}]
>>> mask_filler(["Where do <mask> live?", "Where <mask> I live?"])
[[{'sequence': '<s>Where do you live?</s>', 'score': 0.5754269361495972, 'token': 47, 'token_str': 'Ġyou'}, {'sequence': '<s>Where do YOU live?</s>', 'score': 0.1382400244474411, 'token': 10540, 'token_str': 'ĠYOU'}, {'sequence': '<s>Where do they live?</s>', 'score': 0.044610150158405304, 'token': 51, 'token_str': 'Ġthey'}, {'sequence': '<s>Where do we live?</s>', 'score': 0.032781537622213364, 'token': 52, 'token_str': 'Ġwe'}, {'sequence': '<s>Where do millennials live?</s>', 'score': 0.022945256903767586, 'token': 15100, 'token_str': 'Ġmillennials'}], [{'sequence': '<s>Where do I live?</s>', 'score': 0.6903642416000366, 'token': 109, 'token_str': 'Ġdo'}, {'sequence': '<s>Where should I live?</s>', 'score': 0.15854182839393616, 'token': 197, 'token_str': 'Ġshould'}, {'sequence': '<s>Where did I live?</s>', 'score': 0.04364638775587082, 'token': 222, 'token_str': 'Ġdid'}, {'sequence': '<s>Where am I live?</s>', 'score': 0.030600957572460175, 'token': 524, 'token_str': 'Ġam'}, {'sequence': '<s>Where shall I live?</s>', 'score': 0.029068272560834885, 'token': 5658, 'token_str': 'Ġshall'}]] It wouldn't be opposed to that design to return an array of values when there are several masks in the pipeline. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @LysandreJik Okay, in that case, the return type for single mask will stay the same -
and for the multiple masks case it would be (notice that the key and type for token and token_str changes)
Is this agreeable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Other functions in transfomers have parameters controlling what goes into the return dictionary, e.g. |
||
"tokens": [p.tolist()[i] for p in predictions_all], | ||
"tokens_strs": [self.tokenizer.decode(p.tolist()[i]) for p in predictions_all], | ||
} | ||
) | ||
|
||
|
@@ -192,3 +194,6 @@ def __call__(self, *args, targets=None, top_k: Optional[int] = None, **kwargs): | |
if len(results) == 1: | ||
return results[0] | ||
return results | ||
|
||
|
||
# values = [loc1, loc2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is debattable.
Are the proposition single tokens for mask tokens, or are they tuples of answers. Consider the following:
This <mask> is to <mask> what rice is to sushi.
Here are the top-3 proposition for the 3 masks:
[apple, rhubarb, Paris]
[pie, France, biking]
With your code, you are going to propose IIUC
(apple, pie)
(rhubarb, France)
(Paris, biking)
It's possible (not necessarely though) that the propositions you want to make are more like:
(apple, pie)
(rhubarb, pie)
(Paris, France).
My suggestion at this point it to look at joint probabilities for the tuple suggestion instead of token per token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Narsil This is correct, I have been a little worried about how BERT's masking for multiple masks and how one obtains the joint prob instead of a single token specific probability. Since it is simultaenously making the prediction for all the masks, it tends to make more mistakes(both gramatically) and knowledge wise too. I would be grateful if you could help me understand how one retrieves a join probability in this case.
This issue gets worse when the masks are situated closer to each other, with BERT typically predicting the same word for both the mask slots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can do correct joint probabilities.
The output is by design the sum of all joint probabilities at every locus.
What I meant is Bert cannot outptut {token1:"either", token2:"or", score:50%}, {token1:"or", token2:"either", score:50%}. It has to output {token1: ["either", 50%], ["or", 50%]} {token2: ["either", 50%], ["or", 50%]}. So you have no way of recovering the first proposed solution and your best guess can only be (either either, 25), (either, or, 25), (or, either, 25), (or, or, 25)
What I was suggesting, as a better guess, was simply treating them like they were:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, awesome, I will get on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will rank ('apple', 'France') and ('Paris', 'pie') higher than ('Paris', 'France'). We need some measure how happy the transformer is with each candidate sequence. I think we need additional forward passes to measure the effect of each combination. If there is some way of measuring the model's happiness with a candidate sequence one pass per candidate sequence will suffice. If not, I'd suggest to run
and then multiple the probabilities. We will need some kind of beam search to limit the combinations tested as the number of forward passes needed will otherwise explode for more masked tokens or large
top_k
.Edit: Actually, depending on the setting, this may run with fewer passes than trying all combinations, e.g. here 6 vs. 9.