Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Mask support in Pipeline #10158

Open
naveenjafer opened this issue Feb 12, 2021 · 2 comments
Open

Multiple Mask support in Pipeline #10158

naveenjafer opened this issue Feb 12, 2021 · 2 comments
Labels
Feature request Request for a new feature

Comments

@naveenjafer
Copy link

🚀 Feature request

The fill mask feature as a part of the pipeline currently only supports a single mask for the inputs. It could be expanded to predict and return the results for multiple masks in the same sentence too.

Motivation

There are use cases where one would ideally have more than just a single mask where they would need a prediction from the model. For example, smarter template filling in outputs returned to users etc. Could also be used in better study of the implicit knowledge that BERT models have accumulated during pre-training.

Your contribution

I should be able to raise a PR for the same. The output JSON schema would have to be slightly modified, but I can go ahead and complete the same if there is no other obvious issue that slipped my mind as to why only a single [MASK] token needs to be supported.

@LysandreJik LysandreJik added the Feature request Request for a new feature label Feb 13, 2021
@naveenjafer
Copy link
Author

@LysandreJik
The current implementation for a single mask returns the data as a list of

{  
   "sequence" : "the final sequence with the mask added",  
   "score" :  "the softmax score",  
   "token" : "the token ID used in filling the MASK",  
   "token_str" : "the token string used in filling the MASK"  
}  

When returning the results for sentences with multiple masks, it is not possible to maintain the same return format of the JSON. I propose to have a different pipeline call for this 'fill-mask-multiple' or something along those lines. The return format I have proceeded with is

{  
   "sequence" : "the final sequence with all the masks filled by the model,  
   "scores" :  ["the softmax score of mask 1",  "the softmax score of mask 2", ...]
   "tokens" : ["the token ID used in filling mask 1",  "the token ID used in filling mask 2", ...]
   "token_strs" : ["the token string used in filling mask 1",  "the token string used in filling mask 2", ...]
}  

Some minor changes will be made to the input param "targets" to support optional targets for each of the mask.

If having 2 separate pipelines does not seem a great idea, we could just club them both right now into one single pipeline call irrespective of whether it is a single mask or multiple mask. The return json type would change, I am not sure about the impact/how feasible it would be to bring that across in minor version updates.

Would really benefit from some expert advice since I am sort of new here.

PS: I have currently implemented the functionality for the pytorch framework, getting the same done in tf too.

@LysandreJik
Copy link
Member

This change seems okay to me. Since you have already some functionality for PyTorch, do you mind opening a PR (even a draft PR), so that we may play around with it and talk about the potential improvements? Thanks! Pinging @Narsil too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants