Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fill-mask target for full words not enabled? #17374

Closed
2 of 4 tasks
i-am-neo opened this issue May 20, 2022 · 6 comments
Closed
2 of 4 tasks

fill-mask target for full words not enabled? #17374

i-am-neo opened this issue May 20, 2022 · 6 comments
Labels

Comments

@i-am-neo
Copy link

System Info

- `transformers` version: 4.19.2
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.6.0
- PyTorch version (GPU?): 1.11.0+cu113 (False)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@Narsil and @LysandreJik (?)
How can one use Roberta for fill-mask to get the full word candidate and its "full" score for Roberta-large? Open to workaround solutions.

My example:
sentence = f"Nitzsch argues against the doctrine of the annihilation of the wicked, regards the teaching of Scripture about eternal {nlp.tokenizer.mask_token} as hypothetical."
Notebook here.

Using pipeline, the output I get is:
The specified target token damnationdoes not exist in the model vocabulary. Replacing withĠdamn.

Thanks.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

See notebook above.

Expected behavior

I expect to see "damnation" with its score.
@i-am-neo i-am-neo added the bug label May 20, 2022
@Narsil
Copy link
Contributor

Narsil commented May 23, 2022

hi @i-am-neo ,

Fill-mask works at a token level, not words, so you cannot use targets which are multi token. Since damnation seems to no exist directly in your vocabulary, it uses the closes 1-token element it finds damn. You cannot unfortunately have fill-mask work with varying number of holes/tokens. You could use 2 masks instead of one for instance, but then you will need to logic "fuse" those two tokens which might not correspond to a single word.

@i-am-neo
Copy link
Author

Thanks @Narsil . I had thought so. No plans to allow full words and regex in your roadmap?

@Narsil
Copy link
Contributor

Narsil commented May 23, 2022

It's not something that fits the current pipeline model (at least in the default settings).

pipeline is aimed to make ML model usable without any ML specific knowledge, BUT never hiding any complexities it induces.

In this particular part, fill-mask model, do work on a token level, and trying to do word-level really requires some custom strategies (how many tokens is your word? Do you want to handle multiple size of tokens ?). How to resolve in case of multi tokens (since multi tokens will give you independent token probabilities, and not grouped probabilities).

Since it is a non trivial problem, we decide to not do it on behalf of users and give an output that is much closer to what the original model does. If simple strategies can be implemented maybe we can add them as opt-in parameters, but so far nothing is being worked on as far as I know. PRs are more than welcome.

If you want more background for instance, this PR might be valuable to read (and the linked PRs too); #10222

I would like to point out zero-shot-classification which although not being the same pipeline we have seen being used in a similar fashion, which might suit your needs.

side note: An easy start solution for regexp is to fetch all tokens in the vocabulary that start with your prefix and use them as targets targets=[word for word in tokenizer.get_vocab() if word.startswith("X")] for instance. It's not all possible english words, but at least all possible elements of the vocabulary that will work.

@i-am-neo
Copy link
Author

I hear you @Narsil, it sure is non-trivial.

In my case, I would like a large-enough LM (for example, Roberta-large) to generate word candidates to start with, given some regex as hints/constraints, without knowing in advance what the best candidates are, except for those hints. My thinking is that the candidates the LM generates would more or less already fit into the context given to the model. Multiple candidates would be ranked post-fill by their scores.

Re zero-shot-classification, the trouble is without knowing in advance what the correct/best candidates are, it's more difficult to work it in.

@Narsil
Copy link
Contributor

Narsil commented May 24, 2022

In my case, I would like a large-enough LM (for example, Roberta-large) to generate word candidates to start with, given some regex as hints/constraints, without knowing in advance what the best candidates are, except for those hints.

I think there would be a lot of value to be able to do that, but AFAIK there's no simple way to do that with bert-like models. I think the biggest culprit is that models are trained to give independant probabilities, and not joint ones. Solving it might require an entire new training objective.

This house is <mask> and <mask>:

Disjoint probabilities: (big: 50%, red: 50) (big: 50%, red: 50%)
Joint probabilities: ( (big, red, 50%) , (red, big, 50%) ) . (Btu then (big, big = 0% for instance, which is allowed in disjoint probabilities)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants