fill-mask target for full words not enabled? #17374

i-am-neo · 2022-05-20T23:12:58Z

System Info

- `transformers` version: 4.19.2
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.6.0
- PyTorch version (GPU?): 1.11.0+cu113 (False)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@Narsil and @LysandreJik (?)
How can one use Roberta for fill-mask to get the full word candidate and its "full" score for Roberta-large? Open to workaround solutions.

My example:
sentence = f"Nitzsch argues against the doctrine of the annihilation of the wicked, regards the teaching of Scripture about eternal {nlp.tokenizer.mask_token} as hypothetical."
Notebook here.

Using pipeline, the output I get is:
The specified target token damnationdoes not exist in the model vocabulary. Replacing withĠdamn.

Thanks.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

See notebook above.

Expected behavior

I expect to see "damnation" with its score.

The text was updated successfully, but these errors were encountered:

Narsil · 2022-05-23T09:00:50Z

hi @i-am-neo ,

Fill-mask works at a token level, not words, so you cannot use targets which are multi token. Since damnation seems to no exist directly in your vocabulary, it uses the closes 1-token element it finds damn. You cannot unfortunately have fill-mask work with varying number of holes/tokens. You could use 2 masks instead of one for instance, but then you will need to logic "fuse" those two tokens which might not correspond to a single word.

i-am-neo · 2022-05-23T16:03:12Z

Thanks @Narsil . I had thought so. No plans to allow full words and regex in your roadmap?

Narsil · 2022-05-23T16:37:40Z

It's not something that fits the current pipeline model (at least in the default settings).

pipeline is aimed to make ML model usable without any ML specific knowledge, BUT never hiding any complexities it induces.

In this particular part, fill-mask model, do work on a token level, and trying to do word-level really requires some custom strategies (how many tokens is your word? Do you want to handle multiple size of tokens ?). How to resolve in case of multi tokens (since multi tokens will give you independent token probabilities, and not grouped probabilities).

Since it is a non trivial problem, we decide to not do it on behalf of users and give an output that is much closer to what the original model does. If simple strategies can be implemented maybe we can add them as opt-in parameters, but so far nothing is being worked on as far as I know. PRs are more than welcome.

If you want more background for instance, this PR might be valuable to read (and the linked PRs too); #10222

I would like to point out zero-shot-classification which although not being the same pipeline we have seen being used in a similar fashion, which might suit your needs.

side note: An easy start solution for regexp is to fetch all tokens in the vocabulary that start with your prefix and use them as targets targets=[word for word in tokenizer.get_vocab() if word.startswith("X")] for instance. It's not all possible english words, but at least all possible elements of the vocabulary that will work.

i-am-neo · 2022-05-23T17:36:01Z

I hear you @Narsil, it sure is non-trivial.

In my case, I would like a large-enough LM (for example, Roberta-large) to generate word candidates to start with, given some regex as hints/constraints, without knowing in advance what the best candidates are, except for those hints. My thinking is that the candidates the LM generates would more or less already fit into the context given to the model. Multiple candidates would be ranked post-fill by their scores.

Re zero-shot-classification, the trouble is without knowing in advance what the correct/best candidates are, it's more difficult to work it in.

Narsil · 2022-05-24T10:24:18Z

In my case, I would like a large-enough LM (for example, Roberta-large) to generate word candidates to start with, given some regex as hints/constraints, without knowing in advance what the best candidates are, except for those hints.

I think there would be a lot of value to be able to do that, but AFAIK there's no simple way to do that with bert-like models. I think the biggest culprit is that models are trained to give independant probabilities, and not joint ones. Solving it might require an entire new training objective.

This house is <mask> and <mask>:

Disjoint probabilities: (big: 50%, red: 50) (big: 50%, red: 50%)
Joint probabilities: ( (big, red, 50%) , (red, big, 50%) ) . (Btu then (big, big = 0% for instance, which is allowed in disjoint probabilities)

github-actions · 2022-06-20T15:02:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

i-am-neo added the bug label May 20, 2022

github-actions bot closed this as completed Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fill-mask target for full words not enabled? #17374

fill-mask target for full words not enabled? #17374

i-am-neo commented May 20, 2022

Narsil commented May 23, 2022

i-am-neo commented May 23, 2022

Narsil commented May 23, 2022 •

edited

Loading

i-am-neo commented May 23, 2022

Narsil commented May 24, 2022

github-actions bot commented Jun 20, 2022

fill-mask target for full words not enabled? #17374

fill-mask target for full words not enabled? #17374

Comments

i-am-neo commented May 20, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Narsil commented May 23, 2022

i-am-neo commented May 23, 2022

Narsil commented May 23, 2022 • edited Loading

i-am-neo commented May 23, 2022

Narsil commented May 24, 2022

github-actions bot commented Jun 20, 2022

Narsil commented May 23, 2022 •

edited

Loading