Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sacrebleu 2.0 dies when encountering an empty reference line #161

Closed
mjpost opened this issue Aug 30, 2021 · 15 comments
Closed

Sacrebleu 2.0 dies when encountering an empty reference line #161

mjpost opened this issue Aug 30, 2021 · 15 comments
Assignees
Labels
Milestone

Comments

@mjpost
Copy link
Owner

mjpost commented Aug 30, 2021

Sacrebleu 2.0.0 dies if there is an empty line in the reference. You can reproduce this with this command:

 $ sacrebleu -t mtedx/test -l el-en --echo ref | sacrebleu -t mtedx/test -l el-en
Traceback (most recent call last):
  File "/home/mattpost/local/bin/sacrebleu", line 8, in <module>
    sys.exit(main())
  File "/home/mattpost/local/lib/python3.8/site-packages/sacrebleu/sacrebleu.py", line 482, in main
    metrics[name] = METRICS[name](**metric_args)
  File "/home/mattpost/local/lib/python3.8/site-packages/sacrebleu/metrics/bleu.py", line 197, in __init__
    self._ref_cache = self._cache_references(references)
  File "/home/mattpost/local/lib/python3.8/site-packages/sacrebleu/metrics/base.py", line 333, in _cache_references
    raise RuntimeError("Empty or `None` reference sentence found.")
RuntimeError: Empty or `None` reference sentence found.
@mjpost
Copy link
Owner Author

mjpost commented Aug 30, 2021

@ozancaglayan I tagged you, but let me know if you can't process this.

@mjpost mjpost changed the title Sacrebleu 2.0 dies with empty reference line Sacrebleu 2.0 dies when encountering an empty reference line Aug 30, 2021
@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Aug 31, 2021

Hi there,

This is a side-effect of variable number of references introduced in #132 . The reason it was not failing with 1.5.x is probably due to this feature being available only through the API call and not the CLI utility. We now globally filter the set of references to reject empty or Nones.

I would suggest fixing the references if possible as an empty line in references does not make much sense, what do you think?

@mjpost
Copy link
Owner Author

mjpost commented Aug 31, 2021

At the very least, this error is way too cryptic. We should catch the exception and report to the user.

However, I think that we should support empty references, in part because there are many in our existing datasets. Without fixing this, those are unusable. Also, it is not clear how to fix them, especially for the average user.

Can’t we just do a check and convert None→ ``?

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Aug 31, 2021

Honestly, I can't wrap my head around for this problem :) Let me think loudly.

Initially, the '' check wasn't there and it was only filtering out the Nones but if I understand correctly you proposed to add that bit as well ( See 2b5052a#r556098399)

That made sense because explicit Nones could only appear if the user uses the API and injects them deliberately to their list of refs. This would mean that the CLI would not work with variable number of references. We allow multiple references through CLI in two ways (i) with a list of files and (ii) with a tab-separated stream. For (i) you can give 100 ref files and some of their lines can be missing to allow var. number of refs. For (ii), essentially the same thing: some columns would be empty strings. So still no Nones at all to check and convert to ''.

BTW, I am not sure why we throw that exception. It was again introduced in #132 probably for some sanity check @tuetschek ? I can't guess out of my head what happens if we remove that actually.

In the end, all these should be very well documented and tested to not cause ambiguities and weird results in the future.

@mjpost
Copy link
Owner Author

mjpost commented Aug 31, 2021

We should be able to retain the feature of multiple tab-delimited references, no?

For example, consider ref1:

The first sentence.

and reference two:


The second sentence.

(I really wish Markdown supported tables).

Pasted, this would give

This is the first sentence\t
\tThis is the second sentence

Which is fine. To make this not throw an error, we just need those empty fields to be "" instead of None, right? So we would change line 330 from this:

            lines = [x for x in refs if x is not None and x != ""]

to this:

            lines = [x for x in refs if x is not None]

But now I have a vague memory where we had to add the second check for some reason. Do you remember why?

I did test this. It fixes the command above, and still works in situations where we have multiple references:

$ sacrebleu -t wmt17/ms -l zh-en --echo src | ./sacrebleu.py -t wmt17/ms -l zh-en
{
 "name": "BLEU",
 "score": 0.0,
 "signature": "nrefs:3|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "20.5/17.3/6.6/3.0 (BP = 0.000 ratio = 0.066 hyp_len = 3289 ref_len = 50034)",
 "nrefs": "3",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

Actually, though, this doesn't work if you try to give the references as a tab-delimited stream. How does that work? Here, I dump the three references to disk, then try:

$ sacrebleu -t wmt17/ms -l zh-en --echo ref > refs
sacrebleu -t wmt17/ms -l zh-en --echo src | ./sacrebleu.py refs
{
 "name": "BLEU",
 "score": 0.0,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "20.8/17.3/6.6/3.0 (BP = 0.000 ratio = 0.020 hyp_len = 3289 ref_len = 166361)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

Note there is just one reference here, even though refs has three, tab-delimited. Maybe I am just forgetting the syntax for multiple tab-delimited refs?

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Aug 31, 2021

For the last part, to make things even complicated, we have a historical --num-refs param that needs to be provided if the references are given in a tab-delimited file.. If you pass --num-refs 3, you get the same result.

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Aug 31, 2021

But now I have a vague memory where we had to add the second check for some reason. Do you remember why?

The second check was added there to allow skipping empty reference lines since None reference lines are not something that can be curated through CLI. None placeholders can only be injected through the API use.

Now you could (or I could) say, why do we even filter out empty lines there. Then I can't directly guess if having an empty ref vs. not having it changes BLEU, TER or chrF scores computed at all.

@martinpopel
Copy link
Collaborator

martinpopel commented Aug 31, 2021

I think we should first decide whether we want to support empty reference segments in case of a single reference (i.e. ignoring the use case of variable number of references).

we should support empty references, in part because there are many in our existing datasets.

I can imagine multiple reasons for empty (segments in) references and I am not sure if we want to support all/any of them.

  1. The source sentence is also empty. This could happen e.g.
    a. to mark paragraph boundaries in plain text
    b. in multi-parallel corpora where some translations for some languages are missing, but we want to keep all languages alignable by line numbers
  2. The reference is wrong.
    a. The translation for a given segment is missing (but a correct translation is not empty).
    b. There was a technical error (e.g. a wrong sentence alignment between the source and reference files).
  3. The correct translation of a given segment should be empty (according to the reference translator).
    a. The segment is a "filler word" (oh, ehm...) and the goal of the task is translation and text cleaning.
    b. The correct translation of a given segment should be empty only within a given document context, for example to prevent repeating the same sentence again.

The two empty lines in -t mtedx/test -l el-en --echo ref seem to be the case of 3b (although, the repetition is not exact: You speak it very well. vs. You know it very well.).

SacreBLEU can be used for evaluating document-level translation and translation-like tasks (such as text normalization, phonetic transcription etc.), so I agree there are reasons to support both 3a and 3b (at the cost of making everything a bit more complicated, e.g. not catching 2b).

We don't want to support 2b (sacrebleu should warn users about such wrong references) and even for 2a, I would say it would be better to mark such (temporarily) missing segments explicitly with a special token (MISSING) or exclude such segments from both source and target (esp. if sentence-level translation is expected).

I am not sure about 1a and 1b. It could be a nice feature for some users. For example, I would use 1a, but only if it results in exactly the same BLEU as when filtering out the empty lines (from both src and ref), which is not the case now because of the brevity penalty.

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Aug 31, 2021

Some tests:

$ cat hyps
The dog bit the man.
It wasn't surprising.
The man had just bitten him.

$ cat ref1
The dog bit the man.
It was not unexpected.
The man bit him first.

$ cat ref1_missing_line

It was not unexpected.
The man bit him first.

$ cat ref2
The dog had bit the man.
No one was surprised.
The man had bitten the dog.
cat hyps | sacrebleu ref1 -f text
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 45.1 70.6/42.9/36.4/37.5 (BP = 1.000 ratio = 1.000 hyp_len = 17 ref_len = 17)
# after relaxing x != ''
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 45.1 70.6/42.9/36.4/37.5 (BP = 1.000 ratio = 1.000 hyp_len = 17 ref_len = 17)

cat hyps | sacrebleu ref2 -f text
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 28.3 70.6/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)
# after relaxing x != ''
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 28.3 70.6/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)

cat hyps | sacrebleu ref1 ref2 -f text
BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 48.5 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)
# after relaxing x != ''
BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 48.5 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)

# Wrong as we did not pass -nr 2
cat hyps | sacrebleu <(paste ref1 ref2) -f text
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 16.8 82.4/50.0/45.5/37.5 (BP = 0.327 ratio = 0.472 hyp_len = 17 ref_len = 36)
# after relaxing x != ''
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 16.8 82.4/50.0/45.5/37.5 (BP = 0.327 ratio = 0.472 hyp_len = 17 ref_len = 36)

cat hyps | sacrebleu <(paste ref1 ref2) -f text -nr 2
BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 48.5 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)
# after relaxing x != ''
BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 48.5 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)

cat hyps | sacrebleu ref1_missing_line
RuntimeError: Empty or `None` reference sentence found.
# after relaxing x != ''
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 7.7 35.3/7.1/4.5/3.1 (BP = 1.000 ratio = 1.545 hyp_len = 17 ref_len = 11)

# do the same with multi refs
cat hyps | sacrebleu ref1_missing_line ref2 -f text
BLEU|nrefs:var|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 29.4 82.4/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)
# after relaxing x != '' ** variable ref counting broke
BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 29.4 82.4/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)

# do the same with TSV
cat hyps | sacrebleu <(paste ref1_missing_line ref2) -f text -nr 2
BLEU|nrefs:var|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 29.4 82.4/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)
# after relaxing x != '' ** variable ref counting broke
BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 29.4 82.4/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)

@tuetschek
Copy link
Contributor

Just a note: I think that this check was introduced so that no one is using empty references by mistake. "Simulating" variable number of references by using empty references could change the brevity penalty.

@ozancaglayan
Copy link
Collaborator

What do we do about this?

@mjpost
Copy link
Owner Author

mjpost commented Nov 16, 2021

I would like to fix it this week and do a 2.1.0 release (with #118 and #174).

I think the reality is that we need to support empty lines. My reasoning is that sacrebleu is a wrapper around existing, externally-provided test sets, and has no power over the correctness of those test sets, nor do its maintainers (i.e., us) have the bandwidth to even determine where a blank line in a reference set belongs in @martinpopel's taxonomy.

@ozancaglayan what do you think?

mjpost added a commit that referenced this issue Nov 16, 2021
@mjpost mjpost added this to the 2.1.0 milestone Nov 16, 2021
@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Nov 16, 2021

  • Removing the None check breaks the variable number of references support through the API.
  • Removing the empty string check breaks the variable number of references support through the CLI.

I think we should at least keep the None check to still support the 1st one. For the second, there's no easy fix so we'll need to update the README and say that 2.1.0 breaks compatibility with 2.0.0 and reverts the variable number of refs support through the CLI and insist that this feature should only be used through the API.

@mjpost
Copy link
Owner Author

mjpost commented Nov 16, 2021

I've been short on time to look into this. This last post was an aid to my understanding. But why do we need None? Can't we just use "" everywhere to represent a missing reference? I don't understand why multi-ref support in both settings is so different and why we need the None check at all.

At the very least, we should document this better. I admit that I have lost my mental model of the codebase as we've added features and especially with the v2.0 refactoring.

@ozancaglayan
Copy link
Collaborator

  • In the past, things were organized in a way that the length of a list of references reflect the number of references in general for that test set. When variable number of references (let me abbreviate it as VR), we needed a way to mark a particular reference in a list of references as missing i.e. by making it "" or None. "" alone would have been sufficient for both the API and the CLI but we probably also added None as a secondary way of marking things. In line 337 of sacrebleu/metrics/base.py, we further check whether the number of references across the test set is unique or can differ. For the latter, we mark it as VR and in the signature we see it as nrefs:var.

  • Since "" and None denote a non-existent reference, when we apply the filtering lines = [line for line in refs if line != "" and line is not None], AND the number of references == 1, we are left with no references and the code raises an Exception, which is what this issue refers to.

  • From the CLI perspective, since we either provide a number of text files to represent the set of references, or a tab-separated file with each column being a particular reference, we still have the notion of number of columns. For example: we can have 10 files given as refs but only a few sentences may have a 10th reference, with all others having a "" in the 10th column. We need to recognize this as VR as well. There's no way to have a None in a text file so the only way of easily marking unavailable references when files or TSV's are given to sacreBLEU utility, was to explicitly make them "". (An alternative way is to define a particular marker such as <None> to fill missing references, which can make things much clean).

  • Now let's relax the filter to lines = [line for line in refs if line is not None]. This will not break the API usage of VR when we document things properly, saying that "you need to put Nones to allow for VR". But this breaks the CLI usage of VR as now the "" columns will be taken as proper reference sentences. This has two consequences:

    • The signature will never be able to tell nrefs:var since all columns are valid/accepted.
    • Metrics will be given ""s as potential references. Honestly I can't tell if a particular metric score will change in this case from an applied and theoretical perspective. Does the chrF sentence score should be 100% if both hypotheses and references are empty? No idea. How does the brevity penalty is affected if we have a reference with length 0? What about TER? (This concern could be ignored as the case of having an empty reference due to a bug in the dataset/test set is already very rare).
  • Now let's relax the filter to lines = [line for line in refs] as in your last PR. This'll break the moment a sacreBLEU API user passes Nones to denote missing references because it was documented that way. So we should keep the None check which does not harm anything if I am not mistaken.

  • So in short, we need a way to mark empty positions in files/streams and that marker should be something different than "" to be able to count the number of refs properly.

    • Ideally, I would change the TSV input of sacreBLEU to a variable-column one, where each line can include multiple fields separated with \t. This'll also allow us to remove the error-prone --num-refs argument which the user should provide explicitly now. This'll fix VR support without depending on "" markers.
    • For the case where we provide ref files as sacrebleu ref1 ref2 ... refN, we can disable VR as it is very cumbersome already for someone to prepare 10 files just because one of the sentences has 10 refs and all others have <10 refs. We can say that in this case VR does not and will not work and always consider 10 refs across the document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants