-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sacrebleu 2.0 dies when encountering an empty reference line #161
Comments
@ozancaglayan I tagged you, but let me know if you can't process this. |
Hi there, This is a side-effect of variable number of references introduced in #132 . The reason it was not failing with 1.5.x is probably due to this feature being available only through the API call and not the CLI utility. We now globally filter the set of references to reject empty or I would suggest fixing the references if possible as an empty line in references does not make much sense, what do you think? |
At the very least, this error is way too cryptic. We should catch the exception and report to the user. However, I think that we should support empty references, in part because there are many in our existing datasets. Without fixing this, those are unusable. Also, it is not clear how to fix them, especially for the average user. Can’t we just do a check and convert |
Honestly, I can't wrap my head around for this problem :) Let me think loudly. Initially, the That made sense because explicit BTW, I am not sure why we throw that exception. It was again introduced in #132 probably for some sanity check @tuetschek ? I can't guess out of my head what happens if we remove that actually. In the end, all these should be very well documented and tested to not cause ambiguities and weird results in the future. |
We should be able to retain the feature of multiple tab-delimited references, no? For example, consider ref1:
and reference two:
(I really wish Markdown supported tables). Pasted, this would give
Which is fine. To make this not throw an error, we just need those empty fields to be lines = [x for x in refs if x is not None and x != ""] to this: lines = [x for x in refs if x is not None] But now I have a vague memory where we had to add the second check for some reason. Do you remember why? I did test this. It fixes the command above, and still works in situations where we have multiple references: $ sacrebleu -t wmt17/ms -l zh-en --echo src | ./sacrebleu.py -t wmt17/ms -l zh-en
{
"name": "BLEU",
"score": 0.0,
"signature": "nrefs:3|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
"verbose_score": "20.5/17.3/6.6/3.0 (BP = 0.000 ratio = 0.066 hyp_len = 3289 ref_len = 50034)",
"nrefs": "3",
"case": "mixed",
"eff": "no",
"tok": "13a",
"smooth": "exp",
"version": "2.0.0"
} Actually, though, this doesn't work if you try to give the references as a tab-delimited stream. How does that work? Here, I dump the three references to disk, then try: $ sacrebleu -t wmt17/ms -l zh-en --echo ref > refs
sacrebleu -t wmt17/ms -l zh-en --echo src | ./sacrebleu.py refs
{
"name": "BLEU",
"score": 0.0,
"signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
"verbose_score": "20.8/17.3/6.6/3.0 (BP = 0.000 ratio = 0.020 hyp_len = 3289 ref_len = 166361)",
"nrefs": "1",
"case": "mixed",
"eff": "no",
"tok": "13a",
"smooth": "exp",
"version": "2.0.0"
} Note there is just one reference here, even though |
For the last part, to make things even complicated, we have a historical |
The second check was added there to allow skipping empty reference lines since Now you could (or I could) say, why do we even filter out empty lines there. Then I can't directly guess if having an empty ref vs. not having it changes BLEU, TER or chrF scores computed at all. |
I think we should first decide whether we want to support empty reference segments in case of a single reference (i.e. ignoring the use case of variable number of references).
I can imagine multiple reasons for empty (segments in) references and I am not sure if we want to support all/any of them.
The two empty lines in SacreBLEU can be used for evaluating document-level translation and translation-like tasks (such as text normalization, phonetic transcription etc.), so I agree there are reasons to support both 3a and 3b (at the cost of making everything a bit more complicated, e.g. not catching 2b). We don't want to support 2b (sacrebleu should warn users about such wrong references) and even for 2a, I would say it would be better to mark such (temporarily) missing segments explicitly with a special token (MISSING) or exclude such segments from both source and target (esp. if sentence-level translation is expected). I am not sure about 1a and 1b. It could be a nice feature for some users. For example, I would use 1a, but only if it results in exactly the same BLEU as when filtering out the empty lines (from both src and ref) |
Some tests:
|
Just a note: I think that this check was introduced so that no one is using empty references by mistake. "Simulating" variable number of references by using empty references could change the brevity penalty. |
What do we do about this? |
I would like to fix it this week and do a 2.1.0 release (with #118 and #174). I think the reality is that we need to support empty lines. My reasoning is that sacrebleu is a wrapper around existing, externally-provided test sets, and has no power over the correctness of those test sets, nor do its maintainers (i.e., us) have the bandwidth to even determine where a blank line in a reference set belongs in @martinpopel's taxonomy. @ozancaglayan what do you think? |
I think we should at least keep the |
I've been short on time to look into this. This last post was an aid to my understanding. But why do we need At the very least, we should document this better. I admit that I have lost my mental model of the codebase as we've added features and especially with the v2.0 refactoring. |
|
Sacrebleu 2.0.0 dies if there is an empty line in the reference. You can reproduce this with this command:
The text was updated successfully, but these errors were encountered: