added option for SPM tokenization for sacrebleu #118

ngoyal2707 · 2020-09-14T18:39:34Z

Adding tokenizer based on sentencepiece model trained on common crawl mined data of 134 languages, with low resource languages upsampled.
The goal is to provide unified tokenization across languages.

In our analysis SPM tokenized bleu shows good correlation with default tokenization for languages that have good tokenizers in sacrebleu and shows good correlation with chrf or "char" based tokenizers for languages that are not handled well with default tokenizer (ie. my, km, lo, yo, mg)

@mjpost let me know what analysis I can provide, so that we can get it merged to sacrebleu master.

mjpost · 2020-09-14T18:46:37Z

Hi, this is a nice idea! Using SentencePiece would be a big improvement over the default tokenizer.

I see the need to have a standardized model, but I think we would need to generalize your implementation, such that

a number of built-in SPM models could be supported
these models would each have, at minimum, a name and a description describing how they were built (and ideally a pointer to a document describing this)
the name of the selected model would become part of the signature

What are the details of your provided model?

ngoyal2707 · 2020-09-14T19:30:18Z

@mjpost Perfect, I agree with all the 3 things you mentioned. Sorry, the current PR was to get your early reviews.

Details of the SPM model:

Training Data: We filter from common crawl (12 spanshots for high resource languages and 60 for low) based of fasttext langauge ID detector, language models threshold and some other thresholds. The v1 version of this dataset was used in XLMR paper: https://arxiv.org/pdf/1911.02116.pdf
We selected total 100M sentences across 134 languages (I can share these language list with you) with temperature based sampling described in XLMR paper.
Model training details: The SPM model was trained with following cmd:

spm_train \
--input "spm_samples.txt" \
--vocab_size=256000 \
--character_coverage=0.99995 \
--model_type=bpe \
--shuffle_input_sentence=true \
--input_sentence_size=100000000 \
--num_threads 72;

I have done some analysis internally that shows very low rate of this model across all the languages.

Let me know what would be correct next steps.

ozancaglayan · 2020-10-14T17:49:41Z

Hi there,

(verbose mode on)

Do we care about potential issues regarding comparability across the literature? I can foresee people enabling SPM switch in sacreBLEU for some reason, and then comparing against all the other SOTA stuff that did not use that switch. For all this time, the sole objective was to have consistent numbers out of sacreBLEU and the same tokenizer is used everywhere except Japanese and Chinese. I was a bit concerned on this end, when I saw this PR.
I think there should be only one/holistic SPM/BPE tokenizer here and not many more in terms of the above concern.
(If the above bullet does make sense) if the SPM model is not that big, it may be shipped within the sacreBLEU package.

mjpost · 2020-10-16T13:49:56Z

On point (1), there are already multiple tokenizers (e.g., -tok intl). I think adding another one isn't too confusing, especially if v13a is the default. And the advantage is that, if a single SPM model were available that, say, covered all WMT languages, it would be more general.

Ideally, though, this variant would have been submitted to the WMT metrics task, or the authors would run their own experiments to see how it affects BLEU's correlations with human judgments. @ngoyal2707, have you done this?

ngoyal2707 · 2020-10-16T14:18:20Z

As @mjpost said, the goal is to reach to one tokenizer that can cover most languages. The current SPM is trained on 134 languages common crawl data.
@ozancaglayan The reason to have versioning is, because we can't predict if we would want to expand to more languages. But in general I personally don't expect upgrading the SPM model often.

@mjpost I have done correlation analysis for more than 50 languages against 13a tokenizer, char tokenizer and chrF. We haven't done human eval correlation yet, it's on our list but a bit low priority.

@mjpost Is it possible to set up a VC time to discuss details about what analysis we would want to get it merged to sacrebleu?

mjpost · 2020-10-16T16:03:35Z

sacrebleu/tokenizers/tokenizer_spm.py

+            url = "https://dl.fbaipublicfiles.com/fairseq/models/flores/sacrebleu_tokenizer_spm.model"
+            urllib.request.urlretrieve(url, 'sacrebleu_tokenizer_spm.model')
+        self.sp.Load("sacrebleu_tokenizer_spm.model")
+


A few comments here:

This should be saved to ~/.sacrebleu/models/, not the current directory

Can you add a checksum test and locking? An early sacrebleu bug was people running multiple at the same time, and it would download the test sets, overwriting each other. I think you could refactor download_test_set to have a download_file(uri, local_path, checksum), and then this could use that function, too

The model file name should be more descriptive, ideally containing the checksum

It would be helpful to log a message that the download is happening, since it's likely to take some time (although I see the model is quite small; we could distribute it with, per @ozancaglayan's suggestion)

Hi @ngoyal2707, just pinging you here in case you missed these comments.

ngoyal2707 · 2021-06-02T19:12:51Z

Hey @mjpost

Curious any thoughts about merging this? Can you please help with the build failure?

mjpost · 2021-06-30T18:01:54Z

@ozancaglayan we should include this with #152, can we do that?

@ngoyal2707, can you merge in the master branch to help make the merge simpler?

ngoyal2707 · 2021-06-30T18:15:44Z

@mjpost I have merged the changes to master in my fork: https://github.com/ngoyal2707/sacrebleu

Do you want me to create a new pull request?

mjpost · 2021-06-30T18:20:49Z

No need to create a new PR. You can update this one just by pushing to your branch (ngoyal2707:adding_spm_tokenized_bleu). So checkout your branch locally, merge mjpost/master into it, then push up.

mjpost · 2021-06-30T18:21:18Z

Hmm, actually, I was able to do this by clicking a button on Github's interface.

ngoyal2707 · 2021-06-30T18:36:04Z

Copying from offline communication:

so it should be checksummed, and it'd be nice to have the crc32 hash included in the file name. Then it could be checked locally (there is existing code for this, would be nice to use it). There is also code for downloading, it would be nice to factor it.

Can you please add a link of code for crc32 check and downloading in the sacrebleu?

mjpost · 2021-06-30T18:38:28Z

Sorry, not crc32 (that's the ACL Anthology), but MD5, in download_test_set.

mjpost · 2021-06-30T18:39:44Z

Ideally we'd factor out a download_file(remote_path, local_path, checksum) call from download_test_set, and then both that function and your code could use download_file, if that makes sense.

ozancaglayan · 2021-06-30T18:53:11Z

Hi,

sure but the big PR isn't merged to master yet. How is it that this will get merged before the other one? Hopefully, I can finalise the other tomorrow and merge it to master. Can you ACK the other one?

Later, we can test the master a bit with this merged as well, prior to final release.

mjpost · 2021-06-30T19:15:50Z

I see, sure. We could just merge it after merging #152 but before doing the release.

mjpost · 2021-07-23T18:29:23Z

Hi, Now that the 2.0 branch is merged, do you want to rebase this on our master branch, and we can include it with the 2.0 release?

mjpost · 2021-11-10T13:50:04Z

Superseded by #168.

added option for SPM tokenization for sacrebleu

4f04a8e

changed to public https url

a2c8e7a

mjpost reviewed Oct 16, 2020

View reviewed changes

Merge branch 'master' into adding_spm_tokenized_bleu

2ae1366

mjpost mentioned this pull request Feb 25, 2021

SGML entities de-escaping in tokenization #15

Closed

mjpost added this to the 2.0.0 milestone Feb 25, 2021

martinpopel mentioned this pull request Jun 13, 2021

feature/add thai word and syllable tokenizers #155

Closed

Merge branch 'master' into adding_spm_tokenized_bleu

89c6fa6

mjpost closed this Nov 10, 2021

mjpost mentioned this pull request Nov 16, 2021

Sacrebleu 2.0 dies when encountering an empty reference line #161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added option for SPM tokenization for sacrebleu #118

added option for SPM tokenization for sacrebleu #118

ngoyal2707 commented Sep 14, 2020

mjpost commented Sep 14, 2020

ngoyal2707 commented Sep 14, 2020

ozancaglayan commented Oct 14, 2020

mjpost commented Oct 16, 2020

ngoyal2707 commented Oct 16, 2020

mjpost Oct 16, 2020 •

edited

Loading

mjpost Jun 30, 2021

ngoyal2707 commented Jun 2, 2021

mjpost commented Jun 30, 2021

ngoyal2707 commented Jun 30, 2021

mjpost commented Jun 30, 2021

mjpost commented Jun 30, 2021

ngoyal2707 commented Jun 30, 2021

mjpost commented Jun 30, 2021

mjpost commented Jun 30, 2021

ozancaglayan commented Jun 30, 2021

mjpost commented Jun 30, 2021

mjpost commented Jul 23, 2021

mjpost commented Nov 10, 2021

added option for SPM tokenization for sacrebleu #118

added option for SPM tokenization for sacrebleu #118

Conversation

ngoyal2707 commented Sep 14, 2020

mjpost commented Sep 14, 2020

ngoyal2707 commented Sep 14, 2020

ozancaglayan commented Oct 14, 2020

mjpost commented Oct 16, 2020

ngoyal2707 commented Oct 16, 2020

mjpost Oct 16, 2020 • edited Loading

Choose a reason for hiding this comment

mjpost Jun 30, 2021

Choose a reason for hiding this comment

ngoyal2707 commented Jun 2, 2021

mjpost commented Jun 30, 2021

ngoyal2707 commented Jun 30, 2021

mjpost commented Jun 30, 2021

mjpost commented Jun 30, 2021

ngoyal2707 commented Jun 30, 2021

mjpost commented Jun 30, 2021

mjpost commented Jun 30, 2021

ozancaglayan commented Jun 30, 2021

mjpost commented Jun 30, 2021

mjpost commented Jul 23, 2021

mjpost commented Nov 10, 2021

mjpost Oct 16, 2020 •

edited

Loading