Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added option for SPM tokenization for sacrebleu #118

Closed

Conversation

ngoyal2707
Copy link

Adding tokenizer based on sentencepiece model trained on common crawl mined data of 134 languages, with low resource languages upsampled.
The goal is to provide unified tokenization across languages.

In our analysis SPM tokenized bleu shows good correlation with default tokenization for languages that have good tokenizers in sacrebleu and shows good correlation with chrf or "char" based tokenizers for languages that are not handled well with default tokenizer (ie. my, km, lo, yo, mg)

@mjpost let me know what analysis I can provide, so that we can get it merged to sacrebleu master.

@mjpost
Copy link
Owner

mjpost commented Sep 14, 2020

Hi, this is a nice idea! Using SentencePiece would be a big improvement over the default tokenizer.

I see the need to have a standardized model, but I think we would need to generalize your implementation, such that

  • a number of built-in SPM models could be supported
  • these models would each have, at minimum, a name and a description describing how they were built (and ideally a pointer to a document describing this)
  • the name of the selected model would become part of the signature

What are the details of your provided model?

@ngoyal2707
Copy link
Author

@mjpost Perfect, I agree with all the 3 things you mentioned. Sorry, the current PR was to get your early reviews.

Details of the SPM model:

  1. Training Data: We filter from common crawl (12 spanshots for high resource languages and 60 for low) based of fasttext langauge ID detector, language models threshold and some other thresholds. The v1 version of this dataset was used in XLMR paper: https://arxiv.org/pdf/1911.02116.pdf
    We selected total 100M sentences across 134 languages (I can share these language list with you) with temperature based sampling described in XLMR paper.

  2. Model training details: The SPM model was trained with following cmd:

spm_train \
--input "spm_samples.txt" \
--vocab_size=256000 \
--character_coverage=0.99995 \
--model_type=bpe \
--shuffle_input_sentence=true \
--input_sentence_size=100000000 \
--num_threads 72;

I have done some analysis internally that shows very low rate of this model across all the languages.

Let me know what would be correct next steps.

@ozancaglayan
Copy link
Collaborator

Hi there,

(verbose mode on)

  • Do we care about potential issues regarding comparability across the literature? I can foresee people enabling SPM switch in sacreBLEU for some reason, and then comparing against all the other SOTA stuff that did not use that switch. For all this time, the sole objective was to have consistent numbers out of sacreBLEU and the same tokenizer is used everywhere except Japanese and Chinese. I was a bit concerned on this end, when I saw this PR.

  • I think there should be only one/holistic SPM/BPE tokenizer here and not many more in terms of the above concern.

  • (If the above bullet does make sense) if the SPM model is not that big, it may be shipped within the sacreBLEU package.

@mjpost
Copy link
Owner

mjpost commented Oct 16, 2020

On point (1), there are already multiple tokenizers (e.g., -tok intl). I think adding another one isn't too confusing, especially if v13a is the default. And the advantage is that, if a single SPM model were available that, say, covered all WMT languages, it would be more general.

Ideally, though, this variant would have been submitted to the WMT metrics task, or the authors would run their own experiments to see how it affects BLEU's correlations with human judgments. @ngoyal2707, have you done this?

@ngoyal2707
Copy link
Author

As @mjpost said, the goal is to reach to one tokenizer that can cover most languages. The current SPM is trained on 134 languages common crawl data.
@ozancaglayan The reason to have versioning is, because we can't predict if we would want to expand to more languages. But in general I personally don't expect upgrading the SPM model often.

@mjpost I have done correlation analysis for more than 50 languages against 13a tokenizer, char tokenizer and chrF. We haven't done human eval correlation yet, it's on our list but a bit low priority.

@mjpost Is it possible to set up a VC time to discuss details about what analysis we would want to get it merged to sacrebleu?

url = "https://dl.fbaipublicfiles.com/fairseq/models/flores/sacrebleu_tokenizer_spm.model"
urllib.request.urlretrieve(url, 'sacrebleu_tokenizer_spm.model')
self.sp.Load("sacrebleu_tokenizer_spm.model")

Copy link
Owner

@mjpost mjpost Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments here:

  • This should be saved to ~/.sacrebleu/models/, not the current directory
  • Can you add a checksum test and locking? An early sacrebleu bug was people running multiple at the same time, and it would download the test sets, overwriting each other. I think you could refactor download_test_set to have a download_file(uri, local_path, checksum), and then this could use that function, too
  • The model file name should be more descriptive, ideally containing the checksum
  • It would be helpful to log a message that the download is happening, since it's likely to take some time (although I see the model is quite small; we could distribute it with, per @ozancaglayan's suggestion)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ngoyal2707, just pinging you here in case you missed these comments.

@mjpost mjpost added this to the 2.0.0 milestone Feb 25, 2021
@ngoyal2707
Copy link
Author

Hey @mjpost

Curious any thoughts about merging this? Can you please help with the build failure?

@mjpost
Copy link
Owner

mjpost commented Jun 30, 2021

@ozancaglayan we should include this with #152, can we do that?

@ngoyal2707, can you merge in the master branch to help make the merge simpler?

@ngoyal2707
Copy link
Author

@mjpost I have merged the changes to master in my fork: https://github.com/ngoyal2707/sacrebleu

Do you want me to create a new pull request?

@mjpost
Copy link
Owner

mjpost commented Jun 30, 2021

No need to create a new PR. You can update this one just by pushing to your branch (ngoyal2707:adding_spm_tokenized_bleu). So checkout your branch locally, merge mjpost/master into it, then push up.

@mjpost
Copy link
Owner

mjpost commented Jun 30, 2021

Hmm, actually, I was able to do this by clicking a button on Github's interface.

@ngoyal2707
Copy link
Author

Copying from offline communication:

so it should be checksummed, and it'd be nice to have the crc32 hash included in the file name. Then it could be checked locally (there is existing code for this, would be nice to use it). There is also code for downloading, it would be nice to factor it.

Can you please add a link of code for crc32 check and downloading in the sacrebleu?

@mjpost
Copy link
Owner

mjpost commented Jun 30, 2021

Sorry, not crc32 (that's the ACL Anthology), but MD5, in download_test_set.

@mjpost
Copy link
Owner

mjpost commented Jun 30, 2021

Ideally we'd factor out a download_file(remote_path, local_path, checksum) call from download_test_set, and then both that function and your code could use download_file, if that makes sense.

@ozancaglayan
Copy link
Collaborator

Hi,

sure but the big PR isn't merged to master yet. How is it that this will get merged before the other one? Hopefully, I can finalise the other tomorrow and merge it to master. Can you ACK the other one?

Later, we can test the master a bit with this merged as well, prior to final release.

@mjpost
Copy link
Owner

mjpost commented Jun 30, 2021

I see, sure. We could just merge it after merging #152 but before doing the release.

@mjpost
Copy link
Owner

mjpost commented Jul 23, 2021

Hi, Now that the 2.0 branch is merged, do you want to rebase this on our master branch, and we can include it with the 2.0 release?

@mjpost
Copy link
Owner

mjpost commented Nov 10, 2021

Superseded by #168.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants