-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added option for SPM tokenization for sacrebleu #118
Conversation
Hi, this is a nice idea! Using SentencePiece would be a big improvement over the default tokenizer. I see the need to have a standardized model, but I think we would need to generalize your implementation, such that
What are the details of your provided model? |
@mjpost Perfect, I agree with all the 3 things you mentioned. Sorry, the current PR was to get your early reviews. Details of the SPM model:
I have done some analysis internally that shows very low rate of this model across all the languages. Let me know what would be correct next steps. |
Hi there, (verbose mode on)
|
On point (1), there are already multiple tokenizers (e.g., Ideally, though, this variant would have been submitted to the WMT metrics task, or the authors would run their own experiments to see how it affects BLEU's correlations with human judgments. @ngoyal2707, have you done this? |
As @mjpost said, the goal is to reach to one tokenizer that can cover most languages. The current SPM is trained on @mjpost I have done correlation analysis for more than 50 languages against 13a tokenizer, char tokenizer and chrF. We haven't done human eval correlation yet, it's on our list but a bit low priority. @mjpost Is it possible to set up a VC time to discuss details about what analysis we would want to get it merged to sacrebleu? |
url = "https://dl.fbaipublicfiles.com/fairseq/models/flores/sacrebleu_tokenizer_spm.model" | ||
urllib.request.urlretrieve(url, 'sacrebleu_tokenizer_spm.model') | ||
self.sp.Load("sacrebleu_tokenizer_spm.model") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments here:
- This should be saved to
~/.sacrebleu/models/
, not the current directory - Can you add a checksum test and locking? An early sacrebleu bug was people running multiple at the same time, and it would download the test sets, overwriting each other. I think you could refactor
download_test_set
to have adownload_file(uri, local_path, checksum)
, and then this could use that function, too - The model file name should be more descriptive, ideally containing the checksum
- It would be helpful to log a message that the download is happening, since it's likely to take some time (although I see the model is quite small; we could distribute it with, per @ozancaglayan's suggestion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ngoyal2707, just pinging you here in case you missed these comments.
Hey @mjpost Curious any thoughts about merging this? Can you please help with the build failure? |
@ozancaglayan we should include this with #152, can we do that? @ngoyal2707, can you merge in the |
@mjpost I have merged the changes to Do you want me to create a new pull request? |
No need to create a new PR. You can update this one just by pushing to your branch (ngoyal2707:adding_spm_tokenized_bleu). So checkout your branch locally, merge |
Hmm, actually, I was able to do this by clicking a button on Github's interface. |
Copying from offline communication:
Can you please add a link of code for crc32 check and downloading in the sacrebleu? |
Sorry, not crc32 (that's the ACL Anthology), but MD5, in download_test_set. |
Ideally we'd factor out a |
Hi, sure but the big PR isn't merged to master yet. How is it that this will get merged before the other one? Hopefully, I can finalise the other tomorrow and merge it to master. Can you ACK the other one? Later, we can test the master a bit with this merged as well, prior to final release. |
I see, sure. We could just merge it after merging #152 but before doing the release. |
Hi, Now that the 2.0 branch is merged, do you want to rebase this on our |
Superseded by #168. |
Adding tokenizer based on sentencepiece model trained on common crawl mined data of 134 languages, with low resource languages upsampled.
The goal is to provide unified tokenization across languages.
In our analysis SPM tokenized bleu shows good correlation with default tokenization for languages that have good tokenizers in sacrebleu and shows good correlation with chrf or "char" based tokenizers for languages that are not handled well with default tokenizer (ie. my, km, lo, yo, mg)
@mjpost let me know what analysis I can provide, so that we can get it merged to sacrebleu master.