Skip to content

Commit

Permalink
Version 2.2 (#200)
Browse files Browse the repository at this point in the history
Major features
* Added WMT21 datasets (closes #166)
* Restructured internal storage, expose metadata via `--echo`
* Add a Korean tokenizer (`--tok ko-mecab` #194)
* Allow empty references (fixes #161)

Bugfixes and minor additions:
* Set API tokenizer to None by default (closes #181)
* Added SPM to list of CJK tokenizer recommendations
* Pulled out NoneTokenizer, allow None references in args check (addresses #195)
* Remove colon from filename of parsed files (not permitted on Windows)
* Fix: update the url of mtnt2019 data to latest
* Added a few missing md5 hashes
* Changed filename of downloaded file
* Use global filename for downloaded tarballs
* In tests, subsample datasets to download instead of going through them all

Co-authored-by: NoUnique <[email protected]>
Co-authored-by: hanbing <[email protected]>
Co-authored-by: Jannis Vamvas <[email protected]>
  • Loading branch information
4 people authored Jul 25, 2022
1 parent 8e7abf5 commit a73315b
Show file tree
Hide file tree
Showing 25 changed files with 2,834 additions and 1,084 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/check-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ on:
env:
PYTHONUTF8: "1"

# only run one at a time per branch
concurrency:
group: check-build-${{ github.ref }}
cancel-in-progress: true

jobs:
check-build:
runs-on: ${{ matrix.os }}
Expand Down Expand Up @@ -35,6 +40,7 @@ jobs:
python -m pip install --upgrade pip
pip install pytest
pip install .[ja]
pip install .[ko]
- name: Python pytest test suite
run: python3 -m pytest
- name: CLI bash test suite
Expand Down
23 changes: 21 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,32 @@
# Release Notes

- 2.1.0 (TBD)
- 2.2.0 (2022-07-25)
Features:
- Added `-tok spm` for multilingual SPM tokenization
- Added WMT21 datasets (thanks to @BrighXiaoHan)
- `--echo` now exposes document metadata where available (e.g., docid, genre, origlang)
- Bugfix: allow empty references (#161)
- Adds a Korean tokenizer (thanks to @NoUnique)

Under the hood:
- Moderate code refactoring
- Processed files have adopted a more sensible internal naming scheme under ~/.sacrebleu
(e.g., wmt17_ms.zh-en.src instead of zh-en.zh)
- Processed file extensions correspond to the values passed to `--echo` (e.g., "src")
- Now explicitly representing NoneTokenizer
- Got rid of the ".lock" lockfile for downloading (using the tarball itself)

Many thanks to @BrightXiaoHan (https://github.com/BrightXiaoHan) for the bulk of
the code contributions in this release.

- 2.1.0 (2022-05-19)
Features:
- Added `-tok spm` for multilingual SPM tokenization (#168)
(thanks to Naman Goyal and James Cross at Facebook)

Fixes:
- Handle potential memory usage issues due to LRU caching in tokenizers (#167)
- Bugfix: BLEU.corpus_score() now using max_ngram_order (#173)
- Upgraded ja-mecab to 1.0.5 (#196)

- 2.0.0 (2021-07-18)
- Build: Add Windows and OS X testing to Travis CI.
Expand Down
48 changes: 36 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,11 @@ following command instead, to perform a full installation with dependencies:

pip install "sacrebleu[ja]"

In order to install Korean tokenizer support through `pymecab-ko`, you need to run the
following command instead, to perform a full installation with dependencies:

pip install "sacrebleu[ko]"

# Command-line Usage

You can get a list of available test sets with `sacrebleu --list`. Please see [DATASETS.md](DATASETS.md)
Expand All @@ -68,17 +73,15 @@ for an up-to-date list of supported datasets.

### Downloading test sets

Download the **source** for one of the pre-defined test sets:
Downloading is triggered when you request a test set. If the dataset is not available, it is downloaded
and unpacked.

```
$ sacrebleu -t wmt17 -l en-de --echo src | head -n1
28-Year-Old Chef Found Dead at San Francisco Mall
```
E.g., you can use the following commands to download the source, pass it through your translation system
in `translate.sh`, and then score it:

Download the **reference** for one of the pre-defined test sets:
```
$ sacrebleu -t wmt17 -l en-de --echo ref | head -n1
28-jähriger Koch in San Francisco Mall tot aufgefunden
$ sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en
$ cat wmt17.en-de.en | translate.sh | sacrebleu -t wmt17 -l en-de
```

### JSON output
Expand Down Expand Up @@ -194,8 +197,8 @@ BLEU related arguments:
Smoothing method: exponential decay, floor (increment zero counts), add-k (increment num/denom by k for n>1), or none. (Default: exp)
--smooth-value BLEU_SMOOTH_VALUE, -sv BLEU_SMOOTH_VALUE
The smoothing value. Only valid for floor and add-k. (Defaults: floor: 0.1, add-k: 1)
--tokenize {none,zh,13a,char,intl,ja-mecab}, -tok {none,zh,13a,char,intl,ja-mecab}
Tokenization method to use for BLEU. If not provided, defaults to `zh` for Chinese, `ja-mecab` for Japanese and `13a` (mteval) otherwise.
--tokenize {none,zh,13a,char,intl,ja-mecab,ko-mecab}, -tok {none,zh,13a,char,intl,ja-mecab,ko-mecab}
Tokenization method to use for BLEU. If not provided, defaults to `zh` for Chinese, `ja-mecab` for Japanese, `ko-mecab` for Korean and `13a` (mteval) otherwise.
--lowercase, -lc If True, enables case-insensitivity. (Default: False)
--force Insist that your tokenized input is actually detokenized.
Expand All @@ -220,6 +223,26 @@ TER related arguments (The defaults replicate TERCOM's behavior):
### Version Signatures
As you may have noticed, sacreBLEU generates version strings such as `BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0` for reproducibility reasons. It's strongly recommended to share these signatures in your papers!

### Outputting other metadata

Sacrebleu knows about metadata for some test sets, and you can output it like this:

```
$ sacrebleu -t wmt21 -l en-de --echo src docid ref | head 2
Couple MACED at California dog park for not wearing face masks while having lunch (VIDEO) - RT USA News rt.com.131279 Paar in Hundepark in Kalifornien mit Pfefferspray besprüht, weil es beim Mittagessen keine Masken trug (VIDEO) - RT USA News
There's mask-shaming and then there's full on assault. rt.com.131279 Masken-Shaming ist eine Sache, Körperverletzung eine andere.
```

If multiple fields are requested, they are output as tab-separated columns (a TSV).

To see the available fields, add `--echo asdf` (or some other garbage data):

```
$ sacrebleu -t wmt21 -l en-de --echo asdf
sacreBLEU: No such field asdf in test set wmt21 for language pair en-de.
sacreBLEU: available fields for wmt21/en-de: src, ref:A, ref, docid, origlang
```

## Translationese Support

If you are interested in the translationese effect, you can evaluate BLEU on a subset of sentences
Expand Down Expand Up @@ -247,11 +270,12 @@ but it expects that you pass through the entire translated test set.
- `intl` applies international tokenization and mimics the `mteval-v14` script from Moses
- `zh` separates out **Chinese** characters and tokenizes the non-Chinese parts using `13a` tokenizer
- `ja-mecab` tokenizes **Japanese** inputs using the [MeCab](https://pypi.org/project/mecab-python3) morphological analyzer
- `ko-mecab` tokenizes **Korean** inputs using the [MeCab-ko](https://pypi.org/project/mecab-ko) morphological analyzer
- `spm` uses the SentencePiece model built from the Flores-101 dataset (https://github.com/facebookresearch/flores#list-of-languages). Note: the canonical .spm file will be automatically fetched if not found locally.
- You can switch tokenizers using the `--tokenize` flag of sacreBLEU. Alternatively, if you provide language-pair strings
using `--language-pair/-l`, `zh` and `ja-mecab` tokenizers will be used if the target language is `zh` or `ja`, respectively.
using `--language-pair/-l`, `zh`, `ja-mecab` and `ko-mecab` tokenizers will be used if the target language is `zh` or `ja` or `ko`, respectively.
- **Note that** there's no automatic language detection from the hypotheses so you need to make sure that you are correctly
selecting the tokenizer for **Japanese** and **Chinese**.
selecting the tokenizer for **Japanese**, **Korean** and **Chinese**.


Default 13a tokenizer will produce poor results for Japanese:
Expand Down
2 changes: 1 addition & 1 deletion sacrebleu/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.

__version__ = '2.1.0'
__version__ = '2.2.0'
__description__ = 'Hassle-free computation of shareable, comparable, and reproducible BLEU, chrF, and TER scores'


Expand Down
7 changes: 7 additions & 0 deletions sacrebleu/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ def corpus_bleu(hypotheses: Sequence[str],
tokenize=BLEU.TOKENIZER_DEFAULT,
use_effective_order=False) -> BLEUScore:
"""Computes BLEU for a corpus against a single (or multiple) reference(s).
This is the main CLI entry point for computing BLEU between a system output
and a reference sentence.
:param hypotheses: A sequence of hypothesis strings.
:param references: A sequence of reference documents with document being
Expand Down Expand Up @@ -42,11 +44,16 @@ def raw_corpus_bleu(hypotheses: Sequence[str],
This convenience function assumes a particular set of arguments i.e.
it disables tokenization and applies a `floor` smoothing with value `0.1`.
This convenience call does not apply any tokenization at all,
neither to the system output nor the reference. It just computes
BLEU on the "raw corpus" (hence the name).
:param hypotheses: A sequence of hypothesis strings.
:param references: A sequence of reference documents with document being
defined as a sequence of reference strings.
:param smooth_value: The smoothing value for `floor`. If not given, the default of 0.1 is used.
:return: Returns a `BLEUScore` object.
"""
return corpus_bleu(
hypotheses, references, smooth_method='floor',
Expand Down
Loading

0 comments on commit a73315b

Please sign in to comment.