Version 2.2 (#200)

Major features * Added WMT21 datasets (closes #166) * Restructured internal storage, expose metadata via `--echo` * Add a Korean tokenizer (`--tok ko-mecab` #194) * Allow empty references (fixes #161) Bugfixes and minor additions: * Set API tokenizer to None by default (closes #181) * Added SPM to list of CJK tokenizer recommendations * Pulled out NoneTokenizer, allow None references in args check (addresses #195) * Remove colon from filename of parsed files (not permitted on Windows) * Fix: update the url of mtnt2019 data to latest * Added a few missing md5 hashes * Changed filename of downloaded file * Use global filename for downloaded tarballs * In tests, subsample datasets to download instead of going through them all Co-authored-by: NoUnique <[email protected]> Co-authored-by: hanbing <[email protected]> Co-authored-by: Jannis Vamvas <[email protected]>
mjpost · Jul 25, 2022 · a73315b · a73315b
1 parent 8e7abf5
commit a73315b
Show file tree

Hide file tree

Showing 25 changed files with 2,834 additions and 1,084 deletions.
diff --git a/.github/workflows/check-build.yml b/.github/workflows/check-build.yml
@@ -6,6 +6,11 @@ on:
 env:
   PYTHONUTF8: "1"
 
+# only run one at a time per branch
+concurrency:
+  group: check-build-${{ github.ref }}
+  cancel-in-progress: true
+
 jobs:
   check-build:
     runs-on: ${{ matrix.os }}
@@ -35,6 +40,7 @@ jobs:
           python -m pip install --upgrade pip
           pip install pytest
           pip install .[ja]
+          pip install .[ko]
       - name: Python pytest test suite
         run: python3 -m pytest
       - name: CLI bash test suite

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,13 +1,32 @@
 # Release Notes
 
-- 2.1.0 (TBD)
+- 2.2.0 (2022-07-25)
   Features:
-  - Added `-tok spm` for multilingual SPM tokenization
+  - Added WMT21 datasets (thanks to @BrighXiaoHan)
+  - `--echo` now exposes document metadata where available (e.g., docid, genre, origlang)
+  - Bugfix: allow empty references (#161)
+  - Adds a Korean tokenizer (thanks to @NoUnique)
+
+  Under the hood:
+  - Moderate code refactoring
+  - Processed files have adopted a more sensible internal naming scheme under ~/.sacrebleu
+    (e.g., wmt17_ms.zh-en.src instead of zh-en.zh)
+  - Processed file extensions correspond to the values passed to `--echo` (e.g., "src")
+  - Now explicitly representing NoneTokenizer
+  - Got rid of the ".lock" lockfile for downloading (using the tarball itself)
+
+  Many thanks to @BrightXiaoHan (https://github.com/BrightXiaoHan) for the bulk of
+  the code contributions in this release.
+
+- 2.1.0 (2022-05-19)
+  Features:
+  - Added `-tok spm` for multilingual SPM tokenization (#168)
     (thanks to Naman Goyal and James Cross at Facebook)
 
   Fixes:
   - Handle potential memory usage issues due to LRU caching in tokenizers (#167)
   - Bugfix: BLEU.corpus_score() now using max_ngram_order (#173)
+  - Upgraded ja-mecab to 1.0.5 (#196)
 
 - 2.0.0 (2021-07-18)
   - Build: Add Windows and OS X testing to Travis CI.

diff --git a/README.md b/README.md
@@ -59,6 +59,11 @@ following command instead, to perform a full installation with dependencies:
 
     pip install "sacrebleu[ja]"
 
+In order to install Korean tokenizer support through `pymecab-ko`, you need to run the
+following command instead, to perform a full installation with dependencies:
+
+    pip install "sacrebleu[ko]"
+
 # Command-line Usage
 
 You can get a list of available test sets with `sacrebleu --list`. Please see [DATASETS.md](DATASETS.md)
@@ -68,17 +73,15 @@ for an up-to-date list of supported datasets.
 
 ### Downloading test sets
 
-Download the **source** for one of the pre-defined test sets:
+Downloading is triggered when you request a test set. If the dataset is not available, it is downloaded
+and unpacked.
 
-```
-$ sacrebleu -t wmt17 -l en-de --echo src | head -n1
-28-Year-Old Chef Found Dead at San Francisco Mall
-```
+E.g., you can use the following commands to download the source, pass it through your translation system
+in `translate.sh`, and then score it:
 
-Download the **reference** for one of the pre-defined test sets:
 ```
-$ sacrebleu -t wmt17 -l en-de --echo ref | head -n1
-28-jähriger Koch in San Francisco Mall tot aufgefunden
+$ sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en
+$ cat wmt17.en-de.en | translate.sh | sacrebleu -t wmt17 -l en-de
 ```
 
 ### JSON output
@@ -194,8 +197,8 @@ BLEU related arguments:
                         Smoothing method: exponential decay, floor (increment zero counts), add-k (increment num/denom by k for n>1), or none. (Default: exp)
   --smooth-value BLEU_SMOOTH_VALUE, -sv BLEU_SMOOTH_VALUE
                         The smoothing value. Only valid for floor and add-k. (Defaults: floor: 0.1, add-k: 1)
-  --tokenize {none,zh,13a,char,intl,ja-mecab}, -tok {none,zh,13a,char,intl,ja-mecab}
-                        Tokenization method to use for BLEU. If not provided, defaults to `zh` for Chinese, `ja-mecab` for Japanese and `13a` (mteval) otherwise.
+  --tokenize {none,zh,13a,char,intl,ja-mecab,ko-mecab}, -tok {none,zh,13a,char,intl,ja-mecab,ko-mecab}
+                        Tokenization method to use for BLEU. If not provided, defaults to `zh` for Chinese, `ja-mecab` for Japanese, `ko-mecab` for Korean and `13a` (mteval) otherwise.
   --lowercase, -lc      If True, enables case-insensitivity. (Default: False)
   --force               Insist that your tokenized input is actually detokenized.
 
@@ -220,6 +223,26 @@ TER related arguments (The defaults replicate TERCOM's behavior):
 ### Version Signatures
 As you may have noticed, sacreBLEU generates version strings such as `BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0` for reproducibility reasons. It's strongly recommended to share these signatures in your papers!
 
+### Outputting other metadata
+
+Sacrebleu knows about metadata for some test sets, and you can output it like this:
+
+```
+$ sacrebleu -t wmt21 -l en-de --echo src docid ref | head 2
+Couple MACED at California dog park for not wearing face masks while having lunch (VIDEO) - RT USA News	rt.com.131279	Paar in Hundepark in Kalifornien mit Pfefferspray besprüht, weil es beim Mittagessen keine Masken trug (VIDEO) - RT USA News
+There's mask-shaming and then there's full on assault.	rt.com.131279	Masken-Shaming ist eine Sache, Körperverletzung eine andere.
+```
+
+If multiple fields are requested, they are output as tab-separated columns (a TSV).
+
+To see the available fields, add `--echo asdf` (or some other garbage data):
+
+```
+$ sacrebleu -t wmt21 -l en-de --echo asdf
+sacreBLEU: No such field asdf in test set wmt21 for language pair en-de.
+sacreBLEU: available fields for wmt21/en-de: src, ref:A, ref, docid, origlang
+```
+
 ## Translationese Support
 
 If you are interested in the translationese effect, you can evaluate BLEU on a subset of sentences
@@ -247,11 +270,12 @@ but it expects that you pass through the entire translated test set.
    - `intl` applies international tokenization and mimics the `mteval-v14` script from Moses
    - `zh` separates out **Chinese** characters and tokenizes the non-Chinese parts using `13a` tokenizer
    - `ja-mecab` tokenizes **Japanese** inputs using the [MeCab](https://pypi.org/project/mecab-python3) morphological analyzer
+   - `ko-mecab` tokenizes **Korean** inputs using the [MeCab-ko](https://pypi.org/project/mecab-ko) morphological analyzer
    - `spm` uses the SentencePiece model built from the Flores-101 dataset (https://github.com/facebookresearch/flores#list-of-languages). Note: the canonical .spm file will be automatically fetched if not found locally.
 - You can switch tokenizers using the `--tokenize` flag of sacreBLEU. Alternatively, if you provide language-pair strings
-  using `--language-pair/-l`, `zh` and `ja-mecab` tokenizers will be used if the target language is `zh` or `ja`, respectively.
+  using `--language-pair/-l`, `zh`, `ja-mecab` and `ko-mecab` tokenizers will be used if the target language is `zh` or `ja` or `ko`, respectively.
 - **Note that** there's no automatic language detection from the hypotheses so you need to make sure that you are correctly
-  selecting the tokenizer for **Japanese** and **Chinese**.
+  selecting the tokenizer for **Japanese**, **Korean** and **Chinese**.
 
 
 Default 13a tokenizer will produce poor results for Japanese:

diff --git a/sacrebleu/__init__.py b/sacrebleu/__init__.py
@@ -14,7 +14,7 @@
 # express or implied. See the License for the specific language governing
 # permissions and limitations under the License.
 
-__version__ = '2.1.0'
+__version__ = '2.2.0'
 __description__ = 'Hassle-free computation of shareable, comparable, and reproducible BLEU, chrF, and TER scores'
 
 

diff --git a/sacrebleu/compat.py b/sacrebleu/compat.py
@@ -15,6 +15,8 @@ def corpus_bleu(hypotheses: Sequence[str],
                 tokenize=BLEU.TOKENIZER_DEFAULT,
                 use_effective_order=False) -> BLEUScore:
     """Computes BLEU for a corpus against a single (or multiple) reference(s).
+    This is the main CLI entry point for computing BLEU between a system output
+    and a reference sentence.
 
     :param hypotheses: A sequence of hypothesis strings.
     :param references: A sequence of reference documents with document being
@@ -42,11 +44,16 @@ def raw_corpus_bleu(hypotheses: Sequence[str],
     This convenience function assumes a particular set of arguments i.e.
     it disables tokenization and applies a `floor` smoothing with value `0.1`.
 
+    This convenience call does not apply any tokenization at all,
+    neither to the system output nor the reference. It just computes
+    BLEU on the "raw corpus" (hence the name).
+
     :param hypotheses: A sequence of hypothesis strings.
     :param references: A sequence of reference documents with document being
         defined as a sequence of reference strings.
     :param smooth_value: The smoothing value for `floor`. If not given, the default of 0.1 is used.
     :return: Returns a `BLEUScore` object.
+
     """
     return corpus_bleu(
         hypotheses, references, smooth_method='floor',