This repository contains the code used to train and evaluate our High Coverage Translation System. This is the code used for our paper Training and Inference Methods for High-Coverage Neural Machine Translation, published by ACL 2020
The pretrained model we worked from was the JParaCrawl English-to-Japanese Transformer 'base' model (and associated sentencepiece models) from
Model finetuning was conducted in the jparacrawl-finetune
submodule, branched from
The shell scripts for training our various models were the jparacrawl-finetune/
and jparacrawl-finetune/staple_*.sh
scripts. These were adapted from the original scripts in the jparacrawl-finetune
We did not use duolingo-sharedtask-2020/
from the shared task starter code.
Models for the neural filtering model can be found in filtering
Our Duolingo train/dev/test gold data can be found at duolingo-sharedtask-2020/staple-2020-train/en_ja/*
. We created these split files from the original
provided for the shared task (excluded from the repo due to excessive file size)
We used the 'official splits' JESC data from
Before training or evaluation, the raw sentences in these files were encoded into subword sequences using the aforementioned sentencepiece models from JParaCrawl.
Experiments for the neural filtering model can be found in filtering
We adapted the scoring pipeline provided in the shared task starter code
We edited duolingo-sharedtask-2020/
to decode our subword translation outputs using our sentencepiece model.
We edited duolingo-sharedtask-2020/
to print descriptive statistics for the number of candidates outputed per prompt.
We created duolingo-sharedtask-2020/
to conduct qualitative error analysis between the outputs of different models.
(To create model inputs for the Duolingo blind dev and test sets, we created duolingo-sharedtask-2020/
and duolingo-sharedtask-2020/
files, adapted from duolingo-sharedtask-2020/