Self-Modifying State Modeling for Simultaneous Machine Translation

Source Code for ACL 2024 main conference paper "Self-Modifying State Modeling for Simultaneous Machine Translation".

Our model is implemented based on the open-source toolkit Fairseq and the open-source code ITST.

Requirements and Installation

Install the Fairseq with the following commands:

git clone https://github.com/EurekaForNLP/SM2.git
cd SM2
pip install --editable ./

We tokenize the English, German, Romanian corpus with mosesdecoder/scripts/tokenizer/tokenizer.perl and Chinese corpus with fxsjy/jieba.
We apply BPE with rsennrich/subword-nmt.
We preprocess the data into fairseq format with preprocess.sh, adding --joined-dictionary for German-English.

Use train_sm2.sh to finish Training SM$^2$. It is noted that:

The --arch transformer_with_sm2_unidirectional for SM$^2$ with unidirectional encoder settings.
If the used device supports bf16, --bf16 is suggested.
If source and target language share embeddings, use --share-all-embeddings.

Use test_sm2.sh to finish the inference process of simultaneous translation with --batch-size=1 and --beam=1