System submitted to the IWSLT 2022 offline speech translation task by the UPC Machine Translation group.
The paper is available here.
This paper describes the submissions of the UPC Machine Translation group to the IWSLT 2022 Offline Speech Translation and Speech-to-Speech Translation tracks. The offline task involves translating English speech to German, Japanese and Chinese text. Our Speech Translation systems are trained end-to-end and are based on large pretrained speech and text models. We use an efficient fine-tuning technique that trains only specific layers of our system, and explore the use of adapter modules for the non-trainable layers. We further investigate the suitability of different speech encoders (wav2vec 2.0, HuBERT) for our models and the impact of knowledge distillation from the Machine Translation model that we use for the decoder (mBART). For segmenting the IWSLT test sets we fine-tune a pretrained audio segmentation model and achieve improvements of 5 BLEU compared to the given segmentation. Our best single model uses HuBERT and parallel adapters and achieves 29.42 BLEU at English-German MuST-C tst-COMMON and 26.77 at IWSLT 2020 test. By ensembling many models, we further increase translation quality to 30.83 BLEU and 27.78 accordingly. Furthermore, our submission for English-Japanese achieves 15.85 and English-Chinese obtains 25.63 BLEU on the MuST-C tst-COMMON sets. Finally, we extend our system to perform English-German Speech-to-Speech Translation with a pretrained Text-to-Speech model.
title = "Pretrained Speech Encoders and Efficient Fine-tuning Methods for Speech Translation: {UPC} at {IWSLT} 2022",
author = "Tsiamas, Ioannis and
G{\'a}llego, Gerard I. and
Escolano, Carlos and
Fonollosa, Jos{\'e} and
Costa-juss{\`a}, Marta R.",
booktitle = "Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)",
month = may,
year = "2022",
address = "Dublin, Ireland (in-person and online)",
publisher = "Association for Computational Linguistics",
url = "",
pages = "265--276",
- Environment Setup
- Pretrained Models
- Data
- Knowledge Distillation
- Training
- MuST-C Evaluation
- IWSLT Evaluation
Set the environment variables:
export IWSLT_ROOT=... # where to clone this repo
export FAIRSEQ_ROOT=... # where to clone fairseq
Clone this repository to $IWSLT_ROOT
git clone --recursive ${IWSLT_ROOT}
Create a conda environment using the environment.yml
file, activate it and install Fairseq:
conda env create -f ${IWSLT_ROOT}/environment.yml && \
conda activate iwslt22 && \
pip install --editable ${IWSLT_ROOT}/fairseq/
Install NVIDIA's apex library for faster training with fp16 precision:
git clone && cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
In this project we use pre-trained speech encoders and text decoders.
Download HuBERT, wav2vec2.0 and mBART models to $MODELS_ROOT
export MODELS_ROOT=...
mkdir -p ${MODELS_ROOT}/{wav2vec,hubert,mbart}
wget -P ${MODELS_ROOT}/wav2vec
wget -P ${MODELS_ROOT}/hubert
wget -O - | \
tar -xz --strip-components 1 -C ${MODELS_ROOT}/mbart
Set the data environment variables:
export MUSTC_ROOT=... # where to download MuST-C v2
export CV_ROOT=... # where to download the CommonVoice corpus 8.0
export EUROPARL_ROOT=... # where to download Europarl-ST
export IWSLT_TST_ROOT=... # where to download the IWSLT test sets
Download MuST-C v2 en-de, en-ja and en-zh to $MUSTC_ROOT
The dataset is available here. Press the bottom ”click here to download the corpus”, and select version V2.
Download the Common Voice version 8 and the CoVoST tsvs (en-de, en-ja, en-zh) to $CV_ROOT
mkdir -p ${COVOST_ROOT}/{en-de,en-ja,en-zh}
wget -P ${COVOST_ROOT}
wget -P ${CoVoST_ROOT}/en-de
wget -P ${CoVoST_ROOT}/en-zh
wget -P ${CoVoST_ROOT}/en-ja
Download Europarl-ST v1.1 to $EUROPARL_ROOT
mkdir -p ${EUROPARL_ROOT}
wget -O - | tar -xz --strip-components 1 -C ${EUROPARL_ROOT}
Download the IWLST data (tst.2019,tst.2020,tst.2021,tst.2022):
mkdir -p $IWSLT_TST_ROOT
for year in {2019,2020,2021}; do
tar -xvf IWSLT-SLT.tst2022.en-${tgt_lang}.tgz -C ${IWSLT_TST_ROOT}
rm IWSLT-SLT.tst2022.en-${tgt_lang}.tgz
for tgt_lang in {de,ja,zh}; do
tar -xvf IWSLT-SLT.tst2022.en-${tgt_lang}.tgz -C ${IWSLT_TST_ROOT}
rm IWSLT-SLT.tst2022.en-${tgt_lang}.tgz
# get the file order for this pair
cut -d' ' -f1 ${IWSLT_TST_ROOT}/IWSLT.tst2022/IWSLT.TED.tst2022.en-${tgt_lang}.en.video_url > ${IWSLT_TST_ROOT}/IWSLT.tst2022/FILER_ORDER.en-${tgt_lang}
Convert the Common Voice clips to 16kHz and mono:
(We only need to convert the ones in the train, dev and test splits)
mkdir -p ${CV_ROOT}/en/clips_mono_16k
for split in {train,dev,test}; do
cat ${COVOST_ROOT}/${split}.tsv | cut -f2 | parallel -j $(eval nproc) ffmpeg -i ${CV_ROOT}/en//clips/{} \
-ac 1 -ar 16000 -hide_banner -loglevel error ${CV_ROOT}/en/clips_mono_16k/{.}.wav
Prepare the tsvs for the MuST-C, Europarl-ST and CoVoST data:
We do this process for both the ASR and ST tasks and for all language pairs.
We only prepare the tsvs and do not learn a vocabulary since we will reuse the one from mBART50.
# MuST-C (en-de,en-zh,en-ja)
for task in {asr,st}; do
python ${IWSLT_ROOT}scripts/data_prep/ \
--data-root ${MUSTC_ROOT} --task $task --use-audio-input --only-manifest --append-lang-id
# Europarl-ST (en-de)
for task in {asr,st}; do
python ${IWSLT_ROOT}/scripts/data_prep/ \
-d ${EUROPARL_ROOT} --lang-pair en-de --task st --use-audio-input --only-manifest --append-lang-id
# CoVoST (en-de,en-zh,en-ja)
for tgt_lang in {de,zh-CH,ja}; do
for task in {asr,st}; do
python ${IWSLT_ROOT}/scripts/data_prep/ \
-d $COVOST_ROOT -s en -t $tgt_lang --append-lang-id
Do ASR inference on the "train" sets using a pre-trained wav2vec 2.0 model and save the results at $FILTER_ROOT
export FILTER_ROOT=...
# MuST-C
for tgt_lang in {de,ja,zh}; do
python ${IWSLT_ROOT}/scripts/filtering/ \
--tsv_path ${MUSTC_ROOT}/en-${tgt_lang}/train_asr.tsv -o ${FILTERING_ROOT}/MUSTC_v2.0/en
# Europarl-ST
for split in {train,dev,test}; do
python ${IWSLT_ROOT}/scripts/filtering/ \
--tsv_path ${EUROPARL_ROOT}/en/en-de_${split}_asr.tsv -o ${FILTERING_ROOT}/EuroparlST/en
# CoVoST
for split in {train,dev,test}; do
for tgt_lang in {de,ja,zh}; do
python ${IWSLT_ROOT}/scripts/filtering/ \
--tsv_path ${COVOST_ROOT}/en-${tgt_lang}/${split}_asr.tsv -o ${FILTERING_ROOT}/CoVoST/en
Apply ASR-based and text-based filtering to create clean versions of the train sets:
# MuST-C
for tgt_lang in {de,ja,zh}; do
python ${IWSLT_ROOT}/filtering/ \
-tsv ${MUSTC_ROOT}/en-${tgt_lang}/train_st.tsv \
-p ${FILTERING_ROOT}/MUSTC_v2.0/en/train_asr_wer_results.json \
-o ${MUSTC_ROOT}/en-${tgt_lang} \
-par -wer 0.75
# Europarl-ST
for split in {train,dev,test}; do
python ${IWSLT_ROOT}/filtering/ \
-tsv ${EUROPARL_ROOT}/en/en-de_${split}_st.tsv \
-p ${FILTERING_ROOT}/EuroparlST/en/en-de_${split}_asr_wer_results.json \
-o ${EUROPARL_ROOT}/en \
-par -wer 0.75
# CoVoST
for tgt_lang in {de,ja,zh}; do
for split in {train,dev,test}; do
python ${IWSLT_ROOT}/filtering/ \
-tsv ${COVOST_ROOT}/en-${tgt_lang}/${split}_st.tsv \
-p ${FILTERING_ROOT}/CoVoST/en/${split}_asr_wer_results.json \
-o ${COVOST_ROOT}/en-${tgt_lang} \
-par -wer 0.75
Set up the path:
export DATA_ROOT=...
mkdir -p ${DATA_ROOT}/{en-de,en-zh,en-ja}
Make symbolink links:
# from MuST-C
for tgt_lang in {de,ja,zh}; do
for task in {asr,st}; do
ln -s ${MUSTC_ROOT}/en-${tgt_lang}/train_${task}_filtered.tsv ${DATA_ROOT}/en-${tgt_lang}/train_${task}_mustc.tsv
ln -s ${MUSTC_ROOT}/en-${tgt_lang}/dev_${task}.tsv ${DATA_ROOT}/en-${tgt_lang}/dev_${task}_mustc.tsv
ln -s ${MUSTC_ROOT}/en-${tgt_lang}/tst-COMMON_${task}.tsv ${DATA_ROOT}/en-${tgt_lang}/tst-COMMON_${task}_mustc.tsv
# from Europarl-ST
for split in {train,dev,test}; do
for task in {asr,st}; do
if [[ $split != train ]]; then
ln -s ${EUROPARL_ROOT}/en/en-de_${split}_${task}_filtered.tsv ${DATA_ROOT}/en-de/train_${split}_${task}_europarl.tsv
ln -s ${EUROPARL_ROOT}/en/en-de_${split}_${task}_filtered.tsv ${DATA_ROOT}/en-de/${split}_${task}_europarl.tsv
# from CoVoST
for tgt_lang in {de,ja,zh}; do
for split in {train,dev,test}; do
for task in {asr,st}; do
if [[ $split != train ]]; then
ln -s ${COVOST_ROOT}/en-${tgt_lang}/${split}_${task}_filtered.tsv ${DATA_ROOT}/en-${tgt_lang}/train_${split}_${task}_covost.tsv
ln -s ${COVOST_ROOT}/en-${tgt_lang}/${split}_${task}_filtered.tsv ${DATA_ROOT}/en-${tgt_lang}/${split}_${task}_covost.tsv
We are using knowledge distillation for en-de with mBART50 as the teacher.
Extract the top-k probabilities offline before training and save them at $KD_ROOT
export KD_ROOT=...
for asr_tsv_file in ${DATA_ROOT}/en-de/train*asr*.tsv; do
st_tsv_file=$(echo $asr_tsv_file | sed "s/_asr_/_st_/g")
kd_subdir=$(basename "$st_tsv_file" .tsv)
python ${IWSLT_ROOT}knowledge_distillation/ \
-asr $asr_tsv_file -st $st_tsv_file -o ${KD_ROOT}/en-de/${kd_subdir}
Set up the path to save the training outputs:
export SAVE_DIR=...
All our experiments can be found at ${IWSLT_ROOT}/config
To train an experiment called EXP_NAME
, run the following command:
EXP_NAME=... # one of the available experiments
# to adjust the update_freq according to the number of available GPUs
n_gpus=$(nvidia-smi --list-gpus | wc -l)
fairseq-hydra-train \
--config-dir ${IWSLT_ROOT}/config/ \
--config-name ${EXP_NAME}.yaml \
dataset.num_workers=$(($(eval nproc) / 2)) \
optimization.update_freq=[$(( $base_update_freq / $n_gpus ))]
To generate the translations for the MuST-C dev or tst-COMMON sets run the following command:
EXP_NAME=... # one of the trained experiments
CKPT_NAME=... # the name of a .pt file
SUBSET=... # dev_mustc or tst-COMMON_mustc
TGT_LANG=... # de, zh or ja
To generate translations for the IWSLT test sets, we first have to segment the audio files.
We are using SHAS. Clone the SHAS repo at $SHAS_ROOT
git clone ${SHAS_ROOT}
Create an environment for the segmentation:
conda env create -f ${SHAS_ROOT}/environment.yml
Download the Multilingual checkpoint for the Segmentation Frame Classifier at $SHAS_ROOT/
Segment the wav files of the IWSLT test sets with the multilingual classifier and the pDAC algorithm with max-segment-length of 16 and inference-times of 3, which were found to be optimal. Save the segmentation yaml at $path_to_custom_segmentation_yaml
conda activate shas
SUBSET=... # IWSLT.tst2019, IWSLT.tst2020, IWSLT.tst2021 or IWSLT.tst2022
python ${SHAS_ROOT}/src/supervised_hybrid/ \
-wavs ${IWSLT_TST_ROOT}/${SUBSET}/wavs \
-ckpt ${SHAS_ROOT}/ \
-yaml $path_to_custom_segmentation_yaml \
-max 16 -n 3
To evaluate translations from a custom segmentation, we are using to mwerSegmenter to align the hypotheses with the references.
Download mwerSegmenter at ${MWERSEGMENTER_ROOT}
and follow the instructions in ${MWERSEGMENTER_ROOT}/README
to install it:
tar -zxvf mwerSegmenter.tar.gz -C ${MWERSEGMENTER_ROOT}
rm -r mwerSegmenter.tar.gz
We also need a python2 environment to run it:
conda create -n snakes27 python=2.7
Generate translations on the created segmentation and calculate the BLEU scores if the $SUBSET
is IWSLT.tst2019 or IWSLT.tst2020:
${IWSLT_ROOT}/scripts/segmentation/ \
${SAVE_DIR}/${EXP_NAME}/ckpts/${CKPT_NAME} \