Skip to content

Latest commit

 

History

History
94 lines (79 loc) · 10.6 KB

readme.md

File metadata and controls

94 lines (79 loc) · 10.6 KB

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

This is the official code for the paper "Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision". We release the code, models and data for the whole pipeline of Whistle and you can find them in their respective experimental folders. Below is a brief instruction for each folder and a summary of the experimental results.

data

data contains the efficient data management file metainfo.json.

lang-process

All of our ASR models are trained with the processed CV-lang10 data covering 12 languages (10 seen languages and 2 unseen languages), which are sourced from the publicly available Common Voice 11.0. The data processing for each language are detailed in lang-process.md. For convenience, we adopt ISO-639-1 code to represent language ID, and the 12 languages and training hours are as follow.

Serial number Language Language ID Training hours
1 English en 2227.3
2 Spanish es 382.3
3 French fr 823.4
4 Italian it 271.5
5 Kyrgyz ky 32.7
6 Dutch nl 70.2
7 Russian ru 149.8
8 Swedish sv-SE 29.8
9 Turkish tr 61.5
10 Tatar tt 20.8
11 Polish pl 130
12 Indonesian id 20.8

local

local contains the script data_prep.md preparing data and generating pronunciation lexicon for each language. Besides, there are some useful tools to debug our experiments.

exp

exp contains configuration files and detailed training process of our models.

Experiment setup

We adapt the Conformer and CTC to train our models. Three training strategies were applied for comparison, which are monolingual, multilingual and cross-lingual training.

10 monolingual phoneme-based ASR models are trained on the training set of each language seperately and then is evaluated on the test set of the corresponding language without fine-tuneing. For Indonesian and Polish, the training data is divided into three scales: 1 hour, 10 hours, and full. Both phoneme-based model and subword-based model are trained with these scales of data seperately.

3 phoneme-based models of different sizes are trained, including small (90 MB), medium (218 MB) and large (543 MB). And subword-based and wav2vec-based model of small size are also trained for comprison. The multilingual ASR model are trained on CV-lang10 data and then is evaluated on test dataset of corresponding language whitout fine-tuneing.

To test different multilingual pre-trained models for crosslingual speech recognition, we conduct phoneme-based and subword-based crosslingual fine-tuning on unseen languages. All of the Crosslingual models are fine-tuned on the basis of the pretrained multilingual phoneme-based model of small size, subword-based model or wav2vec-based model with the same fine-tuning strategy. The performence of the fine-tuned models are evaluated on 2 unseen languages dataset.

Results

Phoneme based monolingial models and multilingual pretrained models (PER%)

Model Model size en es fr it ky nl ru sv-SE tr tt Avg.
Monolingual phoneme 90 MB 7.92 2.47 4.93 2.87 2.23 5.89 2.72 16.11 6.00 10.54 6.16
Multilingual phoneme small 90 MB 8.02 3.37 5.68 4.04 8.29 5.77 6.05 18.07 8.32 8.53 7.61
Multilingual phoneme medium 218 MB 6.70 2.63 4.53 3.12 5.95 3.95 4.61 14.81 6.04 8.47 6.08
Multilingual phoneme large 543 MB 5.42 1.96 3.52 2.25 4.06 2.64 2.97 11.33 4.04 5.97 4.43

Phoneme based monolingial models and multilingual pretrained models (WER%)

Model Model size en es fr it ky nl ru sv-SE tr tt Avg.
Monolingual phoneme 90 MB 10.59 7.91 15.58 9.26 1.03 8.84 1.62 8.37 8.46 9.75 8.14
Multilingual subword small 92 MB 12.00 9.82 12.40 9.98 3.29 9.67 3.31 9.95 9.11 13.56 9.30
Multilingual phoneme small 90 MB 10.76 8.68 16.01 9.98 1.02 7.32 1.59 6.714 7.63 7.30 7.64
Multilingual phoneme medium 218 MB 9.83 7.82 14.94 9.04 0.91 6.57 1.65 5.65 7.27 7.37 7.10
Multilingual phoneme large 543 MB 8.80 7.02 14.02 8.16 0.94 6.22 1.46 5.06 7.05 6.92 6.56

Phoneme based crosslingual fine-tuning on Polish (WER%)

Model 10 minutes 1 hour 10 hours 130 hours (full)
Monolingual phoneme - 99.98 13.86 4.97
Wav2vec (En) phoneme FT - 11.09 6.75 4.57
Wav2vec (10 lang) phoneme FT - 7.94 5.65 4.44
Phoneme PT and phoneme FT 11.0 6.95 5.27 4.30

Subword based crosslingual fine-tuning on Polish (WER%)

Model 10 minutes 1 hour 10 hours 130 hours (full)
Monolingual subword - 98.38 59.43 7.12
Wav2vec (En) subword FT - 100 7.08 3.85
Wav2vec (10 lang) subword FT - 100 5.71 3.45
Subword PT and subword FT 52.52 9.16 4.89 3.76
Phoneme PT and subword FT 81.62 8.63 4.83 3.82

Phoneme based crosslingual fine-tuning on Indonesian (WER%)

Model 10 minutes 1 hour 10 hours 20 hours (full)
Monolingual phoneme - 100 7.71 3.28
Wav2vec (En) phoneme FT - 6.73 3.31 2.83
Wav2vec (10 lang) phoneme FT - 3.75 2.79 2.47
Phoneme PT and phoneme FT 6.85 3.27 2.54 2.43

Subword based crosslingual fine-tuning on Indonesian (WER%)

Model 10 minutes 1 hour 10 hours 20 hours (full)
Monolingual subword - 96.42 49.67 10.85
Wav2vec (En) Subword FT - 100 5.28 3.59
Wav2vec (10 lang) Subword FT - 99.97 4.52 3.15
Subword PT and subword FT 87.75 23.56 3.91 3.07
Phoneme PT and subword FT 98.65 24.57 3.59 2.92