We trained a phoneme-based ASR model for each language of cv-lang10 with the same architecture that is based on a Conformer network consisting of 14 encoder blocks. The number of phonemes and training hours of the each language are in the following table.
Language | Language ID | # of phonemes | Train hours | Dev hours | Test hours |
---|---|---|---|---|---|
English |
en |
39 | 2227.3 | 27.2 | 27.0 |
Spanish |
es |
32 | 382.3 | 26.0 | 26.5 |
French |
fr |
33 | 823.4 | 25.0 | 25.4 |
Italian |
it |
30 | 271.5 | 24.7 | 26.0 |
Kirghiz |
ky |
32 | 32.7 | 2.1 | 2.2 |
Dutch |
nl |
39 | 70.2 | 13.8 | 13.9 |
Russian |
ru |
32 | 149.8 | 14.6 | 15.0 |
Swedish |
sv-SE |
33 | 29.8 | 5.5 | 6.2 |
Turkish |
tr |
41 | 61.5 | 10.1 | 11.4 |
Tatar |
tt |
31 | 20.8 | 3.0 | 5.7 |
-
%PER
Model Model size en es fr it ky nl ru sv-SE tr tt Avg. Mono. phoneme 90 MB 7.39 2.47 4.93 2.87 2.23 4.60 2.72 18.69 6.00 10.54 6.11 -
%WER with 4-gram LM
Model Model size en es fr it ky nl ru sv-SE tr tt Avg. Mono. phoneme 90 MB 10.59 7.91 15.58 9.26 1.03 8.84 1.62 8.37 8.46 9.75 8.14
For ablation study, the training data is divided into three scales to simulate different resource scenarios: 1 hour, 10 hours, and full data. Phoneme-based and subword-based models are both trained with this three scales of training data.
Language | Language ID | # of phonemes | # of subwords | Train hours | Dev hours | Test hours |
---|---|---|---|---|---|---|
Indonesian |
id |
35 | 500 | 20.8 | 3.7 | 4.1 |
Polish |
pl |
35 | 500 | 129.9 | 11.4 | 11.5 |