Author: Ma, Te ([email protected])
10 hours of Polish
data was used to fine-tune the pretrained unsupervised multilingual ASR model for cv-lang10 Wav2vec-lang10
in phoneme form. The training dataset was randomly selected from 130 hours Polish dataset sourced from the publicly available Common Voice
11.0.
The script run.sh
contains the overall model training process.
- The data preparation has been implemented in
monolingual experiments for Polish
. Run the scriptsubset.sh
to select any hours of data randomly. - The detailed model parameters are detailed in
config.json
andhyper-p.json
. Dataset paths should be added to themetainfo.json
for efficient management of datasets.
-
The training of this model utilized 2 NVIDIA GeForce RTX 3090 GPUs and took 2 hours.
- # of parameters (million): 90.20
- GPU info
- NVIDIA GeForce RTX 3090
- # of GPUs: 2
-
For fine-tuning experiment, the output layer of the pretrained model need to be matched to the corresponding language before fine-tuning. We train the tokenizer for Polish and run the script
unpack_mulingual_param.py
to implement it. Then configure the parameterinit_model
inhyper-p.json
. -
To train tokenizer:
`bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h --sta 1 --sto 1`
-
To fine-tune the model:
`bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h --sta 2 --sto 3`
-
To plot the training curves:
`python utils/plot_tb.py exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/log/tensorboard/file -o exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/monitor.png`
Monitor figure |
---|
-
To decode with CTC and calculate the %PER:
`bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 4 --sto 4`
test_pl_raw %SER 84.76 | %PER 12.65 [ 37776 / 298549, 8510 ins, 11363 del, 17903 sub ]
-
Before FST decoding, we need to train a language model for each language, which are the same as Monolingual ASR experiment. The configuration files
config.json
andhyper-p.json
are in thelm
of corresponding language directory in monolingual ASR experiment. Notice the distinction between the profiles for training the ASR model and the profiles for training the language model, which have the same name but are in different directories. -
To train a language model:
`bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 5 --sto 5`
-
To decode with FST and calculate the %WER:
`bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 6`
test_pl_raw_ac1.0_lm1.6_wip0.0.hyp %SER 14.03 | %WER 5.65 [ 3362 / 59464, 247 ins, 687 del, 2428 sub ]
-
The files used to fine-tune this model and the fine-tuned model are available in the following table.
Word list Checkpoint model Language model Tensorboard log wordlist_pl.txt
Wav2vec-lang10_ft_phoneme_10h_best-3.pt
lm_pl_4gram.arpa
tb_Wav2vec-lang10_ft_phoneme_10h