Fine-tuning Wav2vec-lang10 model in phoneme form with Polish 10 hours data

Basic info

10 hours of Polish data was used to fine-tune the pretrained unsupervised multilingual ASR model for cv-lang10 Wav2vec-lang10 in phoneme form. The training dataset was randomly selected from 130 hours Polish dataset sourced from the publicly available Common Voice 11.0.

Training process

The script run.sh contains the overall model training process.

Stage 0: Data preparation

The data preparation has been implemented in monolingual experiments for Polish. Run the script subset.sh to select any hours of data randomly.
The detailed model parameters are detailed in config.json and hyper-p.json. Dataset paths should be added to the metainfo.json for efficient management of datasets.

Stage 1 to 3: Model training

The training of this model utilized 2 NVIDIA GeForce RTX 3090 GPUs and took 2 hours.
- # of parameters (million): 90.20
- GPU info
  - NVIDIA GeForce RTX 3090
  - # of GPUs: 2
For fine-tuning experiment, the output layer of the pretrained model need to be matched to the corresponding language before fine-tuning. We train the tokenizer for Polish and run the script unpack_mulingual_param.py to implement it. Then configure the parameter init_model in hyper-p.json.

To train tokenizer:

  `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h --sta 1 --sto 1`

To fine-tune the model:

  `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h --sta 2 --sto 3`

To plot the training curves:

  `python utils/plot_tb.py exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/log/tensorboard/file -o exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/monitor.png`

Monitor figure

Stage 4: CTC decoding

To decode with CTC and calculate the %PER:

  `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 4 --sto 4`

%PER

test_pl_raw     %SER 84.76 | %PER 12.65 [ 37776 / 298549, 8510 ins, 11363 del, 17903 sub ]

Stage 5 to 7: FST decoding

Before FST decoding, we need to train a language model for each language, which are the same as Monolingual ASR experiment. The configuration files config.json and hyper-p.json are in the lm of corresponding language directory in monolingual ASR experiment. Notice the distinction between the profiles for training the ASR model and the profiles for training the language model, which have the same name but are in different directories.

To train a language model:

  `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 5 --sto 5`

To decode with FST and calculate the %WER:

  `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 6`

%WER with 4-gram LM

test_pl_raw_ac1.0_lm1.6_wip0.0.hyp      %SER 14.03 | %WER 5.65 [ 3362 / 59464, 247 ins, 687 del, 2428 sub ]

Resources

The files used to fine-tune this model and the fine-tuned model are available in the following table.

Word list Checkpoint model Language model Tensorboard log

wordlist_pl.txt Wav2vec-lang10_ft_phoneme_10h_best-3.pt lm_pl_4gram.arpa tb_Wav2vec-lang10_ft_phoneme_10h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Fine-tuning Wav2vec-lang10 model in phoneme form with Polish 10 hours data

Basic info

Training process

Stage 0: Data preparation

Stage 1 to 3: Model training

Stage 4: CTC decoding

%PER

Stage 5 to 7: FST decoding

%WER with 4-gram LM

Resources

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Fine-tuning Wav2vec-lang10 model in phoneme form with Polish 10 hours data

Basic info

Training process

Stage 0: Data preparation

Stage 1 to 3: Model training

Stage 4: CTC decoding

%PER

Stage 5 to 7: FST decoding

%WER with 4-gram LM

Resources