Skip to content

Latest commit

 

History

History
executable file
·
72 lines (47 loc) · 4.14 KB

File metadata and controls

executable file
·
72 lines (47 loc) · 4.14 KB

Fine-tuning Wav2vec-lang10 model in phoneme form with Polish 10 hours data

Author: Ma, Te ([email protected])

Basic info

10 hours of Polish data was used to fine-tune the pretrained unsupervised multilingual ASR model for cv-lang10 Wav2vec-lang10 in phoneme form. The training dataset was randomly selected from 130 hours Polish dataset sourced from the publicly available Common Voice 11.0.

Training process

The script run.sh contains the overall model training process.

Stage 0: Data preparation

Stage 1 to 3: Model training

  • The training of this model utilized 2 NVIDIA GeForce RTX 3090 GPUs and took 2 hours.

    • # of parameters (million): 90.20
    • GPU info
      • NVIDIA GeForce RTX 3090
      • # of GPUs: 2
  • For fine-tuning experiment, the output layer of the pretrained model need to be matched to the corresponding language before fine-tuning. We train the tokenizer for Polish and run the script unpack_mulingual_param.py to implement it. Then configure the parameter init_model in hyper-p.json.

  • To train tokenizer:

      `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h --sta 1 --sto 1`
    
  • To fine-tune the model:

      `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h --sta 2 --sto 3`
    
  • To plot the training curves:

      `python utils/plot_tb.py exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/log/tensorboard/file -o exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/monitor.png`
    
Monitor figure
tb-plot

Stage 4: CTC decoding

  • To decode with CTC and calculate the %PER:

      `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 4 --sto 4`
    
    %PER
    test_pl_raw     %SER 84.76 | %PER 12.65 [ 37776 / 298549, 8510 ins, 11363 del, 17903 sub ]
    

Stage 5 to 7: FST decoding

  • Before FST decoding, we need to train a language model for each language, which are the same as Monolingual ASR experiment. The configuration files config.json and hyper-p.json are in the lm of corresponding language directory in monolingual ASR experiment. Notice the distinction between the profiles for training the ASR model and the profiles for training the language model, which have the same name but are in different directories.

  • To train a language model:

      `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 5 --sto 5`
    
  • To decode with FST and calculate the %WER:

      `bash run.sh pl exp/Crosslingual/pl/Wav2vec-lang10_ft_phoneme_10h/ --sta 6`
    
    %WER with 4-gram LM
    test_pl_raw_ac1.0_lm1.6_wip0.0.hyp      %SER 14.03 | %WER 5.65 [ 3362 / 59464, 247 ins, 687 del, 2428 sub ]
    

Resources