Skip to content

Latest commit

 

History

History
executable file
·
61 lines (41 loc) · 3.27 KB

File metadata and controls

executable file
·
61 lines (41 loc) · 3.27 KB

Monolingual phoneme-based ASR model for Polish

Basic info

This model is built upon Conformer architecture and trained using the CTC (Connectionist Temporal Classification) approach. The training dataset consists of 130 hours of Polish speech data sourced from the publicly available Common Voice 11.0.

Training process

The script run.sh contains the overall model training process.

Stage 0: Data preparation

  • Follow the steps data_prep.md and run data_prep.sh to prepare the datset and pronunciation lexicon for a given language. The second and fourth stages of data_prep.sh involve language-specific special processing, which are detailed in the lang_process.md.
  • The detailed model parameters are detailed in config.json and hyper-p.json. Dataset paths should be added to the metainfo.json for efficient management of datasets.

Stage 1 to 3: Model training

  • The training of this model utilized 5 NVIDIA GeForce RTX 3090 GPUs and took 10.6 hours.

    • # of parameters (million): 89.98
    • GPU info
      • NVIDIA GeForce RTX 3090
      • # of GPUs: 5
  • To train the model:

      `bash run.sh pl exp/Monolingual/pl/Mono._phoneme_130h --sta 1 --sto 3`
    
  • To plot the training curves:

      `python utils/plot_tb.py exp/Monolingual/pl/Mono._phoneme_130h/log/tensorboard/file -o exp/Monolingual/pl/Mono._phoneme_130h/monitor.png`
    
Monitor figure
tb-plot

Stage 4: CTC decoding

  • To decode with CTC and calculate the %PER:

      `bash run.sh pl exp/Monolingual/pl/Mono._phoneme_130h --sta 4 --sto 4`
    
    %PER
    test_pl %SER 25.08 | %PER 2.82 [ 8418 / 298549, 1480 ins, 3383 del, 3555 sub ]
    
    

Stage 5 to 7: FST decoding

  • For FST decoding, config.json and hyper-p.json are needed to train language model. Notice the distinction between the profiles for training the ASR model and the profiles for training the language model, which have the same name but are in different directories.

  • To decode with FST and calculate the %WER:

      `bash run.sh pl exp/Monolingual/pl/Mono._phoneme_130h --sta 5`
    
    %WER
    test_pl_ac1.0_lm1.6_wip0.0.hyp  %SER 12.50 | %WER 4.97 [ 2958 / 59464, 301 ins, 557 del, 2100 sub ]
    
    
    

Resources