Author:
Fanhua Song
Lukuan Dong 330293721@qq.com

Note

This is the official code for the paper "Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training".

Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training

Mainstream automatic speech recognition (ASR) technologies typically require hundreds to thousands of hours of labeled speech data. For low-resource speech recognition, phoneme-based supervised pre-training, subword-based supervised pre-training, and self-supervised pre-training are three commonly used multilingual pre-training methods. Mien language, the primary ethnic language of the Yao people in China, is considered a low-resource language due to the very limited amount of labeled speech data available. We studied and compared the effectiveness of these three methods for Mien language speech recognition using less than 10 hours of transcribed Mien speech data. Our experiments are based on three recently released pre-trained models, which were pre-trained on a total of 4096 hours of data in 10 languages from the CommonVoice dataset (CV-Lang10), corresponding to the three methods for low-resource ASR. The study found that phoneme-based supervision achieved better performance compared to subword-based and self-supervised approaches, demonstrating data efficiency. The Whistle model (weakly-supervised phoneme-based multilingual pre-training), which was obtained through weakly supervised phoneme-based multilingual pre-training, achieved the best results on the test set.

Data

Iu Mien Corpus (a variety of Yao language, close to Jinxiu Yao language. A total of 9754 data entries, approximately 9.7 hours of labeled audio data)

The text domain of this dataset comes from the Bible, written using the standard Iu Mien script scheme.

The standard Iu Mien script scheme (referenced from Mien Wikipedia) uses only the 26 basic Latin letters as the basic units of writing, constructing 30 initials, 128 finals, and 8 tones.

It is noteworthy that Iu Mien has 8 tones, and the spelling of Iu Mien words will explicitly write the word tone, with the last character of the spelling generally indicating the tone (but the first tone is not explicitly written). The tonal plosive rhyme can only match two entering tones, while other rhymes can only match the non-entering tones. The original entering tone was represented by the separate tone marks q and r, now replaced by the tone marks of the closest non-entering tones (v and c).

Example of Iu Mien text:

Psalm_9_39_189040_195800	ninh mbuo nyei zaux yaac zuqc ninh mbuo ganh zaeng nyei mungz nzenc jienv
Psalm_9_41_199200_201880	ninh baengh fim nyei siemv zuiz
Psalm_9_43_208440_213920	orqv mienh se la kuqv tin hungh nyei fingx fingx mienh
Psalm_9_44_213920_217480	zungv zuqc mingh yiem yiemh gen
Psalm_9_45_217480_223040	ninh maiv zeiz yietc liuz la kuqv dangx donx nyei mienh
Psalm_9_47_229960_235040	o ziouv aah daaih maah maiv dungx bun baamh mienh duqv hingh

Data processing

Download data

git clone https://github.com/mightmay/MightLJSpeech.git MightLJSPeech

Data splitting

Divide the data into training set, validation set, and test set in a ratio of 8:1:1. For the validation and test sets, remove any audio and corresponding text that are completely duplicated in the training set.

python local/split_data.py
python local/fliter_data.py

The training set, validation set, and test set divisions used in our experiments can be found in the exp_data.

Speech feature extraction

Since the whistle pretrained models used in this experiment use fbank audio features extracted by Kaldi during training, we also use Kaldi to extract fbank audio features for alignment with the pretrained models.

bash local/data_kaldi.sh
python utils/data/resolvedata.py

For experiments using the self-supervised model wav2vec, since its input is not fbank features but raw audio signals sampled at 16kHz, use the following script to process the data.

bash local/audio2ark.sh
python utils/data/resolvedata.py

Lexicon

Referencing the Mien script and IPA pronunciation comparison table on the Mien Wikipedia page, generate a lexicon from the training set. Words in the lexicon are segmented using the longest match rule, and the IPA pronunciations are obtained by referring to the comparison table in the lexicon.txt. Please refer to exp_dict/ for the pronunciation dictionary used in our experiment.

mkdir dict
python local/get_wordlist.py
python local/get_lexicon.py

Language model

In speech recognition, we often use language models to assist with decoding to reduce the error rate of the final recognition results. In this experiment, we use a 4-gram word language model as our language model.

Use the following command to train a language model.

bash exp/decode_lm/run.history.sh

Experiment

How to run exp

Please refer to the exp/.../run.history.sh

Result

Take the average of three independent experiments as the final experiment result.

（1）BPE modeling, using Mien language data for training from scratch.

model	Unit	lm setting	test	note
Mono-subword	bpe500	no lm	9.71
Mono-subword	bpe500	4-gram word lm	6.87	use fst decode

（2）BPE modeling, fine-tuning with Mien language data based on a pretrained model with subwords from cv-10.

model	Unit	lm setting	test	note
Mul10-sub-PT-sub-FT	bpe500	no lm	4.33
Mul10-sub-PT-sub-FT	bpe500	4-gram word lm	3.46	use fst decode

(3) BPE modeling, fine-tuning with Mien language data based on the Wav2Vec2-cv10 pretrained model.

model	Unit	Evaluation metrics	lm setting	test	note
Wav2vec2-cv10-sub-FT	bpe500	wer	no lm	3.76
Wav2vec2-cv10-sub-FT	bpe500	wer	4-gram word lm	3.06

（4）BPE modeling, fine-tuning with Mien language data based on the Whistle-small pretrained model.

model	Unit	lm setting	test	note
Whistle-sub-FT	bpe500	no lm	3.30
Whistle-sub-FT	bpe500	4-gram word lm	2.95	use fst decode

（5）Phone modeling, using Mien language data for training from scratch.

model	Unit	Evaluation metrics	lm setting	test	note
Mono-phoneme	phone	per	no lm	4.22
Mono-phoneme	phone	wer	4-gram word lm	4.69

（6）Phone modeling, fine-tuning with Mien language data based on the Whistle-small pretrained model.

model	Unit	Evaluation metrics	lm setting	test	note
Whistle-phoneme-FT	phone	per	no lm	2.41
Whistle-phoneme-FT	phone	wer	4-gram word lm	2.71

（7）Phone modeling, fine-tuning with Mien language data based on the Wav2vec2-cv10 pretrained model.

model	Unit	Evaluation metrics	lm setting	test	note
Wav2vec2-cv10-phoneme-FT	phone	per	no lm	2.53
Wav2vec2-cv10-phoneme-FT	phone	wer	4-gram word lm	2.76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Note

Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training

Data

Data processing

Download data

Data splitting

Speech feature extraction

Lexicon

Language model

Experiment

How to run exp

Result

Files

README.md

Latest commit

History

README.md

File metadata and controls

Note

Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training

Data

Data processing

Download data

Data splitting

Speech feature extraction

Lexicon

Language model

Experiment

How to run exp

Result