Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hotwords #11

Merged
merged 4 commits into from
May 18, 2023
Merged

Add hotwords #11

merged 4 commits into from
May 18, 2023

Conversation

FieldsMedal
Copy link

No description provided.

scorer = Scorer(0.5, 0.5, lm_path, vocab_list)
batchhotwords = BatchHotwords()
#In the first badcase, there is a big difference in scoring between the optimal path and other paths.
hot_words = {'极点': 5, '换一': -3.40282e+38, '首歌': -100, '换歌': 3.40282e+38}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more why setting these words here? Why do you use positive score or negtive score? What's the bad case? Results before and after setting it?

Copy link
Author

@FieldsMedal FieldsMedal May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_zh.py,batch_log_ctc_probs contains two audio test data.
1)first audio,decoding without hotwords result is ('换一首歌', '', '换一', '一', '换', '换首歌', '一首歌', '一歌', '首歌', '歌'), i want to use hotwords chang the decoding result, let '换歌' ranking frontmost, so i set '换歌' positive weights 3.40282e+38 and reduce the weight of other words like "换一" and "首歌".The decoding result with hot words is ('换歌', '换一首歌', '', '换一', '一', '换', '一首歌', '一歌', '首歌', '歌');
2)second audio, decoding without hotwords result is ('几点了', '极点了', '几点啊', '几点啦', '几点儿', '几点吧', '急点了', '几点晚', '极点啊', '极点啦'). After setting '极点': 5,the decoding result is ('极点了', '几点了', '极点啊', '极点啦', '极点儿', '极点吧', '极点晚', '极点呀', '几点啊', '几点啦').

@FieldsMedal
Copy link
Author

FieldsMedal commented May 8, 2023

We tested hotwords on speech_asr_aishell1_hotwords_testsets.

  1. Acoustic model: a small Conformer model for AIShell

  2. Hotwords weight:hotwords.tar.gz

  3. Test method: please refer to the readme of this repository(TODO)

Latency And CER

model (FP16) Latency (s) CER
offline model 5.5921 13.85
offline model with hotwords 5.6401 12.16

offline model: https://github.com/wenet-e2e/wenet/tree/main/runtime/gpu/model_repo

offline model with hotwords(TODO):

Decoding result

Label hotwords pred w/o hotwords pred w/ hotwords
以及拥有陈露的女单项目 陈露 以及拥有陈鹭的女单项目 以及拥有陈露的女单项目
庞清和佟健终于可以放心地考虑退役的事情了 庞清
佟健
庞青董建终于可以放心地考虑退役的事情了 庞清佟健终于可以放心地考虑退役的事情了
赵继宏老板电器做厨电已经三十多年了 赵继宏 赵继红老板电器做厨店已经三十多年了 赵继宏老板电器做厨电已经三十多年了

@yuekaizhang
Copy link
Contributor

yuekaizhang commented May 8, 2023

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER?
https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

@FieldsMedal
Copy link
Author

FieldsMedal commented May 9, 2023

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

hotwords weight and ngram file:https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/models.
decoding results: https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/results.

  • The current order of ngram is 4, only support length <= 4 hotwords, if you want to configure longer hotwords, you can use higher order ngram, but at the same time will increase the decoding time.

Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

Hotwords result on speech_asr_aishell1_hotwords_testsets.

model (FP16) Latency (s) CER Recall Precision F1-score
offline model w/o hotwords 5.5921 13.85 0.27 0.99 0.43
offline model w/ hotwords 5.6401 12.16 0.45 0.97 0.62

I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER? https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

Hotwords result on AISHELL-1 Test dataset

model (FP16) RTF CER
offline model w/o hotwords 0.00437 4.6805
offline model w/ hotwords 0.00435 4.5831
streaming model w/o hotwords 0.01231 5.2777
streaming model w/ hotwords 0.01142 5.1926

Tested ENV

  • CPU:40 Core, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
  • GPU:NVIDIA GeForce RTX 2080 Ti

@yuekaizhang
Copy link
Contributor

yuekaizhang commented May 9, 2023

Thanks. Would you mind uploading the decoding results w w/o hotwords somewhere? (Maybe a huggingface repo for hotwords weight, ngram file, decoding results and other essentials is a good choice.)

hotwords weight and ngram file:https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/models. decoding results: https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/tree/main/results.

  • The current order of ngram is 4, only support length <= 4 hotwords, if you want to configure longer hotwords, you can use higher order ngram, but at the same time will increase the decoding time.

Also, https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary here, they use F1 score, recall, precision to evaluate hot words. Can we get this stats?

Hotwords result on speech_asr_aishell1_hotwords_testsets.

model (FP16) Latency (s) CER Recall Precision F1-score
offline model w/o hotwords 5.5921 13.85 0.27 0.99 0.43
offline model w/ hotwords 5.6401 12.16 0.45 0.97 0.62

I am also interested about the general test set performance. Would you mind testing the normal aishell test set WER? https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py#L37-L43

Hotwords result on AISHELL-1 Test dataset

model (FP16) RTF CER
offline model w/o hotwords 0.00437 4.6805
offline model w/ hotwords 0.00435 4.5831
streaming model w/o hotwords 0.01231 5.2777
streaming model w/ hotwords 0.01142 5.1926

Tested ENV

  • CPU:40 Core, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
  • GPU:NVIDIA GeForce RTX 2080 Ti

Many thanks. The results look nice. I was wondering for both w/ or w/o hotwords case if we use this as the default external LM https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/blob/main/models/init_kenlm.arpa.

Also, is the pretrained model from here https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/s0#u2-conformer-result ? Looks like WER with WFST Decoding + attention rescoring for offline and chunk16 are 4.4 & 4.75. Pure attention rescoring without any ngram are 4.63&5.05. Not sure the results look like if you use aishell train set as arpa. I thought they use this 3-gram arpa https://huggingface.co/yuekai/aishell1_tlg_essentials/blob/main/3-gram.unpruned.arpa here.

@FieldsMedal
Copy link
Author

FieldsMedal commented May 9, 2023

Many thanks. The results look nice. I was wondering for both w/ or w/o hotwords case if we use this as the default external LM https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/blob/main/models/init_kenlm.arpa.

Also, is the pretrained model from here https://github.com/wenet-e2e/wenet/tree/main/examples/aishell/s0#u2-conformer-result ? Looks like WER with WFST Decoding + attention rescoring for offline and chunk16 are 4.4 & 4.75. Pure attention rescoring without any ngram are 4.63&5.05. Not sure the results look like if you use aishell train set as arpa. I thought they use this 3-gram arpa https://huggingface.co/yuekai/aishell1_tlg_essentials/blob/main/3-gram.unpruned.arpa here.

  1. init_kenlm.arpa is used to init Score, because our hotwordsboosting depends on Scorer::make_ngram, any language model trained by Kenlm is fine. if decode with hotwords, put unique hot words from each recording into batchhotwords. if decode without hotwords, put None into batchhotwords. whether to add hotwords score controled by this. if user also wants to add ngram score, set use_ngram_score to true.
  2. Our pretrained model from here https://github.com/wenet-e2e/wenet/blob/main/docs/pretrained_models.md, trained by aishell datasets. Our results on AISHELL-1 Test dataset were tested using the fp16 onnx model,ctc_weight is 0.3, reverse_weight is 0.3. These settings may have some impact.

for (size_t index=0; index < ngram.size(); index ++ ) {
std::string word = "";
// character-based language model, combining chinese characters into words
if (ext_scorer->is_character_based()) {
Copy link
Contributor

@yuekaizhang yuekaizhang May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully understand why we need an external arpa file to launch the hotword scorer when we set use_ngram_score=False. If the ext_scorer->is_character_based() is the reason, how about setting an independent flag?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. First, consider how to add ngram score when decoding at frame t. 1)、The first step is to get the words or Chinese characters in the fixed window(the fixed window is the language model order size).if use character-based language model, then find the Chinese characters of step t, step t-1, step t-2, step t-3, and we get [a, b, c, d].if use word-level language model, then find the words of step t, step t-1, step t-2, step t-3, and we get [words1, words2, words3 , words4]. This step is implemented by make_ngram .2)、Next step is to compute conditional probabilities p(d|(a,b,c)) or p((words4|words1,words2,words3)).This step is implemented by get_log_cond_prob.
  2. Second, when do hotwordboosting at frame t. 1)、The first step is to find the characters of time steps such as t, t-1 and t-2, t-3, t-4...,this step can be obtained directly using make_ngram,so when do hotwordboosting we need external arpa file to launch ext_scorer. when ext_scorer is character-based language model, combine Chinese characters into words[a, b, c, d] -> [abcd, bcd, cd]. when ext_scorer is word-level language model, we get [words1, words2, words3 , words4]. if this words in hotwords dictionary then add hotwords score.
  3. when we set use_ngram_score=False, conditional probability scores are not calculated. external arpa file is to launch ext_scorer, hotwordboosting rely on make_ngram. We can also write a function similar to the make_ngram function in hotwords.cpp, but our hotwords are directly using the already implemented make_ngram.

Copy link
Contributor

@yuekaizhang yuekaizhang May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So, if we use a 4-gram-arpa as ext_scorer, we can't handle hot words more than 4 like this one https://huggingface.co/58AILab/wenet_u2pp_aishell1_with_hotwords/blob/main/models/hotwords.yaml#L4, right?

What do you think if we copy that make_ngram into HotWord Class? Such that we could set the hot word ngram max_length as well as is_character_or_not.

Since it would be a little bit confusing, when we set the lm_path, but our final decoding results didn't use ngram lm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If 4-gram-arpa as ext_scorer,only support a maximum length of 4 Chinese characters.
  2. I plan to copy make_ngram into HotWordsBoosting Class, and exposing the length parameter of hotwords.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commit, we separated hotwords scoring from language model scoring. The latest hotwords results on speech_asr_aishell1_hotwords_testsets and AISHELL-1 Test dataset can be found here.

@FieldsMedal
Copy link
Author

In the latest commit, we modify batch_hotwords_scorer to hotwords_scorer. If you have free time, please help to review this pr.

@Slyne
Copy link
Owner

Slyne commented May 18, 2023

@FieldsMedal Thanks!
One more question:
The output (Test hotwords boosting with word-level language models during ctc prefix beam search) for test_zh.py is

INFO:root:Test hotwords boosting with word-level language models during ctc prefix beam search
INFO:root:('', '一', '换', '一首', '极点晚', '几点啦', '极点', '几点', '', '几', '晚', '极')

Not sure if the above result is expected?

=================
Update:
Should be fine. It is the user's responsibility to ensure the vocabulary contains space_id.

@Slyne Slyne merged commit 03259fd into Slyne:master May 18, 2023
@Slyne
Copy link
Owner

Slyne commented May 18, 2023

Thanks again!
Really great feature! @FieldsMedal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants