Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing error while training a character ngram model using kenlm #435

Open
fkhan98 opened this issue Jul 20, 2023 · 0 comments
Open

Facing error while training a character ngram model using kenlm #435

fkhan98 opened this issue Jul 20, 2023 · 0 comments

Comments

@fkhan98
Copy link

fkhan98 commented Jul 20, 2023

I want to train a character ngram model for Bangla language. I have preprocessed my corpus so that it looks like this, here is a small demo:

| অ প ে ক ্ ষ া | ক র ত ে ন | উ প ভ ো গ | ক র ত ে ন | ত া র | উ জ ্ জ ্ ব ল | উ প স ্ থ ি ত ি | এ ই | স র ক া র | ল ু ট ে র া | ত ো ষ ণ ক া র ী ঃ | র ু ম ি ন | ফ া র হ া ন া | হ ্ য া ঁ | আ প ন া র | র ে জ ি স ্ ট ্ র ে শ ন | ফ র ্ ম | স ম ্ প ন ্ ন | ক র া র | প র | প র ি ব র ্ ত ন | ক র া | স ম ্ ভ ব | স া ধ া র ণ ত | ন ি শ ্ চ ি ত ক র ণ | এ ব ং | চ া ল া ন |......

Here I have appended all the meaningful sentence in my dataset, one after another in a single line of a .txt file. All the characters in each word of each sentence has been space separated and representing word boundaries with a |. The training data size is around 7GB which is quite big in terms of text.

I want to train the 6gram model using the command:
./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa"
Demo sample of how path_to_my_preprocessed_text_corpus.txt file looks like is shown above.

Running the command: ./kenlm/build/bin/lmplz -o 6 --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" gives the following error:

=== 1/5 Counting and sorting n-grams ===
Reading /home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/our_data/train_processed_char_level_git_data_proper_nouns_ai4bharat.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


Unigram tokens 2832827518 types 64
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:768 2:6642523648 3:12454731776 4:19927572480 5:29061042176 6:39855144960
/home/fahim/codes/wav2vec2/wav2vec2_grapheme/beam_search_LM/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 2 because we didn't observe any 1-grams with adjusted count 1; Is this small or artificial data?
Try deduplicating the input. To override this error for e.g. a class-based model, rerun with --discount_fallback

Aborted (core dumped)

But when I run the training using the same command but with --discount_fallback there error does not persist anymore and training starts, the command with --discount_fallback is: ./kenlm/build/bin/lmplz -o 6 --discount_fallback --memory 80% < "path_to_my_preprocessed_text_corpus.txt" > "./saved_lm/6gram_model.arpa" . My question is why is this? and when I run training using --discount_fallback will there be anything wrong with the model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant