Looking into the digitizing of Nguni languages and increasing their digital footprint.
- The goal of this project is to come up with a
- Language model - probability distribution over sequences of words
- Large/comprehensive dataset
- dictionary like document/resource that will spell out the use/meaning of each word/phrase in different contexts - detailing how it is used/mis-used, proper spelling and misspelling, pronunciation and alternative pronunciations how it has evolved overtime, its origin, etc
- text-to-speech (TTS) model (low priority)
- All documented in Nguni languages (low priority)
- Text Analysis (low priority)
- To focus only on South African Nguni languages (hence the name of the project), excluding Mozambican and Zimbabwean languages
- isiZulu and isiXhosa as a starting point
Live it a world where I can
- voice type in isiXhosa/isiZulu
- get keyboard autocomplete in isiXhosa/isiZulu
- and finally get over computers squiggly my name
The bigger picture is to bring nguni culture and heritage to the modern world, and open doors to wide range of possibilities, such as
-
Closing the illiteracy/computer illiteracy gap by allowing everyone and anyone to access modern tools using their native languages
-
Making it possible to learn and teach in isiXhosa/isiZulu
-
Using isiXhosa/isiZulu to communicate at any level
-
Preserve and protect culture and heritage
-
Collect Language Data: isiXhosa/Zulu language data to train your language model. This can include Xhosa/Zulu books, articles, news, and other text sources. You can also use publicly available datasets such as the South African National Corpus.
-
Preprocess and Clean the Data: This involves removing any unwanted characters, punctuation, and other non-text elements.
-
Train a Language Model: using PyTorch, TensorFlow, and Keras.
-
Fine-tune the Model: To improve the accuracy of your Xhosa/Zulu language model, you may need to fine-tune it. This involves training the model on a smaller, more specific dataset to improve its performance on a particular task.
-
Test and Evaluate the Model: Finally, test and evaluate the model to ensure accurate results. Watch for perplexity, accuracy, and F1 score metrics.