Skip to content
/ nguni Public

Looking into the digitizing of Nguni languages and increasing their digital footprint

License

Notifications You must be signed in to change notification settings

makhosi6/nguni

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nguni

Looking into the digitizing of Nguni languages and increasing their digital footprint.

Goals

  • The goal of this project is to come up with a
    • Language model - probability distribution over sequences of words
    • Large/comprehensive dataset
    • dictionary like document/resource that will spell out the use/meaning of each word/phrase in different contexts - detailing how it is used/mis-used, proper spelling and misspelling, pronunciation and alternative pronunciations how it has evolved overtime, its origin, etc
    • text-to-speech (TTS) model (low priority)
    • All documented in Nguni languages (low priority)
    • Text Analysis (low priority)

Scope

  • To focus only on South African Nguni languages (hence the name of the project), excluding Mozambican and Zimbabwean languages
  • isiZulu and isiXhosa as a starting point

Vision

Live it a world where I can

  • voice type in isiXhosa/isiZulu
  • get keyboard autocomplete in isiXhosa/isiZulu
  • and finally get over computers squiggly my name

The bigger picture is to bring nguni culture and heritage to the modern world, and open doors to wide range of possibilities, such as

  • Closing the illiteracy/computer illiteracy gap by allowing everyone and anyone to access modern tools using their native languages

  • Making it possible to learn and teach in isiXhosa/isiZulu

  • Using isiXhosa/isiZulu to communicate at any level

  • Preserve and protect culture and heritage

What it take

  • Collect Language Data: isiXhosa/Zulu language data to train your language model. This can include Xhosa/Zulu books, articles, news, and other text sources. You can also use publicly available datasets such as the South African National Corpus.

  • Preprocess and Clean the Data: This involves removing any unwanted characters, punctuation, and other non-text elements.

  • Train a Language Model: using PyTorch, TensorFlow, and Keras.

  • Fine-tune the Model: To improve the accuracy of your Xhosa/Zulu language model, you may need to fine-tune it. This involves training the model on a smaller, more specific dataset to improve its performance on a particular task.

  • Test and Evaluate the Model: Finally, test and evaluate the model to ensure accurate results. Watch for perplexity, accuracy, and F1 score metrics.

Contributions

About

Looking into the digitizing of Nguni languages and increasing their digital footprint

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages