Adding new Language (Pali), and CrowdSourcing the Translation #268
Replies: 1 comment
-
To include a new language to one of the future releases of a multilingual translation model, the necessary (but not sufficient) condition is (1) to have some training parallel data for it, and (2) to have a gold-standard parallel dataset for evaluation. Both should have permissive licenses. Currently, the main parallel dataset for evaluating translation is FLORES-200 based on texts from Wikimedia and its spoken version, FLEURS. For speech data, one of the best ways of crowdsourcing is to contribute your spoken utterances to https://commonvoice.mozilla.org.
Personally, I do not think that would happen soon. Currently, the most multilingual publicly available translation model, NLLB, does not include even more popular extinct languages, such as Latin or Sanscrit. And if its set of languages is expanded, I expect that the languages that have living native speakers will be prioritized. Nevertheless, I think that if someone collects a large enough parallel corpus of Pali with some modern higher-resourced language (there is already at least one parallel corpus, https://github.com/topics/pali-tripitaka), it is possible to fine-tune a dedicated version of NLLB specifically for translating to or from this language. |
Beta Was this translation helpful? Give feedback.
-
How can old languages, as Pali be added to the model.
Buddha spoke in Pali, and is widely studied and documented language.
It has rich scriptures about Mindfulness and staying in present.
Access to Buddha's direct teachings to a common person, can help them become more mindful and happier.
Is it possible to Crowdsource such community projects.
Beta Was this translation helpful? Give feedback.
All reactions