-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inflection-64 Convert dictionary-parser to consume Wikidata #65
Conversation
consonant-end=[^ ˈ'ˌ/\\[\\]iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ][ ˈ'ˌ/\\[\\]]*$ | ||
consonant-start=^[ ˈ'ˌ/\\[\\]]*[^ ˈ'ˌ/\\[\\]iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ] | ||
vowel-end=[iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ][ ˈ'ˌ/\\[\\]]*$ | ||
vowel-start=^[ ˈ'ˌ/\\[\\]]*[iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how we extract phonetic information for the 4 most common types. There may be other ways to get the phonetic information from other properties, but I don't know what are those types or formats. It's likely that this information can also be extended to the phonetic information needed for Turkish (front-round, front-unround, back-round, back-unround, hard-consonant, ...), Korean (rieul-end), and probably others. It's not extracted by default. It can be extracted if the --add-sound
option is used, which is important for English, French, Italian and others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits (comments/docs).
Resolves #64
These changes are the first attempt at getting the dictionary-parser to consume data from Wikidata. This tool collects all of the necessary data and converts it into the necessary format for the lexical dictionaries. I tested the generated data for all of the supported languages. While the data is in the correct format, there are still issues that remain. Those language specific problems were submitted as other issues to address individually. The tests for this tool were not converted over yet. This is a starting point.
Java is needed to run this tool.
Some currently known issues include: