Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inflection-64 Convert dictionary-parser to consume Wikidata #65

Merged
merged 4 commits into from
Jan 23, 2025

Conversation

grhoten
Copy link
Member

@grhoten grhoten commented Jan 22, 2025

Resolves #64

These changes are the first attempt at getting the dictionary-parser to consume data from Wikidata. This tool collects all of the necessary data and converts it into the necessary format for the lexical dictionaries. I tested the generated data for all of the supported languages. While the data is in the correct format, there are still issues that remain. Those language specific problems were submitted as other issues to address individually. The tests for this tool were not converted over yet. This is a starting point.

Java is needed to run this tool.

Some currently known issues include:

  1. Creating a correct inflection table for L7083(theater/theatre) and L14678(axe/ax) need to be refined.
  2. Ignoring abbreviations and other properties or inflections likely needs to be fixed.
  3. Detecting and handling the bag of words entries needs to be tested.
  4. Tests were not converted over to the new format.
  5. Some unknown properties generated by this tool likely needs one of the following:
    • Verify that it's actually a grammeme, and not some sort of topic of interest that should be moved to another field in Wikidata. These are frequently typos that requires the data to be fixed.
    • Verify that it's not a phrase that should normally be handled by individual words. These are normally ignored lemmas.
    • Verify that it's meaningful for inflection (e.g. makes a state unique for inflection or for grammatical agreement). If it's unhelpful, it should be an ignored property or inflection.

Comment on lines +2 to +5
consonant-end=[^ ˈ'ˌ/\\[\\]iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ][ ˈ'ˌ/\\[\\]]*$
consonant-start=^[ ˈ'ˌ/\\[\\]]*[^ ˈ'ˌ/\\[\\]iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ]
vowel-end=[iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ][ ˈ'ˌ/\\[\\]]*$
vowel-start=^[ ˈ'ˌ/\\[\\]]*[iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ]
Copy link
Member Author

@grhoten grhoten Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we extract phonetic information for the 4 most common types. There may be other ways to get the phonetic information from other properties, but I don't know what are those types or formats. It's likely that this information can also be extended to the phonetic information needed for Turkish (front-round, front-unround, back-round, back-unround, hard-consonant, ...), Korean (rieul-end), and probably others. It's not extracted by default. It can be extracted if the --add-sound option is used, which is important for English, French, Italian and others.

Copy link
Contributor

@nciric nciric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits (comments/docs).

@grhoten grhoten merged commit 7d9c2c1 into unicode-org:main Jan 23, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert dictionary-parser to consume Wikidata
2 participants