Inflection-64 Convert dictionary-parser to consume Wikidata #65

grhoten · 2025-01-22T17:42:30Z

Resolves #64

These changes are the first attempt at getting the dictionary-parser to consume data from Wikidata. This tool collects all of the necessary data and converts it into the necessary format for the lexical dictionaries. I tested the generated data for all of the supported languages. While the data is in the correct format, there are still issues that remain. Those language specific problems were submitted as other issues to address individually. The tests for this tool were not converted over yet. This is a starting point.

Java is needed to run this tool.

Some currently known issues include:

Creating a correct inflection table for L7083(theater/theatre) and L14678(axe/ax) need to be refined.
Ignoring abbreviations and other properties or inflections likely needs to be fixed.
Detecting and handling the bag of words entries needs to be tested.
Tests were not converted over to the new format.
Some unknown properties generated by this tool likely needs one of the following:
- Verify that it's actually a grammeme, and not some sort of topic of interest that should be moved to another field in Wikidata. These are frequently typos that requires the data to be fixed.
- Verify that it's not a phrase that should normally be handled by individual words. These are normally ignored lemmas.
- Verify that it's meaningful for inflection (e.g. makes a state unique for inflection or for grammatical agreement). If it's unhelpful, it should be an ignored property or inflection.

grhoten · 2025-01-22T17:48:25Z

inflection/tools/dictionary-parser/src/main/resources/org/unicode/wikidata/P898.properties

+consonant-end=[^ ˈ'ˌ/\\[\\]iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ][ ˈ'ˌ/\\[\\]]*$
+consonant-start=^[ ˈ'ˌ/\\[\\]]*[^ ˈ'ˌ/\\[\\]iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ]
+vowel-end=[iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ][ ˈ'ˌ/\\[\\]]*$
+vowel-start=^[ ˈ'ˌ/\\[\\]]*[iyɨʉɯuɪʏʊeøɘɵɤoe̞ø̞əɤ̞o̞ɛœɜɞʌɔæɐaɶäɑɒ]


This is how we extract phonetic information for the 4 most common types. There may be other ways to get the phonetic information from other properties, but I don't know what are those types or formats. It's likely that this information can also be extended to the phonetic information needed for Turkish (front-round, front-unround, back-round, back-unround, hard-consonant, ...), Korean (rieul-end), and probably others. It's not extracted by default. It can be extracted if the --add-sound option is used, which is important for English, French, Italian and others.

nciric

Left some nits (comments/docs).

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/Inflection.java

...ction/tools/dictionary-parser/src/main/java/org/unicode/wikidata/ClaimsJsonDeserializer.java

inflection/tools/dictionary-parser/README.md

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/Inflection.java

Inflection-64 Convert dictionary-parser to consume Wikidata

8101076

grhoten commented Jan 22, 2025

View reviewed changes

grhoten mentioned this pull request Jan 22, 2025

Create extract-wikidata.py #45

Open

nciric approved these changes Jan 22, 2025

View reviewed changes

grhoten added 3 commits January 22, 2025 14:21

Inflection-64 Convert dictionary-parser to consume Wikidata

4b09549

Inflection-64 Convert dictionary-parser to consume Wikidata

c570160

Inflection-64 Convert dictionary-parser to consume Wikidata

78a3ec3

nciric approved these changes Jan 22, 2025

View reviewed changes

inflection/tools/dictionary-parser/src/main/java/org/unicode/wikidata/Inflection.java Show resolved Hide resolved

grhoten merged commit 7d9c2c1 into unicode-org:main Jan 23, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inflection-64 Convert dictionary-parser to consume Wikidata #65

Inflection-64 Convert dictionary-parser to consume Wikidata #65

grhoten commented Jan 22, 2025 •

edited

Loading

grhoten Jan 22, 2025 •

edited

Loading

nciric left a comment

Inflection-64 Convert dictionary-parser to consume Wikidata #65

Inflection-64 Convert dictionary-parser to consume Wikidata #65

Conversation

grhoten commented Jan 22, 2025 • edited Loading

grhoten Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

nciric left a comment

Choose a reason for hiding this comment

grhoten commented Jan 22, 2025 •

edited

Loading

grhoten Jan 22, 2025 •

edited

Loading