-
Notifications
You must be signed in to change notification settings - Fork 23
New version of JMnedict (the proper name dictionary) #41
Conversation
Very old versions of JMdict and unofficial versions are unlikely to have the publication date entry at the end of the file.
Only English-language senses in JMdict contain part-of-speech tags. This info is displayed to users in definition tags and also used for deinflecting verbs and adjectives during term lookups. The old version of Yomichan-Import took the PoS tags from the final sense in the English version of an entry and applied them to every sense of every other language. For example, 川・かわ has two senses in English JMdict: a noun sense and a suffix sense. Therefore every sense of 川・かわ in every other language was tagged as a suffix. Instead, I suggest gathering all distinct PoS tags from each English entry and applying them all to each non-English sense. Every non-English sense of 川・かわ will therefore be tagged as both a noun and suffix.
That is doable, but it's a tradeoff between utility and bloat. Adding kana-to-kanji lookups doubles the size of the term database, and I'm not sure if that functionality is actually useful. I made a version like this last year if you'd like to try installing it and see for yourself: FooSoft/yomichan#2111 (comment) I've been using the version without the kana-to-kanji terms for about six months now and never found myself wishing for that functionality. |
another issue I just noticed is if the reading is removed, freq dicts with readings (ex bccwj, maybe a yomichan change could allow for clean/compacted jmnedict entries while still allowing for kana searches and freq dicts with readings. (might even be some overlap with the changes described in this thread to allow for cleaner / more compact viewing of kanji/kana combinations) tangentially related: I keep forgetting that modes other than removing everything but grouped mode would also streamline development / testing / troubleshooting, since you'd have 1 less dimension of modes to worry about. maybe this could use its own thread on the yomichan repo... thanks for reading, let me know your thoughts on this! |
Looking good! |
@FooSoft, thanks again for your time. @Thermospore, it is indeed an issue that JMnedict contains no frequency information. For example, 若槻 might be read 「わかつき」 the vast majority of the time, but this isn't evident by looking at JMnedict. I actually mentioned this to the JMdict editors last year, although I didn't have any good solutions at the time. You made a good point that the BCCWJ frequency list could be used for this purpose. I just proposed this idea to the editors, and Dr. Breen agrees that it sounds promising. If and when this frequency information is adapted and added to JMnedict, I can update the Yomichan dictionary to include standard expression + reading terms for names that are included in the BCCWJ list. This will allow frequency lists, pitch accent lists, flashcards, etc., to function normally. |
This pull request is to redesign the format of the JMnedict dictionary for Yomichan. It also includes a fix for a part-of-speech tag problem in non-English versions of JMdict.
New version of JMnedict
Related issue: FooSoft/yomichan#2111
Unlike the new version of JMdict, this redesign does not add new information or use any of Yomichan's new structured content features. It simply redesigns how the information is presented to users.
JMnedict contains a daunting number of entries that surpasses even JMdict. There are generally two types of entries in the file: (1) specific names of people, companies, events, etc., and (2) generic names such as given names and surnames. The latter category far outnumbers the former.
While the entries for specific names often provide useful information and context for a given term, the entries for generic names do not. The glossaries for generic names simply transliterate the term into Latin characters. So for example, the JMnedict entry for おおたに【大谷】 simply contains the gloss "Ootani" along with "place" and "surname" tags.
The problem is that JMnedict contains 44 generic name entries for the kanji 大. This means that anytime a Yomichan user searches for a word beginning with 大, Yomichan will also retrieve all 44 generic name entries for 大. This clutters the search results with a large amount of low quality information.
My suggestion is that we discard all glosses in generic entries with kanji forms. This way we can merge all generic entries sharing the same kanji form into single Yomichan entries.
Example: 尚三郎 (readings are moved to the glossaries for generic kanji terms)
Example: 山海経 (specific name entries retain glosses)
Example: 大谷海岸駅 (all 44 generic 大 entries merged into one)
Example: 林佳樹 (gloss is technically a transliteration but is retained because it has a space)
Example: じゅりあん (glosses are retained because they are not all transliterations)
JMdict: missing part-of-speech tags
I noticed that non-English versions of the new JMdict dictionaries did not have part-of-speech tags, unlike the old versions.
Only English-language senses in JMdict contain part-of-speech tags. The old version of Yomichan-Import took the PoS tags from the final sense in the English version of an entry and applied them to every sense of every other language. For example, 川・かわ has two senses in English JMdict: a noun sense and a suffix sense. Therefore every sense of 川・かわ in every other language was tagged as a suffix.
Instead, I suggest gathering all distinct part-of-speech tags from each English entry and applying them all to each non-English sense. Every non-English sense of 川・かわ will therefore be tagged as both a noun and suffix. This still isn't ideal, but I think this is at least an improvement on the previous setup.
Test Dictionary Builds
jmnedict.zip (2023-02-02)
jmdict_russian.zip (2023-02-02)