Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

New version of JMnedict (the proper name dictionary) #41

Merged
merged 10 commits into from
Feb 5, 2023
Merged

New version of JMnedict (the proper name dictionary) #41

merged 10 commits into from
Feb 5, 2023

Conversation

stephenmk
Copy link
Contributor

@stephenmk stephenmk commented Feb 2, 2023

This pull request is to redesign the format of the JMnedict dictionary for Yomichan. It also includes a fix for a part-of-speech tag problem in non-English versions of JMdict.

New version of JMnedict

Related issue: FooSoft/yomichan#2111

Unlike the new version of JMdict, this redesign does not add new information or use any of Yomichan's new structured content features. It simply redesigns how the information is presented to users.

JMnedict contains a daunting number of entries that surpasses even JMdict. There are generally two types of entries in the file: (1) specific names of people, companies, events, etc., and (2) generic names such as given names and surnames. The latter category far outnumbers the former.

While the entries for specific names often provide useful information and context for a given term, the entries for generic names do not. The glossaries for generic names simply transliterate the term into Latin characters. So for example, the JMnedict entry for おおたに【大谷】 simply contains the gloss "Ootani" along with "place" and "surname" tags.

The problem is that JMnedict contains 44 generic name entries for the kanji 大. This means that anytime a Yomichan user searches for a word beginning with 大, Yomichan will also retrieve all 44 generic name entries for 大. This clutters the search results with a large amount of low quality information.

My suggestion is that we discard all glosses in generic entries with kanji forms. This way we can merge all generic entries sharing the same kanji form into single Yomichan entries.

Example: 尚三郎 (readings are moved to the glossaries for generic kanji terms)

尚三郎

Example: 山海経 (specific name entries retain glosses)

山海経

Example: 大谷海岸駅 (all 44 generic 大 entries merged into one)

大谷海岸駅

Example: 林佳樹 (gloss is technically a transliteration but is retained because it has a space)

林佳樹

Example: じゅりあん (glosses are retained because they are not all transliterations)

julian

JMdict: missing part-of-speech tags

I noticed that non-English versions of the new JMdict dictionaries did not have part-of-speech tags, unlike the old versions.

Only English-language senses in JMdict contain part-of-speech tags. The old version of Yomichan-Import took the PoS tags from the final sense in the English version of an entry and applied them to every sense of every other language. For example, 川・かわ has two senses in English JMdict: a noun sense and a suffix sense. Therefore every sense of 川・かわ in every other language was tagged as a suffix.

Instead, I suggest gathering all distinct part-of-speech tags from each English entry and applying them all to each non-English sense. Every non-English sense of 川・かわ will therefore be tagged as both a noun and suffix. This still isn't ideal, but I think this is at least an improvement on the previous setup.

Test Dictionary Builds

Very old versions of JMdict and unofficial versions are unlikely to
have the publication date entry at the end of the file.
Only English-language senses in JMdict contain part-of-speech tags.
This info is displayed to users in definition tags and also used
for deinflecting verbs and adjectives during term lookups.

The old version of Yomichan-Import took the PoS tags from the final
sense in the English version of an entry and applied them to every
sense of every other language. For example, 川・かわ has two senses in
English JMdict: a noun sense and a suffix sense. Therefore every sense
of 川・かわ in every other language was tagged as a suffix.

Instead, I suggest gathering all distinct PoS tags from each English
entry and applying them all to each non-English sense. Every
non-English sense of 川・かわ will therefore be tagged as both a noun
and suffix.
@Thermospore
Copy link

nice! yea currently I have jmnedict in its own profile with a different key to trigger it, cos it clutters things up. I'll have to try this out

one potential problem I see is that you can't do a kana -> kanji search for some entries. ex if you heard "おおやかいがん" and looked it up, this entry wouldn't show up
image

hopefully your ime or even just google could help you out in cases like this, but it is a bit of a regression

@stephenmk
Copy link
Contributor Author

stephenmk commented Feb 3, 2023

That is doable, but it's a tradeoff between utility and bloat. Adding kana-to-kanji lookups doubles the size of the term database, and I'm not sure if that functionality is actually useful.

I made a version like this last year if you'd like to try installing it and see for yourself: FooSoft/yomichan#2111 (comment)

Example: よしたけ

yoshitake

I've been using the version without the kana-to-kanji terms for about six months now and never found myself wishing for that functionality.

@Thermospore
Copy link

another issue I just noticed is if the reading is removed, freq dicts with readings (ex bccwj, B長 in my screenshot) wouldn't function anymore
image

maybe a yomichan change could allow for clean/compacted jmnedict entries while still allowing for kana searches and freq dicts with readings. (might even be some overlap with the changes described in this thread to allow for cleaner / more compact viewing of kanji/kana combinations)

tangentially related: I keep forgetting that modes other than group term-reading pairs exist... is there any reason not to use it? It might be better to just remove the other modes from yomichan, and focus on improving grouped mode. instead of trying to finangle grouped mode-esque functionality into the other modes, from the dictionary creation end

removing everything but grouped mode would also streamline development / testing / troubleshooting, since you'd have 1 less dimension of modes to worry about. maybe this could use its own thread on the yomichan repo...

thanks for reading, let me know your thoughts on this!

jmdict.go Outdated Show resolved Hide resolved
jmdict.go Show resolved Hide resolved
jmdict.go Show resolved Hide resolved
jmnedict.go Outdated Show resolved Hide resolved
@FooSoft
Copy link
Owner

FooSoft commented Feb 5, 2023

Looking good!

@FooSoft FooSoft merged commit f4da17e into FooSoft:master Feb 5, 2023
@stephenmk
Copy link
Contributor Author

@FooSoft, thanks again for your time.

@Thermospore, it is indeed an issue that JMnedict contains no frequency information. For example, 若槻 might be read 「わかつき」 the vast majority of the time, but this isn't evident by looking at JMnedict. I actually mentioned this to the JMdict editors last year, although I didn't have any good solutions at the time. You made a good point that the BCCWJ frequency list could be used for this purpose. I just proposed this idea to the editors, and Dr. Breen agrees that it sounds promising.

If and when this frequency information is adapted and added to JMnedict, I can update the Yomichan dictionary to include standard expression + reading terms for names that are included in the BCCWJ list. This will allow frequency lists, pitch accent lists, flashcards, etc., to function normally.

@Thermospore
Copy link

If and when this frequency information is adapted and added to JMnedict, I can update the Yomichan dictionary to include standard expression + reading terms for names that are included in the BCCWJ list. This will allow frequency lists, pitch accent lists, flashcards, etc., to function normally.

thanks for the response, sure that sounds like a good stopgap

next week when I have time, I'll make a thread on the yomichan repo about grouping modes, which would address the core of the issue

basically, I think grouped mode should be default (and various improvements / changes made), and have the other modes just be discontinued / hidden in advanced settings

probably 99% of people using a non grouped mode are just using it because it is default, or because of a feature it has which could just be implemented in grouped mode

the other modes are just holding things back, I think. I don't think grouped mode functionality should have to be finangled into all the modes, from the dictionary end:
image

it should all just be one mode

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants