Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JMdict and JMdict Forms Do Not Have Valid Revision Dates #9

Closed
MarvNC opened this issue Sep 25, 2023 · 4 comments · Fixed by #10
Closed

JMdict and JMdict Forms Do Not Have Valid Revision Dates #9

MarvNC opened this issue Sep 25, 2023 · 4 comments · Fixed by #10

Comments

@MarvNC
Copy link
Member

MarvNC commented Sep 25, 2023

chrome_Welcome_to_Yomibaba!_-_Google_Chrome_2023-09-25_00-00-33
On a fresh compile of both. This is seen in the .zip files distributed in Aquafina-water-bottle/jmdict-english-yomichan and MarvNC/jmdict-yomitan.

@stephenmk
Copy link

The code expects the JMdict date entry to be the final entry in the file. A couple of months ago they started including a small selection of JMnedict (name entries) in the JMdict file, so the date entry is no longer the final entry.

Instead of looking for the final entry, I guess you'd want to find the entry with the sequence number equal to 9999999. Or find the entry with the expression JMdict.

https://github.com/themoeway/yomitan-import/blob/73b35ff03a78de0c5bb9881eb1d99af121746dab/jmdict.go#L65-L83

Also, I've been slowly working on adding JMdict to jitenbot. The mdict (MDX/MDD) version is pretty much finished. Eventually I plan to get it working with yomichan too, but that's a pain because yomichan's format is so much more limited. So while it may be many months in the future until everything is ported over to jitenbot (including the name dictionary), you may want to reconsider spending too much time on yomitan-import.

sujou

@MarvNC
Copy link
Member Author

MarvNC commented Sep 25, 2023

Wow that looks awesome, are there any significant improvements planned for the Yomichan version?

And yeah, just hoping to fix the rev version issue for now.

@stephenmk
Copy link

stephenmk commented Sep 25, 2023

are there any significant improvements planned for the Yomichan version?

There will be only one yomichan JSON "term" per JMdict entry per headword. Right now in my current version there's one JSON term per JMdict sense multiplied by the number of headwords, which results in an astronomical number of terms1. It's possible that merging the JSON terms like this may result in faster validation times when importing the dictionary file, although I won't know until I try2.

This will solve the "Merging of terms from separate entries" problem that I wrote about in this pull request.

Since this design means I'll no longer be able to use yomichan's term tags to display part-of-speech and other miscellaneous information, I'm going to use embedded image files to display the information instead. In some ways this is an improvement, because yomichan's term tags do not display this information in the correct order. Most people probably don't know this, but the order of these tags can be important to understanding JMdict entries. If the "adj-no" tag is the first tag, for example, it means that the word is mostly used as 〜の and the definition glosses will be written as adjectives (rather than nouns, adverbs, etc). Sometimes these definition glosses can be interpreted differently (English has plenty of words that can be both nouns and verbs), so the tags are there to resolve that ambiguity.

Using embedded images also means we'll be able to avoid the emoji problem that lots of people have with chrome-based browsers. I'll also be able to use embedded images instead of weird symbols (🅁, ⚠, ⛬, etc.) in the forms table. Since embedded images support hover-text in yomichan, users will be able to hover over and see additional information if they don't understand the symbols at first.

I'm also now grouping the senses by their part of speech tags. So if three senses in a row share are all "noun" glosses, then they'll be grouped together under a single noun tag rather than displaying the noun tag on each sense.

chuuchou

Also using furigana in all cross referenced words now. I want to add furigana to the example sentences as well by using a variety of different resources, but we'll see how well that goes.

kigo

Footnotes

  1. I designed the current version that way because that's how the original one worked, and some small yomichan features (e.g. term tags for parts of speech and miscellaneous info) relied upon each JMdict sense being a separate JSON term. Now that I have a better understanding of how this stuff works, I feel more comfortable breaking with that tradition.

  2. It will be nice if it does import faster, but that's not my main goal. This isn't really an issue to be solved on the dictionary side; the JSON validation process in yomichan badly needs to be optimized.

@MarvNC
Copy link
Member Author

MarvNC commented Sep 25, 2023

Oh wow, didn't know about the tag order issue. Looks like some great improvements with the image tags and grouping, looking forward to seeing this release for jitenbot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants