-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in Kobo prefix generation #219
Comments
I implemented the prefix algorithm the best I could based on this page. 1- Two examples in that page seem to conflict with the explained algorithm! I marked them with 2- Adding support for alternates would be very very tricky, based on this note: Note: Kobo will only look in the file matching the word’s prefix, so if a variant has a different prefix, it must be duplicated into each matching file (note that duplicate words aren’t an issue). Unless we duplicate the definitions as different entries! We don't support alternates now (if there are, they would be joined with "|" as a single word) For now we need to test this code. |
Great! I'll do some testing on the results once the WTFs are solved. The two problematic cases
get their value due to the rule
The corresponding part in the go code is if !unicode.IsLetter(pfx[0]) || !unicode.IsLetter(pfx[1]) {
return "11" // if either of the first 2 chars are letters, return "11"
} One reasonable way to do that in python seems to be https://stackoverflow.com/a/6314634/114926. |
Okay that fixes the first WTF. Look at the last 2 rows in Examples! |
The second-last example has two spaces before, the last one has one. |
Also, FYI, Kobo dictionaries support having multiple entries for a single word, so if a headword has multiple distinct definitions (e.g. acronym vs normal word), you can have multiple entries in the file, all of which will show. |
No, both have just one space
|
And a few other things you might want to know:
|
Oh, I know I why. Whitespace collapsing. I'll need to fix that later. For now, look at https://github.com/geek1011/dictutil/blob/master/kobodict/util_test.go for test cases, as it includes nearly all possibilities and covers all codepaths. You can also use https://github.com/geek1011/kobo-mods/tree/master/dictword-test to generate new ones if you're so inclined. |
You may also be interested in https://github.com/geek1011/dictutil/blob/5b406d091dc94b720e3f4a3bee0b3f3b6a08bf88/kobodict/util.go#L94 if you want to see an implementation closer to what was in libnickel based on the reversed assembly for reference (and if you ever decide to look at libnickel yourself). |
@ilius, one more subtle bug: pyglossary/pyglossary/plugins/ebook_kobo.py Line 137 in 592c047
To have full support for how libnickel matches words, you need a space before the closing |
We don't need to lowercase the whole string if we only need first two characters. |
This should be ready for testing I guess. |
On a side note about speed, I have benchmarks of the different functions for generating prefixes. I do agree that the way libnickel does isn't the most efficient, and that's why I made my own implementation of it. |
For parsing dictfiles, here is the dictutil implementation: https://github.com/geek1011/dictutil/blob/master/dictgen/dictfile.go. It will probably be around the same length in Python. |
@ilius, I've taken a quick look at all the Kobo code, and everything looks good. |
And, if you're planning on implementing the dictfile format:
|
We ignore the files (data entries) right now. |
That would be a good idea, since nickel has a history of crashing with bad image references. If possible, a better idea might be to replace it with placeholder text from either the |
According to https://www.mobileread.com/forums/showpost.php?p=4007869&postcount=2, the prefix generation for kobo dictionaries does not work properly in all cases.
I didn't look into the details yet, but I would prefer to fix this in pyglossary instead of postprocessing the dictionaries to work around this issue.
The text was updated successfully, but these errors were encountered: