Bug in Kobo prefix generation #219

karlb · 2020-07-04T20:08:51Z

According to https://www.mobileread.com/forums/showpost.php?p=4007869&postcount=2, the prefix generation for kobo dictionaries does not work properly in all cases.

I didn't look into the details yet, but I would prefer to fix this in pyglossary instead of postprocessing the dictionaries to work around this issue.

ilius · 2020-07-05T04:24:59Z

I implemented the prefix algorithm the best I could based on this page.
Also added all test cases in that page.
But there are two problems.

1- Two examples in that page seem to conflict with the explained algorithm! I marked them with WTF in my test.

2- Adding support for alternates would be very very tricky, based on this note:

Note: Kobo will only look in the file matching the word’s prefix, so if a variant has a different prefix, it must be duplicated into each matching file (note that duplicate words aren’t an issue).

Unless we duplicate the definitions as different entries!

We don't support alternates now (if there are, they would be joined with "|" as a single word)

For now we need to test this code.

…group_by_prefix_length param, #219

karlb · 2020-07-05T07:53:40Z

Great! I'll do some testing on the results once the WTFs are solved.

The two problematic cases

self.case(" 123", "11")  # WTF?
self.case(" 未", "11")  # WTF?

get their value due to the rule

If either of the first two characters are not in the Unicode Letter character class, return “11”.

The corresponding part in the go code is

		if !unicode.IsLetter(pfx[0]) || !unicode.IsLetter(pfx[1]) {
			return "11" // if either of the first 2 chars are letters, return "11"
		}

One reasonable way to do that in python seems to be https://stackoverflow.com/a/6314634/114926.

ilius · 2020-07-05T09:48:14Z

Okay that fixes the first WTF.

Look at the last 2 rows in Examples!
Same word, different prefixes!
Typo?

pgaskin · 2020-07-05T14:44:55Z

The second-last example has two spaces before, the last one has one.

pgaskin · 2020-07-05T14:45:54Z

Also, FYI, Kobo dictionaries support having multiple entries for a single word, so if a headword has multiple distinct definitions (e.g. acronym vs normal word), you can have multiple entries in the file, all of which will show.

ilius · 2020-07-05T14:50:22Z

No, both have just one space

<td>”<code class="language-plaintext highlighter-rouge"> 未</code>”</td>

pgaskin · 2020-07-05T14:52:51Z

And a few other things you might want to know:

Kobo supports images in dictionaries, but these have a lot of gotchas (see https://pgaskin.net/dictutil/dicthtml/format.html). Basically, the best way to do it is to encode the images as a base64 data URL after shrinking it and making it grayscale (if it's JPG, this is as simple as only keeping the Y channel).
The new prefix generation algorithm is only for 4.7.10364+ (see https://pgaskin.net/dictutil/dicthtml/v1v2.html), so you may want to have two formats, as some people still use 3.19.5761. I haven't looked into the first format yet, but I might in the future.
One way to support most of these features as an output format without having to implement them yourself would be to support my dictfile format as an output (see https://pgaskin.net/dictutil/dictgen/). It's a simple intermediate format designed specifically to represent all supported features of Kobo dictionaries.
If want to be able to read from Kobo dictionaries, you can implement a dictfile input, which can then be used with dictzip-decompile (see https://pgaskin.net/dictutil/examples/dictzip-decompile.html).

pgaskin · 2020-07-05T14:54:43Z

No, both have just one space

Oh, I know I why. Whitespace collapsing. I'll need to fix that later.

For now, look at https://github.com/geek1011/dictutil/blob/master/kobodict/util_test.go for test cases, as it includes nearly all possibilities and covers all codepaths.

You can also use https://github.com/geek1011/kobo-mods/tree/master/dictword-test to generate new ones if you're so inclined.

…ithub.com/geek1011/dictutil

pgaskin · 2020-07-05T15:26:07Z

You may also be interested in https://github.com/geek1011/dictutil/blob/5b406d091dc94b720e3f4a3bee0b3f3b6a08bf88/kobodict/util.go#L94 if you want to see an implementation closer to what was in libnickel based on the reversed assembly for reference (and if you ever decide to look at libnickel yourself).

pgaskin · 2020-07-05T15:48:27Z

@ilius, one more subtle bug:

pyglossary/pyglossary/plugins/ebook_kobo.py

Line 137 in 592c047

htmlContents += f"<w><a name=\"{word}\"/><div><b>{word}</b>"\

To have full support for how libnickel matches words, you need a space before the closing /> for the a tag. See https://pgaskin.net/dictutil/dicthtml/format.html.

ilius · 2020-07-05T17:02:15Z

We don't need to lowercase the whole string if we only need first two characters.
I rather stick to a faster implementation in Python.
Specially since get_prefix is also used for sorting (before grouping).

ilius · 2020-07-05T17:05:41Z

This should be ready for testing I guess.
We can follow up Dictfile format in a new issue.

pgaskin · 2020-07-05T17:08:56Z

On a side note about speed, I have benchmarks of the different functions for generating prefixes.

I do agree that the way libnickel does isn't the most efficient, and that's why I made my own implementation of it.

pgaskin · 2020-07-05T17:13:10Z

For parsing dictfiles, here is the dictutil implementation: https://github.com/geek1011/dictutil/blob/master/dictgen/dictfile.go. It will probably be around the same length in Python.

karlb · 2020-07-05T17:21:41Z

I didn't get to any fancy cases yet, but the case folding works fine on my kobo, now. Considering the tests, I will assume that this is solved until I encounter a counter example. Thanks @geek1011 and @ilius!

pgaskin · 2020-07-05T17:25:26Z

@ilius, I've taken a quick look at all the Kobo code, and everything looks good.

pgaskin · 2020-07-05T17:30:17Z

And, if you're planning on implementing the dictfile format:

My parsing code is here.
You can find some test cases here, but note that they don't cover all edge cases.
It will allow you to use the data from https://pgaskin.net/dictutil/examples/webster1913-convert.html, which I believe is the only parser for the Gutenberg Webster's 1913 dictionary which parses all words correctly, preserves the structure, and doesn't discard content.
It would be a good idea to warn about images somehow, if they aren't supported by PyGlossary.

ilius · 2020-07-05T18:05:17Z

We ignore the files (data entries) right now.
But <img ... tags are left in definitions if glossary has them.
Should we remove this tag?

pgaskin · 2020-07-05T18:06:18Z

That would be a good idea, since nickel has a history of crashing with bad image references.

If possible, a better idea might be to replace it with placeholder text from either the alt attribute or the basename of the href so users aren't left wondering where the image went.

…a warning about data entries, #219

ilius added a commit that referenced this issue Jul 5, 2020

Kobo: fix get_prefix algorithm and sorting order, #219

92ca961

ilius added a commit that referenced this issue Jul 5, 2020

Kobo: move get_prefix and sort_key funcs to Writer class, and remove …

b9c1ac3

…group_by_prefix_length param, #219

ilius added a commit that referenced this issue Jul 5, 2020

Kobo: get_prefix: add check for Unicode Letter category, #219

3268e23

ilius added a commit that referenced this issue Jul 5, 2020

Kobo: fix pasted typo in comment, #219

f35e5ca

ilius added a commit that referenced this issue Jul 5, 2020

ebook_kobo_test.py: fix and enable failing case, #219

88c8207

ilius added a commit that referenced this issue Jul 5, 2020

Kobo: #219: get_prefix: check for null byte, add tests from https://g…

592c047

…ithub.com/geek1011/dictutil

ilius added a commit that referenced this issue Jul 5, 2020

Kobo: #219: add a space before />

b121b4b

karlb closed this as completed Jul 5, 2020

ilius added a commit that referenced this issue Jul 6, 2020

Kobo: replace '<img src=...' tags with '[Image: name.bmp]', and show …

4565850

…a warning about data entries, #219

ilius added the Bug label Oct 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in Kobo prefix generation #219

Bug in Kobo prefix generation #219

karlb commented Jul 4, 2020

ilius commented Jul 5, 2020 •

edited

Loading

karlb commented Jul 5, 2020

ilius commented Jul 5, 2020

pgaskin commented Jul 5, 2020

pgaskin commented Jul 5, 2020

ilius commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020

pgaskin commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020 •

edited

Loading

ilius commented Jul 5, 2020

ilius commented Jul 5, 2020

pgaskin commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020

karlb commented Jul 5, 2020

pgaskin commented Jul 5, 2020

pgaskin commented Jul 5, 2020

ilius commented Jul 5, 2020

pgaskin commented Jul 5, 2020 •

edited

Loading

Bug in Kobo prefix generation #219

Bug in Kobo prefix generation #219

Comments

karlb commented Jul 4, 2020

ilius commented Jul 5, 2020 • edited Loading

karlb commented Jul 5, 2020

ilius commented Jul 5, 2020

pgaskin commented Jul 5, 2020

pgaskin commented Jul 5, 2020

ilius commented Jul 5, 2020 • edited Loading

pgaskin commented Jul 5, 2020

pgaskin commented Jul 5, 2020 • edited Loading

pgaskin commented Jul 5, 2020 • edited Loading

pgaskin commented Jul 5, 2020 • edited Loading

ilius commented Jul 5, 2020

ilius commented Jul 5, 2020

pgaskin commented Jul 5, 2020 • edited Loading

pgaskin commented Jul 5, 2020

karlb commented Jul 5, 2020

pgaskin commented Jul 5, 2020

pgaskin commented Jul 5, 2020

ilius commented Jul 5, 2020

pgaskin commented Jul 5, 2020 • edited Loading

ilius commented Jul 5, 2020 •

edited

Loading

ilius commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020 •

edited

Loading

pgaskin commented Jul 5, 2020 •

edited

Loading