Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Kobo prefix generation #219

Closed
karlb opened this issue Jul 4, 2020 · 19 comments
Closed

Bug in Kobo prefix generation #219

karlb opened this issue Jul 4, 2020 · 19 comments
Labels

Comments

@karlb
Copy link
Contributor

karlb commented Jul 4, 2020

According to https://www.mobileread.com/forums/showpost.php?p=4007869&postcount=2, the prefix generation for kobo dictionaries does not work properly in all cases.

I didn't look into the details yet, but I would prefer to fix this in pyglossary instead of postprocessing the dictionaries to work around this issue.

@ilius
Copy link
Owner

ilius commented Jul 5, 2020

I implemented the prefix algorithm the best I could based on this page.
Also added all test cases in that page.
But there are two problems.

1- Two examples in that page seem to conflict with the explained algorithm! I marked them with WTF in my test.

2- Adding support for alternates would be very very tricky, based on this note:

Note: Kobo will only look in the file matching the word’s prefix, so if a variant has a different prefix, it must be duplicated into each matching file (note that duplicate words aren’t an issue).

Unless we duplicate the definitions as different entries!

We don't support alternates now (if there are, they would be joined with "|" as a single word)

For now we need to test this code.

ilius added a commit that referenced this issue Jul 5, 2020
@karlb
Copy link
Contributor Author

karlb commented Jul 5, 2020

Great! I'll do some testing on the results once the WTFs are solved.

The two problematic cases

self.case(" 123", "11")  # WTF?
self.case(" 未", "11")  # WTF?

get their value due to the rule

If either of the first two characters are not in the Unicode Letter character class, return “11”.

The corresponding part in the go code is

		if !unicode.IsLetter(pfx[0]) || !unicode.IsLetter(pfx[1]) {
			return "11" // if either of the first 2 chars are letters, return "11"
		}

One reasonable way to do that in python seems to be https://stackoverflow.com/a/6314634/114926.

@ilius
Copy link
Owner

ilius commented Jul 5, 2020

Okay that fixes the first WTF.

Look at the last 2 rows in Examples!
Same word, different prefixes!
Typo?

ilius added a commit that referenced this issue Jul 5, 2020
@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

The second-last example has two spaces before, the last one has one.

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

Also, FYI, Kobo dictionaries support having multiple entries for a single word, so if a headword has multiple distinct definitions (e.g. acronym vs normal word), you can have multiple entries in the file, all of which will show.

@ilius
Copy link
Owner

ilius commented Jul 5, 2020

No, both have just one space

<td>”<code class="language-plaintext highlighter-rouge"> 未</code>”</td>

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

And a few other things you might want to know:

  • Kobo supports images in dictionaries, but these have a lot of gotchas (see https://pgaskin.net/dictutil/dicthtml/format.html). Basically, the best way to do it is to encode the images as a base64 data URL after shrinking it and making it grayscale (if it's JPG, this is as simple as only keeping the Y channel).
  • The new prefix generation algorithm is only for 4.7.10364+ (see https://pgaskin.net/dictutil/dicthtml/v1v2.html), so you may want to have two formats, as some people still use 3.19.5761. I haven't looked into the first format yet, but I might in the future.
  • One way to support most of these features as an output format without having to implement them yourself would be to support my dictfile format as an output (see https://pgaskin.net/dictutil/dictgen/). It's a simple intermediate format designed specifically to represent all supported features of Kobo dictionaries.
  • If want to be able to read from Kobo dictionaries, you can implement a dictfile input, which can then be used with dictzip-decompile (see https://pgaskin.net/dictutil/examples/dictzip-decompile.html).

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

No, both have just one space

Oh, I know I why. Whitespace collapsing. I'll need to fix that later.

For now, look at https://github.com/geek1011/dictutil/blob/master/kobodict/util_test.go for test cases, as it includes nearly all possibilities and covers all codepaths.

You can also use https://github.com/geek1011/kobo-mods/tree/master/dictword-test to generate new ones if you're so inclined.

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

You may also be interested in https://github.com/geek1011/dictutil/blob/5b406d091dc94b720e3f4a3bee0b3f3b6a08bf88/kobodict/util.go#L94 if you want to see an implementation closer to what was in libnickel based on the reversed assembly for reference (and if you ever decide to look at libnickel yourself).

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

@ilius, one more subtle bug:

htmlContents += f"<w><a name=\"{word}\"/><div><b>{word}</b>"\

To have full support for how libnickel matches words, you need a space before the closing /> for the a tag. See https://pgaskin.net/dictutil/dicthtml/format.html.

ilius added a commit that referenced this issue Jul 5, 2020
@ilius
Copy link
Owner

ilius commented Jul 5, 2020

We don't need to lowercase the whole string if we only need first two characters.
I rather stick to a faster implementation in Python.
Specially since get_prefix is also used for sorting (before grouping).

@ilius
Copy link
Owner

ilius commented Jul 5, 2020

This should be ready for testing I guess.
We can follow up Dictfile format in a new issue.

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

On a side note about speed, I have benchmarks of the different functions for generating prefixes.

I do agree that the way libnickel does isn't the most efficient, and that's why I made my own implementation of it.

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

For parsing dictfiles, here is the dictutil implementation: https://github.com/geek1011/dictutil/blob/master/dictgen/dictfile.go. It will probably be around the same length in Python.

@karlb
Copy link
Contributor Author

karlb commented Jul 5, 2020

I didn't get to any fancy cases yet, but the case folding works fine on my kobo, now. Considering the tests, I will assume that this is solved until I encounter a counter example. Thanks @geek1011 and @ilius!

@karlb karlb closed this as completed Jul 5, 2020
@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

@ilius, I've taken a quick look at all the Kobo code, and everything looks good.

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

And, if you're planning on implementing the dictfile format:

  • My parsing code is here.
  • You can find some test cases here, but note that they don't cover all edge cases.
  • It will allow you to use the data from https://pgaskin.net/dictutil/examples/webster1913-convert.html, which I believe is the only parser for the Gutenberg Webster's 1913 dictionary which parses all words correctly, preserves the structure, and doesn't discard content.
  • It would be a good idea to warn about images somehow, if they aren't supported by PyGlossary.

@ilius
Copy link
Owner

ilius commented Jul 5, 2020

We ignore the files (data entries) right now.
But <img ... tags are left in definitions if glossary has them.
Should we remove this tag?

@pgaskin
Copy link
Contributor

pgaskin commented Jul 5, 2020

That would be a good idea, since nickel has a history of crashing with bad image references.

If possible, a better idea might be to replace it with placeholder text from either the alt attribute or the basename of the href so users aren't left wondering where the image went.

ilius added a commit that referenced this issue Jul 6, 2020
@ilius ilius added the Bug label Oct 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants