-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor bug on upper/lowercase variants in title search : the util.ucEveryFirstLetter function does not work properly when accents or other non-ASCII characters are used #179
Comments
That's a really funny one :) |
I thought this one was easy to fix, but it's not. |
@kelson42 Sorry for slow reply. Yes, I'll take a look at that regex. It's also possible that with widespread UTF8 support in JavaScript, the bug has gone away, i.e. the internal definition of a word boundary may no longer have that bug. |
I confirm this bug still exists in Firefox 75.0. |
There is a broader issue with the current regex, which is that I tested an initial solution to the bug, but still using |
If instead we select any character ( |
@Jaifroid What about something like this? |
@Jaifroid We can easily put the punctuations marks in this as well. |
Thanks, that's a good idea, just in case someone writes punctuation in their search instead of a space. |
@Jaifroid Do we need anything else to finally be able to fix that ticket. Would you be able to make a PR? |
@kelson42 No problem I'll make a PR tomorrow morning for testing. |
I've been looking into the punctuation issue, and since we can't use a Unicode property escapes (due to IE11 and Firefox not supporting them), we can instead compile a regex that does the equivalent, as given on this Stackexchange post:
The regex here (last line) is a monster, but regexes are lightning fast, and all this does is match users' (slowly) typed input, so it wouldn't have performance implications. Possibly a useful util regex to have anyway. I can well imagine we may have users searching for Note this regex is taken from the XRegExp library. An alternative would be to add XRegExp as a dependency, which would allow us to use, simply, In any case, I'll experiment in the PR. |
How is this handled in other Kiwix clients? (maybe it's handled in libzim or kiwix-lib) If it can help, I tried to use XRegExp in another branch (3 years ago) : see https://github.com/kiwix/kiwix-js/compare/fix-international-ucEveryFirstLetter In any case, it would be worth adding some unit tests with the implementation (you can see how I did it in the branch above). It helps understand which cases we try to cover, and it also checks non-regression when we refactor it. |
I can certainly add XRegExp as a dependency, @mossroy, if you think it might be worth it for more than this one use-case. For this case, a dev on Stackexchange has extracted the underlying regex that XRegExp compiles when using The advantage of using our own regex here is speed and ensuring light weight of the app. The downside is maintainability, but it seems unlikely we'll be changing it often (and I can add a note on how to update it to the code). I'm not against using XRegExp (it's a great library), if you think it is better to add it. (I would prefer it if we could add it simply with |
We have a title index handled by Xapian. Xapian manages everything (even the search pattern string parsing AFAIK). |
But I think there is still fallback parsing of the title pointer list, no? I saw a comment about that on the Android app GitHub somewhere, as not all ZIMs have Xapian indices, plus that causes a problem with split ZIMs, since Xapian can't handle them. I may be wrong / out-of-date on that. |
@Jaifroid Yes, there is a fallback here: https://github.com/kiwix/kiwix-lib/blob/master/src/reader.cpp#L800 |
For example, if you type "amérique" as a search string, the uppercase on first letter of each word variant is wrong : it looks for AméRique instead of Amérique.
This is because the function util.ucEveryFirstLetter uses a regular expression with word boundary \b, that only works on ASCII characters : it considers the é as a word boundary
The text was updated successfully, but these errors were encountered: