Minor bug on upper/lowercase variants in title search : the util.ucEveryFirstLetter function does not work properly when accents or other non-ASCII characters are used #179

mossroy · 2016-06-23T11:31:41Z

For example, if you type "amérique" as a search string, the uppercase on first letter of each word variant is wrong : it looks for AméRique instead of Amérique.

This is because the function util.ucEveryFirstLetter uses a regular expression with word boundary \b, that only works on ASCII characters : it considers the é as a word boundary

kelson42 · 2016-06-23T12:44:47Z

That's a really funny one :)

It does not work well in every case

mossroy · 2017-04-06T08:23:17Z

I thought this one was easy to fix, but it's not.
As it's a minor issue, I'll postpone it for now.
I've commited my work-in-progress on branch fix-international-ucEveryFirstLetter

kelson42 · 2020-04-23T12:23:17Z

@mossroy @Jaifroid I think this would be fixed in an acceptable manner by detecting the spaces instead of detecting the words. You could use \s for example and any other separator you judge appropriate. Would that be good for you?

Jaifroid · 2020-04-25T08:42:27Z

@kelson42 Sorry for slow reply. Yes, I'll take a look at that regex. It's also possible that with widespread UTF8 support in JavaScript, the bug has gone away, i.e. the internal definition of a word boundary may no longer have that bug.

Jaifroid · 2020-04-25T08:50:17Z

I confirm this bug still exists in Firefox 75.0.

Jaifroid · 2020-04-25T11:50:19Z

There is a broader issue with the current regex, which is that \w only matches [a-zA-Z] in JavaScript, i.e. only works with the ASCII alphabet. It's possibly this could be changed with the /u (Unicode) switch, but that is not supported in IE11.

I tested an initial solution to the bug, but still using \w, and the outcome can be seen in Regex Buddy here (note that I'm using a placeholder U to represent uppercase). As can be seen below, neither á in ágata nor any of the first letters of the Greek words are selected for conversion to uppercase.

Jaifroid · 2020-04-25T11:58:30Z

If instead we select any character (.) after the start of the string or after a space or any number of spaces, as @kelson42 suggested, then we would get what you see below. My only doubts are that this will pick up numbers and punctuation marks with spaces before them. However, I suspect that converting these to uppercase will just return the original value, but haven't tested yet.

kelson42 · 2020-04-25T12:10:19Z

@Jaifroid What about something like this? ([\s-–—_:.]+|^). and this can be easily extended with new separator characters.

kelson42 · 2020-04-25T12:16:24Z

@Jaifroid We can easily put the punctuations marks in this as well.

Jaifroid · 2020-04-25T16:55:17Z

Thanks, that's a good idea, just in case someone writes punctuation in their search instead of a space.

kelson42 · 2020-04-25T17:33:46Z

@Jaifroid Do we need anything else to finally be able to fix that ticket. Would you be able to make a PR?

Jaifroid · 2020-04-25T19:45:27Z

@kelson42 No problem I'll make a PR tomorrow morning for testing.

Jaifroid · 2020-04-26T09:22:48Z

I've been looking into the punctuation issue, and since we can't use a Unicode property escapes (due to IE11 and Firefox not supporting them), we can instead compile a regex that does the equivalent, as given on this Stackexchange post:

    // any kind of punctuation character (including international e.g. Chinese and Spanish punctuation)
    // author: http://www.regular-expressions.info/unicode.html
    // source: https://github.com/slevithan/xregexp/blob/41f4cd3fc0a8540c3c71969a0f81d1f00e9056a9/src/addons/unicode/unicode-categories.js#L142
    // note: XRegExp unicode output taken from http://jsbin.com/uFiNeDOn/3/edit?js,console (see chrome console.log), then converted back to JS escaped unicode here http://rishida.net/tools/conversion/, then tested on http://regexpal.com/
    // suggested by: https://stackoverflow.com/a/7578937
    // added: extra characters like "$", "\uFFE5" [yen symbol], "^", "+", "=" which are not consider punctuation in the XRegExp regex (they are currency or mathmatical characters)
    // added: \u3000-\u303F Chinese Punctuation for good measure
    var regex_characters_to_remove = /[\$\uFFE5\^\+=`~<>{}\[\]|\u3000-\u303F!-#%-\x2A,-/:;\x3F@\x5B-\x5D_\x7B}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E3B\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65]+/g

The regex here (last line) is a monster, but regexes are lightning fast, and all this does is match users' (slowly) typed input, so it wouldn't have performance implications. Possibly a useful util regex to have anyway. I can well imagine we may have users searching for ¿dónde? (película), which we need to transform to ¿Dónde? (Película), so it's true that we need some measure of punctuation support, but we shouldn't assume Western European languages. I have absolutely no idea about Arabic punctuation, for example, or whether Arabic has uppercase, but the regex above claims to deal with all punctuation in all languages.

Note this regex is taken from the XRegExp library. An alternative would be to add XRegExp as a dependency, which would allow us to use, simply, \p{P} (Unicode property escape for the Punctuation class) in all browsers. However, a whole library seems a bit OTT for this use case.

In any case, I'll experiment in the PR.

mossroy · 2020-04-27T09:21:22Z

How is this handled in other Kiwix clients? (maybe it's handled in libzim or kiwix-lib)
When this was first implemented in kiwix-js, we tried to stick with this behavior.

If it can help, I tried to use XRegExp in another branch (3 years ago) : see https://github.com/kiwix/kiwix-js/compare/fix-international-ucEveryFirstLetter
I did not have the time to complete that work at that time.

In any case, it would be worth adding some unit tests with the implementation (you can see how I did it in the branch above). It helps understand which cases we try to cover, and it also checks non-regression when we refactor it.

Jaifroid · 2020-04-27T09:53:28Z

I can certainly add XRegExp as a dependency, @mossroy, if you think it might be worth it for more than this one use-case. For this case, a dev on Stackexchange has extracted the underlying regex that XRegExp compiles when using \p{P} and has added a few more non-punctuation symbols which seem sensible to me (and they exist in Wikipedia searches). I have used this extracted version in #617 .

The advantage of using our own regex here is speed and ensuring light weight of the app. The downside is maintainability, but it seems unlikely we'll be changing it often (and I can add a note on how to update it to the code).

I'm not against using XRegExp (it's a great library), if you think it is better to add it. (I would prefer it if we could add it simply with npm install xregexp and var XRegExp = require('xregexp');!)

kelson42 · 2020-04-27T11:43:30Z

How is this handled in other Kiwix clients?

We have a title index handled by Xapian. Xapian manages everything (even the search pattern string parsing AFAIK).

Jaifroid · 2020-04-27T15:15:55Z

We have a title index handled by Xapian. Xapian manages everything (even the search pattern string parsing AFAIK).

But I think there is still fallback parsing of the title pointer list, no? I saw a comment about that on the Android app GitHub somewhere, as not all ZIMs have Xapian indices, plus that causes a problem with split ZIMs, since Xapian can't handle them. I may be wrong / out-of-date on that.

kelson42 · 2020-04-27T15:22:36Z

@Jaifroid Yes, there is a fallback here: https://github.com/kiwix/kiwix-lib/blob/master/src/reader.cpp#L800

Fixes #179 and #184. A complete solution to cancelling some running binary searches after user has initiated a new search will be dealt with in #637.

mossroy added the bug label Jun 23, 2016

mossroy added this to the v2.0 milestone Jun 23, 2016

mossroy self-assigned this Jun 23, 2016

mossroy modified the milestones: v2.0, v2.0-beta Jun 26, 2016

mossroy added a commit that referenced this issue Apr 4, 2017

Add unit tests to check #179, and initial implementation with XRegExp.

25c34b6

It does not work well in every case

mossroy modified the milestones: v2.1, v2.0 Apr 6, 2017

mossroy removed this from the v2.2 milestone Jan 4, 2018

Jaifroid mentioned this issue Apr 26, 2020

Better handling of case transformations in search #617

Merged

Jaifroid linked a pull request May 10, 2020 that will close this issue

Better handling of case transformations in search #617

Merged

Jaifroid mentioned this issue May 10, 2020

Add XRegExp dependency #618

Closed

Jaifroid closed this as completed in #617 Jul 11, 2020

Jaifroid added a commit that referenced this issue Jul 11, 2020

Better handling of case transformations in search (#617)

823d183

Fixes #179 and #184. A complete solution to cancelling some running binary searches after user has initiated a new search will be dealt with in #637.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor bug on upper/lowercase variants in title search : the util.ucEveryFirstLetter function does not work properly when accents or other non-ASCII characters are used #179

Minor bug on upper/lowercase variants in title search : the util.ucEveryFirstLetter function does not work properly when accents or other non-ASCII characters are used #179

mossroy commented Jun 23, 2016

kelson42 commented Jun 23, 2016

mossroy commented Apr 6, 2017

kelson42 commented Apr 23, 2020

Jaifroid commented Apr 25, 2020

Jaifroid commented Apr 25, 2020 •

edited

Loading

Jaifroid commented Apr 25, 2020

Jaifroid commented Apr 25, 2020

kelson42 commented Apr 25, 2020 •

edited

Loading

kelson42 commented Apr 25, 2020

Jaifroid commented Apr 25, 2020

kelson42 commented Apr 25, 2020

Jaifroid commented Apr 25, 2020

Jaifroid commented Apr 26, 2020

mossroy commented Apr 27, 2020

Jaifroid commented Apr 27, 2020

kelson42 commented Apr 27, 2020

Jaifroid commented Apr 27, 2020 •

edited

Loading

kelson42 commented Apr 27, 2020

Minor bug on upper/lowercase variants in title search : the util.ucEveryFirstLetter function does not work properly when accents or other non-ASCII characters are used #179

Minor bug on upper/lowercase variants in title search : the util.ucEveryFirstLetter function does not work properly when accents or other non-ASCII characters are used #179

Comments

mossroy commented Jun 23, 2016

kelson42 commented Jun 23, 2016

mossroy commented Apr 6, 2017

kelson42 commented Apr 23, 2020

Jaifroid commented Apr 25, 2020

Jaifroid commented Apr 25, 2020 • edited Loading

Jaifroid commented Apr 25, 2020

Jaifroid commented Apr 25, 2020

kelson42 commented Apr 25, 2020 • edited Loading

kelson42 commented Apr 25, 2020

Jaifroid commented Apr 25, 2020

kelson42 commented Apr 25, 2020

Jaifroid commented Apr 25, 2020

Jaifroid commented Apr 26, 2020

mossroy commented Apr 27, 2020

Jaifroid commented Apr 27, 2020

kelson42 commented Apr 27, 2020

Jaifroid commented Apr 27, 2020 • edited Loading

kelson42 commented Apr 27, 2020

Jaifroid commented Apr 25, 2020 •

edited

Loading

kelson42 commented Apr 25, 2020 •

edited

Loading

Jaifroid commented Apr 27, 2020 •

edited

Loading