Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor bug on upper/lowercase variants in title search : the util.ucEveryFirstLetter function does not work properly when accents or other non-ASCII characters are used #179

Closed
mossroy opened this issue Jun 23, 2016 · 18 comments · Fixed by #617
Assignees
Labels

Comments

@mossroy
Copy link
Contributor

mossroy commented Jun 23, 2016

For example, if you type "amérique" as a search string, the uppercase on first letter of each word variant is wrong : it looks for AméRique instead of Amérique.

This is because the function util.ucEveryFirstLetter uses a regular expression with word boundary \b, that only works on ASCII characters : it considers the é as a word boundary

@mossroy mossroy added the bug label Jun 23, 2016
@mossroy mossroy added this to the v2.0 milestone Jun 23, 2016
@mossroy mossroy self-assigned this Jun 23, 2016
@kelson42
Copy link
Collaborator

That's a really funny one :)

@mossroy mossroy modified the milestones: v2.0, v2.0-beta Jun 26, 2016
mossroy added a commit that referenced this issue Apr 4, 2017
@mossroy
Copy link
Contributor Author

mossroy commented Apr 6, 2017

I thought this one was easy to fix, but it's not.
As it's a minor issue, I'll postpone it for now.
I've commited my work-in-progress on branch fix-international-ucEveryFirstLetter

@mossroy mossroy modified the milestones: v2.1, v2.0 Apr 6, 2017
@mossroy mossroy removed this from the v2.2 milestone Jan 4, 2018
@kelson42
Copy link
Collaborator

@mossroy @Jaifroid I think this would be fixed in an acceptable manner by detecting the spaces instead of detecting the words. You could use \s for example and any other separator you judge appropriate. Would that be good for you?

@Jaifroid
Copy link
Member

@kelson42 Sorry for slow reply. Yes, I'll take a look at that regex. It's also possible that with widespread UTF8 support in JavaScript, the bug has gone away, i.e. the internal definition of a word boundary may no longer have that bug.

@Jaifroid
Copy link
Member

Jaifroid commented Apr 25, 2020

I confirm this bug still exists in Firefox 75.0.

@Jaifroid
Copy link
Member

There is a broader issue with the current regex, which is that \w only matches [a-zA-Z] in JavaScript, i.e. only works with the ASCII alphabet. It's possibly this could be changed with the /u (Unicode) switch, but that is not supported in IE11.

I tested an initial solution to the bug, but still using \w, and the outcome can be seen in Regex Buddy here (note that I'm using a placeholder U to represent uppercase). As can be seen below, neither á in ágata nor any of the first letters of the Greek words are selected for conversion to uppercase.

image

@Jaifroid
Copy link
Member

If instead we select any character (.) after the start of the string or after a space or any number of spaces, as @kelson42 suggested, then we would get what you see below. My only doubts are that this will pick up numbers and punctuation marks with spaces before them. However, I suspect that converting these to uppercase will just return the original value, but haven't tested yet.

image

@kelson42
Copy link
Collaborator

kelson42 commented Apr 25, 2020

@Jaifroid What about something like this? ([\s-–—_:.]+|^). and this can be easily extended with new separator characters.

@kelson42
Copy link
Collaborator

@Jaifroid We can easily put the punctuations marks in this as well.

@Jaifroid
Copy link
Member

Thanks, that's a good idea, just in case someone writes punctuation in their search instead of a space.

@kelson42
Copy link
Collaborator

@Jaifroid Do we need anything else to finally be able to fix that ticket. Would you be able to make a PR?

@Jaifroid
Copy link
Member

@kelson42 No problem I'll make a PR tomorrow morning for testing.

@Jaifroid
Copy link
Member

I've been looking into the punctuation issue, and since we can't use a Unicode property escapes (due to IE11 and Firefox not supporting them), we can instead compile a regex that does the equivalent, as given on this Stackexchange post:

    // any kind of punctuation character (including international e.g. Chinese and Spanish punctuation)
    // author: http://www.regular-expressions.info/unicode.html
    // source: https://github.com/slevithan/xregexp/blob/41f4cd3fc0a8540c3c71969a0f81d1f00e9056a9/src/addons/unicode/unicode-categories.js#L142
    // note: XRegExp unicode output taken from http://jsbin.com/uFiNeDOn/3/edit?js,console (see chrome console.log), then converted back to JS escaped unicode here http://rishida.net/tools/conversion/, then tested on http://regexpal.com/
    // suggested by: https://stackoverflow.com/a/7578937
    // added: extra characters like "$", "\uFFE5" [yen symbol], "^", "+", "=" which are not consider punctuation in the XRegExp regex (they are currency or mathmatical characters)
    // added: \u3000-\u303F Chinese Punctuation for good measure
    var regex_characters_to_remove = /[\$\uFFE5\^\+=`~<>{}\[\]|\u3000-\u303F!-#%-\x2A,-/:;\x3F@\x5B-\x5D_\x7B}\u00A1\u00A7\u00AB\u00B6\u00B7\u00BB\u00BF\u037E\u0387\u055A-\u055F\u0589\u058A\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0609\u060A\u060C\u060D\u061B\u061E\u061F\u066A-\u066D\u06D4\u0700-\u070D\u07F7-\u07F9\u0830-\u083E\u085E\u0964\u0965\u0970\u0AF0\u0DF4\u0E4F\u0E5A\u0E5B\u0F04-\u0F12\u0F14\u0F3A-\u0F3D\u0F85\u0FD0-\u0FD4\u0FD9\u0FDA\u104A-\u104F\u10FB\u1360-\u1368\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DA\u1800-\u180A\u1944\u1945\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B60\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u2010-\u2027\u2030-\u2043\u2045-\u2051\u2053-\u205E\u207D\u207E\u208D\u208E\u2329\u232A\u2768-\u2775\u27C5\u27C6\u27E6-\u27EF\u2983-\u2998\u29D8-\u29DB\u29FC\u29FD\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E3B\u3001-\u3003\u3008-\u3011\u3014-\u301F\u3030\u303D\u30A0\u30FB\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAADE\uAADF\uAAF0\uAAF1\uABEB\uFD3E\uFD3F\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE61\uFE63\uFE68\uFE6A\uFE6B\uFF01-\uFF03\uFF05-\uFF0A\uFF0C-\uFF0F\uFF1A\uFF1B\uFF1F\uFF20\uFF3B-\uFF3D\uFF3F\uFF5B\uFF5D\uFF5F-\uFF65]+/g

The regex here (last line) is a monster, but regexes are lightning fast, and all this does is match users' (slowly) typed input, so it wouldn't have performance implications. Possibly a useful util regex to have anyway. I can well imagine we may have users searching for ¿dónde? (película), which we need to transform to ¿Dónde? (Película), so it's true that we need some measure of punctuation support, but we shouldn't assume Western European languages. I have absolutely no idea about Arabic punctuation, for example, or whether Arabic has uppercase, but the regex above claims to deal with all punctuation in all languages.

Note this regex is taken from the XRegExp library. An alternative would be to add XRegExp as a dependency, which would allow us to use, simply, \p{P} (Unicode property escape for the Punctuation class) in all browsers. However, a whole library seems a bit OTT for this use case.

In any case, I'll experiment in the PR.

@mossroy
Copy link
Contributor Author

mossroy commented Apr 27, 2020

How is this handled in other Kiwix clients? (maybe it's handled in libzim or kiwix-lib)
When this was first implemented in kiwix-js, we tried to stick with this behavior.

If it can help, I tried to use XRegExp in another branch (3 years ago) : see https://github.com/kiwix/kiwix-js/compare/fix-international-ucEveryFirstLetter
I did not have the time to complete that work at that time.

In any case, it would be worth adding some unit tests with the implementation (you can see how I did it in the branch above). It helps understand which cases we try to cover, and it also checks non-regression when we refactor it.

@Jaifroid
Copy link
Member

I can certainly add XRegExp as a dependency, @mossroy, if you think it might be worth it for more than this one use-case. For this case, a dev on Stackexchange has extracted the underlying regex that XRegExp compiles when using \p{P} and has added a few more non-punctuation symbols which seem sensible to me (and they exist in Wikipedia searches). I have used this extracted version in #617 .

The advantage of using our own regex here is speed and ensuring light weight of the app. The downside is maintainability, but it seems unlikely we'll be changing it often (and I can add a note on how to update it to the code).

I'm not against using XRegExp (it's a great library), if you think it is better to add it. (I would prefer it if we could add it simply with npm install xregexp and var XRegExp = require('xregexp');!)

@kelson42
Copy link
Collaborator

How is this handled in other Kiwix clients?

We have a title index handled by Xapian. Xapian manages everything (even the search pattern string parsing AFAIK).

@Jaifroid
Copy link
Member

Jaifroid commented Apr 27, 2020

We have a title index handled by Xapian. Xapian manages everything (even the search pattern string parsing AFAIK).

But I think there is still fallback parsing of the title pointer list, no? I saw a comment about that on the Android app GitHub somewhere, as not all ZIMs have Xapian indices, plus that causes a problem with split ZIMs, since Xapian can't handle them. I may be wrong / out-of-date on that.

@kelson42
Copy link
Collaborator

@Jaifroid Jaifroid linked a pull request May 10, 2020 that will close this issue
Jaifroid added a commit that referenced this issue Jul 11, 2020
Fixes #179 and #184. A complete solution to cancelling some running binary searches after user has initiated a new search will be dealt with in #637.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants