Better handling of case transformations in search #617

Jaifroid · 2020-04-26T10:36:20Z

This is an initial go at #179. It handles case transformations of the first letter of search terms after a space or any Unicode punctuation character.

Limitations:

It does not (yet?) produce an array of search strings with every possible combination of upper- and lower-case first-letter transformations. For example, given the search string estéril (Ciru, the algorithm currently only produces these variants:

clearly the variant Estéril (ciru is missing, and in this case, then, the search fails to find the result for Sterile Surgery (Estéril (cirugía)) in the Spanish Wikipedia.
I think the functions lcFirstLetter() and ucFirstLetter() in util.js may be redundant; I guess they were attempts to produce some measure of mixed-case transformations on the string, but they fail in the case mentioned, and it would be better to use a matrix to transform every word's first letters and recombine the strings.

In sum, I don't think this is ready for merging. It is an improvement in behaviour, I think, given that it works on Unicode strings, but it can be improved.. It would still be worth testing in a few scenarios and languages.

kelson42 · 2020-04-26T10:38:15Z

@Jaifroid Thank you. To me this is the perfect candidate for a unit test.

Jaifroid · 2020-04-26T21:33:44Z

The latest commit finds every combination of upper- and lower-case first letters of words in the phrase. It removes the separate functions that did upper- and lower-case transforms in limited cases, and instead splits the string into its component number of "words" (n), ignoring any punctuation at the beginning of a word, and then iterates through 2 ^ n combinations using bitwise & to set the case (upper|lower) of each word's first letter.

So, it can deal with this sample search, for example:

The Unit Tests currently contain a set of simple tests transforming a phrase to all upper-case first letters and all lower-case first letters. I will need to write a different test using the new algorithm. It will fail the current tests. I will write the new test if this is approved by @mossroy .

NB I want to test this thoroughly before removing the do-not-merge label. It is not ready for review.

kelson42 · 2020-04-27T07:09:05Z

@Jaifroid [FYI] Small remark about usual PR workflow. Github proposes a "draft" PR mode which can be chosen at the time your create the PR and since a few weeks even afterward. I recommend to use it in place of the homebrew "do-not-merge" tag. An other traditional/old way to show that the PR is not finished is to put "WIP" (for Work In Progress) at the head of the PR title. Of course, not assigning a reviewer is important too (no assignee for this PR, this is good). All of this is not super important because you mostly work only together with @mossroy, so just FYI. I have put the PR in draft.

Jaifroid · 2020-04-27T07:16:50Z

@kelson42 Thanks, that's useful to know!

Jaifroid · 2020-04-27T08:44:26Z

I have added "whole word" case variations, not for mixed case within a word, but for cases such as searching for Config.sys and finding CONFIG.SYS. The downside of this is that it doubles the number of searches to be matched with a concomitant halving of search performance. On a PC this is not particularly noticeable, but it could be quite slow on mobile. It could be made an option in Configuration.

You can see this in action in the search below, and the first twelve combinations the function returns below that. Duplicates are subsequently removed from this array.

Jaifroid · 2020-04-27T20:42:38Z

Latest commit preserves the search exactly as typed, and then normalizes it by reducing any spacing to a single space, making entire string lowercase, and then calculating the variation matrix from the normalized string. It enables pseudo case-insensitivity while also searching on the case-sensitive string as typed by the user (in case that is important). See screenshot running in FFOS simulator. Case transformations could be an option in Config, because in some contexts it could be slow. In browser/extension contexts performance is fine.

Jaifroid · 2020-04-27T20:57:11Z

I say "pseudo case insensitivity" because of course it cannot transform évoid into éVoid (an actual search term in full English Wikipedia). The latter can only be found by typing case exactly.

Tested and working on:

Firefox 75
FFOS simulator
Internet Explorer 11 (a little slow on full English Wikipedia)
Edge 81 (Chromium)
Edge 44 (Legacy)

mossroy · 2020-04-28T21:10:50Z

Regarding the use of XRegExp vs copying its giant regexp, I'd be in favor of using their js file (only it's not too complicated). If we need to modify/extend it, it would be worth searching/creating issues on their github. They're probably better than us on dealing with unicode regexps, and maintaining them in the long term.
Based on the README of https://github.com/slevithan/xregexp , requirejs is officially supported. They also suggest to use https://www.npmjs.com/package/xregexp/v/4.3.0 (that has support for requirejs built-in, but also includes many addons that we probably don't need). It might be possible afterwards to remove these addons, or simply add the few header/footer lines add requirejs support to the non-"all" version.

Some unit tests are failing on Travis because they're still using the previous API. Don't forget to fix them (and add some that check the issues we're trying to fix).

Regarding the performance impact, I also agree that it's a concern.
Maybe there could be 2 versions (as you suggested) :

a simple and fast version, that only tests the same variants as the kiwix-lib fallback (https://github.com/kiwix/kiwix-lib/blob/8f07689c57356264c1ef145ba82c1d60a97b786e/src/reader.cpp#L788) : original title, ucFirst, lcFirst and toTitle (that seems to correspond to our previous ucEveryFirstLetter function). This version should not introduce any noticeable performance regression compared to version 2.7.0.
a smarter (and more complete) version, that could use more variants on faster devices

The choice between these versions could be a new settings.
IMHO, ideally, when the setting is not set yet, we would use the smarter version, and would measure the time the first search takes. If it's not fast enough, we would display a warning to the user, suggesting to switch to the simpler version. If it's too complicated, it could be a separate enhancement

Regarding the smarter version, I'm a bit concerned by the 2^n variants. It looks overkill for most cases. Maybe a compromise would be that, for each word, we only test :

the originally typed word
an uppercase version
a lowercase version
a version with only the first letter uppercase, and the rest lowercase
Even multiplied by the number of words, it would make a lot less variants than 2^n (especially if we remove duplicates between these versions, which should be frequent)

On the other hand, all this might be considered too complicated, considered that it would be removed if/when we manage compile kiwix-lib with emscripten. Up to you.

Jaifroid · 2020-04-28T21:22:43Z

OK, thanks for the feedback @mossroy . I like these ideas. I might put XRegExp integration as a separate issue. Measuring search time sounds interesting, and could also determine the number of search results returned.

I definitely plan to fix the Unit Tests.

Jaifroid · 2020-04-30T12:44:59Z

Latest commit adds this UI to the Performance Settings panel, and implements the Basic and Full search type. The difference is noticeable in IE11, but not so much in other browsers. This commit does not include any AI to time the search and warn the user. If user hovers over control, further explanation pops up (not sure what happens in mobile yet, only tested in IE11, Firefox 75 and Edge Legacy).

Jaifroid · 2020-05-01T12:38:15Z

Just a quick sample of the kind of difference between the old search (top) and the new "comprehensive" search (bottom) in English Wikipedia searching for "the past is a". As you can see "The Past Is a Foreign Land" (a distinct article) is missed in the old search (together with some variants for the other two distinct articles).

kelson42 · 2020-05-01T13:58:58Z

@Jaifroid ... and I thought first this PR will be quick/short... How wrong I was :)

Jaifroid · 2020-05-02T08:56:12Z

Latest commit returns results dynamically to the user, as they are found, and optimizes for returning common search patterns first. Clicking on a found result stops further search. I think this obviates the need for complex automation (timing searches and prompting user to change settings, or auto-changing settings).

Need to review performance on old/slow systems to see whether the new behaviour obviates the need for the "basic" search type. I suspect it will still be useful, as search can be excruciatingly slow in some contexts.

Will now attempt Unit Tests.

One more sample of a search that returns nothing in "basic" search (and in existing app), and 7 results with "full":

mossroy · 2020-05-02T10:07:25Z

I did not look at the code, but I like your suggestion to dynamically add results, if it might simplify the code and the UI.
Tell me when I can test a pre-version on my Firefox OS device (which is slow enough to give an idea)

Jaifroid · 2020-05-02T10:10:58Z

Code can be tested now. I haven't marked Ready for review because Unit Tests are still failing... Though I'm not sure they're failing on the new tests I've put in, seems to be something else.

Jaifroid · 2020-05-02T11:56:46Z

Is passing Unit Tests now.

Jaifroid · 2020-05-02T14:02:52Z

@mossory, progressive search is currently not very granular, because it reports back on each search variant that has found something, and quite a lot of variants don't of course find anything. It is, however, not difficult to report back for each result found. It looks like the demo gif below. I'm holding back on pushing these changes because I wonder if it is too granular. Thoughts?

Jaifroid · 2020-05-02T14:48:54Z

I broke the tests I fixed. 😒
Have run out of time to fix them today I think. I know what I did, and can fix tomorrow... (It's to do with interim reporting of results...)

Jaifroid · 2020-05-03T08:26:35Z

After further testing, the granular reporting works well on fast systems, and gives more informative feedback to the user on slower systems. So I've pushed it in a single commit 1e10807 which can easily be reversed if necessary.

Jaifroid · 2020-05-03T16:21:42Z

Just for fun... searching with regular expressions! ༼ つ ◕_◕ ༽つ
(But it would be separate issue, because it would require significant refactoring: I just did a quick and very dirty hack to see if it was possible...)

mossroy · 2020-05-03T20:12:54Z

I quickly tested on my Firefox OS device.
It works pretty well.
In terms of UX, I find that the progressive display of search results is a huge improvement!
I agree that it makes the new settings unnecessary. We could also keep the hard-coded 25 results.
If I choose a search result while the search is still in progress, I do not notice a slow-down, which is cool.

However, it sometimes creates some "flickering" on my device.
One reason is #184, for which PR #593 had been started (but not finished nor merged). It is a separate issue.
When there are not a lot of results, some lines of result sometimes appear, then disappear, then re-appear. It's harder to reproduce. I initially thought it was another issue, but maybe it was simply a side effect of #184.

Also, when the search string is long, it can take a long time to search all variants. Avoiding the 2^n variants can be a separate improvement.

I'm very pleased to see you fixed the unit tests, and added new ones.
They could be improved by avoiding to hard-code the position in the array, for the expected string. The expected string could be searched at any position in the array, with something like assert.ok(somethingThatReturnsTrueIfOurExpectedStringIsInTheArray(), "The specific variant should be in the results")

Regarding the regexp search, well... it's fun, but does not seem useful for a regular user ;-)

Jaifroid · 2020-07-09T08:21:16Z

I've added a commit which I consider "merge candidate", which follows the suggestion of comparing with the original prefix to add another cancellation test. Thank you for that. I'll do testing and self review, then will let you know when ready for final check.

Jaifroid

I've added a few copyediting changes, and one "silly" one (the useless lines that need removing). I have a blindspot about that...

www/js/app.js

www/js/lib/zimArchive.js

Jaifroid · 2020-07-09T11:24:51Z

Tests

I'll update this post as these tests are done:

Code tests for cancellation (using console.log)
Firefox OS search and cancellation test
IE11 search and cancellation test
Edge Legacy search and cancellation tests jQuery and SW modes
Edge Chromium 83.0 search and cancellation tests jQuery and SW modes
Firefox 78.0.1 search and cancellation tests jQuery and SW modes

Jaifroid · 2020-07-09T12:59:05Z

@mossroy we need to take a decision on how to proceed with this PR given that there is more work that could be done to optimize it. Let me explain:

Code tests for cancellation using console log show that despite all the cancellation code, all the Promises have already been set up long before the user actually cancels a search by entering a new one. Because the code runs async, it very quickly sets up Promises for searches on all the variants, before the user starts typing a new search. The Promises carry on searching. We save some CPU by returning early before this line, and we successfully suppress the display of any results returned from the Promises. But some work is still being done, and a number of dirEntries for old searches are returned (and suppressed) even after a new search has started.

This is very similar to #426 (Prevent redundant extraction of assets after navigation), which we haven't yet fixed either. I see three ways to deal with this:

Wait for me to find a way to return early from binary search without looking up any dirEntries and using minimal CPU on the Promises already set up -- I can't work on this now till the weekend;
Or, decide that releasing with the working code as it is a good enough improvement for now, and add an issue like Prevent redundant extraction of assets after navigation #426 to work on optimizing the cancellation later, maybe together with Prevent redundant extraction of assets after navigation #426;
Or do a simple bugfix release without this PR (which I think would be a shame, but it is now urgent to do a bugfix release given users are not able to read latest ZIMs without errors due to MIME type table issue).

To be clear, the code for this PR is working well, searches are cancelled effectively without extraneous results displayed, and the new search starts returning results fast, but CPU is still being consumed on redundant binary searches in Promises that are hard to cancel, for a while after the new search results are coming in. Nevertheless, IMHO the UX is still a big improvement.

Please advise on best course of action @mossroy .

mossroy · 2020-07-09T18:25:16Z

I agree with you that this PR is already a big UX improvement.
Let's do things step by step, I vote for your second option : finalize testing and merging this PR, release a 2.8, and create another issue for a possible improvement in the future

Jaifroid · 2020-07-09T20:33:26Z

In doing IE11 testing I found and fixed a small issue: a user could relaunch the same search while it is already in progress by pressing the search button or pressing enter. I have added 8a7d740 to prevent this if the prefix has not changed and the search status is NOT 'cancelled' or 'complete' (i.e. it is in progress). A user can now press enter, e.g. by mistake or from habit, or click the search icon, without it relaunching or interrupting the search in progress. However, if search is complete and user wants to search the same terms again, then a new search on the same prefix will launch, as this would be a deliberate action by the user.

Jaifroid · 2020-07-09T21:12:31Z

OK @mossroy I've made all the manual tests indicated in #617 (comment) and have fixed the bug mentioned in previous comment. I've added a separate issue #637 for optimizing cancellation.

If you're happy for me to squash and merge, please approve the review. I'm of course happy to fix anything further as necessary.

Jaifroid · 2020-07-10T07:21:47Z

I'm trying something out I discovered this morning. Please wait...

Jaifroid · 2020-07-10T07:39:12Z

OK, sorry for last-minute commits. Having slept on it, a simple way of returning cancelled searches early from binary search came to me, and it works, preventing a lot of redundant CPU use. But it only works if we test on the globalstate.search variable rather than on the local copy of search. But this involves reinstating the globalstate variable in tests, or else they fail.

I'd be happy to revert those two commits and deal with them under #637 if you prefer, but having tested carefully with lots of console log statements, I'm confident they go a long way towards fixing #637 (but I'll keep it open for more testing subsequently).

Jaifroid · 2020-07-10T12:12:08Z

This is just for information (and to document the issue for #637):

Yesterday I thought I was going mad... But this Stackexchange post explains the weirdness we're seeing here:

JavaScript Pointer Reference craziness

So it seems that in fact the global search object and the search passed to the functions point to the same underlying object until the global search object is reassigned to a different object. At that point, the search object passed to the functions continue to point to the old object, and the global search object points to a new object.

It explains why I was seeing the passed searches linked to the global search, but then suddenly they weren't. We can actually exploit this, but I propose to do it for #637 and #426 at a later date, not to delay this PR anymore (which is working well now).

mossroy

Thanks for all this work, and for the explanation on reference tricks.

We're almost there, but the globalstate variable can not be used in the ZIM backend.
It should be declared in app.js (not in init.js), and should not be declared in unit tests.

If it prevents an additional improvement, never mind : we'll take time to implement it properly afterwards, and I prefer to revert it for now. Sorry

www/index.html

www/js/app.js

mossroy · 2020-07-11T10:21:56Z

www/js/lib/zimArchive.js

-            });
+            that.findDirEntriesWithPrefixCaseSensitive(prefix, resultSize - dirEntries.length, localPrefix, 
+                function (newDirEntries, interim) {
+                    if (globalstate.search.status === 'cancelled' || globalstate.search.prefix !== localPrefix) 


Using globalstate variable here breaks the software layers we have.
Using it in app.js is no problem (if it's where the variable is declared), but it should not be used in other files (except if passed as a parameter, of course).

The ZIM backend should not depend on external global variables. A drawback can be seen in the unit tests : you should be able to test each call independently, without being forced to declare the global variable (which might have side effects if the unit tests are ran in parallel, for example)

Jaifroid · 2020-07-11T11:40:01Z

OK, @mossroy I understand, and hadn't realized that I was breaking the architecture by introducing a global variable into the backend. I think this is completely solvable, but isn't a "quick fix". Therefore, can I just check your timeline? If we are going to try to release this weekend, I'll revert the last two commits, and then work properly on this in #637. If you don't aim to release till next week, I could try to work on it more here. But I don't want to rush things, and I'm quite anxious (as you may have noticed) to get a release out.

This reverts commit 115569b.

This reverts commit 01a75d2.

mossroy · 2020-07-11T12:07:06Z

I see you just reverted your last 2 commits. I can review/test a bit more this version, right now.

Jaifroid · 2020-07-11T12:08:36Z

I'm just adding the other two things you mentioned above (change description and regex). Then I'll be done.

Jaifroid · 2020-07-11T12:14:23Z

OK, I've reverted and have done the two minor points in last review. Let me know if there's anything else for a "quick fix".

mossroy

Tested quickly on my Firefox OS device, and on with Chromium and Firefox on desktop (jQuery and SW modes, including within browser extensions)
Search is working much smoother than before, and I did not see any regression

mossroy · 2020-07-11T12:38:49Z

www/js/init.js

+ * 
+ * @type Object
+ */
+var globalstate = {};


This variable can probably be moved to app.js, but it's minor and can be done later

Thanks @mossroy . It would feel safer to me to move this as part of #637 when I'll have time to test a unified solution to cancelling running binary searches (and possibly also cancelling fetching of assets on pages we have navigated away from). So I'll move a summary of issues to be dealt with to #637 after squashing and merging this PR.

Sure, that's perfect

mossroy · 2020-07-11T13:02:42Z

You can merge and I might work on a 2.8 release in the next hour

Jaifroid added bug do-not-merge Sample code labels Apr 26, 2020

Jaifroid added this to the v2.8 milestone Apr 26, 2020

Jaifroid self-assigned this Apr 26, 2020

kelson42 marked this pull request as draft April 27, 2020 07:09

Jaifroid removed the do-not-merge Sample code label Apr 27, 2020

Jaifroid mentioned this pull request Apr 27, 2020

Minor bug on upper/lowercase variants in title search : the util.ucEveryFirstLetter function does not work properly when accents or other non-ASCII characters are used #179

Closed

Jaifroid marked this pull request as ready for review May 2, 2020 11:56

Jaifroid requested a review from mossroy May 2, 2020 11:56

Jaifroid added 2 commits July 8, 2020 21:42

Remove dummy search object

fa177a6

Keep and test original search prefix for more efficient cancelling

c7a82fb

Jaifroid commented Jul 9, 2020

View reviewed changes

www/js/app.js Outdated Show resolved Hide resolved

www/js/app.js Outdated Show resolved Hide resolved

www/js/app.js Outdated Show resolved Hide resolved

www/js/lib/zimArchive.js Outdated Show resolved Hide resolved

Minor updates from self-review

d056501

Jaifroid mentioned this pull request Jul 9, 2020

Optimize title search so that cancelled binary search Promises return at the earliest possible opportunity #637

Closed

Prevent relaunching the same search if it is in progress

8a7d740

Add early return and check globalstate.search instead of local

01a75d2

Fix tests by reinstating globalstate variable

115569b

mossroy suggested changes Jul 11, 2020

View reviewed changes

Jaifroid added 2 commits July 11, 2020 13:01

Revert "Fix tests by reinstating globalstate variable"

8ff4bd3

This reverts commit 115569b.

Revert "Add early return and check globalstate.search instead of local"

46e5ac3

This reverts commit 01a75d2.

Fix description of slider and regex test

5b36e4d

mossroy approved these changes Jul 11, 2020

View reviewed changes

Jaifroid merged commit 823d183 into master Jul 11, 2020

Jaifroid mentioned this pull request Sep 8, 2020

Enable quick and complete cancellation of running searches #642

Merged

kelson42 deleted the Better-handling-of-uppercase-and-lowercase-search-transformations branch September 19, 2020 12:58

Better handling of case transformations in search #617

Better handling of case transformations in search #617

Conversation

Jaifroid commented Apr 26, 2020

kelson42 commented Apr 26, 2020

Jaifroid commented Apr 26, 2020

kelson42 commented Apr 27, 2020

Jaifroid commented Apr 27, 2020

Jaifroid commented Apr 27, 2020 • edited Loading

Jaifroid commented Apr 27, 2020

Jaifroid commented Apr 27, 2020

mossroy commented Apr 28, 2020

Jaifroid commented Apr 28, 2020

Jaifroid commented Apr 30, 2020

Jaifroid commented May 1, 2020

kelson42 commented May 1, 2020

Jaifroid commented May 2, 2020 • edited Loading

mossroy commented May 2, 2020

Jaifroid commented May 2, 2020

Jaifroid commented May 2, 2020

Jaifroid commented May 2, 2020 • edited Loading

Jaifroid commented May 2, 2020

Jaifroid commented May 3, 2020

Jaifroid commented May 3, 2020

mossroy commented May 3, 2020

Jaifroid commented Jul 9, 2020

Jaifroid left a comment

Choose a reason for hiding this comment

Jaifroid commented Jul 9, 2020 • edited Loading

Tests

Jaifroid commented Jul 9, 2020 • edited Loading

mossroy commented Jul 9, 2020

Jaifroid commented Jul 9, 2020

Jaifroid commented Jul 9, 2020

Jaifroid commented Jul 10, 2020

Jaifroid commented Jul 10, 2020

Jaifroid commented Jul 10, 2020

mossroy left a comment

Choose a reason for hiding this comment

mossroy Jul 11, 2020

Choose a reason for hiding this comment

Jaifroid commented Jul 11, 2020

mossroy commented Jul 11, 2020

Jaifroid commented Jul 11, 2020

Jaifroid commented Jul 11, 2020

mossroy left a comment

Choose a reason for hiding this comment

mossroy Jul 11, 2020

Choose a reason for hiding this comment

Jaifroid Jul 11, 2020

Choose a reason for hiding this comment

mossroy Jul 11, 2020

Choose a reason for hiding this comment

mossroy commented Jul 11, 2020

Jaifroid commented Apr 27, 2020 •

edited

Loading

Jaifroid commented May 2, 2020 •

edited

Loading

Jaifroid commented May 2, 2020 •

edited

Loading

Jaifroid commented Jul 9, 2020 •

edited

Loading

Jaifroid commented Jul 9, 2020 •

edited

Loading