Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of case transformations in search #617

Conversation

Jaifroid
Copy link
Member

This is an initial go at #179. It handles case transformations of the first letter of search terms after a space or any Unicode punctuation character.

Limitations:

  • It does not (yet?) produce an array of search strings with every possible combination of upper- and lower-case first-letter transformations. For example, given the search string estéril (Ciru, the algorithm currently only produces these variants:
    image
    clearly the variant Estéril (ciru is missing, and in this case, then, the search fails to find the result for Sterile Surgery (Estéril (cirugía)) in the Spanish Wikipedia.
  • I think the functions lcFirstLetter() and ucFirstLetter() in util.js may be redundant; I guess they were attempts to produce some measure of mixed-case transformations on the string, but they fail in the case mentioned, and it would be better to use a matrix to transform every word's first letters and recombine the strings.

In sum, I don't think this is ready for merging. It is an improvement in behaviour, I think, given that it works on Unicode strings, but it can be improved.. It would still be worth testing in a few scenarios and languages.

@Jaifroid Jaifroid added bug do-not-merge Sample code labels Apr 26, 2020
@Jaifroid Jaifroid added this to the v2.8 milestone Apr 26, 2020
@Jaifroid Jaifroid self-assigned this Apr 26, 2020
@kelson42
Copy link
Collaborator

@Jaifroid Thank you. To me this is the perfect candidate for a unit test.

@Jaifroid
Copy link
Member Author

The latest commit finds every combination of upper- and lower-case first letters of words in the phrase. It removes the separate functions that did upper- and lower-case transforms in limited cases, and instead splits the string into its component number of "words" (n), ignoring any punctuation at the beginning of a word, and then iterates through 2 ^ n combinations using bitwise & to set the case (upper|lower) of each word's first letter.

So, it can deal with this sample search, for example:

image

The Unit Tests currently contain a set of simple tests transforming a phrase to all upper-case first letters and all lower-case first letters. I will need to write a different test using the new algorithm. It will fail the current tests. I will write the new test if this is approved by @mossroy .

NB I want to test this thoroughly before removing the do-not-merge label. It is not ready for review.

@kelson42
Copy link
Collaborator

@Jaifroid [FYI] Small remark about usual PR workflow. Github proposes a "draft" PR mode which can be chosen at the time your create the PR and since a few weeks even afterward. I recommend to use it in place of the homebrew "do-not-merge" tag. An other traditional/old way to show that the PR is not finished is to put "WIP" (for Work In Progress) at the head of the PR title. Of course, not assigning a reviewer is important too (no assignee for this PR, this is good). All of this is not super important because you mostly work only together with @mossroy, so just FYI. I have put the PR in draft.

@kelson42 kelson42 marked this pull request as draft April 27, 2020 07:09
@Jaifroid Jaifroid removed the do-not-merge Sample code label Apr 27, 2020
@Jaifroid
Copy link
Member Author

@kelson42 Thanks, that's useful to know!

@Jaifroid
Copy link
Member Author

Jaifroid commented Apr 27, 2020

I have added "whole word" case variations, not for mixed case within a word, but for cases such as searching for Config.sys and finding CONFIG.SYS. The downside of this is that it doubles the number of searches to be matched with a concomitant halving of search performance. On a PC this is not particularly noticeable, but it could be quite slow on mobile. It could be made an option in Configuration.

You can see this in action in the search below, and the first twelve combinations the function returns below that. Duplicates are subsequently removed from this array.

image

image

@Jaifroid
Copy link
Member Author

Latest commit preserves the search exactly as typed, and then normalizes it by reducing any spacing to a single space, making entire string lowercase, and then calculating the variation matrix from the normalized string. It enables pseudo case-insensitivity while also searching on the case-sensitive string as typed by the user (in case that is important). See screenshot running in FFOS simulator. Case transformations could be an option in Config, because in some contexts it could be slow. In browser/extension contexts performance is fine.

image

@Jaifroid
Copy link
Member Author

I say "pseudo case insensitivity" because of course it cannot transform évoid into éVoid (an actual search term in full English Wikipedia). The latter can only be found by typing case exactly.

Tested and working on:

  • Firefox 75
  • FFOS simulator
  • Internet Explorer 11 (a little slow on full English Wikipedia)
  • Edge 81 (Chromium)
  • Edge 44 (Legacy)

@mossroy
Copy link
Contributor

mossroy commented Apr 28, 2020

Regarding the use of XRegExp vs copying its giant regexp, I'd be in favor of using their js file (only it's not too complicated). If we need to modify/extend it, it would be worth searching/creating issues on their github. They're probably better than us on dealing with unicode regexps, and maintaining them in the long term.
Based on the README of https://github.com/slevithan/xregexp , requirejs is officially supported. They also suggest to use https://www.npmjs.com/package/xregexp/v/4.3.0 (that has support for requirejs built-in, but also includes many addons that we probably don't need). It might be possible afterwards to remove these addons, or simply add the few header/footer lines add requirejs support to the non-"all" version.

Some unit tests are failing on Travis because they're still using the previous API. Don't forget to fix them (and add some that check the issues we're trying to fix).

Regarding the performance impact, I also agree that it's a concern.
Maybe there could be 2 versions (as you suggested) :

The choice between these versions could be a new settings.
IMHO, ideally, when the setting is not set yet, we would use the smarter version, and would measure the time the first search takes. If it's not fast enough, we would display a warning to the user, suggesting to switch to the simpler version. If it's too complicated, it could be a separate enhancement

Regarding the smarter version, I'm a bit concerned by the 2^n variants. It looks overkill for most cases. Maybe a compromise would be that, for each word, we only test :

  • the originally typed word
  • an uppercase version
  • a lowercase version
  • a version with only the first letter uppercase, and the rest lowercase
    Even multiplied by the number of words, it would make a lot less variants than 2^n (especially if we remove duplicates between these versions, which should be frequent)

On the other hand, all this might be considered too complicated, considered that it would be removed if/when we manage compile kiwix-lib with emscripten. Up to you.

@Jaifroid
Copy link
Member Author

OK, thanks for the feedback @mossroy . I like these ideas. I might put XRegExp integration as a separate issue. Measuring search time sounds interesting, and could also determine the number of search results returned.

I definitely plan to fix the Unit Tests.

@Jaifroid
Copy link
Member Author

Latest commit adds this UI to the Performance Settings panel, and implements the Basic and Full search type. The difference is noticeable in IE11, but not so much in other browsers. This commit does not include any AI to time the search and warn the user. If user hovers over control, further explanation pops up (not sure what happens in mobile yet, only tested in IE11, Firefox 75 and Edge Legacy).

image

@Jaifroid
Copy link
Member Author

Jaifroid commented May 1, 2020

Just a quick sample of the kind of difference between the old search (top) and the new "comprehensive" search (bottom) in English Wikipedia searching for "the past is a". As you can see "The Past Is a Foreign Land" (a distinct article) is missed in the old search (together with some variants for the other two distinct articles).

image

image

@kelson42
Copy link
Collaborator

kelson42 commented May 1, 2020

@Jaifroid ... and I thought first this PR will be quick/short... How wrong I was :)

@Jaifroid
Copy link
Member Author

Jaifroid commented May 2, 2020

Latest commit returns results dynamically to the user, as they are found, and optimizes for returning common search patterns first. Clicking on a found result stops further search. I think this obviates the need for complex automation (timing searches and prompting user to change settings, or auto-changing settings).

Need to review performance on old/slow systems to see whether the new behaviour obviates the need for the "basic" search type. I suspect it will still be useful, as search can be excruciatingly slow in some contexts.

Will now attempt Unit Tests.

One more sample of a search that returns nothing in "basic" search (and in existing app), and 7 results with "full":

image

@mossroy
Copy link
Contributor

mossroy commented May 2, 2020

I did not look at the code, but I like your suggestion to dynamically add results, if it might simplify the code and the UI.
Tell me when I can test a pre-version on my Firefox OS device (which is slow enough to give an idea)

@Jaifroid
Copy link
Member Author

Jaifroid commented May 2, 2020

Code can be tested now. I haven't marked Ready for review because Unit Tests are still failing... Though I'm not sure they're failing on the new tests I've put in, seems to be something else.

@Jaifroid Jaifroid marked this pull request as ready for review May 2, 2020 11:56
@Jaifroid Jaifroid requested a review from mossroy May 2, 2020 11:56
@Jaifroid
Copy link
Member Author

Jaifroid commented May 2, 2020

Is passing Unit Tests now.

@Jaifroid
Copy link
Member Author

Jaifroid commented May 2, 2020

@mossory, progressive search is currently not very granular, because it reports back on each search variant that has found something, and quite a lot of variants don't of course find anything. It is, however, not difficult to report back for each result found. It looks like the demo gif below. I'm holding back on pushing these changes because I wonder if it is too granular. Thoughts?

Progressive Search Demo

@Jaifroid
Copy link
Member Author

Jaifroid commented May 2, 2020

I broke the tests I fixed. 😒
Have run out of time to fix them today I think. I know what I did, and can fix tomorrow... (It's to do with interim reporting of results...)

@Jaifroid
Copy link
Member Author

Jaifroid commented May 3, 2020

After further testing, the granular reporting works well on fast systems, and gives more informative feedback to the user on slower systems. So I've pushed it in a single commit 1e10807 which can easily be reversed if necessary.

@Jaifroid
Copy link
Member Author

Jaifroid commented May 3, 2020

Just for fun... searching with regular expressions! ༼ つ ◕_◕ ༽つ
(But it would be separate issue, because it would require significant refactoring: I just did a quick and very dirty hack to see if it was possible...)

image

@mossroy
Copy link
Contributor

mossroy commented May 3, 2020

I quickly tested on my Firefox OS device.
It works pretty well.
In terms of UX, I find that the progressive display of search results is a huge improvement!
I agree that it makes the new settings unnecessary. We could also keep the hard-coded 25 results.
If I choose a search result while the search is still in progress, I do not notice a slow-down, which is cool.

However, it sometimes creates some "flickering" on my device.
One reason is #184, for which PR #593 had been started (but not finished nor merged). It is a separate issue.
When there are not a lot of results, some lines of result sometimes appear, then disappear, then re-appear. It's harder to reproduce. I initially thought it was another issue, but maybe it was simply a side effect of #184.

Also, when the search string is long, it can take a long time to search all variants. Avoiding the 2^n variants can be a separate improvement.

I'm very pleased to see you fixed the unit tests, and added new ones.
They could be improved by avoiding to hard-code the position in the array, for the expected string. The expected string could be searched at any position in the array, with something like assert.ok(somethingThatReturnsTrueIfOurExpectedStringIsInTheArray(), "The specific variant should be in the results")

Regarding the regexp search, well... it's fun, but does not seem useful for a regular user ;-)

@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 9, 2020

I've added a commit which I consider "merge candidate", which follows the suggestion of comparing with the original prefix to add another cancellation test. Thank you for that. I'll do testing and self review, then will let you know when ready for final check.

Copy link
Member Author

@Jaifroid Jaifroid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few copyediting changes, and one "silly" one (the useless lines that need removing). I have a blindspot about that...

www/js/app.js Outdated Show resolved Hide resolved
www/js/app.js Outdated Show resolved Hide resolved
www/js/app.js Outdated Show resolved Hide resolved
www/js/lib/zimArchive.js Outdated Show resolved Hide resolved
@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 9, 2020

Tests

I'll update this post as these tests are done:

  • Code tests for cancellation (using console.log)
  • Firefox OS search and cancellation test
  • IE11 search and cancellation test
  • Edge Legacy search and cancellation tests jQuery and SW modes
  • Edge Chromium 83.0 search and cancellation tests jQuery and SW modes
  • Firefox 78.0.1 search and cancellation tests jQuery and SW modes

@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 9, 2020

@mossroy we need to take a decision on how to proceed with this PR given that there is more work that could be done to optimize it. Let me explain:

Code tests for cancellation using console log show that despite all the cancellation code, all the Promises have already been set up long before the user actually cancels a search by entering a new one. Because the code runs async, it very quickly sets up Promises for searches on all the variants, before the user starts typing a new search. The Promises carry on searching. We save some CPU by returning early before this line, and we successfully suppress the display of any results returned from the Promises. But some work is still being done, and a number of dirEntries for old searches are returned (and suppressed) even after a new search has started.

This is very similar to #426 (Prevent redundant extraction of assets after navigation), which we haven't yet fixed either. I see three ways to deal with this:

  • Wait for me to find a way to return early from binary search without looking up any dirEntries and using minimal CPU on the Promises already set up -- I can't work on this now till the weekend;
  • Or, decide that releasing with the working code as it is a good enough improvement for now, and add an issue like Prevent redundant extraction of assets after navigation #426 to work on optimizing the cancellation later, maybe together with Prevent redundant extraction of assets after navigation #426;
  • Or do a simple bugfix release without this PR (which I think would be a shame, but it is now urgent to do a bugfix release given users are not able to read latest ZIMs without errors due to MIME type table issue).

To be clear, the code for this PR is working well, searches are cancelled effectively without extraneous results displayed, and the new search starts returning results fast, but CPU is still being consumed on redundant binary searches in Promises that are hard to cancel, for a while after the new search results are coming in. Nevertheless, IMHO the UX is still a big improvement.

Please advise on best course of action @mossroy .

@mossroy
Copy link
Contributor

mossroy commented Jul 9, 2020

I agree with you that this PR is already a big UX improvement.
Let's do things step by step, I vote for your second option : finalize testing and merging this PR, release a 2.8, and create another issue for a possible improvement in the future

@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 9, 2020

In doing IE11 testing I found and fixed a small issue: a user could relaunch the same search while it is already in progress by pressing the search button or pressing enter. I have added 8a7d740 to prevent this if the prefix has not changed and the search status is NOT 'cancelled' or 'complete' (i.e. it is in progress). A user can now press enter, e.g. by mistake or from habit, or click the search icon, without it relaunching or interrupting the search in progress. However, if search is complete and user wants to search the same terms again, then a new search on the same prefix will launch, as this would be a deliberate action by the user.

@Jaifroid
Copy link
Member Author

Jaifroid commented Jul 9, 2020

OK @mossroy I've made all the manual tests indicated in #617 (comment) and have fixed the bug mentioned in previous comment. I've added a separate issue #637 for optimizing cancellation.

If you're happy for me to squash and merge, please approve the review. I'm of course happy to fix anything further as necessary.

@Jaifroid
Copy link
Member Author

I'm trying something out I discovered this morning. Please wait...

@Jaifroid
Copy link
Member Author

OK, sorry for last-minute commits. Having slept on it, a simple way of returning cancelled searches early from binary search came to me, and it works, preventing a lot of redundant CPU use. But it only works if we test on the globalstate.search variable rather than on the local copy of search. But this involves reinstating the globalstate variable in tests, or else they fail.

I'd be happy to revert those two commits and deal with them under #637 if you prefer, but having tested carefully with lots of console log statements, I'm confident they go a long way towards fixing #637 (but I'll keep it open for more testing subsequently).

@Jaifroid
Copy link
Member Author

This is just for information (and to document the issue for #637):

Yesterday I thought I was going mad... But this Stackexchange post explains the weirdness we're seeing here:

JavaScript Pointer Reference craziness

So it seems that in fact the global search object and the search passed to the functions point to the same underlying object until the global search object is reassigned to a different object. At that point, the search object passed to the functions continue to point to the old object, and the global search object points to a new object.

It explains why I was seeing the passed searches linked to the global search, but then suddenly they weren't. We can actually exploit this, but I propose to do it for #637 and #426 at a later date, not to delay this PR anymore (which is working well now).

Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all this work, and for the explanation on reference tricks.

We're almost there, but the globalstate variable can not be used in the ZIM backend.
It should be declared in app.js (not in init.js), and should not be declared in unit tests.

If it prevents an additional improvement, never mind : we'll take time to implement it properly afterwards, and I prefer to revert it for now. Sorry

www/index.html Outdated Show resolved Hide resolved
www/js/app.js Outdated Show resolved Hide resolved
});
that.findDirEntriesWithPrefixCaseSensitive(prefix, resultSize - dirEntries.length, localPrefix,
function (newDirEntries, interim) {
if (globalstate.search.status === 'cancelled' || globalstate.search.prefix !== localPrefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using globalstate variable here breaks the software layers we have.
Using it in app.js is no problem (if it's where the variable is declared), but it should not be used in other files (except if passed as a parameter, of course).

The ZIM backend should not depend on external global variables. A drawback can be seen in the unit tests : you should be able to test each call independently, without being forced to declare the global variable (which might have side effects if the unit tests are ran in parallel, for example)

@Jaifroid
Copy link
Member Author

OK, @mossroy I understand, and hadn't realized that I was breaking the architecture by introducing a global variable into the backend. I think this is completely solvable, but isn't a "quick fix". Therefore, can I just check your timeline? If we are going to try to release this weekend, I'll revert the last two commits, and then work properly on this in #637. If you don't aim to release till next week, I could try to work on it more here. But I don't want to rush things, and I'm quite anxious (as you may have noticed) to get a release out.

@mossroy
Copy link
Contributor

mossroy commented Jul 11, 2020

I see you just reverted your last 2 commits. I can review/test a bit more this version, right now.

@Jaifroid
Copy link
Member Author

I'm just adding the other two things you mentioned above (change description and regex). Then I'll be done.

@Jaifroid
Copy link
Member Author

OK, I've reverted and have done the two minor points in last review. Let me know if there's anything else for a "quick fix".

Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested quickly on my Firefox OS device, and on with Chromium and Firefox on desktop (jQuery and SW modes, including within browser extensions)
Search is working much smoother than before, and I did not see any regression

*
* @type Object
*/
var globalstate = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable can probably be moved to app.js, but it's minor and can be done later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mossroy . It would feel safer to me to move this as part of #637 when I'll have time to test a unified solution to cancelling running binary searches (and possibly also cancelling fetching of assets on pages we have navigated away from). So I'll move a summary of issues to be dealt with to #637 after squashing and merging this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's perfect

@mossroy
Copy link
Contributor

mossroy commented Jul 11, 2020

You can merge and I might work on a 2.8 release in the next hour

@Jaifroid Jaifroid merged commit 823d183 into master Jul 11, 2020
@kelson42 kelson42 deleted the Better-handling-of-uppercase-and-lowercase-search-transformations branch September 19, 2020 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment