-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handling of case transformations in search #617
Better handling of case transformations in search #617
Conversation
@Jaifroid Thank you. To me this is the perfect candidate for a unit test. |
The latest commit finds every combination of upper- and lower-case first letters of words in the phrase. It removes the separate functions that did upper- and lower-case transforms in limited cases, and instead splits the string into its component number of "words" ( So, it can deal with this sample search, for example: The Unit Tests currently contain a set of simple tests transforming a phrase to all upper-case first letters and all lower-case first letters. I will need to write a different test using the new algorithm. It will fail the current tests. I will write the new test if this is approved by @mossroy . NB I want to test this thoroughly before removing the do-not-merge label. It is not ready for review. |
@Jaifroid [FYI] Small remark about usual PR workflow. Github proposes a "draft" PR mode which can be chosen at the time your create the PR and since a few weeks even afterward. I recommend to use it in place of the homebrew "do-not-merge" tag. An other traditional/old way to show that the PR is not finished is to put "WIP" (for Work In Progress) at the head of the PR title. Of course, not assigning a reviewer is important too (no assignee for this PR, this is good). All of this is not super important because you mostly work only together with @mossroy, so just FYI. I have put the PR in draft. |
@kelson42 Thanks, that's useful to know! |
I have added "whole word" case variations, not for mixed case within a word, but for cases such as searching for You can see this in action in the search below, and the first twelve combinations the function returns below that. Duplicates are subsequently removed from this array. |
Latest commit preserves the search exactly as typed, and then normalizes it by reducing any spacing to a single space, making entire string lowercase, and then calculating the variation matrix from the normalized string. It enables pseudo case-insensitivity while also searching on the case-sensitive string as typed by the user (in case that is important). See screenshot running in FFOS simulator. Case transformations could be an option in Config, because in some contexts it could be slow. In browser/extension contexts performance is fine. |
I say "pseudo case insensitivity" because of course it cannot transform Tested and working on:
|
Regarding the use of XRegExp vs copying its giant regexp, I'd be in favor of using their js file (only it's not too complicated). If we need to modify/extend it, it would be worth searching/creating issues on their github. They're probably better than us on dealing with unicode regexps, and maintaining them in the long term. Some unit tests are failing on Travis because they're still using the previous API. Don't forget to fix them (and add some that check the issues we're trying to fix). Regarding the performance impact, I also agree that it's a concern.
The choice between these versions could be a new settings. Regarding the smarter version, I'm a bit concerned by the 2^n variants. It looks overkill for most cases. Maybe a compromise would be that, for each word, we only test :
On the other hand, all this might be considered too complicated, considered that it would be removed if/when we manage compile kiwix-lib with emscripten. Up to you. |
OK, thanks for the feedback @mossroy . I like these ideas. I might put XRegExp integration as a separate issue. Measuring search time sounds interesting, and could also determine the number of search results returned. I definitely plan to fix the Unit Tests. |
Latest commit adds this UI to the Performance Settings panel, and implements the Basic and Full search type. The difference is noticeable in IE11, but not so much in other browsers. This commit does not include any AI to time the search and warn the user. If user hovers over control, further explanation pops up (not sure what happens in mobile yet, only tested in IE11, Firefox 75 and Edge Legacy). |
Just a quick sample of the kind of difference between the old search (top) and the new "comprehensive" search (bottom) in English Wikipedia searching for "the past is a". As you can see "The Past Is a Foreign Land" (a distinct article) is missed in the old search (together with some variants for the other two distinct articles). |
@Jaifroid ... and I thought first this PR will be quick/short... How wrong I was :) |
Latest commit returns results dynamically to the user, as they are found, and optimizes for returning common search patterns first. Clicking on a found result stops further search. I think this obviates the need for complex automation (timing searches and prompting user to change settings, or auto-changing settings). Need to review performance on old/slow systems to see whether the new behaviour obviates the need for the "basic" search type. I suspect it will still be useful, as search can be excruciatingly slow in some contexts. Will now attempt Unit Tests. One more sample of a search that returns nothing in "basic" search (and in existing app), and 7 results with "full": |
I did not look at the code, but I like your suggestion to dynamically add results, if it might simplify the code and the UI. |
Code can be tested now. I haven't marked Ready for review because Unit Tests are still failing... Though I'm not sure they're failing on the new tests I've put in, seems to be something else. |
Is passing Unit Tests now. |
@mossory, progressive search is currently not very granular, because it reports back on each search variant that has found something, and quite a lot of variants don't of course find anything. It is, however, not difficult to report back for each result found. It looks like the demo gif below. I'm holding back on pushing these changes because I wonder if it is too granular. Thoughts? |
I broke the tests I fixed. 😒 |
After further testing, the granular reporting works well on fast systems, and gives more informative feedback to the user on slower systems. So I've pushed it in a single commit 1e10807 which can easily be reversed if necessary. |
I quickly tested on my Firefox OS device. However, it sometimes creates some "flickering" on my device. Also, when the search string is long, it can take a long time to search all variants. Avoiding the 2^n variants can be a separate improvement. I'm very pleased to see you fixed the unit tests, and added new ones. Regarding the regexp search, well... it's fun, but does not seem useful for a regular user ;-) |
I've added a commit which I consider "merge candidate", which follows the suggestion of comparing with the original prefix to add another cancellation test. Thank you for that. I'll do testing and self review, then will let you know when ready for final check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few copyediting changes, and one "silly" one (the useless lines that need removing). I have a blindspot about that...
TestsI'll update this post as these tests are done:
|
@mossroy we need to take a decision on how to proceed with this PR given that there is more work that could be done to optimize it. Let me explain: Code tests for cancellation using console log show that despite all the cancellation code, all the Promises have already been set up long before the user actually cancels a search by entering a new one. Because the code runs async, it very quickly sets up Promises for searches on all the variants, before the user starts typing a new search. The Promises carry on searching. We save some CPU by returning early before this line, and we successfully suppress the display of any results returned from the Promises. But some work is still being done, and a number of dirEntries for old searches are returned (and suppressed) even after a new search has started. This is very similar to #426 (Prevent redundant extraction of assets after navigation), which we haven't yet fixed either. I see three ways to deal with this:
To be clear, the code for this PR is working well, searches are cancelled effectively without extraneous results displayed, and the new search starts returning results fast, but CPU is still being consumed on redundant binary searches in Promises that are hard to cancel, for a while after the new search results are coming in. Nevertheless, IMHO the UX is still a big improvement. Please advise on best course of action @mossroy . |
I agree with you that this PR is already a big UX improvement. |
In doing IE11 testing I found and fixed a small issue: a user could relaunch the same search while it is already in progress by pressing the search button or pressing enter. I have added 8a7d740 to prevent this if the prefix has not changed and the search status is NOT 'cancelled' or 'complete' (i.e. it is in progress). A user can now press enter, e.g. by mistake or from habit, or click the search icon, without it relaunching or interrupting the search in progress. However, if search is complete and user wants to search the same terms again, then a new search on the same prefix will launch, as this would be a deliberate action by the user. |
OK @mossroy I've made all the manual tests indicated in #617 (comment) and have fixed the bug mentioned in previous comment. I've added a separate issue #637 for optimizing cancellation. If you're happy for me to squash and merge, please approve the review. I'm of course happy to fix anything further as necessary. |
I'm trying something out I discovered this morning. Please wait... |
OK, sorry for last-minute commits. Having slept on it, a simple way of returning cancelled searches early from binary search came to me, and it works, preventing a lot of redundant CPU use. But it only works if we test on the globalstate.search variable rather than on the local copy of search. But this involves reinstating the globalstate variable in tests, or else they fail. I'd be happy to revert those two commits and deal with them under #637 if you prefer, but having tested carefully with lots of console log statements, I'm confident they go a long way towards fixing #637 (but I'll keep it open for more testing subsequently). |
This is just for information (and to document the issue for #637): Yesterday I thought I was going mad... But this Stackexchange post explains the weirdness we're seeing here: JavaScript Pointer Reference craziness So it seems that in fact the global search object and the search passed to the functions point to the same underlying object until the global search object is reassigned to a different object. At that point, the search object passed to the functions continue to point to the old object, and the global search object points to a new object. It explains why I was seeing the passed searches linked to the global search, but then suddenly they weren't. We can actually exploit this, but I propose to do it for #637 and #426 at a later date, not to delay this PR anymore (which is working well now). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all this work, and for the explanation on reference tricks.
We're almost there, but the globalstate variable can not be used in the ZIM backend.
It should be declared in app.js (not in init.js), and should not be declared in unit tests.
If it prevents an additional improvement, never mind : we'll take time to implement it properly afterwards, and I prefer to revert it for now. Sorry
www/js/lib/zimArchive.js
Outdated
}); | ||
that.findDirEntriesWithPrefixCaseSensitive(prefix, resultSize - dirEntries.length, localPrefix, | ||
function (newDirEntries, interim) { | ||
if (globalstate.search.status === 'cancelled' || globalstate.search.prefix !== localPrefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using globalstate variable here breaks the software layers we have.
Using it in app.js is no problem (if it's where the variable is declared), but it should not be used in other files (except if passed as a parameter, of course).
The ZIM backend should not depend on external global variables. A drawback can be seen in the unit tests : you should be able to test each call independently, without being forced to declare the global variable (which might have side effects if the unit tests are ran in parallel, for example)
OK, @mossroy I understand, and hadn't realized that I was breaking the architecture by introducing a global variable into the backend. I think this is completely solvable, but isn't a "quick fix". Therefore, can I just check your timeline? If we are going to try to release this weekend, I'll revert the last two commits, and then work properly on this in #637. If you don't aim to release till next week, I could try to work on it more here. But I don't want to rush things, and I'm quite anxious (as you may have noticed) to get a release out. |
I see you just reverted your last 2 commits. I can review/test a bit more this version, right now. |
I'm just adding the other two things you mentioned above (change description and regex). Then I'll be done. |
OK, I've reverted and have done the two minor points in last review. Let me know if there's anything else for a "quick fix". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested quickly on my Firefox OS device, and on with Chromium and Firefox on desktop (jQuery and SW modes, including within browser extensions)
Search is working much smoother than before, and I did not see any regression
* | ||
* @type Object | ||
*/ | ||
var globalstate = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable can probably be moved to app.js, but it's minor and can be done later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mossroy . It would feel safer to me to move this as part of #637 when I'll have time to test a unified solution to cancelling running binary searches (and possibly also cancelling fetching of assets on pages we have navigated away from). So I'll move a summary of issues to be dealt with to #637 after squashing and merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's perfect
You can merge and I might work on a 2.8 release in the next hour |
This is an initial go at #179. It handles case transformations of the first letter of search terms after a space or any Unicode punctuation character.
Limitations:
estéril (Ciru
, the algorithm currently only produces these variants:clearly the variant
Estéril (ciru
is missing, and in this case, then, the search fails to find the result for Sterile Surgery (Estéril (cirugía)
) in the Spanish Wikipedia.lcFirstLetter()
anducFirstLetter()
inutil.js
may be redundant; I guess they were attempts to produce some measure of mixed-case transformations on the string, but they fail in the case mentioned, and it would be better to use a matrix to transform every word's first letters and recombine the strings.In sum, I don't think this is ready for merging. It is an improvement in behaviour, I think, given that it works on Unicode strings, but it can be improved.. It would still be worth testing in a few scenarios and languages.