-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove false assumption about position of assets in ZIM #443
Conversation
This PR needs further investigation. Although it finds articles in |
I made a quick index dump routine and tried it on The problem here is that the routine can only guess the direction of travel based on comparing the search term with the current dirEntry.title. We can't really use the url, becuase although it would work with this ZIM, in other ZIMs the URL can be "1809387/index.html". It means we need a more sophisticated search algorithm, so that we can know whether to keep searching "up" or "down" the list of dirEntries when we encounter one with a blank title. @mossroy Any ideas or hints from your experience? |
I've added a possible fix. The original binary search works by jumping to the middle of the alphabetical window of dirEntries. It then determines the direction for the next jump, and jumps to the middle of the top half or the bottom half of the remaining items. It repeats this as it zeros in on the matching item. However, this fails if it keeps jumping to dirEntries that have no title, but are still ordered in the dirEntry index. In these cases, we must remember the previous direction of travel, or "vector", and nudge the window up or down by a single element, rather than jumping to the mid point. This allows the algorithm to find the next available dirEntry with a title, so that it can compare it to the search term. Once it can make the comparison, it can then resume the faster jumping method. I've put this logic into a simple fix, and it seems to solve the issue with the Geneva TEDx ZIM. It needs testing on a wider range of ZIMs. |
Maybe it would be worth having a look at the source code of https://github.com/openzim/libzim to see how they handle this binary search? |
I've had a look in this directory, and what I can find there relating to search is: search.cpp, search_iterator.cpp, and levenshtein.cpp. The latter intrigued me, and it seems to be a string-matching concept (used in agrep and diff), and seems to have to do with the ordering of suggestions in search results, but we don't have anything like that. I suspect it's from a search library. The other files also look very different from our search algorithm, and they seem to mesh in with the database search for indexed ZIMs. What we're using, it seems, and what I described above without realizing it, is a Binary search algorithm [binary chop or half-interval search] -- sorry, I'm sure you know that, but giving it a name is new to me!! All I've done is, when the search lands on an item without a title (within the A namespace), I nudge the search window up or down one item at a time, until it encounters a dirEntry with a title, so that it can make the binary "decision" (effectively, go half-way up, or go half-way down the window of remaining items). It then resumes the previous search pattern. From my testing, it seems to overcome the problem. For example, latest |
This is very low-level : we need to be careful. In libzim, it seems to be in function findxByTitle of fileimpl.cpp : https://github.com/openzim/libzim/blob/684c355a6c56208c3d9e0be7990c4923b7ed8c50/src/fileimpl.cpp#L214
In kiwix-js, we currently make the binary search on the whole title index, not inside the offsets that correspond to the namespace. @mgautierfr : can we safely assume this is "only" an optimization? Their implementation really looks similar to the one of your first commit in this branch. They don't need to remember the "direction of travel". NB : the search* files you were mentioning are about some full-text search using xapian : it's a different feature. |
OK @mossroy , thanks for the explanation and for identifying the corresponding code in the reference implementation. So, the relevant binary decision code is this:
So, first it checks that the article namespace (A) is less than (returns -1) or greater than ( returns +1) the namespace of the dirEntry. If the namespaces match, then it returns the Boolean (?) result of So what does it return when the dirEntry has no title?? It is the latter situation we have to deal with. Even in the A namespace, some dirEntries have no title, so how do we get a meaningful comparison, so we can know whether we have undershot or overshot the search term? I don't know if there's some more logic in According to the Wikipedia article, binary search only works on arrays / lists that are sorted. We're trying to deal with an array/list that isn't correctly sorted by title. I've tested the problematic ZIMs on Kiwix desktop, and they work. They don't work in Kiwix JS, and I don't see how they would work with the reference code above either, unless There are only two fixes I can think of:
If you look at the index dump above, you'll see that if we land in one of the EMPTY TITLE dirEntries, doing a linear search above or below will get us to a usable title that will allow the binary search to continue effectively. It's low-level, but the logic is, well, "binary" :-) . It's also really easy to debug just by pausing on the decision tree. |
The libzim is not only the reference implementation, it's also the code ran by kiwix-desktop (and most kiwix clients). So there must be a reason why it works. I think we need to find this reason. I first thought it might be because of the namespace offsets : if the direntries with no title are in a namespace different from 'A' and the namespaces are sorted in the ZIM file, the libzim code will never reach them, and should work. But, in your screenshot above, I see that all these direntries are .vtt subtitles files, and they are in 'A' namespace : IMHO they should not, as they're not articles. I opened openzim/ted#8 for that (but it's a different issue) I think I found were it comes from : in So this looks like the first fix you were mentioning : slightly modify the binary search to compare with the URL if the title is empty. |
Good find @mossroy ! Using the URL is the only other possible choice, as we have to use something, or the comparison fails and destroys the logic of the binary search. The reason I didn't choose that method was because of the issue of some URLs having a radically different string from their corresponding titles (above all in Stackexchange ZIMs, where there are lots of redirect-style dirEntries, taking us from URLs with text to URLs with mostly numbers). And subdirectories might confuse things. But if this issue only arises with subtitle files that at are deliberately named to be close in sort order to the corresponding video file, then maybe I was being too cautious. It should be easy to modify the PR to use the new logic. I'll have a go tomorrow. |
@mossroy I've now modified the PR to use the URL as comparator. It appears to work. Below is a screenshot of the result of pressing space (' ') with the Geneva TEDx ZIM. It corresponds to the dump above. Other ZIMs now seem readable, though I haven't tested widely yet. One question is whether we should fill in the blanks in the search results seen below with something like the URL in square brackets. It might be disconcerting to users if they encounter blank search results? |
Yes, the URL should be displayed in search results if the title is empty, like in kiwix-desktop. |
Maybe adding a universal getTitle function should be another issue. Can you add it? In the meantime, I think it's important to fix this bug asap. I've filled in the blank search results. In the process, I discovered some issues with populateListOfArticles, like the fact that the message "First x articles below (refine your search)" is never displayed because the calling function doesn't pass the variable maxArticles (and it's only called from one place). We have that variable anyway as a constant, so I used that instead. I cleaned up the JSDoc. I'll add a couple of comments in the code for when you review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good.
I did not test a lot, only with latest Firefox, with a few different ZIM files, and did not see any issue.
Please test a bit more before squashing/rebasing/merging.
NB : On another subject, I'll try to answer on the mwoffliner and gutenberg github issues, about JS support on kiwix-js, probably by the end of the day. It might take more time for me to work on the other topics. I should have some time to work on kiwix-js on December 11th.
df039f6
to
0d5edf3
Compare
0d5edf3
to
6d74d82
Compare
This is a proposed fix for #440 . It will probably slow down searches in some ZIMs with lots of dirEntries, such as full Wikipedia, so it needs testing on a range of ZIMs to be sure there are no unintended effects.