Remove false assumption about position of assets in ZIM #443

Jaifroid · 2018-11-24T18:23:01Z

This is a proposed fix for #440 . It will probably slow down searches in some ZIMs with lots of dirEntries, such as full Wikipedia, so it needs testing on a range of ZIMs to be sure there are no unintended effects.

Jaifroid · 2018-11-25T17:46:15Z

This PR needs further investigation. Although it finds articles in ted_en_global_issues_2018-07.zim (where the previous algorithm found none), it fails to find any articles at all in tedxgeneva-2014_fr_all_2015-03.zim. Interestingly, the latter seems to have gaps, or "holes", in its index: dirEntries in the A namespace that have no title. Maybe we need to check by url instead?

Jaifroid · 2018-11-25T19:54:27Z

I made a quick index dump routine and tried it on tedxgeneva-2014_fr_all_2015-03.zim. The dirEntries in A namespace that have no titles are subtitle files. Unfortunately, they cause the scan algorithm to fail at return prefix <= dirEntry.title ? -1 : 1;. This was partially handled by the code I removed in this PR, but that code also stalled if there were any empty dirEntry.title fields "above" the current search index.

The problem here is that the routine can only guess the direction of travel based on comparing the search term with the current dirEntry.title. We can't really use the url, becuase although it would work with this ZIM, in other ZIMs the URL can be "1809387/index.html".

It means we need a more sophisticated search algorithm, so that we can know whether to keep searching "up" or "down" the list of dirEntries when we encounter one with a blank title.

@mossroy Any ideas or hints from your experience?

Jaifroid · 2018-11-26T00:46:29Z

I've added a possible fix.

The original binary search works by jumping to the middle of the alphabetical window of dirEntries. It then determines the direction for the next jump, and jumps to the middle of the top half or the bottom half of the remaining items. It repeats this as it zeros in on the matching item.

However, this fails if it keeps jumping to dirEntries that have no title, but are still ordered in the dirEntry index. In these cases, we must remember the previous direction of travel, or "vector", and nudge the window up or down by a single element, rather than jumping to the mid point. This allows the algorithm to find the next available dirEntry with a title, so that it can compare it to the search term. Once it can make the comparison, it can then resume the faster jumping method.

I've put this logic into a simple fix, and it seems to solve the issue with the Geneva TEDx ZIM.

It needs testing on a wider range of ZIMs.

mossroy · 2018-11-26T19:31:33Z

Maybe it would be worth having a look at the source code of https://github.com/openzim/libzim to see how they handle this binary search?
Even if it's in a different language, the search algorithm should be the same

Jaifroid · 2018-11-26T20:12:57Z

I've had a look in this directory, and what I can find there relating to search is: search.cpp, search_iterator.cpp, and levenshtein.cpp. The latter intrigued me, and it seems to be a string-matching concept (used in agrep and diff), and seems to have to do with the ordering of suggestions in search results, but we don't have anything like that. I suspect it's from a search library. The other files also look very different from our search algorithm, and they seem to mesh in with the database search for indexed ZIMs.

What we're using, it seems, and what I described above without realizing it, is a Binary search algorithm [binary chop or half-interval search] -- sorry, I'm sure you know that, but giving it a name is new to me!!

All I've done is, when the search lands on an item without a title (within the A namespace), I nudge the search window up or down one item at a time, until it encounters a dirEntry with a title, so that it can make the binary "decision" (effectively, go half-way up, or go half-way down the window of remaining items). It then resumes the previous search pattern.

From my testing, it seems to overcome the problem. For example, latest dirtybiology_fr_all_2018-06.zim and ted_en_global_issues_2018-07.zim, and others mentioned in #440 are now searchable. In current master, the search is unusable (mostly) in several of these types of ZIM.

mossroy · 2018-11-27T12:58:16Z

This is very low-level : we need to be careful.
Instead of inventing our own implementation, I think we need to implement the same algorithm as in libzim and/or ask for help from the people behind libzim.

In libzim, it seems to be in function findxByTitle of fileimpl.cpp : https://github.com/openzim/libzim/blob/684c355a6c56208c3d9e0be7990c4923b7ed8c50/src/fileimpl.cpp#L214
They seem to do the following :

find the begin and end offsets of the namespace (and use a cache for performance)
then, inside these offsets, implement a binary search with a comparison by namespace, then by title

In kiwix-js, we currently make the binary search on the whole title index, not inside the offsets that correspond to the namespace. @mgautierfr : can we safely assume this is "only" an optimization?

Their implementation really looks similar to the one of your first commit in this branch. They don't need to remember the "direction of travel".
Did you compare with a recent kiwix-desktop?

NB : the search* files you were mentioning are about some full-text search using xapian : it's a different feature.

Jaifroid · 2018-11-27T14:56:04Z

OK @mossroy , thanks for the explanation and for identifying the corresponding code in the reference implementation. So, the relevant binary decision code is this:

      int c = ns < d->getNamespace() ? -1
            : ns > d->getNamespace() ? 1
            : title.compare(d->getTitle());

      if (c < 0)
        u = p;
      else if (c > 0)
        l = p;
      else
      {
        return std::pair<bool, article_index_t>(true, article_index_t(p));
      }

So, first it checks that the article namespace (A) is less than (returns -1) or greater than ( returns +1) the namespace of the dirEntry.

If the namespaces match, then it returns the Boolean (?) result of title.compare(d->getTitle()). Looking at this, it must be the case that this comparison produces 0 when the title matches (part of?) the search term, because it uses a value of 0 to return the result.

So what does it return when the dirEntry has no title??

It is the latter situation we have to deal with. Even in the A namespace, some dirEntries have no title, so how do we get a meaningful comparison, so we can know whether we have undershot or overshot the search term?

I don't know if there's some more logic in title.compare, but this looks like a reference implementation that doesn't deal with these anomalous cases.

According to the Wikipedia article, binary search only works on arrays / lists that are sorted. We're trying to deal with an array/list that isn't correctly sorted by title.

I've tested the problematic ZIMs on Kiwix desktop, and they work. They don't work in Kiwix JS, and I don't see how they would work with the reference code above either, unless title.compare is somehow skipping over those empty titles (that you can clearly see in my screenshot of an index dump above).

There are only two fixes I can think of:

Use the url instead of the title in such cases, but we'd have to find the right part of the url, and that may not be standard in every ZIM; or
Effectively do what I've proposed, which is initiate a linear search up or down until we find a dirEntry with a title we can use in order to carry on with the binary search.

If you look at the index dump above, you'll see that if we land in one of the EMPTY TITLE dirEntries, doing a linear search above or below will get us to a usable title that will allow the binary search to continue effectively.

It's low-level, but the logic is, well, "binary" :-) . It's also really easy to debug just by pausing on the decision tree.

mossroy · 2018-11-27T22:11:22Z

The libzim is not only the reference implementation, it's also the code ran by kiwix-desktop (and most kiwix clients). So there must be a reason why it works. I think we need to find this reason.

I first thought it might be because of the namespace offsets : if the direntries with no title are in a namespace different from 'A' and the namespaces are sorted in the ZIM file, the libzim code will never reach them, and should work.

But, in your screenshot above, I see that all these direntries are .vtt subtitles files, and they are in 'A' namespace : IMHO they should not, as they're not articles. I opened openzim/ted#8 for that (but it's a different issue)

I think I found were it comes from : in
https://github.com/openzim/libzim/blob/684c355a6c56208c3d9e0be7990c4923b7ed8c50/src/_dirent.h#L84 , when you use getTitle() of a DirEntry, it gives the URL if the title is empty.

So this looks like the first fix you were mentioning : slightly modify the binary search to compare with the URL if the title is empty.
I did not like that idea at first, because I found it risky. But if it's what is done in libzim, I suppose it's safe.

Jaifroid · 2018-11-27T22:25:09Z

Good find @mossroy ! Using the URL is the only other possible choice, as we have to use something, or the comparison fails and destroys the logic of the binary search.

The reason I didn't choose that method was because of the issue of some URLs having a radically different string from their corresponding titles (above all in Stackexchange ZIMs, where there are lots of redirect-style dirEntries, taking us from URLs with text to URLs with mostly numbers). And subdirectories might confuse things. But if this issue only arises with subtitle files that at are deliberately named to be close in sort order to the corresponding video file, then maybe I was being too cautious.

It should be easy to modify the PR to use the new logic. I'll have a go tomorrow.

Jaifroid · 2018-11-28T08:34:01Z

@mossroy I've now modified the PR to use the URL as comparator. It appears to work. Below is a screenshot of the result of pressing space (' ') with the Geneva TEDx ZIM. It corresponds to the dump above. Other ZIMs now seem readable, though I haven't tested widely yet. One question is whether we should fill in the blanks in the search results seen below with something like the URL in square brackets. It might be disconcerting to users if they encounter blank search results?

Jaifroid · 2018-11-28T08:58:34Z

This is what it would look like without blanks in the search results (ignore styling differences):

mossroy · 2018-11-28T10:23:44Z

Yes, the URL should be displayed in search results if the title is empty, like in kiwix-desktop.
I'm wondering if it might be worth doing the same thing as in libzim : create a getTitle() function in the DirEntry class, that handles the fallback to URL, and that would be used in the binary search, the displayed search results, and anywhere it would be necessary.

Jaifroid · 2018-11-28T11:07:04Z

Maybe adding a universal getTitle function should be another issue. Can you add it? In the meantime, I think it's important to fix this bug asap. I've filled in the blank search results. In the process, I discovered some issues with populateListOfArticles, like the fact that the message "First x articles below (refine your search)" is never displayed because the calling function doesn't pass the variable maxArticles (and it's only called from one place). We have that variable anyway as a constant, so I used that instead. I cleaned up the JSDoc. I'll add a couple of comments in the code for when you review it.

www/js/app.js

www/js/lib/zimArchive.js

mossroy

The code looks good.
I did not test a lot, only with latest Firefox, with a few different ZIM files, and did not see any issue.
Please test a bit more before squashing/rebasing/merging.

NB : On another subject, I'll try to answer on the mwoffliner and gutenberg github issues, about JS support on kiwix-js, probably by the end of the day. It might take more time for me to work on the other topics. I should have some time to work on kiwix-js on December 11th.

Jaifroid added the bug label Nov 24, 2018

Jaifroid self-assigned this Nov 24, 2018

Jaifroid requested a review from mossroy November 24, 2018 18:23

Jaifroid commented Nov 28, 2018

View reviewed changes

www/js/app.js Show resolved Hide resolved

Jaifroid commented Nov 28, 2018

View reviewed changes

www/js/app.js Show resolved Hide resolved

Jaifroid commented Nov 28, 2018

View reviewed changes

www/js/lib/zimArchive.js Show resolved Hide resolved

mossroy reviewed Nov 28, 2018

View reviewed changes

www/js/lib/zimArchive.js Outdated Show resolved Hide resolved

mossroy reviewed Nov 28, 2018

View reviewed changes

www/js/lib/zimArchive.js Outdated Show resolved Hide resolved

mossroy approved these changes Nov 28, 2018

View reviewed changes

Jaifroid force-pushed the Fix-incomplete-dirEntry-search branch from df039f6 to 0d5edf3 Compare November 30, 2018 10:59

Fix incomplete dirEntry search

6d74d82

Jaifroid force-pushed the Fix-incomplete-dirEntry-search branch from 0d5edf3 to 6d74d82 Compare November 30, 2018 11:12

Jaifroid merged commit b2b91f4 into master Nov 30, 2018

Jaifroid deleted the Fix-incomplete-dirEntry-search branch November 30, 2018 11:19

Jaifroid mentioned this pull request Nov 30, 2018

Enable media playback in JQuery mode #441

Merged

This was referenced Feb 23, 2019

[TEST] New ZIM files without file extension in the URL #475

Closed

Stop putting article URL if no title is provided openzim/libzim#208

Closed

mossroy mentioned this pull request Apr 17, 2019

Title should be based on the DOM node <title> openzim/mwoffliner#605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove false assumption about position of assets in ZIM #443

Remove false assumption about position of assets in ZIM #443

Jaifroid commented Nov 24, 2018

Jaifroid commented Nov 25, 2018

Jaifroid commented Nov 25, 2018

Jaifroid commented Nov 26, 2018

mossroy commented Nov 26, 2018

Jaifroid commented Nov 26, 2018

mossroy commented Nov 27, 2018

Jaifroid commented Nov 27, 2018 •

edited

Loading

mossroy commented Nov 27, 2018

Jaifroid commented Nov 27, 2018

Jaifroid commented Nov 28, 2018

Jaifroid commented Nov 28, 2018

mossroy commented Nov 28, 2018

Jaifroid commented Nov 28, 2018

mossroy left a comment

Remove false assumption about position of assets in ZIM #443

Remove false assumption about position of assets in ZIM #443

Conversation

Jaifroid commented Nov 24, 2018

Jaifroid commented Nov 25, 2018

Jaifroid commented Nov 25, 2018

Jaifroid commented Nov 26, 2018

mossroy commented Nov 26, 2018

Jaifroid commented Nov 26, 2018

mossroy commented Nov 27, 2018

Jaifroid commented Nov 27, 2018 • edited Loading

mossroy commented Nov 27, 2018

Jaifroid commented Nov 27, 2018

Jaifroid commented Nov 28, 2018

Jaifroid commented Nov 28, 2018

mossroy commented Nov 28, 2018

Jaifroid commented Nov 28, 2018

mossroy left a comment

Choose a reason for hiding this comment

Jaifroid commented Nov 27, 2018 •

edited

Loading