Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove false assumption about position of assets in ZIM #443

Merged
merged 1 commit into from
Nov 30, 2018

Conversation

Jaifroid
Copy link
Member

This is a proposed fix for #440 . It will probably slow down searches in some ZIMs with lots of dirEntries, such as full Wikipedia, so it needs testing on a range of ZIMs to be sure there are no unintended effects.

@Jaifroid Jaifroid added the bug label Nov 24, 2018
@Jaifroid Jaifroid self-assigned this Nov 24, 2018
@Jaifroid Jaifroid requested a review from mossroy November 24, 2018 18:23
@Jaifroid
Copy link
Member Author

This PR needs further investigation. Although it finds articles in ted_en_global_issues_2018-07.zim (where the previous algorithm found none), it fails to find any articles at all in tedxgeneva-2014_fr_all_2015-03.zim. Interestingly, the latter seems to have gaps, or "holes", in its index: dirEntries in the A namespace that have no title. Maybe we need to check by url instead?

@Jaifroid
Copy link
Member Author

I made a quick index dump routine and tried it on tedxgeneva-2014_fr_all_2015-03.zim. The dirEntries in A namespace that have no titles are subtitle files. Unfortunately, they cause the scan algorithm to fail at return prefix <= dirEntry.title ? -1 : 1;. This was partially handled by the code I removed in this PR, but that code also stalled if there were any empty dirEntry.title fields "above" the current search index.

The problem here is that the routine can only guess the direction of travel based on comparing the search term with the current dirEntry.title. We can't really use the url, becuase although it would work with this ZIM, in other ZIMs the URL can be "1809387/index.html".

It means we need a more sophisticated search algorithm, so that we can know whether to keep searching "up" or "down" the list of dirEntries when we encounter one with a blank title.

@mossroy Any ideas or hints from your experience?

image

@Jaifroid
Copy link
Member Author

I've added a possible fix.

The original binary search works by jumping to the middle of the alphabetical window of dirEntries. It then determines the direction for the next jump, and jumps to the middle of the top half or the bottom half of the remaining items. It repeats this as it zeros in on the matching item.

However, this fails if it keeps jumping to dirEntries that have no title, but are still ordered in the dirEntry index. In these cases, we must remember the previous direction of travel, or "vector", and nudge the window up or down by a single element, rather than jumping to the mid point. This allows the algorithm to find the next available dirEntry with a title, so that it can compare it to the search term. Once it can make the comparison, it can then resume the faster jumping method.

I've put this logic into a simple fix, and it seems to solve the issue with the Geneva TEDx ZIM.

It needs testing on a wider range of ZIMs.

@mossroy
Copy link
Contributor

mossroy commented Nov 26, 2018

Maybe it would be worth having a look at the source code of https://github.com/openzim/libzim to see how they handle this binary search?
Even if it's in a different language, the search algorithm should be the same

@Jaifroid
Copy link
Member Author

I've had a look in this directory, and what I can find there relating to search is: search.cpp, search_iterator.cpp, and levenshtein.cpp. The latter intrigued me, and it seems to be a string-matching concept (used in agrep and diff), and seems to have to do with the ordering of suggestions in search results, but we don't have anything like that. I suspect it's from a search library. The other files also look very different from our search algorithm, and they seem to mesh in with the database search for indexed ZIMs.

What we're using, it seems, and what I described above without realizing it, is a Binary search algorithm [binary chop or half-interval search] -- sorry, I'm sure you know that, but giving it a name is new to me!!

All I've done is, when the search lands on an item without a title (within the A namespace), I nudge the search window up or down one item at a time, until it encounters a dirEntry with a title, so that it can make the binary "decision" (effectively, go half-way up, or go half-way down the window of remaining items). It then resumes the previous search pattern.

From my testing, it seems to overcome the problem. For example, latest dirtybiology_fr_all_2018-06.zim and ted_en_global_issues_2018-07.zim, and others mentioned in #440 are now searchable. In current master, the search is unusable (mostly) in several of these types of ZIM.

@mossroy
Copy link
Contributor

mossroy commented Nov 27, 2018

This is very low-level : we need to be careful.
Instead of inventing our own implementation, I think we need to implement the same algorithm as in libzim and/or ask for help from the people behind libzim.

In libzim, it seems to be in function findxByTitle of fileimpl.cpp : https://github.com/openzim/libzim/blob/684c355a6c56208c3d9e0be7990c4923b7ed8c50/src/fileimpl.cpp#L214
They seem to do the following :

  • find the begin and end offsets of the namespace (and use a cache for performance)
  • then, inside these offsets, implement a binary search with a comparison by namespace, then by title

In kiwix-js, we currently make the binary search on the whole title index, not inside the offsets that correspond to the namespace. @mgautierfr : can we safely assume this is "only" an optimization?

Their implementation really looks similar to the one of your first commit in this branch. They don't need to remember the "direction of travel".
Did you compare with a recent kiwix-desktop?

NB : the search* files you were mentioning are about some full-text search using xapian : it's a different feature.

@Jaifroid
Copy link
Member Author

Jaifroid commented Nov 27, 2018

OK @mossroy , thanks for the explanation and for identifying the corresponding code in the reference implementation. So, the relevant binary decision code is this:

      int c = ns < d->getNamespace() ? -1
            : ns > d->getNamespace() ? 1
            : title.compare(d->getTitle());

      if (c < 0)
        u = p;
      else if (c > 0)
        l = p;
      else
      {
        return std::pair<bool, article_index_t>(true, article_index_t(p));
      }

So, first it checks that the article namespace (A) is less than (returns -1) or greater than ( returns +1) the namespace of the dirEntry.

If the namespaces match, then it returns the Boolean (?) result of title.compare(d->getTitle()). Looking at this, it must be the case that this comparison produces 0 when the title matches (part of?) the search term, because it uses a value of 0 to return the result.

So what does it return when the dirEntry has no title??

It is the latter situation we have to deal with. Even in the A namespace, some dirEntries have no title, so how do we get a meaningful comparison, so we can know whether we have undershot or overshot the search term?

I don't know if there's some more logic in title.compare, but this looks like a reference implementation that doesn't deal with these anomalous cases.

According to the Wikipedia article, binary search only works on arrays / lists that are sorted. We're trying to deal with an array/list that isn't correctly sorted by title.

I've tested the problematic ZIMs on Kiwix desktop, and they work. They don't work in Kiwix JS, and I don't see how they would work with the reference code above either, unless title.compare is somehow skipping over those empty titles (that you can clearly see in my screenshot of an index dump above).

There are only two fixes I can think of:

  1. Use the url instead of the title in such cases, but we'd have to find the right part of the url, and that may not be standard in every ZIM; or
  2. Effectively do what I've proposed, which is initiate a linear search up or down until we find a dirEntry with a title we can use in order to carry on with the binary search.

If you look at the index dump above, you'll see that if we land in one of the EMPTY TITLE dirEntries, doing a linear search above or below will get us to a usable title that will allow the binary search to continue effectively.

It's low-level, but the logic is, well, "binary" :-) . It's also really easy to debug just by pausing on the decision tree.

@mossroy
Copy link
Contributor

mossroy commented Nov 27, 2018

The libzim is not only the reference implementation, it's also the code ran by kiwix-desktop (and most kiwix clients). So there must be a reason why it works. I think we need to find this reason.

I first thought it might be because of the namespace offsets : if the direntries with no title are in a namespace different from 'A' and the namespaces are sorted in the ZIM file, the libzim code will never reach them, and should work.

But, in your screenshot above, I see that all these direntries are .vtt subtitles files, and they are in 'A' namespace : IMHO they should not, as they're not articles. I opened openzim/ted#8 for that (but it's a different issue)

I think I found were it comes from : in
https://github.com/openzim/libzim/blob/684c355a6c56208c3d9e0be7990c4923b7ed8c50/src/_dirent.h#L84 , when you use getTitle() of a DirEntry, it gives the URL if the title is empty.

So this looks like the first fix you were mentioning : slightly modify the binary search to compare with the URL if the title is empty.
I did not like that idea at first, because I found it risky. But if it's what is done in libzim, I suppose it's safe.

@Jaifroid
Copy link
Member Author

Good find @mossroy ! Using the URL is the only other possible choice, as we have to use something, or the comparison fails and destroys the logic of the binary search.

The reason I didn't choose that method was because of the issue of some URLs having a radically different string from their corresponding titles (above all in Stackexchange ZIMs, where there are lots of redirect-style dirEntries, taking us from URLs with text to URLs with mostly numbers). And subdirectories might confuse things. But if this issue only arises with subtitle files that at are deliberately named to be close in sort order to the corresponding video file, then maybe I was being too cautious.

It should be easy to modify the PR to use the new logic. I'll have a go tomorrow.

@Jaifroid
Copy link
Member Author

@mossroy I've now modified the PR to use the URL as comparator. It appears to work. Below is a screenshot of the result of pressing space (' ') with the Geneva TEDx ZIM. It corresponds to the dump above. Other ZIMs now seem readable, though I haven't tested widely yet. One question is whether we should fill in the blanks in the search results seen below with something like the URL in square brackets. It might be disconcerting to users if they encounter blank search results?

image

@Jaifroid
Copy link
Member Author

This is what it would look like without blanks in the search results (ignore styling differences):

image

@mossroy
Copy link
Contributor

mossroy commented Nov 28, 2018

Yes, the URL should be displayed in search results if the title is empty, like in kiwix-desktop.
I'm wondering if it might be worth doing the same thing as in libzim : create a getTitle() function in the DirEntry class, that handles the fallback to URL, and that would be used in the binary search, the displayed search results, and anywhere it would be necessary.

@Jaifroid
Copy link
Member Author

Maybe adding a universal getTitle function should be another issue. Can you add it? In the meantime, I think it's important to fix this bug asap. I've filled in the blank search results. In the process, I discovered some issues with populateListOfArticles, like the fact that the message "First x articles below (refine your search)" is never displayed because the calling function doesn't pass the variable maxArticles (and it's only called from one place). We have that variable anyway as a constant, so I used that instead. I cleaned up the JSDoc. I'll add a couple of comments in the code for when you review it.

www/js/app.js Show resolved Hide resolved
www/js/app.js Show resolved Hide resolved
www/js/lib/zimArchive.js Outdated Show resolved Hide resolved
www/js/lib/zimArchive.js Outdated Show resolved Hide resolved
Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good.
I did not test a lot, only with latest Firefox, with a few different ZIM files, and did not see any issue.
Please test a bit more before squashing/rebasing/merging.

NB : On another subject, I'll try to answer on the mwoffliner and gutenberg github issues, about JS support on kiwix-js, probably by the end of the day. It might take more time for me to work on the other topics. I should have some time to work on kiwix-js on December 11th.

@Jaifroid Jaifroid force-pushed the Fix-incomplete-dirEntry-search branch from df039f6 to 0d5edf3 Compare November 30, 2018 10:59
@Jaifroid Jaifroid force-pushed the Fix-incomplete-dirEntry-search branch from 0d5edf3 to 6d74d82 Compare November 30, 2018 11:12
@Jaifroid Jaifroid merged commit b2b91f4 into master Nov 30, 2018
@Jaifroid Jaifroid deleted the Fix-incomplete-dirEntry-search branch November 30, 2018 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants