Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading ZIM archives with no namespace #597

Closed
Jaifroid opened this issue Mar 4, 2020 · 29 comments
Closed

Support reading ZIM archives with no namespace #597

Jaifroid opened this issue Mar 4, 2020 · 29 comments
Assignees
Milestone

Comments

@Jaifroid
Copy link
Member

Jaifroid commented Mar 4, 2020

See comment here: #230 (comment) (and that issue more broadly) and the issue for this on Libzim: openzim/libzim#15. If we haven't achieved #514 by the time this becomes a reality, we will need to do a fair amount of adjustment to the back end. It should not be difficult, just a bit tedious... :-)

@Jaifroid Jaifroid self-assigned this Mar 4, 2020
@Jaifroid Jaifroid added this to the v2.9 milestone Mar 4, 2020
@Rajat379
Copy link

Rajat379 commented Dec 9, 2020

I am keen on working on this issue, should I work straight away?

@Jaifroid
Copy link
Member Author

Jaifroid commented Dec 9, 2020

@Rajat379 This is a rather specialized ssue: unless you are very experienced with JS development and with the format of ZIM archives (and the changes happening to that format) I suggest you should choose an easier issue to tackle as a first contribution.

@Jaifroid
Copy link
Member Author

Jaifroid commented Dec 9, 2020

@kelson42 wrote in #684:

First test ZIM files have been created and are available at:
http://tmp.kiwix.org/nons_zims/

Almost all Kiwix ports work with them so far, but not Kiwix-JS unfortunately.

@Jaifroid
Copy link
Member Author

@kelson42 What is the new ZIM format specification we should work with here? I've checked https://wiki.openzim.org/wiki/ZIM_file_format#Namespaces and it doesn't have any updated information. Looking at the Ray Charles sample you provided, it seems the landing page (at least) is in Namespace - which is why it is not showing in Kiwix JS: we have code that checks if the landing page is an article, and rejects it if it is not in namespace A. While I can change that to accept Namespace - as well, it would be useful to know whether - is fixed as part of a stable specification, and any other details of the new specification that are relevant for adapting the code.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 13, 2020

@Jaifroid @mgautierfr Could you please talk together on this?

@kelson42
Copy link
Collaborator

kelson42 commented Dec 13, 2020

@Jaifroid The current roadmad looks like this:

  • until end of december, polishing last details + doc of the libzim
  • mid of january, release of the new libzim
  • mid of february, first scrapers are adapted
  • march, first official ZIM releases without namespaces

Hope in the meantime we will be able to fix all related bugs in Kiwix JS and other ports.

@Jaifroid
Copy link
Member Author

Thanks, that's good to know. I don't anticipate big difficulties adapting the code. The main issue, from previous discussions, is likely to be slower binary search if we have to search through one huge namespace instead of one focused on articles. But we now have a block cache merged in 3.1 which should help with that.

@mgautierfr
Copy link
Member

@Jaifroid, I have update the spec on the wiki (https://wiki.openzim.org/wiki/ZIM_file_format)

libzim itself (in master) doesn't fully follow the spec yet. It doesn't handle the well know entries (W namespace) and listing (in X namespace). So zim files in http://tmp.kiwix.org/nons_zims doesn't contains those entries. We will not do a release of libzim before it implement correctly the spec.

The main changes to remember are :

  • All "user" content is written in C namespace (instead of A, I, J, -).
  • We have new entries (generated by libzim itself) in W and X namespace to help localize entries.
  • libzim itself hide the namespace to the user and provide specific api to access content outside of C namespace. Other implementation are free to do as they which (but I recommend to do the same)

If you have any questions, feel free to ask them here or directly to me on slack.

@Jaifroid
Copy link
Member Author

Thank you, @mgautierfr , I'll study that and get back to you if I have any questions.

@Jaifroid
Copy link
Member Author

@mgautierfr I do have an immediate observation: deprecation of titlePtrList would put Kiwix JS (and associated readers) out of business overnight. We cannot read Xapian-encoded databases, as there is no JS implementation of Xapian. Therefore, could I urge you and @kelson42 to think very carefully before making such a breaking change to the specification? We would definitely need help, either to port Xapian via Emscripten, or else to port libzim via Emscripten (the latter implies the former). I would argue that until that is achieved, there should be no breaking change to the ZIM format!

@kelson42
Copy link
Collaborator

@Jaifroid We are aware about the problem you raised of. The titlePtrList will stay available as long as necessary for backward compatibility purpose. That said, its deprecation does not mean we will expect all readers to have to deal with Xapian index to provide a simple title suggestion system. We are already working on its replacement with a better approach and it should be easy for Kiwix JS to add its support, see openzim/libzim#397.

@mgautierfr
Copy link
Member

As @kelson42 said, we will not drop titlePtrList.
My modification in the spec is maybe a bit too "aggressive", but as you said it is compatibility break and so we will not remove it (at least not before a long time and a change in the major version).

The titleList is still present, what is considered obsolete is how to locate it.
We introduced in X namespace two new entries : listing/titleOrdered/v0 and listing/titleOrdered/v1. (https://wiki.openzim.org/wiki/Search_indexes).

The content of listing/titleOrdered/v0 is exactly the same content of the titleList.
The content will be put in a uncompressed cluster, and so you can use the classical mechanism for all entries to locate where the listing is. In fact, when libzim will implement openzim/libzim#397, titlePtrList will point to the content of listing/titleOrdered/v0

listing/titleOrdered/v1 is almost the same content of titleList. The only difference is the number of entries. v1 contains only the "main entries" (from the user pov, not the resources) and v0 contains all entries. Readers should prefere v1 to v0 as this is the only way to know if an entry is a article or a ressource.
The only way to locate this list is to use listing/titleOrdered/v1, there is not direct pointer in the header. For consistency, we say that using the titlePtrList is obsolete, but it will be valid.

@Jaifroid
Copy link
Member Author

@mgautierfr Thank you for the reassuring explanation. I'm no expert in the correct terminology, but what you describe sounds like a "deprecated" API, that is still available/supported but that may be removed at some future time. If I've understood correctly, then that may be a better term to use in the spec than "obsolete".

@mgautierfr
Copy link
Member

It is a matter of terminology, it is not a exact science :)

I didn't want to use deprecated. Deprecated sound like you should not use (read) it and use other way/method/field. It sound also that the feature may disappear in the future (The last sentence of the paragraph in the spec said that but it's wrong. I already remove it).

But it is not the case for titlePtrList. Reader still have to read it (to support old zim files) and we will not remove it without breaking the compatibility (and then we will probably change the header format to simply drop the slot).

titlePtrList is the old way to locate the titleList, but it is not deprecated, it will not be removed and you should still use it.

@kelson42
Copy link
Collaborator

As a reminder, we are just a few PRs away to release libzim 7.0.0.

@Jaifroid
Copy link
Member Author

OK, thanks for the reminder @kelson42. Can I just check whether the ZIM archives in this directory:

http://tmp.kiwix.org/nons_zims/

still correspond to the new spec (at least enough to use as test ZIMs)?

An immediate change we could do would be to change all references to the /A/ namesepace to /C/ (in our code). We then have to check for any code that has /I/ or /J/ namespaces hard-coded. One complication is that we need to keep all the old code for backward compatibility.

@kelson42
Copy link
Collaborator

@Jaifroid These ZIM file still have the title index at the old/current location. @mgautierfr Would you be able please to refresh them?

@Jaifroid
Copy link
Member Author

Jaifroid commented Jan 25, 2021

@kelson42 But, re-reading the discussion above, it seems that titlePtrList will still be accessible the old way for a while yet, so perhaps we can concentrate on removing the assumption that articles are found only in A/, and then we can work separately on referencing the titlePtrList from X/listing/titleOrdered/v0 instead of its binary position.

To summarize, I'm proposing we do two separate PRs:

  1. Support C/ namespace + MIME type as a way of identifying articles in addition to the old A/ namespace (and similar for images);
  2. Get titlePtrList from the X/ namespace instead of finding it from its binary position.

Does this sound like a good way to proceed?

@Jaifroid
Copy link
Member Author

Jaifroid commented Jan 25, 2021

I've added a hacky PR #698 which can read wikipedia_en_ray_charles_2018-09_nons.zim in jQuery mode.

But I have a query for @mgautierfr: the landing article of this ZIM is in namespace - instead of W or C. Is this just an older verison of the spec before W was settled on for the Well Known page type?

EDIT: I think the answer to my query is above, i.e. that the test ZIMs do not yet handle the W namespace, so I guess this is why - is used instead for the landing page. It would be good to know if we need to support - as a fallback.

@Jaifroid
Copy link
Member Author

Jaifroid commented Jan 26, 2021

Sorry to bombard you with questions, @mgautierfr , but do you have any suggestion from your experience on how to deal with the title search issue shown in the screenshot below? This is from the Ray Charles test ZIM, which is small. Extrapolated to a large ZIM, this would be unmanageable. To elaborate: because namespace C contains "everrything", when using case-sensitive binary search on the title list we get a lot of non-desired results. While I can of course filter these out by testing the MIME type is 'text/html', on a large ZIM I would need to search forward through potentially tens of thousands of unwanted results to find a single legitimate result. Each result requires a dirEntry lookup by title index.

image

@mgautierfr
Copy link
Member

But I have a query for @mgautierfr: the landing article of this ZIM is in namespace - instead of W or C. Is this just an older verison of the spec before W was settled on for the Well Known page type?

Yes, they are "old" zim file before having a spec for the W namespace.
But you should not have to worry about this. The main page is also accessible through the mainPage entry in the header.
In any case, you shouldn't try to read the -/mainPage
And be aware that previous zim files have the header pointing directly to the main page and new zim files have header pointing to a redirect entry pointing to the main page.

but do you have any suggestion from your experience on how to deal with the title search issue shown in the screenshot below?

Use X/listing/titleOrdered/v1.
(They are not in the current zim files in http://tmp.kiwix.org/nons_zims, I will recreate them)

@Jaifroid
Copy link
Member Author

Use X/listing/titleOrdered/v1.
(They are not in the current zim files in http://tmp.kiwix.org/nons_zims, I will recreate them)

But is that a Xappian compressed index, or is it a standard titlePtrList? (We have no way to decode Xappian.)
Thanks for the quick reply.

@mgautierfr
Copy link
Member

It is the same format than titlePtrList but with less entries (https://wiki.openzim.org/wiki/Search_indexes#Title_index_v1)

@kelson42
Copy link
Collaborator

@mgautierfr Things are a bit complex already, could you please refresh quickly the ZIM file http://tmp.kiwix.org/nons_zims?

@mgautierfr
Copy link
Member

Things are a bit complex already, could you please refresh quickly the ZIM file http://tmp.kiwix.org/nons_zims?

Done

@Jaifroid
Copy link
Member Author

Thank you @mgautierfr!

I'm running into some inconsistencies in 3dprinting.stackexchange.com_en_all_2018-03_nons.zim. This ZIM has subdirectories, but on some pages we seem to have refernces to assets with no namespace and on some other pages we have old-style namespaces creeping in.

For an example of the latter, see the page with ZIM URL C/tag/surface/1.html. The HTML of this page begins:

<!DOCTYPE html>
  <html>
    <head>
      <meta charset="utf-8" />
      <title>surface</title>
....  SNIP ....
      <link rel="stylesheet" href="../../../-/static/bootstrap/css/bootstrap.min.css">
      <link rel="stylesheet" href="../../../-/static/bootstrap/css/bootstrap-theme.min.css">
      <link rel="stylesheet" href="../../../-/static/main.css"/>

The assets listed here with namespace - cannot be found in the ZIM.

Assets on other pages for example ZIM URL C/question/11.html are written correctly, e.g.:

../static/bootstrap/css/bootstrap.min.css

Relative to its page, this yields the correct asset URL:

C/static/bootstrap/css/bootstrap.min.css

Is this a local problem with the ZIM writer for this ZIM type?

@mgautierfr
Copy link
Member

This is a issue with zim-recreate (https://github.com/openzim/zim-tools/blob/master/src/zimrecreate.cpp#L75-L89)
The url rewriting is pretty limited. "Real" new zim files should not have this issue.

@Jaifroid
Copy link
Member Author

Jaifroid commented Feb 6, 2021

This issue is partially implemented by 18c51f1.

What remains is to support the version 1 of titlePtrList, accessed from X/listing/titleOrdered/v1.

@Jaifroid
Copy link
Member Author

I have opened #708 in order to continue work on adding support for no-namespace ZIMs. To summarize, these are now readable in Kiwix JS, but we are not yet using the new versions of the titlePtreList. I'm closing this issue, and discussion should move to #708.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants