Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPEC] TitlePtrList should not contains all entries #397

Closed
mgautierfr opened this issue Aug 10, 2020 · 1 comment · Fixed by #487
Closed

[SPEC] TitlePtrList should not contains all entries #397

mgautierfr opened this issue Aug 10, 2020 · 1 comment · Fixed by #487
Assignees
Milestone

Comments

@mgautierfr
Copy link
Collaborator

Currently the TitlePtrList contains all the entries in the zim file, using the NS(Title|Url) key to sort the entries.

  • Entries are sorted by namespace (NS) first.
  • Then, by title (using the url if the title is empty).

Most of the time it is "ok" as only A entries have title. As we sort first by namespace, entries with title are sorted correctly (without url interleaved) and other entries (I/J) are sorted by url.
As we search by title only on the A namespace (suggestion system), this system mostly works.

But :

  • We have a TitlePtrList bigger than needed (no need to sort entries without title by title).
  • As we plan to "remove" the namespace, the sort of entries will mix entries with title and url (while still technically working).
  • A search by namespace will make no sense with one namespace. So it will be difficult to do suggestion search on only entries with title.

If fact, from the begining, the TitlePtrList should :

  • Contains only entries with title.
  • Be sorted by title only (no namespace).

However this breaks the compatibility as :

  • The size of TitlePtrList will not be ArticleCount anymore. So the header has to be changed to store the size of TitlePtrList.
  • The order criteria changes, and so implementation doing binary search on the TitlePtrList must use a different comparison function.

I have no plan to change this rigth now. But we may need to do it some time. This ticket is here to help us keep this in mind.

@mgautierfr
Copy link
Collaborator Author

Thinking about this, I come with a solution.

The idea is not to change the TitlePtrList but to add a new index(es).

We already have indexes, not pointed by the zim header. We use plain entry at well known path to found them : X/title/xapian and X/fulltext/xapian.
And we already have other specific entries at well known path (-/favicon, -/mainpage).
We could extend this principle to add new index data.

  • Indexes would be stored at path X/index/<indexname>/vN
  • The content would be a list of entry id (4 bytes index, as for TitlePtrList)
  • The index could contains less (or more) id than the number of entries.
  • The number of id can be infered from the size of the full index (/4)

This way :

  • We don't break zim format (we add new data, don't change them)
  • We can have multiple indexes (main entries title sorted, main entries url sorted, main entries case insensitive sorted, ...)
  • We will be future proof (we start with v1, and we need to change the format, move to v2)

This would be at the cost of (a bit) bigger zim file.
For wikipedia_en_all_maxi_2020_08 which contains 20_219_212 entries including 6_140_943 article,
a index of the articles would add 23Mb of data. (On 92Gb total size)

Random entries

Using the new getHints method of libzim creator, the user code could set which items are main items or not.
Only main items go in the new index entrie and random (and suggestions) would be only taken from them.

Suggestions (case insensitive search).

In kiwix-lib, if zim file has no xapian title index, we search for title. But as we want to be case insensitive and TitlePtrList is case sensitive, we do four searchs with different case variation.
By having a case insensitive index in the zim file itself, we would simply have to do one search.

Efficient order

Instead of generating the "efficient order" at reading (https://github.com/openzim/libzim/blob/master/src/fileimpl.cpp#L351-L371) , we could store a list of entries in this order in the zim file itself.

Categories

When we implement categories or entry extradata (see #325), we could add a index to list all entries in a specific category, or using specific extradata order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants