Creating indexes #290

sharun-s · 2017-07-18T15:40:38Z

@kelson42 @mossroy can you point me at code/docs/links on how the url/title index is created?
Basically if I want to create my own custom index, within the ZIM or externally, I am just trying to understand the steps involved.

I remember seeing something about an indexer executable but I am unable to find it and am not sure in which repo I saw it.

Also @mossroy you had mentioned the geoindex in evopedia archives couldn't be reused here. What was the issue there?

kelson42 · 2017-07-18T21:42:05Z

For the spec, have a look to www.openzim.org and otherwise to the libzim and zimwriterfs in the "openzim" org on Github.

sharun-s · 2017-07-19T00:04:53Z

Ooh forgot about the openzim repos. Thx.

mossroy · 2017-07-19T19:54:24Z

From what I remember, @peter-x told me that the structure of geoindex in evopedia and ZIM archives were a bit different (I don't know to which extend). So it would not be possible to reuse the javascript code that was reading it for Evopedia.

kelson42 · 2017-07-19T20:17:51Z

@mossroy @sharun-s There is for now, not geoindex, in ZIM files. @peter-x has made a piece of c++ (based on Evopedia approach) to add one. The pull request is here https://github.com/wikimedia/openzim/pull/1. We want to sort out that 18 months old topic within the next 2-3 months. That said this solution is in competition with Xapian which seems also to propose a geoindex too. Like said, we will make a decision soon.

sharun-s · 2017-07-20T18:53:41Z

hmm...interesting. Good to have this reference.

sharun-s · 2017-07-20T19:08:00Z

I have no idea how wikipedia search works, but if they publish their indexes maybe we can compress+reuse them?
I don't know enough about the subject, but intuitively it feels like to support (one day) the kind of functionality the wikipedia search bar provides, openzim/kiwix would have to replicate a lot of types of indexes. Reusing what wikipedia has already built whether its the indexes + the search front end maybe much simpler.

sharun-s · 2018-01-16T02:10:42Z

For anyone interested in the search feature of kiwix-js - creating and querying new kinds of indexes is not too straightforward currently. I have added a basic demo of how to do this with sqlite3. It's bare bones, but it shows how indexes can be built and queried using sqlite.

The idea is to import the entire direntry table of a zim file into sqlite. Then use sqlites powerful indexing and query features. It is easier to use over rebuilding these pieces in js, one by one, to support different types of indexes and queries. Probably much more efficient too as sqlite is highly optimized for this stuff.
If interested, try it out and let me know if you have any questions, suggestions etc. Thanks!

Jaifroid · 2018-01-16T06:58:47Z

Wow @sharun-s , great to see some work going forward on indexing!
I've always thought some element of SQL would be more efficient than reinventing the wheel on extracting data from indices. It would be amazing if we had some way to make a ZIM act as a giant, relational database...

kelson42 · 2018-01-16T07:28:38Z

@sharun-s What kind of feature do you want to provide (based on this index)? ZIM files have now fulltext and geo indexes using Xapian technologies. They should be used.

sharun-s · 2018-01-16T08:23:56Z

It would be amazing if we had some way to make a ZIM act as a giant, relational database

@Jaifroid this is the claim the folks at sqlite make wrt to any archive. That the db engine(with all its bells and whistles to index/parse/query etc) comes with the db as a single file.

@kelson42 I was finding adding queries in kiwix-js to be a bit slow going as they grow in complexity. Things like search only image urls containing 'paris' or just the "javascript" tagged articles for a particular phrase in the stackoverflow dump. It involves writing indexing code + query engine code for each new kind of query.

That work doesn't go away by using sqlite, but it's just easier to import the direntry table, add a custom column or two, create appropriate indexes (which are probably going to be faster since they are b-tree based vs our binary tree impl) and then run an sql query over the table. The output is the cluster+blob. And in URL mode its possible to just directly access the result without going through the fileselector process.

The difference is a couple hours of work vs a couple weeks. That said its very much a hack, and is going to take a while to figure out where the issues with this approach is.

mossroy · 2018-01-16T13:05:48Z

If I understood correctly, you use some javascript to read the ZIM file dirEntry content, and generate a CSV file. Then you import this CSV content in a sqlite database.
Afterwards, you can make SQL queries on the sqlite database (from command-line), and use the result to start a browser with the right parameters to open the articles corresponding to the result of the SQL query (provided that the ZIM file is bundled with the javascript source code).

It's true that a SQL database easily allows to do powerful and fast queries.
But I suppose the fact that SQLite is ran outside the browser does not allow to use it in other contexts (browser extensions, mobile apps, etc)?

kelson42 · 2018-01-16T18:19:44Z

@sharun-s To my opinion, this is not the role of the reader to somehow index the content or implement complex searching code. The necessary data should be easily and efficiently available from the ZIM file itself. And it is already there for a few things you talk about.

A new ticket should be open describing the feature you propose to achieve from the user perspective. It is still unclear to me. After this is clear to everyone, then we could discuss about a way to achieve it. I have the feeling we talk now about technical details without having a clear/consensual view about what needs exactly to be achieved.

sharun-s · 2018-01-17T04:03:38Z

@mossroy you got it. One point - creating the table/index is a one time affair ( A Million rows takes about 8mins ~65MB file - single threaded). After that its mainly writing new queries. There are multiple sqlite.js projects out there already. I did some basic testing and was surprised at the speed. JS integration part I haven't thought about much. Just focusing on the indexing/queries right now.

@kelson42 The issue is, when dealing with Gigabytes of content, people are sooner or later going to ask questions about Search. And what is the point of having a 50GB dump of data sitting on disk, if it can only be searched 3 ways? Who is going to add the search features and how becomes the question.

The Readers right now are doing a great job reading and rendering content. The archive format does a great job minimizing space used. Improving Search seem the next phase. I agree building StackOverflow.com level search to handle the stackoverflow dump or wikipedia.com searchbar functionality to handle the wikipedia dump directly into this project is a bridge too far.

But the fundamental advantage projects like this or Zeal/Dash have over a website, is we don't have to build search to handle a million queries a second, but just one query at a time. And that changes the the search problem we have to deal with. People are already well conditioned to think search=online. Modern average hardware and projects like this are showing the opposite is possible.

Jaifroid · 2018-01-17T08:28:38Z

To my opinion, this is not the role of the reader to somehow index the content or implement complex searching code.

However standard Kiwix (the x86 executable) does offer to index a ZIM file on opening, and what @sharun-s is exploring is not a different principle.

There would be an issue (which exists in the current x86 executable) with the inordinate amount of time the indexing takes for something like full Wikipedia -- I tried it once when Wikipedia was at something like 20Gb, and it took hours and hours, which is I guess why we now have some pre-indexed versions (though the index is still external to the ZIM file). We couldn't really leave a JavaScript application indexing away for hours -- its execution would be stopped, and it can't be done on mobile due to background restrictions and battery use.

Is there a roadmap on indexing? I guess the advent of mobile has significantly changed the playing field.

It's all very interesting! I didn't know about Zeal or Dash.

kelson42 · 2018-01-17T08:49:38Z

@sharun-s After reading your comment I still do not know what do you want to achieve exactly. You write "Improving Search seem the next phase." OK... but what needs to be improved exactly? Where are the bugs, the feature requests you are talking about? You write "StackOverflow.com level search "... but what is that? "wikipedia.com searchbar functionality" What is that too?

A lot is already in place in the ZIM files for searches, fulltext search in content, in titles or even geo-coordinates. Other ZIM files provide javascript based more sophisticated filtering solutions in javascript. None of them are currently working in kiwix-js.

BTW, A few days ago, I have uploaded guidelines about how to report a bug/feature request. https://github.com/kiwix/overview/blob/master/REPORT_BUG.md. This guidelines should be followed.

Jaifroid · 2018-01-17T09:13:43Z

Other ZIM files provide javascript based more sophisticated filtering solutions in javascript. None of them are currently working in kiwix-js.

@kelson42 Is this documented anywhere? If it's in JavaScript, we can hook into it, I think! All I could find is a link to a not-yet-written page on the ZIM Index Format in the specs on this page:
http://www.openzim.org/wiki/ZIM_file_format. I'd be really interested in any JavaScript-based filtering API, or any other way of hooking into such filtering.

kelson42 · 2018-01-17T09:15:58Z

@Jaifroid The gutenberg ZIM file works that way to filter by languages/authors/etc. https://github.com/openzim/gutenberg.

sharun-s · 2018-01-17T12:39:21Z

what needs to be improved exactly?

@kelson42 Here is the page detailing wikipedia searchbar functionality https://www.mediawiki.org/wiki/Help:CirrusSearch To get any one para of that stuff implemented in kiwix-js, creating indexes is step1. Which is why I opened this 'creating indexes' issue back in July to figure it out :)

kelson42 · 2018-01-17T12:58:56Z

@sharun-s Hmm this is ambitious... you really want to implement all the features? Or do you want to focus on one part first?

kelson42 · 2019-07-22T06:58:36Z

@mossroy @Jaifroid I really think this ticket should be closed. Not the role of the reader tomake indexes.

Jaifroid · 2019-07-22T09:10:19Z

Yes, I agree. It would be good to be able to read the index in the ZIM files, but for the moment that depends on #116 .

mossroy · 2019-07-22T17:34:19Z

@kelson42 I agree with all the issues you recently proposed to close. Thanks for this cleaning session!

kelson42 added the question label Jul 18, 2017

kelson42 closed this as completed Jul 19, 2017

mossroy modified the milestone: v2.2 Jul 19, 2017

sharun-s reopened this Jan 16, 2018

sharun-s mentioned this issue Jun 24, 2018

Document Elasticsearch index integration sharun-s/kiwix-html5#3

Open

Jaifroid closed this as completed Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating indexes #290

Creating indexes #290

sharun-s commented Jul 18, 2017 •

edited

Loading

kelson42 commented Jul 18, 2017 •

edited

Loading

sharun-s commented Jul 19, 2017

mossroy commented Jul 19, 2017

kelson42 commented Jul 19, 2017 •

edited

Loading

sharun-s commented Jul 20, 2017

sharun-s commented Jul 20, 2017

sharun-s commented Jan 16, 2018

Jaifroid commented Jan 16, 2018

kelson42 commented Jan 16, 2018

sharun-s commented Jan 16, 2018 •

edited

Loading

mossroy commented Jan 16, 2018

kelson42 commented Jan 16, 2018

sharun-s commented Jan 17, 2018

Jaifroid commented Jan 17, 2018 •

edited

Loading

kelson42 commented Jan 17, 2018

Jaifroid commented Jan 17, 2018

kelson42 commented Jan 17, 2018

sharun-s commented Jan 17, 2018 •

edited

Loading

kelson42 commented Jan 17, 2018 •

edited

Loading

kelson42 commented Jul 22, 2019

Jaifroid commented Jul 22, 2019

mossroy commented Jul 22, 2019

Creating indexes #290

Creating indexes #290

Comments

sharun-s commented Jul 18, 2017 • edited Loading

kelson42 commented Jul 18, 2017 • edited Loading

sharun-s commented Jul 19, 2017

mossroy commented Jul 19, 2017

kelson42 commented Jul 19, 2017 • edited Loading

sharun-s commented Jul 20, 2017

sharun-s commented Jul 20, 2017

sharun-s commented Jan 16, 2018

Jaifroid commented Jan 16, 2018

kelson42 commented Jan 16, 2018

sharun-s commented Jan 16, 2018 • edited Loading

mossroy commented Jan 16, 2018

kelson42 commented Jan 16, 2018

sharun-s commented Jan 17, 2018

Jaifroid commented Jan 17, 2018 • edited Loading

kelson42 commented Jan 17, 2018

Jaifroid commented Jan 17, 2018

kelson42 commented Jan 17, 2018

sharun-s commented Jan 17, 2018 • edited Loading

kelson42 commented Jan 17, 2018 • edited Loading

kelson42 commented Jul 22, 2019

Jaifroid commented Jul 22, 2019

mossroy commented Jul 22, 2019

sharun-s commented Jul 18, 2017 •

edited

Loading

kelson42 commented Jul 18, 2017 •

edited

Loading

kelson42 commented Jul 19, 2017 •

edited

Loading

sharun-s commented Jan 16, 2018 •

edited

Loading

Jaifroid commented Jan 17, 2018 •

edited

Loading

sharun-s commented Jan 17, 2018 •

edited

Loading

kelson42 commented Jan 17, 2018 •

edited

Loading