Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating indexes #290

Closed
sharun-s opened this issue Jul 18, 2017 · 22 comments
Closed

Creating indexes #290

sharun-s opened this issue Jul 18, 2017 · 22 comments
Labels

Comments

@sharun-s
Copy link
Contributor

sharun-s commented Jul 18, 2017

@kelson42 @mossroy can you point me at code/docs/links on how the url/title index is created?
Basically if I want to create my own custom index, within the ZIM or externally, I am just trying to understand the steps involved.

I remember seeing something about an indexer executable but I am unable to find it and am not sure in which repo I saw it.

Also @mossroy you had mentioned the geoindex in evopedia archives couldn't be reused here. What was the issue there?

@kelson42
Copy link
Collaborator

kelson42 commented Jul 18, 2017

For the spec, have a look to www.openzim.org and otherwise to the libzim and zimwriterfs in the "openzim" org on Github.

@sharun-s
Copy link
Contributor Author

Ooh forgot about the openzim repos. Thx.

@mossroy
Copy link
Contributor

mossroy commented Jul 19, 2017

From what I remember, @peter-x told me that the structure of geoindex in evopedia and ZIM archives were a bit different (I don't know to which extend). So it would not be possible to reuse the javascript code that was reading it for Evopedia.

@mossroy mossroy modified the milestone: v2.2 Jul 19, 2017
@kelson42
Copy link
Collaborator

kelson42 commented Jul 19, 2017

@mossroy @sharun-s There is for now, not geoindex, in ZIM files. @peter-x has made a piece of c++ (based on Evopedia approach) to add one. The pull request is here https://github.com/wikimedia/openzim/pull/1. We want to sort out that 18 months old topic within the next 2-3 months. That said this solution is in competition with Xapian which seems also to propose a geoindex too. Like said, we will make a decision soon.

@sharun-s
Copy link
Contributor Author

hmm...interesting. Good to have this reference.

@sharun-s
Copy link
Contributor Author

I have no idea how wikipedia search works, but if they publish their indexes maybe we can compress+reuse them?
I don't know enough about the subject, but intuitively it feels like to support (one day) the kind of functionality the wikipedia search bar provides, openzim/kiwix would have to replicate a lot of types of indexes. Reusing what wikipedia has already built whether its the indexes + the search front end maybe much simpler.

@sharun-s
Copy link
Contributor Author

For anyone interested in the search feature of kiwix-js - creating and querying new kinds of indexes is not too straightforward currently. I have added a basic demo of how to do this with sqlite3. It's bare bones, but it shows how indexes can be built and queried using sqlite.

The idea is to import the entire direntry table of a zim file into sqlite. Then use sqlites powerful indexing and query features. It is easier to use over rebuilding these pieces in js, one by one, to support different types of indexes and queries. Probably much more efficient too as sqlite is highly optimized for this stuff.
If interested, try it out and let me know if you have any questions, suggestions etc. Thanks!

@sharun-s sharun-s reopened this Jan 16, 2018
@Jaifroid
Copy link
Member

Wow @sharun-s , great to see some work going forward on indexing!
I've always thought some element of SQL would be more efficient than reinventing the wheel on extracting data from indices. It would be amazing if we had some way to make a ZIM act as a giant, relational database...

@kelson42
Copy link
Collaborator

@sharun-s What kind of feature do you want to provide (based on this index)? ZIM files have now fulltext and geo indexes using Xapian technologies. They should be used.

@sharun-s
Copy link
Contributor Author

sharun-s commented Jan 16, 2018

It would be amazing if we had some way to make a ZIM act as a giant, relational database

@Jaifroid this is the claim the folks at sqlite make wrt to any archive. That the db engine(with all its bells and whistles to index/parse/query etc) comes with the db as a single file.

@kelson42 I was finding adding queries in kiwix-js to be a bit slow going as they grow in complexity. Things like search only image urls containing 'paris' or just the "javascript" tagged articles for a particular phrase in the stackoverflow dump. It involves writing indexing code + query engine code for each new kind of query.

That work doesn't go away by using sqlite, but it's just easier to import the direntry table, add a custom column or two, create appropriate indexes (which are probably going to be faster since they are b-tree based vs our binary tree impl) and then run an sql query over the table. The output is the cluster+blob. And in URL mode its possible to just directly access the result without going through the fileselector process.

The difference is a couple hours of work vs a couple weeks. That said its very much a hack, and is going to take a while to figure out where the issues with this approach is.

@mossroy
Copy link
Contributor

mossroy commented Jan 16, 2018

If I understood correctly, you use some javascript to read the ZIM file dirEntry content, and generate a CSV file. Then you import this CSV content in a sqlite database.
Afterwards, you can make SQL queries on the sqlite database (from command-line), and use the result to start a browser with the right parameters to open the articles corresponding to the result of the SQL query (provided that the ZIM file is bundled with the javascript source code).

It's true that a SQL database easily allows to do powerful and fast queries.
But I suppose the fact that SQLite is ran outside the browser does not allow to use it in other contexts (browser extensions, mobile apps, etc)?

@kelson42
Copy link
Collaborator

@sharun-s To my opinion, this is not the role of the reader to somehow index the content or implement complex searching code. The necessary data should be easily and efficiently available from the ZIM file itself. And it is already there for a few things you talk about.

A new ticket should be open describing the feature you propose to achieve from the user perspective. It is still unclear to me. After this is clear to everyone, then we could discuss about a way to achieve it. I have the feeling we talk now about technical details without having a clear/consensual view about what needs exactly to be achieved.

@sharun-s
Copy link
Contributor Author

@mossroy you got it. One point - creating the table/index is a one time affair ( A Million rows takes about 8mins ~65MB file - single threaded). After that its mainly writing new queries. There are multiple sqlite.js projects out there already. I did some basic testing and was surprised at the speed. JS integration part I haven't thought about much. Just focusing on the indexing/queries right now.

@kelson42 The issue is, when dealing with Gigabytes of content, people are sooner or later going to ask questions about Search. And what is the point of having a 50GB dump of data sitting on disk, if it can only be searched 3 ways? Who is going to add the search features and how becomes the question.

The Readers right now are doing a great job reading and rendering content. The archive format does a great job minimizing space used. Improving Search seem the next phase. I agree building StackOverflow.com level search to handle the stackoverflow dump or wikipedia.com searchbar functionality to handle the wikipedia dump directly into this project is a bridge too far.

But the fundamental advantage projects like this or Zeal/Dash have over a website, is we don't have to build search to handle a million queries a second, but just one query at a time. And that changes the the search problem we have to deal with. People are already well conditioned to think search=online. Modern average hardware and projects like this are showing the opposite is possible.

@Jaifroid
Copy link
Member

Jaifroid commented Jan 17, 2018

To my opinion, this is not the role of the reader to somehow index the content or implement complex searching code.

However standard Kiwix (the x86 executable) does offer to index a ZIM file on opening, and what @sharun-s is exploring is not a different principle.

There would be an issue (which exists in the current x86 executable) with the inordinate amount of time the indexing takes for something like full Wikipedia -- I tried it once when Wikipedia was at something like 20Gb, and it took hours and hours, which is I guess why we now have some pre-indexed versions (though the index is still external to the ZIM file). We couldn't really leave a JavaScript application indexing away for hours -- its execution would be stopped, and it can't be done on mobile due to background restrictions and battery use.

Is there a roadmap on indexing? I guess the advent of mobile has significantly changed the playing field.

It's all very interesting! I didn't know about Zeal or Dash.

@kelson42
Copy link
Collaborator

@sharun-s After reading your comment I still do not know what do you want to achieve exactly. You write "Improving Search seem the next phase." OK... but what needs to be improved exactly? Where are the bugs, the feature requests you are talking about? You write "StackOverflow.com level search "... but what is that? "wikipedia.com searchbar functionality" What is that too?

A lot is already in place in the ZIM files for searches, fulltext search in content, in titles or even geo-coordinates. Other ZIM files provide javascript based more sophisticated filtering solutions in javascript. None of them are currently working in kiwix-js.

BTW, A few days ago, I have uploaded guidelines about how to report a bug/feature request. https://github.com/kiwix/overview/blob/master/REPORT_BUG.md. This guidelines should be followed.

@Jaifroid
Copy link
Member

Other ZIM files provide javascript based more sophisticated filtering solutions in javascript. None of them are currently working in kiwix-js.

@kelson42 Is this documented anywhere? If it's in JavaScript, we can hook into it, I think! All I could find is a link to a not-yet-written page on the ZIM Index Format in the specs on this page:
http://www.openzim.org/wiki/ZIM_file_format. I'd be really interested in any JavaScript-based filtering API, or any other way of hooking into such filtering.

@kelson42
Copy link
Collaborator

@Jaifroid The gutenberg ZIM file works that way to filter by languages/authors/etc. https://github.com/openzim/gutenberg.

@sharun-s
Copy link
Contributor Author

sharun-s commented Jan 17, 2018

what needs to be improved exactly?

@kelson42 Here is the page detailing wikipedia searchbar functionality https://www.mediawiki.org/wiki/Help:CirrusSearch To get any one para of that stuff implemented in kiwix-js, creating indexes is step1. Which is why I opened this 'creating indexes' issue back in July to figure it out :)

@kelson42
Copy link
Collaborator

kelson42 commented Jan 17, 2018

@sharun-s Hmm this is ambitious... you really want to implement all the features? Or do you want to focus on one part first?

@kelson42
Copy link
Collaborator

@mossroy @Jaifroid I really think this ticket should be closed. Not the role of the reader tomake indexes.

@Jaifroid
Copy link
Member

Yes, I agree. It would be good to be able to read the index in the ZIM files, but for the moment that depends on #116 .

@mossroy
Copy link
Contributor

mossroy commented Jul 22, 2019

@kelson42 I agree with all the issues you recently proposed to close. Thanks for this cleaning session!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants