-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating indexes #290
Comments
For the spec, have a look to www.openzim.org and otherwise to the libzim and zimwriterfs in the "openzim" org on Github. |
Ooh forgot about the openzim repos. Thx. |
From what I remember, @peter-x told me that the structure of geoindex in evopedia and ZIM archives were a bit different (I don't know to which extend). So it would not be possible to reuse the javascript code that was reading it for Evopedia. |
@mossroy @sharun-s There is for now, not geoindex, in ZIM files. @peter-x has made a piece of c++ (based on Evopedia approach) to add one. The pull request is here https://github.com/wikimedia/openzim/pull/1. We want to sort out that 18 months old topic within the next 2-3 months. That said this solution is in competition with Xapian which seems also to propose a geoindex too. Like said, we will make a decision soon. |
hmm...interesting. Good to have this reference. |
I have no idea how wikipedia search works, but if they publish their indexes maybe we can compress+reuse them? |
For anyone interested in the search feature of kiwix-js - creating and querying new kinds of indexes is not too straightforward currently. I have added a basic demo of how to do this with sqlite3. It's bare bones, but it shows how indexes can be built and queried using sqlite. The idea is to import the entire direntry table of a zim file into sqlite. Then use sqlites powerful indexing and query features. It is easier to use over rebuilding these pieces in js, one by one, to support different types of indexes and queries. Probably much more efficient too as sqlite is highly optimized for this stuff. |
Wow @sharun-s , great to see some work going forward on indexing! |
@sharun-s What kind of feature do you want to provide (based on this index)? ZIM files have now fulltext and geo indexes using Xapian technologies. They should be used. |
@Jaifroid this is the claim the folks at sqlite make wrt to any archive. That the db engine(with all its bells and whistles to index/parse/query etc) comes with the db as a single file. @kelson42 I was finding adding queries in kiwix-js to be a bit slow going as they grow in complexity. Things like search only image urls containing 'paris' or just the "javascript" tagged articles for a particular phrase in the stackoverflow dump. It involves writing indexing code + query engine code for each new kind of query. That work doesn't go away by using sqlite, but it's just easier to import the direntry table, add a custom column or two, create appropriate indexes (which are probably going to be faster since they are b-tree based vs our binary tree impl) and then run an sql query over the table. The output is the cluster+blob. And in URL mode its possible to just directly access the result without going through the fileselector process. The difference is a couple hours of work vs a couple weeks. That said its very much a hack, and is going to take a while to figure out where the issues with this approach is. |
If I understood correctly, you use some javascript to read the ZIM file dirEntry content, and generate a CSV file. Then you import this CSV content in a sqlite database. It's true that a SQL database easily allows to do powerful and fast queries. |
@sharun-s To my opinion, this is not the role of the reader to somehow index the content or implement complex searching code. The necessary data should be easily and efficiently available from the ZIM file itself. And it is already there for a few things you talk about. A new ticket should be open describing the feature you propose to achieve from the user perspective. It is still unclear to me. After this is clear to everyone, then we could discuss about a way to achieve it. I have the feeling we talk now about technical details without having a clear/consensual view about what needs exactly to be achieved. |
@mossroy you got it. One point - creating the table/index is a one time affair ( A Million rows takes about 8mins ~65MB file - single threaded). After that its mainly writing new queries. There are multiple sqlite.js projects out there already. I did some basic testing and was surprised at the speed. JS integration part I haven't thought about much. Just focusing on the indexing/queries right now. @kelson42 The issue is, when dealing with Gigabytes of content, people are sooner or later going to ask questions about Search. And what is the point of having a 50GB dump of data sitting on disk, if it can only be searched 3 ways? Who is going to add the search features and how becomes the question. The Readers right now are doing a great job reading and rendering content. The archive format does a great job minimizing space used. Improving Search seem the next phase. I agree building StackOverflow.com level search to handle the stackoverflow dump or wikipedia.com searchbar functionality to handle the wikipedia dump directly into this project is a bridge too far. But the fundamental advantage projects like this or Zeal/Dash have over a website, is we don't have to build search to handle a million queries a second, but just one query at a time. And that changes the the search problem we have to deal with. People are already well conditioned to think search=online. Modern average hardware and projects like this are showing the opposite is possible. |
However standard Kiwix (the x86 executable) does offer to index a ZIM file on opening, and what @sharun-s is exploring is not a different principle. There would be an issue (which exists in the current x86 executable) with the inordinate amount of time the indexing takes for something like full Wikipedia -- I tried it once when Wikipedia was at something like 20Gb, and it took hours and hours, which is I guess why we now have some pre-indexed versions (though the index is still external to the ZIM file). We couldn't really leave a JavaScript application indexing away for hours -- its execution would be stopped, and it can't be done on mobile due to background restrictions and battery use. Is there a roadmap on indexing? I guess the advent of mobile has significantly changed the playing field. It's all very interesting! I didn't know about Zeal or Dash. |
@sharun-s After reading your comment I still do not know what do you want to achieve exactly. You write "Improving Search seem the next phase." OK... but what needs to be improved exactly? Where are the bugs, the feature requests you are talking about? You write "StackOverflow.com level search "... but what is that? "wikipedia.com searchbar functionality" What is that too? A lot is already in place in the ZIM files for searches, fulltext search in content, in titles or even geo-coordinates. Other ZIM files provide javascript based more sophisticated filtering solutions in javascript. None of them are currently working in kiwix-js. BTW, A few days ago, I have uploaded guidelines about how to report a bug/feature request. https://github.com/kiwix/overview/blob/master/REPORT_BUG.md. This guidelines should be followed. |
@kelson42 Is this documented anywhere? If it's in JavaScript, we can hook into it, I think! All I could find is a link to a not-yet-written page on the ZIM Index Format in the specs on this page: |
@Jaifroid The gutenberg ZIM file works that way to filter by languages/authors/etc. https://github.com/openzim/gutenberg. |
@kelson42 Here is the page detailing wikipedia searchbar functionality https://www.mediawiki.org/wiki/Help:CirrusSearch To get any one para of that stuff implemented in kiwix-js, creating indexes is step1. Which is why I opened this 'creating indexes' issue back in July to figure it out :) |
@sharun-s Hmm this is ambitious... you really want to implement all the features? Or do you want to focus on one part first? |
Yes, I agree. It would be good to be able to read the index in the ZIM files, but for the moment that depends on #116 . |
@kelson42 I agree with all the issues you recently proposed to close. Thanks for this cleaning session! |
@kelson42 @mossroy can you point me at code/docs/links on how the url/title index is created?
Basically if I want to create my own custom index, within the ZIM or externally, I am just trying to understand the steps involved.
I remember seeing something about an indexer executable but I am unable to find it and am not sure in which repo I saw it.
Also @mossroy you had mentioned the geoindex in evopedia archives couldn't be reused here. What was the issue there?
The text was updated successfully, but these errors were encountered: