-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for zstd decompression #611
Comments
Hmm. This depends entirely on whether a decompressor has been written in JavaScript, or whether it can be compiled to webassembly. If not, and if we continue to be unable to compile libzim to webassembly, this would be one big breaking change that could make Kiwix JS obsolete overnight! |
@kelson42 That looks like very good news! They specifically say it is an Emscripten version "for Node and the Web", which sounds like what we need. |
Maybe, but it might be not so easy. I was supposed to work on #514 at the hackathon, that would supersede this issue. The fact that the hackathon has been canceled (for good covid-19 reasons) changed the plans. |
I know it does not help, but I'm quite sorry to inform you so lately, but this is something I had in the back of my mind for quite a bit of time and the move is ad the end relatively easy for the C++ backed software... But, it is true, for JS, this is challenging. |
Here is a demo file with zstd coimpression http://tmp.kiwix.org/foo-zstd.zim |
Thank you! I've downloaded it. @kelson42 just to be clear, is it the case that the cluster information byte referenced in https://wiki.openzim.org/wiki/ZIM_file_format will now have a value of 5 if the cluster is compressed with the zstd codec, but that otherwise nothing else has changed with the file format? |
@Jaifroid abslutely |
I spent a bit of time trying to integrate the zstd decompressor. I got as far as attaching a webpacked bundle produced with browserify, which runs and loads the decompressor into memory. I very roughly attempted to add a new decompressor function to zimfile.js for when the But I don't really understand how data is chunked into the streaming decompressor, so I'm no doubt tying things up wrong at this stage, and I get the Emscripten memory error in screenshot below. I don't think this is real, I'm simply calling the decompressor too many times on tiny chunks, and need to stitch the chunks together. Anyway, I just wanted to log progress here. EDIT: Branch is: https://github.com/kiwix/kiwix-js/tree/Add-zstd-decompression-support |
@Jaifroid At least you seem on a promising path :) |
I've made a bit of progress, but am hitting a problem I'm not sure how to solve. As I understand it, I need to read the entire compressed cluster (apart from the first informational byte) into memory as a Uint8Array and then decompress that Uint8Array, and after that we need to slice out the needed blobs. So, I think I've managed to read the full cluster, though it's a bit difficult to know with this very small test file that only has numbers for article titles, and no idea what data are in those articles. However, the issue I'm hitting is that the Simple API, which appears to load correctly and produces no errors, returns When inspecting the zstandard code, there is an instruction to use the Streaming API instead of the Simple API in this circumstance:
Without clear documentation of the API used to encode the ZIM file, I'm a bit lost here. Experimentally, I've tried the Steaming API, but that causes: It doesn't help that there seems to be no way to access the zstandard documentation. I can open https://github.com/facebook/zstd , but the standard is documented on http://www.zstd.net/ which is inaccessible at the moment. |
I've added the example app provided in https://github.com/yoshihitoh/zstd-codec (after compiling the bundle) to our repo as a proof of concept. It's in the directory To test in browser, switch to branch You should see the browser decompressing a compressed bmp and a compressed json formatted book, using the Simple API. Note that it does not work with IE11. It might be possible to make it work with IE11 by adding a babel step in compiling the bundle, since a number of the bundled files use ECMA 2015 / ES6, which IE11 does not support. All this would be easier if we had achieved #554! |
It looks like we can compile our own version from the latest zstd release. See this text on https://github.com/facebook/zstd/tree/v1.4.4/contrib/single_file_decoder:
|
Maybe we need to recompile the zstd-bundle.js library with different memory settings. |
What is the status here? And maybe the timeline? This is still the plan to release the first Zstd encode ZIM filed during the summer and it would be really preferable to have releases of Kiwix-JS with its support before. |
@kelson42 I've been really caught up with examining and planning for next academic year under social distancing conditions, and haven't been able to progress things. But it's a top priority once the frenzy dies down -- because lack of ZSTD is a cliff-edge for Kiwix JS if we're not careful. It looks possible, and shouldn't be a big job, as in principle it's just a case of substituting one decoder for another in the compressed byte stream, and stream decoding is supported. The problem I encountered is that the relevant Kiwix JS code is very heavily object-oriented, within within which I find it quite difficult to find entry and exit points. |
Documentation on how to decompress a stream with Zstandard:
From http://facebook.github.io/zstd/zstd_manual.html#Chapter7 . |
How to build emscripten version: https://dev.classmethod.jp/articles/emscripten-zstandard-on-browser/ (Japanese, but use auto translate). |
This page mentions a very simple way to compile zstandard, together with good tutorial on WASM and exposing C++ functions: https://marcoselvatici.github.io/WASM_tutorial/ |
First attempt to compile our own https://github.com/kiwix/kiwix-js/tree/Add-zstd-decompression-support/emscripten/zstandard I used this
These values try to map onto our existing API as far as possible, but using the ZSTD functions mentioned in the documentation above. I compiled it only as asm for now (not wasm). It compiled with no errors. As far as I've understood the process, we would need to call That's the theory. None of it tested yet! I just wanted to see if I could compile that with no errors... Let's see how we can wire this up. |
I have a real case example of zim file here: http://tmp.kiwix.org/wikipedia_bm_all_2020-07.zstd.zim |
I've re-compiled a test build with these options (some previous options have changed format, e.g.
|
The two Emscripten modules In fact xzdec.js can be loaded the same way, so that has been added to
NB Above is not optimized or minified. But that is a good thing for testing, as the resulting file has lots of useful comments and instructions. We should ensure to keep an unoptimized/unminified version in the source code. |
A quick note on progress:
TODO:
|
@Jaifroid What is the timeline for a release with zstd support? Plan is to start publishing the first ZIM with zstd at the end of the month. |
@kelson42 Slowdown caused by my having to learn about writing Emscripten bindings in C++ in order to simplify the passing of memory pointers between JS and C++, which are otherwise really difficult to debug. Also slowed down by a holiday! However, I'm optimistic that I can push this forward positively over the next couple of weeks, as I now understand a lot better what I'm doing (and in particular what a proper JS to C++ interface looks like). |
Due to out-of-memory errors, I had to recompile The main issue I'm encountering at the moment is that the Managing the interface between JS and C++ with only being able to pass raw memory pointers is a real pain.... I might as well be writing assembly... 😮 |
@Jaifroid Thank you very much for the update. If you have to write assembly then something else is probably wrong IMO. |
@kelson42 It was a slightly exasperated metaphor! I'm not actually writing assembly, I'm just not at all used to dealing with memory pointers and malloc, and such things, but this is necessary to set up the memory structures so that the asm/wasm (= WebAssembly!) version of zstdec.js can communicate with JS... |
After a lot of trial and error, and recompiling with slightly larger memory values (140,247,040), I've managed to get |
First decoded result with ZSTD using |
Success decoding @kelson42 Do we have any more known good ZSTD ZIMs I can work with yet? I just want to be sure that the issue isn't with the encoding of the ZIM, since the small proof-of-concept test ZIM does work with my code. |
And it's very fast!!! |
We can make a ticket. @mossroy do you think adding zstandard warrants a major number increment? I don't know if ZSTD qualifies as a major change even if it's only the backend and arguably not an API change. |
I also repeat my congratulations to you @Jaifroid ! Regarding the version number : it's not crucial, we can do what we want. |
OK, thank you both! |
We just about to add support of zstd compression in the libzim:
openzim/libzim#84
The reason is that zst can decompress around x6 time faster than xz (all the other aspects are similar to xz).
This will be shipped in the next libzim release within a week.
I guess all libzim based ports (basically everything except Kiwix JS) will have it included before summer.
As a consequence I expect to start to create ZIM file using zstd during the summer.
This is why I create this ticket to ask if zstd could be supported soon by Kiwix JS?
The text was updated successfully, but these errors were encountered: