Add support for zstd decompression #611

kelson42 · 2020-03-31T16:29:10Z

We just about to add support of zstd compression in the libzim:
openzim/libzim#84

The reason is that zst can decompress around x6 time faster than xz (all the other aspects are similar to xz).

This will be shipped in the next libzim release within a week.

I guess all libzim based ports (basically everything except Kiwix JS) will have it included before summer.

As a consequence I expect to start to create ZIM file using zstd during the summer.

This is why I create this ticket to ask if zstd could be supported soon by Kiwix JS?

Jaifroid · 2020-03-31T17:27:13Z

Hmm. This depends entirely on whether a decompressor has been written in JavaScript, or whether it can be compiled to webassembly. If not, and if we continue to be unable to compile libzim to webassembly, this would be one big breaking change that could make Kiwix JS obsolete overnight!

kelson42 · 2020-03-31T17:30:10Z

@Jaifroid https://www.npmjs.com/package/zstd-codec

Jaifroid · 2020-03-31T17:36:17Z

@kelson42 That looks like very good news! They specifically say it is an Emscripten version "for Node and the Web", which sounds like what we need.

mossroy · 2020-04-01T19:02:39Z

Maybe, but it might be not so easy.
I don't see in this library an API identical to the one we use for XZ format (see https://github.com/kiwix/kiwix-js/blob/master/www/js/lib/xzdec_wrapper.js#L59). But maybe it can be adapted.

I was supposed to work on #514 at the hackathon, that would supersede this issue. The fact that the hackathon has been canceled (for good covid-19 reasons) changed the plans.
Will the hackathon be re-scheduled later this year? Or simply canceled, and the next one would be in 2021? Maybe it's too early to say...
In any case, I'd prefer to work on #514 (at next hackathon, with the help of the other mates), instead of manually adding zstd support.

kelson42 · 2020-04-01T19:10:44Z

I know it does not help, but I'm quite sorry to inform you so lately, but this is something I had in the back of my mind for quite a bit of time and the move is ad the end relatively easy for the C++ backed software... But, it is true, for JS, this is challenging.

kelson42 · 2020-04-04T15:23:25Z

Here is a demo file with zstd coimpression http://tmp.kiwix.org/foo-zstd.zim

Jaifroid · 2020-04-04T16:03:07Z

Thank you! I've downloaded it. @kelson42 just to be clear, is it the case that the cluster information byte referenced in https://wiki.openzim.org/wiki/ZIM_file_format will now have a value of 5 if the cluster is compressed with the zstd codec, but that otherwise nothing else has changed with the file format?

kelson42 · 2020-04-04T16:09:34Z

@Jaifroid abslutely

Jaifroid · 2020-04-10T18:31:03Z

I spent a bit of time trying to integrate the zstd decompressor. I got as far as attaching a webpacked bundle produced with browserify, which runs and loads the decompressor into memory.

I very roughly attempted to add a new decompressor function to zimfile.js for when the compressionType === 5. It is not properly integrated in that I need to wrap the new zstd.Streaming() loader in a promise, so for now we have to reload the article twice to ensure the decompressor is in memory.

But I don't really understand how data is chunked into the streaming decompressor, so I'm no doubt tying things up wrong at this stage, and I get the Emscripten memory error in screenshot below.

I don't think this is real, I'm simply calling the decompressor too many times on tiny chunks, and need to stitch the chunks together. Anyway, I just wanted to log progress here.

EDIT: Branch is: https://github.com/kiwix/kiwix-js/tree/Add-zstd-decompression-support

kelson42 · 2020-04-11T07:36:17Z

@Jaifroid At least you seem on a promising path :)

Jaifroid · 2020-04-11T12:42:57Z

I've made a bit of progress, but am hitting a problem I'm not sure how to solve. As I understand it, I need to read the entire compressed cluster (apart from the first informational byte) into memory as a Uint8Array and then decompress that Uint8Array, and after that we need to slice out the needed blobs.

So, I think I've managed to read the full cluster, though it's a bit difficult to know with this very small test file that only has numbers for article titles, and no idea what data are in those articles. However, the issue I'm hitting is that the Simple API, which appears to load correctly and produces no errors, returns null when fed the cluster as a Uint8Array:

When inspecting the zstandard code, there is an instruction to use the Streaming API instead of the Simple API in this circumstance:

// use streaming-api, to support data without frameContentSize.

Without clear documentation of the API used to encode the ZIM file, I'm a bit lost here. Experimentally, I've tried the Steaming API, but that causes:

It doesn't help that there seems to be no way to access the zstandard documentation. I can open https://github.com/facebook/zstd , but the standard is documented on http://www.zstd.net/ which is inaccessible at the moment.

Jaifroid · 2020-04-11T16:48:33Z

I've added the example app provided in https://github.com/yoshihitoh/zstd-codec (after compiling the bundle) to our repo as a proof of concept. It's in the directory zstandard/dist/ .

To test in browser, switch to branch Add-zstd-decompression-support and open zstandard/dist/index.html in a broswer, either from the file:// protocol if browser supports it (Edge Legacy or Firefox), or serve it from localhost.

You should see the browser decompressing a compressed bmp and a compressed json formatted book, using the Simple API. Note that it does not work with IE11. It might be possible to make it work with IE11 by adding a babel step in compiling the bundle, since a number of the bundled files use ECMA 2015 / ES6, which IE11 does not support.

All this would be easier if we had achieved #554!

Jaifroid · 2020-04-12T17:05:40Z

It looks like we can compile our own version from the latest zstd release. See this text on https://github.com/facebook/zstd/tree/v1.4.4/contrib/single_file_decoder:

all it now takes to support decompressing Zstd is the addition of a single file, two if using the header, with no configuration or further build steps. The library is small, adding, for example, 26kB to an Emscripten compiled WebAssembly project.

mossroy · 2020-04-13T09:47:28Z

Maybe we need to recompile the zstd-bundle.js library with different memory settings.
For xzdec, we used -s TOTAL_MEMORY=83886080. They seem to be compiling it inside Docker, but do not give instructions

kelson42 · 2020-06-23T11:09:32Z

What is the status here? And maybe the timeline? This is still the plan to release the first Zstd encode ZIM filed during the summer and it would be really preferable to have releases of Kiwix-JS with its support before.

Jaifroid · 2020-06-23T13:04:56Z

@kelson42 I've been really caught up with examining and planning for next academic year under social distancing conditions, and haven't been able to progress things. But it's a top priority once the frenzy dies down -- because lack of ZSTD is a cliff-edge for Kiwix JS if we're not careful. It looks possible, and shouldn't be a big job, as in principle it's just a case of substituting one decoder for another in the compressed byte stream, and stream decoding is supported. The problem I encountered is that the relevant Kiwix JS code is very heavily object-oriented, within within which I find it quite difficult to find entry and exit points.

Jaifroid · 2020-07-06T14:03:40Z

Documentation on how to decompress a stream with Zstandard:

Streaming decompression - HowTo
  A ZSTD_DStream object is required to track streaming operations.
  Use ZSTD_createDStream() and ZSTD_freeDStream() to create/release resources.
  ZSTD_DStream objects can be re-used multiple times.

  Use ZSTD_initDStream() to start a new decompression operation.
 @return : recommended first input size
  Alternatively, use advanced API to set specific properties.

  Use ZSTD_decompressStream() repetitively to consume your input.
  The function will update both `pos` fields.
  If `input.pos < input.size`, some input has not been consumed.
  It's up to the caller to present again remaining data.
  The function tries to flush all data decoded immediately, respecting output buffer size.
  If `output.pos < output.size`, decoder has flushed everything it could.
  But if `output.pos == output.size`, there might be some data left within internal buffers.,
  In which case, call ZSTD_decompressStream() again to flush whatever remains in the buffer.
  Note : with no additional input provided, amount of data flushed is necessarily <= ZSTD_BLOCKSIZE_MAX.
 @return : 0 when a frame is completely decoded and fully flushed,
        or an error code, which can be tested using ZSTD_isError(),
        or any other value > 0, which means there is still some decoding or flushing to do to complete current frame :
                                the return value is a suggested next input size (just a hint for better latency)
                                that will never request more than the remaining frame size.

From http://facebook.github.io/zstd/zstd_manual.html#Chapter7 .

Jaifroid · 2020-07-06T14:19:57Z

How to build emscripten version: https://dev.classmethod.jp/articles/emscripten-zstandard-on-browser/ (Japanese, but use auto translate).

Jaifroid · 2020-07-06T15:12:34Z

This page mentions a very simple way to compile zstandard, together with good tutorial on WASM and exposing C++ functions: https://marcoselvatici.github.io/WASM_tutorial/

Jaifroid · 2020-07-06T17:40:05Z

First attempt to compile our own zstddec.js with Emscripten is here:

https://github.com/kiwix/kiwix-js/tree/Add-zstd-decompression-support/emscripten/zstandard

I used this compile.sh script:

emcc --memory-init-file 0 -O3 -s WASM=0 -s MALLOC="emmalloc" -s TOTAL_MEMORY=83886080 -s NO_FILESYSTEM=1 
-s AGGRESSIVE_VARIABLE_ELIMINATION=1 -s DOUBLE_MODE=0 -s NO_DYNAMIC_EXECUTION=1 -s LEGACY_VM_SUPPORT=1 
-s EXPORTED_FUNCTIONS="['_ZSTD_createDStream', '_ZSTD_initDStream', '_ZSTD_decompressStream', '_ZSTD_freeDStream']" 
-DXZ_USE_CRC64=1 -DXZ_INTERNAL_CRC64=1 *.c -o zstdec.js

These values try to map onto our existing API as far as possible, but using the ZSTD functions mentioned in the documentation above. I compiled it only as asm for now (not wasm). It compiled with no errors.

As far as I've understood the process, we would need to call _ZSTD_createDStream to get a control object from which we can read input.pos, input.size, output.pos, output.size. However, I don't know if we'll need to make some custom functions for reading those, or whether the JavaScript object would be updated by the asm. Then we need to use _ZSTD_initDStream which should return a suggested initial number of bytes to send to the input stream. Then call _ZSTD_decompressStream with bytes, checking if output.pos < output.size to call _ZSTD_decompressStream again, and so on until the input stream is finished and output stream finished. Then we call _ZSTD_freeDStream to finish and release the decompressor resources.

That's the theory. None of it tested yet! I just wanted to see if I could compile that with no errors... Let's see how we can wire this up.

kelson42 · 2020-07-17T13:46:45Z

I have a real case example of zim file here: http://tmp.kiwix.org/wikipedia_bm_all_2020-07.zstd.zim

Jaifroid · 2020-07-22T11:42:36Z

I've re-compiled a test build with these options (some previous options have changed format, e.g. NO_DYNAMIC_EXECUTION=1 has now become DYNAMIC_EXECUTION=0, and it seems we need to cwrap or ccall exported functions if we don't want to handle data typing ourselves. The resulting zstdec.js is a lot bigger because it is a test build without optimization -O3 for now, and therefore the code is un-minified (and not so aggressive function elimination).

emcc --memory-init-file 0 -s WASM=0 -s MALLOC="emmalloc" -s TOTAL_MEMORY=83886080 -s FILESYSTEM=0 -s DOUBLE_MODE=0 -s DYNAMIC_EXECUTION=0 -s MIN_IE_VERSION=11 -s EXPORTED_FUNCTIONS="['_ZSTD_createDStream', '_ZSTD_initDStream', '_ZSTD_decompressStream', '_ZSTD_isError', '_ZSTD_freeDStream']" -s EXPORTED_RUNTIME_METHODS="['ccall', 'cwrap']" *.c -o zstdec.js

Jaifroid · 2020-07-23T09:58:34Z

The two Emscripten modules xzdec.js and zstdec.js were conflicting with each other because when compiled with standard options, they both emit a global Module object. I have managed to modularize zstdec.js using the new compile options below. It can now be loaded with requireJS instead of attaching in index.html.

In fact xzdec.js can be loaded the same way, so that has been added to xzdec_wrapper.js, just that xzdec continues to emit a global Module object (that no longer conflicts with zstdec's ZD Promise function). We should re-compile xzdec.js with similar options as below to avoid having it always loaded in memory when we are only dealing with a zstd-encoded ZIM. The relevant options are EXPORT_NAME and MODULARIZE:

emcc --memory-init-file 0 -s WASM=0 -s MALLOC="emmalloc" -s TOTAL_MEMORY=83886080 -s FILESYSTEM=0 -s DOUBLE_MODE=0 -s DYNAMIC_EXECUTION=0 -s MIN_IE_VERSION=11 -s EXPORT_NAME="ZD" -s MODULARIZE=1 -s EXPORTED_FUNCTIONS="['_ZSTD_createDStream', '_ZSTD_initDStream', '_ZSTD_decompressStream', '_ZSTD_isError', '_ZSTD_freeDStream']" -s EXPORTED_RUNTIME_METHODS="['cwrap']" *.c -o zstdec.js

NB Above is not optimized or minified. But that is a good thing for testing, as the resulting file has lots of useful comments and instructions. We should ensure to keep an unoptimized/unminified version in the source code.

Jaifroid · 2020-07-26T21:19:00Z

A quick note on progress:

I have set up the control structures
I have set these in the HEAP32 w/asm memory and received pointers to the structures
I have allocated the inBuffer data stream and the outBuffer data stream, receiving pointers to their location in w/asm memory
I have transferred a data slice to the w/asm inBuffer using HEAP8.set() (consider using HEAPU8.set(), because we are transferring a Uint8Array)
I have run _ZSTD_decompressStream, passing it pointers to the control object, the inBuffer structure, the outBuffer structure
It returns with a positive number, which means that it has partially decoded the stream;
According to the API, the number represents a hint of the number of bytes to be injected into the inBuffer.

TODO:

Read back the pos values for inBuffer and outBuffer from w/asm memory
Check that I am always setting HEAP32 memory correctly, taking into account BYTES_PER_ELEMENT when writing arrays and dividing the pointer value by BYTES_PER_ELEMENT (the logic is tricky)
Copy new decompressed bytes and add to JS copy of outBuffer
If inBuffer still contains unprocessed bytes, call decompressStream repeatedly until inBuffer is consumed, and copy new bytes in outBuffer to JS
Or read another slice if we have not reached the end of the cluster
If we have found the cluster termination then call _ZSTD_freeDStream() and de-allocate all assigned memory buffers and control structures. Do same if cluster termination not found, but all input bytes have been consumed (probably this is an error)
Consider possibility of simplifying by using ZSTD_decompressStream_simpleArgs(). This is described in source code like this: "Same as ZSTD_decompressStream(), but using only integral types as arguments. This can be helpful for binders from dynamic languages which have trouble handling structures containing memory pointers."

kelson42 · 2020-08-09T07:20:55Z

@Jaifroid What is the timeline for a release with zstd support? Plan is to start publishing the first ZIM with zstd at the end of the month.

Jaifroid · 2020-08-10T05:25:08Z

@kelson42 Slowdown caused by my having to learn about writing Emscripten bindings in C++ in order to simplify the passing of memory pointers between JS and C++, which are otherwise really difficult to debug. Also slowed down by a holiday!

However, I'm optimistic that I can push this forward positively over the next couple of weeks, as I now understand a lot better what I'm doing (and in particular what a proper JS to C++ interface looks like).

Jaifroid · 2020-08-28T16:52:14Z

Due to out-of-memory errors, I had to recompile zstdec.js with a memory stage of 139919360 bytes (~140 Mb), which seems a rather large memory requirement to me, especially for mobile (given that xzdec.js could manage with ~ 90 Mb). However, we're now able, in principle, to decompress a cluster size of 1024 * 5 bytes without an out-of-memory message.

The main issue I'm encountering at the moment is that the output.pos field of the output control structure (at a known memory location in zd.HEAP32) is not being advanced with the amount of data decompressed, and I don't know if this is because of a reading error, or because a frame has not yet been fully consumed. At least it is indicating it needs more data to complete the frame, which is where I'm at for now.

Managing the interface between JS and C++ with only being able to pass raw memory pointers is a real pain.... I might as well be writing assembly... 😮

kelson42 · 2020-08-28T17:03:02Z

@Jaifroid Thank you very much for the update. If you have to write assembly then something else is probably wrong IMO.

Jaifroid · 2020-08-28T17:28:56Z

@kelson42 It was a slightly exasperated metaphor! I'm not actually writing assembly, I'm just not at all used to dealing with memory pointers and malloc, and such things, but this is necessary to set up the memory structures so that the asm/wasm (= WebAssembly!) version of zstdec.js can communicate with JS...

Jaifroid · 2020-08-30T16:30:18Z

After a lot of trial and error, and recompiling with slightly larger memory values (140,247,040), I've managed to get zstdec to advance the outbuffer position field in response to a larger input. This is good news because it means I've managed to set up all the data structures correctly, and have entered enough valid data to decode a whole frame and get meaningful output.

Jaifroid · 2020-08-30T17:40:08Z

First decoded result with ZSTD using wikipedia_bm_all_2020-07.zstd.zim (screenshot). This isn't as bad as it looks 🙄 because we can clearly see that it's decoded some stylesheet data (these data are usually encoded in the ZIM). It's actually positive because it shows the underlying asm decoder is working fine, but that I've made an error somewhere in setting offsets.

Jaifroid · 2020-08-31T11:04:36Z

Success decoding foo-zstd.zim! I can pick any of the short articles (which are all numeric) and get the result in screenshot below (according to article number). This is a good proof of concept. I am still getting the strange result above with wikipedia_bm_all_2020-07.zstd.zim despite some refining since yesterday, but it is probably still something wrong with my calculation of offsets when I have to provide more data to the inBuffer or empty the outBuffer.

@kelson42 Do we have any more known good ZSTD ZIMs I can work with yet? I just want to be sure that the issue isn't with the encoding of the ZIM, since the small proof-of-concept test ZIM does work with my code.

Jaifroid · 2020-08-31T21:08:46Z

Success!!! 😀

Jaifroid · 2020-08-31T21:10:33Z

Jaifroid · 2020-08-31T21:10:45Z

And it's very fast!!!

Closes #611.

Jaifroid · 2020-09-29T10:47:30Z

@kelson42 Now this issue has been closed by merging the PR, I think it's safe to start publishing zstandard ZIM archives. We can make a new release once the first ones start coming out, or earlier if @mossroy has time (though it might be an idea to merge #642 before making a release).

kelson42 · 2020-09-29T11:11:34Z

@Jaifroid Thank you very much again or the amount and the quality of the work here. I believe it would be good to merge #642 before making a new release. @mossroy Hopefully this is doable this week. How long would be then take to make a new release (do we need a ticket)?

Jaifroid · 2020-09-29T11:15:23Z

We can make a ticket. @mossroy do you think adding zstandard warrants a major number increment? I don't know if ZSTD qualifies as a major change even if it's only the backend and arguably not an API change.

mossroy · 2020-09-29T19:55:04Z

I also repeat my congratulations to you @Jaifroid !
In the short term, I'm overwhelmed by too many things to review/test #642
If you agree, let's focus on what is strictly necessary for now : releasing a new version able to handle the latest ZIM files.

Regarding the version number : it's not crucial, we can do what we want.
I think it might be a version 3.0.0 : although it does not change anything in the frontend, it is a significant technical change in the backend

Jaifroid · 2020-09-29T21:39:44Z

OK, thank you both!
No problem about #642, it's not that important, and not noticeable to most users.
So, I'll make a ticket to release 3.0.0!

kelson42 added enhancement question labels Mar 31, 2020

kelson42 assigned mossroy Mar 31, 2020

kelson42 pinned this issue Mar 31, 2020

kelson42 assigned Jaifroid Mar 31, 2020

Jaifroid mentioned this issue Apr 4, 2020

Add support the Zstandard (aka zstd) library for compression openzim/libzim#84

Closed

kelson42 unpinned this issue Apr 7, 2020

Jaifroid changed the title ~~Add support of zstd decompression~~ Add support for zstd decompression May 11, 2020

Jaifroid mentioned this issue Jul 3, 2020

Can Kiwix-JS deal with an online ZIM file available through HTTP? #595

Closed

Jaifroid removed the question label Jul 7, 2020

Jaifroid added the help wanted label Jul 7, 2020

Jaifroid linked a pull request Sep 1, 2020 that will close this issue

Add zstd decompression support #648

Merged

Jaifroid mentioned this issue Sep 1, 2020

Add zstd decompression support #648

Merged

Jaifroid mentioned this issue Sep 27, 2020

Add QUnit tests running against a zstandard version of Ray Charles #651

Closed

Jaifroid closed this as completed in #648 Sep 29, 2020

Jaifroid added a commit that referenced this issue Sep 29, 2020

Add zstd decompression support #611 (#648)

0c53010

Closes #611.

Add support for zstd decompression #611

Add support for zstd decompression #611

Comments

kelson42 commented Mar 31, 2020

Jaifroid commented Mar 31, 2020

kelson42 commented Mar 31, 2020

Jaifroid commented Mar 31, 2020

mossroy commented Apr 1, 2020

kelson42 commented Apr 1, 2020

kelson42 commented Apr 4, 2020

Jaifroid commented Apr 4, 2020

kelson42 commented Apr 4, 2020

Jaifroid commented Apr 10, 2020 • edited Loading

kelson42 commented Apr 11, 2020

Jaifroid commented Apr 11, 2020 • edited Loading

Jaifroid commented Apr 11, 2020 • edited Loading

Jaifroid commented Apr 12, 2020

mossroy commented Apr 13, 2020

kelson42 commented Jun 23, 2020

Jaifroid commented Jun 23, 2020

Jaifroid commented Jul 6, 2020

Jaifroid commented Jul 6, 2020

Jaifroid commented Jul 6, 2020

Jaifroid commented Jul 6, 2020 • edited Loading

kelson42 commented Jul 17, 2020

Jaifroid commented Jul 22, 2020 • edited Loading

Jaifroid commented Jul 23, 2020

Jaifroid commented Jul 26, 2020 • edited Loading

kelson42 commented Aug 9, 2020

Jaifroid commented Aug 10, 2020

Jaifroid commented Aug 28, 2020

kelson42 commented Aug 28, 2020

Jaifroid commented Aug 28, 2020

Jaifroid commented Aug 30, 2020

Jaifroid commented Aug 30, 2020

Jaifroid commented Aug 31, 2020 • edited Loading

Jaifroid commented Aug 31, 2020

Jaifroid commented Aug 31, 2020

Jaifroid commented Aug 31, 2020

Jaifroid commented Sep 29, 2020 • edited Loading

kelson42 commented Sep 29, 2020

Jaifroid commented Sep 29, 2020

mossroy commented Sep 29, 2020

Jaifroid commented Sep 29, 2020

Jaifroid commented Apr 10, 2020 •

edited

Loading

Jaifroid commented Apr 11, 2020 •

edited

Loading

Jaifroid commented Apr 11, 2020 •

edited

Loading

Jaifroid commented Jul 6, 2020 •

edited

Loading

Jaifroid commented Jul 22, 2020 •

edited

Loading

Jaifroid commented Jul 26, 2020 •

edited

Loading

Jaifroid commented Aug 31, 2020 •

edited

Loading

Jaifroid commented Sep 29, 2020 •

edited

Loading