Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for zstd decompression #611

Closed
kelson42 opened this issue Mar 31, 2020 · 40 comments · Fixed by #648
Closed

Add support for zstd decompression #611

kelson42 opened this issue Mar 31, 2020 · 40 comments · Fixed by #648

Comments

@kelson42
Copy link
Collaborator

We just about to add support of zstd compression in the libzim:
openzim/libzim#84

The reason is that zst can decompress around x6 time faster than xz (all the other aspects are similar to xz).

This will be shipped in the next libzim release within a week.

I guess all libzim based ports (basically everything except Kiwix JS) will have it included before summer.

As a consequence I expect to start to create ZIM file using zstd during the summer.

This is why I create this ticket to ask if zstd could be supported soon by Kiwix JS?

@Jaifroid
Copy link
Member

Hmm. This depends entirely on whether a decompressor has been written in JavaScript, or whether it can be compiled to webassembly. If not, and if we continue to be unable to compile libzim to webassembly, this would be one big breaking change that could make Kiwix JS obsolete overnight!

@kelson42
Copy link
Collaborator Author

@Jaifroid
Copy link
Member

@kelson42 That looks like very good news! They specifically say it is an Emscripten version "for Node and the Web", which sounds like what we need.

@mossroy
Copy link
Contributor

mossroy commented Apr 1, 2020

Maybe, but it might be not so easy.
I don't see in this library an API identical to the one we use for XZ format (see https://github.com/kiwix/kiwix-js/blob/master/www/js/lib/xzdec_wrapper.js#L59). But maybe it can be adapted.

I was supposed to work on #514 at the hackathon, that would supersede this issue. The fact that the hackathon has been canceled (for good covid-19 reasons) changed the plans.
Will the hackathon be re-scheduled later this year? Or simply canceled, and the next one would be in 2021? Maybe it's too early to say...
In any case, I'd prefer to work on #514 (at next hackathon, with the help of the other mates), instead of manually adding zstd support.

@kelson42
Copy link
Collaborator Author

kelson42 commented Apr 1, 2020

I know it does not help, but I'm quite sorry to inform you so lately, but this is something I had in the back of my mind for quite a bit of time and the move is ad the end relatively easy for the C++ backed software... But, it is true, for JS, this is challenging.

@kelson42
Copy link
Collaborator Author

kelson42 commented Apr 4, 2020

Here is a demo file with zstd coimpression http://tmp.kiwix.org/foo-zstd.zim

@Jaifroid
Copy link
Member

Jaifroid commented Apr 4, 2020

Thank you! I've downloaded it. @kelson42 just to be clear, is it the case that the cluster information byte referenced in https://wiki.openzim.org/wiki/ZIM_file_format will now have a value of 5 if the cluster is compressed with the zstd codec, but that otherwise nothing else has changed with the file format?

@kelson42
Copy link
Collaborator Author

kelson42 commented Apr 4, 2020

@Jaifroid abslutely

@kelson42 kelson42 unpinned this issue Apr 7, 2020
@Jaifroid
Copy link
Member

Jaifroid commented Apr 10, 2020

I spent a bit of time trying to integrate the zstd decompressor. I got as far as attaching a webpacked bundle produced with browserify, which runs and loads the decompressor into memory.

I very roughly attempted to add a new decompressor function to zimfile.js for when the compressionType === 5. It is not properly integrated in that I need to wrap the new zstd.Streaming() loader in a promise, so for now we have to reload the article twice to ensure the decompressor is in memory.

But I don't really understand how data is chunked into the streaming decompressor, so I'm no doubt tying things up wrong at this stage, and I get the Emscripten memory error in screenshot below.

image

I don't think this is real, I'm simply calling the decompressor too many times on tiny chunks, and need to stitch the chunks together. Anyway, I just wanted to log progress here.

EDIT: Branch is: https://github.com/kiwix/kiwix-js/tree/Add-zstd-decompression-support

@kelson42
Copy link
Collaborator Author

@Jaifroid At least you seem on a promising path :)

@Jaifroid
Copy link
Member

Jaifroid commented Apr 11, 2020

I've made a bit of progress, but am hitting a problem I'm not sure how to solve. As I understand it, I need to read the entire compressed cluster (apart from the first informational byte) into memory as a Uint8Array and then decompress that Uint8Array, and after that we need to slice out the needed blobs.

So, I think I've managed to read the full cluster, though it's a bit difficult to know with this very small test file that only has numbers for article titles, and no idea what data are in those articles. However, the issue I'm hitting is that the Simple API, which appears to load correctly and produces no errors, returns null when fed the cluster as a Uint8Array:

image

When inspecting the zstandard code, there is an instruction to use the Streaming API instead of the Simple API in this circumstance:

// use streaming-api, to support data without frameContentSize.

Without clear documentation of the API used to encode the ZIM file, I'm a bit lost here. Experimentally, I've tried the Steaming API, but that causes:

image

It doesn't help that there seems to be no way to access the zstandard documentation. I can open https://github.com/facebook/zstd , but the standard is documented on http://www.zstd.net/ which is inaccessible at the moment.

@Jaifroid
Copy link
Member

Jaifroid commented Apr 11, 2020

I've added the example app provided in https://github.com/yoshihitoh/zstd-codec (after compiling the bundle) to our repo as a proof of concept. It's in the directory zstandard/dist/ .

To test in browser, switch to branch Add-zstd-decompression-support and open zstandard/dist/index.html in a broswer, either from the file:// protocol if browser supports it (Edge Legacy or Firefox), or serve it from localhost.

You should see the browser decompressing a compressed bmp and a compressed json formatted book, using the Simple API. Note that it does not work with IE11. It might be possible to make it work with IE11 by adding a babel step in compiling the bundle, since a number of the bundled files use ECMA 2015 / ES6, which IE11 does not support.

All this would be easier if we had achieved #554!

@Jaifroid
Copy link
Member

It looks like we can compile our own version from the latest zstd release. See this text on https://github.com/facebook/zstd/tree/v1.4.4/contrib/single_file_decoder:

all it now takes to support decompressing Zstd is the addition of a single file, two if using the header, with no configuration or further build steps. The library is small, adding, for example, 26kB to an Emscripten compiled WebAssembly project.

@mossroy
Copy link
Contributor

mossroy commented Apr 13, 2020

Maybe we need to recompile the zstd-bundle.js library with different memory settings.
For xzdec, we used -s TOTAL_MEMORY=83886080. They seem to be compiling it inside Docker, but do not give instructions

@Jaifroid Jaifroid changed the title Add support of zstd decompression Add support for zstd decompression May 11, 2020
@kelson42
Copy link
Collaborator Author

What is the status here? And maybe the timeline? This is still the plan to release the first Zstd encode ZIM filed during the summer and it would be really preferable to have releases of Kiwix-JS with its support before.

@Jaifroid
Copy link
Member

@kelson42 I've been really caught up with examining and planning for next academic year under social distancing conditions, and haven't been able to progress things. But it's a top priority once the frenzy dies down -- because lack of ZSTD is a cliff-edge for Kiwix JS if we're not careful. It looks possible, and shouldn't be a big job, as in principle it's just a case of substituting one decoder for another in the compressed byte stream, and stream decoding is supported. The problem I encountered is that the relevant Kiwix JS code is very heavily object-oriented, within within which I find it quite difficult to find entry and exit points.

@Jaifroid
Copy link
Member

Jaifroid commented Jul 6, 2020

Documentation on how to decompress a stream with Zstandard:

Streaming decompression - HowTo
  A ZSTD_DStream object is required to track streaming operations.
  Use ZSTD_createDStream() and ZSTD_freeDStream() to create/release resources.
  ZSTD_DStream objects can be re-used multiple times.

  Use ZSTD_initDStream() to start a new decompression operation.
 @return : recommended first input size
  Alternatively, use advanced API to set specific properties.

  Use ZSTD_decompressStream() repetitively to consume your input.
  The function will update both `pos` fields.
  If `input.pos < input.size`, some input has not been consumed.
  It's up to the caller to present again remaining data.
  The function tries to flush all data decoded immediately, respecting output buffer size.
  If `output.pos < output.size`, decoder has flushed everything it could.
  But if `output.pos == output.size`, there might be some data left within internal buffers.,
  In which case, call ZSTD_decompressStream() again to flush whatever remains in the buffer.
  Note : with no additional input provided, amount of data flushed is necessarily <= ZSTD_BLOCKSIZE_MAX.
 @return : 0 when a frame is completely decoded and fully flushed,
        or an error code, which can be tested using ZSTD_isError(),
        or any other value > 0, which means there is still some decoding or flushing to do to complete current frame :
                                the return value is a suggested next input size (just a hint for better latency)
                                that will never request more than the remaining frame size.

From http://facebook.github.io/zstd/zstd_manual.html#Chapter7 .

@Jaifroid
Copy link
Member

Jaifroid commented Jul 6, 2020

How to build emscripten version: https://dev.classmethod.jp/articles/emscripten-zstandard-on-browser/ (Japanese, but use auto translate).

@Jaifroid
Copy link
Member

Jaifroid commented Jul 6, 2020

This page mentions a very simple way to compile zstandard, together with good tutorial on WASM and exposing C++ functions: https://marcoselvatici.github.io/WASM_tutorial/

@Jaifroid
Copy link
Member

Jaifroid commented Jul 6, 2020

First attempt to compile our own zstddec.js with Emscripten is here:

https://github.com/kiwix/kiwix-js/tree/Add-zstd-decompression-support/emscripten/zstandard

I used this compile.sh script:

emcc --memory-init-file 0 -O3 -s WASM=0 -s MALLOC="emmalloc" -s TOTAL_MEMORY=83886080 -s NO_FILESYSTEM=1 
-s AGGRESSIVE_VARIABLE_ELIMINATION=1 -s DOUBLE_MODE=0 -s NO_DYNAMIC_EXECUTION=1 -s LEGACY_VM_SUPPORT=1 
-s EXPORTED_FUNCTIONS="['_ZSTD_createDStream', '_ZSTD_initDStream', '_ZSTD_decompressStream', '_ZSTD_freeDStream']" 
-DXZ_USE_CRC64=1 -DXZ_INTERNAL_CRC64=1 *.c -o zstdec.js

These values try to map onto our existing API as far as possible, but using the ZSTD functions mentioned in the documentation above. I compiled it only as asm for now (not wasm). It compiled with no errors.

As far as I've understood the process, we would need to call _ZSTD_createDStream to get a control object from which we can read input.pos, input.size, output.pos, output.size. However, I don't know if we'll need to make some custom functions for reading those, or whether the JavaScript object would be updated by the asm. Then we need to use _ZSTD_initDStream which should return a suggested initial number of bytes to send to the input stream. Then call _ZSTD_decompressStream with bytes, checking if output.pos < output.size to call _ZSTD_decompressStream again, and so on until the input stream is finished and output stream finished. Then we call _ZSTD_freeDStream to finish and release the decompressor resources.

That's the theory. None of it tested yet! I just wanted to see if I could compile that with no errors... Let's see how we can wire this up.

@Jaifroid Jaifroid removed the question label Jul 7, 2020
@kelson42
Copy link
Collaborator Author

I have a real case example of zim file here: http://tmp.kiwix.org/wikipedia_bm_all_2020-07.zstd.zim

@Jaifroid
Copy link
Member

Jaifroid commented Jul 22, 2020

I've re-compiled a test build with these options (some previous options have changed format, e.g. NO_DYNAMIC_EXECUTION=1 has now become DYNAMIC_EXECUTION=0, and it seems we need to cwrap or ccall exported functions if we don't want to handle data typing ourselves. The resulting zstdec.js is a lot bigger because it is a test build without optimization -O3 for now, and therefore the code is un-minified (and not so aggressive function elimination).

emcc --memory-init-file 0 -s WASM=0 -s MALLOC="emmalloc" -s TOTAL_MEMORY=83886080 -s FILESYSTEM=0 -s DOUBLE_MODE=0 -s DYNAMIC_EXECUTION=0 -s MIN_IE_VERSION=11 -s EXPORTED_FUNCTIONS="['_ZSTD_createDStream', '_ZSTD_initDStream', '_ZSTD_decompressStream', '_ZSTD_isError', '_ZSTD_freeDStream']" -s EXPORTED_RUNTIME_METHODS="['ccall', 'cwrap']" *.c -o zstdec.js

@Jaifroid
Copy link
Member

The two Emscripten modules xzdec.js and zstdec.js were conflicting with each other because when compiled with standard options, they both emit a global Module object. I have managed to modularize zstdec.js using the new compile options below. It can now be loaded with requireJS instead of attaching in index.html.

In fact xzdec.js can be loaded the same way, so that has been added to xzdec_wrapper.js, just that xzdec continues to emit a global Module object (that no longer conflicts with zstdec's ZD Promise function). We should re-compile xzdec.js with similar options as below to avoid having it always loaded in memory when we are only dealing with a zstd-encoded ZIM. The relevant options are EXPORT_NAME and MODULARIZE:

emcc --memory-init-file 0 -s WASM=0 -s MALLOC="emmalloc" -s TOTAL_MEMORY=83886080 -s FILESYSTEM=0 -s DOUBLE_MODE=0 -s DYNAMIC_EXECUTION=0 -s MIN_IE_VERSION=11 -s EXPORT_NAME="ZD" -s MODULARIZE=1 -s EXPORTED_FUNCTIONS="['_ZSTD_createDStream', '_ZSTD_initDStream', '_ZSTD_decompressStream', '_ZSTD_isError', '_ZSTD_freeDStream']" -s EXPORTED_RUNTIME_METHODS="['cwrap']" *.c -o zstdec.js

NB Above is not optimized or minified. But that is a good thing for testing, as the resulting file has lots of useful comments and instructions. We should ensure to keep an unoptimized/unminified version in the source code.

@Jaifroid
Copy link
Member

Jaifroid commented Jul 26, 2020

A quick note on progress:

  • I have set up the control structures
  • I have set these in the HEAP32 w/asm memory and received pointers to the structures
  • I have allocated the inBuffer data stream and the outBuffer data stream, receiving pointers to their location in w/asm memory
  • I have transferred a data slice to the w/asm inBuffer using HEAP8.set() (consider using HEAPU8.set(), because we are transferring a Uint8Array)
  • I have run _ZSTD_decompressStream, passing it pointers to the control object, the inBuffer structure, the outBuffer structure
  • It returns with a positive number, which means that it has partially decoded the stream;
  • According to the API, the number represents a hint of the number of bytes to be injected into the inBuffer.

TODO:

  • Read back the pos values for inBuffer and outBuffer from w/asm memory
  • Check that I am always setting HEAP32 memory correctly, taking into account BYTES_PER_ELEMENT when writing arrays and dividing the pointer value by BYTES_PER_ELEMENT (the logic is tricky)
  • Copy new decompressed bytes and add to JS copy of outBuffer
  • If inBuffer still contains unprocessed bytes, call decompressStream repeatedly until inBuffer is consumed, and copy new bytes in outBuffer to JS
  • Or read another slice if we have not reached the end of the cluster
  • If we have found the cluster termination then call _ZSTD_freeDStream() and de-allocate all assigned memory buffers and control structures. Do same if cluster termination not found, but all input bytes have been consumed (probably this is an error)
  • Consider possibility of simplifying by using ZSTD_decompressStream_simpleArgs(). This is described in source code like this: "Same as ZSTD_decompressStream(), but using only integral types as arguments. This can be helpful for binders from dynamic languages which have trouble handling structures containing memory pointers."

@kelson42
Copy link
Collaborator Author

kelson42 commented Aug 9, 2020

@Jaifroid What is the timeline for a release with zstd support? Plan is to start publishing the first ZIM with zstd at the end of the month.

@Jaifroid
Copy link
Member

@kelson42 Slowdown caused by my having to learn about writing Emscripten bindings in C++ in order to simplify the passing of memory pointers between JS and C++, which are otherwise really difficult to debug. Also slowed down by a holiday!

However, I'm optimistic that I can push this forward positively over the next couple of weeks, as I now understand a lot better what I'm doing (and in particular what a proper JS to C++ interface looks like).

@Jaifroid
Copy link
Member

Due to out-of-memory errors, I had to recompile zstdec.js with a memory stage of 139919360 bytes (~140 Mb), which seems a rather large memory requirement to me, especially for mobile (given that xzdec.js could manage with ~ 90 Mb). However, we're now able, in principle, to decompress a cluster size of 1024 * 5 bytes without an out-of-memory message.

The main issue I'm encountering at the moment is that the output.pos field of the output control structure (at a known memory location in zd.HEAP32) is not being advanced with the amount of data decompressed, and I don't know if this is because of a reading error, or because a frame has not yet been fully consumed. At least it is indicating it needs more data to complete the frame, which is where I'm at for now.

Managing the interface between JS and C++ with only being able to pass raw memory pointers is a real pain.... I might as well be writing assembly... 😮

@kelson42
Copy link
Collaborator Author

@Jaifroid Thank you very much for the update. If you have to write assembly then something else is probably wrong IMO.

@Jaifroid
Copy link
Member

@kelson42 It was a slightly exasperated metaphor! I'm not actually writing assembly, I'm just not at all used to dealing with memory pointers and malloc, and such things, but this is necessary to set up the memory structures so that the asm/wasm (= WebAssembly!) version of zstdec.js can communicate with JS...

@Jaifroid
Copy link
Member

After a lot of trial and error, and recompiling with slightly larger memory values (140,247,040), I've managed to get zstdec to advance the outbuffer position field in response to a larger input. This is good news because it means I've managed to set up all the data structures correctly, and have entered enough valid data to decode a whole frame and get meaningful output.

@Jaifroid
Copy link
Member

First decoded result with ZSTD using wikipedia_bm_all_2020-07.zstd.zim (screenshot). This isn't as bad as it looks 🙄 because we can clearly see that it's decoded some stylesheet data (these data are usually encoded in the ZIM). It's actually positive because it shows the underlying asm decoder is working fine, but that I've made an error somewhere in setting offsets.

image

@Jaifroid
Copy link
Member

Jaifroid commented Aug 31, 2020

Success decoding foo-zstd.zim! I can pick any of the short articles (which are all numeric) and get the result in screenshot below (according to article number). This is a good proof of concept. I am still getting the strange result above with wikipedia_bm_all_2020-07.zstd.zim despite some refining since yesterday, but it is probably still something wrong with my calculation of offsets when I have to provide more data to the inBuffer or empty the outBuffer.

@kelson42 Do we have any more known good ZSTD ZIMs I can work with yet? I just want to be sure that the issue isn't with the encoding of the ZIM, since the small proof-of-concept test ZIM does work with my code.

image

@Jaifroid
Copy link
Member

Success!!! 😀

image

@Jaifroid
Copy link
Member

image

@Jaifroid
Copy link
Member

And it's very fast!!!

@Jaifroid
Copy link
Member

Jaifroid commented Sep 29, 2020

@kelson42 Now this issue has been closed by merging the PR, I think it's safe to start publishing zstandard ZIM archives. We can make a new release once the first ones start coming out, or earlier if @mossroy has time (though it might be an idea to merge #642 before making a release).

@kelson42
Copy link
Collaborator Author

@Jaifroid Thank you very much again or the amount and the quality of the work here. I believe it would be good to merge #642 before making a new release. @mossroy Hopefully this is doable this week. How long would be then take to make a new release (do we need a ticket)?

@Jaifroid
Copy link
Member

We can make a ticket. @mossroy do you think adding zstandard warrants a major number increment? I don't know if ZSTD qualifies as a major change even if it's only the backend and arguably not an API change.

@mossroy
Copy link
Contributor

mossroy commented Sep 29, 2020

I also repeat my congratulations to you @Jaifroid !
In the short term, I'm overwhelmed by too many things to review/test #642
If you agree, let's focus on what is strictly necessary for now : releasing a new version able to handle the latest ZIM files.

Regarding the version number : it's not crucial, we can do what we want.
I think it might be a version 3.0.0 : although it does not change anything in the frontend, it is a significant technical change in the backend

@Jaifroid
Copy link
Member

OK, thank you both!
No problem about #642, it's not that important, and not noticeable to most users.
So, I'll make a ticket to release 3.0.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants