Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for .zst for component tarballs in channels #2488

Closed
kinnison opened this issue Sep 15, 2020 · 26 comments · Fixed by #2676
Closed

Add support for .zst for component tarballs in channels #2488

kinnison opened this issue Sep 15, 2020 · 26 comments · Fixed by #2676

Comments

@kinnison
Copy link
Contributor

kinnison commented Sep 15, 2020

Describe the problem you are trying to solve

As per #1858 and other places, it would be good to support zstd compression for component packages in channels. This would allow for smaller, faster-to-decompress components which is a benefit all round.

Describe the solution you'd like

Support for .zst as a file extension and zstd decompression as part of Rustup.

Notes

This ought mostly to be constrained to the src/dist/ tree, particularly the manifestation.rs and component/package.rs files. Some test support will also be needed.

Since zstd generally compresses better and decompresses more quickly than xz, we should prefer .zst over .xz where it is present in the manifest.

Once implemented, the compiler team could start to produce channels with zstd compressed content

@joshtriplett
Copy link
Member

For reference, on my laptop:

$ du -hs rust-1.46.0-x86_64-unknown-linux-gnu.tar*
842M	rust-1.46.0-x86_64-unknown-linux-gnu.tar
211M	rust-1.46.0-x86_64-unknown-linux-gnu.tar.gz
122M	rust-1.46.0-x86_64-unknown-linux-gnu.tar.xz
99M	rust-1.46.0-x86_64-unknown-linux-gnu.tar.zst
$ time gzip -dc rust-1.46.0-x86_64-unknown-linux-gnu.tar.gz >/dev/null

real	0m4.890s
user	0m4.853s
sys	0m0.036s
$ time xz -dc rust-1.46.0-x86_64-unknown-linux-gnu.tar.xz >/dev/null

real	0m9.218s
user	0m9.160s
sys	0m0.056s
$ time zstd -qdc rust-1.46.0-x86_64-unknown-linux-gnu.tar.zst >/dev/null

real	0m0.839s
user	0m0.811s
sys	0m0.028s

@Mark-Simulacrum
Copy link
Member

@joshtriplett Could you also run compression timing? If zstd is significantly slower we probably can't afford it.

@joshtriplett
Copy link
Member

@Mark-Simulacrum Depends heavily on compression level and what tradeoff we want to make. How much time does the current xz compression use?

@Mark-Simulacrum
Copy link
Member

I don't think we have timings, so it's hard to say. Does decompression time not correlate with compression time much then? (Beyond "disk reading is slower with more data")?

I'd guess a good way to try and estimate things would be for someone to read https://github.com/rust-lang/rust-installer/blob/d66f476b4d5e7fdf1ec215c9ac16c923dc292324/src/tarballer.rs#L49-L56, lower that into either xz command line arguments or write up a small Rust program that we can test things out with. Then we'd download all the artifacts for a single commit from S3 (I can help there, though the rustup manifest is an easy place to start), and write up a small-ish table of compression/decompression times for various settings on each of the tools. Then we can try and select our best trade-off; CI time these days is mostly plentiful on dist builders, and if we do this during release promotion we can definitely take somewhat longer.

@joshtriplett
Copy link
Member

joshtriplett commented Sep 16, 2020

Does decompression time not correlate with compression time much then? (Beyond "disk reading is slower with more data")?

No, it doesn't. Compression can take anywhere from "faster than gzip" to "several minutes", with corresponding improvements in data size; decompression of all of those is proportional to data size, never compression time. Taking several minutes would be worth it for stable release tarballs, while nightly/CI versions could scale that back a little and aim for taking the same amount of time in CI that we currently do.

@kinnison
Copy link
Contributor Author

@joshtriplett @Mark-Simulacrum Was there consensus on this in the end? Is this still on the cards?

@Mark-Simulacrum
Copy link
Member

I think we should gather some more data timing and size wise, but I expect that we should indeed support zstd compression (and perhaps even make that our canonical format, instead of xz, in the next several cycles after it rolls out).

That said, I would like for us to try to figure out a plan for limiting ourselves to maybe 2-3 long-term supported compression formats for cost / time reasons; probably gzip needs to stick around for compatibility (though I don't think I've encountered xz lacking but gzip supporting servers myself) but I'm not sure about xz vs zstd.

cc @rust-lang/release

@kinnison
Copy link
Contributor Author

It's not that much pain for rustup to support all the formats, though we don't have to generate them all for each channel release.

With that said, I'll consider the work to add zstd support to rustup in the near future.

@Mark-Simulacrum
Copy link
Member

Yeah, I'm mostly worried about storing 2-3x as much data (plus the time to recompress things) if we're going to accumulate new algorithms over time, not implementation complexity of each piece.

@kinnison
Copy link
Contributor Author

I think that if zstd looks promising, the right thing will be to swap xz for it. gzip is a good idea to retain long-term for compatibility I guess.

@joshtriplett
Copy link
Member

joshtriplett commented Oct 18, 2020 via email

@rbtcollins
Copy link
Contributor

rbtcollins commented Oct 18, 2020 via email

@joshtriplett
Copy link
Member

joshtriplett commented Oct 18, 2020 via email

@joshtriplett
Copy link
Member

The nice thing is that once the client has zst support, the server-side can decide how aggressively to compress for each tarball. For instance, if stable artifacts get more downloads, we could spend more resources compressing them.

@kinnison
Copy link
Contributor Author

Currently the channel manifest toml only has gz and bz2 support, and not in an entirely extensible way, I think that we need to define how the channel toml will represent the available compression formats, and thence what is considered 'acceptable' in terms of available formats. E.g. would a channel manifest with only .zst be okay? For that to be the case a lot of tooling may need adjustment.

I've given this some thought already and will continue to do so, and may bring it up for wider discussion at the next dev-tools sync.

@kinnison
Copy link
Contributor Author

True, not bz2, braino on my part :D

@kinnison
Copy link
Contributor Author

@joshtriplett If you use a master build of rustup then there's now .zst support there though I'd consider it entirely rough-cut for now. If you'd prefer to wait until it's in a released version before experimentation that's fine, but now you should be able to have a play around with a theoretically .zst-only toolchain channel manifest if you wanted. Just use zst_url and zst_hash as you might expect.

I won't be advertising this beyond a changelog entry along the lines of Experimental support for ZStd compressed dist artifacts for now.

@kinnison
Copy link
Contributor Author

@89z I didn't say that any of the channels had been updated, merely that the in-development branch of rustup could now support it if Josh wanted to experiment. Don't expect this any time soon.

@joshtriplett
Copy link
Member

@kinnison Thank you!

@joshtriplett
Copy link
Member

I'm working on patches to try this in the Rust distribution process. Right now, those patches are waiting on gyscos/zstd-rs#117 to go into a released version of the zstd crate, because Rust tarballs benefit greatly from zstd's long-distance matching mode.

@kinnison
Copy link
Contributor Author

kinnison commented May 2, 2021

I'm working on patches to try this in the Rust distribution process.

Great to hear - if you need any extra work on our side to enable your experiments, please let me know. I fully expect that we'll not have done everything right first time :D

@rbtcollins
Copy link
Contributor

rbtcollins commented May 2, 2021 via email

@joshtriplett
Copy link
Member

@rbtcollins I'd expect zst to typically take less memory on decompression, and much less decompression time (by a factor of 10). And gzip isn't going away.

@pietroalbini
Copy link
Member

Just a note, I'm not sure whether it will be feasible to provide three compression formats. Our releases are already huge and we store them forever, so adding more duplication could be an issue. The infra team would need to discuss what we're going to serve to users.

@rbtcollins
Copy link
Contributor

rbtcollins commented May 3, 2021 via email

@joshtriplett
Copy link
Member

@pietroalbini I'm not expecting that we should supply three. I'm hoping that we move from gz/xz to gz/zst. I'm working on a proposal for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants