-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tarballs we deliver should be compressed better #21724
Comments
Apparently our s3 distribution includes docs, which my I downloaded the tarball we ship and recompressed it for a fairer comparison:
|
The tarball (2015-03-03) has since increased to 138MiB in size. Recompressing that to |
Do we know of statistics of how many systems can decompress a |
As I and others discussed on IRC:
On linux systems some very common packages (with defaultish configure options) depend on liblzma5/xz (gdb and systemd come to mind as examples) so it is very likely it will be available on a standard linux system. |
OS X does support xz by default in its 'tar' command (which is bsdtar - not sure exactly when support was introduced, but I think it was in 10.9) and Archive Utility (apparently newly in 10.10). This works via a library rather than the xz command line utility, which is not provided. |
Are you sure it is not just shelling out to the xz command? Does 'tar xJf EDIT: never mind, apparently bstdar just links and uses liblzma5 directly. This is very nice, I should consider dropping gnu tar and use bsdtar myself. Either way, this means that using xz will benefit most of the users of both linux and os x. 2015/03/06 6:52 "comex" [email protected]:
|
The 7-Zip Utility for Windows can decompress xz archives, according to Wikipedia. That's the only archiver I use on Windows. It's free and open source, but it is third-party. However, isn't the installer the preferred way of getting Rust on Windows? That's what I use. I don't know how the installer does decompression but xz support will probably have to be implemented for it. |
In addition to 7-Zip (which invented the underlying LZMA2 compression format) XZ is also supported by all the other major tools not already mentioned:
Edit: I need to stop forgetting to double-check my memory of what download pages are offering before posting. I've trimmed out some irrelevant bits. |
Strong +1 to @alexcrichton 's suggestion that we simply provide both. It costs us relatively nothing to construct both artifacts; is there any serious cost to provide them both (e.g. are we worried about connection charges or storage space on our servers?) |
An update since the metadata reform. Now gzipped tarball is 107MB. xzipped is 78MB. Still an easy 30MB win. EDIT: docless xz: 75MB, docless gzip: 100MB. |
Is this still an issue today? |
Recompressing gzip from https://static.rust-lang.org/dist/rust-1.4.0-x86_64-unknown-linux-gnu.tar.gz to xz still goes from 97MB to 75MB, a win of 22MB. |
|
FYI, nowadays even Busybox supports |
We now have switched to the full Rust solution for distribution, so we can easily switch from tar.gz to tar.xz in that case. |
I recently had a chance to live on data-capped tether for 2 weeks and it hurt me very hard when the new stage0 compiler got in. It took me a considerable amount of time to download the new compiler and put a noticeable dent into my data allowance. Both of those would have been much more bearable with |
Please provide .xz downloads for the source tarballs. I am a packager for Mageia Linux and downloading the tar.gz and then uploading it our tarballs server over my slow ADSL upstream is time-consuming. I tried to compress the tar.gz tarball using
That's a 34% saving. |
I'd like to make this happen but it's quite complex to do. I think the basic way to do it is to recompress all the tarballs in one batch job at the same time during final manifest generation. It would be great to do it in a way that isn't conflated with other parts of the build infrastructure, so that it can be developed and tested independently of buildbot. Unfortunately the way the entire set of artifacts is put together is quite complex. I tried to write up a design that somebody else could implement but got pretty discouraged. But some requirements I think
I do want to redesign the entire release build process, and it might be easier to make this happen as part of a redesign. |
Compressing just the source tarball can probably be done relatively easily by modifying the build system with a |
If we're counting calories, stripping There are also a few spots of debuginfo, but that saves less. |
Ok, I think nowadays we're quite ready to be poised to do this! Specifically I believe the steps would look like:
|
I did some experiments with the compression by also tuning the order in which files are included in the archive and it looks like we might get further improvements. This is basically achieved by storing duplicate files one after the other, so that the stream compressor can encode them more efficiently.
I would like to work on this issue, but I might not be able to do so until the end of the month. If somebody else starts implementing the new system, please ping here :) |
@ranma42 holy cow I had no idea we could get such a drastic improvement by reordering files, that's awesome! FWIW the tarball creation itself is likely buried in rust-installer which may be difficult to modify but not impossible! Eventually I'd love to completely rewrite the rust-installer repo in Rust itself (e.g. |
I've tried to get better results than @ranma42 with both brotli and zstd at maximum compression settings (
For the rust-src-nightly.tar.gz they were behind as well, just not as far:
Also note that brotli takes far longer to compress than any other algo. In the second diagram, you can see that reverse sorting in fact has a tiny negative inpact for source code. I too would suggest to go with xz at level 9 with reverse ordering, as its a) far more widespread than zstd/brotli and b) possible better decompression speed for zstd/brotli is an unimportant advantage. |
I wonder if we can improve the sorting by either using a similarity hash and order by hash value, or even use a distance metric and Floyd-Warshall to find out the cheapest path through all files. Then again that's probably overdoing it. |
@llogiq the "reverse name sorting" trick is a cheap approximation of that, because it clusters files with the same extension. In the case of rust object files, it is effectively also sorting them by their hash, ensuring that identical libraries are adjacent in the list. If we want to squeeze the tarball further, I would suggest investigating the biggest files in the release:
|
Perhaps we should setup stripped binaries after all – as the savings are substantial. It may allow some people to use Rust who currently cannot afford it. |
The difference between fully stripped and not stripped when decompressed is 120MB. Difference when compressed (for sorted files) is 8MB. |
Being bold, we could also think of every single function as one "file", reorder those using similarity hashes (or floyd-warshall, although I guess the number would be too high for pure floyd-warshall), and provide a self extracting archive or something. That would solve the "cargo links everything statically" problem. |
Just in case I have tested other options with
[1] All dictionary compression scheme requires a certain amount of previously decoded data. In gzip this is not significant (~64K) but for costlier options of |
I followed the first steps suggested by @alexcrichton without encountering any significant issue. |
@ranma42 oh @brson and I discussed this a long time ago actually and we were both on board with just adding a new key to the manifest. Right now all artifacts have |
oh and similar to |
@alexcrichton a dash in the field name will prevent Given the proposed approach, I assume that there are no plans to add other formats in the future. Another option might be to add a |
Oh so the serde version of toml takes care of tha just fine (via serde attributes) and the old rustc-serialize version actually handled it as well (translating deserializing into a rust field named I think we're definitely open to new formats in the future, we'd just add more keys. We could support a generic container (like a list) for the formats but it didn't really seem to give much benefit over just listing everything manually. Downloaders will basically always look for an exact one and otherwise fall back to tarballs. |
I implemented the changes required to get the xz url and hash here, but I keep getting the |
Oh ideally we'd switch to serde, but I wouldn't really worry about it, it's not that important. Due to bootstrapping using serde in the compiler is difficult right now, unfortunately. |
Then I will leave the manifest fields as |
Generate XZ-compressed tarballs Integrate the new `rust-installer` and extend manifests with keys for xz-compressed tarballs. One of the steps required for #21724
Generate XZ-compressed tarballs Integrate the new `rust-installer` and extend manifests with keys for xz-compressed tarballs. One of the steps required for rust-lang#21724
Generate XZ-compressed tarballs Integrate the new `rust-installer` and extend manifests with keys for xz-compressed tarballs. One of the steps required for rust-lang#21724
Add support for XZ-compressed packages When XZ-compressed packages are available, prefer them in place of the GZip-compressed ones as they can provide significant savings interms of download size. This should be the last step towards fixing rust-lang/rust#21724
Add support for XZ-compressed packages When XZ-compressed packages are available, prefer them in place of the GZip-compressed ones as they can provide significant savings interms of download size. This should be the last step towards fixing rust-lang/rust#21724
The next version of rustup should include rust-lang/rustup#1100, hence it should use XZ by default (if available). |
And rustup has now shipped! |
Today’s rust-nightly-x86_64-unknown-linux-gnu.tar.gz is 125MiB in size. I did a
make dist-tar-bins
which outputthe sametarball, but only 88MiB in size.This is 70% of whatever we publish to s3.I took liberty to also test:
xz (the default level, -6) → 69MiB (55% original);xz -9 → 59MiB (47% original, but has high memory requirements to decompress)bz2 → 82MiB (65% original);lzma → 69MiB, but took longer than xz.I strongly propose to either migrate to a more modern compression algorithm (xz)
or at least investigating why gzip does such a bad job on the build bots.cc @brson
The text was updated successfully, but these errors were encountered: