Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

research area: parallel zip creation #2158

Open
cosmicexplorer opened this issue Jun 19, 2023 · 18 comments
Open

research area: parallel zip creation #2158

cosmicexplorer opened this issue Jun 19, 2023 · 18 comments

Comments

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Jun 19, 2023

Problem

I was creating a pex file to resolve dependencies while playing around with fawkes (https://github.com/shawn-shan/fawkes), and like most modern ML projects, it contains many large binary dependencies. This meant that while resolution (with the 2020 pip resolver) was relatively fast, the creation of the zip itself took about a minute after that (without any progress indicator), which led me to prefer a venv when iterating on the fawkes source code.

Use Case Discussion

I understand pex's supported use cases revolve much more around robust reproducible deployment scenarios, where taking a single minute to zip up a massive bundle of dependencies is more than acceptable, and where making use of the battle-tested stdlib zipfile.ZipFile is extremely important to ensure pex files can be executed as well as inspected on all platforms and by all applications. However, for use cases like the one described above, where the pex is going to be created strictly for the local platform, I think it would be really convenient to avoid having to set up a stateful venv.

Alternatives

I could just create a pex file for the dependencies, and use that to launch the python process that runs from source code, and indeed that is what we decided on to implement pantsbuild/pants#8793, which worked perfectly for the Twitter ML infra team's jupyter notebooks. But (assuming this is actually possible) I would still personally find a feature that zips up pex files much faster to be useful for a lot of "I really just wanna hack something together" scenarios where I don't necessarily want to have to set up a two-phase build process like that.

Implementation Strategy

After going through the proposal of pypa/pip#8448 to hack the zip format to minimize wheel downloads, which was implemented much more thoughtfully in pypa/pip#8467 as LazyZipOverHTTP, and then realizing I had overpromised the potential speedup at pypa/pip#7049 (comment), I am wary of just assuming that hacking around with the zip format will necessarily improve performance over the battle-tested synchronous stdlib implementation.

However, it seems plausible that the process of compressing individual entries could be parallelized. pigz describes their methodology at https://github.com/madler/pigz/blob/cb8a432c91a1dbaee896cd1ad90be62e5d82d452/pigz.c#L279-L340, and there is a codebase named fastzip that does this in python using threads, with a great discussion of performance bottlenecks and a few notes on compatibility issues. The Apache commons library appears to have implemented this too with ParallelScatterZipCreator.

End-User Interface

Due to the likely compatibility issues (both with executing the parallel method at all, as well as consuming the resulting zip file), it seems best to put this behind a flag, and probably to explicitly call it experimental (I like the way pip does --use-feature=fast-deps, for example), and possibly even to print out a warning to stderr when executing or processing any pex files created this way. To enable the warning message, or in case we want to do any other special-case processing of pex files created this way, we could put a key in PEX-INFO's .build_properties, or perhaps add an empty sentinel file named .parallel-zip to the output.

@jsirois
Copy link
Member

jsirois commented Jun 19, 2023

Have you explored --layout {loose,packed} or --no-compress using traditional pex or else pex3 venv create ... to directly create an ~immutable venv with no pip installed?

@cosmicexplorer
Copy link
Contributor Author

No, thanks so much! Will be playing around with pex3 for a bit now.........

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Jun 20, 2023

I am confident that either of the methods you described above are a preferable solution for the use case I described.

I will probably still be playing around with parallel zip creation just to see if it demonstrates any significant perf improvement at all without sacrificing much compatibility. It would be neat to demonstrate an X% speedup if there were zero compatibility issues, and less useful if it produces a wonky zip. I would also like this to work on all platforms.

Serializing a few thoughts:

  1. I was trying to figure out a way to make the process recursive instead of needing to synchronize adding individual entries to the central directory record, but I'm not sure that's possible without getting into extensions of "4.5 Extensible data fields" from APPNOTE.TXT, which I do not want to do (I hallucinated that zip files contained other non-central directory records).
  2. The perf improvement from parallelizing file reads + compression cpu seems likely to be much more significant compared to the bookkeeping needed to maintain the central directory record contents, so synchronizing on adding individual records should not be that much of a concern.
    • If I truly wanted to avoid synchronizing, I could look into atomic operations from a language like Rust, but that wouldn't be very useful for pex.
  3. Naively parallelizing individual entry compression would introduce an unstable entry order. For pex's use case, I would very much prefer the result to be deterministic, even if I'm not sure if/where pex relies on the hash of the actual zip vs the pex_hash (I know that pants relies on the hash itself, however).
    • Maintaining a sorted list or b-tree of some sort would address this, but would require holding all of the compressed zip contents in that data structure in memory before writing it out. Using an mmap'd file or a temporary directory to store the data structure could address this, but that seems complex.

Given the above, choosing to "naively parallelize" reading & compressing individual file entries from the chroot, then synchronizing on adding individual compressed entries to the zip output file (which would then have a nondeterministic file entry ordering, but would hopefully not require extensive hacks to zipfile.ZipFile) seems like a reasonable prototype to get an initial idea of any perf improvements.

It might also be interesting to try "virtualizing" Chroot entries instead of hard linking. But that sounds significantly more difficult and much less likely to improve performance vs just letting the filesystem do that bookkeeping.

As of now, I do not expect this investigation to be extremely useful for the pex project, and will prioritize it accordingly. But it's a fun idea.

@cosmicexplorer
Copy link
Contributor Author

Hmmmm: the particular parallelization/synchronization operations necessary for this make it seem like a good fit for pants, actually, especially since pants is much more able to control the build inputs and output than pex itself can afford to. I think I'll leave this open because I can still see a potential for a perf improvement without introducing too many new code paths, for use cases where the output hash doesn't matter (particularly because pex provides a .pex_hash in PEX-INFO), but I will keep that in mind for future ideas like this.

@jsirois
Copy link
Member

jsirois commented Jun 20, 2023

Since this would only be used by the Pex CLI (build-time), a rust implementation is actually how I'd approach this. It's the right language for the job, the problem really has nothing to do with Pex (it has much wider applicability), and there are great tools for turning a crate into an easily consumable Python platform-specific distribution.

@jsirois
Copy link
Member

jsirois commented Jun 20, 2023

In terms of interest from PEX, the resulting zip would definitely need to be vanilla consumable via the zipfile module / the Python zipimport facility. In terms of feature flags, that's really a non-issue. Pex follows the rule of not breaking consumers; so no new feature is ever default unless it's totally transparent to the user.

@cosmicexplorer
Copy link
Contributor Author

Since this would only be used by the Pex CLI (build-time), a rust implementation is actually how I'd approach this.

This makes perfect sense, thanks! Totally hadn't considered this.

@cosmicexplorer
Copy link
Contributor Author

Created project at https://github.com/cosmicexplorer/medusa-zip, will update here if the prototype demonstrates any value.

@cosmicexplorer
Copy link
Contributor Author

The project has two features: crawling for files in parallel, and converting a crawl list into a zip in parallel. Given that pex currently expects to have all the Chroot's files known before the .zip() operation via recording in self.filesets, I'm going to first try the least invasive change (swapping out just the zip operation to call into my rust code) before revisiting the assumption that a Chroot must be a physical directory which contains every single entry within the directory.

I'm probably going to try exposing these two features as subcommands and executing medusa-zip as a subprocess within pex to avoid having to figure out an FFI right now. For right now, the file crawling command won't be used.

In terms of interest from PEX, the resulting zip would definitely need to be vanilla consumable via the zipfile module / the Python zipimport facility.

After deciding that asyncifying and parallelizing the read/compress process was likely to contribute the strongest performance improvement, I was able to avoid any modifications to the rust zip crate, especially after seeing they've put in the effort to provide fuzz testing. I do not believe that the current prototype introduces additional compatibility concerns on the zip format-level or the zipimport/zipfile-level.

@cosmicexplorer
Copy link
Contributor Author

Ok, so even just limiting the parallelism to reading source files + compressing their contents, we have produced quite a significant speedup. The current pex diff is at https://github.com/pantsbuild/pex/compare/main...cosmicexplorer:pex:medusa-zip?expand=1, and the pex command I'm testing it against went from 2m40s to 1m16s, which is 2.1x as fast (and that's the entire pex command, including resolution and preparing the chroot beforehand). As mentioned, I didn't get into any zip file hacks, so I would be surprised if this produced compatibility issues with zipimport or zipfile.

TODO

  1. Currently there is no throttling at all, which means the medusa-zip process consumes about 5GB of memory while executing and easily overruns most open file limits. I think I'm going to institute a configurable file handle limit as well as an allocated memory limit.
  2. There's another set of speedups I think we can get by "virtualizing" inclusion of dists without writing anything to the chroot, avoiding walking their file trees or copying all of their recursive contents to the chroot, which is currently performed via ._add_dist_dir(): https://github.com/cosmicexplorer/pex/blob/5f687bd4e85afbd2ff06e98f5badd40eaef2c810/pex/pex_builder.py#L436-L447. We would then be able to perform a parallel crawl of all those dists in their location in the pex unpacked wheel cache.
  3. There's also a synchronous crawl we do of all the chroot's files at once in order to filter against paths in iter_files(), which performs a syscall for each file entry. I think we should be able to perform this checking incrementally as we add files (this affects the non-medusa code path as well).

I had assumed we would need to cut at the API surface of PEXBuilder, since Chroot seemed like it had a lot of reliance on a physical backing directory, but that doesn't seem to be the case at all. I hope that we can expose the parallel zip as an optional argument to Chroot, and then pipe that through to PEXBuilder and the pex CLI.

@jsirois
Copy link
Member

jsirois commented Jul 6, 2023

I think maybe you missed copy mode symlink. That's why the chroot setup is fast today.

@cosmicexplorer
Copy link
Contributor Author

I think maybe you missed copy mode symlink. That's why the chroot setup is fast today.

Yes, but I was confused as to why it only checks for self._copy_mode == CopyMode.SYMLINK and not CopyMode.LINK as well (was going to raise an issue asking this question). However, especially seeing that ._add_dist_dir() also applies the dist_name as a label, this "virtual" inclusion may well already be performed.

@jsirois
Copy link
Member

jsirois commented Jul 7, 2023

If you do a cursory search for LINK I think you can avoid raising an issue, all three modes are handled, and LINK gives a hard link farm.

@cosmicexplorer
Copy link
Contributor Author

It's true that all three modes are handled, but it just seemed counterintuitive from a passing glance that CopyMode.LINK would need to traverse the contents of the dist, which is why I was pointing it out as a potential performance optimization. After making that change earlier, I found out which tests it would break, so if I want to pursue that I can peruse those later.

@cosmicexplorer
Copy link
Contributor Author

I have been making more fun changes to zip file creation:

  • I have converted the code in medusa-zip to perform explicit pipelining with tokio::sync::mpsc::channel, but the first attempt was approximately on par with the sync python impl (about 2m40s), which makes some sense, since only a single file is processed at a time. It also still steadily accumulates many gigabytes of memory until releasing it all at once at the end.
  • After converting the channel to an unbounded_channel and using .ready_chunks() to perform fixed-size sets of file i/o at a time, I recovered some of the initial performance (about 2m20s), but not all of it, and memory usage still grows unbounded.

This might have been discouraging, but after diving into the zip format again earlier today, I believe that we can extend the current method (creating many in-memory single-file zips and copying over data) to produce chunked intermediate zips (which can be created in parallel, written to file, then copied over to the final output zip with essentially a massive io::copy()). This was my original hope, but it wasn't clear to me how the bookkeeping within the zip module worked previously. It's now clear to me how zip files were created to allow appending, and I believe we will be able to take advantage of this.

To reiterate:

  1. Split up the input file list into chunks of size N.
  2. Per chunk, read input files and convert them into zip entries in parallel, then write those entries in order to each intermediate zip.
  3. Merge the contents of each intermediate zip in order.

For pex's use case, we can even do one better, creating and caching an "intermediate zip" for each dist itself, so we can avoid traversing the directory hierarchy for dists at all and instead largely just io::copy() the already-compressed data wholesale.

We still need to determine whether merging intermediate zips works the way I think it does, but I am pretty sure that we can essentially rip the central directory headers from the end of the source zip file, copy-paste the entire body of the source zip into the current zip, then append the new central directory headers into our existing central directory headers after writing the rest of the file. You'll note this can be formulated recursively, although I'm not sure whether that's helpful yet.

In order to do this, I'm going to implement:

  1. A merge operation in the medusa-zip CLI (this will require a relatively small modification to the upstream zip crate).
  2. Make pex generate a cached intermediate zip for each cached dist.
  3. Apply the merge operation to produce an output zip from the Chroots source files as well as cached dist zips.

@cosmicexplorer
Copy link
Contributor Author

Ok so that seems to work great!

Results

  1. You can indeed merge zip files with a single big io::copy().
  2. This means we can produce several zip files in parallel and merge them to achieve ~the same speedup (1m20s vs 2m40s for the entire pex invocation) we got with the first version, but without overrunning open file handles!
  3. I also implemented another small optimization to compress small files (<= 1000 bytes) in-memory, but keep larger files as open file handles, as well as limiting the number of files we have "in flight" that haven't been written into an intermediate zip yet. This keeps memory usage extremely minimal, and appears to have little impact on speed.

TODO: caching intermediate zips for 3rdparty dists

So this alone would be great (and would also minimize the changes to pex), but as detailed above, I would like to additionally make use of the zip merging breakthrough to add further caching to this process:

  1. Create and cache an "intermediate zip" for each fingerprinted external dist added to the PEXBuilder.
    • This would use medusa-zip to crawl and then generate a zip file for the dist which has entries at the exact same paths that they would be in any output pex (e.g. at .deps/Keras_Preprocessing-1.1.2-py2.py3-none-any.whl/...).
    • I have added a regex-based --ignores option to medusa-zip crawl which allows us to perform the same filtering of pyc temporary files that we do within pex.
  2. Avoid crawling any external dist directories, and instead merge their cached intermediate zips into the output zip.
    • This might produce a slightly different order of entries in the final zip than before. Currently, medusa-zip's output exactly matches pex's output via zipfile. I do not believe this will cause any compatibility issues.

With the caching of intermediate zip files for 3rdparty dists, I suspect that the final zip production can be made into a matter of several seconds, as it completely avoids even traversing the filesystem for the myriad 3rdparty dists after they are first installed and cached. For pex files with vendored code, or simply quite a lot of first-party code, we wouldn't try to perform any caching, but simply make use of the normal faster parallel zipping process (which internally uses the zip merge operation).

Changes to pex code

Current pex changes are visible at: main...cosmicexplorer:pex:medusa-zip, currently at +46/-18.

As of right now, before implementing the caching of intermediate zips for 3rdparty dists, we only modify pex.common.Chroot#zip() to:

  1. Use the existing iter_files() to crawl all the chroot contents.
  2. Generate a JSON object mapping file paths to their location in the output zip.
  3. Pass that json object to medusa-zip zip over stdin, and wait for it to complete.

My goals with the caching of intermediate zips for 3rdparty dists are:

  1. Avoid traversing the installation directory for dists at all when they are cached.
    • Should be able to do this by adding an experimental_parallel_zip option to PEXBuilder() and then modifying ._add_dist_dir() to return early but add a note to Chroot that the dist will be found elsewhere.
    • If self._copy_mode == CopyMode.SYMLINK, we don't have to change ._add_dist_dir() at all, so we may just want to enforce that for all zips created with the parallel method.
  2. For each dist without a cached intermediate zip, crawl its unpacked install dir with medusa-zip crawl, modify the names so they appear within the right subdirectory, then generate the intermediate zip with medusa-zip zip.
    • Hopefully we can perform this as part of process of installing a dist (editing somewhere in resolver.py, I think), so that it can make use of the python-level subprocess parallelism we've already worked hard to provide.
  3. When .zip() is called, the chroot should:
    a. Avoid iterating over any filesets added with the label of a 3rdparty dist.
    b. Generate the input JSON from those constrained filesets to tell medusa-zip zip which files to read.
    c. Execute medusa-zip zip to produce a "source-only" zip of those non-3rdparty files.
    d. Execute medusa-zip merge to merge in the contents of the 3rdparty dists.

Compatibility

If merging is performed in this way, all of the .deps/ entries will come after all the source files. I don't expect this will break anything (in fact, it makes the output of unzip -l easier to read), but it will require adding a small workaround to ensure that directory entries are listed appropriately (the .deps/ dir will need to be created at the very end of the "source-only" zip).

@cosmicexplorer
Copy link
Contributor Author

Ok, this all seems to work great and the resulting zip is generated in several hundred milliseconds, which rocks. I've been transferring more functionality from .zip_entry_for_file(), like making sure the modification times are taken from the file on disk. (https://github.com/cosmicexplorer/medusa-zip)

Really notably, I've actually switched the default mechanism for zipping up source files to run entirely synchronously, so I can focus on the compatibility part. The parallel approach is absolutely still useful, but the speedup I care about is coming from merging those intermediate zips for all 3rdparty dists.

Once the file perms and mtime are done correctly, I'm going to fix up the generation of intermediate zips for dists so that it's done as part of the parallel InstallRequest jobs (the caching is currently performed when ._add_dist_dir() is called). (It would probably be better to make the generation of the intermediate zip into its own parallel job at some point, though.) After that, I think it'll be ready for review.

I'm not sure exactly how we would incorporate a rust tool into the pex build; I looked at the great tooling we have for vendored code, but that seems very python-specific. @jsirois what format would be most useful for pex to consume a rust tool with? It works great as a subprocess.

@cosmicexplorer
Copy link
Contributor Author

See #2175 for an initial implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants