-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor executors #77
Conversation
I like this general idea! My main concern is that this would preclude the option to avoid using intermediate arrays in favor of executor-native groupby operations, e.g., like what is sketched out in #36 for Beam. In principle, avoiding the intermediate copy could be up to twice as fast, if an executor like Beam or Spark manages to hold all the intermediate values in memory instead of dumping to disk. |
Good point Stephan. I think there is a simple resolution; we make the CopySpec -> Pipeline translation optional and allow executors to natively work on CopySpecs if they prefer. Stand by for an update to implement this. |
Sounds good to me! |
Codecov Report
@@ Coverage Diff @@
## master #77 +/- ##
==========================================
- Coverage 97.76% 97.44% -0.32%
==========================================
Files 10 12 +2
Lines 447 509 +62
Branches 89 93 +4
==========================================
+ Hits 437 496 +59
- Misses 5 7 +2
- Partials 5 6 +1
Continue to review full report at Codecov.
|
For context, the tests are currently failing due to the pre-commit mypy failure described above. That's the only blocker here. |
9bbaa60
to
83e1fc7
Compare
I just put this through its paces on google cloud, and I'm satisfied it is working ok for real world use cases. I'm going to merge. It would be great if other users (maybe @rsignell-usgs?) could take this for a spin by installing rechunker from github master. |
Overview
This is a major refactor of the internals of rechunker. I have changed the interface between the executors and the rechunking plan by adding some new types. The executors all now accept something called a
ParallelPipelines
object. The hierarchy of types looks like thisStages contain a single function that is mapped across many inputs (e.g. a single copy operation). MultiStagePipelienes contain multiple Stages (e.g. copy source to intermediate, then intermediate to target). ParallelPipelines contain several MultiStagePipelines that can be executed in parallel.
The
rechunk
function now contains a line calledpipelines = specs_to_pipelines(copy_spec)
which translates a list ofCopySpec
s to aParallelPipelines
. This function contains all of the logic about how to execute a copy operation. All the executors needs to do is know how to execute a genericParallelPipelines
.Motivation
The abstractions we have created in rechunker, which allow many different distributed execution engines to be used for the same computation, are very useful and cool. The underlying motivation for this refactor was to make the executors more general, such that they can be used in other projects (e.g. Pangeo Forge). This is accomplished by decoupling the details of the rechunking operation as currently implemented from the
Executor
class.Pros
Beyond the motivation above, this approach has a couple of major benefits:
consolidate_reads
causes error for large array #75)Cons
But there are some downsides:
Todo
At this point I would love to get a preliminary review from anyone who is interested.