Refactor executors #77

rabernat · 2020-12-31T23:02:54Z

Overview

This is a major refactor of the internals of rechunker. I have changed the interface between the executors and the rechunking plan by adding some new types. The executors all now accept something called a ParallelPipelines object. The hierarchy of types looks like this

ParallelPipelines = Iterable[MultiStagePipeline]
MultiStagePipeline = Iterable[Stage]
class Stage(NamedTuple):
    func: Callable
    map_args: Optional[Iterable] = None

Stages contain a single function that is mapped across many inputs (e.g. a single copy operation). MultiStagePipelienes contain multiple Stages (e.g. copy source to intermediate, then intermediate to target). ParallelPipelines contain several MultiStagePipelines that can be executed in parallel.

The rechunk function now contains a line called pipelines = specs_to_pipelines(copy_spec) which translates a list of CopySpecs to a ParallelPipelines. This function contains all of the logic about how to execute a copy operation. All the executors needs to do is know how to execute a generic ParallelPipelines.

Motivation

The abstractions we have created in rechunker, which allow many different distributed execution engines to be used for the same computation, are very useful and cool. The underlying motivation for this refactor was to make the executors more general, such that they can be used in other projects (e.g. Pangeo Forge). This is accomplished by decoupling the details of the rechunking operation as currently implemented from the Executor class.

Pros

Beyond the motivation above, this approach has a couple of major benefits:

Simplify Executor code
Allow all Executors to accept Dask arrays 🎉
Allow read consolidation for Dask array sources (see e.g. Hardcoded consolidate_reads causes error for large array #75)
Enable future enhancements (e.g. add additional stages, like delete intermediate, consolidate metadata, etc.)

Cons

But there are some downsides:

Additional code complexity (more levels of abstraction)
Possible performance impacts (have not done benchmarking comparison yet)
Possible impacts on graph size / complexity (have not checked yet)

Todo

At this point I would love to get a preliminary review from anyone who is interested.

shoyer · 2021-01-01T07:29:35Z

I like this general idea! My main concern is that this would preclude the option to avoid using intermediate arrays in favor of executor-native groupby operations, e.g., like what is sketched out in #36 for Beam. In principle, avoiding the intermediate copy could be up to twice as fast, if an executor like Beam or Spark manages to hold all the intermediate values in memory instead of dumping to disk.

rabernat · 2021-01-02T15:44:21Z

Good point Stephan. I think there is a simple resolution; we make the CopySpec -> Pipeline translation optional and allow executors to natively work on CopySpecs if they prefer. Stand by for an update to implement this.

shoyer · 2021-01-02T19:34:21Z

Good point Stephan. I think there is a simple resolution; we make the CopySpec -> Pipeline translation optional and allow executors to natively work on CopySpecs if they prefer. Stand by for an update to implement this.

Sounds good to me!

codecov · 2021-01-07T04:55:32Z

Codecov Report

Merging #77 (a5a3a29) into master (c59f303) will decrease coverage by 0.31%.
The diff coverage is 98.13%.

@@            Coverage Diff             @@
##           master      #77      +/-   ##
==========================================
- Coverage   97.76%   97.44%   -0.32%     
==========================================
  Files          10       12       +2     
  Lines         447      509      +62     
  Branches       89       93       +4     
==========================================
+ Hits          437      496      +59     
- Misses          5        7       +2     
- Partials        5        6       +1

Impacted Files	Coverage Δ
rechunker/executors/util.py	`100.00% <ø> (ø)`
rechunker/executors/dask.py	`94.54% <94.00%> (-5.46%)`	⬇️
rechunker/algorithm.py	`82.45% <100.00%> (ø)`
rechunker/api.py	`100.00% <100.00%> (ø)`
rechunker/compat.py	`100.00% <100.00%> (ø)`
rechunker/executors/__init__.py	`100.00% <100.00%> (ø)`
rechunker/executors/beam.py	`100.00% <100.00%> (ø)`
rechunker/executors/prefect.py	`100.00% <100.00%> (ø)`
rechunker/executors/python.py	`100.00% <100.00%> (ø)`
rechunker/executors/pywren.py	`100.00% <100.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c59f303...a5a3a29. Read the comment docs.

rechunker/executors/pipeline.py

rabernat · 2021-01-14T20:24:53Z

For context, the tests are currently failing due to the pre-commit mypy failure described above. That's the only blocker here.

…lures

rabernat · 2021-01-27T03:37:57Z

I just put this through its paces on google cloud, and I'm satisfied it is working ok for real world use cases. I'm going to merge. It would be great if other users (maybe @rsignell-usgs?) could take this for a spin by installing rechunker from github master.

rabernat requested review from shoyer and TomAugspurger December 31, 2020 23:02

rabernat marked this pull request as draft December 31, 2020 23:03

rabernat mentioned this pull request Jan 4, 2021

Total refactor pangeo-forge/pangeo-forge-recipes#27

Merged

rabernat commented Jan 7, 2021

View reviewed changes

rechunker/executors/pipeline.py Outdated Show resolved Hide resolved

rabernat added 17 commits January 15, 2021 11:50

RLS: v0.3.4

733cd35

add new types

73c46a0

python executor working

14caeb6

got prefect working

81d1054

made dask work with zarr arrays; now have to do dask inputs

9731e42

removed limitations on executor; still looking for source of dask fai…

9c03338

…lures

works! now need to remove comments

c8b759a

cleanup

1c9fba4

fix pre-commit

eea4298

add forgotten file

7b302e3

wip

c71e9e8

dask, prefect, python executors refactored

f2bda56

isort

fdd9aec

isort

879b928

remove debugging

de2c5a3

Merge branch 'master' into refactor-executors

d493544

fix type hints

83e1fc7

rabernat force-pushed the refactor-executors branch from 9bbaa60 to 83e1fc7 Compare January 16, 2021 15:49

rabernat added 2 commits January 17, 2021 17:54

rearrange modules

7ec612d

add dedicated pipeline tests

479908f

rabernat added 2 commits January 17, 2021 21:00

refactor pre-commit ci

0ac460f

try pre-commit without installing env

36f4db8

rabernat marked this pull request as ready for review January 18, 2021 14:41

rabernat added 2 commits January 18, 2021 09:42

remove python 3.9 from CI

9ac9791

found silent prefect bug

a5a3a29

rabernat mentioned this pull request Jan 24, 2021

Fix API doc pangeo-forge/pangeo-forge-recipes#46

Merged

davidbrochart mentioned this pull request Jan 26, 2021

Example pipeline for IMERG pangeo-forge/staged-recipes#5

Open

rabernat merged commit 6cc0f26 into pangeo-data:master Jan 27, 2021

tomwhite mentioned this pull request Apr 20, 2021

Rechunker 0.4 requires Dask and Prefect #82

Closed

rabernat mentioned this pull request Jun 9, 2021

cannot use rechunker starting from 0.4.0 #92

Open

alxmrs mentioned this pull request Jul 21, 2021

Support Beam as an executor for recipe pipelines pangeo-forge/pangeo-forge-recipes#169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor executors #77

Refactor executors #77

rabernat commented Dec 31, 2020

shoyer commented Jan 1, 2021

rabernat commented Jan 2, 2021

shoyer commented Jan 2, 2021

codecov bot commented Jan 7, 2021 •

edited

Loading

rabernat commented Jan 14, 2021

rabernat commented Jan 27, 2021

Refactor executors #77

Refactor executors #77

Conversation

rabernat commented Dec 31, 2020

Overview

Motivation

Pros

Cons

Todo

shoyer commented Jan 1, 2021

rabernat commented Jan 2, 2021

shoyer commented Jan 2, 2021

codecov bot commented Jan 7, 2021 • edited Loading

Codecov Report

rabernat commented Jan 14, 2021

rabernat commented Jan 27, 2021

codecov bot commented Jan 7, 2021 •

edited

Loading