Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the data model for excutors #38

Merged
merged 1 commit into from
Aug 4, 2020

Conversation

shoyer
Copy link
Collaborator

@shoyer shoyer commented Jul 31, 2020

I originally thought that it made the most sense to think of rechunking as a chain of chunked array copies.

After going through the exercise of trying to write an Executor that doesn't use temporary arrays (#36), I realize now that this was overly generic. Sometimes it might make sense to implement a rechunking this way, but Rechunker's algorithm adds more structure than that. There is always a "push/pull" structure that splits read chunks into intermediate chunks, and then combines intermediate chunks into write chunks.

This structure wasn't evident from current data model and I needed it for my executor without temporaries, so this PR changes CopySpec to directly keep track of "read/intermediate/write" steps.

I've used this new representation in api.py and the dask executor. It's still convenient sometimes to use the "chain of copies" representation, so I've added the utility function split_into_direct_copies() for converting to this representation in rechunker.executors.util.

I originally thought that it made the most sense to think of rechunking as a
chain of chunked array copies.

After going through the exercise of trying to write an Executor that doesn't
use temporary arrays (pangeo-data#36), I
realize now that this was overly generic. *Sometimes* it might make sense to
implement a rechunking this way, but Rechunker's algorithm adds more structure
than that. There is always a "push/pull" structure that splits read chunks into
intermediate chunks, and then combines intermediate chunks into write chunks.

This structure wasn't evident from current data model and I needed it for my
executor without temporaries, so this PR changes CopySpec to directly keep
track of "read/intermediate/write" steps.

It's still convenient sometimes to use the "chain of copies" representation,
so I've added the utility function `split_into_direct_copies()` for converting
to this representation in `rechunker.executors.util`.
@codecov
Copy link

codecov bot commented Jul 31, 2020

Codecov Report

Merging #38 into master will increase coverage by 0.16%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
+ Coverage   93.39%   93.55%   +0.16%     
==========================================
  Files           7        7              
  Lines         318      326       +8     
  Branches       65       66       +1     
==========================================
+ Hits          297      305       +8     
  Misses         10       10              
  Partials       11       11              
Impacted Files Coverage Δ
rechunker/api.py 90.75% <100.00%> (+0.23%) ⬆️
rechunker/executors/beam.py 100.00% <100.00%> (ø)
rechunker/executors/dask.py 100.00% <100.00%> (ø)
rechunker/executors/python.py 100.00% <100.00%> (ø)
rechunker/executors/util.py 100.00% <100.00%> (ø)
rechunker/types.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 22e6ca0...278e270. Read the comment docs.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks. Planning to merge later today if there's no objections.

read : ArrayProxy
Read proxy with an ``array`` attribute that supports ``__getitem__``.
intermediate : ArrayProxy
Intermediate proxy with either an ``array`` that is either ``None``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. I was confused about how direct copies would be represented.

@TomAugspurger TomAugspurger merged commit 685a8b2 into pangeo-data:master Aug 4, 2020
@TomAugspurger
Copy link
Member

Thanks @shoyer!

@shoyer shoyer deleted the better-data-model branch August 4, 2020 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants