-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the data model for excutors #38
Conversation
I originally thought that it made the most sense to think of rechunking as a chain of chunked array copies. After going through the exercise of trying to write an Executor that doesn't use temporary arrays (pangeo-data#36), I realize now that this was overly generic. *Sometimes* it might make sense to implement a rechunking this way, but Rechunker's algorithm adds more structure than that. There is always a "push/pull" structure that splits read chunks into intermediate chunks, and then combines intermediate chunks into write chunks. This structure wasn't evident from current data model and I needed it for my executor without temporaries, so this PR changes CopySpec to directly keep track of "read/intermediate/write" steps. It's still convenient sometimes to use the "chain of copies" representation, so I've added the utility function `split_into_direct_copies()` for converting to this representation in `rechunker.executors.util`.
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
==========================================
+ Coverage 93.39% 93.55% +0.16%
==========================================
Files 7 7
Lines 318 326 +8
Branches 65 66 +1
==========================================
+ Hits 297 305 +8
Misses 10 10
Partials 11 11
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks. Planning to merge later today if there's no objections.
read : ArrayProxy | ||
Read proxy with an ``array`` attribute that supports ``__getitem__``. | ||
intermediate : ArrayProxy | ||
Intermediate proxy with either an ``array`` that is either ``None`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation. I was confused about how direct copies would be represented.
Thanks @shoyer! |
I originally thought that it made the most sense to think of rechunking as a chain of chunked array copies.
After going through the exercise of trying to write an Executor that doesn't use temporary arrays (#36), I realize now that this was overly generic. Sometimes it might make sense to implement a rechunking this way, but Rechunker's algorithm adds more structure than that. There is always a "push/pull" structure that splits read chunks into intermediate chunks, and then combines intermediate chunks into write chunks.
This structure wasn't evident from current data model and I needed it for my executor without temporaries, so this PR changes CopySpec to directly keep track of "read/intermediate/write" steps.
I've used this new representation in
api.py
and the dask executor. It's still convenient sometimes to use the "chain of copies" representation, so I've added the utility functionsplit_into_direct_copies()
for converting to this representation inrechunker.executors.util
.