-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for MergeDim
s and Split Variables to FilePatternToChunks
transform.
#39
Conversation
if not self._max_sizes: | ||
self._max_sizes = dataset.sizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this before, but why is this caching needed/useful? In general would guess it's probably a bad idea to make stateful Beam transforms, since that breaks one of the underlying assumptions of Beam's data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's helpful to know that stateful transforms are discouraged; I'll keep that in mind for the future.
This caching was the simplest way I could think of to calculate the correct offsets for the keys. See this code-block for context.
When I calculated the offsets using the current dataset's sizes, it would always fail to compute the last offsets correctly (please see the code-block linked above). The simplest way I could think of to calculate the right starting offset was to cache the first dataset's size, and let the 0-indexed dim.index
handle the rest.
From what I can tell, this data is safe to cache. Those are, however, famous last words in parallel programming.
Co-authored-by: Stephan Hoyer <[email protected]>
I'm testing this PR end-to-end with a script that uses this file pattern for a Grib 2 dataset (uses the
Traces in logs show that threads are acquiring a lock, though it's unclear if it's just a big dataset and thus taking some time. log 1
log 2
This to me seems like an instance of pydata/xarray#4591. Right now, I'm going to experiment with changing the scheduler to use a single thread in the |
I should mention: The Dataflow diagnostics for the above report is showing unresponsive threads, making a dead-lock scenario more sound.
|
Dead-locking seems plausible, but this is different from pydata/xarray#4591 which describes a serialization failure. |
…f with fsspec). This avoids contention issues with dask, fsspec, and xarray (including a possible deadlock).
This has gone stale. |
Fixes #29 and #38.