Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support chunks in open_groups and open_datatree #9660

Merged
merged 23 commits into from
Oct 24, 2024

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Oct 22, 2024

In trying to support chunks the way we do for open_dataset I had to add a lot of parameters to the top-level open_datatree and open_groups functions.

I'm also still looking for the equivalent of _protect_dataset_variables_inplace, and finally _dataset_from_backend_dataset has been the place to call set_close, while #9651 pushed this to the backends for datatree.

No tests yet, and I also want to improve the docstrings before merging.

(the chunked array methods – chunk, load, compute, and persist – should be a separate PR)

  • complete docstrings
  • Tests added

cc @TomNicholas, @shoyer

@TomNicholas TomNicholas added topic-backends topic-DataTree Related to the implementation of a DataTree class topic-chunked-arrays Managing different chunked backends, e.g. dask labels Oct 22, 2024
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was quick @keewis !

xarray/backends/api.py Outdated Show resolved Hide resolved
@sjperkins
Copy link

Thanks for doing this @keewis!

}
)

# ds.set_close(backend_ds._close)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend_tree should have been created using datatree_from_dict_with_io_cleanup, so one way to handle this could be just to copy over the _close attribute from every node of backend_tree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the question is, do we even need that here? I copied this from open_dataset where this is explicitly set, but since datatree_from_dict_with_io_cleanup does this already we might be able to just remove it?

The only reason why I kept the commented-out line is to discuss whether the shift in paradigm (have the backend set _close vs. do it for all backends the same way) is intentional, and if we should do the same for open_dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be nice to remove this, I'm just worried that mapping over the each .dataset might not properly propagate ._close (does it? should it?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not (I think), so I'm explicitly copying it over. So far that doesn't appear to cause anything to break.

xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/backends/api.py Outdated Show resolved Hide resolved
@keewis
Copy link
Collaborator Author

keewis commented Oct 23, 2024

I've copied over the docstring of open_dataset to open_groups and open_datatree and changed the code of _datatree_from_backend_datatree to both copy over _close and to protect the data against modifications. Which means this should be ready for another round of reviews and possibly merging.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/backends/api.py Show resolved Hide resolved
xarray/backends/api.py Outdated Show resolved Hide resolved
xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved
xarray/tests/test_backends_datatree.py Outdated Show resolved Hide resolved
@keewis keewis mentioned this pull request Oct 24, 2024
2 tasks
@TomNicholas TomNicholas added the plan to merge Final call for comments label Oct 24, 2024
@TomNicholas
Copy link
Member

I'm happy to merge this @keewis ?

aladinor added a commit to aladinor/xarray that referenced this pull request Oct 24, 2024
@keewis
Copy link
Collaborator Author

keewis commented Oct 24, 2024

I'm happy to merge this @keewis ?

me too! Feel free to go ahead and merge.

@TomNicholas TomNicholas merged commit 521b087 into pydata:main Oct 24, 2024
35 checks passed
@keewis keewis deleted the open_datatree-dask branch October 24, 2024 17:40
TomNicholas added a commit that referenced this pull request Oct 24, 2024
* adding draft for fixing behaviour for group parameter

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* new trial

* new trial

* fixing duplicate pahts and path in the root group

* removing yield str(gpath)

* implementing the proposed solution to hdf5 and netcdf backends

* adding changes to whats-new.rst

* removing encoding['source_group'] line to avoid conflicts with PR #9660

* adding test

* adding test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adding             assert subgroup_tree.root.parent is None

* modifying tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update xarray/tests/test_backends_datatree.py

Co-authored-by: Justus Magin <[email protected]>

* applying suggested changes

* updating test

* adding Justus and Alfonso to the list of contributors to the DataTree entry

* adding Justus and Alfonso to the list of contributors to the DataTree entry

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tom Nicholas <[email protected]>
Co-authored-by: Justus Magin <[email protected]>
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 29, 2024
* main:
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 3, 2024
* main: (85 commits)
  Refactor out utility functions from to_zarr (pydata#9695)
  Use the same function to floatize coords in polyfit and polyval (pydata#9691)
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
  Fix multiple grouping with missing groups (pydata#9650)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-backends topic-chunked-arrays Managing different chunked backends, e.g. dask topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

5 participants