-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility for zarr-python 3.x #9552
Conversation
1ed4ef1
to
bb2bb6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This set of changes should be backwards compatible and work with zarr-python 2.x (so reading and writing zarr v2 data).
I'll work through zarr-python 3.x now. I think we might want to parametrize most of these tests by zarr_version=[2, 3]
to confirm that we can read / write zarr v2 data with zarr-python 3.x
xarray/backends/zarr.py
Outdated
|
||
if _zarr_v3() and zarr_array.metadata.zarr_format == 3: | ||
encoding["codec_pipeline"] = [ | ||
x.to_dict() for x in zarr_array.metadata.codecs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this instead?
x.to_dict() for x in zarr_array.metadata.codecs | |
zarr_array.metadata.to_dict()["codecs"] |
A bit wasteful since everything has to be serialized, but presumably zarr knows better how to serialize the codec pipeline than we do here?
9f2cb2f
to
d11d593
Compare
* removed open_consolidated workarounds * removed _store_version check * pass through zarr_version
a324329
to
6087e5e
Compare
- skip write_empty_chunks on 3.x - update patch targets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress here @TomAugspurger. I'm impressed by how little you've changed in the backend itself and I'm noting the pain around testing (I felt some of that w/ dask as well).
I just pushed a commit reverting the changes to avoid values equal to the I think this is ready to go once CI finishes. I expect upstream-ci to fail on the |
There's one typing failure we might want to address:
I'll do some reading about how best to handle type annotations when the proper type depends on the version of a dependency. Edit: a complication here is that this is in |
I don't see why the typing of |
Good catch, this affects both. I was hoping something like this would work: from pathlib import Path
try:
from zarr.storage import StoreLike as _StoreLike
except ImportError:
_StoreLike = str | Path
StoreLike = type[_StoreLike]
def f(x: StoreLike) -> StoreLike:
return x but mypy doesn't like that
|
my 2 cents... we should not get hung up on this right now. a) there are plenty of other failures in the upstream-dev-mypy check unrelated to this PR and b) its probably not worth hacking something in here when there are bigger issues with the upstream zarr implementation to sort out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TomAugspurger et al. This looks good. I have some minor comments, which I can address later today.
zarr.consolidate_metadata(self.zarr_group.store) | ||
kwargs = {} | ||
if _zarr_v3(): | ||
# https://github.com/zarr-developers/zarr-python/pull/2113#issuecomment-2386718323 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed at some point in the future? If so, it would be good to add a TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look more closely later, but for now I think this will be required, following a deliberate change in zarr v3 consolidated metadata.
With v2 metadata, I think that consolidated happened at the store-level, and was all-or-nothing. If you have two Groups with Arrays, the consolidated metadata will be placed at the store root, and will contain everything:
# zarr v2
In [1]: import json, xarray as xr
In [2]: store = {}
In [3]: a = xr.tutorial.load_dataset("air_temperature")
In [4]: b = xr.tutorial.load_dataset("rasm")
In [5]: a.to_zarr(store=store, group="A")
/Users/tom/gh/zarr-developers/zarr-v2/.direnv/python-3.10/lib/python3.10/site-packages/xarray/core/dataset.py:2562: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
return to_zarr( # type: ignore[call-overload,misc]
Out[5]: <xarray.backends.zarr.ZarrStore at 0x11113edc0>
In [6]: b.to_zarr(store=store, group="B")
Out[6]: <xarray.backends.zarr.ZarrStore at 0x10cab2440>
In [7]: list(json.loads(store['.zmetadata'])['metadata'])
Out[7]: # contains nodes from both A and B
['.zgroup',
'A/.zattrs',
'A/.zgroup',
'A/air/.zarray',
'A/air/.zattrs',
'A/lat/.zarray',
'A/lat/.zattrs',
'A/lon/.zarray',
'A/lon/.zattrs',
'A/time/.zarray',
'A/time/.zattrs',
'B/.zattrs',
'B/.zgroup',
'B/Tair/.zarray',
'B/Tair/.zattrs',
'B/time/.zarray',
'B/time/.zattrs',
'B/xc/.zarray',
'B/xc/.zattrs',
'B/yc/.zarray',
'B/yc/.zattrs']
With v3, consolidated metadata is scoped to a Group, so we can provide the group we want to consolidated (the zarr-python API does support "consolidate everything in the store at the root", but I don't think we want that because you'd need to open it at the root when reading, and I think it's kinda where for ds.to_zarr(group="A")
to be reading / writing stuff outside of the A
prefix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially it would make sense to have two versions of consolidated metadata:
- Everything at a specific group/node level
- Everything in a group and all of its subgroups (i.e., for DataTree)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. zarr-developers/zarr-specs#309 has some discussion on adding a depth
field to the spec for consolidated metadata. That's currently implicitly depth=None
, which is everything below a group. depth=0
or 1
would be just the immediate children. That's not standardized or implemented anywhere yet, but the current implementation is forwards compatible and it shouldn't be a ton of effort.
* main: Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619)
Let's get this in by the end of the week. |
* main: Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655)
👏 Thanks all! Especially @TomAugspurger for doing the lion's share of the work here. |
* main: Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
* main: (85 commits) Refactor out utility functions from to_zarr (pydata#9695) Use the same function to floatize coords in polyfit and polyval (pydata#9691) Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) ...
This PR begins the process of adding compatibility with zarr-python 3.x. It's intended to be run against zarr-python v3 + the open PRs referenced in #9515.
All of the zarr test cases should be parameterized by
zarr_format=[2, 3]
with zarr-python 3.x to exercise reading and writing both formats.This is currently passing with zarr-python==2.18.3.
zarr-python 3.x has about 61 failures, all of which are related to data types that aren't yet implemented in zarr-python 3.x.I'll also note that #5475 is going to become a larger issue once people start writing Zarr-V3 datasets.
_FillValue
really the same as zarr'sfill_value
? #5475whats-new.rst