Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility for zarr-python 3.x #9552

Merged
merged 93 commits into from
Oct 23, 2024
Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 27, 2024

This PR begins the process of adding compatibility with zarr-python 3.x. It's intended to be run against zarr-python v3 + the open PRs referenced in #9515.

All of the zarr test cases should be parameterized by zarr_format=[2, 3] with zarr-python 3.x to exercise reading and writing both formats.

This is currently passing with zarr-python==2.18.3. zarr-python 3.x has about 61 failures, all of which are related to data types that aren't yet implemented in zarr-python 3.x.

I'll also note that #5475 is going to become a larger issue once people start writing Zarr-V3 datasets.

@TomAugspurger TomAugspurger mentioned this pull request Sep 27, 2024
4 tasks
@TomAugspurger TomAugspurger force-pushed the fix/zarr-v3 branch 2 times, most recently from 1ed4ef1 to bb2bb6c Compare September 30, 2024 14:04
Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of changes should be backwards compatible and work with zarr-python 2.x (so reading and writing zarr v2 data).

I'll work through zarr-python 3.x now. I think we might want to parametrize most of these tests by zarr_version=[2, 3] to confirm that we can read / write zarr v2 data with zarr-python 3.x

xarray/backends/zarr.py Show resolved Hide resolved
xarray/backends/zarr.py Show resolved Hide resolved

if _zarr_v3() and zarr_array.metadata.zarr_format == 3:
encoding["codec_pipeline"] = [
x.to_dict() for x in zarr_array.metadata.codecs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this instead?

Suggested change
x.to_dict() for x in zarr_array.metadata.codecs
zarr_array.metadata.to_dict()["codecs"]

A bit wasteful since everything has to be serialized, but presumably zarr knows better how to serialize the codec pipeline than we do here?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress here @TomAugspurger. I'm impressed by how little you've changed in the backend itself and I'm noting the pain around testing (I felt some of that w/ dask as well).

xarray/backends/zarr.py Show resolved Hide resolved
xarray/backends/zarr.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

I just pushed a commit reverting the changes to avoid values equal to the fill_value in the test cases, which aren't needed after Ryan's changes to fill value handling.

I think this is ready to go once CI finishes. I expect upstream-ci to fail on the xarray/tests/test_groupby.py::test_gappy_resample_reductions tests, but those should be unrelated.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 15, 2024

There's one typing failure we might want to address:

xarray/tests/test_backends_datatree.py: note: In member "test_to_zarr_zip_store" of class "TestZarrDatatreeIO":
xarray/tests/test_backends_datatree.py:289: error: Argument 1 to "open_datatree" has incompatible type "ZipStore"; expected "str | PathLike[Any] | BufferedIOBase | AbstractDataStore"  [arg-type]

I'll do some reading about how best to handle type annotations when the proper type depends on the version of a dependency.

Edit: a complication here is that this is in open_datatree, which I think supports other backends like NetCDF too. It's not immediately clear to me whether the signature can of open_datatree needs to match for every implementation. This can maybe be addressed once we fully support datatree with zarr-python 3?

@TomNicholas
Copy link
Member

TomNicholas commented Oct 16, 2024

Edit: a complication here is that this is in open_datatree, which I think supports other backends like NetCDF too. It's not immediately clear to me whether the signature can of open_datatree needs to match for every implementation. This can maybe be addressed once we fully support datatree with zarr-python 3?

I don't see why the typing of open_datatree(<zarr-store>) would need to be any different to that of open_dataset(<zarr-store>). But I think it's fine to ignore this for now as we know we need to come back to it in another PR anyway.

@TomAugspurger
Copy link
Contributor Author

Good catch, this affects both.

I was hoping something like this would work:

from pathlib import Path

try:
    from zarr.storage import StoreLike as _StoreLike

except ImportError:
    _StoreLike = str | Path

StoreLike = type[_StoreLike]


def f(x: StoreLike) -> StoreLike:
    return x

but mypy doesn't like that

test.py:7: error: Cannot assign multiple types to name "_StoreLike" without an explicit "Type[...]" annotation  [misc]
Found 1 error in 1 file (checked 1 source file)

@jhamman
Copy link
Member

jhamman commented Oct 18, 2024

but mypy doesn't like that

my 2 cents... we should not get hung up on this right now. a) there are plenty of other failures in the upstream-dev-mypy check unrelated to this PR and b) its probably not worth hacking something in here when there are bigger issues with the upstream zarr implementation to sort out.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TomAugspurger et al. This looks good. I have some minor comments, which I can address later today.

xarray/backends/zarr.py Outdated Show resolved Hide resolved
zarr.consolidate_metadata(self.zarr_group.store)
kwargs = {}
if _zarr_v3():
# https://github.com/zarr-developers/zarr-python/pull/2113#issuecomment-2386718323
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed at some point in the future? If so, it would be good to add a TODO

Copy link
Contributor Author

@TomAugspurger TomAugspurger Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look more closely later, but for now I think this will be required, following a deliberate change in zarr v3 consolidated metadata.

With v2 metadata, I think that consolidated happened at the store-level, and was all-or-nothing. If you have two Groups with Arrays, the consolidated metadata will be placed at the store root, and will contain everything:

# zarr v2

In [1]: import json, xarray as xr

In [2]: store = {}

In [3]: a = xr.tutorial.load_dataset("air_temperature")

In [4]: b = xr.tutorial.load_dataset("rasm")

In [5]: a.to_zarr(store=store, group="A")
/Users/tom/gh/zarr-developers/zarr-v2/.direnv/python-3.10/lib/python3.10/site-packages/xarray/core/dataset.py:2562: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
  return to_zarr(  # type: ignore[call-overload,misc]
Out[5]: <xarray.backends.zarr.ZarrStore at 0x11113edc0>

In [6]: b.to_zarr(store=store, group="B")
Out[6]: <xarray.backends.zarr.ZarrStore at 0x10cab2440>

In [7]: list(json.loads(store['.zmetadata'])['metadata'])
Out[7]:  # contains nodes from both A and B
['.zgroup',
 'A/.zattrs',
 'A/.zgroup',
 'A/air/.zarray',
 'A/air/.zattrs',
 'A/lat/.zarray',
 'A/lat/.zattrs',
 'A/lon/.zarray',
 'A/lon/.zattrs',
 'A/time/.zarray',
 'A/time/.zattrs',
 'B/.zattrs',
 'B/.zgroup',
 'B/Tair/.zarray',
 'B/Tair/.zattrs',
 'B/time/.zarray',
 'B/time/.zattrs',
 'B/xc/.zarray',
 'B/xc/.zattrs',
 'B/yc/.zarray',
 'B/yc/.zattrs']

With v3, consolidated metadata is scoped to a Group, so we can provide the group we want to consolidated (the zarr-python API does support "consolidate everything in the store at the root", but I don't think we want that because you'd need to open it at the root when reading, and I think it's kinda where for ds.to_zarr(group="A") to be reading / writing stuff outside of the A prefix).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially it would make sense to have two versions of consolidated metadata:

  • Everything at a specific group/node level
  • Everything in a group and all of its subgroups (i.e., for DataTree)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. zarr-developers/zarr-specs#309 has some discussion on adding a depth field to the spec for consolidated metadata. That's currently implicitly depth=None, which is everything below a group. depth=0 or 1 would be just the immediate children. That's not standardized or implemented anywhere yet, but the current implementation is forwards compatible and it shouldn't be a ton of effort.

xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Show resolved Hide resolved
xarray/tests/test_backends.py Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Show resolved Hide resolved
dcherian and others added 4 commits October 21, 2024 14:49
* main:
  Fix multiple grouping with missing groups (pydata#9650)
  flox: Properly propagate multiindex (pydata#9649)
  Update Datatree html repr to indicate inheritance (pydata#9633)
  Re-implement map_over_datasets using group_subtrees (pydata#9636)
  fix zarr intersphinx (pydata#9652)
  Replace black and blackdoc with ruff-format (pydata#9506)
  Fix error and missing code cell in io.rst (pydata#9641)
  Support alternative names for the root node in DataTree.from_dict (pydata#9638)
  Updates to DataTree.equals and DataTree.identical (pydata#9627)
  DOC: Clarify error message in open_dataarray (pydata#9637)
  Add zip_subtrees for paired iteration over DataTrees (pydata#9623)
  Type check datatree tests (pydata#9632)
  Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631)
  Bug fixes for DataTree indexing and aggregation (pydata#9626)
  Add inherit=False option to DataTree.copy() (pydata#9628)
  docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625)
  Migration guide for users of old datatree repo (pydata#9598)
  Reimplement Datatree typed ops (pydata#9619)
@dcherian dcherian added the plan to merge Final call for comments label Oct 21, 2024
@dcherian
Copy link
Contributor

Let's get this in by the end of the week.

* main:
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
@dcherian dcherian merged commit b133fdc into pydata:main Oct 23, 2024
27 of 29 checks passed
@jhamman
Copy link
Member

jhamman commented Oct 23, 2024

👏 Thanks all! Especially @TomAugspurger for doing the lion's share of the work here.

@TomAugspurger TomAugspurger deleted the fix/zarr-v3 branch October 23, 2024 19:51
@TomAugspurger TomAugspurger restored the fix/zarr-v3 branch October 23, 2024 19:51
dcherian added a commit to dcherian/xarray that referenced this pull request Oct 29, 2024
* main:
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 3, 2024
* main: (85 commits)
  Refactor out utility functions from to_zarr (pydata#9695)
  Use the same function to floatize coords in polyfit and polyval (pydata#9691)
  Add `DataTree.persist` (pydata#9682)
  Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688)
  Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689)
  Fix inadvertent deep-copying of child data in DataTree (pydata#9684)
  new blank whatsnew (pydata#9679)
  v2024.10.0 release summary (pydata#9678)
  drop the length from `numpy`'s fixed-width string dtypes (pydata#9586)
  fixing behaviour for group parameter in `open_datatree` (pydata#9666)
  Use zarr v3 dimension_names (pydata#9669)
  fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673)
  implement `dask` methods on `DataTree` (pydata#9670)
  support `chunks` in `open_groups` and `open_datatree` (pydata#9660)
  Compatibility for zarr-python 3.x (pydata#9552)
  Update to_dataframe doc to match current behavior (pydata#9662)
  Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)
  Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651)
  Change URL for pydap test (pydata#9655)
  Fix multiple grouping with missing groups (pydata#9650)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-upstream Run upstream CI topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is _FillValue really the same as zarr's fill_value?
7 participants