-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zarr-python
v3 compatibility
#516
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's much less change here than I might have thought.
kerchunk/hdf.py
Outdated
@@ -496,6 +516,8 @@ def _translator( | |||
if h5obj.fletcher32: | |||
logging.info("Discarding fletcher32 checksum") | |||
v["size"] -= 4 | |||
key = str.removeprefix(h5obj.name, "/") + "/" + ".".join(map(str, k)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as what _chunk_key
did? Maybe make it a function with a comment saying it's a copy/reimplementation.
By the way, is h5obj.name
not actually a string, so you could have done h5obj.name.removeprefix()
?
kerchunk/hdf.py
Outdated
shape=h5obj.shape, | ||
dtype=dt or h5obj.dtype, | ||
chunks=h5obj.chunks or False, | ||
fill_value=fill, | ||
compression=None, | ||
compressor=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here, you could reintroduce the compressor
filters = filters[:-1]
compressor = filters[-1]
but obviously it depends on whether there are indeed any filters at all.
It would still need back compat, since filters-only datasts definitely exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the big issue is that v3 cares about what type of operation it is, and v2w doesnt so moving them around doesnt necessarily fix that bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there needs to be a change upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this: zarr-developers/zarr-python#2325
kerchunk/hdf.py
Outdated
for k, v in self.store_dict.items(): | ||
if isinstance(v, zarr.core.buffer.cpu.Buffer): | ||
key = str.removeprefix(k, "/") | ||
new_keys[key] = v.to_bytes() | ||
keys_to_remove.append(k) | ||
for k in keys_to_remove: | ||
del self.store_dict[k] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the hacky bit and could use some explanations. Even when requesting "v2", zarr makes Buffer objects, and the keys are also wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so two issues here:
- the keys we get from hdf are for example
/depth/.zarray
when then need to bedepth/.zarray
- we cant jsonify buffers, which is how the internal
MemoryStore
in v3 stores its data. So we need to convert the buffers to bytes to be serialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - would appreciate comments on the code saying this.
Also worth noting... zarr 3 doesn't support numcodecs codecs out of the box. There is a pr to help this zarr-developers/numcodecs#524 (see also here for updated version zarr-developers/numcodecs#597) but it would mean a change to codec names which causes an incompatibility. For the initial icechunk examples we handle this in virtualizarr but long term it probably belongs here to work standalone. |
When this zarr-developers/zarr-python#2425 goes in it should unblock this to work full with zarr python v3. We will also need to create both numcodecs and zarr v3 codec versions of all the custom kerchunk codecs so that a given dataset can be loaded in either v2 or v3 contexts (say if you kerchunk a grib file, then want to convert those references to an icechunk store) |
Is that a subclass in each case, specifying that it is to be bytes->bytes? |
Sorry the numcodecs versions exists as they are today. Yes the v3 version would be basically subclassing zarr.abc.codec using the numcodecs implementations of the kerchunk codecs. Although grib decoder should really be bytes to array i think |
This is required upstream: fsspec/filesystem_spec#1734 |
Also of note: To get this to work with zarr 3, we pass an fsspec |
Grib works as long as read in zarr 2 format. To be used with zarr 3 the codec needs to be ported over to the new abc.Codec class from zarr |
implement zarr3 support for grib
State of the Union post merge with @moradology and installing latest fsspec and zarr 3
A lot of these are the same errors over and over a again. The hardest part will be maintaining compat for zarr python 2 library which I am not sure if it should be a goal of this PR @martindurant |
Thanks for working on this compatibility! I really appreciate all the efforts you've put into this already. My understanding is that there's still a bit of work to do. Even though we all know that Matt's super trustworthy, some concerns were raised in the VirtualiZarr coordination meeting yesterday about how the dependency conflicts between icechunk, virtualizarr, zarr, and kerchunk seem more urgent with Zarr v3.0 out and it being sketchy to recommend installing a library from a branch off a fork. I wanted to float the idea of either moving Matt's PR to a branch within fsspec/kerchunk or adding a fsspec/kerchunk@zarr-v3 branch to mitigate these concerns until zarr-python v3 compatibility is completed. what do y'all think? |
Makes sense to me. Perhaps a similar target branch should be added across the impacted repos (assuming that doesn't increase the cognitive load too much) |
It is still the case that kerchunk can be used to generate v2 output with zarr-3 installed, no? Did I misunderstand something? Please invite me to the coordination meeting, so we can decide who does what and when. |
I see I am wrong, and this branch cannot be expected to work with v2 zarr. So the plan is for all future kerchunking to happen in v3 (but perhaps still produce v2 output, so that people who don't upgrade can still use it) ? |
That is correct |
@martindurant the coordination meeting info is at https://zarr.dev/community-calls/. The event is hosted on Zarr's community calendar rather than a regular calendar invite so it's not tied to any single individual/institution. |
OK, so I can work on this, if you have no time @mpiannucci . I will go with the plan, that kerchunk will require zarr>3, but produce v2 output (because we don't actually use any new functionality), and for combining and zarr-to-zarr, the input has to be v2 format. |
That sounds great @martindurant . I do not have the capacity to code on it at the moment, but can help with any questions or if you get stuck to keep things moving. |
So far the only tested file type is HDF5. Thats the only module that currently works with new zarr python in some way