Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): full v2 compat via python fallback #84

Merged
merged 28 commits into from
Feb 11, 2025
Merged

Conversation

ilan-gold
Copy link
Owner

@ilan-gold ilan-gold commented Jan 30, 2025

Just a draft but moving in the direction of full v2 support by falling back to zarr-python and also fixing some serialization issues. I think this mechanism, even if it is obviated by development on the rust side, is not bad to have as it allows us to keep up a bit more with new features/bugs etc.

I got rid of the full-chunk fetching (i.e., for some simple integer indexing cases) in rust because it was actually slower than just doing it in python. This also lets us be a bit more clear about what we support and don't because it is simpler now as either "optimized rust" or "pure python."

All data types (V, S, O etc) are now split off and i/o for them is done on an instantiated zarr-python pipeline (as opposed to our rust one).

Furthermore, default fill values have been added for all data types in python for our pipeline (as None is valid at hte top-level in zarr-python) as well as improved fill value handling in rust which should prepare us a bit better for variable-length decoding in pure-rust.

@@ -0,0 +1,346 @@
import json
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely taken from the zarr codebase with minimal changes

Comment on lines +220 to +226
@pytest.mark.parametrize(
"array_order", ["C", pytest.param("F", marks=[pytest.mark.xfail])]
)
@pytest.mark.parametrize("data_order", ["C", "F"])
@pytest.mark.parametrize(
"memory_order", ["C", pytest.param("F", marks=[pytest.mark.xfail])]
)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fall back to python for F-arrays or just fail?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could handle F array order easily if zarr_python gave us all of the array metadata when constructing the codec pipeline

// FIXME: The array order, dimensionality, data type, and endianness are needed to exhaustively support all Zarr V2 data that zarrs can handle.
// However, CodecPipeline.from_codecs does not supply this information, and CodecPipeline.evolve_from_array_spec is seemingly never called.
let metadata = zarrs::metadata::v2_to_v3::codec_metadata_v2_to_v3(
ArrayMetadataV2Order::C,

python/zarrs/pipeline.py Outdated Show resolved Hide resolved
@LDeakin
Copy link
Collaborator

LDeakin commented Feb 3, 2025

I've added a few FIXMEs for variable-length data and fixed the codec handling. All the V2 codec metadata logic could just be pulled from zarrs itself instead though. Just need to isolate the codec part of array_metadata_v2_to_v3 Done in LDeakin/zarrs#141.

@ilan-gold ilan-gold marked this pull request as ready for review February 4, 2025 12:41
@ilan-gold ilan-gold requested a review from LDeakin February 4, 2025 12:43
ilan-gold and others added 3 commits February 4, 2025 16:32
There is a lot of additional logic already taken care of by `zarrs`, like handling multiple versions of codec metadata.
fill_value_bytes = fill_value.call_method0("tobytes")?.extract()?;
} else if let Ok(fill_value_downcast) = fill_value.downcast::<PyInt>() {
let fill_value_usize: usize = fill_value_downcast.extract()?;
if fill_value_usize == (0 as usize) && dtype == "object" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this varies between zarr-python versions, but it is at least consistent with zarr-python 3, See zarr-developers/zarr-python#2792 (comment)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry meant to reply to this earlier - this condition is still kosher, though, right? I mean, the behavior is correct, that this is implicitly the v2 case that we have been discussing elsewhere and this sort of thing is not allowed in v3?

Just want to make sure I'm clear before merging since I feat I may be a bit lost in the sauce.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "string" / 0 fill value workaround was unreachable before. I've adjusted it now to match zarr-python 2.x.x behaviour given the discussion in zarr-developers/zarr-python#2792

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the workaround can't be called anyway since we reject variable-sized data types from the pipeline. It'd be handy if zarr-python resolves zarr-developers/zarr-python#2792 so that we get a "0" instead of a 0 and no workaround is needed on our side.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all makes sense now. zarr-developers/zarr-python#2792 (comment). Reverted back to the 0 -> "" fill value workaround.

@LDeakin
Copy link
Collaborator

LDeakin commented Feb 6, 2025

In 26ee516 I switched to using all the V2 to V3 codec metadata handling in zarrs. This is more complicated than it looks, because there are so many variants of codec metadata that have been produced over the life of zarr-python / numcodecs. For example, I have Zarr V2/V3 specific handling of zstd for multiple numcodecs versions:

I hope in the future that this responsibility of mapping interim/experimental codec metadata to standardised metadata can be taken on by zarr-python instead. Maybe it already can? I haven't really looked.

@ilan-gold
Copy link
Owner Author

@LDeakin Thanks for this. I had looked into that function as well when writing this PR, but didn't want to ask you to make it public given the amount of APIs you already have.

I hope in the future that this responsibility of mapping interim/experimental codec metadata to standardised metadata can be taken on by zarr-python instead. Maybe it already can? I haven't really looked.

Right, so I had opened an issue in numcodecs kind of about this: zarr-developers/numcodecs#676 The reason we need to call get_config is because these V2 codecs haven't been standardized/given v3-compatible json interchange formats. I wanted to look into that after this PR and I understood more what was going on.

@d-v-b
Copy link

d-v-b commented Feb 7, 2025

see zarr-developers/numcodecs#686. we need to fix this.

@ilan-gold
Copy link
Owner Author

Ah amazing @d-v-b I will close my little issue then. Thanks!

@ilan-gold ilan-gold merged commit 1d4e3cb into main Feb 11, 2025
17 checks passed
@ilan-gold ilan-gold deleted the ig/python_fallback branch February 11, 2025 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants