Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Icechunk Support #256

Merged
merged 90 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
7b00e41
move vds_with_manifest_arrays fixture up
TomNicholas Sep 27, 2024
c82221c
sketch implementation
TomNicholas Sep 27, 2024
d29362b
test that we can create an icechunk store
TomNicholas Sep 27, 2024
2aa3cb5
fixture to create icechunk filestore in temporary directory
TomNicholas Sep 27, 2024
f2c095c
get the async fixture working properly
TomNicholas Sep 27, 2024
6abe32d
split into more functions
TomNicholas Sep 27, 2024
93080b3
change mode
TomNicholas Sep 27, 2024
bebf370
try creating zarr group and arrays explicitly
TomNicholas Sep 27, 2024
833e5f0
create root group from store
TomNicholas Sep 28, 2024
9853140
todos
TomNicholas Sep 28, 2024
030a96e
do away with the async pytest fixtures/functions
TomNicholas Sep 28, 2024
90ca5cf
successfully writes root group attrs
TomNicholas Sep 28, 2024
b138dde
check array metadata is correct
TomNicholas Sep 28, 2024
6631102
try to write array attributes
TomNicholas Sep 28, 2024
e92b56c
sketch test for checking virtual references have been set correctly
TomNicholas Sep 28, 2024
2c8c0ee
test setting single virtual ref
TomNicholas Sep 30, 2024
a2ce1ed
use async properly
TomNicholas Sep 30, 2024
9393995
better separation of handling of loadable variables
TomNicholas Oct 1, 2024
956e324
fix chunk key format
TomNicholas Oct 1, 2024
2d7d5f6
use require_array
TomNicholas Oct 1, 2024
8726e23
check that store supports writes
TomNicholas Oct 1, 2024
387f345
removed outdated note about awaiting
TomNicholas Oct 1, 2024
b2a0700
fix incorrect chunk key in test
TomNicholas Oct 2, 2024
4ffb55e
absolute path
TomNicholas Oct 2, 2024
f929fcb
convert to file URI before handing to icechunk
TomNicholas Oct 2, 2024
e9c1287
test that without encoding we can definitely read one chunk
TomNicholas Oct 2, 2024
2fe3012
Work on encoding test
mpiannucci Oct 2, 2024
33d8ce8
Merge remote-tracking branch 'origin/icechunk' into matt/icechunk-enc…
mpiannucci Oct 2, 2024
8aa6034
Update test to match
mpiannucci Oct 2, 2024
aa2d415
Quick comment
mpiannucci Oct 2, 2024
7e4e2ce
more comprehensive
mpiannucci Oct 2, 2024
9a03245
add attrtirbute encoding
mpiannucci Oct 3, 2024
9676485
Merge pull request #2 from earth-mover/matt/icechunk-encoding
TomNicholas Oct 4, 2024
bbaf3ba
Fix array dimensions
mpiannucci Oct 10, 2024
31945cd
Merge pull request #3 from earth-mover/matt/array-dims
mpiannucci Oct 11, 2024
49daa7e
Fix v3 codec pipeline
mpiannucci Oct 11, 2024
756ff92
Put xarray dep back
mpiannucci Oct 11, 2024
8c7242e
Handle codecs, but get bad results
mpiannucci Oct 12, 2024
666b676
Gzip an d zlib are not directly working
mpiannucci Oct 12, 2024
9076ad7
Get up working with numcodecs zarr 3 codecs
mpiannucci Oct 13, 2024
7a160fd
Update codec pipeline
mpiannucci Oct 14, 2024
286a9b5
Merge pull request #4 from earth-mover/matt/v3-codecs
mpiannucci Oct 15, 2024
8f1f96e
oUdpate to latest icechunk using sync api
mpiannucci Oct 15, 2024
b59060d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 15, 2024
4f3bafa
Merge remote-tracking branch 'origin/main' into icechunk
mpiannucci Oct 21, 2024
26db575
Merge remote-tracking branch 'mpiannucci/icechunk' into icechunk
mpiannucci Oct 21, 2024
01d261c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2024
52e53f9
Some type stuff
mpiannucci Oct 21, 2024
d10de6b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2024
d0b6bfb
Update zarr and icechunk tests, fix zarr v3 metadata
mpiannucci Oct 21, 2024
b7dc5f5
Update import we dont need
mpiannucci Oct 21, 2024
1e8ba7d
Update kerhcunk tests
mpiannucci Oct 21, 2024
6b5b8b5
Check for v3 metadata import in zarr test
mpiannucci Oct 21, 2024
00ecae3
More tests
mpiannucci Oct 21, 2024
14ee6f9
type checker
mpiannucci Oct 21, 2024
23d98de
types
mpiannucci Oct 21, 2024
c633f13
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2024
308a508
More types
mpiannucci Oct 21, 2024
60cb43e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2024
d72e9d8
ooops
mpiannucci Oct 21, 2024
0e0e5ac
Merge remote-tracking branch 'mpiannucci/icechunk' into icechunk
mpiannucci Oct 21, 2024
3873fde
One left
mpiannucci Oct 21, 2024
19bc9ae
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2024
4b6aefc
Finally done being dumb
mpiannucci Oct 21, 2024
9fa9b38
Merge remote-tracking branch 'mpiannucci/icechunk' into icechunk
mpiannucci Oct 21, 2024
4d5c46a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2024
5cb3d21
Support loadables without tests
mpiannucci Oct 21, 2024
3e31f21
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 21, 2024
e105b78
Add test for multiple chunks to check order
mpiannucci Oct 21, 2024
ea52003
Add loadable varaible test
mpiannucci Oct 22, 2024
54d9bea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
303e5cb
Add accessor, simple docs
mpiannucci Oct 22, 2024
8838a1d
Update icechunk.py
mpiannucci Oct 22, 2024
dd6c118
Update accessor.py
mpiannucci Oct 22, 2024
85de689
Fix attributes when loadables are available
mpiannucci Oct 22, 2024
9b471ec
Merge remote-tracking branch 'mpiannucci/icechunk' into icechunk
mpiannucci Oct 22, 2024
8cd5237
Protect zarr import
mpiannucci Oct 22, 2024
542a953
Fix import errors in icechunk writer
mpiannucci Oct 22, 2024
305b0c6
More protection
mpiannucci Oct 22, 2024
2f8e270
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
7ce9f67
i am bad at this
mpiannucci Oct 22, 2024
8b4863e
Add xarray roundtrip asserts
mpiannucci Oct 22, 2024
b072535
Add icechunk to api.rst
mpiannucci Oct 22, 2024
45ae850
Update virtualizarr/tests/test_writers/test_icechunk.py
mpiannucci Oct 22, 2024
36eaad1
More test improvements, update realeses.rst
mpiannucci Oct 22, 2024
9f4e978
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
3b496f5
tmore testing
mpiannucci Oct 22, 2024
1e580a5
Merge remote-tracking branch 'mpiannucci/icechunk' into icechunk
mpiannucci Oct 22, 2024
117479c
Figure out tests for real this time
mpiannucci Oct 22, 2024
46c41d0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ci/upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ dependencies:
- fsspec
- pip
- pip:
- zarr==3.0.0b1 # beta release of zarr-python v3
- icechunk # Installs zarr v3 as dependency
- git+https://github.com/pydata/xarray@zarr-v3 # zarr-v3 compatibility branch
- git+https://github.com/zarr-developers/numcodecs@zarr3-codecs # zarr-v3 compatibility branch
# - git+https://github.com/fsspec/kerchunk@main # kerchunk is currently incompatible with zarr-python v3 (https://github.com/fsspec/kerchunk/pull/516)
15 changes: 15 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import h5py
import numpy as np
import pytest
import xarray as xr
from xarray.core.variable import Variable


def pytest_addoption(parser):
Expand Down Expand Up @@ -96,3 +98,16 @@ def hdf5_scalar(tmpdir):
dataset = f.create_dataset("scalar", data=0.1, dtype="float32")
dataset.attrs["scalar"] = "true"
return filepath


@pytest.fixture
def simple_netcdf4(tmpdir):
filepath = f"{tmpdir}/simple.nc"

arr = np.arange(12, dtype=np.dtype("int32")).reshape(3, 4)
var = Variable(data=arr, dims=["x", "y"])
ds = xr.Dataset({"foo": var})

ds.to_netcdf(filepath)

return filepath
1 change: 1 addition & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Serialization

VirtualiZarrDatasetAccessor.to_kerchunk
VirtualiZarrDatasetAccessor.to_zarr
VirtualiZarrDatasetAccessor.to_icechunk


Rewriting
Expand Down
3 changes: 3 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ New Features
- Support empty files (:pull:`260`)
By `Justus Magin <https://github.com/keewis>`_.

- Can write virtual datasets to Icechunk stores using `vitualize.to_icechunk` (:pull:`256`)
By `Matt Iannucci <https://github.com/mpiannucci>`_.

Breaking changes
~~~~~~~~~~~~~~~~

Expand Down
17 changes: 17 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,23 @@ combined_ds = xr.open_dataset('combined.parq', engine="kerchunk")

By default references are placed in separate parquet file when the total number of references exceeds `record_size`. If there are fewer than `categorical_threshold` unique urls referenced by a particular variable, url will be stored as a categorical variable.

### Writing to an Icechunk Store

We can also write these references out as an [IcechunkStore](https://icechunk.io/). `Icechunk` is a Open-source, cloud-native transactional tensor storage engine that is compatible with zarr version 3. To export our virtual dataset to an `Icechunk` Store, we simply use the {py:meth}`ds.virtualize.to_icechunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_icechunk>` accessor method.

```python
# create an icechunk store
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
storage = StorageConfig.filesystem(str('combined'))
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
virtual_ref_config=VirtualRefConfig.s3_anonymous(region='us-east-1'),
))

combined_vds.virtualize.to_icechunk(store)
```

See the [Icechunk documentation](https://icechunk.io/icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr) for more details.

### Writing as Zarr

Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the [proposed "Chunk Manifest" ZEP](https://github.com/zarr-developers/zarr-specs/issues/287). To do this we simply use the {py:meth}`ds.virtualize.to_zarr <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_zarr>` accessor method.
Expand Down
18 changes: 18 additions & 0 deletions virtualizarr/accessor.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from pathlib import Path
from typing import (
TYPE_CHECKING,
Callable,
Literal,
overload,
Expand All @@ -12,6 +13,9 @@
from virtualizarr.writers.kerchunk import dataset_to_kerchunk_refs
from virtualizarr.writers.zarr import dataset_to_zarr

if TYPE_CHECKING:
from icechunk import IcechunkStore # type: ignore[import-not-found]


@register_dataset_accessor("virtualize")
class VirtualiZarrDatasetAccessor:
Expand Down Expand Up @@ -39,6 +43,20 @@ def to_zarr(self, storepath: str) -> None:
"""
dataset_to_zarr(self.ds, storepath)

def to_icechunk(self, store: "IcechunkStore") -> None:
"""
Write an xarray dataset to an Icechunk store.

Any variables backed by ManifestArray objects will be be written as virtual references, any other variables will be loaded into memory before their binary chunk data is written into the store.

Parameters
----------
store: IcechunkStore
"""
from virtualizarr.writers.icechunk import dataset_to_icechunk

dataset_to_icechunk(self.ds, store)

@overload
def to_kerchunk(
self, filepath: None, format: Literal["dict"]
Expand Down
2 changes: 2 additions & 0 deletions virtualizarr/readers/zarr_v3.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,5 +150,7 @@ def _configurable_to_num_codec_config(configurable: dict) -> dict:
"""
configurable_copy = configurable.copy()
codec_id = configurable_copy.pop("name")
if codec_id.startswith("numcodecs."):
codec_id = codec_id[len("numcodecs.") :]
configuration = configurable_copy.pop("configuration")
return numcodecs.get_codec({"id": codec_id, **configuration}).get_config()
2 changes: 1 addition & 1 deletion virtualizarr/tests/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def test_kerchunk_roundtrip_in_memory_no_concat():
chunks=(2, 2),
compressor=None,
filters=None,
fill_value=np.nan,
fill_value=None,
order="C",
),
chunkmanifest=manifest,
Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/tests/test_manifests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def test_create_manifestarray_from_kerchunk_refs(self):
assert marr.chunks == (2, 3)
assert marr.dtype == np.dtype("int64")
assert marr.zarray.compressor is None
assert marr.zarray.fill_value is np.nan
assert marr.zarray.fill_value == 0
assert marr.zarray.filters is None
assert marr.zarray.order == "C"

Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/tests/test_readers/test_kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def test_dataset_from_df_refs():

assert da.data.zarray.compressor is None
assert da.data.zarray.filters is None
assert da.data.zarray.fill_value is np.nan
assert da.data.zarray.fill_value == 0
assert da.data.zarray.order == "C"

assert da.data.manifest.dict() == {
Expand Down
27 changes: 27 additions & 0 deletions virtualizarr/tests/test_writers/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import numpy as np
import pytest
from xarray import Dataset
from xarray.core.variable import Variable

from virtualizarr.manifests import ChunkManifest, ManifestArray


@pytest.fixture
def vds_with_manifest_arrays() -> Dataset:
arr = ManifestArray(
chunkmanifest=ChunkManifest(
entries={"0.0": dict(path="/test.nc", offset=6144, length=48)}
),
zarray=dict(
shape=(2, 3),
dtype=np.dtype("<i8"),
chunks=(2, 3),
compressor={"id": "zlib", "level": 1},
filters=None,
fill_value=0,
order="C",
zarr_format=3,
),
)
var = Variable(dims=["x", "y"], data=arr, attrs={"units": "km"})
return Dataset({"a": var}, attrs={"something": 0})
Loading
Loading