-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virtual Dataset Workflow Tracking Issue #197
Comments
This is so awesome, thank you for open sourcing your work and the impressive documentation/issue tracking! Just wanted to share the snippet below that works for me, since there has been some changes on those branches since this code was posted. In particular, only import xarray as xr
from virtualizarr import open_virtual_dataset
from virtualizarr.writers.icechunk import dataset_to_icechunk
url = 's3://met-office-atmospheric-model-data/global-deterministic-10km/20221001T0000Z/20221001T0000Z-PT0000H00M-CAPE_mixed_layer_lowest_500m.nc'
so = dict(anon=True, default_fill_cache=False, default_cache_type="none")
# create xarray dataset
ds = open_virtual_dataset(url, reader_options={'storage_options': so}, indexes={})
# create an icechunk store
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig
storage = StorageConfig.filesystem(str('ukmet'))
store = IcechunkStore.create(storage=storage, mode="w", config=StoreConfig(
virtual_ref_config=VirtualRefConfig.s3_anonymous(region='eu-west-2'),
))
# use virtualizarr to write the dataset to icechunk
dataset_to_icechunk(ds, store)
# commit to save progress
store.commit(message="Initial commit")
# open it back up
ds = xr.open_zarr(store, zarr_version=3, consolidated=False)
# plot!
ds.atmosphere_convective_available_potential_energy.plot() |
Thanks @maxrjones !! I updated the code sample up top to match just to make sure its all on the same page |
Icechunk support was merged to VirtualiZarr main! zarr-developers/VirtualiZarr#256 I updated the top post with the latest instructions Edit: And released!! https://virtualizarr.readthedocs.io/en/latest/generated/virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk.html#virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk |
I listed out a current breakdown of the work to be done in kerchunk here if anyone is interested in helping to drive this effort foward! |
I wonder, do we have examples of supermassive iced datasets yet, with millions of references? I wanted to see how the msgpack format stacks up against kerchunk's parquet format, particularly the ability to only load partitions of the reference data. |
numcodecs 0.14.0 is out with included support for zarr 3 codecs using the The last piece to this puzzle is getting kerchunk fully working with zarr 3 stores which is a work in progress |
Great! Would you mind submitting a PR to VirtualiZarr to change this dependency? |
I tried 100 million virtual references in #401, which kind of already works. (Which is surprising given how no effort has gone into optimizing anything yet!)
(This was done in zarr-developers/VirtualiZarr#301) |
In order to create and use virtual datasets with python, users will want to use
kerchunk
andvirtualizarr
. These are just starting down the path to zarr 3 and icechunk compatability. This issue will be used to track progress and relevant PRs:zarr-python
v3 compatibility fsspec/kerchunk#516All of this can be installed with
pip
. However we need to install with three steps for now to avoid version conflicts:This assumes also having
fsspec
ands3fs
installed:With all of this installed, HDF5 virtual datasets currently work like this:
Updated 11/13/2024
The text was updated successfully, but these errors were encountered: