Skip to content

Commit

Permalink
Rename paths in manifest (#152)
Browse files Browse the repository at this point in the history
* tests

* rename paths inside manifest using np.vectorize

* add method on ManifestArray too

* add method on xarray accessor

* rewrite to not use apply_ufunc

* fix tests for accessor method

* add test for skipping over numpy arrays in virtual dataset

* release notes

* add rename_paths to API docs page

* document rename_paths method on usage page

* extra line
  • Loading branch information
TomNicholas authored Jun 23, 2024
1 parent 9ac42d3 commit 7fca86b
Show file tree
Hide file tree
Showing 8 changed files with 273 additions and 7 deletions.
11 changes: 11 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,17 @@ Serialization
VirtualiZarrDatasetAccessor.to_zarr


Rewriting
=============

.. currentmodule:: virtualizarr.xarray
.. autosummary::
:nosignatures:
:toctree: generated/

VirtualiZarrDatasetAccessor.rename_paths


Array API
=========

Expand Down
2 changes: 2 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ v0.2 (unreleased)
New Features
~~~~~~~~~~~~

- Added a `.rename_paths` convenience method to rename paths in a manifest according to a function.
(:pull:`152`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Breaking changes
~~~~~~~~~~~~~~~~
Expand Down
33 changes: 30 additions & 3 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ In future we would like for it to be possible to just use `xr.open_dataset`, e.g
but this requires some [upstream changes](https://github.com/TomNicholas/VirtualiZarr/issues/35) in xarray.
```

Printing this "virtual dataset" shows that although it is an instance of `xarray.Dataset`, unlike a typical xarray dataset, it does not contain numpy or dask arrays, but instead it wraps {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects.
Printing this "virtual dataset" shows that although it is an instance of `xarray.Dataset`, unlike a typical xarray dataset, it does not contain numpy or dask arrays, but instead it wraps {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects. (We will use the term "virtual dataset" to refer to any `xarray.Dataset` which contains one or more {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects.)

```python
vds
Expand Down Expand Up @@ -170,7 +170,7 @@ You also cannot currently index into a `ManifestArray`, as arbitrary indexing wo

## Virtual Datasets as Zarr Groups

The full Zarr model (for a single group) includes multiple arrays, array names, named dimensions, and arbitrary dictionary-like attrs on each array. Whilst the duck-typed `ManifestArray` cannot store all of this information, an `xarray.Dataset` wrapping multiple `ManifestArray`s maps really nicely to the Zarr model. This is what the virtual dataset we opened represents - all the information in one entire Zarr group, but held as references to on-disk chunks instead of in-memory arrays.
The full Zarr model (for a single group) includes multiple arrays, array names, named dimensions, and arbitrary dictionary-like attrs on each array. Whilst the duck-typed `ManifestArray` cannot store all of this information, an `xarray.Dataset` wrapping multiple `ManifestArray`s maps neatly to the Zarr model. This is what the virtual dataset we opened represents - all the information in one entire Zarr group, but held as references to on-disk chunks instead of as in-memory arrays.

The problem of combining many legacy format files (e.g. netCDF files) into one virtual Zarr store therefore becomes just a matter of opening each file using `open_virtual_dataset` and using [xarray's various combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html) to combine them into one aggregate virtual dataset.

Expand Down Expand Up @@ -317,7 +317,7 @@ Loading variables can be useful in a few scenarios:

Once we've combined references to all the chunks of all our legacy files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.

### Writing to Kerchunk's format and reading via fsspec
### Writing to Kerchunk's format and reading data via fsspec

The [kerchunk library](https://github.com/fsspec/kerchunk) has its own [specification](https://fsspec.github.io/kerchunk/spec.html) for how byte range references should be serialized (either as a JSON or parquet file).

Expand Down Expand Up @@ -377,3 +377,30 @@ Currently there are not yet any zarr v3 readers which understand the chunk manif
This store can however be read by {py:func}`~virtualizarr.xarray.open_virtual_dataset`, by passing `filetype="zarr_v3"`.
```

## Rewriting existing manifests

Sometimes it can be useful to rewrite the contents of an already-generated manifest or virtual dataset.

### Rewriting file paths

You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`ds.virtualize.rename_paths <virtualizarr.xarray.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.

For example, you may want to rename file paths according to a function to reflect having moved the location of the referenced files from local storage to an S3 bucket.

```python
def local_to_s3_url(old_local_path: str) -> str:
from pathlib import Path

new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"

filename = Path(old_local_path).name
return str(new_s3_bucket_url / filename)
```
```python
renamed_vds = vds.virtualize.rename_paths(local_to_s3_url)
renamed_vds['air'].data.manifest.dict()
```
```
{'0.0.0': {'path': 'http://s3.amazonaws.com/my_bucket/air.nc', 'offset': 15419, 'length': 7738000}}
```
42 changes: 41 additions & 1 deletion virtualizarr/manifests/array.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import warnings
from typing import Any, Union
from typing import Any, Callable, Union

import numpy as np

Expand Down Expand Up @@ -224,6 +224,46 @@ def __getitem__(
else:
raise NotImplementedError(f"Doesn't support slicing with {indexer}")

def rename_paths(
self,
new: str | Callable[[str], str],
) -> "ManifestArray":
"""
Rename paths to chunks in this array's manifest.
Accepts either a string, in which case this new path will be used for all chunks, or
a function which accepts the old path and returns the new path.
Parameters
----------
new
New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns
-------
ManifestArray
Examples
--------
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>> marr.rename_paths(local_to_s3_url)
See Also
--------
ChunkManifest.rename_paths
"""
renamed_manifest = self.manifest.rename_paths(new)
return ManifestArray(zarray=self.zarray, chunkmanifest=renamed_manifest)


def _possibly_expand_trailing_ellipsis(key, ndim: int):
if key[-1] == ...:
Expand Down
55 changes: 54 additions & 1 deletion virtualizarr/manifests/manifest.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import json
import re
from collections.abc import Iterable, Iterator
from typing import Any, NewType, Tuple, Union, cast
from typing import Any, Callable, NewType, Tuple, Union, cast

import numpy as np
from pydantic import BaseModel, ConfigDict
Expand Down Expand Up @@ -290,6 +290,59 @@ def _from_kerchunk_chunk_dict(cls, kerchunk_chunk_dict) -> "ChunkManifest":
}
return ChunkManifest(entries=cast(ChunkDict, chunkentries))

def rename_paths(
self,
new: str | Callable[[str], str],
) -> "ChunkManifest":
"""
Rename paths to chunks in this manifest.
Accepts either a string, in which case this new path will be used for all chunks, or
a function which accepts the old path and returns the new path.
Parameters
----------
new
New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns
-------
manifest
Examples
--------
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>> manifest.rename_paths(local_to_s3_url)
See Also
--------
ManifestArray.rename_paths
"""
if isinstance(new, str):
renamed_paths = np.full_like(self._paths, fill_value=new)
elif callable(new):
vectorized_rename_fn = np.vectorize(new, otypes=[np.dtypes.StringDType()]) # type: ignore[attr-defined]
renamed_paths = vectorized_rename_fn(self._paths)
else:
raise TypeError(
f"Argument 'new' must be either a string or a callable that accepts and returns strings, but got type {type(new)}"
)

return self.from_arrays(
paths=renamed_paths,
offsets=self._offsets,
lengths=self._lengths,
)


def split(key: ChunkKey) -> Tuple[int, ...]:
return tuple(int(i) for i in key.split("."))
Expand Down
41 changes: 41 additions & 0 deletions virtualizarr/tests/test_manifests/test_manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,44 @@ class TestSerializeManifest:
def test_serialize_manifest_to_zarr(self): ...

def test_deserialize_manifest_from_zarr(self): ...


class TestRenamePaths:
def test_rename_to_str(self):
chunks = {
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
}
manifest = ChunkManifest(entries=chunks)

renamed = manifest.rename_paths("s3://bucket/bar.nc")
assert renamed.dict() == {
"0.0.0": {"path": "s3://bucket/bar.nc", "offset": 100, "length": 100},
}

def test_rename_using_function(self):
chunks = {
"0.0.0": {"path": "foo.nc", "offset": 100, "length": 100},
}
manifest = ChunkManifest(entries=chunks)

def local_to_s3_url(old_local_path: str) -> str:
from pathlib import Path

new_s3_bucket_url = "s3://bucket/"

filename = Path(old_local_path).name
return str(new_s3_bucket_url + filename)

renamed = manifest.rename_paths(local_to_s3_url)
assert renamed.dict() == {
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
}

def test_invalid_type(self):
chunks = {
"0.0.0": {"path": "foo.nc", "offset": 100, "length": 100},
}
manifest = ChunkManifest(entries=chunks)

with pytest.raises(TypeError):
manifest.rename_paths(["file1.nc", "file2.nc"])
44 changes: 44 additions & 0 deletions virtualizarr/tests/test_xarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,3 +320,47 @@ def test_open_virtual_dataset_passes_expected_args(
"reader_options": reader_options,
}
mock_read_kerchunk.assert_called_once_with(**args)


class TestRenamePaths:
def test_rename_to_str(self, netcdf4_file):
vds = open_virtual_dataset(netcdf4_file, indexes={})
renamed_vds = vds.virtualize.rename_paths("s3://bucket/air.nc")
assert (
renamed_vds["air"].data.manifest.dict()["0.0.0"]["path"]
== "s3://bucket/air.nc"
)

def test_rename_using_function(self, netcdf4_file):
vds = open_virtual_dataset(netcdf4_file, indexes={})

def local_to_s3_url(old_local_path: str) -> str:
from pathlib import Path

new_s3_bucket_url = "s3://bucket/"

filename = Path(old_local_path).name
return str(new_s3_bucket_url + filename)

renamed_vds = vds.virtualize.rename_paths(local_to_s3_url)
assert (
renamed_vds["air"].data.manifest.dict()["0.0.0"]["path"]
== "s3://bucket/air.nc"
)

def test_invalid_type(self, netcdf4_file):
vds = open_virtual_dataset(netcdf4_file, indexes={})

with pytest.raises(TypeError):
vds.virtualize.rename_paths(["file1.nc", "file2.nc"])

def test_mixture_of_manifestarrays_and_numpy_arrays(self, netcdf4_file):
vds = open_virtual_dataset(
netcdf4_file, indexes={}, loadable_variables=["lat", "lon"]
)
renamed_vds = vds.virtualize.rename_paths("s3://bucket/air.nc")
assert (
renamed_vds["air"].data.manifest.dict()["0.0.0"]["path"]
== "s3://bucket/air.nc"
)
assert isinstance(renamed_vds["lat"].data, np.ndarray)
52 changes: 50 additions & 2 deletions virtualizarr/xarray.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from collections.abc import Iterable, Mapping, MutableMapping
from pathlib import Path
from typing import (
Callable,
Literal,
Optional,
overload,
Expand Down Expand Up @@ -343,8 +344,8 @@ class VirtualiZarrDatasetAccessor:
Methods on this object are called via `ds.virtualize.{method}`.
"""

def __init__(self, ds):
self.ds = ds
def __init__(self, ds: xr.Dataset):
self.ds: xr.Dataset = ds

def to_zarr(self, storepath: str) -> None:
"""
Expand Down Expand Up @@ -438,3 +439,50 @@ def to_kerchunk(
return None
else:
raise ValueError(f"Unrecognized output format: {format}")

def rename_paths(
self,
new: str | Callable[[str], str],
) -> xr.Dataset:
"""
Rename paths to chunks in every ManifestArray in this dataset.
Accepts either a string, in which case this new path will be used for all chunks, or
a function which accepts the old path and returns the new path.
Parameters
----------
new
New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns
-------
Dataset
Examples
--------
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>> ds.virtualize.rename_paths(local_to_s3_url)
See Also
--------
ManifestArray.rename_paths
ChunkManifest.rename_paths
"""

new_ds = self.ds.copy()
for var_name in new_ds.variables:
data = new_ds[var_name].data
if isinstance(data, ManifestArray):
new_ds[var_name].data = data.rename_paths(new=new)

return new_ds

0 comments on commit 7fca86b

Please sign in to comment.