Rename paths in manifest (#152)

* tests * rename paths inside manifest using np.vectorize * add method on ManifestArray too * add method on xarray accessor * rewrite to not use apply_ufunc * fix tests for accessor method * add test for skipping over numpy arrays in virtual dataset * release notes * add rename_paths to API docs page * document rename_paths method on usage page * extra line
zarr-developers · Jun 23, 2024 · 7fca86b · 7fca86b
1 parent 9ac42d3
commit 7fca86b
Show file tree

Hide file tree

Showing 8 changed files with 273 additions and 7 deletions.
diff --git a/docs/api.rst b/docs/api.rst
@@ -41,6 +41,17 @@ Serialization
     VirtualiZarrDatasetAccessor.to_zarr
 
 
+Rewriting
+=============
+
+.. currentmodule:: virtualizarr.xarray
+.. autosummary::
+    :nosignatures:
+    :toctree: generated/
+
+    VirtualiZarrDatasetAccessor.rename_paths
+
+
 Array API
 =========
 

diff --git a/docs/releases.rst b/docs/releases.rst
@@ -9,6 +9,8 @@ v0.2 (unreleased)
 New Features
 ~~~~~~~~~~~~
 
+- Added a `.rename_paths` convenience method to rename paths in a manifest according to a function.
+  (:pull:`152`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 Breaking changes
 ~~~~~~~~~~~~~~~~

diff --git a/docs/usage.md b/docs/usage.md
@@ -38,7 +38,7 @@ In future we would like for it to be possible to just use `xr.open_dataset`, e.g
 but this requires some [upstream changes](https://github.com/TomNicholas/VirtualiZarr/issues/35) in xarray.
 ```
 
-Printing this "virtual dataset" shows that although it is an instance of `xarray.Dataset`, unlike a typical xarray dataset, it does not contain numpy or dask arrays, but instead it wraps {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects.
+Printing this "virtual dataset" shows that although it is an instance of `xarray.Dataset`, unlike a typical xarray dataset, it does not contain numpy or dask arrays, but instead it wraps {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects. (We will use the term "virtual dataset" to refer to any `xarray.Dataset` which contains one or more {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects.)
 
 ```python
 vds
@@ -170,7 +170,7 @@ You also cannot currently index into a `ManifestArray`, as arbitrary indexing wo
 
 ## Virtual Datasets as Zarr Groups
 
-The full Zarr model (for a single group) includes multiple arrays, array names, named dimensions, and arbitrary dictionary-like attrs on each array. Whilst the duck-typed `ManifestArray` cannot store all of this information, an `xarray.Dataset` wrapping multiple `ManifestArray`s maps really nicely to the Zarr model. This is what the virtual dataset we opened represents - all the information in one entire Zarr group, but held as references to on-disk chunks instead of in-memory arrays.
+The full Zarr model (for a single group) includes multiple arrays, array names, named dimensions, and arbitrary dictionary-like attrs on each array. Whilst the duck-typed `ManifestArray` cannot store all of this information, an `xarray.Dataset` wrapping multiple `ManifestArray`s maps neatly to the Zarr model. This is what the virtual dataset we opened represents - all the information in one entire Zarr group, but held as references to on-disk chunks instead of as in-memory arrays.
 
 The problem of combining many legacy format files (e.g. netCDF files) into one virtual Zarr store therefore becomes just a matter of opening each file using `open_virtual_dataset` and using [xarray's various combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html) to combine them into one aggregate virtual dataset.
 
@@ -317,7 +317,7 @@ Loading variables can be useful in a few scenarios:
 
 Once we've combined references to all the chunks of all our legacy files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
 
-### Writing to Kerchunk's format and reading via fsspec
+### Writing to Kerchunk's format and reading data via fsspec
 
 The [kerchunk library](https://github.com/fsspec/kerchunk) has its own [specification](https://fsspec.github.io/kerchunk/spec.html) for how byte range references should be serialized (either as a JSON or parquet file).
 
@@ -377,3 +377,30 @@ Currently there are not yet any zarr v3 readers which understand the chunk manif
 
 This store can however be read by {py:func}`~virtualizarr.xarray.open_virtual_dataset`, by passing `filetype="zarr_v3"`.
 ```
+
+## Rewriting existing manifests
+
+Sometimes it can be useful to rewrite the contents of an already-generated manifest or virtual dataset.
+
+### Rewriting file paths
+
+You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`ds.virtualize.rename_paths <virtualizarr.xarray.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.
+
+For example, you may want to rename file paths according to a function to reflect having moved the location of the referenced files from local storage to an S3 bucket.
+
+```python
+def local_to_s3_url(old_local_path: str) -> str:
+    from pathlib import Path
+
+    new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
+
+    filename = Path(old_local_path).name
+    return str(new_s3_bucket_url / filename)
+```
+```python
+renamed_vds = vds.virtualize.rename_paths(local_to_s3_url)
+renamed_vds['air'].data.manifest.dict()
+```
+```
+{'0.0.0': {'path': 'http://s3.amazonaws.com/my_bucket/air.nc', 'offset': 15419, 'length': 7738000}}
+```
diff --git a/virtualizarr/manifests/array.py b/virtualizarr/manifests/array.py
@@ -1,5 +1,5 @@
 import warnings
-from typing import Any, Union
+from typing import Any, Callable, Union
 
 import numpy as np
 
@@ -224,6 +224,46 @@ def __getitem__(
         else:
             raise NotImplementedError(f"Doesn't support slicing with {indexer}")
 
+    def rename_paths(
+        self,
+        new: str | Callable[[str], str],
+    ) -> "ManifestArray":
+        """
+        Rename paths to chunks in this array's manifest.
+
+        Accepts either a string, in which case this new path will be used for all chunks, or
+        a function which accepts the old path and returns the new path.
+
+        Parameters
+        ----------
+        new
+            New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
+
+        Returns
+        -------
+        ManifestArray
+
+        Examples
+        --------
+        Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
+
+        >>> def local_to_s3_url(old_local_path: str) -> str:
+        ...     from pathlib import Path
+        ...
+        ...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
+        ...
+        ...     filename = Path(old_local_path).name
+        ...     return str(new_s3_bucket_url / filename)
+
+        >>> marr.rename_paths(local_to_s3_url)
+
+        See Also
+        --------
+        ChunkManifest.rename_paths
+        """
+        renamed_manifest = self.manifest.rename_paths(new)
+        return ManifestArray(zarray=self.zarray, chunkmanifest=renamed_manifest)
+
 
 def _possibly_expand_trailing_ellipsis(key, ndim: int):
     if key[-1] == ...:

diff --git a/virtualizarr/manifests/manifest.py b/virtualizarr/manifests/manifest.py
@@ -1,7 +1,7 @@
 import json
 import re
 from collections.abc import Iterable, Iterator
-from typing import Any, NewType, Tuple, Union, cast
+from typing import Any, Callable, NewType, Tuple, Union, cast
 
 import numpy as np
 from pydantic import BaseModel, ConfigDict
@@ -290,6 +290,59 @@ def _from_kerchunk_chunk_dict(cls, kerchunk_chunk_dict) -> "ChunkManifest":
         }
         return ChunkManifest(entries=cast(ChunkDict, chunkentries))
 
+    def rename_paths(
+        self,
+        new: str | Callable[[str], str],
+    ) -> "ChunkManifest":
+        """
+        Rename paths to chunks in this manifest.
+
+        Accepts either a string, in which case this new path will be used for all chunks, or
+        a function which accepts the old path and returns the new path.
+
+        Parameters
+        ----------
+        new
+            New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
+
+        Returns
+        -------
+        manifest
+
+        Examples
+        --------
+        Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
+
+        >>> def local_to_s3_url(old_local_path: str) -> str:
+        ...     from pathlib import Path
+        ...
+        ...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
+        ...
+        ...     filename = Path(old_local_path).name
+        ...     return str(new_s3_bucket_url / filename)
+
+        >>> manifest.rename_paths(local_to_s3_url)
+
+        See Also
+        --------
+        ManifestArray.rename_paths
+        """
+        if isinstance(new, str):
+            renamed_paths = np.full_like(self._paths, fill_value=new)
+        elif callable(new):
+            vectorized_rename_fn = np.vectorize(new, otypes=[np.dtypes.StringDType()])  # type: ignore[attr-defined]
+            renamed_paths = vectorized_rename_fn(self._paths)
+        else:
+            raise TypeError(
+                f"Argument 'new' must be either a string or a callable that accepts and returns strings, but got type {type(new)}"
+            )
+
+        return self.from_arrays(
+            paths=renamed_paths,
+            offsets=self._offsets,
+            lengths=self._lengths,
+        )
+
 
 def split(key: ChunkKey) -> Tuple[int, ...]:
     return tuple(int(i) for i in key.split("."))

diff --git a/virtualizarr/tests/test_manifests/test_manifest.py b/virtualizarr/tests/test_manifests/test_manifest.py
@@ -87,3 +87,44 @@ class TestSerializeManifest:
     def test_serialize_manifest_to_zarr(self): ...
 
     def test_deserialize_manifest_from_zarr(self): ...
+
+
+class TestRenamePaths:
+    def test_rename_to_str(self):
+        chunks = {
+            "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
+        }
+        manifest = ChunkManifest(entries=chunks)
+
+        renamed = manifest.rename_paths("s3://bucket/bar.nc")
+        assert renamed.dict() == {
+            "0.0.0": {"path": "s3://bucket/bar.nc", "offset": 100, "length": 100},
+        }
+
+    def test_rename_using_function(self):
+        chunks = {
+            "0.0.0": {"path": "foo.nc", "offset": 100, "length": 100},
+        }
+        manifest = ChunkManifest(entries=chunks)
+
+        def local_to_s3_url(old_local_path: str) -> str:
+            from pathlib import Path
+
+            new_s3_bucket_url = "s3://bucket/"
+
+            filename = Path(old_local_path).name
+            return str(new_s3_bucket_url + filename)
+
+        renamed = manifest.rename_paths(local_to_s3_url)
+        assert renamed.dict() == {
+            "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
+        }
+
+    def test_invalid_type(self):
+        chunks = {
+            "0.0.0": {"path": "foo.nc", "offset": 100, "length": 100},
+        }
+        manifest = ChunkManifest(entries=chunks)
+
+        with pytest.raises(TypeError):
+            manifest.rename_paths(["file1.nc", "file2.nc"])
diff --git a/virtualizarr/tests/test_xarray.py b/virtualizarr/tests/test_xarray.py
@@ -320,3 +320,47 @@ def test_open_virtual_dataset_passes_expected_args(
             "reader_options": reader_options,
         }
         mock_read_kerchunk.assert_called_once_with(**args)
+
+
+class TestRenamePaths:
+    def test_rename_to_str(self, netcdf4_file):
+        vds = open_virtual_dataset(netcdf4_file, indexes={})
+        renamed_vds = vds.virtualize.rename_paths("s3://bucket/air.nc")
+        assert (
+            renamed_vds["air"].data.manifest.dict()["0.0.0"]["path"]
+            == "s3://bucket/air.nc"
+        )
+
+    def test_rename_using_function(self, netcdf4_file):
+        vds = open_virtual_dataset(netcdf4_file, indexes={})
+
+        def local_to_s3_url(old_local_path: str) -> str:
+            from pathlib import Path
+
+            new_s3_bucket_url = "s3://bucket/"
+
+            filename = Path(old_local_path).name
+            return str(new_s3_bucket_url + filename)
+
+        renamed_vds = vds.virtualize.rename_paths(local_to_s3_url)
+        assert (
+            renamed_vds["air"].data.manifest.dict()["0.0.0"]["path"]
+            == "s3://bucket/air.nc"
+        )
+
+    def test_invalid_type(self, netcdf4_file):
+        vds = open_virtual_dataset(netcdf4_file, indexes={})
+
+        with pytest.raises(TypeError):
+            vds.virtualize.rename_paths(["file1.nc", "file2.nc"])
+
+    def test_mixture_of_manifestarrays_and_numpy_arrays(self, netcdf4_file):
+        vds = open_virtual_dataset(
+            netcdf4_file, indexes={}, loadable_variables=["lat", "lon"]
+        )
+        renamed_vds = vds.virtualize.rename_paths("s3://bucket/air.nc")
+        assert (
+            renamed_vds["air"].data.manifest.dict()["0.0.0"]["path"]
+            == "s3://bucket/air.nc"
+        )
+        assert isinstance(renamed_vds["lat"].data, np.ndarray)
diff --git a/virtualizarr/xarray.py b/virtualizarr/xarray.py
@@ -1,6 +1,7 @@
 from collections.abc import Iterable, Mapping, MutableMapping
 from pathlib import Path
 from typing import (
+    Callable,
     Literal,
     Optional,
     overload,
@@ -343,8 +344,8 @@ class VirtualiZarrDatasetAccessor:
     Methods on this object are called via `ds.virtualize.{method}`.
     """
 
-    def __init__(self, ds):
-        self.ds = ds
+    def __init__(self, ds: xr.Dataset):
+        self.ds: xr.Dataset = ds
 
     def to_zarr(self, storepath: str) -> None:
         """
@@ -438,3 +439,50 @@ def to_kerchunk(
             return None
         else:
             raise ValueError(f"Unrecognized output format: {format}")
+
+    def rename_paths(
+        self,
+        new: str | Callable[[str], str],
+    ) -> xr.Dataset:
+        """
+        Rename paths to chunks in every ManifestArray in this dataset.
+
+        Accepts either a string, in which case this new path will be used for all chunks, or
+        a function which accepts the old path and returns the new path.
+
+        Parameters
+        ----------
+        new
+            New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
+
+        Returns
+        -------
+        Dataset
+
+        Examples
+        --------
+        Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
+
+        >>> def local_to_s3_url(old_local_path: str) -> str:
+        ...     from pathlib import Path
+        ...
+        ...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
+        ...
+        ...     filename = Path(old_local_path).name
+        ...     return str(new_s3_bucket_url / filename)
+
+        >>> ds.virtualize.rename_paths(local_to_s3_url)
+
+        See Also
+        --------
+        ManifestArray.rename_paths
+        ChunkManifest.rename_paths
+        """
+
+        new_ds = self.ds.copy()
+        for var_name in new_ds.variables:
+            data = new_ds[var_name].data
+            if isinstance(data, ManifestArray):
+                new_ds[var_name].data = data.rename_paths(new=new)
+
+        return new_ds