Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kerchunk doesn't translate HDF5 hard links #459

Closed
ljwoods2 opened this issue Jun 7, 2024 · 3 comments · Fixed by #463
Closed

Kerchunk doesn't translate HDF5 hard links #459

ljwoods2 opened this issue Jun 7, 2024 · 3 comments · Fixed by #463

Comments

@ljwoods2
Copy link
Contributor

ljwoods2 commented Jun 7, 2024

I'm using kerchunk to translate hdf5 files in the H5MD format to be readable by zarr.
Kerchunk doesn't translate hard links- wherever in the directory the h5py dataset was last assigned is the only place it remains in the resulting zarr directory and all other hard links are not visible.

Here's an example: First, I create an hdf5 file on the local filesystem

import h5py
import numpy as np

file_name = "example.h5"
with h5py.File(file_name, "w") as f:
    data = np.arange(10)
    dset = f.create_dataset("original_dataset", data=data)

    f["linked_dataset"] = f["/original_dataset"]

with h5py.File(file_name, "r") as f:
    print(f["original_dataset"])
    print(f["linked_dataset"])

This gives the output:

<HDF5 dataset "original_dataset": shape (10,), type "<i8">
<HDF5 dataset "linked_dataset": shape (10,), type "<i8">

However, when I do:

import kerchunk.hdf
import fsspec
import zarr

with fsspec.open(file_name) as inf:
    h5chunks = kerchunk.hdf.SingleHdf5ToZarr(
        inf, file_name, inline_threshold=100
    )
    fo = h5chunks.translate()

fs = fsspec.filesystem(
    "reference",
    fo=fo,
    skip_instance_cache=True,
)

z = zarr.open_group(fs.get_mapper(""), mode="r")
print(z.tree())

The tree looks like this:

/
 └── linked_dataset (10,) int64

This may be intentional behavior since zarr does not support linking datasets like hdf5 does. Is it possible to recreate the links in the json metadata created by kerchunk.hdf.SingleHdf5ToZarr to give the expected behavior? Please let me know if I'm missing anything!

For reference, I am using Kerchunk 0.2.5 and Python 3.11.9

@martindurant
Copy link
Member

You are quite right, this is not handled essentially because there is no obvious way to represent such links in zarr. With zarr V3, there has been talk about implementing links, but I'm not sure where that conversation is up to.

For kerchunk, we can recognise links, and for some special cases (single strings, IIRC) are handled. The question is what to do with the general case: should the metadata and references simply be duplicated?

@ljwoods2
Copy link
Contributor Author

ljwoods2 commented Jun 7, 2024

Got it, sounds like this is already a known limitation. Would some kind of warning or note in the docs be appropriate? I'd be happy to open a quick PR.

This is a quick and dirty patch I'm using in my project to support hard links in H5MD, simply duplicates metadata:

Change in kerchunk.hdf.SingleHdf5ToZarr._translator

    def _translator(
        self,
        name: str,
        h5obj: Union[
            h5py.Dataset, h5py.Group, h5py.SoftLink, h5py.HardLink, h5py.ExternalLink
        ],
    ):
        """Produce Zarr metadata for all groups and datasets in the HDF5 file."""
        try:  # method must not raise exception
            kwargs = {}

            if isinstance(h5obj, h5py.SoftLink) or isinstance(h5obj, h5py.HardLink):
                h5obj = self._h5f[name]

Change in kerchunk.hdf.SingleHdf5ToZarr.translate

    def translate(self, preserve_links=False):
        """Translate content of one HDF5 file into Zarr storage format.

        This method is the main entry point to execute the workflow, and
        returns a "reference" structure to be used with zarr/kerchunk

        No data is copied out of the HDF5 file.

        Returns
        -------
        dict
            Dictionary containing reference structure.
        """
        lggr.debug("Translation begins")
        self._transfer_attrs(self._h5f, self._zroot)

        self._preserve_links = preserve_links
        if self._preserve_links:
            self._h5f.visititems_links(self._translator)
        else:
            self._h5f.visititems(self._translator)

This does require h5py>=3.11.0 since this is when h5py.Group.visititems_links was introduced

@martindurant
Copy link
Member

If it's optional, I think this would be a useful addition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants