Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caterva inside Zarr #713

Open
rabernat opened this issue Mar 28, 2021 · 6 comments
Open

Caterva inside Zarr #713

rabernat opened this issue Mar 28, 2021 · 6 comments
Labels
enhancement New features or improvements

Comments

@rabernat
Copy link
Contributor

rabernat commented Mar 28, 2021

I've been reading about Caterva and have chatted a few times about it with
@FrancescAlted. Caterva clearly has some overlap with Zarr, but I think it would be great if we could find some points for collaboration. A key difference is that Caterva stores everything in a single file, so consequently it is aimed at "not-so-big data". By combining Zarr with Caterva, we may get the best of both worlds.

The specific idea would be to encode a Zarr chunk as a Caterva array. This would allow us to leverage Caterva's efficient sub-slicing for partial chunk reads.

Does this make sense? I think so. @FrancescAlted suggests this explicitly in these slides https://www.blosc.org/docs/Caterva-HDF5-Workshop.pdf.

The path forward would be to create a numcodecs codec for Caterva.

@FrancescAlted
Copy link

FrancescAlted commented Mar 29, 2021

Definitely. We have designed Caterva as a multidimensional building block with the intention that other libraries can leverage it; so I think it makes totally sense (and we would be very happy) if Zarr can do so. Just a couple of remarks:

  1. Caterva does support both persistency either with a single file or a directory (i.e. à la Zarr). This is a consequence of the recent implementation of sparse frames in the C-Blosc2 library (we actually blogged about it: https://www.blosc.org/posts/introducing-sparse-frames/)

  2. Caterva brings way more features than filters and codecs. It is meant to become a full-fledged binary container for binary data, and in particular, it implements a two-level chunking that allows for finer granularity while doing slices (https://github.com/Blosc/cat4py/blob/master/notebooks/slicing-performance.ipynb).

Finally, Caterva has a well-stablished roadmap that will be trying to follow: https://github.com/Blosc/Caterva/blob/master/ROADMAP.rst. If you think that Zarr can benefit from any of these planned features, we will be glad to accept contributions (in any form of suggestions/code/grants).

@jakirkham
Copy link
Member

cc @joshmoore @shoyer (in case you find this interesting ;)

@rabernat
Copy link
Contributor Author

I started playing with this today. As a first step, I am just trying to implement encoding / decoding of numpy data into caterva, as needed by numcodecs.

But immediately I hit a roadblock. I can't figure out how to get the encoded bytes / buffer out of caterva. For example to encode an array, I am doing

import caterva as cat
import numpy as np

data = np.random.rand(10000, 10000)
c = cat.from_buffer(
    data.tobytes(),
    shape=data.shape,
    itemsize=data.dtype.itemsize,
    chunks=(1000, 10000),
    blocks=(100, 100)
)

# encoded_data = ?

The c.to_buffer() method returns the uncompressed data. I could persist the caterva data to disk, e.g. by passing filename='some/string/path, but this is not what numcodecs needs. It just wants to encoded bytes. As far as I can tell, caterva does not expose this in its public API.

Am I missing something?

@FrancescAlted
Copy link

AFAIK we have not implemented yet an accessor to the compressed data in python-caterva, but even if we did, I am afraid that you couldn't immediately leverage it because Caterva uses C-Blosc2 frames so as to store the compressed data, plus the metalayer for dimensionality. Then, frames contain the C-Blosc2 chunks. It goes like this:

image

You can find more info on the Caterva metalayer here: https://caterva.readthedocs.io/en/latest/getting_started/overview.html.

In case you still want to access raw Caterva buffers, you can do that using the C API. First, and in order to avoid copies, you need to create a contiguous buffer by setting the caterva_storage_properties_blosc_t.sequencial to true and then you can access that buffer with blosc2_schunk_to_buffer(cat_array->sc, ...).

@rabernat
Copy link
Contributor Author

Thanks for the tips Francesc. It sounds like we will probably have to create a cython wrapper for Caterva in numcodecs, similar to what we currently do with Blosc.

Understanding how to best leverage Caterva for Zarr is going to be a bit trickier than I hoped, because the Numcodecs API only defines decompress_partial for a single contiguous byte range:

https://github.com/zarr-developers/numcodecs/blob/98c9e08fc7895dae4d5f9d2abf7b3e405f407402/numcodecs/blosc.pyx#L566-L569

Which we use in Zarr python here:

zarr-python/zarr/core.py

Lines 1961 to 1965 in adc430a

if (
all([x is not None for x in [start, nitems]])
and self._compressor.codec_id == "blosc"
) and hasattr(self._compressor, "decode_partial"):
chunk = self._compressor.decode_partial(cdata, start, nitems)

The implementation is basically hard-coded to Blosc

Notes
-----
An array is flattened when compressed with blosc, so this iterator takes
the wanted selection of an array and determines the wanted coordinates
of the flattened, compressed data to be read and then decompressed. The
decompressed data is then placed in a temporary empty array of size
`Array._chunks` at the indices yielded as partial_out_selection.
Once all the slices yielded by this iterator have been read, decompressed
and written to the temporary array, the wanted slice of the chunk can be
indexed from the temporary array and written to the out_selection slice
of the out array.

In order to leverage the ND-slicing capabilities of Caterva, we would need to further refactor the interface between Numcodecs and Zarr.

@jakirkham
Copy link
Member

Raised issue ( fsspec/filesystem_spec#766 ) about supporting range queries in fsspec. That seems relevant here, but still thinking through exactly how we would use that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

4 participants