Caterva inside Zarr #713

rabernat · 2021-03-28T16:14:03Z

I've been reading about Caterva and have chatted a few times about it with
@FrancescAlted. Caterva clearly has some overlap with Zarr, but I think it would be great if we could find some points for collaboration. A key difference is that Caterva stores everything in a single file, so consequently it is aimed at "not-so-big data". By combining Zarr with Caterva, we may get the best of both worlds.

The specific idea would be to encode a Zarr chunk as a Caterva array. This would allow us to leverage Caterva's efficient sub-slicing for partial chunk reads.

Does this make sense? I think so. @FrancescAlted suggests this explicitly in these slides https://www.blosc.org/docs/Caterva-HDF5-Workshop.pdf.

The path forward would be to create a numcodecs codec for Caterva.

FrancescAlted · 2021-03-29T11:19:42Z

Definitely. We have designed Caterva as a multidimensional building block with the intention that other libraries can leverage it; so I think it makes totally sense (and we would be very happy) if Zarr can do so. Just a couple of remarks:

Caterva does support both persistency either with a single file or a directory (i.e. à la Zarr). This is a consequence of the recent implementation of sparse frames in the C-Blosc2 library (we actually blogged about it: https://www.blosc.org/posts/introducing-sparse-frames/)
Caterva brings way more features than filters and codecs. It is meant to become a full-fledged binary container for binary data, and in particular, it implements a two-level chunking that allows for finer granularity while doing slices (https://github.com/Blosc/cat4py/blob/master/notebooks/slicing-performance.ipynb).

Finally, Caterva has a well-stablished roadmap that will be trying to follow: https://github.com/Blosc/Caterva/blob/master/ROADMAP.rst. If you think that Zarr can benefit from any of these planned features, we will be glad to accept contributions (in any form of suggestions/code/grants).

jakirkham · 2021-04-03T06:06:57Z

cc @joshmoore @shoyer (in case you find this interesting ;)

rabernat · 2021-08-17T11:59:25Z

I started playing with this today. As a first step, I am just trying to implement encoding / decoding of numpy data into caterva, as needed by numcodecs.

But immediately I hit a roadblock. I can't figure out how to get the encoded bytes / buffer out of caterva. For example to encode an array, I am doing

import caterva as cat
import numpy as np

data = np.random.rand(10000, 10000)
c = cat.from_buffer(
    data.tobytes(),
    shape=data.shape,
    itemsize=data.dtype.itemsize,
    chunks=(1000, 10000),
    blocks=(100, 100)
)

# encoded_data = ?

The c.to_buffer() method returns the uncompressed data. I could persist the caterva data to disk, e.g. by passing filename='some/string/path, but this is not what numcodecs needs. It just wants to encoded bytes. As far as I can tell, caterva does not expose this in its public API.

Am I missing something?

FrancescAlted · 2021-08-21T19:14:05Z

AFAIK we have not implemented yet an accessor to the compressed data in python-caterva, but even if we did, I am afraid that you couldn't immediately leverage it because Caterva uses C-Blosc2 frames so as to store the compressed data, plus the metalayer for dimensionality. Then, frames contain the C-Blosc2 chunks. It goes like this:

You can find more info on the Caterva metalayer here: https://caterva.readthedocs.io/en/latest/getting_started/overview.html.

In case you still want to access raw Caterva buffers, you can do that using the C API. First, and in order to avoid copies, you need to create a contiguous buffer by setting the caterva_storage_properties_blosc_t.sequencial to true and then you can access that buffer with blosc2_schunk_to_buffer(cat_array->sc, ...).

rabernat · 2021-08-23T13:13:27Z

Thanks for the tips Francesc. It sounds like we will probably have to create a cython wrapper for Caterva in numcodecs, similar to what we currently do with Blosc.

Understanding how to best leverage Caterva for Zarr is going to be a bit trickier than I hoped, because the Numcodecs API only defines decompress_partial for a single contiguous byte range:

https://github.com/zarr-developers/numcodecs/blob/98c9e08fc7895dae4d5f9d2abf7b3e405f407402/numcodecs/blosc.pyx#L566-L569

Which we use in Zarr python here:

zarr-python/zarr/core.py

Lines 1961 to 1965 in adc430a

    
           if ( 
        
               all([x is not None for x in [start, nitems]]) 
        
               and self._compressor.codec_id == "blosc" 
        
           ) and hasattr(self._compressor, "decode_partial"): 
        
               chunk = self._compressor.decode_partial(cdata, start, nitems)

The implementation is basically hard-coded to Blosc

zarr-python/zarr/indexing.py

Lines 874 to 884 in adc430a

    
               Notes 
        
               ----- 
        
               An array is flattened when compressed with blosc, so this iterator takes 
        
               the wanted selection of an array and determines the wanted coordinates 
        
               of the flattened, compressed data to be read and then decompressed. The 
        
               decompressed data is then placed in a temporary empty array of size 
        
               `Array._chunks` at the indices yielded as partial_out_selection. 
        
               Once all the slices yielded by this iterator have been read, decompressed 
        
               and written to the temporary array, the wanted slice of the chunk can be 
        
               indexed from the temporary array and written to the out_selection slice 
        
               of the out array.

In order to leverage the ND-slicing capabilities of Caterva, we would need to further refactor the interface between Numcodecs and Zarr.

jakirkham · 2021-09-22T18:44:59Z

Raised issue ( fsspec/filesystem_spec#766 ) about supporting range queries in fsspec. That seems relevant here, but still thinking through exactly how we would use that

shoyer mentioned this issue Jul 19, 2021

Partial chunk reads zarr-developers/zarr-specs#59

Open

rabernat mentioned this issue Sep 20, 2021

Pangeo examples summary ocean-transport/coiled_collaboration#17

Open

jakirkham mentioned this issue Sep 22, 2021

Supporting range queries fsspec/filesystem_spec#766

Open

rabernat mentioned this issue Nov 19, 2021

Add Sharding Support #877

Closed

rabernat mentioned this issue Sep 6, 2022

ZFP Compression zarr-developers/numcodecs#117

Open

rabernat mentioned this issue Dec 21, 2022

Blosc2 Codec? zarr-developers/numcodecs#413

Open

jstriebel mentioned this issue Feb 3, 2023

Review of the ZEP2 spec - Sharding storage transformer zarr-developers/zarr-specs#152

Closed

jhamman mentioned this issue May 20, 2024

Support Blosc2 codec #1896

Closed

dstansby added the enhancement New features or improvements label Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caterva inside Zarr #713

Caterva inside Zarr #713

rabernat commented Mar 28, 2021 •

edited

Loading

FrancescAlted commented Mar 29, 2021 •

edited

Loading

jakirkham commented Apr 3, 2021

rabernat commented Aug 17, 2021

FrancescAlted commented Aug 21, 2021

rabernat commented Aug 23, 2021

jakirkham commented Sep 22, 2021

Caterva inside Zarr #713

Caterva inside Zarr #713

Comments

rabernat commented Mar 28, 2021 • edited Loading

FrancescAlted commented Mar 29, 2021 • edited Loading

jakirkham commented Apr 3, 2021

rabernat commented Aug 17, 2021

FrancescAlted commented Aug 21, 2021

rabernat commented Aug 23, 2021

jakirkham commented Sep 22, 2021

rabernat commented Mar 28, 2021 •

edited

Loading

FrancescAlted commented Mar 29, 2021 •

edited

Loading