-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caterva inside Zarr #713
Comments
Definitely. We have designed Caterva as a multidimensional building block with the intention that other libraries can leverage it; so I think it makes totally sense (and we would be very happy) if Zarr can do so. Just a couple of remarks:
Finally, Caterva has a well-stablished roadmap that will be trying to follow: https://github.com/Blosc/Caterva/blob/master/ROADMAP.rst. If you think that Zarr can benefit from any of these planned features, we will be glad to accept contributions (in any form of suggestions/code/grants). |
cc @joshmoore @shoyer (in case you find this interesting ;) |
I started playing with this today. As a first step, I am just trying to implement encoding / decoding of numpy data into caterva, as needed by numcodecs. But immediately I hit a roadblock. I can't figure out how to get the encoded bytes / buffer out of caterva. For example to encode an array, I am doing import caterva as cat
import numpy as np
data = np.random.rand(10000, 10000)
c = cat.from_buffer(
data.tobytes(),
shape=data.shape,
itemsize=data.dtype.itemsize,
chunks=(1000, 10000),
blocks=(100, 100)
)
# encoded_data = ? The Am I missing something? |
AFAIK we have not implemented yet an accessor to the compressed data in python-caterva, but even if we did, I am afraid that you couldn't immediately leverage it because Caterva uses C-Blosc2 frames so as to store the compressed data, plus the metalayer for dimensionality. Then, frames contain the C-Blosc2 chunks. It goes like this: You can find more info on the Caterva metalayer here: https://caterva.readthedocs.io/en/latest/getting_started/overview.html. In case you still want to access raw Caterva buffers, you can do that using the C API. First, and in order to avoid copies, you need to create a contiguous buffer by setting the caterva_storage_properties_blosc_t.sequencial to |
Thanks for the tips Francesc. It sounds like we will probably have to create a cython wrapper for Caterva in numcodecs, similar to what we currently do with Blosc. Understanding how to best leverage Caterva for Zarr is going to be a bit trickier than I hoped, because the Numcodecs API only defines Which we use in Zarr python here: Lines 1961 to 1965 in adc430a
The implementation is basically hard-coded to Blosc Lines 874 to 884 in adc430a
In order to leverage the ND-slicing capabilities of Caterva, we would need to further refactor the interface between Numcodecs and Zarr. |
Raised issue ( fsspec/filesystem_spec#766 ) about supporting range queries in |
I've been reading about Caterva and have chatted a few times about it with
@FrancescAlted. Caterva clearly has some overlap with Zarr, but I think it would be great if we could find some points for collaboration. A key difference is that Caterva stores everything in a single file, so consequently it is aimed at "not-so-big data". By combining Zarr with Caterva, we may get the best of both worlds.
The specific idea would be to encode a Zarr chunk as a Caterva array. This would allow us to leverage Caterva's efficient sub-slicing for partial chunk reads.
Does this make sense? I think so. @FrancescAlted suggests this explicitly in these slides https://www.blosc.org/docs/Caterva-HDF5-Workshop.pdf.
The path forward would be to create a numcodecs codec for Caterva.
The text was updated successfully, but these errors were encountered: