Partial chunk read? #521

vigji · 2019-11-24T17:55:07Z

I am try to use zarr together with dask and xarray to store large volumetric imaging (T x Z x X x Y dask arrays) data. Before, I was saving the dataset after splitting in artigianal .hdf5 files, from which I could read out small slices of the data (e.g., one element over one dimension, mostly for visualisation purposes) without having to load the full chunk.
Is it possible to do something like that in zarr? Or I always do have to load the full chunk? I tried even disabling all compressions and it still seem not capable of reading out only a small part of a chunk. I know I could just save small chunks, but this will make my datasets thousands of files with more overhead when loading of large slices is required.
Am I missing something? If not, has such feature been suggested for future zarr versions?

constantinpape · 2019-11-24T18:56:05Z

Before, I was saving the dataset after splitting in artigianal .hdf5 files, from which I could read out small slices of the data (e.g., one element over one dimension, mostly for visualisation purposes) without having to load the full chunk.

Maybe I am missing something here, but to the best of my knowledge HDF5 will always read full chunks. See also the section "Pitfalls: Chunks are too large" here or the explanations here.
I am not aware of any optimizations to read partial data from chunks in storage order from a HDF5 container (which would only work for uncompressed data anyways).

You can of course request a sub-part of the chunk, but both zarr and hdf5 (I think) will internally read the full chunk and then slice the request from it.

jakirkham · 2019-11-24T19:35:55Z

It has been raised before. I don’t know if there is a GitHub issue for it (there may already be one).

I recall reading somewhere that Blosc supports partial decompression. Though not sure about other compressors. So there’s a possibility of doing this on compressed data.

This may be possible in some cases, but may be a fair bit of work. Are you mainly curious about this or are you able to show this has some notable impact for your workload?

In practice I think many people use Zarr with some other parallelism library (like Dask) where they align computational chunks to stored chunks. With this sort of workflow partial reads don’t matter.

What does your workflow look like?

jrbourbeau · 2019-11-25T22:38:10Z

For reference this was previously brought up in #40

jakirkham · 2019-11-26T00:37:51Z

Thanks for dredging that up, @jrbourbeau ! 😀

vigji · 2019-11-30T09:20:34Z

@constantinpape as far as I understand hdf5 chunks are internal chunks of a larger hdf5 file. To do something like that I would have to save thousands of chunks in my zarr

@jakirkham right now I have to face 2 different needs in my workflow. My data are 4D volumetric time series. For doing any kind of analysis of them, e.g. analysis or source extraction, I chunk them on spatial or temporal dimensions and run parallel operations (with Dask) on them, and for this part, I keep the zarr chunks size aligned to the sizes that I manipulate in a single Dask task. On the other side, I also need good performance visualisation for which I need to load single frames when scrolling the stacks, and currently doing this on the same zarr dataset makes my interface slow. I think that other people who are testing out the napari interface with zarr support might have the same needs.
I guess one solution would be save a lot of very small chunks, but this does not sound optimal.

constantinpape · 2019-11-30T10:44:33Z

@constantinpape as far as I understand hdf5 chunks are internal chunks of a larger hdf5 file. To do something like that I would have to save thousands of chunks in my zarr

Ok, I understand your point now: in order to make access for different usage patterns efficient, you would need to make chunks smaller. In hdf5 this is not a problem, because chunks are stored internally, so having more chunks does not hurt (note that this is not quite true, more chunks will increase the overall file size in hdf5 and make access slower, see one of my links above).
In zarr, this would mean having many more underlying files, which can become an issue with the filesystem.

I agree that having partial chunk reads would be a good solution for this, but might be challenging to implement. (@jakirkham I didn'tknow about partial decompression in blosc, that's interesting).

@vigji Have you tried using the ZipStore? This might be helpful for your use case, because you could make chunks smaller but still have everything stored in one file.

rabernat · 2021-03-28T14:45:36Z

Can this be closed that now #667 has been merged?

jrbourbeau mentioned this issue Apr 6, 2020

Partial chunk reads zarr-developers/zarr-specs#59

Open

rabernat mentioned this issue Sep 1, 2020

Best practices for zarr and GCS streaming applications #595

Open

ravwojdyla mentioned this issue Jan 12, 2021

Genetics data IO performance stats/doc sgkit-dev/sgkit#437

Open

jakirkham closed this as completed Mar 28, 2021

JackKelly mentioned this issue Jul 19, 2021

Don't read entire chunks at a time openclimatefix/nowcasting_dataset#57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial chunk read? #521

Partial chunk read? #521

vigji commented Nov 24, 2019

constantinpape commented Nov 24, 2019

jakirkham commented Nov 24, 2019

jrbourbeau commented Nov 25, 2019

jakirkham commented Nov 26, 2019

vigji commented Nov 30, 2019

constantinpape commented Nov 30, 2019

rabernat commented Mar 28, 2021

Partial chunk read? #521

Partial chunk read? #521

Comments

vigji commented Nov 24, 2019

constantinpape commented Nov 24, 2019

jakirkham commented Nov 24, 2019

jrbourbeau commented Nov 25, 2019

jakirkham commented Nov 26, 2019

vigji commented Nov 30, 2019

constantinpape commented Nov 30, 2019

rabernat commented Mar 28, 2021