Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial chunk read? #521

Closed
vigji opened this issue Nov 24, 2019 · 7 comments
Closed

Partial chunk read? #521

vigji opened this issue Nov 24, 2019 · 7 comments

Comments

@vigji
Copy link

vigji commented Nov 24, 2019

I am try to use zarr together with dask and xarray to store large volumetric imaging (T x Z x X x Y dask arrays) data. Before, I was saving the dataset after splitting in artigianal .hdf5 files, from which I could read out small slices of the data (e.g., one element over one dimension, mostly for visualisation purposes) without having to load the full chunk.
Is it possible to do something like that in zarr? Or I always do have to load the full chunk? I tried even disabling all compressions and it still seem not capable of reading out only a small part of a chunk. I know I could just save small chunks, but this will make my datasets thousands of files with more overhead when loading of large slices is required.
Am I missing something? If not, has such feature been suggested for future zarr versions?

@constantinpape
Copy link

Before, I was saving the dataset after splitting in artigianal .hdf5 files, from which I could read out small slices of the data (e.g., one element over one dimension, mostly for visualisation purposes) without having to load the full chunk.

Maybe I am missing something here, but to the best of my knowledge HDF5 will always read full chunks. See also the section "Pitfalls: Chunks are too large" here or the explanations here.
I am not aware of any optimizations to read partial data from chunks in storage order from a HDF5 container (which would only work for uncompressed data anyways).

You can of course request a sub-part of the chunk, but both zarr and hdf5 (I think) will internally read the full chunk and then slice the request from it.

@jakirkham
Copy link
Member

It has been raised before. I don’t know if there is a GitHub issue for it (there may already be one).

I recall reading somewhere that Blosc supports partial decompression. Though not sure about other compressors. So there’s a possibility of doing this on compressed data.

This may be possible in some cases, but may be a fair bit of work. Are you mainly curious about this or are you able to show this has some notable impact for your workload?

In practice I think many people use Zarr with some other parallelism library (like Dask) where they align computational chunks to stored chunks. With this sort of workflow partial reads don’t matter.

What does your workflow look like?

@jrbourbeau
Copy link
Member

For reference this was previously brought up in #40

@jakirkham
Copy link
Member

Thanks for dredging that up, @jrbourbeau ! 😀

@vigji
Copy link
Author

vigji commented Nov 30, 2019

@constantinpape as far as I understand hdf5 chunks are internal chunks of a larger hdf5 file. To do something like that I would have to save thousands of chunks in my zarr

@jakirkham right now I have to face 2 different needs in my workflow. My data are 4D volumetric time series. For doing any kind of analysis of them, e.g. analysis or source extraction, I chunk them on spatial or temporal dimensions and run parallel operations (with Dask) on them, and for this part, I keep the zarr chunks size aligned to the sizes that I manipulate in a single Dask task. On the other side, I also need good performance visualisation for which I need to load single frames when scrolling the stacks, and currently doing this on the same zarr dataset makes my interface slow. I think that other people who are testing out the napari interface with zarr support might have the same needs.
I guess one solution would be save a lot of very small chunks, but this does not sound optimal.

@constantinpape
Copy link

@constantinpape as far as I understand hdf5 chunks are internal chunks of a larger hdf5 file. To do something like that I would have to save thousands of chunks in my zarr

Ok, I understand your point now: in order to make access for different usage patterns efficient, you would need to make chunks smaller. In hdf5 this is not a problem, because chunks are stored internally, so having more chunks does not hurt (note that this is not quite true, more chunks will increase the overall file size in hdf5 and make access slower, see one of my links above).
In zarr, this would mean having many more underlying files, which can become an issue with the filesystem.

I agree that having partial chunk reads would be a good solution for this, but might be challenging to implement. (@jakirkham I didn'tknow about partial decompression in blosc, that's interesting).

@vigji Have you tried using the ZipStore? This might be helpful for your use case, because you could make chunks smaller but still have everything stored in one file.

@rabernat
Copy link
Contributor

Can this be closed that now #667 has been merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants