-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial chunk read? #521
Comments
Maybe I am missing something here, but to the best of my knowledge HDF5 will always read full chunks. See also the section "Pitfalls: Chunks are too large" here or the explanations here. You can of course request a sub-part of the chunk, but both zarr and hdf5 (I think) will internally read the full chunk and then slice the request from it. |
It has been raised before. I don’t know if there is a GitHub issue for it (there may already be one). I recall reading somewhere that Blosc supports partial decompression. Though not sure about other compressors. So there’s a possibility of doing this on compressed data. This may be possible in some cases, but may be a fair bit of work. Are you mainly curious about this or are you able to show this has some notable impact for your workload? In practice I think many people use Zarr with some other parallelism library (like Dask) where they align computational chunks to stored chunks. With this sort of workflow partial reads don’t matter. What does your workflow look like? |
For reference this was previously brought up in #40 |
Thanks for dredging that up, @jrbourbeau ! 😀 |
@constantinpape as far as I understand hdf5 chunks are internal chunks of a larger hdf5 file. To do something like that I would have to save thousands of chunks in my zarr @jakirkham right now I have to face 2 different needs in my workflow. My data are 4D volumetric time series. For doing any kind of analysis of them, e.g. analysis or source extraction, I chunk them on spatial or temporal dimensions and run parallel operations (with Dask) on them, and for this part, I keep the zarr chunks size aligned to the sizes that I manipulate in a single Dask task. On the other side, I also need good performance visualisation for which I need to load single frames when scrolling the stacks, and currently doing this on the same zarr dataset makes my interface slow. I think that other people who are testing out the napari interface with zarr support might have the same needs. |
Ok, I understand your point now: in order to make access for different usage patterns efficient, you would need to make chunks smaller. In hdf5 this is not a problem, because chunks are stored internally, so having more chunks does not hurt (note that this is not quite true, more chunks will increase the overall file size in hdf5 and make access slower, see one of my links above). I agree that having partial chunk reads would be a good solution for this, but might be challenging to implement. (@jakirkham I didn'tknow about partial decompression in blosc, that's interesting). @vigji Have you tried using the ZipStore? This might be helpful for your use case, because you could make chunks smaller but still have everything stored in one file. |
Can this be closed that now #667 has been merged? |
I am try to use zarr together with dask and xarray to store large volumetric imaging (T x Z x X x Y dask arrays) data. Before, I was saving the dataset after splitting in artigianal .hdf5 files, from which I could read out small slices of the data (e.g., one element over one dimension, mostly for visualisation purposes) without having to load the full chunk.
Is it possible to do something like that in zarr? Or I always do have to load the full chunk? I tried even disabling all compressions and it still seem not capable of reading out only a small part of a chunk. I know I could just save small chunks, but this will make my datasets thousands of files with more overhead when loading of large slices is required.
Am I missing something? If not, has such feature been suggested for future zarr versions?
The text was updated successfully, but these errors were encountered: