-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider support for storing chunks in z-order #40
Comments
That's a neat trick. However, unless I'm missing something (very possible), On Wednesday, 27 July 2016, Stephan Hoyer [email protected] wrote:
Alistair Miles |
Sorry accidentally sent last comment incomplete. Basically I think that Cc @benjeffery On Saturday, 30 July 2016, Alistair Miles <[email protected]
Alistair Miles |
This would be for storing data within chunks, supposing support for partially reading chunks (possible with many backends). Given the fixed overhead for accessing each individual chunk, there is a minimal chunk size at which it doesn't make sense to chunk any further (depending on details, perhaps somewhere in the range of 1e3 to 1e6 elements). This could yield significant benefits in that case. |
Ah OK. So what's the main use case? Taking a multidimensional slice where Just so I'm completely with you, could you expand a bit more on what you On Sunday, July 31, 2016, Stephan Hoyer [email protected] wrote:
Alistair Miles |
A more common use case might be any slicing operations that are poorly aligned with chunking, even if they involve multiple chunks. For example, suppose I have an array with chunks of size (100, 100, 100), and now want to index out a single point along the first two dimensions, e.g., I don't have a direct use case for this right now (since I'm not actually using zarr yet), but I can see lots of examples where this sort of thing might be useful.
Oops, I had a typo (fixed now). What I was referring to is that there is some fixed overhead associated with storing and manipulating each chunk in every storage and task scheduling system, which is why we don't chunk arrays into single values. |
Thanks Stephan. Just to add that this relates to how Zarr might make use of the blosc_getitem function to extract partial contents of a chunk. I know Bcolz uses blosc_getitem but there things are simpler because they only consider 1D arrays, so far I haven't had the brain space to figure out how to use it for multidimensional arrays. |
|
I don't have bandwidth to explore this myself but very happy to discuss further if there is interest and someone else has time. FWIW I think a natural starting point would be to add support for a codec operation analogous to Btw the next release of Zarr will have all code for compression and filter codecs removed and obtained via a dependency on a new numcodecs package, so this issue may live more naturally there as a codec issue. One other note, I think that although z-order might speed up access for small regions of an array, it would incur a cost for reading larger regions, because all read and write operations would need to pass through the z-order transformation. |
Is this a duplicate of zarr-developers/numcodecs#669 (or vice versa)? |
Apparently, this can result in significant IO savings when indexing multi-dimensional arrays.
See http://bl.ocks.org/jaredwinick/5073432 (note that the graphic is interactive)
The text was updated successfully, but these errors were encountered: