Shortening edge chunks #233

jakirkham · 2018-01-26T00:24:59Z

In the process of investigating an analogous format ( https://github.com/zarr-developers/zarr/issues/231 ), one of the points raised was that chunks in another implementation shorten their edge chunks. As a trivial example, if an array has a shape (5,) and chunk size of (2,), the last chunk will be smaller than the other ones. Currently we write out this chunk to the same size file (even though we effectively ignore the extra bytes).

However we could opt to write this chunk out as a smaller file, since the extra bytes would be unneeded. If this were implemented in a consistent manner, it seems like it should be possible to compute the truncated shape of these edge chunks. This simply using the shape, number of chunks, and our current chunk index. It would also be easy to check whether we are handling a truncated chunk or not by simply comparing the size to that of a typical chunk before reshaping. Thus handling files that don't use truncating in a compatible way.

Though if we would like to be more explicit, we could also include a Zarr Array option (default disabled?) for this behavior and write it out to .zarray to be explicit.

The text was updated successfully, but these errors were encountered:

jakirkham · 2018-02-14T05:21:05Z

Any thoughts on this @alimanfoo?

alimanfoo · 2018-02-15T13:57:15Z

I'm generally in favour of converging on common standards so very happy to have discussion on this. FWIW I think it would be worth understanding the pros and cons of each approach a bit better, with a view to ultimately fixing on just one approach. Although it's attractive in the short term to add flexibility to support both approaches for compatibility with N5, it would add complexity for anyone else wanting to do a new zarr spec implementation from scratch, as well as add complexity in the zarr code base. One of the pros of the current zarr "uniform chunks" approach is that it simplifies the implementation of resize operations. This applies both when increasing and when decreasing the size of a dimension. When increasing the size of a dimension, only the metadata has to be updated, no other action needs to be taken. When decreasing the size of a dimension, the metadata has to be updated, and any chunks no longer overlapping the new shape are deleted. Under the "shortened edge chunks" approach, I think that resize operations are potentially more costly. This is because when increasing the size of a dimension, any chunks that were previously short edge chunks but are now fully within the array need to be modified to the full chunk shape. When decreasing the size of a dimension, any chunk previously within the array but now falling at an edge needs to be modified to the edge chunk shape. The exact implementation details I guess would depend on whether chunks have a shape header or not, but the general point I think is that resize operations are not so straightforward. The obvious con to the uniform chunks approach is that there is some extra storage overhead associated with the chunks that overlap the edges of the array, and also some extra compute overhead associated with compressing/decompressing those chunks when data from the array are written/read.

…

On Wednesday, February 14, 2018, jakirkham ***@***.***> wrote: Any thoughts on this @alimanfoo <https://github.com/alimanfoo>? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qt7wJPXom8eAOasWNyk_gOwhek11ks5tUm1EgaJpZM4RtsWY> .

-- If I do not respond to an email within a few days and you need a response, please feel free to resend your email and/or contact me by other means. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Skype: londonbonsaipurple Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham · 2018-03-28T18:29:04Z

If we relaxed the uniform chunking requirement more generally ( https://github.com/zarr-developers/zarr/issues/245 ), it could solve this issue and still allow for fast appends amongst other nice benefits. Admittedly there would be some overhead involved in tracking non-trivial chunk sizes. So it would need some thought/evaluation.

jreadey mentioned this issue Feb 15, 2018

"Edge Chunk" issues. HDFGroup/hsds#9

Closed

jakirkham mentioned this issue Mar 23, 2022

Handling arrays with non-uniform chunking zarr-developers/zarr-specs#40

Open

dstansby added the enhancement New features or improvements label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shortening edge chunks #233

Shortening edge chunks #233

jakirkham commented Jan 26, 2018

jakirkham commented Feb 14, 2018

alimanfoo commented Feb 15, 2018 via email

jakirkham commented Mar 28, 2018

Shortening edge chunks #233

Shortening edge chunks #233

Comments

jakirkham commented Jan 26, 2018

jakirkham commented Feb 14, 2018

alimanfoo commented Feb 15, 2018 via email

jakirkham commented Mar 28, 2018