-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shortening edge chunks #233
Labels
enhancement
New features or improvements
Comments
Any thoughts on this @alimanfoo? |
I'm generally in favour of converging on common standards so very happy to
have discussion on this.
FWIW I think it would be worth understanding the pros and cons of each
approach a bit better, with a view to ultimately fixing on just one
approach. Although it's attractive in the short term to add flexibility to
support both approaches for compatibility with N5, it would add complexity
for anyone else wanting to do a new zarr spec implementation from scratch,
as well as add complexity in the zarr code base.
One of the pros of the current zarr "uniform chunks" approach is that it
simplifies the implementation of resize operations. This applies both when
increasing and when decreasing the size of a dimension. When increasing the
size of a dimension, only the metadata has to be updated, no other action
needs to be taken. When decreasing the size of a dimension, the metadata
has to be updated, and any chunks no longer overlapping the new shape are
deleted.
Under the "shortened edge chunks" approach, I think that resize operations
are potentially more costly. This is because when increasing the size of a
dimension, any chunks that were previously short edge chunks but are now
fully within the array need to be modified to the full chunk shape. When
decreasing the size of a dimension, any chunk previously within the array
but now falling at an edge needs to be modified to the edge chunk shape.
The exact implementation details I guess would depend on whether chunks
have a shape header or not, but the general point I think is that resize
operations are not so straightforward.
The obvious con to the uniform chunks approach is that there is some extra
storage overhead associated with the chunks that overlap the edges of the
array, and also some extra compute overhead associated with
compressing/decompressing those chunks when data from the array are
written/read.
…On Wednesday, February 14, 2018, jakirkham ***@***.***> wrote:
Any thoughts on this @alimanfoo <https://github.com/alimanfoo>?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qt7wJPXom8eAOasWNyk_gOwhek11ks5tUm1EgaJpZM4RtsWY>
.
--
If I do not respond to an email within a few days and you need a response,
please feel free to resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Skype: londonbonsaipurple
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
If we relaxed the uniform chunking requirement more generally ( https://github.com/zarr-developers/zarr/issues/245 ), it could solve this issue and still allow for fast appends amongst other nice benefits. Admittedly there would be some overhead involved in tracking non-trivial chunk sizes. So it would need some thought/evaluation. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the process of investigating an analogous format ( https://github.com/zarr-developers/zarr/issues/231 ), one of the points raised was that chunks in another implementation shorten their edge chunks. As a trivial example, if an array has a shape
(5,)
and chunk size of(2,)
, the last chunk will be smaller than the other ones. Currently we write out this chunk to the same size file (even though we effectively ignore the extra bytes).However we could opt to write this chunk out as a smaller file, since the extra bytes would be unneeded. If this were implemented in a consistent manner, it seems like it should be possible to compute the truncated shape of these edge chunks. This simply using the shape, number of chunks, and our current chunk index. It would also be easy to check whether we are handling a truncated chunk or not by simply comparing the size to that of a typical chunk before reshaping. Thus handling files that don't use truncating in a compatible way.
Though if we would like to be more explicit, we could also include a Zarr Array option (default disabled?) for this behavior and write it out to
.zarray
to be explicit.The text was updated successfully, but these errors were encountered: