Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] zarr object models #46

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
spec: style
  • Loading branch information
d-v-b committed Oct 29, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit 659a09048effc268c44d9b9535ef607fae6bfd3d
16 changes: 6 additions & 10 deletions draft/ZEP0006.md
Original file line number Diff line number Diff line change
@@ -21,28 +21,24 @@ Created: 2023-07-20

## Abstract

This ZEP defines Zarr Object Models, or ZOMs. A ZOM is a language-independent interface that describes an abstract Zarr hierarchy as a tree of nodes. ZOMs are parametrized by a particular Zarr version, so there is a ZOM for Zarr Version 2, which differs from the ZOM for Zarr V3.
This ZEP defines Zarr Object Models, or ZOMs. A ZOM is a language-independent interface that describes an abstract Zarr hierarchy as a tree of nodes. The ZOM forms the basis for a declarative, type-safe approach to managing Zarr hierarchies.

The basic ZOM defines two types of nodes: arrays and groups. Both types of nodes have an `attributes` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. In the base ZOM, groups have a property called `members`, which is an object with string keys and values that are either arrays or groups. This definition is designed to be abstract enough that it can be implemented by a range of programming languages, and expressed in a wide range of interchange formats.
A ZOM defines two types of nodes: arrays and groups, which schematize Zarr arrays and Zarr groups, respectively. ZOM arrays and groups both have an `attributes` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. ZOM groups have a property called `members`, which is an object with string keys and values that are either arrays or groups; thus, the ZOM can represent an arbitrarily structured tree of arrays and groups.

The ZOM forms the basis for a declarative, type-safe approach to managing Zarr hierarchies.
These definitions are designed to be abstract enough to be implemented by a range of programming languages, and expressed in a wide range of interchange formats. Applications can use the ZOM for a particular Zarr version to implement declarative APIs for accessing structured Zarr hierarchies.

## Motivation and Scope

The Zarr specifications define models of arrays and groups; the Zarr specifications do not define models for hierarchies of arrays and groups. This is unfortunate, because many users of Zarr work primarily with structured hierarchies.
The Zarr specifications define models for arrays and groups, but the Zarr specifications do not define models for *hierarchies* of arrays and groups. This is unfortunate, because many applications using Zarr operate on the level of structured hierarchies rather than individual groups or arrays.

For example, the python library `xarray` defines data structures that can be persisted to Zarr as a Zarr group containing one or more Zarr arrays with specific metadata. The full specification of this format can be found [here](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding). For `xarray` (or any other application that works with structured hieararchies) to save data to Zarr, it must first create a compliant Zarr hierarchy. To read data from Zarr, `xarray` must first check if the potential source of data is an `xarray`-compliant Zarr hierarchy.
For example, the python library `xarray` defines data structures that can be persisted to Zarr as a Zarr group containing one or more Zarr arrays with specific metadata (a key called "_ARRAY_DIMENSIONS", which must have a list of strings as its value). The full specification of this format can be found [here](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html#zarr-encoding). For `xarray` (or any other application that works with structured hierarchies) to save data to Zarr, it must first create a compliant Zarr hierarchy. To read data from Zarr, `xarray` must first check if the potential source of data is an `xarray`-compliant Zarr hierarchy.

Creating and validating Zarr hierarchies can be done procedurally, i.e. as a sequence of Zarr array and group access routines, or declaratively, as a hierarchy definition followed by a procedure that implements the definition. The latter is preferable, but it first requires a machine-readable data model for a Zarr hierarchy.

Such models should be sufficient to express the structure of a Zarr hierarchy, and these models must be usable by Zarr implementations as a basis for declarative APIs for creating and validating Zarr hierarchies. That is the central goal of this proposal.
Creating and validating Zarr hierarchies can be done procedurally, i.e. as a sequence of Zarr array and group access routines, or declaratively, as a hierarchy definition followed by a procedure that creates a Zarr hiearchiy consistent with that definition. In many cases the declarative approach is preferable, but it requires a machine-readable data model for a Zarr hierarchy. Defining such a model is the central goal of this proposal.

## Definition of hierarchy structure

This document distinguishes the *structure* of a Zarr hierarchy from the data stored in the hierarchy. The structure of a Zarr hierarchy is the layout of the tree of arrays and groups, and the metadata of those arrays and groups. This definition omits the data stored in the arrays, and the particular storage backend used to store data and metadata. By these definitions, two distinct Zarr hierarchies can have the same structure even if their arrays contain different values, and / or the hierarchies are stored using different storage backends.

Because the structure of a zarr hierarchy is decoupled, by definition, from the data stored in the hierarchy, it should be possible to represent the structure of a Zarr hierarchy with a compact data structure or interface. Such a data structure or interface would facilitate operations like evaluating whether two Zarr hierarchies are identically structured, evaluating whether a given Zarr hierarchy has a specific structure, or creating a Zarr hierarchy with a desired structure. This document formalizes the Zarr Object Model (ZOM), an abstract model of the structure of a Zarr hierarchy. The ZOM serves as a foundation for tools that create and manipulate Zarr hierarchies at a structural level.

## Specification of the base Zarr Object Model

We begin with a definition of a "base" Zarr Object Model. On its own, the base ZOM is not useful for working with actual Zarr hierarchies, because it contains a reference to an unspecified Zarr version. By supplying definitions from a particular Zarr version, we can specialize the base ZOM and produce an object that can be used for doing actual work.