Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] zarr object models #46

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

[draft] zarr object models #46

wants to merge 17 commits into from

Conversation

d-v-b
Copy link

@d-v-b d-v-b commented Sep 7, 2023

this ZEP defines a representation of a zarr hierarchy, called a Zarr Object Model (ZOM). The purpose of this ZEP is to standardize an abstract representation of zarr hierarchies to support declarative Zarr APIs, and to give type systems access to the structure of zarr hierarchies. A side effect of this ZEP is a standardization of consolidated metadata, which can be defined as a flattening transformation applied to a ZOM representation of a zarr hierarchy.

I didn't use the template structure for this ZEP because it felt limiting, but if that's a big problem I can bring more of that structure back in.

In terms of what needs to be done:

  • The prose needs work. I want to carefully distinguish the "base ZOM", which is kind of a meta-model, from the ZOM for particular zarr versions (2 and 3), from ZOM representations of actual zarr hierarchies. A delicate touch is also required to distinguish zarr arrays (which contain data) from the ZOM representation of a zarr array, which contains now data.
  • I need to complete the JSON schemas for zarr V3. help is appreciated!

cc @jhamman

@jbms
Copy link

jbms commented Sep 7, 2023

For zarr v3, the user-defined attributes are stored within the main metadata file under the attributes member name. Renaming that to attrs in the representation proposed here may be confusing.

@d-v-b
Copy link
Author

d-v-b commented Sep 7, 2023

@jbms That's a great point; I'm happy to expand attrs to attributes. On the topic of naming, are there any objections to members?

@normanrz
Copy link
Member

normanrz commented Sep 7, 2023

I am not sure I got the motivation for this. I understand the motivation of consolidated metadata. Maybe this ZEP should morph into that? Then, it should be an extension to the zarr.json for persistence imo. Also, I would focus the ZEP on v3.

@ivirshup
Copy link
Contributor

ivirshup commented Sep 7, 2023

Agree with @normanrz, a section for motivation and usage would be helpful.

@d-v-b
Copy link
Author

d-v-b commented Sep 7, 2023

I totally agree that the motivation needs to be expanded. At the moment, all we say is

Such a data structure or interface would facilitate operations like evaluating whether two Zarr hierarchies are identically structured, evaluating whether a given Zarr hierarchy has a specific structure, or creating a Zarr hierarchy with a desired structure.

@normanrz do you agree that these are valid motivations, and worthy of expanding on, or should we provide additional motivations?

Also, I would focus the ZEP on v3.

The basic ZOM applies to v3 and v2 equally, and I think it's important to emphasize this, because the ZOM representation would be useful for converting from v2 hierarchies to v3 (and from v3 to v4, if v4 ever exists). Would it help if I made this point clearly in the ZEP?

@keller-mark
Copy link

keller-mark commented Sep 8, 2023

To help with the motivation, I think this point

give type systems access to the structure of zarr hierarchies

could be emphasized further, perhaps with an example of how it could eventually work.

It seems like this is a kind of "extended consolidated metadata" and could be framed more in that way. Beyond the "base consolidated metadata", my understanding is that this would also include the contents of the .zattrs/.zarray/.zgroup files (and the v3 equivalent) which would be used to implement the support for typing / validation / comparison.

the ZOM representation would be useful for converting from v2 hierarchies to v3 (and from v3 to v4, if v4 ever exists)

Perhaps it would be simpler to use the metadata file names directly in the flattened representation rather than abstracting / trying to unify across versions. Using $refs in JSON schemas could enable supporting both v2 and v3 despite possibly using different/split metadata file names.

It could be useful to define the name of an optional property that could be used to specify the URL of a JSON schema to use for validation of ZOM-structured stores such as OME-NGFF or AnnData. For example, Vega-Lite uses a $schema property for this.

Perhaps outside the scope of this proposal, but related, it might be useful for Zarr to make a distinction between ZOM-structured stores vs. non-ZOM-structured stores in the name of the root file/directory (as a convention, not a requirement). For example, as a human, if i look in my file explorer and see something.zarr, it would be nice at first glance to know whether it contains a particular ZOM-structured store or not (without context or looking at the contents). For example, a convention like something.anndata.zarr could be used for this. This would be analogous to using .h5ad rather than .h5 to store AnnData-structured HDF5. I have started doing this in personal work with Zarr stores, but it does not seem like it is a convention for Zarr stores in the wild.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started @d-v-b! I think this will really aid in the maintenance and development of Zarr implementations, old and new.

## Implementation

- pydantic zarr
- ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • dataclass zarr

(I have an unpublished version that I can share soon)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarrita also has attrs classes that define the metadata (minus the new members properties) https://github.com/scalableminds/zarrita/blob/async/zarrita/metadata.py#L259

- - The origins of consolidated metadata:
* <https://github.com/pangeo-data/pangeo/issues/309>
* <https://github.com/zarr-developers/zarr-python/pull/268>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may also be worth summarizing some of the intended benefits to existing/internal applications. For example, the utilization of a standard data object internally within zarr-python may help improve workflow for creating large hierarchies by allowing users to create the ZOM metadata before passing it to a zarr.creation method.

draft/ZEP0006.md Outdated
And Zarr V3:

```json
# insert schema for v3 here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if you could use some help generating this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some help here would be great, thank you!

@d-v-b
Copy link
Author

d-v-b commented Sep 20, 2023

just a clarification about the timeline: I'm working on a transatlantic move, so I can't promise a huge investment in this until mid-october. Thanks for your patience!

@d-v-b
Copy link
Author

d-v-b commented Sep 21, 2023

as per feedback, the field used for user metadata has been renamed from attrs to attributes. I also added a motivating example (comparing if two zarr hierarchies have the same structure), and I moved the zarr v2 stuff behind the zarr v3 examples (although those need fleshing out)

@d-v-b
Copy link
Author

d-v-b commented Sep 21, 2023

see also: https://github.com/google/tensorstore/blob/master/tensorstore/driver/zarr/schema.yml

## Implementation

- pydantic zarr
- ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarrita also has attrs classes that define the metadata (minus the new members properties) https://github.com/scalableminds/zarrita/blob/async/zarrita/metadata.py#L259

draft/ZEP0006.md Outdated
Comment on lines 57 to 59
- `members`: a key-value data structure where the keys are strings and the values are arrays or groups. This property allows a ZOM group to represent the hierarchical relationship between Zarr groups and the Zarr arrays or Zarr groups contained within them.

If future versions of Zarr use a property called `members` for some element of Zarr group metadata, then there would be a naming collision between the `members` property of a Zarr group and the `members` property of a ZOM group. In this case, the ZOM group would rename the Zarr group's `members` property to `_members`, and any additional name collisions would be resolved by prepending additional underscore ("_") characters. E.g., in the unlikely case that `members` and `_members` are *both* listed in Zarr group metadata, then the schema group representation would map the `members` property of the Zarr group to a property called `__members`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to reiterate the idea of broadening this ZEP to include persisted consolidated metadata. Basically, why not allow to store the members property in a zarr.json?
We would need to define the semantics of consolidated metadata (e.g. do member nodes still needs json files, does the members hierarchy need to be exhaustive). I would be happy to contribute that if there is interest to move this ZEP in that direction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's definitely interest in that, my apologies for not making this more clear earlier. I'm not a user of consolidated metadata so I don't have a lot of experience with it, but I think for this ZEP to encompass consolidated metadata functionality as it exists today (i.e, a flat list of string keys pointing to JSON objects) we would need to define a tree flattening operation, and possible make members nullable (because in a flattened representation a ZOM group shouldn't hold a reference to its contents). Alternatively, if the flattened representation of the hierarchy used in consolidated metadata isn't essential to its function, we could simply just put a ZOM in JSON and leave it to clients to do the flattening. I don't have strong feelings either way! You should absolutely feel free contribute something here.

Copy link
Contributor

@rabernat rabernat Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep consolidated metadata separate from this ZOM concept. Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.

We intend to propose a ZEP which uses a STAC style link-relation element to allow a client to traverse an entire hierarchy without being able to list a store. This is similar to consolidated metadata but more scalable because it does not require all the metadata to be in a single json file. For context, we have hierarchies with 100_000 nodes. Would be happy to collaborate and iterate with you on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.

In this case, maybe it would be good to get a statement like this in this ZEP to clarify the relationship between the abstract ZOM and consolidated metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now wondering if this members property needs to be tweaked to accommodate the case of very large hierarchies. As is, it seems like the entire hierarchy may have to be explicitly populated all at once.

In python terms, I'd like to allow members to be either a set of child objects or a generator that yields such objects lazily. Is this making it too complicated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e, a flat list of string keys pointing to JSON objects)

It wouldn't have to have this structure. It could be a nested json structure like this:

{
  "zarr_format": 3,
  "node_type": "group",
  "attributes": {},
  "members": {
    "some_group": {
      "zarr_format": 3,
      "node_type": "group",
      "attributes": {},
      "members": {
        "some_array": {
          "zarr_format": 3,
          "node_type": "array",
          ...
        }
      }
    }
  }
}

I think we should keep consolidated metadata separate from this ZOM concept. Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.

I think there is strong overlap between the ZOM and consolidated metadata. This ZEP introduces a JSON schema that describes the existing metadata of groups and arrays with a new addition of the members property. I think it would be very confusing, if consolidated metadata would end up with different terminology than the ZOM.

We intend to propose a ZEP which uses a STAC style link-relation element to allow a client to traverse an entire hierarchy without being able to list a store. This is similar to consolidated metadata but more scalable because it does not require all the metadata to be in a single json file. For context, we have hierarchies with 100_000 nodes. Would be happy to collaborate and iterate with you on that.

That sounds great. As I said, we can discuss the semantics and features of the consolidated metadata. That could include linking. I don't think we should limit ourselves by what the implementation in zarr-python currently has.

Copy link
Author

@d-v-b d-v-b Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now wondering if this members property needs to be tweaked to accommodate the case of very large hierarchies. As is, it seems like the entire hierarchy may have to be explicitly populated all at once.

This concern is real! See also zarr-developers/pydantic-zarr#2 . The proposal there was to make members nullable, where None would encode "The members have not been parsed", and to give a tree parser the option to limit the depth of traversal, which would result in "truncated" GroupSpec instances being valid. But maybe the python generator approach obviates the need to express this with nullability? I'm open to suggestions here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to distinguish between "there are definitely no members" vs. "there may be members, but they have to be discovered explicitly"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in pydantic-zarr members is now nullable, and it's been extremely useful. That being said, this can be viewed as a well-defined transformation of the base type, so it's not clear if the ZEP actually needs to address it.

draft/ZEP0006.md Outdated Show resolved Hide resolved
@sanketverma1704
Copy link
Member

Hi @d-v-b. I've fixed the RTD build issue in #51.

The current PR can be viewed here: https://zeps--46.org.readthedocs.build/en/46/draft/ZEP0006.html

@d-v-b
Copy link
Author

d-v-b commented Oct 29, 2023

I threw in a JSON schema for ZOM[v3], a ZOM[v3] example hierarchy serialized to JSON, and some python / typescript static typing examples.

I am wondering if it would make more sense to push these hierarchy definitions into the v2 and v3 specs as addenda? This ZEP could exist for posterity, but this would be an easier way to formally associate a particular ZOM with a specific Zarr version. Since the change would be purely additive, it seems safe to do retrospectively (in the case of Zarr v2).

Thoughts?

@bogovicj
Copy link

What should the value of attributes be when there are no user-specified attributes? Say, for a zarr v2 group with no .zattrs.

Since the attributes field is required, null and {} both seem reasonable to me. The thing about {} is that it won't be possible to distinguish between non-existent attributes, and attributes containing an empty object. So perhaps null is preferable.

If we're willing not to require the attributes field, then a node without the field could work too.

I havn't formed a preference.

@d-v-b
Copy link
Author

d-v-b commented Nov 10, 2023

@bogovicj so far i've been thinking about exclusively using {} for "empty attributes", but that's just because I have never been in a situation where "no attributes at all" vs "attributes are there, but empty" was an important distinction. I think this experience stems from attributes historically being "easy" to access, I guess because they are small in size and always have the same name.

By contrast, I think there's an argument for making the .members property of a group nullable to distinguish "there are no members" from "have not checked for members". This accommodates two scenarios: first, some storage backends (http) don't allow discovering subgroups / arrays, so members: null would signify that it wasn't possible to check for members. Second, some hierarchy parsers might want to limit how deep into a zarr hierarchy they go, so members: null would signify that no check for members was done for a given group. See zarr-developers/pydantic-zarr#2 for a discussion of this latter point.

That being said, if the attributes: null vs attributes: {} distinction can do some work for someone, then I'd be fine considering making attributes nullable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants