-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Dataset with schema #124
Conversation
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
Thanks for putting this together @ravwojdyla. I think this is useful for doing internal type checking, and for documenting the xarray variables (what they mean, as well as their names, kinds, dimensions). Apart from the latter, I think this change should be largely invisible to the user. This is so that sgkit can be used as vanilla xarray for users, and also so that we can change things more easily in the future.
I don't think this is particularly usable, so I would not expose this method to users, and instead keep the variable name function arguments in the methods where they are immediately discoverable by users. As a user, I'd like to be able to rename variables manually using xarray APIs calls, and then sgkit would do the necessary revalidation without me having to worry about it (or even know it has happened). Regarding the variable catalogue and documentation, we could have a flat namespace so they all appeared in one place (although I'm not opposed to introducing some kind of logical grouping). Instead of |
@tomwhite thanks for the review:
I agree It's probably fair to observe that we would like vanilla xarray to work with sgkit functions, we don't want to use wrappers, yet pure xarray does lack the "context" those methods require (at least variable names), which means we have at least 4 options:
Without a doubt there is more options out there, but what do you think about those above. There are obviously pluses/minuses to each of those. wdyt? (fyi @eric-czech ) Edit: I'm quite fond of the idea of wrapper/encapsulation (+ public xarray field), but out of the options above without dive into details: [2, 1, 4, 3]. |
To some of those points:
Given that both xr.merge and xr.concat would drop schema attributes by default, I think it would actually be quite common to lose all the attributes between sgkit function calls unless a user is willing to backtrack and rethink the merging. I can imagine this meaning that they'd need to redeclare custom fields multiple times (unless there was some global context for it as mentioned in later comments).
+1 to this for sure, I like the freedom we would have with something like
+1 to that since it aligns more with
This reminds me of tf.name_scope. Personally, I like the idea of names being customizable in the function signatures, within a context manager, and globally. I can see that being a lot of work though. I haven't found it burdensome to rename things when necessary via Xarray so I think this one would be excellent, but also something we can put off on users for a while. I don't know when to say it should be a higher priority.
I definitely agree that would force us to make the functions more consistent, though I think it is an awkward construct when mixed with a bunch of Xarray/Pandas code. I'm a +1 on finding some way to make that an internal detail. |
@tomwhite @jeromekelleher @alimanfoo any opinions on this PR, and comments above ^ |
I like the idea, and it's an elegant way of stating and enforcing the shared common structure of datasets. I wonder if this is a bit premature though, or perhaps even opposed to the way that things are currently evolving. From my (hands-off, admittedly) perspective we seem to be moving towards an idiom of the dataset being a "vanilla" xarray dataset that every function adds to, and can have lots of variables and be sliced and diced in various different ways. For example, stats functions will all compute a result and these results will be included as variables in the output dataset. So, assuming we have a big library with lots of functions, then is the output variable of every function going to have to be registered in the central schema? I can see this being annoying for developers and something that we would have to explain repeatedly. But, maybe this is better than the slightly anarchic way things are going at the moment 🤷 My take would be to let the library evolve in the fairly loosely coupled way that it has been moving for another few months, and to pick this up again when things have been fleshed out a bit more. |
At the moment we have fixed variable names, which I don't think is a problem yet. Of the four options, I think I like option 1 best (provide the context as a function argument) - at least to start with, then move to 4 (function expose all variable names, and if not provided look at the global context). |
Closing this because this PR is based on an internal branch. On a weekly meeting we have decided to adjust this PR to 1, and once that is done, explore if we want to explore option 4. |
See https://github.com/pystatgen/sgkit/pull/124 for more context
See https://github.com/pystatgen/sgkit/pull/124 for more context
See https://github.com/pystatgen/sgkit/pull/124 for more context
See https://github.com/pystatgen/sgkit/pull/124 for more context
See https://github.com/pystatgen/sgkit/pull/124 for more context
See https://github.com/pystatgen/sgkit/pull/124 for more context
See https://github.com/pystatgen/sgkit/pull/124 for more context
See https://github.com/pystatgen/sgkit/pull/124 for more context
This is a POC. And I would like to first get your feedback about the idea, before finishing up tests, doc and coverage. This comes from the #43, follows up from the comment in #103. #43 is a lot more elaborated since it's trying to strive for statically typed feedback, whilst this is a lot more stripped down.
I have tried to:
The core idea is:
attrs
of the xr DatasetBenefits:
As a user:
SgkitSchema.spec
before calling the functionSgkitSchema.spec
returns a new Dataset (shallow copied) with updated schema/attrsWhat is missing:
regenie
(essentially the same as regression)This POC removes all the required/optional variable names from function arguments, and if those need to be specified or are custom user needs to specify it via
SgkitSchema.spec
, alternatively we could keep those where necessary (example) and callSgkitSchema.spec
inside the functions (triggering validation etc).One more point: we could make it redundant to declare default names in schema, and if missing in schema, but requested: assume default name, check variable against the spec, and return name.