Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow basic xarray commands to be applied to whole PredictionEnsemble object #102

Closed
bradyrx opened this issue Apr 28, 2019 · 7 comments · Fixed by #243
Closed

Allow basic xarray commands to be applied to whole PredictionEnsemble object #102

bradyrx opened this issue Apr 28, 2019 · 7 comments · Fixed by #243

Comments

@bradyrx
Copy link
Collaborator

bradyrx commented Apr 28, 2019

I think it would be extremely useful to be able to call something like ReferenceEnsemble.sel(time=slice(1980, 2000)) and have it apply to every product (e.g., reconstruction, data, uninitialized run).

Some commands I'd like to implement:

  • sel()
  • isel()
  • squeeze()
  • groupyby()
  • rolling()
  • resample()
@ahuang11
Copy link
Member

ahuang11 commented May 10, 2019

I forgot to mention one thing during the call that I'm worried about the maintainability of building an object over xarray datasets.

From https://xarray.pydata.org/en/stable/internals.html

"
One standard solution to this problem is to subclass Dataset and/or DataArray to add domain specific functionality. However, inheritance is not very robust. It’s easy to inadvertently use internal APIs when subclassing, which means that your code may break when xarray upgrades. Furthermore, many builtin methods will only return native xarray objects.

The standard advice is to use composition over inheritance, but reimplementing an API as large as xarray’s on your own objects can be an onerous task, even if most methods are only forwarding to xarray implementations.
"

Like just recently, xarray introduced the coarsen method which may be useful for this project (as well as interp method and roll method).

A suggestion thru accessors; now you can use the native xarray methods such as filtering by attrs. The naming convention and dimensions should definitely be checked in the function, but this is just a rough sketch
image

Then you can implement a score method:
image

image

I understand that I'm introducing new things and Aaron said not to :P so you guys can decide!

import xarray as xr
import numpy as np

mod_ds = xr.tutorial.open_dataset('air_temperature')
# clear attrs
mod_ds.attrs = {}
mod_ds['air'].attrs = {}

obs_ds = xr.tutorial.open_dataset('air_temperature')
obs_ds['air'] -= np.random.rand(*obs_ds['air'].shape)
# clear attrs
obs_ds.attrs = {}
obs_ds['air'].attrs = {}

perfect_ds = xr.tutorial.open_dataset('air_temperature')
perfect_ds['air'] -= np.random.rand(*perfect_ds['air'].shape)
# clear attrs
perfect_ds.attrs = {}
perfect_ds['air'].attrs = {}

@xr.register_dataset_accessor('cp')
class ClimPred(object):
    def __init__(self, xr_obj):
        self.ds = xr_obj

    def label(self, name, category, ds=None):
        if ds is None:
            ds = self.ds

        for var in ds:
            if 'category' in ds[var].attrs:
                continue
            ds[var] = ds[var].assign_attrs(**{'category': category})
            ds = ds.rename({var: f'{var} ({name} {category})'})
        return ds

    def add_reference(self, ds, name, category):
        ds = xr.merge([self.ds, self.label(name, category, ds=ds)])
        return ds

    def score(self):
        for obs in self.ds.filter_by_attrs(**{'category': 'obs'}):
            for mod in self.ds.filter_by_attrs(**{'category': 'model'}):
                field_name = mod.split(' ')[0]
                mod_name = mod.split(' ')[1].lstrip('(')
                obs_name = obs.split(' ')[1].lstrip('(')
                var_name = f'{field_name} rmse ({mod_name} {obs_name})'
                self.ds[var_name] = rmse(self.ds[obs], self.ds[mod], 'time')
        return self.ds
ds = mod_ds.cp.label('CESM', 'model')
ds = ds.cp.add_reference(obs_ds, 'NMME', 'obs')
ds = ds.cp.add_reference(perfect_ds, 'perfect', 'obs')

@aaronspring
Copy link
Collaborator

I had strong objections about introducing such things after two weeks.

@aaronspring
Copy link
Collaborator

About the user interface: I prefer to just have one data variable in the process so you can also compute skill over a dataset with lots of variables. That may be a particular use case for the perfect model only.

Accessors sound useful. You probably know more about them than We do.

@bradyrx
Copy link
Collaborator Author

bradyrx commented May 13, 2019

Thanks for bringing up this discussion. I see your point here about missing out on new features/it breaking by not using accessors.

I think I'm concerned by a the following things with this accessor mockup you did:

  1. It requires an intro step that isn't very clear. You have to run ds.cp.label() on the initialized output to generate the beginning of the object. Perhaps there is another way to do this, though.
  2. It doesn't seem like there's any way to make a clear distinction between a perfect-model setup and a hindcast setup. This is important for our package since the statistics are very different between the two. I guess this could be alleviated by having an attribute like ds.type that says where it's perfect model or hindcast. Then our functions can look for that label to guide things under the hood.
  3. Dimension naming will become confusing. The initialized object (hindcast) should have an init dimension, while the observations and uninitialized should have time.
  4. This forces the user to use ds.cp always which feels clunky. I.e., you have to remember to add the .cp accessor for all of our functions. Although this probably isn't a huge deal and would be documented well.
  5. It looks like you'd need separate datasets for each variable (which is maybe what @aaronspring is saying).
  6. Would there be any way to return the statistics as a separate ds? I feel like it would get crowded if rmse results and so on continue to get appended to the main dataset.

On the bright side:

  1. You could just do ds.sel(), ds.coarsen(), etc. right onto this dataset and have the functions applied. You couldn't do HindcastEnsemble.coarsen() without building an explicit method to do this, which then loops through the individual datasets to coarsen them.
  2. Given the previous point, this probably would make the classes code a lot shorter. Since we'd have to add methods for each xarray command we want to add.

With the classes, I really appreciate that we can get a clear view of what's going on:
Screen Shot 2019-05-13 at 3 46 54 PM

If the biggest concern is to allow for multiple models, I can modify the classes to accept them.

@ahuang11, let me know your thoughts on all this. I'm open to overhauling these classes for the next round (v2), so we can have plenty of time to discuss this. I think accessors might limit the scope of our project for the above reasons. But they also lend important features like keeping up with new xarray methods.

@ahuang11
Copy link
Member

ahuang11 commented May 14, 2019

  1. It requires an intro step that isn't very clear. You have to run ds.cp.label() on the initialized output to generate the beginning of the object. Perhaps there is another way to do this, though.

Yep you can do

ds = ClimPred(ds).label()

Or write a separate function completely

    def label(ds, name, category):
        for var in ds:
            if 'category' in ds[var].attrs:
                continue
            ds[var] = ds[var].assign_attrs(**{'category': category})
            ds = ds.rename({var: f'{var} ({name} {category})'})
        return ds

    ds = label(ds)

and in the class, check if it's labeled properly.

  1. It doesn't seem like there's any way to make a clear distinction between a perfect-model setup and a hindcast setup. This is important for our package since the statistics are very different between the two. I guess this could be alleviated by having an attribute like ds.type that says where it's perfect model or hindcast. Then our functions can look for that label to guide things under the hood.

Right, it was a quick mockup. You could add any arbitrary attributes to the dataset, and use filter_by_attrs or do a simple if ds.attrs['setup'] == 'perfect_model'

  1. Dimension naming will become confusing. The initialized object (hindcast) should have an init dimension, while the observations and uninitialized should have time.

No comment :p

  1. This forces the user to use ds.cp always which feels clunky. I.e., you have to remember to add the .cp accessor for all of our functions. Although this probably isn't a huge deal and would be documented well.

You don't have to. You can instantiate the instance once, cp = ClimPred(ds) and then do cp.function().

  1. It looks like you'd need separate datasets for each variable (which is maybe what @aaronspring is saying).

You don't have to. You can again add variable to attributes and filter that way.

  1. Would there be any way to return the statistics as a separate ds? I feel like it would get crowded if rmse results and so on continue to get appended to the main dataset.

Sure, this was just a mockup. You can easily return stat_ds.

With the classes, I really appreciate that we can get a clear view of what's going on:
![Screen Shot 2019-05-13 at 3 46 54 PM](https://user-

Yes, that is a good overview, and I believe you can reimplement that as an accessor method like ds.cp.summary()

If the biggest concern is to allow for multiple models, I can modify the classes to accept them.
Nah, just maintainability, keeping up with xarray.

@ahuang11, let me know your thoughts on all this. I'm open to overhauling these classes for the next round (v2), so we can have plenty of time to discuss this. I think accessors might limit the scope of our project for the above reasons. But they also lend important features like keeping up with new xarray methods.

Yes, v2 sounds like a good target. I suppose I want to get this right ahead of time so it won't be like a Python 2 to 3 transition if this package gets popular :P

@aaronspring
Copy link
Collaborator

  1. It doesn't seem like there's any way to make a clear distinction between a perfect-model setup and a hindcast setup. This is important for our package since the statistics are very different between the two. I guess this could be alleviated by having an attribute like ds.type that says where it's perfect model or hindcast. Then our functions can look for that label to guide things under the hood.

For xr commands like .groupby('time.year').mean('time') or coarsen() I dont see any differences between perfect-model and hindcast type.

@bradyrx
Copy link
Collaborator Author

bradyrx commented May 14, 2019

@ahuang11 -- thanks for the clarifications. I figured there were alleviations for all points but wanted to see what you had to say. I think accessor is the way to go, and maybe that's the major focus for the v2 rollout that you and I can focus on. I am particularly happy to hear we can rewrite as cp = ClimPred(ds) for instance.

Yes, v2 sounds like a good target. I suppose I want to get this right ahead of time so it won't be like a Python 2 to 3 transition if this package gets popular :P

I agree 100%. This forced me to learn python classes anyhow so I'm not mad. Let's put a pin in this for now and pick it up in a conference call following v1's completion. For now, we can have the classes access our updated functions (which shouldn't require change) and then will do a rewrite "early on" to switch to accessors.

For xr commands like .groupby('time.year').mean('time') or coarsen() I dont see any differences between perfect-model and hindcast type.

Agreed. My comment was more about comparisons and so on. But we can alleviate that by forcing a ds.type attribute on intiialization.

Good conversation! Excited to dig into this. Appropriately implementing this will put the package in the place I really want to see it... with smart easy-to-use objects and methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants