Allow basic xarray commands to be applied to whole PredictionEnsemble object #102

bradyrx · 2019-04-28T20:23:26Z

I think it would be extremely useful to be able to call something like ReferenceEnsemble.sel(time=slice(1980, 2000)) and have it apply to every product (e.g., reconstruction, data, uninitialized run).

Some commands I'd like to implement:

sel()
isel()
squeeze()
groupyby()
rolling()
resample()

The text was updated successfully, but these errors were encountered:

ahuang11 · 2019-05-10T02:21:09Z

I forgot to mention one thing during the call that I'm worried about the maintainability of building an object over xarray datasets.

From https://xarray.pydata.org/en/stable/internals.html

"
One standard solution to this problem is to subclass Dataset and/or DataArray to add domain specific functionality. However, inheritance is not very robust. It’s easy to inadvertently use internal APIs when subclassing, which means that your code may break when xarray upgrades. Furthermore, many builtin methods will only return native xarray objects.

The standard advice is to use composition over inheritance, but reimplementing an API as large as xarray’s on your own objects can be an onerous task, even if most methods are only forwarding to xarray implementations.
"

Like just recently, xarray introduced the coarsen method which may be useful for this project (as well as interp method and roll method).

A suggestion thru accessors; now you can use the native xarray methods such as filtering by attrs. The naming convention and dimensions should definitely be checked in the function, but this is just a rough sketch

Then you can implement a score method:

I understand that I'm introducing new things and Aaron said not to :P so you guys can decide!

import xarray as xr
import numpy as np

mod_ds = xr.tutorial.open_dataset('air_temperature')
# clear attrs
mod_ds.attrs = {}
mod_ds['air'].attrs = {}

obs_ds = xr.tutorial.open_dataset('air_temperature')
obs_ds['air'] -= np.random.rand(*obs_ds['air'].shape)
# clear attrs
obs_ds.attrs = {}
obs_ds['air'].attrs = {}

perfect_ds = xr.tutorial.open_dataset('air_temperature')
perfect_ds['air'] -= np.random.rand(*perfect_ds['air'].shape)
# clear attrs
perfect_ds.attrs = {}
perfect_ds['air'].attrs = {}

@xr.register_dataset_accessor('cp')
class ClimPred(object):
    def __init__(self, xr_obj):
        self.ds = xr_obj

    def label(self, name, category, ds=None):
        if ds is None:
            ds = self.ds

        for var in ds:
            if 'category' in ds[var].attrs:
                continue
            ds[var] = ds[var].assign_attrs(**{'category': category})
            ds = ds.rename({var: f'{var} ({name} {category})'})
        return ds

    def add_reference(self, ds, name, category):
        ds = xr.merge([self.ds, self.label(name, category, ds=ds)])
        return ds

    def score(self):
        for obs in self.ds.filter_by_attrs(**{'category': 'obs'}):
            for mod in self.ds.filter_by_attrs(**{'category': 'model'}):
                field_name = mod.split(' ')[0]
                mod_name = mod.split(' ')[1].lstrip('(')
                obs_name = obs.split(' ')[1].lstrip('(')
                var_name = f'{field_name} rmse ({mod_name} {obs_name})'
                self.ds[var_name] = rmse(self.ds[obs], self.ds[mod], 'time')
        return self.ds
ds = mod_ds.cp.label('CESM', 'model')
ds = ds.cp.add_reference(obs_ds, 'NMME', 'obs')
ds = ds.cp.add_reference(perfect_ds, 'perfect', 'obs')

aaronspring · 2019-05-10T06:36:15Z

I had strong objections about introducing such things after two weeks.

aaronspring · 2019-05-10T06:43:02Z

About the user interface: I prefer to just have one data variable in the process so you can also compute skill over a dataset with lots of variables. That may be a particular use case for the perfect model only.

Accessors sound useful. You probably know more about them than We do.

bradyrx · 2019-05-13T21:49:46Z

Thanks for bringing up this discussion. I see your point here about missing out on new features/it breaking by not using accessors.

I think I'm concerned by a the following things with this accessor mockup you did:

It requires an intro step that isn't very clear. You have to run ds.cp.label() on the initialized output to generate the beginning of the object. Perhaps there is another way to do this, though.
It doesn't seem like there's any way to make a clear distinction between a perfect-model setup and a hindcast setup. This is important for our package since the statistics are very different between the two. I guess this could be alleviated by having an attribute like ds.type that says where it's perfect model or hindcast. Then our functions can look for that label to guide things under the hood.
Dimension naming will become confusing. The initialized object (hindcast) should have an init dimension, while the observations and uninitialized should have time.
This forces the user to use ds.cp always which feels clunky. I.e., you have to remember to add the .cp accessor for all of our functions. Although this probably isn't a huge deal and would be documented well.
It looks like you'd need separate datasets for each variable (which is maybe what @aaronspring is saying).
Would there be any way to return the statistics as a separate ds? I feel like it would get crowded if rmse results and so on continue to get appended to the main dataset.

On the bright side:

You could just do ds.sel(), ds.coarsen(), etc. right onto this dataset and have the functions applied. You couldn't do HindcastEnsemble.coarsen() without building an explicit method to do this, which then loops through the individual datasets to coarsen them.
Given the previous point, this probably would make the classes code a lot shorter. Since we'd have to add methods for each xarray command we want to add.

With the classes, I really appreciate that we can get a clear view of what's going on:

If the biggest concern is to allow for multiple models, I can modify the classes to accept them.

@ahuang11, let me know your thoughts on all this. I'm open to overhauling these classes for the next round (v2), so we can have plenty of time to discuss this. I think accessors might limit the scope of our project for the above reasons. But they also lend important features like keeping up with new xarray methods.

ahuang11 · 2019-05-14T02:02:20Z

It requires an intro step that isn't very clear. You have to run ds.cp.label() on the initialized output to generate the beginning of the object. Perhaps there is another way to do this, though.

Yep you can do

ds = ClimPred(ds).label()

Or write a separate function completely

    def label(ds, name, category):
        for var in ds:
            if 'category' in ds[var].attrs:
                continue
            ds[var] = ds[var].assign_attrs(**{'category': category})
            ds = ds.rename({var: f'{var} ({name} {category})'})
        return ds

    ds = label(ds)

and in the class, check if it's labeled properly.

It doesn't seem like there's any way to make a clear distinction between a perfect-model setup and a hindcast setup. This is important for our package since the statistics are very different between the two. I guess this could be alleviated by having an attribute like ds.type that says where it's perfect model or hindcast. Then our functions can look for that label to guide things under the hood.

Right, it was a quick mockup. You could add any arbitrary attributes to the dataset, and use filter_by_attrs or do a simple if ds.attrs['setup'] == 'perfect_model'

Dimension naming will become confusing. The initialized object (hindcast) should have an init dimension, while the observations and uninitialized should have time.

No comment :p

This forces the user to use ds.cp always which feels clunky. I.e., you have to remember to add the .cp accessor for all of our functions. Although this probably isn't a huge deal and would be documented well.

You don't have to. You can instantiate the instance once, cp = ClimPred(ds) and then do cp.function().

It looks like you'd need separate datasets for each variable (which is maybe what @aaronspring is saying).

You don't have to. You can again add variable to attributes and filter that way.

Would there be any way to return the statistics as a separate ds? I feel like it would get crowded if rmse results and so on continue to get appended to the main dataset.

Sure, this was just a mockup. You can easily return stat_ds.

With the classes, I really appreciate that we can get a clear view of what's going on:
![Screen Shot 2019-05-13 at 3 46 54 PM](https://user-

Yes, that is a good overview, and I believe you can reimplement that as an accessor method like ds.cp.summary()

If the biggest concern is to allow for multiple models, I can modify the classes to accept them.
Nah, just maintainability, keeping up with xarray.

@ahuang11, let me know your thoughts on all this. I'm open to overhauling these classes for the next round (v2), so we can have plenty of time to discuss this. I think accessors might limit the scope of our project for the above reasons. But they also lend important features like keeping up with new xarray methods.

Yes, v2 sounds like a good target. I suppose I want to get this right ahead of time so it won't be like a Python 2 to 3 transition if this package gets popular :P

aaronspring · 2019-05-14T09:18:24Z

It doesn't seem like there's any way to make a clear distinction between a perfect-model setup and a hindcast setup. This is important for our package since the statistics are very different between the two. I guess this could be alleviated by having an attribute like ds.type that says where it's perfect model or hindcast. Then our functions can look for that label to guide things under the hood.

For xr commands like .groupby('time.year').mean('time') or coarsen() I dont see any differences between perfect-model and hindcast type.

bradyrx · 2019-05-14T17:27:27Z

@ahuang11 -- thanks for the clarifications. I figured there were alleviations for all points but wanted to see what you had to say. I think accessor is the way to go, and maybe that's the major focus for the v2 rollout that you and I can focus on. I am particularly happy to hear we can rewrite as cp = ClimPred(ds) for instance.

Yes, v2 sounds like a good target. I suppose I want to get this right ahead of time so it won't be like a Python 2 to 3 transition if this package gets popular :P

I agree 100%. This forced me to learn python classes anyhow so I'm not mad. Let's put a pin in this for now and pick it up in a conference call following v1's completion. For now, we can have the classes access our updated functions (which shouldn't require change) and then will do a rewrite "early on" to switch to accessors.

For xr commands like .groupby('time.year').mean('time') or coarsen() I dont see any differences between perfect-model and hindcast type.

Agreed. My comment was more about comparisons and so on. But we can alleviate that by forcing a ds.type attribute on intiialization.

Good conversation! Excited to dig into this. Appropriately implementing this will put the package in the place I really want to see it... with smart easy-to-use objects and methods.

bradyrx added enhancement low priority labels Apr 28, 2019

bradyrx self-assigned this Apr 28, 2019

bradyrx added infrastructure v2 and removed low priority labels May 14, 2019

bradyrx added high priority and removed enhancement labels Aug 15, 2019

bradyrx mentioned this issue Sep 29, 2019

Allow arbitrary xarray methods to be applied to climpred objects #243

Merged

9 tasks

bradyrx closed this as completed in #243 Nov 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow basic xarray commands to be applied to whole PredictionEnsemble object #102

Allow basic xarray commands to be applied to whole PredictionEnsemble object #102

bradyrx commented Apr 28, 2019

ahuang11 commented May 10, 2019 •

edited

Loading

aaronspring commented May 10, 2019

aaronspring commented May 10, 2019

bradyrx commented May 13, 2019

ahuang11 commented May 14, 2019 •

edited

Loading

aaronspring commented May 14, 2019

bradyrx commented May 14, 2019

Allow basic xarray commands to be applied to whole PredictionEnsemble object #102

Allow basic xarray commands to be applied to whole PredictionEnsemble object #102

Comments

bradyrx commented Apr 28, 2019

ahuang11 commented May 10, 2019 • edited Loading

aaronspring commented May 10, 2019

aaronspring commented May 10, 2019

bradyrx commented May 13, 2019

ahuang11 commented May 14, 2019 • edited Loading

aaronspring commented May 14, 2019

bradyrx commented May 14, 2019

ahuang11 commented May 10, 2019 •

edited

Loading

ahuang11 commented May 14, 2019 •

edited

Loading