Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice on unit-aware arithmetic #2176

Closed
mcgibbon opened this issue May 23, 2018 · 9 comments
Closed

Advice on unit-aware arithmetic #2176

mcgibbon opened this issue May 23, 2018 · 9 comments

Comments

@mcgibbon
Copy link
Contributor

This isn't really a bug report. In sympl we're using DataArrays that allow unit-aware operations using the 'units' attribute as the only persistent unit storage. We use pint as a backend to operate on unit strings, but this is never exposed to the user and could be swapped for another backend without much consequence.

Basically, we currently have this implemented as a subclass sympl.DataArray. @dopplershift recently introduced me to the accessor interface, and I've been thinking about whether to switch over to that way of extending DataArray.

The problem I have is that the new code that results from using an accessor is quite cumbersome. The issue lies in that we mainly use new implementations for arithmetic operations. So, for example, the following code:

dt = DataArray(timestep.total_seconds(), attrs={'units': 's'})
for key in tendencies_list[0].keys():
    return_state[key] = state[key] + dt * (
        1.5 * tendencies_list[-1][key] - 0.5 * tendencies_list[-2][key]
    )

instead becomes

dt = DataArray(timestep.total_seconds(), attrs={'units': 's'})
for key in tendencies_list[0].keys():
    return_state[key] = state[key].sympl.add(
        dt.sympl.multiply(
            tendencies_list[-1][key].sympl.multiply(1.5).sympl.subtract(
                tendencies_list[-2][key].sympl.multiply(0.5)
            )
        )
    )

This could be a little less cumbersome if we avoid a sympl namespace and instead add separate accessors for each method. At the least it reads naturally. However, there's a reason you don't generally recommend doing this.

dt = DataArray(timestep.total_seconds(), attrs={'units': 's'})
for key in tendencies_list[0].keys():
    return_state[key] = state[key].add(
        dt.multiply(
            tendencies_list[-1][key].multiply(1.5).subtract(
                tendencies_list[-2][key].multiply(0.5)
            )
        )
    )

I'm looking for advice on what is best for sympl to do here. Right now I'm leaning towards that we should use a subclass rather than an accessor - does this seem like an appropriate case to do so?

@jhamman
Copy link
Member

jhamman commented May 23, 2018

cc @kmpaul who has also been working on this sort of thing recently.

@mcgibbon
Copy link
Contributor Author

For reference, here are some of the sort of methods I've been adding that aren't currently in sympl:

def multiply(self, other):
    if isinstance(other, xr.DataArray):
        result = self._dataarray * other
        result.attrs['units'] = multiply_units(self._dataarray.attrs['units'], other.attrs['units'])
    else:
        result = self._dataarray * other
        result.attrs['units'] = self._dataarray.attrs['units']
    return result

def divide(self, other):
    print(self._dataarray, other)
    if isinstance(other, xr.DataArray):
        result = self._dataarray / other
        result.attrs['units'] = divide_units(self._dataarray.attrs['units'], other.attrs['units'])
    else:
        result = self._dataarray / other
        result.attrs['units'] = self._dataarray.attrs['units']
    return result

def add(self, other):
    result = self._dataarray + other.sympl.to_units(self._dataarray.attrs['units'])
    result.attrs['units'] = self._dataarray.attrs['units']
    return result

def subtract(self, other):
    result = self._dataarray - other.sympl.to_units(self._dataarray.attrs['units'])
    result.attrs['units'] = self._dataarray.attrs['units']
    return result

@kmpaul
Copy link
Contributor

kmpaul commented May 23, 2018

Thanks, @jhamman, for the reference.

I have, indeed, been thinking about this a lot. I've hit the same problem that you have @mcgibbon and came to the same conclusion that subclassing was the only way to go. I'm not sure I'm happy with my solution, but I wrote a (partial) implementation of a DataArray subclass to deal with units. @jhamman suggested that I pull this out into a separate repo, and I hope to do that one day... However, if you are curious, now, you can view it here:

https://github.com/NCAR/PyConform/tree/rewrite/source/pyconform/physarrays

Note that this implementation deals with time as just another unit, rather than requiring any special dealing with time. This works if you units package can deal with calendared times, too. I am currently using cf_units to deal with the units, not pint, but the result is similar. However, cf_units does not deal with calendar conversion, and it doesn't always deal with non-standard calendars well. In the end, though, if I could voice my own 2 cents, I believe that calendared times should be dealt with using the same mechanics as any other unit. Just my bias.

Also, note that this implementation also deals with the positive attribute of the data, in addition to the units and calendar attributes. The positive attribute can be tricky, but if the values don't match between two arrays, then you need to convert before doing any math.

Let me know what you think about this. If it's a priority, I can pull this out into its own repository.

@shoyer
Copy link
Member

shoyer commented May 23, 2018

If you implement arithmetic special methods like __add__ and __radd__ on your accessor, you could write things like:

dt = DataArray(timestep.total_seconds(), attrs={'units': 's'})
for key in tendencies_list[0].keys():
    return_state[key] = state[key].sympl + dt.sympl * (
        1.5 * tendencies_list[-1][key].sympl - 0.5 * tendencies_list[-2][key].sympl
    )

It's awkward but you can still use the infix notation.

Subclassing or writing a DataArray wrapper like what @kmpaul did in physarrays seems like the cleanest way to do this currently. The main challenge is to avoid relying on non-public parts of the DataArray API, but I think overwriting arithmetic would be pretty safe.

In the long term, I think the best solution here is probably an extensible interface for data in xarray objects that would let you put an array with units into an xarray object, similar to the hooks we want for sparse arrays (#1938).

@mcgibbon
Copy link
Contributor Author

@shoyer that notation might work, thanks for pointing it out! Maybe we can think of a more natural name for the accessor ("with_units"? "keep_units"? "uarray"? "u"?). I find the "storage" of units as a string in attrs to be much cleaner than any other implementation I've seen so far (like implementations that have a unit container over an underlying array, or an array of unit-aware objects). It has the added benefit that this is how units are conventionally stored in netCDF files. I don't think it's necessary to use a class other than ndarray for data storage.

@kmpaul the main reason I stayed away from cf_units is that I had bad experiences trying to get it to build with its dependencies in the past. Particularly it's an issue that it depends on the Udunits C library, which requires MinGW to install on Windows and has generally been a headache for me. I'd much prefer a pure Python unit handling implementation. For Sympl, we don't care so much about time units, because time is stored using datetime objects (potentially from the cftime package for alternate calendars). This is also the way that time units are conventionally stored in xarray, once decoded.

It may make sense for us to use some kind of stand-alone unit-aware DataArray implementation. I'd just need to be convinced that yours is well-designed, thoroughly tested, and easy to install with pip. The main things concerning me about PhysArray are 1) As a container rather than subclass, it does not implement many of the methods of DataArray and 2) There are a few design choices I don't understand, like why calendar is always a property of a PhysArray even when it isn't storing a time, why cftime objects aren't used instead of units to manage time, and why the positive attribute is important enough for PhysArray to manage (I've never seen it in any data I've used, and it's easy to check if a DataArray is all positive or negative with a function call). We could discuss these by e-mail if you like (it's my username at uw.edu). Other possibilities are that I'll take the implementation we come up with and give it its own package, or that we'll collaborate on such a package.

@kmpaul
Copy link
Contributor

kmpaul commented May 24, 2018

Well, I'm certainly not trying to argue that the PhysArray implementation is the solution. It's just a solution, and a solution for a much smaller problem.

I understand your concerns about cf_units. I've heard many people make the same argument for pint, and I appreciate it. I went with cf_units because of its dependence on UDUNITS, which is declared as the "standard" in the CF Conventions. I never had the time (nor do I foresee ever having the time) to see if pint was fully compliant with UDUNITS, so I went with something I knew was.

I personally think it's fine to discuss this here, unless other people would like to see this go offline. To address some of your issues:

  1. As a container rather than subclass, it does not implement many of the methods of DataArray

Yes. I agree. In fact, my first implementation was a subclass of DataArray, but after reading the Xarray documentation here and the Issue #706, I decided on composition for this implementation.

  1. There are a few design choices I don't understand, like why calendar is always a property of a PhysArray even when it isn't storing a time

I personally didn't see much of a benefit to a check like hasattr(obj, 'calendar') versus obj.calendar is None (other than the fact that these two checks are opposite). So, for my code, I felt that obj.calendar is None was a reasonable check to see if the units were "calendared time" units.

why cftime objects aren't used instead of units to manage time

As I mentioned in my post, this is just my personal preference. I think that calendared time units are...well...just units, and that the same mechanism for dealing with all other units should deal with calendared time units. It's just an aesthetic that I prefer. (That said, I'm extremely happy that someone finally dealt with non-standard calendars with cftime.)

why the positive attribute is important enough for PhysArray to manage (I've never seen it in any data I've used, and it's easy to check if a DataArray is all positive or negative with a function call)

The CF Conventions define the positive attribute to indicate that the field represents the vertical component of a vector, and the value of positive indicates what direction the vertical component points (up or down) if the data is positive. So, if you add field X and field Y, and X.positive == 'up' but Y.positive == 'down', then you need to actually subtract the fields. It's an annoying attribute that I've had to deal with when preparing data for CMIP6.

In the end, I'm happy doing whatever the community wants with this code. I can pull it out into it's own repo (along with the unit tests, which are also in the PyConform repo). And then, after that, I have no problem with people taking it in a different direction (e.g., using pint, injecting properties based on the value of the attributes, etc.). I'm also happy with a subclass option, as my C++ experiences in the past have made me very comfortable with inheritance. I take guidance easily. I just don't usually have a lot of time to devote to development any more these days. :-(

@dopplershift
Copy link
Contributor

My problem with custom classes (subclasses or composition) is that you will never get those from a Dataset that you get from open_dataset(). That’s a problem for me when I’m trying to make things work for users who want to do simple scripted data analysis. I’m much more interested in #1938 (or #1118).

I’m also looking at moving from pint to unyt, which is Yt’s unit support, brought into a standalone package. Beyond some performance benefits (I’ve heard), it has the benefit of being a ndarray subclass, which means asanyarray will leave it alone. That would seem to make it easier to cram into a DataArray.

@mcgibbon
Copy link
Contributor Author

@dopplershift That's a good point. It's pretty trivial to create a sympl.DataArray from an xarray.DataArray, so perhaps I should be using a decorator that will convert xarray.DataArray to sympl.DataArray whenever one is passed into a sympl call. This would be similarly easy to do in metpy. One could also write a function to convert Dataset into one that contains unit-aware DataArray objects, or an open_dataset that calls xarray.open_dataset and then does such a conversion, though I'd wonder if certain Dataset calls (e.g. mean) might undo such a conversion.

In sympl our main concerns are unit checking at the boundary of components and in properly converting units when time stepping or adding outputs of components together. Maybe sympl should only be using this DataArray subclass internally, with type conversions or wrapping when taking DataArrays into and out of its methods? That would solve a lot of our problems.

unyt may be a better choice than pint for MetPy. Like I said in Sympl we don't use pint for unit information storage, only for conversion and arithmetic, so whether it uses an ndarray subclass doesn't apply for us.

@mcgibbon
Copy link
Contributor Author

Thank you for the input everyone, this discussion has been very useful! Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants