Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not convert subclasses of ndarray unless required #1118

Closed
wants to merge 8 commits into from
2 changes: 1 addition & 1 deletion xarray/core/variable.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ def _as_array_or_item(data):

TODO: remove this (replace with np.asarray) once these issues are fixed
"""
data = np.asarray(data)
data = np.asanyarray(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should probably be renamed, e.g., _as_any_array_or_item, just to make it super clear what it's doing (also update the TODO in the docstring)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

if data.ndim == 0:
if data.dtype.kind == 'M':
data = np.datetime64(data, 'ns')
Expand Down
64 changes: 64 additions & 0 deletions xarray/test/test_quantities.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring explaining that this module uses quantities as a concrete example, but really it's just checking for subclass compatibility.

Also, I might call it something more genericl ike test_numpy_subclasses.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was rather thinking of this test-file as specialised towards physical units handling with quantities (i.e. as a regression-test database geared towards #525). I'd rather make a separate test-case for general subclasses, maybe with a dummy subclass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, test_units_subclass.py also seems reasonable.

import pandas as pd

from xarray import (align, broadcast, Dataset, DataArray, Variable)

from xarray.test import (TestCase, unittest)

try:
import quantities as pq
has_quantities = True
except ImportError:
has_quantities = False

def requires_quantities(test):
return test if has_quantities else unittest.skip('requires dask')(test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the message here.


@requires_quantities
class TestWithQuantities(TestCase):
def setUp(self):
self.x = np.arange(10) * pq.A
self.y = np.arange(20)
self.xp = np.arange(10) * pq.J
self.v = np.arange(10*20).reshape(10,20) * pq.V
self.da = DataArray(self.v,dims=['x','y'],coords=dict(x=self.x,y=self.y,xp=(['x'],self.xp)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run a PEP8 check on this code -- we try to stick to the 80 character line width and use spaces after commas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


def assertEqualWUnits(self,a,b):
self.assertIsNotNone(getattr(a, 'units', None))
self.assertIsNotNone(getattr(b, 'units', None))
self.assertEqual(a.units,b.units)
np.testing.assert_allclose(a.magnitude,b.magnitude)

def test_units_in_data_and_coords(self):
da = self.da
self.assertEqualWUnits(da.xp.data,self.xp)
self.assertEqualWUnits(da.data,self.v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adjust things so that DataArray.values still always returns a base numpy.ndarray, but DataArray.data can return a subclass? That would be consistent with our current behavior for handling duck-type arrays with dask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, but I am not so happy about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is, what should be the exact guarantee that .values wants to give above .data. For me, units are an inherent part of the data (see my comment in #525) My view here is that the behaviour of an instance of Quantity is indeed very close to that of the bare ndarray, and thus shouldn't be treated any different. The problem here is of course that this will not necessarily be the case for a general subclass. So, the question is, where to draw the line between .values and .data. Should it be bare ndarray vs. everything else or just ndarray subclass vs. things that behave like arrays but aren't. The best answer will depend both on what is out there in terms of subclasses and objects that just look like arrays, as well as what the code using xarray expects to be able to do with what it gets from .value. I don't want to judge that, because I think I don't know enough about the eco-system xarray lives in. (E.g. dask: Is a dask array an ndarray subclass or not?). In any case, if you think .values should be bare ndarray only, then I'll respect that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask arrays are not ndarray subclasses, so agree, it's not entirely the same.

Either way, some users will need to wrap their calls to .values or .data with np.asanyarray() (to ensure the result is an ndarray subclass and not possibly a dask array) or np.asarray() (to ensure the result is a base ndarray). Actually, it's probably worth checking if np.asanyarray(data_array) does the right thing in your patch -- possibly DataArray.__array__ needs to be updated to call np.asanyarray.

Just speaking personally -- and I'm happy for other to chime in with their opinions -- I find a shortcut for np.asarray more essential than np.asanyarray. There are lots of libraries not written to handle anything other than base numpy array objects, and being able to convert into something that is guaranteed to work is valuable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if you say .values is really just meant to be a shortcut over asarray or asanyarray, then I agree the former is more important.


def test_arithmetics(self):
da = self.da
f = DataArray(np.arange(10*20).reshape(10,20)*pq.A,dims=['x','y'],coords=dict(x=self.x,y=self.y))
self.assertEqualWUnits((da*f).data, da.data*f.data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about arithmetic that requires broadcasting? e.g., between arrays with different dimensions and/or different dimension order?

That will require ensuring that ops.transpose and ops.broadcast_to work on subclasses (e.g., xarray.Variable.transpose and xarray.Variable.expand_dims). I think the former should work already because of NumPy, but broadcast_to will probably require adding subok=True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a go.


def test_unit_checking(self):
da = self.da
f = DataArray(np.arange(10*20).reshape(10,20)*pq.A,dims=['x','y'],coords=dict(x=self.x,y=self.y))
with self.assertRaisesRegex(ValueError,'Unable to convert between units'):
da + f

@unittest.expectedFailure
def test_units_in_indexes(self):
"""
Indexes are borrowed from Pandas, and Pandas does not support units.
Therefore, we currently don't intend to support units on indexes either.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we use comments instead of docstrings on test methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I do not see how docstrings could be worse than comments under any circumstance... On the other hand, docstrings allow for introspection from unit-testing tools. But then, I don't really care.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, not important, but I'm really surprised by this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the convention for NumPy. To be honest I'm not sure why, but I recall being told so at one point by @charris.

I see now that this isn't followed for pandas, so I guess we can break it here. I don't feel too strongly either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember well, the test docstrings were changing the way nose printed out the test names at runtime, which was a bit annoying when not all tests had a consequent doctstring. I don't know about pytest though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest doesn't seem to with either -v or -vv

"""
da = self.da
self.assertEqualWUnits(da.x.data,self.x)

@unittest.expectedFailure
def test_sel(self):
self.assertEqualWUnits(self.da.sel(y=self.y[0]).values,self.v[:,0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this indicate that indexing does not preserve units?

I think we should be able to fix that, see NumpyIndexingAdapter in indexing.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently yes. I didn't get around to actually look into this.


@unittest.expectedFailure
def test_mean(self):
self.assertEqualWUnits(self.da.mean('x').values,self.v.mean(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment here noting that xarray uses a wrapped version of np.nanmean, and I don't think there's an easy way to make that work with duck-typing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't fully understand this yet: Bare Quantities do work fine with nanmean. Are you saying that mean is wrapped in a more complicated way than other reduction methods, or would your comment apply to any reduction. I quickly tried finding the relevant code and converged at DataArrayGroupBy.reduce->.apply, which seems to be the case for most reductions. Is that what is called when I do .mean() on a DataArray?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting -- I did not realize that Quantities worked with np.nanmean.

We handle our wrapping of nanmean and other aggregation functions over in ops.py: https://github.com/pydata/xarray/blob/master/xarray/core/ops.py#L305

This dispatches to bottleneck.nanmean if available (which is much faster), falling back to use numpy.nanmean if necessary. It's quite possible that bottleneck doesn't handle subclasses even though NumPy does. If that's the case, then I think you could simply adjust the logic on this line to always use numpy instead of bottleneck for subclasses (e.g., if isinstance(values, np.ndarray) and type(values) != np.ndarray):

if isinstance(axis, tuple) or not values.dtype.isnative or no_bottleneck: