-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not convert subclasses of ndarray
unless required
#1118
Conversation
This does seem pretty reasonable to me. But is this actually running in CI? I think you need to add |
You are right. There seem to be quite a number of varying |
It either uses conda (by default) or pip: Please do add it to everything, except perhaps the build with minimal dependencies ( |
xarray/test/test_quantities.py
Outdated
has_quantities = False | ||
|
||
def requires_quantities(test): | ||
return test if has_quantities else unittest.skip('requires dask')(test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update the message here.
Travis succeeds, though lots of failures under environments with allowed failure. They look unrelated to me, but I find it hard to tell. Appveyor doesn't seem to run the quantities tests, so I guess the requirements there are missing too. Where would I add requirements for Appveyor? |
Indeed, the allowed failures are unrelated: #1109 See here for Appveyor installation: https://github.com/pydata/xarray/blob/master/appveyor.yml It would be good to switch it use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a few more integration tests to verify that basic operations work with ndarray subclasses, even if they don't preserve the subclass nature?
For example, it would be to good to add such a test for xarray.concat
.
xarray/test/test_quantities.py
Outdated
self.y = np.arange(20) | ||
self.xp = np.arange(10) * pq.J | ||
self.v = np.arange(10*20).reshape(10,20) * pq.V | ||
self.da = DataArray(self.v,dims=['x','y'],coords=dict(x=self.x,y=self.y,xp=(['x'],self.xp))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run a PEP8 check on this code -- we try to stick to the 80 character line width and use spaces after commas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
xarray/test/test_quantities.py
Outdated
|
||
@unittest.expectedFailure | ||
def test_sel(self): | ||
self.assertEqualWUnits(self.da.sel(y=self.y[0]).values,self.v[:,0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this indicate that indexing does not preserve units?
I think we should be able to fix that, see NumpyIndexingAdapter
in indexing.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently yes. I didn't get around to actually look into this.
xarray/test/test_quantities.py
Outdated
|
||
@unittest.expectedFailure | ||
def test_mean(self): | ||
self.assertEqualWUnits(self.da.mean('x').values,self.v.mean(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment here noting that xarray uses a wrapped version of np.nanmean
, and I don't think there's an easy way to make that work with duck-typing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't fully understand this yet: Bare Quantities
do work fine with nanmean
. Are you saying that mean
is wrapped in a more complicated way than other reduction methods, or would your comment apply to any reduction. I quickly tried finding the relevant code and converged at DataArrayGroupBy.reduce
->.apply
, which seems to be the case for most reductions. Is that what is called when I do .mean()
on a DataArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting -- I did not realize that Quantities worked with np.nanmean
.
We handle our wrapping of nanmean
and other aggregation functions over in ops.py
: https://github.com/pydata/xarray/blob/master/xarray/core/ops.py#L305
This dispatches to bottleneck.nanmean
if available (which is much faster), falling back to use numpy.nanmean
if necessary. It's quite possible that bottleneck doesn't handle subclasses even though NumPy does. If that's the case, then I think you could simply adjust the logic on this line to always use numpy instead of bottleneck for subclasses (e.g., if isinstance(values, np.ndarray) and type(values) != np.ndarray
):
Line 325 in 6c90583
if isinstance(axis, tuple) or not values.dtype.isnative or no_bottleneck: |
xarray/test/test_quantities.py
Outdated
def test_units_in_data_and_coords(self): | ||
da = self.da | ||
self.assertEqualWUnits(da.xp.data,self.xp) | ||
self.assertEqualWUnits(da.data,self.v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we adjust things so that DataArray.values
still always returns a base numpy.ndarray
, but DataArray.data
can return a subclass? That would be consistent with our current behavior for handling duck-type arrays with dask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine, but I am not so happy about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is, what should be the exact guarantee that .values
wants to give above .data
. For me, units are an inherent part of the data (see my comment in #525) My view here is that the behaviour of an instance of Quantity
is indeed very close to that of the bare ndarray
, and thus shouldn't be treated any different. The problem here is of course that this will not necessarily be the case for a general subclass. So, the question is, where to draw the line between .values
and .data
. Should it be bare ndarray
vs. everything else or just ndarray
subclass vs. things that behave like arrays but aren't. The best answer will depend both on what is out there in terms of subclasses and objects that just look like arrays, as well as what the code using xarray expects to be able to do with what it gets from .value
. I don't want to judge that, because I think I don't know enough about the eco-system xarray lives in. (E.g. dask: Is a dask array an ndarray subclass or not?). In any case, if you think .values
should be bare ndarray
only, then I'll respect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask arrays are not ndarray subclasses, so agree, it's not entirely the same.
Either way, some users will need to wrap their calls to .values
or .data
with np.asanyarray()
(to ensure the result is an ndarray subclass and not possibly a dask array) or np.asarray()
(to ensure the result is a base ndarray). Actually, it's probably worth checking if np.asanyarray(data_array)
does the right thing in your patch -- possibly DataArray.__array__
needs to be updated to call np.asanyarray
.
Just speaking personally -- and I'm happy for other to chime in with their opinions -- I find a shortcut for np.asarray
more essential than np.asanyarray
. There are lots of libraries not written to handle anything other than base numpy array objects, and being able to convert into something that is guaranteed to work is valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so if you say .values
is really just meant to be a shortcut over asarray
or asanyarray
, then I agree the former is more important.
@@ -0,0 +1,64 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring explaining that this module uses quantities as a concrete example, but really it's just checking for subclass compatibility.
Also, I might call it something more genericl ike test_numpy_subclasses.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I was rather thinking of this test-file as specialised towards physical units handling with quantities (i.e. as a regression-test database geared towards #525). I'd rather make a separate test-case for general subclasses, maybe with a dummy subclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, test_units_subclass.py
also seems reasonable.
xarray/core/variable.py
Outdated
@@ -185,7 +185,7 @@ def _as_array_or_item(data): | |||
|
|||
TODO: remove this (replace with np.asarray) once these issues are fixed | |||
""" | |||
data = np.asarray(data) | |||
data = np.asanyarray(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should probably be renamed, e.g., _as_any_array_or_item
, just to make it super clear what it's doing (also update the TODO in the docstring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
xarray/test/test_quantities.py
Outdated
def test_arithmetics(self): | ||
da = self.da | ||
f = DataArray(np.arange(10*20).reshape(10,20)*pq.A,dims=['x','y'],coords=dict(x=self.x,y=self.y)) | ||
self.assertEqualWUnits((da*f).data, da.data*f.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about arithmetic that requires broadcasting? e.g., between arrays with different dimensions and/or different dimension order?
That will require ensuring that ops.transpose
and ops.broadcast_to
work on subclasses (e.g., xarray.Variable.transpose
and xarray.Variable.expand_dims
). I think the former should work already because of NumPy, but broadcast_to
will probably require adding subok=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a go.
def test_units_in_indexes(self): | ||
""" | ||
Indexes are borrowed from Pandas, and Pandas does not support units. | ||
Therefore, we currently don't intend to support units on indexes either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually we use comments instead of docstrings on test methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I do not see how docstrings could be worse than comments under any circumstance... On the other hand, docstrings allow for introspection from unit-testing tools. But then, I don't really care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, not important, but I'm really surprised by this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the convention for NumPy. To be honest I'm not sure why, but I recall being told so at one point by @charris.
I see now that this isn't followed for pandas, so I guess we can break it here. I don't feel too strongly either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember well, the test docstrings were changing the way nose
printed out the test names at runtime, which was a bit annoying when not all tests had a consequent doctstring. I don't know about pytest though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest doesn't seem to with either -v
or -vv
With the new changes, this will now conflict with #1128, though easy to solve. |
#1128 has been merged -- can you try rebasing? Thanks |
Is there anything I can do to help move this forward? I'd really like to have this capability. |
@lamorton This branch has conflicts with master (due to #1128) that need to be resolved with a merge or rebase. But otherwise, I think this is nearly ready to go in. Obviously it's not a complete fix, but it might be enough to be useful. If you want to take a shot at resolving merge conflicts in a new PR, I think that would be welcome (I doubt @burnpanck is currently working on it). |
Would anyone like to take this up? I was going to close as stale, but it's fairly close! |
Is this still relevant? If so, I'd be willing to take over. However, I am not really familiar with github yet: would I open a new pull request for that? |
Hi @keewis - yes! You can start a new PR by branching off this PR (https://help.github.com/en/articles/checking-out-pull-requests-locally). I think you'll find GH easy to use - feel free to ping here with any issues and we'll be happy to help |
closing for the same reason as #2956 |
By changing a single
np.asarray
tonp.asanyarray
, it becomes possible to usendarray
subclasses withinDataArray
andDataset
. No guarantee is made at this point that their behaviour is retained under all circumstances, but for some use-cases it will work.Particularly, this allows to store physical quantities represented using the
Quantity
subclass from the package python-quantities. In that sense, it is a partial fix for #525. Tests are included to highlight some of the things that do work, and some which don't. It does not work for coordinates when they are used as indexes, since pandas does not support ndarray subclasses. Most likely, it will also not work for any higher-level operation on the data such as those involvingnp.concatenate
. Thus, any user of this feature should remain cautious.Nonetheless, for me, this change provides enough added value, such that I'd consider it worthwhile to keep, assuming it does not harm anywhere else. It does pass the tests on my machine (no dask though), let's see what travis says.
I expect that astropy's units would behave similarly, though since I never worked with them yet, I did not include any tests.