-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add methods for combining variables of differing dimensionality #1597
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality looks pretty handy, and the docstring example for stack_cat
is super helpful in showing how this works. So thanks for putting this together!
I agree that these should be separate methods, but I'm not in love with the name stack_cat
/unstack_cat
. Maybe to_stacked_array
/from_stacked_array
would be better? Generic xarray methods usually return an object of the same type, so an explicit to
/from
in the name is helpful in signalling that this works differently.
Thanks! Yah...I'm not very good at naming things. I think IMO the behavior of unstack_cat is kind of what I would expect |
Or maybe |
If we have a separate method for I like |
That naming sounds good to me. Also, I was having an issue with |
Okay. I just changed the names of the methods and wrote a test case for the problem with the dtype of the stacked dimensions not being preserved by At the moment, I am filling in the missing dimensions with None, so the resulting index has |
xarray/core/dataarray.py
Outdated
|
||
idx = self.indexes[dim] | ||
if not isinstance(idx, pd.MultiIndex): | ||
raise ValueError(f"{dim} is not a stacked coordinate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this string formatting syntax will not work for Python versions < 3.6.
xarray/core/dataarray.py
Outdated
variables = idx.levels[level] | ||
|
||
# pull variables out of datarray | ||
data_dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use an OrderedDict
here.
xarray/core/dataset.py
Outdated
assign_coords[dim] = None | ||
|
||
expand_dims = set(dims).difference(set(val.dims)) | ||
expand_dims.add('variable') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be:
expand_dims.add(variable_dim)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed. fixed.
xarray/core/dataarray.py
Outdated
# pull variables out of datarray | ||
data_dict = {} | ||
for k in variables: | ||
data_dict[k] = self.sel(variable=k).squeeze(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a keyword argument variable_dim
to match the to_stacked_array
argument?
data_dict[k] = self.sel(**{variable_dim: k).squeeze(drop=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, definitely. good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just make a general comment. It was not immediately obvious to me what the general point of this PR was until reading through #1317. I therefore recommend you add a new section explaining these features to the reshaping page. It would be good to explicitly mention the ML motivation and link to relevant scikit-learn docs / examples. You can use the original issue as the basis for this.
@rabernat Your point is well taken. I will add some docs/motivation to the reshaping page. |
@nbren12 - can you add these methods to |
I just added some docs, but need to add to whats-new still, |
Okay. I think I'm done with the updates to the documentation. |
@nbren12 in the linked issue (#1317 (comment)) you wrote:
Is this something that would make sense to resolve before we merge this? |
@shoyer If you are okay with it, I think we might want to leave that to a later date if ever. I am not exactly sure what a useful API for that would be ATM. On the other hand, I have been using the original For more complicated uses (e.g. taking different subsets of each variable and concatenating the output), I have started working on a project which is similar to sklearn-pandas. Since there are a million ways several xarray variables could be processed/subsetted, stacked and then concatenated, I think this functionality should probably remain in a third party package for now. |
@nbren12 OK, sounds good to me. Yes, it's a good idea to put more speculative stuff in other packages. I will take a look at your actual implementation. |
Could not review pull request. It may be too large, or contain no reviewable changes. |
Sorry for random activity. I accidentally hard reset the master branch nbren12/xarray to pydata/xarray. |
@nbren12 - can we revive this? There or some conflicts but I think this was pretty close to done. |
xarray/tests/test_dataset.py
Outdated
ds = xr.Dataset({'a': a, 'b': 1.0}) | ||
ds_flat = ds.to_stacked_array('features', ['x']) | ||
ds_comp = ds_flat.to_unstacked_dataset('features') | ||
self.assertDatasetIdentical(ds, ds_comp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you break these into a bunch of smaller tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I tried to split these tests into more coherent groups
I'd be happy to pick this up again if you think it will go through. |
@jhamman I did a little more work on this today. How do you recommend I update this to master? rebase? |
This partially resolves #1317. Change names of methods stack_cat -> to_stacked_array unstack_cat -> to_unstacked_dataset Test that the dtype of the stacked dimensions is preserved This is not passing at the moment because concatenating None with a dimension that has values upcasts the combined dtype to object Fix dtypes of stacked dimensions This commit ensures that the dtypes of the stacked coordinate match the input dimensions. Use new index variable rather than patching the old one I didn't like the inplace modification of a private member. Handle variable_dim correctly I also fixed 1. f-string formatting issue 2. Use an OrderedDict as @jhamman recommends Add documentation to api.rst and reshaping.rst I also added appropriate See Also sections to the docstrings for to_stacked_array and to_unstacked_dataset. Add changes to whats-new Fixing style errors. Split up lengthy test Remove "physical variable" from docs This is in response to Joe's "nit"
An error arose when checking for the precence of a dimension in array. The code 'dim in data' no longer works. Replaced this with 'dim in data.dims'
xarray/core/dataset.py
Outdated
without broadcasting. | ||
|
||
This function is basically version of Dataset.to_array which does not | ||
broadcast the variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a typo: 'a version'
maybe reformulate to
This function is similar to Dataset.to_array but does not broadcast the variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Changed to "This method is similar to Dataset.to_array but does not broadcast variables."
Does anybody have an idea why the |
It looks like 43834ac might have fixed this. Let's see if the tests pass. |
It looks like the tests passed. @benbovy How does it look now? Did I fix the issues you mentioned? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @nbren12 for the wait, and thanks for your updates! Your last changes globally look good to me!
I have just a couple of additional minor comments inline and one question below.
Paradoxically I think making to_stacked_array accept the list of dimensions which won't be stacked is clearer. These unstacked dims needed to shared across all variables, which is a very simple requirement.
I agree that it is a good idea if it helps making things clearer. I've found the docs very clear about this.
Out of curiosity, is there actual use cases where it is useful to provide multiple dimensions for sample_dims
? If I understand well, the primary goal of Dataset.to_stacked_array()
is for creating 2D data arrays that could be used, e.g., in machine learning applications. If this covers 90% of the use cases, would it make sense allowing only one sample_dim
? Would the simplified logic and better readability be worth it? Just wondering, I don't have strong opinion on this.
xarray/core/dataset.py
Outdated
dims_include_sample_dims = set(sample_dims) <= set(dims) | ||
if not dims_include_sample_dims: | ||
raise ValueError( | ||
"All DataArrays must share the dims: {}. ".format(dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change "All DataArrays" by "All data variables in Dataset" for this error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking the example from the docs/docstrings:
# the line below gives "ValueError: All DataArrays must share the dims: ('x',)."
data.to_stacked_array('z', ['x', 'y'])
# the line below gives "ValueError: All DataArrays must share the dims: ('x', 'y')."
data.to_stacked_array('z', ['foo'])
Those error messages are still a bit confusing to me. For the second I would expect a KeyError: dimension 'foo' not found
. I also don't know why the message says in this second example that all data variables must share the 'y' dimension .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how that is confusing, but technically it is true. I think it would be a little unwieldy to have differerent error messages for different numbers of sample_dimensions. I change the message to "All variables in the dataset must contain the dimensions {}." Hopefully, that is better.
xarray/core/dataset.py
Outdated
|
||
if name is not None: | ||
dataset.name = name | ||
|
||
return dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "data_array" might be a better name for this variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this
Yes! My main use-case does. For example, if you have a weather dataset where every lat/lon pair should be considered a separate "sample", you would use |
Codecov Report
@@ Coverage Diff @@
## master #1597 +/- ##
==========================================
+ Coverage 94.8% 94.83% +0.02%
==========================================
Files 67 67
Lines 12876 12912 +36
==========================================
+ Hits 12207 12245 +38
+ Misses 669 667 -2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a minor suggestions, but this looks really good!
xarray/core/dataarray.py
Outdated
|
||
idx = self.indexes[dim] | ||
if not isinstance(idx, pd.MultiIndex): | ||
raise ValueError(dim, "is not a stacked coordinate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add test coverage for this error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. done
xarray/core/dataset.py
Outdated
"dimensions {}.".format(dims) | ||
) | ||
|
||
def f(val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please give this as sensible name rather than f
, e.g., ensure_stacked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I changed it to ensure_stackable
. since the arrays aren't stacked yet.
xarray/core/dataset.py
Outdated
# must be list for .expand_dims | ||
expand_dims = list(expand_dims) | ||
|
||
return val.assign_coords(**assign_coords) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: here and below, per PEP8, prefer using parentheses rather than \
for multi-line expressions, e.g.,
return (val.assign_coords(**assign_coords)
.expand_dims(expand_dims)
.stack(**{new_dim: (variable_dim,) + stacking_dims}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. done.
doc/whats-new.rst
Outdated
@@ -189,6 +189,8 @@ Enhancements | |||
- Allow ``expand_dims`` method to support inserting/broadcasting dimensions | |||
with size > 1. (:issue:`2710`) | |||
By `Martin Pletcher <https://github.com/pletchm>`_. | |||
- New methods for reshaping Datasets of variables with different dimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to move up to 0.12.3 now -- sorry for the churn here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I moved this up to that release and added a new section header
v0.12.3 (unreleased)
--------------------
New functions/methods
~~~~~~~~~~~~~~~~~~~~~
- New methods for reshaping Datasets of variables with different dimensions
(:issue:`1317`). By `Noah Brenowitz <https://github.com/nbren12>`_.
you have a lint error: #1597 (comment) |
drat. Should be fixed now. |
OK, I plan to merge this shortly unless there are any objections :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks @nbren12q`
I don't think it matters that much; I would have been +0.1 on a Dataset.from_stacked_array(cls...
since from...
allows classes to return their own instances. But rarely do people inherit from these anyway.
xarray/core/dataarray.py
Outdated
# pull variables out of datarray | ||
data_dict = OrderedDict() | ||
for k in variables: | ||
data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than sending as kwargs, if we send as a dict then this will work with non-str keys (though dim names is only partially supported anyway atm)
data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True) | |
data_dict[k] = self.sel({variable_dim: k}).squeeze(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made this change.
xarray/core/dataset.py
Outdated
|
||
return (val.assign_coords(**assign_coords) | ||
.expand_dims(expand_dims) | ||
.stack(**{new_dim: (variable_dim,) + stacking_dims})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.stack(**{new_dim: (variable_dim,) + stacking_dims})) | |
.stack({new_dim: (variable_dim,) + stacking_dims})) |
Up a level: not sure of the best way of enforcing this beyond manually checking.
Potentially a common test fixture that has a non-str var name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change too. I think re-working the test suite would be a fair amount of work, and potentially make the tests harder to read. There are many places in xarray tests that use hardcoded names for dimensions and variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, tbc the ideation was for broader than this PR! Wasn't expecting any big change now
y = DataArray(pd.Index(np.r_[:20], name='y')) | ||
a = x * y | ||
b = x * y * y | ||
return a, b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change, but this would be ideal as test fixture
xarray/core/dataarray.py
Outdated
level : int or str | ||
The MultiIndex level to expand to a dataset along. Can either be | ||
the integer index of the level or its name. | ||
label : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I think we've said int, default 0
rather than optional
where there's a default; but I don't have a strong view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I changed this to your suggestion
It looks like the CI errors above aren’t related to this PR. There seems to be an issue with |
Thanks @nbren12! |
phew! Thanks for all the reviews and discussion everyone! |
Kudos to @nbren12 for sticking with this one! Looking forward to seeing this one in the wild. |
Thanks Joe! |
git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new APIWhile working on #1317, I settled upon combining
stack
andto_array
to create two dimensional numpy arrays given an xarray Dataset. Unfortunately,to_array
automatically broadcasts the variables of dataset, which is not always a desirable behavior. For instance, I was trying to combine precipitation (a horizontal field) and temperature (a 3D field) into one array.This PR enables this by adding two new methods to xarray:
Dataset.stack_cat
, andDataArray.unstack_cat
.stack_cat
usesstack
,expand_dims
, andconcat
to reshape a Dataset into a Dataarray with a helpful MultiIndex, andunstack_cat
reverses the process.I implemented this functionality as a new method since
to_array
is such a clean method already. I really appreciate your thoughts on this. Thanks!cc @jhamman @shoyer