Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods for combining variables of differing dimensionality #1597

Merged
merged 26 commits into from
Jul 5, 2019
Merged

Add methods for combining variables of differing dimensionality #1597

merged 26 commits into from
Jul 5, 2019

Conversation

nbren12
Copy link
Contributor

@nbren12 nbren12 commented Sep 27, 2017

While working on #1317, I settled upon combining stack and to_array to create two dimensional numpy arrays given an xarray Dataset. Unfortunately, to_array automatically broadcasts the variables of dataset, which is not always a desirable behavior. For instance, I was trying to combine precipitation (a horizontal field) and temperature (a 3D field) into one array.

This PR enables this by adding two new methods to xarray:

  • Dataset.stack_cat, and
  • DataArray.unstack_cat.

stack_cat uses stack, expand_dims, and concat to reshape a Dataset into a Dataarray with a helpful MultiIndex, and unstack_cat reverses the process.

I implemented this functionality as a new method since to_array is such a clean method already. I really appreciate your thoughts on this. Thanks!

cc @jhamman @shoyer

@jhamman jhamman requested review from shoyer and rabernat September 28, 2017 20:13
@jhamman
Copy link
Member

jhamman commented Sep 28, 2017

I want to let @shoyer and @rabernat weigh in on the API here.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality looks pretty handy, and the docstring example for stack_cat is super helpful in showing how this works. So thanks for putting this together!

I agree that these should be separate methods, but I'm not in love with the name stack_cat/unstack_cat. Maybe to_stacked_array/from_stacked_array would be better? Generic xarray methods usually return an object of the same type, so an explicit to/from in the name is helpful in signalling that this works differently.

xarray/core/dataarray.py Show resolved Hide resolved
@nbren12
Copy link
Contributor Author

nbren12 commented Sep 29, 2017

Thanks! Yah...I'm not very good at naming things. I think Dataset.to_stacked_array makes sense, but I would typically expect something like from_stacked_array to be a static method of Dataset (e.g. pd.MultiIndex.from_tuples). We could always move unstack_cat to Dataset, but then we miss out on the ability to make sequential method calls.

IMO the behavior of unstack_cat is kind of what I would expect DataArray.to_dataset to do in the case where dim is a stacked coordinate, so maybe we could put it in there.

@nbren12
Copy link
Contributor Author

nbren12 commented Sep 29, 2017

Or maybe DataArray.to_dataset_unstacked is good.

@shoyer
Copy link
Member

shoyer commented Sep 29, 2017

If we have a separate method for to_stacked_array(), we should have a separate method for unstacking, too.

I like DataArray.to_unstacked_dataset() as the complement to Dataset.to_stacked_array(). I agree that it should be a DataArray method for chain-ability.

@nbren12
Copy link
Contributor Author

nbren12 commented Sep 30, 2017

That naming sounds good to me.

Also, I was having an issue with unstack_cat returning an index with dtype object that I would like to sort out.

@nbren12
Copy link
Contributor Author

nbren12 commented Sep 30, 2017

Okay. I just changed the names of the methods and wrote a test case for the problem with the dtype of the stacked dimensions not being preserved by to_stacked_array.

At the moment, I am filling in the missing dimensions with None, so the resulting index has dtype object'. This is then concatenated with the bona fide indices of the variables which are not missing this index, so the index as a whole is cast to the lowest common denominator, which is object. Unfortunately we can't just put Nan in because, NaN is a floating point number thing, so I don't think all numpy dtypes have the equivalent of NaN. @shoyer Do you have any thoughts about how to resolve this?

xarray/core/dataset.py Outdated Show resolved Hide resolved

idx = self.indexes[dim]
if not isinstance(idx, pd.MultiIndex):
raise ValueError(f"{dim} is not a stacked coordinate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this string formatting syntax will not work for Python versions < 3.6.

variables = idx.levels[level]

# pull variables out of datarray
data_dict = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an OrderedDict here.

assign_coords[dim] = None

expand_dims = set(dims).difference(set(val.dims))
expand_dims.add('variable')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be:

expand_dims.add(variable_dim)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. fixed.

# pull variables out of datarray
data_dict = {}
for k in variables:
data_dict[k] = self.sel(variable=k).squeeze(drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a keyword argument variable_dim to match the to_stacked_array argument?

data_dict[k] = self.sel(**{variable_dim: k).squeeze(drop=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, definitely. good catch.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just make a general comment. It was not immediately obvious to me what the general point of this PR was until reading through #1317. I therefore recommend you add a new section explaining these features to the reshaping page. It would be good to explicitly mention the ML motivation and link to relevant scikit-learn docs / examples. You can use the original issue as the basis for this.

@nbren12
Copy link
Contributor Author

nbren12 commented Sep 30, 2017

@rabernat Your point is well taken. I will add some docs/motivation to the reshaping page.

@jhamman
Copy link
Member

jhamman commented Oct 20, 2017

@nbren12 - can you add these methods to api.rst and add a note to whats-new (in addition to adding some docs to the reshaping page)?

@nbren12
Copy link
Contributor Author

nbren12 commented Nov 8, 2017

I just added some docs, but need to add to whats-new still,

@nbren12
Copy link
Contributor Author

nbren12 commented Nov 9, 2017

Okay. I think I'm done with the updates to the documentation.

@shoyer
Copy link
Member

shoyer commented Nov 11, 2017

@nbren12 in the linked issue (#1317 (comment)) you wrote:

After using my own version of this code for the past month or so, it has occurred to me that this API probably will not support stacking arrays of with different sizes along shared arrays. For instance, I need to "stack" humidity below an altitude of 10km with temperature between 0 and 16 km. IMO, the easiest way to do this would be to change these methods into top-level functions which can take any dict or iterable of datarrays. We could leave that for a later PR of course.

Is this something that would make sense to resolve before we merge this?

@nbren12
Copy link
Contributor Author

nbren12 commented Nov 11, 2017

@shoyer If you are okay with it, I think we might want to leave that to a later date if ever. I am not exactly sure what a useful API for that would be ATM. On the other hand, I have been using the original stack_cat and unstack_cat functions for a couple of months, and they have handled my basic uses pretty well.

For more complicated uses (e.g. taking different subsets of each variable and concatenating the output), I have started working on a project which is similar to sklearn-pandas. Since there are a million ways several xarray variables could be processed/subsetted, stacked and then concatenated, I think this functionality should probably remain in a third party package for now.

@shoyer
Copy link
Member

shoyer commented Nov 12, 2017

@nbren12 OK, sounds good to me. Yes, it's a good idea to put more speculative stuff in other packages. I will take a look at your actual implementation.

@stickler-ci
Copy link
Contributor

Could not review pull request. It may be too large, or contain no reviewable changes.

@nbren12 nbren12 reopened this Feb 18, 2018
@nbren12
Copy link
Contributor Author

nbren12 commented Feb 18, 2018

Sorry for random activity. I accidentally hard reset the master branch nbren12/xarray to pydata/xarray.

@jhamman
Copy link
Member

jhamman commented Dec 7, 2018

@nbren12 - can we revive this? There or some conflicts but I think this was pretty close to done.

ds = xr.Dataset({'a': a, 'b': 1.0})
ds_flat = ds.to_stacked_array('features', ['x'])
ds_comp = ds_flat.to_unstacked_dataset('features')
self.assertDatasetIdentical(ds, ds_comp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you break these into a bunch of smaller tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I tried to split these tests into more coherent groups

doc/reshaping.rst Outdated Show resolved Hide resolved
@nbren12
Copy link
Contributor Author

nbren12 commented Dec 8, 2018

I'd be happy to pick this up again if you think it will go through.

@pep8speaks
Copy link

pep8speaks commented Apr 1, 2019

Hello @nbren12! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-04 20:50:20 UTC

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 1, 2019

@jhamman I did a little more work on this today. How do you recommend I update this to master? rebase?

nbren12 added 2 commits April 1, 2019 12:10
This partially resolves #1317.

Change names of methods

stack_cat -> to_stacked_array
unstack_cat -> to_unstacked_dataset

Test that the dtype of the stacked dimensions is preserved

This is not passing at the moment because concatenating None with
a dimension that has values upcasts the combined dtype to object

Fix dtypes of stacked dimensions

This commit ensures that the dtypes of the stacked coordinate match the input dimensions.

Use new index variable rather than patching the old one

I didn't like the inplace modification of a private member.

Handle variable_dim correctly

I also fixed

1. f-string formatting issue
2. Use an OrderedDict as @jhamman recommends

Add documentation to api.rst and reshaping.rst

I also added appropriate See Also sections to the docstrings for
to_stacked_array and to_unstacked_dataset.

Add changes to whats-new

Fixing style errors.

Split up lengthy test

Remove "physical variable" from docs

This is in response to Joe's "nit"
An error arose when checking for the precence of a dimension in array.
The code 'dim in data' no longer works.
Replaced this with 'dim in data.dims'
without broadcasting.

This function is basically version of Dataset.to_array which does not
broadcast the variables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a typo: 'a version'

maybe reformulate to
This function is similar to Dataset.to_array but does not broadcast the variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Changed to "This method is similar to Dataset.to_array but does not broadcast variables."

@nbren12
Copy link
Contributor Author

nbren12 commented Jun 7, 2019

Does anybody have an idea why the py36-dask-dev tests are failing? None of the errors seems related to this PR.

@nbren12
Copy link
Contributor Author

nbren12 commented Jun 22, 2019

It looks like 43834ac might have fixed this. Let's see if the tests pass.

@nbren12
Copy link
Contributor Author

nbren12 commented Jun 22, 2019

It looks like the tests passed. @benbovy How does it look now? Did I fix the issues you mentioned?

Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @nbren12 for the wait, and thanks for your updates! Your last changes globally look good to me!

I have just a couple of additional minor comments inline and one question below.

Paradoxically I think making to_stacked_array accept the list of dimensions which won't be stacked is clearer. These unstacked dims needed to shared across all variables, which is a very simple requirement.

I agree that it is a good idea if it helps making things clearer. I've found the docs very clear about this.

Out of curiosity, is there actual use cases where it is useful to provide multiple dimensions for sample_dims? If I understand well, the primary goal of Dataset.to_stacked_array() is for creating 2D data arrays that could be used, e.g., in machine learning applications. If this covers 90% of the use cases, would it make sense allowing only one sample_dim? Would the simplified logic and better readability be worth it? Just wondering, I don't have strong opinion on this.

dims_include_sample_dims = set(sample_dims) <= set(dims)
if not dims_include_sample_dims:
raise ValueError(
"All DataArrays must share the dims: {}. ".format(dims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change "All DataArrays" by "All data variables in Dataset" for this error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the example from the docs/docstrings:

# the line below gives "ValueError: All DataArrays must share the dims: ('x',)."
data.to_stacked_array('z', ['x', 'y'])

# the line below gives "ValueError: All DataArrays must share the dims: ('x', 'y')."
data.to_stacked_array('z', ['foo'])      

Those error messages are still a bit confusing to me. For the second I would expect a KeyError: dimension 'foo' not found. I also don't know why the message says in this second example that all data variables must share the 'y' dimension .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how that is confusing, but technically it is true. I think it would be a little unwieldy to have differerent error messages for different numbers of sample_dimensions. I change the message to "All variables in the dataset must contain the dimensions {}." Hopefully, that is better.


if name is not None:
dataset.name = name

return dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "data_array" might be a better name for this variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this

@nbren12
Copy link
Contributor Author

nbren12 commented Jun 25, 2019

is there actual use cases where it is useful to provide multiple dimensions for sample_dims?

Yes! My main use-case does. For example, if you have a weather dataset where every lat/lon pair should be considered a separate "sample", you would use sample_dims=['lat', 'lon'].

@codecov
Copy link

codecov bot commented Jul 2, 2019

Codecov Report

Merging #1597 into master will increase coverage by 0.02%.
The diff coverage is 97.22%.

@@            Coverage Diff             @@
##           master    #1597      +/-   ##
==========================================
+ Coverage    94.8%   94.83%   +0.02%     
==========================================
  Files          67       67              
  Lines       12876    12912      +36     
==========================================
+ Hits        12207    12245      +38     
+ Misses        669      667       -2

@nbren12
Copy link
Contributor Author

nbren12 commented Jul 2, 2019

Okay. I responded to @benbovy's last comments and merged the upstream changes to master. How is this looking?

cc @jhamman

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor suggestions, but this looks really good!


idx = self.indexes[dim]
if not isinstance(idx, pd.MultiIndex):
raise ValueError(dim, "is not a stacked coordinate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add test coverage for this error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. done

"dimensions {}.".format(dims)
)

def f(val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please give this as sensible name rather than f, e.g., ensure_stacked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I changed it to ensure_stackable. since the arrays aren't stacked yet.

# must be list for .expand_dims
expand_dims = list(expand_dims)

return val.assign_coords(**assign_coords) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here and below, per PEP8, prefer using parentheses rather than \ for multi-line expressions, e.g.,

            return (val.assign_coords(**assign_coords)
                    .expand_dims(expand_dims)
                    .stack(**{new_dim: (variable_dim,) + stacking_dims}))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. done.

@@ -189,6 +189,8 @@ Enhancements
- Allow ``expand_dims`` method to support inserting/broadcasting dimensions
with size > 1. (:issue:`2710`)
By `Martin Pletcher <https://github.com/pletchm>`_.
- New methods for reshaping Datasets of variables with different dimensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to move up to 0.12.3 now -- sorry for the churn here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I moved this up to that release and added a new section header

v0.12.3 (unreleased)
--------------------

New functions/methods
~~~~~~~~~~~~~~~~~~~~~

- New methods for reshaping Datasets of variables with different dimensions
  (:issue:`1317`). By `Noah Brenowitz <https://github.com/nbren12>`_.

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@shoyer
Copy link
Member

shoyer commented Jul 2, 2019

you have a lint error: #1597 (comment)

@nbren12
Copy link
Contributor Author

nbren12 commented Jul 2, 2019

drat. Should be fixed now.

@shoyer shoyer dismissed rabernat’s stale review July 3, 2019 00:44

This is clearly documented now.

@shoyer
Copy link
Member

shoyer commented Jul 3, 2019

OK, I plan to merge this shortly unless there are any objections :)

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks @nbren12q`

I don't think it matters that much; I would have been +0.1 on a Dataset.from_stacked_array(cls... since from... allows classes to return their own instances. But rarely do people inherit from these anyway.

# pull variables out of datarray
data_dict = OrderedDict()
for k in variables:
data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than sending as kwargs, if we send as a dict then this will work with non-str keys (though dim names is only partially supported anyway atm)

Suggested change
data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True)
data_dict[k] = self.sel({variable_dim: k}).squeeze(drop=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made this change.


return (val.assign_coords(**assign_coords)
.expand_dims(expand_dims)
.stack(**{new_dim: (variable_dim,) + stacking_dims}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.stack(**{new_dim: (variable_dim,) + stacking_dims}))
.stack({new_dim: (variable_dim,) + stacking_dims}))

Up a level: not sure of the best way of enforcing this beyond manually checking.
Potentially a common test fixture that has a non-str var name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change too. I think re-working the test suite would be a fair amount of work, and potentially make the tests harder to read. There are many places in xarray tests that use hardcoded names for dimensions and variables.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, tbc the ideation was for broader than this PR! Wasn't expecting any big change now

y = DataArray(pd.Index(np.r_[:20], name='y'))
a = x * y
b = x * y * y
return a, b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change, but this would be ideal as test fixture

level : int or str
The MultiIndex level to expand to a dataset along. Can either be
the integer index of the level or its name.
label : int, optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think we've said int, default 0 rather than optional where there's a default; but I don't have a strong view

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I changed this to your suggestion

@nbren12
Copy link
Contributor Author

nbren12 commented Jul 5, 2019

It looks like the CI errors above aren’t related to this PR. There seems to be an issue with to_pandas.

@shoyer shoyer merged commit 9c0fb6c into pydata:master Jul 5, 2019
@shoyer
Copy link
Member

shoyer commented Jul 5, 2019

Thanks @nbren12!

@nbren12
Copy link
Contributor Author

nbren12 commented Jul 5, 2019

phew! Thanks for all the reviews and discussion everyone!

@jhamman
Copy link
Member

jhamman commented Jul 5, 2019

Kudos to @nbren12 for sticking with this one! Looking forward to seeing this one in the wild.

@nbren12
Copy link
Contributor Author

nbren12 commented Jul 5, 2019

Thanks Joe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API for reshaping DataArrays as 2D "data matrices" for use in machine learning