Add methods for combining variables of differing dimensionality #1597

nbren12 · 2017-09-27T22:01:57Z

Closes API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

While working on #1317, I settled upon combining stack and to_array to create two dimensional numpy arrays given an xarray Dataset. Unfortunately, to_array automatically broadcasts the variables of dataset, which is not always a desirable behavior. For instance, I was trying to combine precipitation (a horizontal field) and temperature (a 3D field) into one array.

This PR enables this by adding two new methods to xarray:

Dataset.stack_cat, and
DataArray.unstack_cat.

stack_cat uses stack, expand_dims, and concat to reshape a Dataset into a Dataarray with a helpful MultiIndex, and unstack_cat reverses the process.

I implemented this functionality as a new method since to_array is such a clean method already. I really appreciate your thoughts on this. Thanks!

cc @jhamman @shoyer

jhamman · 2017-09-28T20:13:50Z

I want to let @shoyer and @rabernat weigh in on the API here.

shoyer

This functionality looks pretty handy, and the docstring example for stack_cat is super helpful in showing how this works. So thanks for putting this together!

I agree that these should be separate methods, but I'm not in love with the name stack_cat/unstack_cat. Maybe to_stacked_array/from_stacked_array would be better? Generic xarray methods usually return an object of the same type, so an explicit to/from in the name is helpful in signalling that this works differently.

xarray/core/dataarray.py

nbren12 · 2017-09-29T03:33:34Z

Thanks! Yah...I'm not very good at naming things. I think Dataset.to_stacked_array makes sense, but I would typically expect something like from_stacked_array to be a static method of Dataset (e.g. pd.MultiIndex.from_tuples). We could always move unstack_cat to Dataset, but then we miss out on the ability to make sequential method calls.

IMO the behavior of unstack_cat is kind of what I would expect DataArray.to_dataset to do in the case where dim is a stacked coordinate, so maybe we could put it in there.

nbren12 · 2017-09-29T03:38:07Z

Or maybe DataArray.to_dataset_unstacked is good.

shoyer · 2017-09-29T22:31:03Z

If we have a separate method for to_stacked_array(), we should have a separate method for unstacking, too.

I like DataArray.to_unstacked_dataset() as the complement to Dataset.to_stacked_array(). I agree that it should be a DataArray method for chain-ability.

nbren12 · 2017-09-30T01:06:09Z

That naming sounds good to me.

Also, I was having an issue with unstack_cat returning an index with dtype object that I would like to sort out.

nbren12 · 2017-09-30T02:20:41Z

Okay. I just changed the names of the methods and wrote a test case for the problem with the dtype of the stacked dimensions not being preserved by to_stacked_array.

At the moment, I am filling in the missing dimensions with None, so the resulting index has dtype object'. This is then concatenated with the bona fide indices of the variables which are not missing this index, so the index as a whole is cast to the lowest common denominator, which is object. Unfortunately we can't just put Nan in because, NaN is a floating point number thing, so I don't think all numpy dtypes have the equivalent of NaN. @shoyer Do you have any thoughts about how to resolve this?

xarray/core/dataset.py

jhamman · 2017-09-30T20:14:02Z

xarray/core/dataarray.py

+
+        idx = self.indexes[dim]
+        if not isinstance(idx, pd.MultiIndex):
+            raise ValueError(f"{dim} is not a stacked coordinate")


this string formatting syntax will not work for Python versions < 3.6.

jhamman · 2017-09-30T20:15:48Z

xarray/core/dataarray.py

+        variables = idx.levels[level]
+
+        # pull variables out of datarray
+        data_dict = {}


use an OrderedDict here.

jhamman · 2017-09-30T20:17:12Z

xarray/core/dataset.py

+                    assign_coords[dim] = None
+
+            expand_dims = set(dims).difference(set(val.dims))
+            expand_dims.add('variable')


should this be:

expand_dims.add(variable_dim)

?

indeed. fixed.

jhamman · 2017-09-30T20:21:52Z

xarray/core/dataarray.py

+        # pull variables out of datarray
+        data_dict = {}
+        for k in variables:
+            data_dict[k] = self.sel(variable=k).squeeze(drop=True)


should we add a keyword argument variable_dim to match the to_stacked_array argument?

data_dict[k] = self.sel(**{variable_dim: k).squeeze(drop=True)

yes, definitely. good catch.

rabernat

I will just make a general comment. It was not immediately obvious to me what the general point of this PR was until reading through #1317. I therefore recommend you add a new section explaining these features to the reshaping page. It would be good to explicitly mention the ML motivation and link to relevant scikit-learn docs / examples. You can use the original issue as the basis for this.

nbren12 · 2017-09-30T21:41:58Z

@rabernat Your point is well taken. I will add some docs/motivation to the reshaping page.

jhamman · 2017-10-20T19:40:12Z

@nbren12 - can you add these methods to api.rst and add a note to whats-new (in addition to adding some docs to the reshaping page)?

nbren12 · 2017-11-08T00:04:57Z

I just added some docs, but need to add to whats-new still,

nbren12 · 2017-11-09T22:28:22Z

Okay. I think I'm done with the updates to the documentation.

shoyer · 2017-11-11T01:04:03Z

@nbren12 in the linked issue (#1317 (comment)) you wrote:

After using my own version of this code for the past month or so, it has occurred to me that this API probably will not support stacking arrays of with different sizes along shared arrays. For instance, I need to "stack" humidity below an altitude of 10km with temperature between 0 and 16 km. IMO, the easiest way to do this would be to change these methods into top-level functions which can take any dict or iterable of datarrays. We could leave that for a later PR of course.

Is this something that would make sense to resolve before we merge this?

nbren12 · 2017-11-11T06:01:52Z

@shoyer If you are okay with it, I think we might want to leave that to a later date if ever. I am not exactly sure what a useful API for that would be ATM. On the other hand, I have been using the original stack_cat and unstack_cat functions for a couple of months, and they have handled my basic uses pretty well.

For more complicated uses (e.g. taking different subsets of each variable and concatenating the output), I have started working on a project which is similar to sklearn-pandas. Since there are a million ways several xarray variables could be processed/subsetted, stacked and then concatenated, I think this functionality should probably remain in a third party package for now.

shoyer · 2017-11-12T07:29:22Z

@nbren12 OK, sounds good to me. Yes, it's a good idea to put more speculative stuff in other packages. I will take a look at your actual implementation.

stickler-ci · 2018-02-18T21:06:10Z

Could not review pull request. It may be too large, or contain no reviewable changes.

nbren12 · 2018-02-18T21:24:28Z

Sorry for random activity. I accidentally hard reset the master branch nbren12/xarray to pydata/xarray.

jhamman · 2018-12-07T16:44:54Z

@nbren12 - can we revive this? There or some conflicts but I think this was pretty close to done.

jhamman · 2018-12-07T16:45:52Z

xarray/tests/test_dataset.py

+        ds = xr.Dataset({'a': a, 'b': 1.0})
+        ds_flat = ds.to_stacked_array('features', ['x'])
+        ds_comp = ds_flat.to_unstacked_dataset('features')
+        self.assertDatasetIdentical(ds, ds_comp)


Can you break these into a bunch of smaller tests?

Sure. I tried to split these tests into more coherent groups

doc/reshaping.rst

nbren12 · 2018-12-08T19:28:40Z

I'd be happy to pick this up again if you think it will go through.

pep8speaks · 2019-04-01T18:53:03Z

Hello @nbren12! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-04 20:50:20 UTC

nbren12 · 2019-04-01T18:58:18Z

@jhamman I did a little more work on this today. How do you recommend I update this to master? rebase?

@jhamman

This partially resolves #1317. Change names of methods stack_cat -> to_stacked_array unstack_cat -> to_unstacked_dataset Test that the dtype of the stacked dimensions is preserved This is not passing at the moment because concatenating None with a dimension that has values upcasts the combined dtype to object Fix dtypes of stacked dimensions This commit ensures that the dtypes of the stacked coordinate match the input dimensions. Use new index variable rather than patching the old one I didn't like the inplace modification of a private member. Handle variable_dim correctly I also fixed 1. f-string formatting issue 2. Use an OrderedDict as @jhamman recommends Add documentation to api.rst and reshaping.rst I also added appropriate See Also sections to the docstrings for to_stacked_array and to_unstacked_dataset. Add changes to whats-new Fixing style errors. Split up lengthy test Remove "physical variable" from docs This is in response to Joe's "nit"

An error arose when checking for the precence of a dimension in array. The code 'dim in data' no longer works. Replaced this with 'dim in data.dims'

mathause · 2019-06-07T09:26:57Z

xarray/core/dataset.py

+        without broadcasting.
+
+        This function is basically version of Dataset.to_array which does not
+        broadcast the variables.


I think there is a typo: 'a version'

maybe reformulate to
This function is similar to Dataset.to_array but does not broadcast the variables.

Okay. Changed to "This method is similar to Dataset.to_array but does not broadcast variables."

nbren12 · 2019-06-07T14:51:46Z

Does anybody have an idea why the py36-dask-dev tests are failing? None of the errors seems related to this PR.

nbren12 · 2019-06-22T03:12:00Z

It looks like 43834ac might have fixed this. Let's see if the tests pass.

nbren12 · 2019-06-22T03:37:35Z

It looks like the tests passed. @benbovy How does it look now? Did I fix the issues you mentioned?

benbovy

Sorry @nbren12 for the wait, and thanks for your updates! Your last changes globally look good to me!

I have just a couple of additional minor comments inline and one question below.

Paradoxically I think making to_stacked_array accept the list of dimensions which won't be stacked is clearer. These unstacked dims needed to shared across all variables, which is a very simple requirement.

I agree that it is a good idea if it helps making things clearer. I've found the docs very clear about this.

Out of curiosity, is there actual use cases where it is useful to provide multiple dimensions for sample_dims? If I understand well, the primary goal of Dataset.to_stacked_array() is for creating 2D data arrays that could be used, e.g., in machine learning applications. If this covers 90% of the use cases, would it make sense allowing only one sample_dim? Would the simplified logic and better readability be worth it? Just wondering, I don't have strong opinion on this.

benbovy · 2019-06-25T14:40:32Z

xarray/core/dataset.py

+            dims_include_sample_dims = set(sample_dims) <= set(dims)
+            if not dims_include_sample_dims:
+                raise ValueError(
+                    "All DataArrays must share the dims: {}. ".format(dims)


I would change "All DataArrays" by "All data variables in Dataset" for this error message.

Taking the example from the docs/docstrings:

# the line below gives "ValueError: All DataArrays must share the dims: ('x',)." data.to_stacked_array('z', ['x', 'y']) # the line below gives "ValueError: All DataArrays must share the dims: ('x', 'y')." data.to_stacked_array('z', ['foo'])

Those error messages are still a bit confusing to me. For the second I would expect a KeyError: dimension 'foo' not found. I also don't know why the message says in this second example that all data variables must share the 'y' dimension .

I can see how that is confusing, but technically it is true. I think it would be a little unwieldy to have differerent error messages for different numbers of sample_dimensions. I change the message to "All variables in the dataset must contain the dimensions {}." Hopefully, that is better.

benbovy · 2019-06-25T14:56:41Z

xarray/core/dataset.py

+
+        if name is not None:
+            dataset.name = name
+
        return dataset


nit: "data_array" might be a better name for this variable.

I changed this

nbren12 · 2019-06-25T17:01:52Z

is there actual use cases where it is useful to provide multiple dimensions for sample_dims?

Yes! My main use-case does. For example, if you have a weather dataset where every lat/lon pair should be considered a separate "sample", you would use sample_dims=['lat', 'lon'].

codecov · 2019-07-02T04:12:24Z

Codecov Report

Merging #1597 into master will increase coverage by 0.02%.
The diff coverage is 97.22%.

@@            Coverage Diff             @@
##           master    #1597      +/-   ##
==========================================
+ Coverage    94.8%   94.83%   +0.02%     
==========================================
  Files          67       67              
  Lines       12876    12912      +36     
==========================================
+ Hits        12207    12245      +38     
+ Misses        669      667       -2

nbren12 · 2019-07-02T04:23:10Z

Okay. I responded to @benbovy's last comments and merged the upstream changes to master. How is this looking?

cc @jhamman

shoyer

I have a minor suggestions, but this looks really good!

shoyer · 2019-07-02T04:46:37Z

xarray/core/dataarray.py

+
+        idx = self.indexes[dim]
+        if not isinstance(idx, pd.MultiIndex):
+            raise ValueError(dim, "is not a stacked coordinate")


please add test coverage for this error

shoyer · 2019-07-02T04:47:02Z

xarray/core/dataset.py

+                    "dimensions {}.".format(dims)
+                )
+
+        def f(val):


please give this as sensible name rather than f, e.g., ensure_stacked

Sure. I changed it to ensure_stackable. since the arrays aren't stacked yet.

shoyer · 2019-07-02T04:47:58Z

xarray/core/dataset.py

+            # must be list for .expand_dims
+            expand_dims = list(expand_dims)
+
+            return val.assign_coords(**assign_coords) \


nit: here and below, per PEP8, prefer using parentheses rather than \ for multi-line expressions, e.g.,

return (val.assign_coords(**assign_coords) .expand_dims(expand_dims) .stack(**{new_dim: (variable_dim,) + stacking_dims}))

Sure. done.

shoyer · 2019-07-02T04:49:06Z

doc/whats-new.rst

@@ -189,6 +189,8 @@ Enhancements
 - Allow ``expand_dims`` method to support inserting/broadcasting dimensions
  with size > 1. (:issue:`2710`)
  By `Martin Pletcher <https://github.com/pletchm>`_.
+- New methods for reshaping Datasets of variables with different dimensions


this needs to move up to 0.12.3 now -- sorry for the churn here!

Sure. I moved this up to that release and added a new section header

v0.12.3 (unreleased) -------------------- New functions/methods ~~~~~~~~~~~~~~~~~~~~~ - New methods for reshaping Datasets of variables with different dimensions (:issue:`1317`). By `Noah Brenowitz <https://github.com/nbren12>`_.

xarray/core/dataset.py

shoyer · 2019-07-02T18:23:17Z

you have a lint error: #1597 (comment)

nbren12 · 2019-07-02T19:57:59Z

drat. Should be fixed now.

This is clearly documented now.

shoyer · 2019-07-03T00:45:55Z

OK, I plan to merge this shortly unless there are any objections :)

max-sixty

This is great! Thanks @nbren12q`

I don't think it matters that much; I would have been +0.1 on a Dataset.from_stacked_array(cls... since from... allows classes to return their own instances. But rarely do people inherit from these anyway.

max-sixty · 2019-07-03T14:10:29Z

xarray/core/dataarray.py

+        # pull variables out of datarray
+        data_dict = OrderedDict()
+        for k in variables:
+            data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True)


Rather than sending as kwargs, if we send as a dict then this will work with non-str keys (though dim names is only partially supported anyway atm)

Suggested change

data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True)

data_dict[k] = self.sel({variable_dim: k}).squeeze(drop=True)

I have made this change.

max-sixty · 2019-07-03T14:11:38Z

xarray/core/dataset.py

+
+            return (val.assign_coords(**assign_coords)
+                    .expand_dims(expand_dims)
+                    .stack(**{new_dim: (variable_dim,) + stacking_dims}))


Suggested change

.stack(**{new_dim: (variable_dim,) + stacking_dims}))

.stack({new_dim: (variable_dim,) + stacking_dims}))

Up a level: not sure of the best way of enforcing this beyond manually checking.
Potentially a common test fixture that has a non-str var name?

I made this change too. I think re-working the test suite would be a fair amount of work, and potentially make the tests harder to read. There are many places in xarray tests that use hardcoded names for dimensions and variables.

Right, tbc the ideation was for broader than this PR! Wasn't expecting any big change now

max-sixty · 2019-07-03T14:12:47Z

xarray/tests/test_dataset.py

+    y = DataArray(pd.Index(np.r_[:20], name='y'))
+    a = x * y
+    b = x * y * y
+    return a, b


No need to change, but this would be ideal as test fixture

max-sixty · 2019-07-03T14:54:59Z

xarray/core/dataarray.py

+        level : int or str
+            The MultiIndex level to expand to a dataset along. Can either be
+            the integer index of the level or its name.
+        label : int, optional


Generally I think we've said int, default 0 rather than optional where there's a default; but I don't have a strong view

Okay. I changed this to your suggestion

nbren12 · 2019-07-05T00:09:43Z

It looks like the CI errors above aren’t related to this PR. There seems to be an issue with to_pandas.

shoyer · 2019-07-05T00:33:07Z

Thanks @nbren12!

nbren12 · 2019-07-05T01:07:21Z

phew! Thanks for all the reviews and discussion everyone!

jhamman · 2019-07-05T15:57:31Z

Kudos to @nbren12 for sticking with this one! Looking forward to seeing this one in the wild.

nbren12 · 2019-07-05T15:59:51Z

Thanks Joe!

jhamman requested review from shoyer and rabernat September 28, 2017 20:13

shoyer reviewed Sep 28, 2017

View reviewed changes

xarray/core/dataarray.py Show resolved Hide resolved

jhamman reviewed Sep 30, 2017

View reviewed changes

rabernat previously requested changes Sep 30, 2017

View reviewed changes

shoyer added the needs review label Nov 13, 2017

nbren12 closed this Feb 18, 2018

nbren12 reopened this Feb 18, 2018

jhamman reviewed Dec 7, 2018

View reviewed changes

nbren12 added 2 commits April 1, 2019 12:10

Fix to_stacked_array with new master

e997f7f

An error arose when checking for the precence of a dimension in array. The code 'dim in data' no longer works. Replaced this with 'dim in data.dims'

mathause reviewed Jun 7, 2019

View reviewed changes

Fix typo

ce3b52e

Merge remote-tracking branch 'upstream/master'

4ade43d

benbovy reviewed Jun 25, 2019

View reviewed changes

jhamman removed the needs review label Jun 25, 2019

nbren12 added 3 commits July 1, 2019 20:48

Improve docs and error messages

6d520c2

Remove extra spaces

2669797

Merge remote-tracking branch 'upstream/master'

24b2237

shoyer reviewed Jul 2, 2019

View reviewed changes

nbren12 added 2 commits July 1, 2019 23:40

Test warning in to_unstacked_dataset

13587c2

Improve formatting and naming

95e2da9

Fix flake8 error

7aa7095

max-sixty reviewed Jul 3, 2019

View reviewed changes

Respond to @max-sixty's suggestions

e08622a

shoyer merged commit 9c0fb6c into pydata:master Jul 5, 2019

	data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True)
	data_dict[k] = self.sel({variable_dim: k}).squeeze(drop=True)

	.stack(**{new_dim: (variable_dim,) + stacking_dims}))
	.stack({new_dim: (variable_dim,) + stacking_dims}))

Add methods for combining variables of differing dimensionality #1597

Add methods for combining variables of differing dimensionality #1597

Conversation

nbren12 commented Sep 27, 2017 • edited by jhamman Loading

jhamman commented Sep 28, 2017

shoyer left a comment

Choose a reason for hiding this comment

nbren12 commented Sep 29, 2017 • edited Loading

nbren12 commented Sep 29, 2017

shoyer commented Sep 29, 2017

nbren12 commented Sep 30, 2017

nbren12 commented Sep 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat left a comment

Choose a reason for hiding this comment

nbren12 commented Sep 30, 2017

jhamman commented Oct 20, 2017 • edited Loading

nbren12 commented Nov 8, 2017

nbren12 commented Nov 9, 2017

shoyer commented Nov 11, 2017

nbren12 commented Nov 11, 2017 • edited Loading

shoyer commented Nov 12, 2017

stickler-ci commented Feb 18, 2018

nbren12 commented Feb 18, 2018

jhamman commented Dec 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 commented Dec 8, 2018

pep8speaks commented Apr 1, 2019 • edited Loading

Comment last updated at 2019-07-04 20:50:20 UTC

nbren12 commented Apr 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 commented Jun 7, 2019 • edited Loading

nbren12 commented Jun 22, 2019

nbren12 commented Jun 22, 2019

benbovy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 commented Jun 25, 2019

codecov bot commented Jul 2, 2019 • edited Loading

Codecov Report

nbren12 commented Jul 2, 2019

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Jul 2, 2019

nbren12 commented Jul 2, 2019

shoyer commented Jul 3, 2019

max-sixty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nbren12 commented Jul 5, 2019

shoyer commented Jul 5, 2019

nbren12 commented Jul 5, 2019

jhamman commented Jul 5, 2019

nbren12 commented Jul 5, 2019

nbren12 commented Sep 27, 2017 •

edited by jhamman

Loading

nbren12 commented Sep 29, 2017 •

edited

Loading

nbren12 commented Sep 30, 2017 •

edited

Loading

jhamman commented Oct 20, 2017 •

edited

Loading

nbren12 commented Nov 11, 2017 •

edited

Loading

pep8speaks commented Apr 1, 2019 •

edited

Loading

nbren12 commented Jun 7, 2019 •

edited

Loading

codecov bot commented Jul 2, 2019 •

edited

Loading