Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods for combining variables of differing dimensionality #1597

Merged
merged 26 commits into from
Jul 5, 2019
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
8c947e7
Add stack_cat and unstack_cat methods
nbren12 Sep 27, 2017
e997f7f
Fix to_stacked_array with new master
nbren12 Apr 1, 2019
3d757da
Move entry in whats-new to most recent release
nbren12 Apr 1, 2019
151dc71
Fix code styling errors
nbren12 Apr 1, 2019
8a1a8ef
Improve docstring of to_stacked_array
nbren12 Apr 2, 2019
e8594f1
Move "See Also" section to end of docstring
nbren12 Apr 2, 2019
0f1ba22
Doc and comment improvements.
nbren12 Apr 12, 2019
1e1f4d9
Merge remote-tracking branch 'upstream/master'
nbren12 Apr 12, 2019
35e0ecf
Improve documented example
nbren12 Jun 7, 2019
23d9246
Add name argument to to_stacked_array and test
nbren12 Jun 7, 2019
099d440
Allow level argument to be an int or str
nbren12 Jun 7, 2019
e40b6a2
Remove variable_dim argument of to_unstacked_array
nbren12 Jun 7, 2019
35a2365
Actually removed variable_dim
nbren12 Jun 7, 2019
35715dc
Merge remote-tracking branch 'upstream/master'
nbren12 Jun 7, 2019
5ca9a1d
Change function signature of to_stacked_array
nbren12 Jun 7, 2019
2979c75
Fix lint error
nbren12 Jun 7, 2019
c17dc09
Fix validation and failing tests
nbren12 Jun 7, 2019
ce3b52e
Fix typo
nbren12 Jun 7, 2019
4ade43d
Merge remote-tracking branch 'upstream/master'
nbren12 Jun 22, 2019
6d520c2
Improve docs and error messages
nbren12 Jul 2, 2019
2669797
Remove extra spaces
nbren12 Jul 2, 2019
24b2237
Merge remote-tracking branch 'upstream/master'
nbren12 Jul 2, 2019
13587c2
Test warning in to_unstacked_dataset
nbren12 Jul 2, 2019
95e2da9
Improve formatting and naming
nbren12 Jul 2, 2019
7aa7095
Fix flake8 error
nbren12 Jul 2, 2019
e08622a
Respond to @max-sixty's suggestions
nbren12 Jul 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,7 @@ Reshaping and reorganizing
Dataset.transpose
Dataset.stack
Dataset.unstack
Dataset.to_stacked_array
Dataset.shift
Dataset.roll
Dataset.sortby
Expand Down Expand Up @@ -370,6 +371,7 @@ Reshaping and reorganizing
DataArray.transpose
DataArray.stack
DataArray.unstack
DataArray.to_unstacked_dataset
DataArray.shift
DataArray.roll
DataArray.sortby
Expand Down
30 changes: 30 additions & 0 deletions doc/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,36 @@ pandas, it does not automatically drop missing values. Compare:
We departed from pandas's behavior here because predictable shapes for new
array dimensions is necessary for :ref:`dask`.

Stacking different variables together
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These stacking and unstacking operations are particularly useful for reshaping
xarray objects for use in machine learning packages, such as `scikit-learn
<http://scikit-learn.org/stable/>`_, that usually require two-dimensional numpy
arrays as inputs. For datasets with only one variable, we only need ``stack``
and ``unstack``, but combining multiple variables in a
:py:class:`xarray.Dataset` is more complicated. If the variables in the dataset
have matching numbers of dimensions, we can call
:py:meth:`~xarray.Dataset.to_array` and then stack along the the new coordinate.
But :py:meth:`~xarray.Dataset.to_array` will broadcast the dataarrays together,
which will effectively tile the lower dimensional variable along the missing
dimensions. The method :py:meth:`xarray.Dataset.to_stacked_array` allows
combining variables of differing dimensions without this wasteful copying while
:py:meth:`xarray.DataArray.to_unstacked_dataset` reverses this operation. These
methods are used like this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would mention explicitly that it does this by using a MultiIndex in the output. (Is that a correct interpretation?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


.. ipython:: python

arr = xr.DataArray(np.arange(6).reshape(2, 3),
coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
data = xr.Dataset({'a': arr, 'b': arr.isel(y=0)})
benbovy marked this conversation as resolved.
Show resolved Hide resolved
stacked = data.to_stacked_array("z", ['y'])
stacked
rabernat marked this conversation as resolved.
Show resolved Hide resolved
unstacked = stacked.to_unstacked_dataset("z")
unstacked

In this example, ``stacked`` is a two dimensional array that we can easily pass to a scikit-learn or another generic numerical method.

.. _reshape.set_index:

Set and reset index
Expand Down
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Enhancements
- Allow ``expand_dims`` method to support inserting/broadcasting dimensions
with size > 1. (:issue:`2710`)
By `Martin Pletcher <https://github.com/pletchm>`_.
- New methods for reshaping Datasets of variables with different dimensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to move up to 0.12.3 now -- sorry for the churn here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I moved this up to that release and added a new section header

v0.12.3 (unreleased)
--------------------

New functions/methods
~~~~~~~~~~~~~~~~~~~~~

- New methods for reshaping Datasets of variables with different dimensions
  (:issue:`1317`). By `Noah Brenowitz <https://github.com/nbren12>`_.

(:issue:`1317`). By `Noah Brenowitz <https://github.com/nbren12>`_.


Bug fixes
Expand Down
60 changes: 60 additions & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1402,6 +1402,66 @@ def unstack(self, dim=None):
ds = self._to_temp_dataset().unstack(dim)
return self._from_temp_dataset(ds)

def to_unstacked_dataset(self, dim, level=0,
variable_dim='variable'):
"""Unstack DataArray expanding to Dataset along a given level of a
stacked coordinate.

This is the inverse operation of Dataset.to_stacked_array.

Parameters
----------
dim : str
Name of existing dimension to unstack
level : int
benbovy marked this conversation as resolved.
Show resolved Hide resolved
Index of level to expand to dataset along

benbovy marked this conversation as resolved.
Show resolved Hide resolved
Returns
-------
unstacked: Dataset

rabernat marked this conversation as resolved.
Show resolved Hide resolved
Examples
--------
>>> import xarray as xr
>>> arr = DataArray(np.arange(6).reshape(2, 3),
... coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
>>> data = xr.Dataset({'a': arr, 'b': arr.isel(y=0)})
>>> data
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) <U1 'a' 'b'
* y (y) int64 0 1 2
Data variables:
a (x, y) int64 0 1 2 3 4 5
b (x) int64 0 3
>>> stacked = data.to_stacked_array("z", ['y'])
>>> stacked.indexes['z']
benbovy marked this conversation as resolved.
Show resolved Hide resolved
MultiIndex(levels=[['a', 'b'], [0, 1, 2]],
labels=[[0, 0, 0, 1], [0, 1, 2, -1]],
names=['variable', 'y'])
>>> roundtripped = stacked.to_unstacked_dataset(dim='z')
>>> data.identical(roundtripped)
True

See Also
--------
Dataset.to_stacked_array
"""

idx = self.indexes[dim]
if not isinstance(idx, pd.MultiIndex):
raise ValueError(dim, "is not a stacked coordinate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add test coverage for this error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. done

variables = idx.levels[level]

# pull variables out of datarray
data_dict = OrderedDict()
for k in variables:
data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than sending as kwargs, if we send as a dict then this will work with non-str keys (though dim names is only partially supported anyway atm)

Suggested change
data_dict[k] = self.sel(**{variable_dim: k}).squeeze(drop=True)
data_dict[k] = self.sel({variable_dim: k}).squeeze(drop=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made this change.


# unstacked dataset
return Dataset(data_dict)

def transpose(self, *dims):
"""Return a new DataArray object with transposed dimensions.

Expand Down
101 changes: 101 additions & 0 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2627,6 +2627,107 @@ def stack(self, dimensions=None, **dimensions_kwargs):
result = result._stack_once(dims, new_dim)
return result

def to_stacked_array(self, new_dim, dims, variable_dim='variable'):
benbovy marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also use the syntax stacked = data.to_stacked_array(z=['y'])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While your suggestion is certainly more analogous to stack, I think it should be clear that this function only performs a single stacking operation. That is why I chose a more verbose signature.

"""Combine variables of differing dimensionality into a DataArray
without broadcasting.

This function is basically version of Dataset.to_array which does not
broadcast the variables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a typo: 'a version'

maybe reformulate to
This function is similar to Dataset.to_array but does not broadcast the variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Changed to "This method is similar to Dataset.to_array but does not broadcast variables."


Parameters
----------
new_dim : str
Name of the new stacked coordinate
dims : Sequence[str]
Dimensions to be stacked. Not all variables in the dataset need to
have these dimensions.
variable_dim : str, optional
Name of the level in the MultiIndex object which corresponds to
the variables.
dcherian marked this conversation as resolved.
Show resolved Hide resolved

Returns
-------
stacked : DataArray

See Also
--------
Dataset.to_array
Dataset.stack
DataArray.to_unstacked_dataset

Examples
--------

>>> arr = DataArray(np.arange(6).reshape(2, 3),
... coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
>>> data = Dataset({'a': arr, 'b': arr.isel(y=0)})
>>> data

<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) <U1 'a' 'b'
* y (y) int64 0 1 2
Data variables:
a (x, y) int64 0 1 2 3 4 5
b (x) int64 0 3
>>> stacked = data.to_stacked_array("z", ['y'])
>>> stacked.indexes['z']

MultiIndex(levels=[['a', 'b'], [0, 1, 2]],
labels=[[0, 0, 0, 1], [0, 1, 2, -1]],
names=['variable', 'y'])
>>> stacked

<xarray.DataArray 'a' (x: 2, z: 4)>
array([[0, 1, 2, 0],
[3, 4, 5, 3]])
Coordinates:
* x (x) <U1 'a' 'b'
* z (z) MultiIndex
- variable (z) object 'a' 'a' 'a' 'b'
- y (z) object 0 1 2 nan

"""
dims = tuple(dims)

def f(val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please give this as sensible name rather than f, e.g., ensure_stacked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I changed it to ensure_stackable. since the arrays aren't stacked yet.

# ensure square output

assign_coords = {variable_dim: val.name}
for dim in dims:
if (dim not in val.dims):
jhamman marked this conversation as resolved.
Show resolved Hide resolved
assign_coords[dim] = None

expand_dims = set(dims).difference(set(val.dims))
expand_dims.add(variable_dim)
# must be list for .expand_dims
expand_dims = list(expand_dims)

return val.assign_coords(**assign_coords) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here and below, per PEP8, prefer using parentheses rather than \ for multi-line expressions, e.g.,

            return (val.assign_coords(**assign_coords)
                    .expand_dims(expand_dims)
                    .stack(**{new_dim: (variable_dim,) + stacking_dims}))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. done.

.expand_dims(expand_dims) \
.stack(**{new_dim: (variable_dim,) + dims})

# concatenate the arrays
Xs = [f(self[key]) for key in self.data_vars]
dataset = xr.concat(Xs, dim=new_dim)
jhamman marked this conversation as resolved.
Show resolved Hide resolved

# coerce the levels of the MultiIndex to have the same type as the
# input dimensions. This code is messy, so it might be better to just
# input a dummy value for the singleton dimension.
idx = dataset.indexes[new_dim]
levels = [idx.levels[0]]\
+ [level.astype(self[level.name].dtype)
for level in idx.levels[1:]]
new_idx = idx.set_levels(levels)
# patch in the new index object
# dataset[new_dim].variable._data.array = new_idx
# This commented line below is much cleaner than the junk above, but I
# wanted to modify the IndexVariable inplace to make sure the attrs
# and encodings are the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I think it is cleaner to declare a new IndexVariable rather than modify the existing one in-place. I deleted the comment.

dataset[new_dim] = IndexVariable(new_dim, new_idx)
return dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "data_array" might be a better name for this variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this


def _unstack_once(self, dim):
index = self.get_index(dim)
# GH2619. For MultiIndex, we need to call remove_unused.
Expand Down
56 changes: 56 additions & 0 deletions xarray/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,14 @@ def create_test_multiindex():
return Dataset({}, {'x': mindex})


def create_test_stacked_array():
x = DataArray(pd.Index(np.r_[:10], name='x'))
y = DataArray(pd.Index(np.r_[:20], name='y'))
a = x * y
b = x * y * y
return a, b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change, but this would be ideal as test fixture



class InaccessibleVariableDataStore(backends.InMemoryDataStore):
def __init__(self):
super(InaccessibleVariableDataStore, self).__init__()
Expand Down Expand Up @@ -2252,6 +2260,54 @@ def test_stack_unstack_slow(self):
actual = stacked.isel(z=slice(None, None, -1)).unstack('z')
assert actual.identical(ds[['b']])

def test_to_stacked_array_dtype_dims(self):
# make a two dimensional dataset
a, b = create_test_stacked_array()
D = xr.Dataset({'a': a, 'b': b})
feature_dims = ['y']
y = D.to_stacked_array('features', feature_dims)
assert y.indexes['features'].levels[1].dtype == D.y.dtype
assert y.dims == ('x', 'features')

def test_to_stacked_array_to_unstacked_dataset(self):
# make a two dimensional dataset
a, b = create_test_stacked_array()
D = xr.Dataset({'a': a, 'b': b})
feature_dims = ['y']
y = D.to_stacked_array('features', feature_dims)\
.transpose("x", "features")

x = y.to_unstacked_dataset("features")
assert_identical(D, x)

# test on just one sample
x0 = y[0].to_unstacked_dataset("features")
d0 = D.isel(x=0)
assert_identical(d0, x0)

def test_to_stacked_array_to_unstacked_dataset_different_dimension(self):
# test when variables have different dimensionality
a, b = create_test_stacked_array()
feature_dims = ['y']
D = xr.Dataset({'a': a, 'b': b.isel(y=0)})

y = D.to_stacked_array('features', feature_dims)
x = y.to_unstacked_dataset('features')
assert_identical(D, x)

# another test
ds = D.isel(x=0)
ds_flat = ds.to_stacked_array('features', ['y'])
ds_comp = ds_flat.to_unstacked_dataset('features')
assert_identical(ds, ds_comp)

def test_to_stacked_array_to_unstacked_dataset_scalar(self):
a = xr.DataArray(np.r_[:6], dims=('x', ), coords={'x': np.r_[:6]})
ds = xr.Dataset({'a': a, 'b': 1.0})
ds_flat = ds.to_stacked_array('features', ['x'])
ds_comp = ds_flat.to_unstacked_dataset('features')
assert_identical(ds, ds_comp)

def test_update(self):
data = create_test_data(seed=0)
expected = data.copy()
Expand Down