Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods for combining variables of differing dimensionality #1597

Merged
merged 26 commits into from
Jul 5, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
8c947e7
Add stack_cat and unstack_cat methods
nbren12 Sep 27, 2017
e997f7f
Fix to_stacked_array with new master
nbren12 Apr 1, 2019
3d757da
Move entry in whats-new to most recent release
nbren12 Apr 1, 2019
151dc71
Fix code styling errors
nbren12 Apr 1, 2019
8a1a8ef
Improve docstring of to_stacked_array
nbren12 Apr 2, 2019
e8594f1
Move "See Also" section to end of docstring
nbren12 Apr 2, 2019
0f1ba22
Doc and comment improvements.
nbren12 Apr 12, 2019
1e1f4d9
Merge remote-tracking branch 'upstream/master'
nbren12 Apr 12, 2019
35e0ecf
Improve documented example
nbren12 Jun 7, 2019
23d9246
Add name argument to to_stacked_array and test
nbren12 Jun 7, 2019
099d440
Allow level argument to be an int or str
nbren12 Jun 7, 2019
e40b6a2
Remove variable_dim argument of to_unstacked_array
nbren12 Jun 7, 2019
35a2365
Actually removed variable_dim
nbren12 Jun 7, 2019
35715dc
Merge remote-tracking branch 'upstream/master'
nbren12 Jun 7, 2019
5ca9a1d
Change function signature of to_stacked_array
nbren12 Jun 7, 2019
2979c75
Fix lint error
nbren12 Jun 7, 2019
c17dc09
Fix validation and failing tests
nbren12 Jun 7, 2019
ce3b52e
Fix typo
nbren12 Jun 7, 2019
4ade43d
Merge remote-tracking branch 'upstream/master'
nbren12 Jun 22, 2019
6d520c2
Improve docs and error messages
nbren12 Jul 2, 2019
2669797
Remove extra spaces
nbren12 Jul 2, 2019
24b2237
Merge remote-tracking branch 'upstream/master'
nbren12 Jul 2, 2019
13587c2
Test warning in to_unstacked_dataset
nbren12 Jul 2, 2019
95e2da9
Improve formatting and naming
nbren12 Jul 2, 2019
7aa7095
Fix flake8 error
nbren12 Jul 2, 2019
e08622a
Respond to @max-sixty's suggestions
nbren12 Jul 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,7 @@ Reshaping and reorganizing
Dataset.transpose
Dataset.stack
Dataset.unstack
Dataset.to_stacked_array
Dataset.shift
Dataset.roll
Dataset.sortby
Expand Down Expand Up @@ -377,6 +378,7 @@ Reshaping and reorganizing
DataArray.transpose
DataArray.stack
DataArray.unstack
DataArray.to_unstacked_dataset
DataArray.shift
DataArray.roll
DataArray.sortby
Expand Down
42 changes: 42 additions & 0 deletions doc/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,48 @@ pandas, it does not automatically drop missing values. Compare:
We departed from pandas's behavior here because predictable shapes for new
array dimensions is necessary for :ref:`dask`.

Stacking different variables together
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These stacking and unstacking operations are particularly useful for reshaping
xarray objects for use in machine learning packages, such as `scikit-learn
<http://scikit-learn.org/stable/>`_, that usually require two-dimensional numpy
arrays as inputs. For datasets with only one variable, we only need ``stack``
and ``unstack``, but combining multiple variables in a
:py:class:`xarray.Dataset` is more complicated. If the variables in the dataset
have matching numbers of dimensions, we can call
:py:meth:`~xarray.Dataset.to_array` and then stack along the the new coordinate.
But :py:meth:`~xarray.Dataset.to_array` will broadcast the dataarrays together,
which will effectively tile the lower dimensional variable along the missing
dimensions. The method :py:meth:`xarray.Dataset.to_stacked_array` allows
combining variables of differing dimensions without this wasteful copying while
:py:meth:`xarray.DataArray.to_unstacked_dataset` reverses this operation.
Just as with :py:meth:`xarray.Dataset.stack` the stacked coordinate is
represented by a :py:class:`pandas.MultiIndex` object. These methods are used
like this:

.. ipython:: python
data = xr.Dataset(
data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
'b': ('x', [6, 7])},
coords={'y': ['u', 'v', 'w']}
)
stacked = data.to_stacked_array("z", sample_dims=['x'])
stacked
rabernat marked this conversation as resolved.
Show resolved Hide resolved
unstacked = stacked.to_unstacked_dataset("z")
unstacked

In this example, ``stacked`` is a two dimensional array that we can easily pass to a scikit-learn or another generic
numerical method.

.. note::

Unlike with ``stack``, in ``to_stacked_array``, the user specifies the dimensions they **do not** want stacked.
For a machine learning task, these unstacked dimensions can be interpreted as the dimensions over which samples are
drawn, whereas the stacked coordinates are the features. Naturally, all variables should possess these sampling
dimensions.


.. _reshape.set_index:

Set and reset index
Expand Down
6 changes: 6 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@ What's New
v0.12.3 (unreleased)
--------------------

New functions/methods
~~~~~~~~~~~~~~~~~~~~~

- New methods for reshaping Datasets of variables with different dimensions
(:issue:`1317`). By `Noah Brenowitz <https://github.com/nbren12>`_.

Enhancements
~~~~~~~~~~~~

Expand Down
66 changes: 66 additions & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1540,6 +1540,72 @@ def unstack(self, dim: Union[Hashable, Sequence[Hashable], None] = None
ds = self._to_temp_dataset().unstack(dim)
return self._from_temp_dataset(ds)

def to_unstacked_dataset(self, dim, level=0):
"""Unstack DataArray expanding to Dataset along a given level of a
stacked coordinate.

This is the inverse operation of Dataset.to_stacked_array.

Parameters
----------
dim : str
Name of existing dimension to unstack
level : int or str
The MultiIndex level to expand to a dataset along. Can either be
the integer index of the level or its name.
label : int, default 0
Label of the level to expand dataset along. Overrides the label
argument if given.

benbovy marked this conversation as resolved.
Show resolved Hide resolved
Returns
-------
unstacked: Dataset

rabernat marked this conversation as resolved.
Show resolved Hide resolved
Examples
--------
>>> import xarray as xr
>>> arr = DataArray(np.arange(6).reshape(2, 3),
... coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
>>> data = xr.Dataset({'a': arr, 'b': arr.isel(y=0)})
>>> data
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) <U1 'a' 'b'
* y (y) int64 0 1 2
Data variables:
a (x, y) int64 0 1 2 3 4 5
b (x) int64 0 3
>>> stacked = data.to_stacked_array("z", ['y'])
>>> stacked.indexes['z']
benbovy marked this conversation as resolved.
Show resolved Hide resolved
MultiIndex(levels=[['a', 'b'], [0, 1, 2]],
labels=[[0, 0, 0, 1], [0, 1, 2, -1]],
names=['variable', 'y'])
>>> roundtripped = stacked.to_unstacked_dataset(dim='z')
>>> data.identical(roundtripped)
True

See Also
--------
Dataset.to_stacked_array
"""

idx = self.indexes[dim]
if not isinstance(idx, pd.MultiIndex):
raise ValueError("'{}' is not a stacked coordinate".format(dim))

level_number = idx._get_level_number(level)
variables = idx.levels[level_number]
variable_dim = idx.names[level_number]

# pull variables out of datarray
data_dict = OrderedDict()
for k in variables:
data_dict[k] = self.sel({variable_dim: k}).squeeze(drop=True)

# unstacked dataset
return Dataset(data_dict)

def transpose(self,
*dims: Hashable,
transpose_coords: Optional[bool] = None) -> 'DataArray':
Expand Down
113 changes: 113 additions & 0 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2698,6 +2698,119 @@ def stack(self, dimensions=None, **dimensions_kwargs):
result = result._stack_once(dims, new_dim)
return result

def to_stacked_array(self, new_dim, sample_dims, variable_dim='variable',
name=None):
"""Combine variables of differing dimensionality into a DataArray
without broadcasting.

This method is similar to Dataset.to_array but does not broadcast the
variables.

Parameters
----------
new_dim : str
Name of the new stacked coordinate
sample_dims : Sequence[str]
Dimensions that **will not** be stacked. Each array in the dataset
must share these dimensions. For machine learning applications,
these define the dimensions over which samples are drawn.
variable_dim : str, optional
Name of the level in the stacked coordinate which corresponds to
the variables.
dcherian marked this conversation as resolved.
Show resolved Hide resolved
name : str, optional
Name of the new data array.

Returns
-------
stacked : DataArray
DataArray with the specified dimensions and data variables
stacked together. The stacked coordinate is named ``new_dim``
and represented by a MultiIndex object with a level containing the
data variable names. The name of this level is controlled using
the ``variable_dim`` argument.

See Also
--------
Dataset.to_array
Dataset.stack
DataArray.to_unstacked_dataset

Examples
--------
>>> data = Dataset(
... data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
... 'b': ('x', [6, 7])},
... coords={'y': ['u', 'v', 'w']}
... )

>>> data
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) <U1 'u' 'v' 'w'
Dimensions without coordinates: x
Data variables:
a (x, y) int64 0 1 2 3 4 5
b (x) int64 6 7

>>> data.to_stacked_array("z", sample_dims=['x'])
<xarray.DataArray (x: 2, z: 4)>
array([[0, 1, 2, 6],
[3, 4, 5, 7]])
Coordinates:
* z (z) MultiIndex
- variable (z) object 'a' 'a' 'a' 'b'
- y (z) object 'u' 'v' 'w' nan
Dimensions without coordinates: x

"""
stacking_dims = tuple(dim for dim in self.dims
if dim not in sample_dims)

for variable in self:
dims = self[variable].dims
dims_include_sample_dims = set(sample_dims) <= set(dims)
if not dims_include_sample_dims:
raise ValueError(
"All variables in the dataset must contain the "
"dimensions {}.".format(dims)
)

def ensure_stackable(val):
assign_coords = {variable_dim: val.name}
for dim in stacking_dims:
if dim not in val.dims:
assign_coords[dim] = None

expand_dims = set(stacking_dims).difference(set(val.dims))
expand_dims.add(variable_dim)
# must be list for .expand_dims
expand_dims = list(expand_dims)

return (val.assign_coords(**assign_coords)
.expand_dims(expand_dims)
.stack({new_dim: (variable_dim,) + stacking_dims}))

# concatenate the arrays
stackable_vars = [ensure_stackable(self[key])
for key in self.data_vars]
data_array = xr.concat(stackable_vars, dim=new_dim)

# coerce the levels of the MultiIndex to have the same type as the
# input dimensions. This code is messy, so it might be better to just
# input a dummy value for the singleton dimension.
idx = data_array.indexes[new_dim]
levels = ([idx.levels[0]]
+ [level.astype(self[level.name].dtype)
for level in idx.levels[1:]])
new_idx = idx.set_levels(levels)
data_array[new_dim] = IndexVariable(new_dim, new_idx)

if name is not None:
data_array.name = name

return data_array

def _unstack_once(self, dim):
index = self.get_index(dim)
# GH2619. For MultiIndex, we need to call remove_unused.
Expand Down
6 changes: 6 additions & 0 deletions xarray/tests/test_dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1798,6 +1798,12 @@ def test_stack_nonunique_consistency(self):
expected = DataArray(orig.to_pandas().stack(), dims='z')
assert_identical(expected, actual)

def test_to_unstacked_dataset_raises_value_error(self):
data = DataArray([0, 1], dims='x', coords={'x': [0, 1]})
with pytest.raises(
ValueError, match="'x' is not a stacked coordinate"):
data.to_unstacked_dataset('x', 0)

def test_transpose(self):
da = DataArray(np.random.randn(3, 4, 5), dims=('x', 'y', 'z'),
coords={'x': range(3), 'y': range(4), 'z': range(5),
Expand Down
63 changes: 63 additions & 0 deletions xarray/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,14 @@ def create_test_multiindex():
return Dataset({}, {'x': mindex})


def create_test_stacked_array():
x = DataArray(pd.Index(np.r_[:10], name='x'))
y = DataArray(pd.Index(np.r_[:20], name='y'))
a = x * y
b = x * y * y
return a, b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change, but this would be ideal as test fixture



class InaccessibleVariableDataStore(backends.InMemoryDataStore):
def __init__(self):
super(InaccessibleVariableDataStore, self).__init__()
Expand Down Expand Up @@ -2449,6 +2457,61 @@ def test_stack_unstack_slow(self):
actual = stacked.isel(z=slice(None, None, -1)).unstack('z')
assert actual.identical(ds[['b']])

def test_to_stacked_array_invalid_sample_dims(self):
data = xr.Dataset(
data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
'b': ('x', [6, 7])},
coords={'y': ['u', 'v', 'w']}
)
with pytest.raises(ValueError):
data.to_stacked_array("features", sample_dims=['y'])

def test_to_stacked_array_name(self):
name = 'adf9d'

# make a two dimensional dataset
a, b = create_test_stacked_array()
D = xr.Dataset({'a': a, 'b': b})
sample_dims = ['x']

y = D.to_stacked_array('features', sample_dims, name=name)
assert y.name == name

def test_to_stacked_array_dtype_dims(self):
# make a two dimensional dataset
a, b = create_test_stacked_array()
D = xr.Dataset({'a': a, 'b': b})
sample_dims = ['x']
y = D.to_stacked_array('features', sample_dims)
assert y.indexes['features'].levels[1].dtype == D.y.dtype
assert y.dims == ('x', 'features')

def test_to_stacked_array_to_unstacked_dataset(self):
# make a two dimensional dataset
a, b = create_test_stacked_array()
D = xr.Dataset({'a': a, 'b': b})
sample_dims = ['x']
y = D.to_stacked_array('features', sample_dims)\
.transpose("x", "features")

x = y.to_unstacked_dataset("features")
assert_identical(D, x)

# test on just one sample
x0 = y[0].to_unstacked_dataset("features")
d0 = D.isel(x=0)
assert_identical(d0, x0)

def test_to_stacked_array_to_unstacked_dataset_different_dimension(self):
# test when variables have different dimensionality
a, b = create_test_stacked_array()
sample_dims = ['x']
D = xr.Dataset({'a': a, 'b': b.isel(y=0)})

y = D.to_stacked_array('features', sample_dims)
x = y.to_unstacked_dataset('features')
assert_identical(D, x)

def test_update(self):
data = create_test_data(seed=0)
expected = data.copy()
Expand Down