Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods for combining variables of differing dimensionality #1597

Merged
merged 26 commits into from
Jul 5, 2019
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
8c947e7
Add stack_cat and unstack_cat methods
nbren12 Sep 27, 2017
e997f7f
Fix to_stacked_array with new master
nbren12 Apr 1, 2019
3d757da
Move entry in whats-new to most recent release
nbren12 Apr 1, 2019
151dc71
Fix code styling errors
nbren12 Apr 1, 2019
8a1a8ef
Improve docstring of to_stacked_array
nbren12 Apr 2, 2019
e8594f1
Move "See Also" section to end of docstring
nbren12 Apr 2, 2019
0f1ba22
Doc and comment improvements.
nbren12 Apr 12, 2019
1e1f4d9
Merge remote-tracking branch 'upstream/master'
nbren12 Apr 12, 2019
35e0ecf
Improve documented example
nbren12 Jun 7, 2019
23d9246
Add name argument to to_stacked_array and test
nbren12 Jun 7, 2019
099d440
Allow level argument to be an int or str
nbren12 Jun 7, 2019
e40b6a2
Remove variable_dim argument of to_unstacked_array
nbren12 Jun 7, 2019
35a2365
Actually removed variable_dim
nbren12 Jun 7, 2019
35715dc
Merge remote-tracking branch 'upstream/master'
nbren12 Jun 7, 2019
5ca9a1d
Change function signature of to_stacked_array
nbren12 Jun 7, 2019
2979c75
Fix lint error
nbren12 Jun 7, 2019
c17dc09
Fix validation and failing tests
nbren12 Jun 7, 2019
ce3b52e
Fix typo
nbren12 Jun 7, 2019
4ade43d
Merge remote-tracking branch 'upstream/master'
nbren12 Jun 22, 2019
6d520c2
Improve docs and error messages
nbren12 Jul 2, 2019
2669797
Remove extra spaces
nbren12 Jul 2, 2019
24b2237
Merge remote-tracking branch 'upstream/master'
nbren12 Jul 2, 2019
13587c2
Test warning in to_unstacked_dataset
nbren12 Jul 2, 2019
95e2da9
Improve formatting and naming
nbren12 Jul 2, 2019
7aa7095
Fix flake8 error
nbren12 Jul 2, 2019
e08622a
Respond to @max-sixty's suggestions
nbren12 Jul 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 16 additions & 6 deletions doc/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,16 +154,26 @@ represented by a :py:class:`pandas.MultiIndex` object. These methods are used
like this:

.. ipython:: python

arr = xr.DataArray(np.arange(6).reshape(2, 3),
coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
data = xr.Dataset({'a': arr, 'b': arr.isel(y=0)})
stacked = data.to_stacked_array("z", ['y'])
data = xr.Dataset(
data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
'b': ('x', [6, 7])},
coords={'y': ['u', 'v', 'w']}
)
stacked = data.to_stacked_array("z", sample_dims=['x'])
stacked
rabernat marked this conversation as resolved.
Show resolved Hide resolved
unstacked = stacked.to_unstacked_dataset("z")
unstacked

In this example, ``stacked`` is a two dimensional array that we can easily pass to a scikit-learn or another generic numerical method.
In this example, ``stacked`` is a two dimensional array that we can easily pass to a scikit-learn or another generic
numerical method.

.. note::

Unlike with ``stack``, in ``to_stacked_array``, the user specifies the dimensions they **do not** want stacked.
For a machine learning task, these unstacked dimensions can be interpreted as the dimensions over which samples are
drawn, whereas the stacked coordinates are the features. Naturally, all variables should possess these sampling
dimensions.


.. _reshape.set_index:

Expand Down
18 changes: 12 additions & 6 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1408,8 +1408,7 @@ def unstack(self, dim=None):
ds = self._to_temp_dataset().unstack(dim)
return self._from_temp_dataset(ds)

def to_unstacked_dataset(self, dim, level=0,
variable_dim='variable'):
def to_unstacked_dataset(self, dim, level=0):
"""Unstack DataArray expanding to Dataset along a given level of a
stacked coordinate.

Expand All @@ -1419,8 +1418,12 @@ def to_unstacked_dataset(self, dim, level=0,
----------
dim : str
Name of existing dimension to unstack
level : int
Index of level to expand to dataset along
level : int or str
The MultiIndex level to expand to a dataset along. Can either be
the integer index of the level or its name.
label : int, optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I think we've said int, default 0 rather than optional where there's a default; but I don't have a strong view

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I changed this to your suggestion

Label of the level to expand dataset along. Overrides the label
argument if given.

benbovy marked this conversation as resolved.
Show resolved Hide resolved
Returns
-------
Expand Down Expand Up @@ -1458,7 +1461,10 @@ def to_unstacked_dataset(self, dim, level=0,
idx = self.indexes[dim]
if not isinstance(idx, pd.MultiIndex):
raise ValueError(dim, "is not a stacked coordinate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add test coverage for this error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. done

variables = idx.levels[level]

level_number = idx._get_level_number(level)
variables = idx.levels[level_number]
variable_dim = idx.names[level_number]

# pull variables out of datarray
data_dict = OrderedDict()
Expand All @@ -1468,7 +1474,7 @@ def to_unstacked_dataset(self, dim, level=0,
# unstacked dataset
return Dataset(data_dict)

def transpose(self, *dims) -> 'DataArray':
def transpose(self, *dims, transpose_coords=None) -> 'DataArray':
"""Return a new DataArray object with transposed dimensions.

Parameters
Expand Down
72 changes: 42 additions & 30 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2650,23 +2650,27 @@ def stack(self, dimensions=None, **dimensions_kwargs):
result = result._stack_once(dims, new_dim)
return result

def to_stacked_array(self, new_dim, dims, variable_dim='variable'):
def to_stacked_array(self, new_dim, sample_dims, variable_dim='variable',
name=None):
"""Combine variables of differing dimensionality into a DataArray
without broadcasting.

This function is basically version of Dataset.to_array which does not
broadcast the variables.
This method is similar to Dataset.to_array but does not broadcast the
variables.

Parameters
----------
new_dim : str
Name of the new stacked coordinate
dims : Sequence[str]
Dimensions to be stacked. Not all variables in the dataset need to
have these dimensions.
sample_dims : Sequence[str]
Dimensions that **will not** be stacked. Each array in the dataset
must share these dimensions. For machine learning applications,
these define the dimensions over which samples are drawn.
variable_dim : str, optional
Name of the level in the stacked coordinate which corresponds to
the variables.
dcherian marked this conversation as resolved.
Show resolved Hide resolved
name : str, optional
Name of the new data array.

Returns
-------
Expand All @@ -2685,56 +2689,60 @@ def to_stacked_array(self, new_dim, dims, variable_dim='variable'):

Examples
--------
>>> data = Dataset(
... data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
... 'b': ('x', [6, 7])},
... coords={'y': ['u', 'v', 'w']}
... )

>>> arr = DataArray(np.arange(6).reshape(2, 3),
... coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
>>> data = Dataset({'a': arr, 'b': arr.isel(y=0)})
>>> data

<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) <U1 'a' 'b'
* y (y) int64 0 1 2
* y (y) <U1 'u' 'v' 'w'
Dimensions without coordinates: x
Data variables:
a (x, y) int64 0 1 2 3 4 5
b (x) int64 0 3
>>> stacked = data.to_stacked_array("z", ['y'])
>>> stacked.indexes['z']

MultiIndex(levels=[['a', 'b'], [0, 1, 2]],
labels=[[0, 0, 0, 1], [0, 1, 2, -1]],
names=['variable', 'y'])
>>> stacked

<xarray.DataArray 'a' (x: 2, z: 4)>
array([[0, 1, 2, 0],
[3, 4, 5, 3]])
b (x) int64 6 7

>>> data.to_stacked_array("z", ['x'])
jhamman marked this conversation as resolved.
Show resolved Hide resolved
<xarray.DataArray (x: 2, z: 4)>
array([[0, 1, 2, 6],
[3, 4, 5, 7]])
Coordinates:
* x (x) <U1 'a' 'b'
* z (z) MultiIndex
- variable (z) object 'a' 'a' 'a' 'b'
- y (z) object 0 1 2 nan
- y (z) object 'u' 'v' 'w' nan
Dimensions without coordinates: x

"""
dims = tuple(dims)
stacking_dims = tuple(dim for dim in self.dims
if dim not in sample_dims)

for variable in self:
dims = self[variable].dims
dims_include_sample_dims = set(sample_dims) <= set(dims)
if not dims_include_sample_dims:
raise ValueError(
"All DataArrays must share the dims: {}. ".format(dims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change "All DataArrays" by "All data variables in Dataset" for this error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the example from the docs/docstrings:

# the line below gives "ValueError: All DataArrays must share the dims: ('x',)."
data.to_stacked_array('z', ['x', 'y'])

# the line below gives "ValueError: All DataArrays must share the dims: ('x', 'y')."
data.to_stacked_array('z', ['foo'])      

Those error messages are still a bit confusing to me. For the second I would expect a KeyError: dimension 'foo' not found. I also don't know why the message says in this second example that all data variables must share the 'y' dimension .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how that is confusing, but technically it is true. I think it would be a little unwieldy to have differerent error messages for different numbers of sample_dimensions. I change the message to "All variables in the dataset must contain the dimensions {}." Hopefully, that is better.

)

def f(val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please give this as sensible name rather than f, e.g., ensure_stacked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I changed it to ensure_stackable. since the arrays aren't stacked yet.

# ensure square output

assign_coords = {variable_dim: val.name}
for dim in dims:
for dim in stacking_dims:
if (dim not in val.dims):
jhamman marked this conversation as resolved.
Show resolved Hide resolved
assign_coords[dim] = None

expand_dims = set(dims).difference(set(val.dims))
expand_dims = set(stacking_dims).difference(set(val.dims))
expand_dims.add(variable_dim)
# must be list for .expand_dims
expand_dims = list(expand_dims)

return val.assign_coords(**assign_coords) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here and below, per PEP8, prefer using parentheses rather than \ for multi-line expressions, e.g.,

            return (val.assign_coords(**assign_coords)
                    .expand_dims(expand_dims)
                    .stack(**{new_dim: (variable_dim,) + stacking_dims}))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. done.

.expand_dims(expand_dims) \
.stack(**{new_dim: (variable_dim,) + dims})
.stack(**{new_dim: (variable_dim,) + stacking_dims})

# concatenate the arrays
Xs = [f(self[key]) for key in self.data_vars]
Expand All @@ -2749,6 +2757,10 @@ def f(val):
for level in idx.levels[1:]]
new_idx = idx.set_levels(levels)
dataset[new_dim] = IndexVariable(new_dim, new_idx)

if name is not None:
dataset.name = name

return dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "data_array" might be a better name for this variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this


def _unstack_once(self, dim):
Expand Down
45 changes: 26 additions & 19 deletions xarray/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -2405,21 +2405,41 @@ def test_stack_unstack_slow(self):
actual = stacked.isel(z=slice(None, None, -1)).unstack('z')
assert actual.identical(ds[['b']])

def test_to_stacked_array_invalid_sample_dims(self):
data = xr.Dataset(
data_vars={'a': (('x', 'y'), [[0, 1, 2], [3, 4, 5]]),
'b': ('x', [6, 7])},
coords={'y': ['u', 'v', 'w']}
)
with pytest.raises(ValueError):
data.to_stacked_array("features", sample_dims=['y'])

def test_to_stacked_array_name(self):
name = 'adf9d'

# make a two dimensional dataset
a, b = create_test_stacked_array()
D = xr.Dataset({'a': a, 'b': b})
sample_dims = ['x']

y = D.to_stacked_array('features', sample_dims, name=name)
assert y.name == name

def test_to_stacked_array_dtype_dims(self):
# make a two dimensional dataset
a, b = create_test_stacked_array()
D = xr.Dataset({'a': a, 'b': b})
feature_dims = ['y']
y = D.to_stacked_array('features', feature_dims)
sample_dims = ['x']
y = D.to_stacked_array('features', sample_dims)
assert y.indexes['features'].levels[1].dtype == D.y.dtype
assert y.dims == ('x', 'features')

def test_to_stacked_array_to_unstacked_dataset(self):
# make a two dimensional dataset
a, b = create_test_stacked_array()
D = xr.Dataset({'a': a, 'b': b})
feature_dims = ['y']
y = D.to_stacked_array('features', feature_dims)\
sample_dims = ['x']
y = D.to_stacked_array('features', sample_dims)\
.transpose("x", "features")

x = y.to_unstacked_dataset("features")
Expand All @@ -2433,26 +2453,13 @@ def test_to_stacked_array_to_unstacked_dataset(self):
def test_to_stacked_array_to_unstacked_dataset_different_dimension(self):
# test when variables have different dimensionality
a, b = create_test_stacked_array()
feature_dims = ['y']
sample_dims = ['x']
D = xr.Dataset({'a': a, 'b': b.isel(y=0)})

y = D.to_stacked_array('features', feature_dims)
y = D.to_stacked_array('features', sample_dims)
x = y.to_unstacked_dataset('features')
assert_identical(D, x)

# another test
ds = D.isel(x=0)
ds_flat = ds.to_stacked_array('features', ['y'])
ds_comp = ds_flat.to_unstacked_dataset('features')
assert_identical(ds, ds_comp)

def test_to_stacked_array_to_unstacked_dataset_scalar(self):
a = xr.DataArray(np.r_[:6], dims=('x', ), coords={'x': np.r_[:6]})
ds = xr.Dataset({'a': a, 'b': 1.0})
ds_flat = ds.to_stacked_array('features', ['x'])
ds_comp = ds_flat.to_unstacked_dataset('features')
assert_identical(ds, ds_comp)

def test_update(self):
data = create_test_data(seed=0)
expected = data.copy()
Expand Down