API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317

nbren12 · 2017-03-22T21:33:07Z

Machine learning and linear algebra problems are often expressed in terms of operations on matrices rather than arrays of arbitrary dimension, and there is currently no convenient way to turn DataArrays (or combinations of DataArrays) into a single "data matrix".

As an example, I have needed to use scikit-learn lately with data from DataArray objects. Scikit-learn requires the data to be expressed in terms of simple 2-dimensional matrices. The rows are called samples, and the columns are known as features. It is annoying and error to transpose and reshape a data array by hand to fit into this format. For instance, this gituhub repo for xarray aware sklearn-like objects devotes many lines of code to massaging data arrays into data matrices. I think that this reshaping workflow might be common enough to warrant some kind of treatment in xarray.

I have written some code in this gist, that have found pretty convenient for doing this. This gist has an XRReshaper class which can be used for reshaping data to and from a matrix format. The basic usage for an EOF analysis of a dataset A(lat, lon, time) can be done like this

feature_dims = ['lat', 'lon']

rs = XRReshaper(A)
data_matrix, _ = rs.to(feature_dims)

# Some linear algebra or machine learning
_,_, eofs = svd(data_matrix)

eofs_datarray = rs.get(eofs[0], ['mode'] + feature_dims)

I am not sure this is the best API, but it seems to work pretty well and I have used it here to implement some xarray-aware sklearn-like objects for PCA, which can be used like

feature_dims = ['lat', 'lon']
pca = XPCA(feature_dims, n_components=10, weight=cos(A.lat))
pca.fit(A)
pca.transform(A)
eofs = pca.components_

Another syntax which might be helpful is some kind of context manager approach like

with XRReshaper(A) as rs, data_matrix:
     # do some stuff with data_matrix
# use rs to restore output to a data array.

The text was updated successfully, but these errors were encountered:

fmaussion · 2017-03-22T21:43:12Z

I personally have no opinion on the subject, but maybe @ajdawson wants to chime in (as the author of the eofs package which includes xarray support).

shoyer · 2017-03-23T00:06:34Z

I've written similar code in the past as well, so I would be pretty supportive of adding a utility class for this. Actually one of my colleagues wrote a virtually identical class for our xarray equivalent in TensorFlow -- take a look at it for some possible alternative API options.

For xarray, .stack() and .to_array(), or .to_dataframe() can do most of the heavy lifting instead of manually reshaping.

Thanks for the pointer to xlearn, too!

nbren12 · 2017-03-23T01:32:55Z

Cool! Thanks for that link. As far as the API is concerned, I think I like the ReshapeCoder approach a little better because it does not require keeping track of a feature_dims vector list throughout the code, like my class does. It also could generalize beyond just creating a 2D array.

To produce a dataset B(samples,features) from a dataset A(x,y,z,t) how do you feel about a syntax like this:

rs = Reshaper(dict(samples=('t',), features=('x', 'y', 'z')), coords=A.coords)

B = rs.encode(A)


_,_,eofs =svd(B.data)

# eofs is now a 2D dask array so we need to give 
# it dimension information
eof_dims = ['mode', 'features']
rs.decode(eofs, eof_dims)

# to decode XArray object we don't need to pass dimension info 
rs.decode(B)

On the other hand, it would be nice to be able to reshape data through a syntax like

A.reshape.encode(dict(...))

nbren12 · 2017-03-23T03:32:50Z

I had the chance to play around with stack and unstack, and it appears that these actually do nearly all the work needed here, so you can disregard my last comment. The only logic which is somewhat unwieldy is code which creates a DataArray from the eofs dask array. This is a complete example using the air dataset:

air = load_dataset("air_temperature").air

A = air.stack(features=['lat', 'lon']).chunk()
A-= A.mean('features')

_,_,eofs = svd_compressed(A.data, 4)

# wrap eofs in dataarray
dims = ['modes', 'features']
coords = {}

for i, dim in enumerate(dims):
    if dim in A.dims:
        coords[dim] = A[dim]
    elif dim in coords:
        pass
    else:
        coords[dim] = np.arange(eofs.shape[i])
    

eofs = xr.DataArray(eofs, dims=dims, coords=coords).unstack('features')

This is pretty compact as is, so maybe the ugly final bit could be replaced with a convenience function like unstack_array(eofs, dims, coords) or a method call A.unstack_array(eofs, dims, new_coords={}).

nbren12 · 2017-09-18T16:45:55Z

@shoyer I wrote a class that does this a while ago.
It is available here: data_matrix.py. It is used like this

# D is a dataset
# the signature for DataMatrix.__init__ is 
# DataMatrix(feature_dims, sample_dims, variables)
mat = DataMatrix(['z'], ['x'], ['a', 'b'])
y = mat.dataset_to_mat(D)
x = mat.mat_to_dataset(y)

One of the problems I had to handle was with concatenating/stacking DataArrays with different numbers of dimensions---stack and unstack combined with to_array can only handle the case where the desired feature variables all have the same dimensionality. ATM my code stacks the desired dimensions for each variable and then manually calls np.hstack to produce the final matrix, but I bet it would be easy to create a pandas Index object which can handle this use case.

Would you be open to a PR along these lines?

jhamman · 2017-09-27T19:03:14Z

I can see the use of a Dataset to_array/stack method that does not broadcast arrays. Feel free to open a PR and we'll take a look.

nbren12 · 2017-10-19T04:32:03Z

After using my own version of this code for the past month or so, it has occurred to me that this API probably will not support stacking arrays of with different sizes along shared arrays. For instance, I need to "stack" humidity below an altitude of 10km with temperature between 0 and 16 km. IMO, the easiest way to do this would be to change these methods into top-level functions which can take any dict or iterable of datarrays. We could leave that for a later PR of course.

shoyer · 2017-10-19T16:14:54Z

IMO, the easiest way to do this would be to change these methods into top-level functions which can take any dict or iterable of datarrays.

👍 for a function or class based interface if that makes sense. Can you share a few examples of what using your proposed API would look like?

nbren12 · 2017-10-19T16:56:37Z

Sorry. I guess I should have made my last comment in the PR.

@jhamman

* Add stack_cat and unstack_cat methods This partially resolves #1317. Change names of methods stack_cat -> to_stacked_array unstack_cat -> to_unstacked_dataset Test that the dtype of the stacked dimensions is preserved This is not passing at the moment because concatenating None with a dimension that has values upcasts the combined dtype to object Fix dtypes of stacked dimensions This commit ensures that the dtypes of the stacked coordinate match the input dimensions. Use new index variable rather than patching the old one I didn't like the inplace modification of a private member. Handle variable_dim correctly I also fixed 1. f-string formatting issue 2. Use an OrderedDict as @jhamman recommends Add documentation to api.rst and reshaping.rst I also added appropriate See Also sections to the docstrings for to_stacked_array and to_unstacked_dataset. Add changes to whats-new Fixing style errors. Split up lengthy test Remove "physical variable" from docs This is in response to Joe's "nit" * Fix to_stacked_array with new master An error arose when checking for the precence of a dimension in array. The code 'dim in data' no longer works. Replaced this with 'dim in data.dims' * Move entry in whats-new to most recent release * Fix code styling errors It needs to pass `pycodestyle xarray` * Improve docstring of to_stacked_array Added example and additional description. * Move "See Also" section to end of docstring * Doc and comment improvements. * Improve documented example @benbovy pointed out that the old example was confusing. * Add name argument to to_stacked_array and test * Allow level argument to be an int or str * Remove variable_dim argument of to_unstacked_array * Actually removed variable_dim * Change function signature of to_stacked_array Previously, this function was passed a list of dimensions which should be stacked together. However, @benbovy found that the function failed when the _non-stacked_ dimensions were not shared across all variables. Thus, it is easier to specify the dimensions which should remain unchanged, rather than the dimensions to be stacked. The function to_stacked_array now takes an argument ''sample_dim'' which defines these non-stacked dimensions. If these dims are not shared accross all variables than an error is raised. * Fix lint error The line was too long * Fix validation and failing tests 1. the test which stacks a scalar and an array doesn't make sense anymore given the new API. 2. Fixed a bug in the validation code which raised an error almost always. * Fix typo * Improve docs and error messages * Remove extra spaces * Test warning in to_unstacked_dataset * Improve formatting and naming * Fix flake8 error * Respond to @max-sixty's suggestions

nbren12 mentioned this issue Sep 18, 2017

TypeError on DataArray.stack() if any of the dimensions to be stacked has a MultiIndex #1554

Open

nbren12 mentioned this issue Sep 27, 2017

Add methods for combining variables of differing dimensionality #1597

Merged

4 tasks

shoyer closed this as completed in #1597 Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317

API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317

nbren12 commented Mar 22, 2017

fmaussion commented Mar 22, 2017

shoyer commented Mar 23, 2017

nbren12 commented Mar 23, 2017

nbren12 commented Mar 23, 2017 •

edited

Loading

nbren12 commented Sep 18, 2017 •

edited

Loading

jhamman commented Sep 27, 2017

nbren12 commented Oct 19, 2017

shoyer commented Oct 19, 2017

nbren12 commented Oct 19, 2017

API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317

API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317

Comments

nbren12 commented Mar 22, 2017

fmaussion commented Mar 22, 2017

shoyer commented Mar 23, 2017

nbren12 commented Mar 23, 2017

nbren12 commented Mar 23, 2017 • edited Loading

nbren12 commented Sep 18, 2017 • edited Loading

jhamman commented Sep 27, 2017

nbren12 commented Oct 19, 2017

shoyer commented Oct 19, 2017

nbren12 commented Oct 19, 2017

nbren12 commented Mar 23, 2017 •

edited

Loading

nbren12 commented Sep 18, 2017 •

edited

Loading