Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317

Closed
nbren12 opened this issue Mar 22, 2017 · 9 comments · Fixed by #1597
Closed

API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317

nbren12 opened this issue Mar 22, 2017 · 9 comments · Fixed by #1597

Comments

@nbren12
Copy link
Contributor

nbren12 commented Mar 22, 2017

Machine learning and linear algebra problems are often expressed in terms of operations on matrices rather than arrays of arbitrary dimension, and there is currently no convenient way to turn DataArrays (or combinations of DataArrays) into a single "data matrix".

As an example, I have needed to use scikit-learn lately with data from DataArray objects. Scikit-learn requires the data to be expressed in terms of simple 2-dimensional matrices. The rows are called samples, and the columns are known as features. It is annoying and error to transpose and reshape a data array by hand to fit into this format. For instance, this gituhub repo for xarray aware sklearn-like objects devotes many lines of code to massaging data arrays into data matrices. I think that this reshaping workflow might be common enough to warrant some kind of treatment in xarray.

I have written some code in this gist, that have found pretty convenient for doing this. This gist has an XRReshaper class which can be used for reshaping data to and from a matrix format. The basic usage for an EOF analysis of a dataset A(lat, lon, time) can be done like this

feature_dims = ['lat', 'lon']

rs = XRReshaper(A)
data_matrix, _ = rs.to(feature_dims)

# Some linear algebra or machine learning
_,_, eofs = svd(data_matrix)

eofs_datarray = rs.get(eofs[0], ['mode'] + feature_dims)

I am not sure this is the best API, but it seems to work pretty well and I have used it here to implement some xarray-aware sklearn-like objects for PCA, which can be used like

feature_dims = ['lat', 'lon']
pca = XPCA(feature_dims, n_components=10, weight=cos(A.lat))
pca.fit(A)
pca.transform(A)
eofs = pca.components_

Another syntax which might be helpful is some kind of context manager approach like

with XRReshaper(A) as rs, data_matrix:
     # do some stuff with data_matrix
# use rs to restore output to a data array.
@fmaussion
Copy link
Member

I personally have no opinion on the subject, but maybe @ajdawson wants to chime in (as the author of the eofs package which includes xarray support).

@shoyer
Copy link
Member

shoyer commented Mar 23, 2017

I've written similar code in the past as well, so I would be pretty supportive of adding a utility class for this. Actually one of my colleagues wrote a virtually identical class for our xarray equivalent in TensorFlow -- take a look at it for some possible alternative API options.

For xarray, .stack() and .to_array(), or .to_dataframe() can do most of the heavy lifting instead of manually reshaping.

Thanks for the pointer to xlearn, too!

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 23, 2017

Cool! Thanks for that link. As far as the API is concerned, I think I like the ReshapeCoder approach a little better because it does not require keeping track of a feature_dims vector list throughout the code, like my class does. It also could generalize beyond just creating a 2D array.

To produce a dataset B(samples,features) from a dataset A(x,y,z,t) how do you feel about a syntax like this:

rs = Reshaper(dict(samples=('t',), features=('x', 'y', 'z')), coords=A.coords)

B = rs.encode(A)


_,_,eofs =svd(B.data)

# eofs is now a 2D dask array so we need to give 
# it dimension information
eof_dims = ['mode', 'features']
rs.decode(eofs, eof_dims)

# to decode XArray object we don't need to pass dimension info 
rs.decode(B)

On the other hand, it would be nice to be able to reshape data through a syntax like

A.reshape.encode(dict(...))

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 23, 2017

I had the chance to play around with stack and unstack, and it appears that these actually do nearly all the work needed here, so you can disregard my last comment. The only logic which is somewhat unwieldy is code which creates a DataArray from the eofs dask array. This is a complete example using the air dataset:

air = load_dataset("air_temperature").air

A = air.stack(features=['lat', 'lon']).chunk()
A-= A.mean('features')

_,_,eofs = svd_compressed(A.data, 4)

# wrap eofs in dataarray
dims = ['modes', 'features']
coords = {}

for i, dim in enumerate(dims):
    if dim in A.dims:
        coords[dim] = A[dim]
    elif dim in coords:
        pass
    else:
        coords[dim] = np.arange(eofs.shape[i])
    

eofs = xr.DataArray(eofs, dims=dims, coords=coords).unstack('features')

This is pretty compact as is, so maybe the ugly final bit could be replaced with a convenience function like unstack_array(eofs, dims, coords) or a method call A.unstack_array(eofs, dims, new_coords={}).

@nbren12
Copy link
Contributor Author

nbren12 commented Sep 18, 2017

@shoyer I wrote a class that does this a while ago.
It is available here: data_matrix.py. It is used like this

# D is a dataset
# the signature for DataMatrix.__init__ is 
# DataMatrix(feature_dims, sample_dims, variables)
mat = DataMatrix(['z'], ['x'], ['a', 'b'])
y = mat.dataset_to_mat(D)
x = mat.mat_to_dataset(y)

One of the problems I had to handle was with concatenating/stacking DataArrays with different numbers of dimensions---stack and unstack combined with to_array can only handle the case where the desired feature variables all have the same dimensionality. ATM my code stacks the desired dimensions for each variable and then manually calls np.hstack to produce the final matrix, but I bet it would be easy to create a pandas Index object which can handle this use case.

Would you be open to a PR along these lines?

@jhamman
Copy link
Member

jhamman commented Sep 27, 2017

I can see the use of a Dataset to_array/stack method that does not broadcast arrays. Feel free to open a PR and we'll take a look.

@nbren12
Copy link
Contributor Author

nbren12 commented Oct 19, 2017

After using my own version of this code for the past month or so, it has occurred to me that this API probably will not support stacking arrays of with different sizes along shared arrays. For instance, I need to "stack" humidity below an altitude of 10km with temperature between 0 and 16 km. IMO, the easiest way to do this would be to change these methods into top-level functions which can take any dict or iterable of datarrays. We could leave that for a later PR of course.

@shoyer
Copy link
Member

shoyer commented Oct 19, 2017

IMO, the easiest way to do this would be to change these methods into top-level functions which can take any dict or iterable of datarrays.

👍 for a function or class based interface if that makes sense. Can you share a few examples of what using your proposed API would look like?

@nbren12
Copy link
Contributor Author

nbren12 commented Oct 19, 2017

Sorry. I guess I should have made my last comment in the PR.

shoyer pushed a commit that referenced this issue Jul 5, 2019
* Add stack_cat and unstack_cat methods

This partially resolves #1317.

Change names of methods

stack_cat -> to_stacked_array
unstack_cat -> to_unstacked_dataset

Test that the dtype of the stacked dimensions is preserved

This is not passing at the moment because concatenating None with
a dimension that has values upcasts the combined dtype to object

Fix dtypes of stacked dimensions

This commit ensures that the dtypes of the stacked coordinate match the input dimensions.

Use new index variable rather than patching the old one

I didn't like the inplace modification of a private member.

Handle variable_dim correctly

I also fixed

1. f-string formatting issue
2. Use an OrderedDict as @jhamman recommends

Add documentation to api.rst and reshaping.rst

I also added appropriate See Also sections to the docstrings for
to_stacked_array and to_unstacked_dataset.

Add changes to whats-new

Fixing style errors.

Split up lengthy test

Remove "physical variable" from docs

This is in response to Joe's "nit"

* Fix to_stacked_array with new master

An error arose when checking for the precence of a dimension in array.
The code 'dim in data' no longer works.
Replaced this with 'dim in data.dims'

* Move entry in whats-new to most recent release

* Fix code styling errors

It needs to pass `pycodestyle xarray`

* Improve docstring of to_stacked_array

Added example and additional description.

* Move "See Also" section to end of docstring

* Doc and comment improvements.

* Improve documented example

@benbovy pointed out that the old example was confusing.

* Add name argument to to_stacked_array and test

* Allow level argument to be an int or str

* Remove variable_dim argument of to_unstacked_array

* Actually removed variable_dim

* Change function signature of to_stacked_array

Previously, this function was passed a list of dimensions which should be
stacked together. However, @benbovy found that the function failed when the
_non-stacked_ dimensions were not shared across all variables. Thus, it is
easier to specify the dimensions which should remain unchanged, rather than the
dimensions to be stacked.

The function to_stacked_array now takes an argument ''sample_dim'' which defines
these non-stacked dimensions. If these dims are not shared accross all
variables than an error is raised.

* Fix lint error

The line was too long

* Fix validation and failing tests

1. the test which stacks a scalar and an array doesn't make sense anymore given
the new API.

2. Fixed a bug in the validation code which raised an error almost always.

* Fix typo

* Improve docs and error messages

* Remove extra spaces

* Test warning in to_unstacked_dataset

* Improve formatting and naming

* Fix flake8 error

* Respond to @max-sixty's suggestions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants