-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API for reshaping DataArrays as 2D "data matrices" for use in machine learning #1317
Comments
I've written similar code in the past as well, so I would be pretty supportive of adding a utility class for this. Actually one of my colleagues wrote a virtually identical class for our xarray equivalent in TensorFlow -- take a look at it for some possible alternative API options. For xarray, Thanks for the pointer to xlearn, too! |
Cool! Thanks for that link. As far as the API is concerned, I think I like the To produce a dataset rs = Reshaper(dict(samples=('t',), features=('x', 'y', 'z')), coords=A.coords)
B = rs.encode(A)
_,_,eofs =svd(B.data)
# eofs is now a 2D dask array so we need to give
# it dimension information
eof_dims = ['mode', 'features']
rs.decode(eofs, eof_dims)
# to decode XArray object we don't need to pass dimension info
rs.decode(B) On the other hand, it would be nice to be able to reshape data through a syntax like
|
I had the chance to play around with air = load_dataset("air_temperature").air
A = air.stack(features=['lat', 'lon']).chunk()
A-= A.mean('features')
_,_,eofs = svd_compressed(A.data, 4)
# wrap eofs in dataarray
dims = ['modes', 'features']
coords = {}
for i, dim in enumerate(dims):
if dim in A.dims:
coords[dim] = A[dim]
elif dim in coords:
pass
else:
coords[dim] = np.arange(eofs.shape[i])
eofs = xr.DataArray(eofs, dims=dims, coords=coords).unstack('features') This is pretty compact as is, so maybe the ugly final bit could be replaced with a convenience function like |
@shoyer I wrote a class that does this a while ago. # D is a dataset
# the signature for DataMatrix.__init__ is
# DataMatrix(feature_dims, sample_dims, variables)
mat = DataMatrix(['z'], ['x'], ['a', 'b'])
y = mat.dataset_to_mat(D)
x = mat.mat_to_dataset(y) One of the problems I had to handle was with concatenating/stacking DataArrays with different numbers of dimensions--- Would you be open to a PR along these lines? |
I can see the use of a Dataset to_array/stack method that does not broadcast arrays. Feel free to open a PR and we'll take a look. |
After using my own version of this code for the past month or so, it has occurred to me that this API probably will not support stacking arrays of with different sizes along shared arrays. For instance, I need to "stack" humidity below an altitude of 10km with temperature between 0 and 16 km. IMO, the easiest way to do this would be to change these methods into top-level functions which can take any dict or iterable of datarrays. We could leave that for a later PR of course. |
👍 for a function or class based interface if that makes sense. Can you share a few examples of what using your proposed API would look like? |
Sorry. I guess I should have made my last comment in the PR. |
* Add stack_cat and unstack_cat methods This partially resolves #1317. Change names of methods stack_cat -> to_stacked_array unstack_cat -> to_unstacked_dataset Test that the dtype of the stacked dimensions is preserved This is not passing at the moment because concatenating None with a dimension that has values upcasts the combined dtype to object Fix dtypes of stacked dimensions This commit ensures that the dtypes of the stacked coordinate match the input dimensions. Use new index variable rather than patching the old one I didn't like the inplace modification of a private member. Handle variable_dim correctly I also fixed 1. f-string formatting issue 2. Use an OrderedDict as @jhamman recommends Add documentation to api.rst and reshaping.rst I also added appropriate See Also sections to the docstrings for to_stacked_array and to_unstacked_dataset. Add changes to whats-new Fixing style errors. Split up lengthy test Remove "physical variable" from docs This is in response to Joe's "nit" * Fix to_stacked_array with new master An error arose when checking for the precence of a dimension in array. The code 'dim in data' no longer works. Replaced this with 'dim in data.dims' * Move entry in whats-new to most recent release * Fix code styling errors It needs to pass `pycodestyle xarray` * Improve docstring of to_stacked_array Added example and additional description. * Move "See Also" section to end of docstring * Doc and comment improvements. * Improve documented example @benbovy pointed out that the old example was confusing. * Add name argument to to_stacked_array and test * Allow level argument to be an int or str * Remove variable_dim argument of to_unstacked_array * Actually removed variable_dim * Change function signature of to_stacked_array Previously, this function was passed a list of dimensions which should be stacked together. However, @benbovy found that the function failed when the _non-stacked_ dimensions were not shared across all variables. Thus, it is easier to specify the dimensions which should remain unchanged, rather than the dimensions to be stacked. The function to_stacked_array now takes an argument ''sample_dim'' which defines these non-stacked dimensions. If these dims are not shared accross all variables than an error is raised. * Fix lint error The line was too long * Fix validation and failing tests 1. the test which stacks a scalar and an array doesn't make sense anymore given the new API. 2. Fixed a bug in the validation code which raised an error almost always. * Fix typo * Improve docs and error messages * Remove extra spaces * Test warning in to_unstacked_dataset * Improve formatting and naming * Fix flake8 error * Respond to @max-sixty's suggestions
Machine learning and linear algebra problems are often expressed in terms of operations on matrices rather than arrays of arbitrary dimension, and there is currently no convenient way to turn DataArrays (or combinations of DataArrays) into a single "data matrix".
As an example, I have needed to use scikit-learn lately with data from DataArray objects. Scikit-learn requires the data to be expressed in terms of simple 2-dimensional matrices. The rows are called samples, and the columns are known as features. It is annoying and error to transpose and reshape a data array by hand to fit into this format. For instance, this gituhub repo for xarray aware sklearn-like objects devotes many lines of code to massaging data arrays into data matrices. I think that this reshaping workflow might be common enough to warrant some kind of treatment in xarray.
I have written some code in this gist, that have found pretty convenient for doing this. This gist has an
XRReshaper
class which can be used for reshaping data to and from a matrix format. The basic usage for an EOF analysis of a datasetA(lat, lon, time)
can be done like thisI am not sure this is the best API, but it seems to work pretty well and I have used it here to implement some xarray-aware sklearn-like objects for PCA, which can be used like
Another syntax which might be helpful is some kind of context manager approach like
The text was updated successfully, but these errors were encountered: