Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save/load DataArray to numpy npz functions #768

Closed
jonathanstrong opened this issue Feb 17, 2016 · 11 comments
Closed

save/load DataArray to numpy npz functions #768

jonathanstrong opened this issue Feb 17, 2016 · 11 comments

Comments

@jonathanstrong
Copy link

hey -

Apologies if this is bad form: I wanted to pass this along but don't have time to do a proper pull request.

I have found pickle to be really problematic for serializing data, so wrote these two functions to save to numpy's binary npz format and retrieve it. Generally, the numpy format is much less likely to bomb when attempting to load on another computer because of some unseen dependency. If there's interest, I could probably add this as a serialization method to DataArray in the next month or so.

def to_npz(da, file_or_buffer):
    if 'dims' in da.dims: 
        raise ValueError('Can\'t use "dims" as a dim name.')
    if 'values' in da.dims: 
        raise ValueError('Can\'t use "values" as a dim name.')
    arrays = {}
    arrays['dims'] = da.dims
    for dim in da.dims: 
        arrays[dim] = da.indexes[dim]
    arrays['values'] = da.values
    np.savez(file_or_buffer, **arrays)

def from_npz(file_or_buffer): 
    data = np.load(file_or_buffer)
    assert hasattr(data, 'keys'), "np.load returned a {}, not a dict-like object".format(type(data))
    assert 'dims' in data, 'Can\'t locate "dims" key in file'
    assert 'values' in data, 'Can\'t locate "values" key in file'
    for dimname in data['dims']:
        assert dimname in data, 'Can\'t locate "{}" key in file'.format(dimname)
    return xray.DataArray(data['values'], dims=data['dims'], coords=dict(zip(data['dims'], [data[dimname] for dimname in data['dims']])))

it's pretty speedy, here is an example for a (3, 4, 5) shaped DataArray:

In [42]:
def save_and_load_again(da):
    with open('/path/to/datarray.npz', 'w') as f: 
        to_npz(da, f)
    with open('/path/to/datarray.npz', 'r') as f: 
        a = from_npz(f)
    return a
%time (save_and_load_again(da) == da).all()
CPU times: user 12.6 ms, sys: 0 ns, total: 12.6 ms
Wall time: 26.2 ms
Out[42]:
<xray.DataArray ()>
array(True, dtype=bool)
@shoyer
Copy link
Member

shoyer commented Feb 18, 2016

This is a pretty reasonable way to save data, but my only concern is that it's not clear to me that we need another file format when netCDF already solves this problem, in a completely portable way. Have you tried using xarray's netCDF IO?

@jonathanstrong
Copy link
Author

I hadn't, for a number of reasons. First, I've used csv, hdf, sql, json, yaml and other formats but never came across netcdf until using this library as someone who isn't working in the physical sciences. Second, the documentation on netcdf is fairly dense. Third, didn't want to deal with installing the library.

I just did use it and seems like it is great for Datasets. As far as I can tell there is no way to save DataArrays directly, though?

Finally, would note that pandas has io methods for csv, excel, hdf, sql, json, msgpack, html, gbq, stata, "clipboard", and pickle. I think it's a strength to offer more choices.

@shoyer
Copy link
Member

shoyer commented Feb 20, 2016

I hadn't, for a number of reasons. First, I've used csv, hdf, sql, json, yaml and other formats but never came across netcdf until using this library as someone who isn't working in the physical sciences. Second, the documentation on netcdf is fairly dense. Third, didn't want to deal with installing the library.

OK, these are all fair points. Though you probably already have SciPy installed, which is enough for basic netCDF support.

I just did use it and seems like it is great for Datasets. As far as I can tell there is no way to save DataArrays directly, though?

This is true. But converting a DataArray to a Dataset is quite simple: arr.to_dataset(name='foo'), so I'm not sure it's worth adding.

Finally, would note that pandas has io methods for csv, excel, hdf, sql, json, msgpack, html, gbq, stata, "clipboard", and pickle. I think it's a strength to offer more choices.

Yes, choice is good -- but also note that none of those are invented file formats for pandas! I am slightly wary of going down this path, because at the point at which you have a file format that can faithfully represent every xarray object, you have basically reinvented netCDF :).

That said, something like JSON is generally useful enough (with a different niche than netCDF) that it could make sense to add IO support.

@jonathanstrong
Copy link
Author

hey,

So - after using netcdf for a few days, definitely not looking back. This is great. By way of background, I am building a way to integrate storage of arbitrary arrays into an otherwise highly-structured schema. After using postgresql arrays flamed out (too slow, even at the raw SQL level), I moved on to saving the file path in my schema. I thought it would be sensible to keep everything in pure ndarray for simplicity. After trying out netcdf, I bit the bullet and wrote constructors for numpy, pandas and xarray types, and it's working great.

Looking back, I actually think the documentation could use some work to help people like me, who haven't used netcdf, realize it's great.

If you look at the docs, it starts with pickle, which for me is kind of a red flag since from my experience pickle is the world's most flaky persistence method (always has dependency issues). Then at netcdf you start with:

"Currently, the only disk based serialization format that xarray directly supports is netCDF."

I read this and think, ok so the IO is not really there yet. It's like an apology there aren't more choices.

"netCDF is a file format for fully self-described datasets that is widely used in the geosciences and supported on almost all platforms."

Hmm...geosciences...who knows what those people are doing? I'm also generally suspicious of academics when it comes to code.

"We use netCDF because xarray was based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects."

Ok, so it's easy for you. What about me?

I've been a bit over the top but you can see how someone who doesn't use netcdf might read this and think they need to write their own IO functions.

If it were me, I would start off selling how great this format is for xarray. Like, "netcdf is a blazing-fast, binary data format that allows transparent, self-describing persistence with zero of the dependency issues you get with pickle or other formats. It allows xarray Datasets to be saved intact and even used in out-of-core computations for larger-than-memory arrays."

Or something like that.

Finally, regarding DataArray not having it's own method to save: I think this is a deficiency that is easily solved. For me, getting into this library I started with just DataArrays. Now that I am using Datasets I can see how they are pretty dang powerful. But at first the simpler DataArrays were all I was using, and they had no direct IO.

To solve this, you could create a "magic" string for DataArrays. On save, to_netcdf converts the da to a Dataset with the magic key. On load, the load function recognizes the magic string and breaks out that DataArray to return it specifically.

I think that would be a quick, relatively painless way to give DataArrays equal footing with Datasets.

Anyway - my two cents. I am a huge fan of this library and happy to chip in regarding any of the above if desired. Thanks for your hard work on it.

@darothen
Copy link

Hi @jonathanstrong,

Just thought it would be useful to point out that the people who maintain NetCDF is Unidata, a branch of the University Corporation for Atmospheric Research. In fact, netCDF-4 is essentially built on top of HDF5 - a much more widely-known file format, with first-class support including an I/O layer in pandas. While it would certainly be great to "sell" netCDF as a format in the documentation, those of us who still have to write netCDF-based I/O modules for our Fortran models might have to throw up a little in our mouths when we do so...

@shoyer
Copy link
Member

shoyer commented Feb 22, 2016

@jonathanstrong this is really helpful feedback! You are right to be suspicious of academics when it comes to file formats :) If you have concrete suggestions for doc improvements along these lines, please do put together a PR!

I've thought about the "magic name" approach, too -- my only concern is that it would be weird to get a DataArray back from xarray.open_dataset. But maybe xarray.open is a better name, anyways...

@dopplershift
Copy link
Contributor

cc @WardF

@max-sixty
Copy link
Collaborator

I'd vote for something format-specific, such as xr.from_netcdf unless open / open_dataset supports other formats...

@jhamman
Copy link
Member

jhamman commented Feb 23, 2016

@jonathanstrong - Thanks for the input. I agree, we could spice up our IO docs. Like you, I think it makes sense to play down the pickle serialization method.

@MaximilianR

unless open / open_dataset supports other formats...

It does. From here:

Formats supported by PyNIO

xarray can also read GRIB, HDF4 and other file formats supported by PyNIO_, if PyNIO is installed. To use PyNIO to read such files, supply engine='pynio' to xarray.open_dataset.

@max-sixty
Copy link
Collaborator

@jhamman nice!

@fmaussion
Copy link
Member

Closing this partly via #1169 and in favor of #1154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants