Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support using opened netCDF4.Dataset (Fixes #1459) #1508

Merged
merged 1 commit into from
Aug 31, 2017

Conversation

dopplershift
Copy link
Contributor

@dopplershift dopplershift commented Aug 16, 2017

Make the filename argument to NetCDF4DataStore polymorphic so that a
Dataset can be passed in.

#1459 discussed adding an alternate constructor (i.e. a class method) to NetCDF4DataStore to allow this, which would be my preferred approach rather than making a filename polymorphic (via isinstance). Unfortunately, alternate constructors only work by taking one set of parameters (or setting defaults) and then passing them to the original constructor. Given that, there's no way to make an alternate constructor without also making the original constructor somehow aware of this functionality--or breaking backwards-compatibility. I'm open to suggestions to the contrary.

@@ -182,7 +182,10 @@ def _extract_nc4_variable_encoding(variable, raise_on_invalid=False,
def _open_netcdf4_group(filename, mode, group=None, **kwargs):
import netCDF4 as nc4

ds = nc4.Dataset(filename, mode=mode, **kwargs)
if isinstance(filename, nc4.Dataset):
ds = filename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential gotcha with reusing an existing netCDF4 dataset is that we disable automatic masking and scale on each variable (see below in this function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why that would be a problem. It's transmuting state, but IMO if you're handing xarray a Dataset, you kinda want it to do it's thing...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just make it very clear in the docs.

if isinstance(filename, nc4.Dataset):
ds = filename
else:
ds = nc4.Dataset(filename, mode=mode, **kwargs)

with close_on_error(ds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to automatically close already open netCDF4 datasets if there's a problem loading a group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll see about that when I refactor.

@@ -197,6 +200,9 @@ class NetCDF4DataStore(WritableCFDataStore, DataStorePickleMixin):
"""Store for reading and writing data via the Python-NetCDF4 library.

This store supports NetCDF3, NetCDF4 and OpenDAP datasets.

`filename` can be an already opened netCDF4 ``Dataset``; this will not
support pickling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

filename can be an already opened netCDF4 Dataset or Group. In this case, the other netCDF4 specific arguments are ignored, and the DataStore object cannot be pickled.

@@ -210,9 +216,9 @@ def __init__(self, filename, mode='r', format='NETCDF4', group=None,
self.ds = opener()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, I think we could support pickling if passed all the right keyword arguments in addition to the existing netCDF4.Dataset object, and we used the filename/group from the input argument as arguments to opener. But I'm not entirely sure that's worth doing.

It's probably simpler for now to explicitly raise an error for cases where we would re-use an opener, namely when pickling and/or using autoclose=True. Can you add that? Right now I would guess that using either of those options could result in somewhat opaque errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can catch autoclose, but am I trying to throw an error in __getstate__ and __setstate__ for pickle? Not sure where to go for that part of the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, netCDF4.Dataset has a __reduce__ method that raises NotImplementedError, so maybe this isn't necessary

@@ -55,6 +55,12 @@ Enhancements
(:issue:`576`).
By `Stephan Hoyer <https://github.com/shoyer>`_.

- Support using an existing, opened netCDF4 ``Dataset`` with
:py:class:`~xarray.backends.NetCDF4DataStore`. This permits creating an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving the current constructor logic into a classmethod NetCDF4DataStore.open(filename, mode, format, group, writer, clobber, diskless, persist, autoclose)

And adjusting the constructor: NetCDF4DataStore.__init__(self, netcdf4_dataset, opener=None, writer=None, autoclose=None).

Right now, I don't think anyone is using the NetCDF4DataStore constructor directly -- there's literally no good reason for that. This also gives us a pattern we could use for other constructors (e.g., pydap) where passing in an existing object is desirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to refactor like that--I just didn't think it was on the table.

@dopplershift
Copy link
Contributor Author

I've taken a crack at refactoring the constructor to make that take the dataset instance and have an open class method that opens the dataset. It was a little more effort to break everything apart (owing to opener) but I think what I have is ok. I'd appreciate feedback before I go and update the whats new and rebase to clean up.

self._opener = functools.partial(opener, mode=self._mode)
super(NetCDF4DataStore, self).__init__(writer)

ds = nc4.Dataset(filename, mode=mode, format=format, clobber=clobber,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just re-use opener here instead? e.g., ds = opener() like how it was in __init__ before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that was my attempt at not duplicating work needed in the opener, the constructor, and our new alternate constructor (open):

  1. Open the filename and get a dataset
  2. Find the group
  3. Set the variables

If we drop the group (as discussed above) then all the constructor needs to do is turn off scale and mask. This would duplicate turning it off in opener--are you ok with the duplication of the "work" (not the code to do so)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop the group (as discussed above) then all the constructor needs to do is turn off scale and mask. This would duplicate turning it off in opener--are you ok with the duplication of the "work" (not the code to do so)?

Yes, this seems OK but not ideal. (In my ideal world, we would check to see if the netCDF4.Dataset has auto-mask-and-scale turned on, and raise an error in that case rather than silently converting it. But I don't think that's possible with netCDF4-Python's current API.)

if format is None:
format = 'NETCDF4'
opener = functools.partial(_open_netcdf4_group, filename, mode=mode,
opener = functools.partial(_open_netcdf4_group, filename,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is missing mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking in common.py and it seems like mode gets overridden and thus was not really needed:

def __getstate__(self):
state = self.__dict__.copy()
del state['ds']
if self._mode == 'w':
# file has already been created, don't override when restoring
state['_mode'] = 'a'
return state
def __setstate__(self, state):
self.__dict__.update(state)
self.ds = self._opener(mode=self._mode)

Happy to defer to your judgement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess given how much the test suite likes my refactor, I've messed something up though... 😁

def __init__(self, filename, mode='r', format='NETCDF4', group=None,
def __init__(self, netcdf4_dataset, mode='r', group=None, writer=None,
opener=None, autoclose=False):
self.ds = _get_netcdf4_group(netcdf4_dataset, group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass in a netCDF4 group directly and drop the group argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So anyone using this part of the interface, if they want a specific group needs to pass it in rather than pass the name? I can live with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

@dopplershift
Copy link
Contributor Author

Ok, so I think we're closer now. The tests at least pass on my machine now. 😁

@dopplershift dopplershift mentioned this pull request Aug 28, 2017
13 tasks
@jhamman jhamman modified the milestone: 0.10 Aug 28, 2017
@shoyer shoyer mentioned this pull request Aug 29, 2017
4 tasks
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, just a few minor changes.

self.is_remote = is_remote_uri(filename)
self._filename = filename
self._mode = 'a' if mode == 'w' else mode
self._opener = functools.partial(opener, mode=self._mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic on these two lines with self._mode seems a little redundant / strange, but I'm concerned that it might be important to avoid overwriting files when pickling a datastore or using autoclose. Can you restore it to __init__? I would rather tackle this clean-up in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, although calling partial on opener is conditional on opener existing since it's None in the case of an existing Dataset.

@@ -762,6 +762,18 @@ def test_0dimensional_variable(self):
expected = Dataset({'x': ((), 123)})
self.assertDatasetIdentical(expected, ds)

def test_read_open_dataset(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more clarity: test_already_open_dataset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Make the filename argument to NetCDF4DataStore polymorphic so that a
Dataset can be passed in.
@dopplershift
Copy link
Contributor Author

I went ahead and rebased on master and squashed down the WIP commits. I suspect this is ready now.

@dopplershift
Copy link
Contributor Author

Looks like test failures are those mentioned in #1540

@shoyer shoyer merged commit b190501 into pydata:master Aug 31, 2017
@shoyer
Copy link
Member

shoyer commented Aug 31, 2017

Thanks @dopplershift !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants