-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes OS error arising from too many files open #1198
Conversation
The intent of this PR is to address (or at least partially address) the following issues: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very nice to see some progress on this!
xarray/backends/netCDF4_.py
Outdated
# netCDF4 only allows closing the root group | ||
while ds.parent is not None: | ||
ds = ds.parent | ||
if ds.isopen(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this while loop in a helper function, something like _find_root
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good.
xarray/backends/api.py
Outdated
@@ -249,6 +248,8 @@ def maybe_decode_store(store, lock=False): | |||
else: | |||
ds2 = ds | |||
|
|||
store.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably guard this behind an option?
xarray/backends/netCDF4_.py
Outdated
self._filename = filename | ||
self._mode = 'a' if mode == 'w' else mode | ||
self._opener = functools.partial(_open_netcdf4_group, filename, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse the same partial created above for opener
, maybe just with a different value for the mode
argument? (would be nice to have less code duplication)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to self._opener = functools.partial(opener, mode=self._mode)
. Is this the best way to do this from previous usage of
self._opener = opener(mode=self._mode)
mode=self._mode,
group=group, clobber=clobber,
diskless=diskless, persist=persist,
format=format)
?
xarray/test/test_backends.py
Outdated
@@ -492,6 +492,21 @@ def create_tmp_file(suffix='.nc', allow_cleanup_failure=False): | |||
if not allow_cleanup_failure: | |||
raise | |||
|
|||
@contextlib.contextmanager | |||
def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you can't reuse create_tmp_file
internally here?
This would be most straightforwardly done with contextlib.ExitStack
, though we would need a backport to Python 2.7:
https://docs.python.org/3.6/library/contextlib.html#contextlib.ExitStack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shoyer, this makes the code cleaner. However, it may be operating slower because there could be more contexts in play. It does remove redundant code, which is always a plus.
7070fd7
to
349e7db
Compare
@shoyer, all the checks "pass" but there are still errors in the "allowed" list. If you get a change could you please provide me some perspective on whether these are errors on my end or not? I'm not exactly sure how to interpret them. Once I know I have correctness in this code I plan to fix the inlines you graciously highlighted above. I think we are getting close here, assuming that I have enough testing to demonstrate we have accurately fixed the too many open file issue. Any additional ideas you have for tests would be really helpful too. |
@pwolfram the allowed failures are pre-existing, not related to this change. |
Thanks @shoyer, Does that mean if the checks pass the code is at least minimally correct in terms of not breaking previous design choices? E.g., does this imply that we are ok except for cleanup / implementation details on this PR? |
If the checks pass, it means that it doesn't directly break anything that we have tests for. Which should cover most functionality. However, we'll still need to be careful not to introduce performance regressions -- we don't have any automated performance tests yet. |
@shoyer, I just realized this might conflict with #1087. Do you foresee this causing problems and what order do you plan to merge this PR and #1087 (which obviously predates this one...)? We are running into the snag with #463 in our analysis and my personal preference would be to get some type of solution into place sooner than later. Thanks for considering this request. Also, I'm not sure exactly the best way to test performance either. Could we potentially use something like the "toy" test cases for this purpose? Ideally we would have a test case with O(100) files to gain a clearer picture of the performance cost of this PR. Please let me know what you want me to do with this PR-- should I clean it up in anticipation of a merge or just wait for now to see if there are extra things that need fixed via additional testing? Note I have the full scipy, h5netcdf and pynio implementations that can also be reviewed because they weren't available when you did your review yesterday. |
Don't worry about #1087 -- I can rebase it.
…On Wed, Jan 11, 2017 at 8:54 PM Phillip Wolfram ***@***.***> wrote:
@shoyer <https://github.com/shoyer>, I just realized this might conflict
with #1087 <#1087>. Do you foresee
this causing problems and what order do you plan to merge this PR and
#1087 <#1087> (which obviously
predates this one...)? We are running into the snag with #463
<#463> in our analysis and my
personal preference would be to get some type of solution into place sooner
than later. Thanks for considering this request.
Also, I'm not sure exactly the best way to test performance either. Could
we potentially use something like the "toy" test cases for this purpose?
Ideally we would have a test case with O(100) files to gain a clearer
picture of the performance cost of this PR.
Please let me know what you want me to do with this PR-- should I clean it
up in anticipation of a merge or just wait for now to see if there are
extra things that need fixed via additional testing? Note I have the full
scipy, h5netcdf and pynio implementations that can also be reviewed because
they weren't available when you did your review yesterday.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1198 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1o6yrNYYulbAMkLTHaVKLQA0u3Kjks5rRbIEgaJpZM4LfwBQ>
.
|
This should be totally fine without performance or compatibility concerns as long as we set In the long term, it would be nice to handle autoclosing automatically (invoking it when the number of open files exceeds some limit), but we should probably be a little more clever for that. |
Thanks @shoyer. This makes sense. I think the path forward on the next round of edits should include making sure existing tests using Documentation is also obviously required. FYI as a heads up, I probably won't be able to get to this mid-week at the earliest but it appears we are close to a viable solution. |
I appreciate your work on this too-many-files-open error - I think your fixes will add a lot of value to the NetCDF multi-file functionality. In this notebook using K-Means clustering on multi-file NetCDF data sets I have repeatedly experienced the too-many-open files error, even with attempts to adjust via |
@PeterDSteinberg, did this PR fix the issue for you? I obviously need to update it but just wanted to confirm that the current branch resolved the too-many-open files error issue. Also, do you have any idea of the performance impact of these changes I'm proposing? |
5acafb6
to
1460a07
Compare
@shoyer and @PeterDSteinberg I've updated this PR to reflect requested changes. |
There are still a few more issues that need ironed out. I'll let you know when I've resolved them. |
I'm just chiming in to signify my interest in seeing this issue solved. I have just hit "OSError: Too many open files". The data itself is not even huge, but it's scattered across many files and it's a PITA to revert to manual concatenation -- I've grown used to dask doing the work for me ;-) |
@pwolfram this looks pretty close to me now -- let me know when it's ready for review. |
1460a07
to
73b601d
Compare
@shoyer, the pushed code represents my progress. The initial PR had a bug-- essentially a calculation couldn't be performed following the load. This fixes that bug and provides a test to ensure that this doesn't happen. However, I'm having trouble with h5netcdf, which I'm not very familiar with compared to netcdf. This represents my current progress, I just need some more time (or even inspiration from you) to sort out this last key issue... I'm getting the following error: ================================================================================================================== FAILURES ==================================================================================================================
___________________________________________________________________________________________ OpenMFDatasetTest.test_4_open_large_num_files_h5netcdf ___________________________________________________________________________________________
self = <xarray.tests.test_backends.OpenMFDatasetTest testMethod=test_4_open_large_num_files_h5netcdf>
@requires_dask
@requires_h5netcdf
def test_4_open_large_num_files_h5netcdf(self):
> self.validate_open_mfdataset_large_num_files(engine=['h5netcdf'])
xarray/tests/test_backends.py:1040:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xarray/tests/test_backends.py:1018: in validate_open_mfdataset_large_num_files
self.assertClose(ds.foo.sum().values, np.sum(randdata))
xarray/core/dataarray.py:400: in values
return self.variable.values
xarray/core/variable.py:306: in values
return _as_array_or_item(self._data)
xarray/core/variable.py:182: in _as_array_or_item
data = np.asarray(data)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/numpy/core/numeric.py:482: in asarray
return array(a, dtype, copy=False, order=order)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/array/core.py:1025: in __array__
x = self.compute()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/base.py:79: in compute
return compute(self, **kwargs)[0]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/base.py:179: in compute
results = get(dsk, keys, **kwargs)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:537: in get_sync
raise_on_exception=True, **kwargs)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:500: in get_async
fire_task()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:476: in fire_task
callback=queue.put)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:525: in apply_sync
res = func(*args, **kwds)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:268: in execute_task
result = _execute_task(task, data)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:248: in _execute_task
args2 = [_execute_task(a, cache) for a in args]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:248: in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:245: in _execute_task
return [_execute_task(a, cache) for a in arg]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:245: in <listcomp>
return [_execute_task(a, cache) for a in arg]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:249: in _execute_task
return func(*args2)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/array/core.py:52: in getarray
c = a[b]
xarray/core/indexing.py:401: in __getitem__
return type(self)(self.array[key])
xarray/core/indexing.py:376: in __getitem__
return type(self)(self.array, self._updated_key(key))
xarray/core/indexing.py:354: in _updated_key
for size, k in zip(self.array.shape, self.key):
xarray/core/indexing.py:364: in shape
for size, k in zip(self.array.shape, self.key):
xarray/core/utils.py:414: in shape
return self.array.shape
xarray/backends/netCDF4_.py:37: in __getattr__
return getattr(self.datastore.ds.variables[self.var], attr)
../../anaconda/envs/test_env_xarray35/lib/python3.5/contextlib.py:66: in __exit__
next(self.gen)
xarray/backends/h5netcdf_.py:105: in ensure_open
self.close()
xarray/backends/h5netcdf_.py:190: in close
_close_ds(self.ds)
xarray/backends/h5netcdf_.py:70: in _close_ds
find_root(ds).close()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/h5netcdf/core.py:458: in close
self._h5file.close()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/h5py/_hl/files.py:302: in close
self.id.close()
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)
???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)
???
h5py/h5f.pyx:282: in h5py.h5f.FileID.close (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/h5f.c:3905)
???
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)
???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E RuntimeError: dictionary changed size during iteration
h5py/_objects.pyx:119: RuntimeError
============================================================================================ 1 failed, 1415 passed, 95 skipped in 116.54 seconds =============================================================================================
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x10f16e598>
Traceback (most recent call last):
File "/Users/pwolfram/anaconda/envs/test_env_xarray35/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable |
73b601d
to
923473c
Compare
I'll take a look tomorrow. Getting all these backends to behave correctly
and consistently is a constant battle.
…On Sat, Feb 4, 2017 at 9:20 PM Phillip Wolfram ***@***.***> wrote:
@shoyer <https://github.com/shoyer>, the pushed code represents my
progress. The initial PR had a bug-- essentially a calculation couldn't be
performed following the load. This fixes that bug and provides a test to
ensure that this doesn't happen. However, I'm having trouble with h5netcdf,
which I'm not very familiar with compared to netcdf. This represents my
current progress, I just need some more time (or even inspiration from you)
to sort out this last key issue...
I'm getting the following error:
================================================================================================================== FAILURES ==================================================================================================================
___________________________________________________________________________________________ OpenMFDatasetTest.test_4_open_large_num_files_h5netcdf ___________________________________________________________________________________________
self = <xarray.tests.test_backends.OpenMFDatasetTest testMethod=test_4_open_large_num_files_h5netcdf>
@requires_dask
@requires_h5netcdf
def test_4_open_large_num_files_h5netcdf(self):> self.validate_open_mfdataset_large_num_files(engine=['h5netcdf'])
xarray/tests/test_backends.py:1040:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xarray/tests/test_backends.py:1018: in validate_open_mfdataset_large_num_files
self.assertClose(ds.foo.sum().values, np.sum(randdata))
xarray/core/dataarray.py:400: in values
return self.variable.values
xarray/core/variable.py:306: in values
return _as_array_or_item(self._data)
xarray/core/variable.py:182: in _as_array_or_item
data = np.asarray(data)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/numpy/core/numeric.py:482: in asarray
return array(a, dtype, copy=False, order=order)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/array/core.py:1025: in __array__
x = self.compute()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/base.py:79: in compute
return compute(self, **kwargs)[0]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/base.py:179: in compute
results = get(dsk, keys, **kwargs)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:537: in get_sync
raise_on_exception=True, **kwargs)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:500: in get_async
fire_task()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:476: in fire_task
callback=queue.put)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:525: in apply_sync
res = func(*args, **kwds)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:268: in execute_task
result = _execute_task(task, data)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:248: in _execute_task
args2 = [_execute_task(a, cache) for a in args]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:248: in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:245: in _execute_task
return [_execute_task(a, cache) for a in arg]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:245: in <listcomp>
return [_execute_task(a, cache) for a in arg]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:249: in _execute_task
return func(*args2)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/array/core.py:52: in getarray
c = a[b]
xarray/core/indexing.py:401: in __getitem__
return type(self)(self.array[key])
xarray/core/indexing.py:376: in __getitem__
return type(self)(self.array, self._updated_key(key))
xarray/core/indexing.py:354: in _updated_key
for size, k in zip(self.array.shape, self.key):
xarray/core/indexing.py:364: in shape
for size, k in zip(self.array.shape, self.key):
xarray/core/utils.py:414: in shape
return self.array.shape
xarray/backends/netCDF4_.py:37: in __getattr__
return getattr(self.datastore.ds.variables[self.var], attr)
../../anaconda/envs/test_env_xarray35/lib/python3.5/contextlib.py:66: in __exit__
next(self.gen)
xarray/backends/h5netcdf_.py:105: in ensure_open
self.close()
xarray/backends/h5netcdf_.py:190: in close
_close_ds(self.ds)
xarray/backends/h5netcdf_.py:70: in _close_ds
find_root(ds).close()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/h5netcdf/core.py:458: in close
self._h5file.close()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/h5py/_hl/files.py:302: in close
self.id.close()
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)
???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)
???
h5py/h5f.pyx:282: in h5py.h5f.FileID.close (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/h5f.c:3905)
???
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)
???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E RuntimeError: dictionary changed size during iteration
h5py/_objects.pyx:119: RuntimeError
============================================================================================ 1 failed, 1415 passed, 95 skipped in 116.54 seconds =============================================================================================
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x10f16e598>
Traceback (most recent call last):
File "/Users/pwolfram/anaconda/envs/test_env_xarray35/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1198 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1he81D4jBPR8uq8EpiC1dHMBlTbnks5rZVwXgaJpZM4LfwBQ>
.
|
6c40f49
to
3ddf9c1
Compare
Subclasses also work in place of fixtures in many cases (we use them in
much of the existing code).
…On Wed, Mar 22, 2017 at 12:30 PM Phillip Wolfram ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In xarray/tests/test_backends.py
<#1198 (comment)>:
>
with self.assertRaisesRegexp(IOError, 'no files to open'):
- open_mfdataset('foo-bar-baz-*.nc')
+ for close in [True, False]:
I think I understand how to use fixtures for both of these arguments now...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1198 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1tp6OBPnEY9sseevAKirVhHD0XR1ks5roXbrgaJpZM4LfwBQ>
.
|
@shoyer, if we generally cover test_backends for
or was there some other test that is needed? |
Agreed, that should do it. |
be6ab05
to
beef5af
Compare
@shoyer, that subclass-based approach you outlined worked (fixture parameters really don't work with classes as far as I could tell). We now have more comprehensive, named testing. Note, there was one minor point that required more explicitly specification that arose from the more rigorous testing: The scipy backend can handle objects like |
beef5af
to
8f2fb8c
Compare
@shoyer, I had a minor bug that is now removed. The last caveat no longer applicable:
I'll let you know when tests pass and this is ready for your final review. |
a531b10
to
8f2fb8c
Compare
@shoyer, this is ready for the final review now. Coveralls appears to have hung but other tests pass. |
@shoyer, all tests (including coveralls) passed. Please let me know if you have additional concerns and if we could merge fairly soon, e.g., because of MPAS-Dev/MPAS-Analysis#151 I would really appreciate it. |
xarray/backends/h5netcdf_.py
Outdated
from .netCDF4_ import (_nc4_group, _nc4_values_and_dtype, | ||
_extract_nc4_variable_encoding, BaseNetCDF4Array) | ||
|
||
|
||
class H5NetCDFFArrayWrapper(BaseNetCDF4Array): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling: extra F
xarray/backends/scipy_.py
Outdated
@@ -96,28 +110,38 @@ def __init__(self, filename_or_obj, mode='r', format=None, group=None, | |||
raise ValueError('invalid format for scipy.io.netcdf backend: %r' | |||
% format) | |||
|
|||
# if the string ends with .gz, then gunzip and open as netcdf file | |||
if type(filename_or_obj) is str and filename_or_obj.endswith('.gz'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use isinstance(filename_or_obj, basestring)
instead of type(filename_or_obj) is str
.
Also, move this logic into _open_scipy_netcdf
-- otherwise it won't work to reopen a gzipped file.
xarray/backends/scipy_.py
Outdated
version=version) | ||
except TypeError as e: | ||
# TODO: gzipped loading only works with NetCDF3 files. | ||
if 'is not a valid NetCDF 3 file' in e.message: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only be triggered when reading a gzipped file. Right now, it can be triggered whenever an invalid netCDF3 file is read, gzipped or not.
This should be easier to fix when you move gzip.open
into this helper function (see my comment below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out-- sorry for this sloppiness.
xarray/tests/test_backends.py
Outdated
# autoclose = True | ||
|
||
class OpenMFDatasetTest(TestCase): | ||
autoclose = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this class variable, since this test class does not use inheritance.
xarray/tests/test_backends.py
Outdated
#class H5NetCDFDataTestAutocloseTrue(H5NetCDFDataTest): | ||
# autoclose = True | ||
|
||
class OpenMFDatasetTest(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to something like OpenMFDatasetManyFilesTest
(we have other tests of open_mfdataset
)
xarray/tests/test_backends.py
Outdated
@@ -1139,6 +1257,8 @@ def test_dataarray_compute(self): | |||
self.assertTrue(computed._in_memory) | |||
self.assertDataArrayAllClose(actual, computed) | |||
|
|||
class DaskTestAutocloseTrue(DaskTest): | |||
autoclose=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run some sort of PEP8 check, e.g., git diff upstream/master | flake8 --diff
. There should be extra spaces around the =
sign here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to ignore PEP8 to xarray/core/pycompat.py
because code in there is essentially a copy / paste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
xarray/tests/test_backends.py
Outdated
|
||
# check that calculation on opened datasets works properly | ||
ds = open_mfdataset(tmpfiles, engine=readengine, | ||
autoclose=self.autoclose) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just set autoclose=True
here.
xarray/tests/test_backends.py
Outdated
# split into multiple sets of temp files | ||
for ii in original.x.values: | ||
( | ||
original.isel(x=slice(ii, ii+1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be slightly more readable with an intermediate variable, or removing the line break on the line above.
Includes testing to demonstrate an OSError associated with opening too many files as encountered using open_mfdataset. Fixed for the following backends: * netCDF4 backend * scipy backend * pynio backend Open/close operations on h5netcdf appear to have an error associated with the h5netcdf library following correspondence with @shoyer. Thus, there are still challenges with h5netcdf; hence, support for h5netcdf is currently disabled. Note, by default `autoclose=False` for open_mfdataset so standard behavior is unchanged unless `autoclose=True`. This choice of default is to select standard xarray performance over general removal of the OSError associated with opening too many files as encountered using open_mfdataset.
8f2fb8c
to
20c5c3b
Compare
OK, in it goes! |
Thanks a bunch @shoyer! |
Previously, DataStore did not judiciously close files, resulting in opening a large number of files that
could result in an OSError related to too many files being open. This merge provides a solution for the netCDF, scipy, and h5netcdf backends.