Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling: bottleneck still not working properly with dask arrays #3165

Closed
peterhob opened this issue Jul 28, 2019 · 12 comments
Closed

rolling: bottleneck still not working properly with dask arrays #3165

peterhob opened this issue Jul 28, 2019 · 12 comments
Labels
plan to close May be closeable, needs more eyeballs topic-dask topic-rolling

Comments

@peterhob
Copy link

MCVE Code Sample

# Your code here
import numpy as np
import xarray as xr
# from dask.distributed import Client
temp= xr.DataArray(np.zeros((5000, 50000)),dims=("x","y")).chunk({"y":100, })
temp.rolling(x=100).mean()

Expected Output

Problem Description

I was thrilled to find that the new release (both 0.12.2 and 0.12.3) fixed the rolling window issue. However, When I tried, it seems the problem is still there. Previously, the above code runs with bottleneck installed. However, with the new version, with or without bottleneck, it simply gives the memory error as below.

I have tried to use old and new versions of Dask and pandas, but with no much difference. However, the dask Dataframe version of the code (shown below) runs ok.

import dask.dataframe as dd
import dask.array as da
import numpy as np

da_array=da.from_array(np.zeros((5000, 50000)), chunks=(5000,100))
df = dd.from_dask_array(da_array)
df.rolling(window=100,axis=0).mean()

I have also tried to apply the similar thing on dataset from netcdf files, it simply started consuming very large portion of memory and gives the similar errors.

Any help are appreciated.

/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/merge.py:17: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  PANDAS_TYPES = (pd.Series, pd.DataFrame, pd.Panel)
/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/dataarray.py:219: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  elif isinstance(data, pd.Panel):
Traceback (most recent call last):
  File "rolltest.py", line 5, in <module>
    temp.rolling(x=100).mean()
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/rolling.py", line 245, in wrapped_func
    return self.reduce(func, **kwargs)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/rolling.py", line 217, in reduce
    result = windows.reduce(func, dim=rolling_dim, **kwargs)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/dataarray.py", line 1636, in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, **kwargs)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/variable.py", line 1369, in reduce
    input_data = self.data if allow_lazy else self.values
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/variable.py", line 392, in values
    return _as_array_or_item(self._data)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/variable.py", line 213, in _as_array_or_item
    data = np.asarray(data)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/array/core.py", line 1047, in __array__
    x = self.compute()
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/base.py", line 399, in compute
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/base.py", line 399, in <listcomp>
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/array/core.py", line 828, in finalize
    return concatenate3(results)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/array/core.py", line 3621, in concatenate3
    result = np.empty(shape=shape, dtype=dtype(deepfirst(arrays)))
MemoryError

Output of xr.show_versions()

# Paste the output here xr.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.2
pandas: 0.24.2
numpy: 1.16.4
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: 0.7.3
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 1.2.2
distributed: 1.28.1
matplotlib: 3.1.0
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.0.0
pip: 19.1.1
conda: 4.7.5
pytest: None
IPython: 7.5.0
sphinx: None

@shoyer
Copy link
Member

shoyer commented Jul 28, 2019

Have you tried adding more chunking, e.g., along the x dimension? That’s that usual recommendation if you’re running out of memory.

@peterhob
Copy link
Author

peterhob commented Jul 29, 2019

Have you tried adding more chunking, e.g., along the x dimension? That’s that usual recommendation if you’re running out of memory.

Hi Shoyer,

Thanks for your reply and help. However, I have tried various chunks along each and both dimension (like 200 on x dimension, 100 on y dimension; or larger chunks like 2000 on y dimension), it doesn't work.

In both a ubuntu machine with 100 Gb memory and a local windows10 machine, it simply crashed in couple of seconds. Even though it says memory error, the code does not use much memory at all. Also even with the one dimension setup, the temp.data shows that each chunk only takes 4 mb memory (which makes me think it might be too small and then used larger chunks). I also used a new conda environment with clean install of just the necessary libraries, and the problem is still there.

Here is the neat new environment under which I tried again but gives the same errors,

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None

xarray: 0.12.3
pandas: 0.25.0
numpy: 1.16.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.1.0
distributed: 2.1.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.0.1
pip: 19.2.1
conda: None
pytest: None
IPython: None
sphinx: None

By the way, the above code seems to work ok with previous 0.12.1 version Xarray and bottleneck.

Cheers,
Joey

@shoyer
Copy link
Member

shoyer commented Jul 29, 2019

Did you try converting np.zeros((5000, 50000) to use dask.array.zeros instead? The former will allocate 2 GB of data within each chunk

@peterhob
Copy link
Author

Did you try converting np.zeros((5000, 50000) to use dask.array.zeros instead? The former will allocate 2 GB of data within each chunk

Thank you for your suggestion. Tried as you suggested, still with same error.

import numpy as np
import xarray as xr
import dask.array as da
# from dask.distributed import Client
temp= xr.DataArray(da.zeros((5000, 50000)),dims=("x","y")).chunk({"y":100, })
temp.rolling(x=100).mean()

I have also tried saving the array to nc file and read it after that. Still rolling gives same error (with or without bottleneck and different chunks). Even though it says memory error, it doesn't consume too much memory.

@shoyer
Copy link
Member

shoyer commented Jul 29, 2019 via email

@peterhob
Copy link
Author

da.zeros((5000, 50000), chu

Tried but same error.

import numpy as np
import xarray as xr
import dask.array as da
temp= xr.DataArray(da.zeros((5000, 50000), chunks=(-1,100)),dims=("x","y"))
temp.rolling(x=100).mean()

Like I said, I have also saved to nc file and read it from disk (as below), but still same error.

import numpy as np
import xarray as xr
import dask.array as da
temp= xr.DataArray(da.zeros((5000, 50000), chunks=(-1,100)),dims=("x","y"))
temp.to_netcdf("temp.nc")
temp.close()
test = xr.open_dataarray("temp.nc",chunks={"y":100,})
test.rolling(x=100).mean()

@shoyer
Copy link
Member

shoyer commented Jul 29, 2019

For context, xarray's rolling window code creates a "virtual dimension" for the rolling window. So if your chunks are size (5000, 100) before the rolling window, they are size (5000, 100, 100) within the rolling window computation. So it's not entirely surprising that there are more issues with memory usage -- these are much bigger arrays, e.g., see

>>> temp.rolling(x=100).construct('window')
<xarray.DataArray (x: 5000, y: 50000, window: 100)>
dask.array<shape=(5000, 50000, 100), dtype=float64, chunksize=(50, 100, 100)>
Dimensions without coordinates: x, y, window

@shoyer
Copy link
Member

shoyer commented Jul 29, 2019

Actually, there does seem to be something fishy going on here. I find that I'm able to execute temp.rolling(x=100).construct('window').mean('window').compute() successfully but not temp.rolling(x=100).mean().compute(), even though that should mostly be equivalent to the former.

@shoyer
Copy link
Member

shoyer commented Jul 29, 2019

I think this triggers a case that dask's scheduler doesn't handle well, related to this issue: dask/dask#874

@peterhob
Copy link
Author

Actually, there does seem to be something fishy going on here. I find that I'm able to execute temp.rolling(x=100).construct('window').mean('window').compute() successfully but not temp.rolling(x=100).mean().compute(), even though that should mostly be equivalent to the former.

Thank you so much for pointing it out. I tried the rollling.construct and it worked! I also tried it on other netcdf files and it sure solved the problem. Thank you so much for your help!

If this is caused by Dask's scheduler and there is no quick fix yet, do you think mention the rolling.construct in the Xarray document as the recommended usage would worth doing? It can help newbies like me a lot.

Cheers,
Joey

@dcherian dcherian changed the title rolling window still not working properly with dask arrays rolling: bottleneck still not working properly with dask arrays Mar 1, 2021
ThomasLecocq added a commit to ROBelgium/MSNoise that referenced this issue Sep 18, 2024
hope it'll work in all cases
pydata/xarray#3165
@max-sixty
Copy link
Collaborator

This seems to be fixed, at least with numbagg installed — the motivating example at the top runs fine. Anything I'm missing / any objections to closing?

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Sep 25, 2024
@dcherian
Copy link
Contributor

Yes this works fine:

# Your code here
import numpy as np
import xarray as xr
import dask
temp= xr.DataArray(dask.array.zeros((5000, 50000), (-1, 100)),dims=("x","y"))
with xr.set_options(use_numbagg=False):
    temp.rolling(x=100).mean().compute()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to close May be closeable, needs more eyeballs topic-dask topic-rolling
Projects
None yet
Development

No branches or pull requests

4 participants