rolling: bottleneck still not working properly with dask arrays #3165

peterhob · 2019-07-28T00:58:10Z

MCVE Code Sample

# Your code here
import numpy as np
import xarray as xr
# from dask.distributed import Client
temp= xr.DataArray(np.zeros((5000, 50000)),dims=("x","y")).chunk({"y":100, })
temp.rolling(x=100).mean()

Expected Output

Problem Description

I was thrilled to find that the new release (both 0.12.2 and 0.12.3) fixed the rolling window issue. However, When I tried, it seems the problem is still there. Previously, the above code runs with bottleneck installed. However, with the new version, with or without bottleneck, it simply gives the memory error as below.

I have tried to use old and new versions of Dask and pandas, but with no much difference. However, the dask Dataframe version of the code (shown below) runs ok.

import dask.dataframe as dd
import dask.array as da
import numpy as np

da_array=da.from_array(np.zeros((5000, 50000)), chunks=(5000,100))
df = dd.from_dask_array(da_array)
df.rolling(window=100,axis=0).mean()

I have also tried to apply the similar thing on dataset from netcdf files, it simply started consuming very large portion of memory and gives the similar errors.

Any help are appreciated.

/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/merge.py:17: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  PANDAS_TYPES = (pd.Series, pd.DataFrame, pd.Panel)
/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/dataarray.py:219: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version
  elif isinstance(data, pd.Panel):
Traceback (most recent call last):
  File "rolltest.py", line 5, in <module>
    temp.rolling(x=100).mean()
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/rolling.py", line 245, in wrapped_func
    return self.reduce(func, **kwargs)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/rolling.py", line 217, in reduce
    result = windows.reduce(func, dim=rolling_dim, **kwargs)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/dataarray.py", line 1636, in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, **kwargs)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/variable.py", line 1369, in reduce
    input_data = self.data if allow_lazy else self.values
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/variable.py", line 392, in values
    return _as_array_or_item(self._data)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/xarray/core/variable.py", line 213, in _as_array_or_item
    data = np.asarray(data)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/array/core.py", line 1047, in __array__
    x = self.compute()
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/base.py", line 399, in compute
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/base.py", line 399, in <listcomp>
    return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/array/core.py", line 828, in finalize
    return concatenate3(results)
  File "/miniconda3/envs/xarray/lib/python3.7/site-packages/dask/array/core.py", line 3621, in concatenate3
    result = np.empty(shape=shape, dtype=dtype(deepfirst(arrays)))
MemoryError

Output of `xr.show_versions()`

# Paste the output here xr.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.2
pandas: 0.24.2
numpy: 1.16.4
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: 0.7.3
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 1.2.2
distributed: 1.28.1
matplotlib: 3.1.0
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.0.0
pip: 19.1.1
conda: 4.7.5
pytest: None
IPython: 7.5.0
sphinx: None

The text was updated successfully, but these errors were encountered:

shoyer · 2019-07-28T06:55:43Z

Have you tried adding more chunking, e.g., along the x dimension? That’s that usual recommendation if you’re running out of memory.

peterhob · 2019-07-29T08:55:34Z

Have you tried adding more chunking, e.g., along the x dimension? That’s that usual recommendation if you’re running out of memory.

Hi Shoyer,

Thanks for your reply and help. However, I have tried various chunks along each and both dimension (like 200 on x dimension, 100 on y dimension; or larger chunks like 2000 on y dimension), it doesn't work.

In both a ubuntu machine with 100 Gb memory and a local windows10 machine, it simply crashed in couple of seconds. Even though it says memory error, the code does not use much memory at all. Also even with the one dimension setup, the temp.data shows that each chunk only takes 4 mb memory (which makes me think it might be too small and then used larger chunks). I also used a new conda environment with clean install of just the necessary libraries, and the problem is still there.

Here is the neat new environment under which I tried again but gives the same errors,

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 21:52:21) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-51-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None

xarray: 0.12.3
pandas: 0.25.0
numpy: 1.16.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.1.0
distributed: 2.1.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.0.1
pip: 19.2.1
conda: None
pytest: None
IPython: None
sphinx: None

By the way, the above code seems to work ok with previous 0.12.1 version Xarray and bottleneck.

Cheers,
Joey

shoyer · 2019-07-29T16:20:07Z

Did you try converting np.zeros((5000, 50000) to use dask.array.zeros instead? The former will allocate 2 GB of data within each chunk

peterhob · 2019-07-29T22:30:37Z

Did you try converting np.zeros((5000, 50000) to use dask.array.zeros instead? The former will allocate 2 GB of data within each chunk

Thank you for your suggestion. Tried as you suggested, still with same error.

import numpy as np
import xarray as xr
import dask.array as da
# from dask.distributed import Client
temp= xr.DataArray(da.zeros((5000, 50000)),dims=("x","y")).chunk({"y":100, })
temp.rolling(x=100).mean()

I have also tried saving the array to nc file and read it after that. Still rolling gives same error (with or without bottleneck and different chunks). Even though it says memory error, it doesn't consume too much memory.

shoyer · 2019-07-29T22:33:56Z

You want to use the chunks argument *inside* da.zeros, e.g., da.zeros((5000, 50000), chunks=100).

…

On Mon, Jul 29, 2019 at 3:30 PM peterhob ***@***.***> wrote: Did you try converting np.zeros((5000, 50000) to use dask.array.zeros instead? The former will allocate 2 GB of data within each chunk Thank you for your suggestion. Tried as you suggested, still with same error. import numpy as npimport xarray as xrimport dask.array as da# from dask.distributed import Client temp= xr.DataArray(da.zeros((5000, 50000)),dims=("x","y")).chunk({"y":100, }) temp.rolling(x=100).mean() I have also tried saving the array to nc file and read it after that. Still rolling gives same error (with or without bottleneck and different chunks). Even though it says memory error, it doesn't consume too much memory. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3165?email_source=notifications&email_token=AAJJFVT7OTCOARO4WQZBCODQB5VQ5A5CNFSM4IHLGUAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3CGFKY#issuecomment-516186795>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJJFVXZRBM7FOHY4NAT53TQB5VQ5ANCNFSM4IHLGUAA> .

peterhob · 2019-07-29T22:48:54Z

da.zeros((5000, 50000), chu

Tried but same error.

import numpy as np
import xarray as xr
import dask.array as da
temp= xr.DataArray(da.zeros((5000, 50000), chunks=(-1,100)),dims=("x","y"))
temp.rolling(x=100).mean()

Like I said, I have also saved to nc file and read it from disk (as below), but still same error.

import numpy as np
import xarray as xr
import dask.array as da
temp= xr.DataArray(da.zeros((5000, 50000), chunks=(-1,100)),dims=("x","y"))
temp.to_netcdf("temp.nc")
temp.close()
test = xr.open_dataarray("temp.nc",chunks={"y":100,})
test.rolling(x=100).mean()

shoyer · 2019-07-29T22:59:48Z

For context, xarray's rolling window code creates a "virtual dimension" for the rolling window. So if your chunks are size (5000, 100) before the rolling window, they are size (5000, 100, 100) within the rolling window computation. So it's not entirely surprising that there are more issues with memory usage -- these are much bigger arrays, e.g., see

>>> temp.rolling(x=100).construct('window')
<xarray.DataArray (x: 5000, y: 50000, window: 100)>
dask.array<shape=(5000, 50000, 100), dtype=float64, chunksize=(50, 100, 100)>
Dimensions without coordinates: x, y, window

shoyer · 2019-07-29T23:00:37Z

Actually, there does seem to be something fishy going on here. I find that I'm able to execute temp.rolling(x=100).construct('window').mean('window').compute() successfully but not temp.rolling(x=100).mean().compute(), even though that should mostly be equivalent to the former.

shoyer · 2019-07-29T23:05:57Z

I think this triggers a case that dask's scheduler doesn't handle well, related to this issue: dask/dask#874

peterhob · 2019-07-30T02:33:23Z

Actually, there does seem to be something fishy going on here. I find that I'm able to execute temp.rolling(x=100).construct('window').mean('window').compute() successfully but not temp.rolling(x=100).mean().compute(), even though that should mostly be equivalent to the former.

Thank you so much for pointing it out. I tried the rollling.construct and it worked! I also tried it on other netcdf files and it sure solved the problem. Thank you so much for your help!

If this is caused by Dask's scheduler and there is no quick fix yet, do you think mention the rolling.construct in the Xarray document as the recommended usage would worth doing? It can help newbies like me a lot.

Cheers,
Joey

hope it'll work in all cases pydata/xarray#3165

max-sixty · 2024-09-25T23:23:35Z

This seems to be fixed, at least with numbagg installed — the motivating example at the top runs fine. Anything I'm missing / any objections to closing?

dcherian · 2024-09-25T23:48:23Z

Yes this works fine:

# Your code here
import numpy as np
import xarray as xr
import dask
temp= xr.DataArray(dask.array.zeros((5000, 50000), (-1, 100)),dims=("x","y"))
with xr.set_options(use_numbagg=False):
    temp.rolling(x=100).mean().compute()

dcherian added the topic-dask label Jul 31, 2019

dcherian added the topic-rolling label Feb 16, 2021

This was referenced Mar 1, 2021

Use numpy & dask sliding_window_view for rolling #4977

Merged

fix bottleneck + Dask 1D rolling operations #4980

Closed

dcherian changed the title ~~rolling window still not working properly with dask arrays~~ rolling: bottleneck still not working properly with dask arrays Mar 1, 2021

ThomasLecocq added a commit to ROBelgium/MSNoise that referenced this issue Sep 18, 2024

rolling & bottleneck fix

f8da7e3

hope it'll work in all cases pydata/xarray#3165

max-sixty added the plan to close May be closeable, needs more eyeballs label Sep 25, 2024

dcherian closed this as completed Sep 25, 2024

phofl mentioned this issue Nov 12, 2024

Use map_overlap for rolling reductions with Dask #9770

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rolling: bottleneck still not working properly with dask arrays #3165

rolling: bottleneck still not working properly with dask arrays #3165

peterhob commented Jul 28, 2019

shoyer commented Jul 28, 2019

peterhob commented Jul 29, 2019 •

edited

Loading

shoyer commented Jul 29, 2019

peterhob commented Jul 29, 2019

shoyer commented Jul 29, 2019 via email

peterhob commented Jul 29, 2019

shoyer commented Jul 29, 2019

shoyer commented Jul 29, 2019

shoyer commented Jul 29, 2019

peterhob commented Jul 30, 2019

max-sixty commented Sep 25, 2024

dcherian commented Sep 25, 2024

rolling: bottleneck still not working properly with dask arrays #3165

rolling: bottleneck still not working properly with dask arrays #3165

Comments

peterhob commented Jul 28, 2019

MCVE Code Sample

Expected Output

Problem Description

Output of xr.show_versions()

shoyer commented Jul 28, 2019

peterhob commented Jul 29, 2019 • edited Loading

Output of xr.show_versions()

shoyer commented Jul 29, 2019

peterhob commented Jul 29, 2019

shoyer commented Jul 29, 2019 via email

peterhob commented Jul 29, 2019

shoyer commented Jul 29, 2019

shoyer commented Jul 29, 2019

shoyer commented Jul 29, 2019

peterhob commented Jul 30, 2019

max-sixty commented Sep 25, 2024

dcherian commented Sep 25, 2024

Output of `xr.show_versions()`

peterhob commented Jul 29, 2019 •

edited

Loading

Output of `xr.show_versions()`