-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rolling: bottleneck still not working properly with dask arrays #3165
Comments
Have you tried adding more chunking, e.g., along the x dimension? That’s that usual recommendation if you’re running out of memory. |
Hi Shoyer, Thanks for your reply and help. However, I have tried various chunks along each and both dimension (like 200 on x dimension, 100 on y dimension; or larger chunks like 2000 on y dimension), it doesn't work. In both a ubuntu machine with 100 Gb memory and a local windows10 machine, it simply crashed in couple of seconds. Even though it says memory error, the code does not use much memory at all. Also even with the one dimension setup, the temp.data shows that each chunk only takes 4 mb memory (which makes me think it might be too small and then used larger chunks). I also used a new conda environment with clean install of just the necessary libraries, and the problem is still there. Here is the neat new environment under which I tried again but gives the same errors, Output of
|
Did you try converting |
Thank you for your suggestion. Tried as you suggested, still with same error. import numpy as np
import xarray as xr
import dask.array as da
# from dask.distributed import Client
temp= xr.DataArray(da.zeros((5000, 50000)),dims=("x","y")).chunk({"y":100, })
temp.rolling(x=100).mean() I have also tried saving the array to nc file and read it after that. Still rolling gives same error (with or without bottleneck and different chunks). Even though it says memory error, it doesn't consume too much memory. |
You want to use the chunks argument *inside* da.zeros, e.g.,
da.zeros((5000, 50000), chunks=100).
…On Mon, Jul 29, 2019 at 3:30 PM peterhob ***@***.***> wrote:
Did you try converting np.zeros((5000, 50000) to use dask.array.zeros
instead? The former will allocate 2 GB of data within each chunk
Thank you for your suggestion. Tried as you suggested, still with same
error.
import numpy as npimport xarray as xrimport dask.array as da# from dask.distributed import Client
temp= xr.DataArray(da.zeros((5000, 50000)),dims=("x","y")).chunk({"y":100, })
temp.rolling(x=100).mean()
I have also tried saving the array to nc file and read it after that.
Still rolling gives same error (with or without bottleneck and different
chunks). Even though it says memory error, it doesn't consume too much
memory.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3165?email_source=notifications&email_token=AAJJFVT7OTCOARO4WQZBCODQB5VQ5A5CNFSM4IHLGUAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3CGFKY#issuecomment-516186795>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAJJFVXZRBM7FOHY4NAT53TQB5VQ5ANCNFSM4IHLGUAA>
.
|
Tried but same error. import numpy as np
import xarray as xr
import dask.array as da
temp= xr.DataArray(da.zeros((5000, 50000), chunks=(-1,100)),dims=("x","y"))
temp.rolling(x=100).mean() Like I said, I have also saved to nc file and read it from disk (as below), but still same error. import numpy as np
import xarray as xr
import dask.array as da
temp= xr.DataArray(da.zeros((5000, 50000), chunks=(-1,100)),dims=("x","y"))
temp.to_netcdf("temp.nc")
temp.close()
test = xr.open_dataarray("temp.nc",chunks={"y":100,})
test.rolling(x=100).mean() |
For context, xarray's rolling window code creates a "virtual dimension" for the rolling window. So if your chunks are size (5000, 100) before the rolling window, they are size (5000, 100, 100) within the rolling window computation. So it's not entirely surprising that there are more issues with memory usage -- these are much bigger arrays, e.g., see
|
Actually, there does seem to be something fishy going on here. I find that I'm able to execute |
I think this triggers a case that dask's scheduler doesn't handle well, related to this issue: dask/dask#874 |
Thank you so much for pointing it out. I tried the rollling.construct and it worked! I also tried it on other netcdf files and it sure solved the problem. Thank you so much for your help! If this is caused by Dask's scheduler and there is no quick fix yet, do you think mention the rolling.construct in the Xarray document as the recommended usage would worth doing? It can help newbies like me a lot. Cheers, |
hope it'll work in all cases pydata/xarray#3165
This seems to be fixed, at least with numbagg installed — the motivating example at the top runs fine. Anything I'm missing / any objections to closing? |
Yes this works fine:
|
MCVE Code Sample
Expected Output
Problem Description
I was thrilled to find that the new release (both 0.12.2 and 0.12.3) fixed the rolling window issue. However, When I tried, it seems the problem is still there. Previously, the above code runs with bottleneck installed. However, with the new version, with or without bottleneck, it simply gives the memory error as below.
I have tried to use old and new versions of Dask and pandas, but with no much difference. However, the dask Dataframe version of the code (shown below) runs ok.
I have also tried to apply the similar thing on dataset from netcdf files, it simply started consuming very large portion of memory and gives the similar errors.
Any help are appreciated.
Output of
xr.show_versions()
xarray: 0.12.2
pandas: 0.24.2
numpy: 1.16.4
scipy: 1.3.0
netCDF4: 1.5.1.2
pydap: None
h5netcdf: 0.7.3
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 1.2.2
distributed: 1.28.1
matplotlib: 3.1.0
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.0.0
pip: 19.1.1
conda: 4.7.5
pytest: None
IPython: 7.5.0
sphinx: None
The text was updated successfully, but these errors were encountered: