-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
round-trip performance with save_mfdataset / open_mfdataset #1340
Comments
My strong suspicion is that the bottleneck here is xarray checking all the coordinates for equality in concat, when deciding whether to add a "time" dimension or not. Try passing This was a convenient check for small/in-memory datasets but possibly it's not a good one going forward. It's generally slow to load all the coordinate data for comparisons, but it's even worse with the current implementation, which computes pair-wise comparisons of arrays with dask instead of doing them in parallel all at once. |
http://xarray.pydata.org/en/latest/generated/xarray.open_mfdataset.html#xarray-open-mfdataset |
Indeed, it's not. We should add some way to pipe this arguments through |
This sounds like the kind of thing I could manage. |
I'm running into the same problem as Ryan, regarding the ValueError. However, when I try the same fix
I get the error
Apologies if this should be on a different chain, but any idea what might be going on? |
@karenamckinnon could you please share a traceback for the error? |
|
@karenamckinnon From your traceback, it looks like you're using pandas 0.14, but xarray requires at least pandas 0.15. |
Got it, thanks @shoyer ! In case this happens again, which component of the traceback provided that information to you? |
@karenamckinnon In this case, it was in the file paths, i.e., |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets.
I start with an xarray dataset created by xgcm with the following repr:
An important point to note is that there are lots of "non-dimension coordinates" corresponding to various parameters of the numerical grid.
I save this dataset to a multi-file netCDF dataset as follows:
This takes many hours to run, since it has to read and write all the data. (I think there are some performance issues here too, related to how dask schedules the read / write tasks, but that is probably a separate issue.)
Then I try to re-load this dataset
This raises an error:
I need to specify
concat_dim='time'
in order to properly concatenate the data. It seems like this should be unnecessary, since I am reading back data that was just written with xarray, but I understand why (the dimensions of the Data Variables in each file are just Z, YC, XC, with no time dimension). Once I do that, it works, but it takes 18 minutes to load the dataset. I assume this is because it has to check the compatibility of all all the non-dimension coordinates.I just thought I would document this, because 18 minutes seems way too long to load a dataset.
The text was updated successfully, but these errors were encountered: