-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch our lazy array classes to use Dask instead? #1725
Comments
This comment has the full context: #1372 (comment). To repeat myself: You might ask why this separate lazy compute machinery exists. The answer is that dask fails to optimize element-wise operations like See dask/dask#746 for discussion and links to PRs about this. jcrist had a solution that worked, but it slowed down every dask array operations by 20%, which wasn't a great win. I wonder if this is worth revisiting with a simpler, less general optimization pass that doesn't bother with broadcasting. See the subclasses of
If we could optimize all these operations (and ideally chain them), then we could drop all the lazy loading stuff from xarray in favor of dask, which would be a real win. The downside of this switch is that lazy loading of data from disk would now require dask, which would be at least slightly annoying to some users. But it's probably worth the tradeoff from a maintainability perspective, and also to fix issues like #1372. |
I'm really not opposed to this. I also think it would be a good/reasonable thing to do before 1.0. There may be some pure-numpy Xarray users out there but I suspect they could 1) handle having dask as a dependency and 2) wouldn't be all that affected by the change since the in-memory workflow isn't particularly dependent on lazy evaluation. |
Yeah, we could solve this by making dask a requirement only if you want load netCDF files and/or load netCDF files lazily. Potentially |
I just had to confront and understand how lazy CF decoding worked in order to move forward with #1528. In my initial implementation, I applied chunking to variables directly in My impression after this exercise is that having two different definitions of "lazy" within xarray leads to developer confusion! So I favor putting dask more central in xarray's data model. |
I'm rather a numpy-xarray user than a dask-xarray user (since most often my data fits in memory), but I wouldn't mind at all having to install dask as a requirement!
Maybe like other users who are used to lazy loading, I'm a bit more concerned by this. I find it so handy to be able to load a medium-sized file instantly, quickly inspect its content, and then work with only a small subset of the variables / data, all of this without worrying about Assuming that numpy-loading is the default, new xarray users coming from If choosing By saying "making the default use Dask", do you mean that data from a file will be "loaded" as dask arrays by default? If this is the case, new xarray users which are probably not familiar with dask (at least less likely than they are familiar with numpy) will have to learn 1-2 concepts from dask before using xarray. This might not be a big deal, though. In summary, I'm also really not opposed to use dask to replace all the current lazy-loading machinery, but ideally it should be as transparent as possible with respect to the current "user experience". |
Since #1532, the This highlights an important difference between how So on second thought, maybe the system we have now is better than using dask for "everything lazy." |
@rabernat actually in #1532 we switched to not displaying a preview of any lazily loaded data on disk -- even if it isn't loaded with Dask. (I was not sure about this change, but I was alone in my reservations.) I do agree that our lazy arrays serve a useful purpose currently. I would only consider removing them if we can improve Dask so it works just as well for this use case. |
On a somewhat related note, I am now proposing extending xarray's "lazy array" functionality to include limited support for arithmetic, without necessarily using dask: #2298 |
Should we close this? Seems like we've headed in a different discussion with our lazy array implementation. |
Ported from #1724, comment by @shoyer
The subtleties of checking
_data
vsdata
are undesirable, e.g., consider the bug on these lines:xarray/xarray/core/formatting.py
Lines 212 to 213 in 1a01208
The text was updated successfully, but these errors were encountered: