[Exploration]: How Dask works and how it is utilized in xarray #376

tomvothecoder · 2022-09-27T15:50:59Z

tomvothecoder
Sep 27, 2022
Maintainer

This post explores the internals of Dask for us to get a better understanding of how it works. We will also explore how Dask is utilized in xarray, and when to chunk xarray Datasets using Dask.

Dask Array Best Practices

https://docs.dask.org/en/stable/array-best-practices.html#best-practices

Use NumPy
- If your data fits comfortably in RAM and you are not performance bound, then using NumPy might be the right choice. Dask adds another layer of complexity which may get in the way.
- If you are just looking for speedups rather than scalability then you may want to consider a project like Numba
Select a good chunk size
Orient your chunks
Avoid Oversubscribing Threads
Consider Xarray
Build your own Operations

Chunking Best Practices

For performance, a good choice of chunks follows the following rules:

A chunk should be small enough to fit comfortably in memory. We'll
have many chunks in memory at once

A chunk must be large enough so that computations on that chunk take
significantly longer than the 1ms overhead per task that Dask scheduling
incurs. A task should take longer than 100ms

Chunk sizes between 10MB-1GB are common, depending on the availability of
RAM and the duration of computations

Chunks should align with the computation that you want to do.

For example, if you plan to frequently slice along a particular dimension,
then it's more efficient if your chunks are aligned so that you have to
touch fewer chunks. If you want to add two arrays, then its convenient if
those arrays have matching chunks patterns

Chunks should align with your storage, if applicable.

Array data formats are often chunked as well. When loading or saving data,
if is useful to have Dask array chunks that are aligned with the chunking
of your storage, often an even multiple times larger in each direction

— https://docs.dask.org/en/stable/array-chunks.html#specifying-chunk-shapes

Select a good chunk size

A common performance problem among Dask Array users is that they have chosen a chunk size that is either too small (leading to lots of overhead) or poorly aligned with their data (leading to inefficient reading).

While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.

You want to choose a chunk size that is large in order to reduce the number of chunks that Dask has to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. Dask will often have as many chunks in memory as twice the number of active threads.
— https://docs.dask.org/en/stable/array-best-practices.html#select-a-good-chunk-size

Chunking and performance

The chunks parameter has critical performance implications when using Dask arrays. If your chunks are too small, queueing up operations will be extremely slow, because Dask will translate each operation into a huge number of operations mapped across chunks. Computation on Dask arrays with small chunks can also be slow, because each operation on a chunk has some fixed overhead from the Python interpreter and the Dask task executor.

Conversely, if your chunks are too big, some of your computation may be wasted, because Dask only computes results one chunk at a time.

A good rule of thumb is to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up Dask operations can be noticeable, and you may need even larger chunksizes.
— https://docs.xarray.dev/en/stable/user-guide/dask.html#chunking-and-performance

Other performance factors:

Given a large enough dataset, can process and may see a speed up
Local vs. network
Dask cluster setup correctly (10 GBps network)
Multiple threads/processes on a single machine

How do chunks communicate with one another?

Chunks communicate using indexing.

NOTE: Xarray docs say Dask commute by index is not yet implemented so grouping and resampling is not optimized.

Do your spatial and temporal indexing (e.g. .sel() or .isel()) early in the pipeline, especially before calling resample() or groupby(). Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn’t been implemented in Dask yet. (See Dask issue #746). More generally, groupby() is a costly operation and does not (yet) perform well on datasets split across multiple files (see PR5734 and linked discussions there)
— https://docs.xarray.dev/en/stable/user-guide/dask.html#optimization-tips

This documentation is partially outdated because pydata/xarray#5734 is now merged, which addresses the performance issues with groupby() and multi-file datasets.

Additional Questions:

Core dims vs. non-core dims, which determines chunking? -- chunking strategy is explicitly set by the user (by specific dim/dims or all dims)
When might we recommend to chunk with xcdat?
- Chunking guidelines are the same as Dask and xarray.

Resources:

jasonb5 · 2022-10-10T23:53:01Z

jasonb5
Oct 10, 2022
Collaborator

Here's a small intro from what I can recall when I was working with Xarray and Dask a lot.

Most of my experience with these libraries came from working with ESGF Compute Service. This service would translate WPS requests into Xarray DAG and then execute on a Dask Cluster that was allocated using Dask Gateway. This service also tried to utilize Xarray's ability to read Zarr formatted datasets off of S3 stores to improve throughput for parallelized operations.

Here's a quick intro to Dask. Anything built with a dask array, bag, dataframe, delayed or future is turned into a task graph, the scheduler can optimize the graph and finally assign the tasks to workers. To answer the first question, the communication depends on the scheduler. There's either a single-machine or distributed scheduler. For single-machine you have single thread, multi-threaded or processes. Multi-threaded is pretty straight forward as it can use shared variables in it's thread pool, but processes actually uses cloudpickle to serial/deserialize messages/data passed between processes. The pattern of serialize/deserialize message/data is the same used when using distributed for local/remote clusters.

In my experience chunking is recommended when dealing with out-of-core operations. I remember losing performance with small datasets and chunking with a Local Cluster due to the communication overhead. Chunking works best when you have an independent variable e.g. if you're averaging over time you could chunk by lat, lon, lev or some combination. You can still benefit from chunking even if some of the tasks are not operating on an independent variable e.g. building large task graphs. An issue I ran into when working on the Compute service was using groupby functions which would cause all the data to load prematurely, I think there was a Dask/Xarry issue about this. Another time to utilize chunking is when operating with large task graphs where the same chunk of data has multiple operations performed on it.

0 replies

jasonb5 · 2022-10-26T17:35:57Z

jasonb5
Oct 26, 2022
Collaborator

Here are some related links.

Dask Benchmarks (https://blog.dask.org/2017/07/03/scaling)
Dask Blog, has some good articles (https://blog.dask.org/)
Dask Dashboard demo (https://www.youtube.com/watch?v=nTMGbkS761Q)

0 replies

tomvothecoder · 2022-10-31T17:40:12Z

tomvothecoder
Oct 31, 2022
Maintainer Author

Exploring whether the Xarray Groupby API sequential or parallel

Overview

We use the groupby API in our averaging APIs, including temporal and spatial averaging. It is important to know whether to Xarray groupby API is sequential or parallel for performance reasons.

Background

An issue I ran into when working on the Compute service was using groupby functions which would cause all the data to load prematurely, I think there was a Dask/Xarry issue about this.

It is good to know that groupby operations are potentially eager rather than lazy, since xcdat's temporal averaging APIs use groupby internally.

I found the related xarray issue: pydata/xarray#2852. Comments from that issue:

It is very hard to make this sort of groupby lazy, because you are grouping over the variable label itself. Groupby uses a split-apply-combine paradigm to transform the data. The apply and combine steps can be lazy. But the split step cannot. Xarray uses the group variable to determine how to index the array, i.e. which items belong in which group. To do this, it needs to read the whole variable into memory.
-- pydata/xarray#2852 (comment)

The current design of GroupBy.apply() in xarray is entirely ignorant of dask: it simply uses a for loop over the grouped variable to built up a computation with high level array operations.

This makes operations that group over large keys stored in dask inefficient. This could be done efficiently (dask.dataframe does this, and might be worth trying in your case) but it's a more challenging distributed computing problem, and xarray's current data model would not know how large of a dimension to create for the returned ararys (doing this properly would require supporting arrays with unknown dimension sizes).
--pydata/xarray#2852 (comment)

Action Items:

Investigate xcdat's temporal averaging APIs to determine if lazy, partially lazy, or eager.

Conclusion

All the step are sequential. In Xarray < 2022.06.0, the groupby and resampling operations are sequential (refer to notes below).

Since xarray stores each of its coordinate variables in memory, slicing by label is trivial and entirely lazy. (https://examples.dask.org/xarray.html)
Do your spatial and temporal indexing (e.g. .sel() or .isel()) early in the pipeline, especially before calling resample() or groupby(). Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn’t been implemented in Dask yet. (See Dask issue #746). More generally, groupby() is a costly operation and does not (yet) perform well on datasets split across multiple files (see PR5734 and linked discussions there). (https://docs.xarray.dev/en/stable/user-guide/dask.html#optimization-tips)
- This documentation seems outdated. Enable flox in GroupBy and resample pydata/xarray#5734 is now merged, which should address the performance issues with groupby() and multi-file datasets using flox.

Other workarounds:

Make grouping lazy using custom groupby and apply_ufunc() -- Allow grouping by dask variables pydata/xarray#2852 (comment)).

Next steps:

Xarray >= v2022.0.6.0 includes flox, which enables parallelization of groupby and resampling operations.

Substantially improved GroupBy operations using flox. This is auto-enabled when flox is installed. Use xr.set_options(use_flox=False) to use the old algorithm.

https://docs.xarray.dev/en/stable/whats-new.html#id169

We should experiment with flox to see how much faster the xCDAT APIs that use groupby are. We can do this in #485.

0 replies

tomvothecoder · 2022-11-01T17:13:05Z

tomvothecoder
Nov 1, 2022
Maintainer Author

Explore where we explicitly load Dask arrays into memory for sequential operations

Background

In xarray, Dask arrays are not loaded into memory unexpectedly (an exception is raised instead). In xcdat, we load Dask arrays into memory in specific spots.

When you load data as a Dask array in an xarray data structure, almost all xarray operations will keep it as a Dask array; when this is not possible, they will raise an exception rather than unexpectedly loading data into memory. Converting a Dask array into memory generally requires an explicit conversion step. One notable exception is indexing operations: to enable label based indexing, xarray will automatically load coordinate labels into memory.
-- https://docs.xarray.dev/en/stable/user-guide/dask.html#using-dask-with-xarray

Action Items

List all of the locations in the xcdat codebase where Dask arrays are explicitly loaded into memory
- I suspect we mainly load n-dim Dask Arrays such as bounds for manipulation, since xarray does not support setting arrays with multiple indices to Dask yet (Fix NotImplementedError: xarray can't set arrays with multiple array indices to dask yet. #115)

Conclusion

xCDAT loads Dask arrays into memory when performing operations or computations using multi-dimensional arrays, specifically coordinate bounds. As of xarray=2023.5.0, xarray can't set arrays with multiple array indices to dask yet, so these arrays must be loaded into memory.

xCDAT loads coordinate bounds into memory in the following APIs during specific situations:

xcdat.axis.swap_lon_axis
- swapping longitude axis orientation
- aligning longitude bounds to 360 axis
xarray.Dataset.spatial.average
- generating weights using lat/lon coordinate bounds
- swapping longitude axis orientation
- scaling domain bounds to a specified region
xcdat.Dataset.temporal.<average|group_average|climatology|departures>
- generating weights using time coordinate bounds

0 replies

tomvothecoder · 2023-05-22T23:28:11Z

tomvothecoder
May 22, 2023
Maintainer Author

Experiment with bounds and edges:

Verify regridding and temporal operations are parallelizable
- Both could plausibly require operations across chunks (so maybe they don't parallelize or the user needs to parallelize in a particular way). e.g., If you took annual chunks, DJF group averaging would require communication across chunks.
Does chunking affect edge cases (e.g., DJF seasons that are sliced by chunk)?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Exploration]: How Dask works and how it is utilized in xarray #376

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a good chunk size

Chunking and performance

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[Exploration]: How Dask works and how it is utilized in xarray #376

tomvothecoder Sep 27, 2022 Maintainer

Dask Array Best Practices

Chunking Best Practices

Select a good chunk size

Chunking and performance

Other performance factors:

How do chunks communicate with one another?

Additional Questions:

Resources:

Replies: 5 comments

jasonb5 Oct 10, 2022 Collaborator

jasonb5 Oct 26, 2022 Collaborator

tomvothecoder Oct 31, 2022 Maintainer Author

Exploring whether the Xarray Groupby API sequential or parallel

Overview

Background

Action Items:

Conclusion

Other workarounds:

Next steps:

tomvothecoder Nov 1, 2022 Maintainer Author

Explore where we explicitly load Dask arrays into memory for sequential operations

Background

Action Items

Conclusion

tomvothecoder May 22, 2023 Maintainer Author

Experiment with bounds and edges:

tomvothecoder
Sep 27, 2022
Maintainer

jasonb5
Oct 10, 2022
Collaborator

jasonb5
Oct 26, 2022
Collaborator

tomvothecoder
Oct 31, 2022
Maintainer Author

tomvothecoder
Nov 1, 2022
Maintainer Author

tomvothecoder
May 22, 2023
Maintainer Author