Speculatively assign tasks to workers #3974

mrocklin · 2020-07-21T17:33:42Z

Motivation

Currently when a task becomes ready to run the scheduler finds the best worker for that task distributes that task to that worker to be enqueued. The worker then handles whatever communication is necessary in order to collect the dependencies for that task and once it's ready it puts it in a queue to be run in a local ThreadPoolExecutor. When it finishes the worker informs the scheduler and goes on to its next task. When the scheduler receives news that the task has finished it marks all of its dependents, each of which go through this process again.

This is occasionally sub-optimal. Consider the following graph:

A1 -> B1
A2 -> B2

If we have one worker then both of the A's will be sent to that worker. The worker will finish A1, send the report to the scheduler that it is finished, and begin work on A2. It will then get information from the scheduler that B1 is ready to compute and it will work on that next. As a result, the resulting order of execution looks like the following:

A1, A2, B1, B2

When really we would have preferred ...

A1, B1, A2, B2

Today we often avoid this situation by doing task fusion on the client side, so that we really only have two tasks

AB1
AB2

If we remove low-level task fusion (which we may want to do for performance reasons) then it would be nice to capture this same A, B, A, B behavior some other way

Send tasks to workers before dependencies are set

One way to capture this same behavior is to send more tasks down to the worker, even before they are ready to run. Today we only send a task to a worker once we have high confidence that that is where it should run, which typically we only know after we understand the data sizes of all of its inputs. We only know this once its dependencies are done, so we only send tasks to workers once all of their dependencies are complete.

However, we could send not-yet-ready-to-run tasks to a worker with high confidence if all of that task's dependencies are also present or running on that worker. This happens to fully consume the low-level task fusion case. If A1 is running on a worker then we believe with high probability that B1 will also run on that same worker.

Changes

This would be a significant change to the worker. We would need to add states to track dependencies, much like how we do in the dask.local scheduler, or a very stripped down version of the dask.distributed scheduler.

This will also require managing interactions with other parts of Dask like ...

Work stealing / load balancing: What happens when to a dependent task when its dependency is taken? (my guess is that it probably goes back to the scheduler and we let the scheduler sort it out)
Resources / restrictions: We should make sure to only speculatively schedule a dependent on a worker if that worker is valid for that worker
Exceptions / failures: Probably we just kick the task back up to the scheduler if anything bad happens.

Learning

I think that this task is probably a good learning task for someone who has some moderate exposure to the distributed scheduler and wants to level up a bit. Adding dependencies to the worker will, I think, force someone to think a lot about the systems that help make Dask run while mostly implementing minor versions of them. cc @quasiben @jacobtomlinson @madsbk

The text was updated successfully, but these errors were encountered:

mrocklin · 2020-07-29T18:00:06Z

Some first steps:

Remove the fuse call and see what breaks if anything, maybe also measure performance
Take a look at this and compare to extending blockwise to support IO operations (need to handle blockwise without any inputs)

fjetter · 2020-11-27T08:30:15Z

I'm wondering what problem this is solving exactly because I see a few potential gains/pitfalls here

A) Ensure there is no network traffic of intermediate results between the execution of e.g. A1 and B1
B) Reduce overhead by not involving the scheduler anymore for the assignment of task B1 once A1 is finished

If A) is the goal, I'm wondering if this approach is actually worth its complexity. After all, if our task2worker decision is accurate, we should not pay any network cost since B1 should be assigned to the same worker, shouldn't it?

If B) is the goal, I'm wondering how the state machine on the scheduler is actually looking like. Can the scheduler still distinguish between processing on worker and speculatively assigned to worker? If the scheduler knows about this difference, how does it know about it? Is the worker pinging it or the other way round? If this exchange is synchronous/blocking/required for the worker speculative->executing transition, would it actually reduce overhead or rather increase it? (My gut feeling tells me the scheduler does not necessarily need to know but I might be wrong, in particular w.r.t work stealing)

And actually, if one of the indirect goals is to remove the necessity of fusing, I would argue that one of the convenient things about fusing is that it also helps to reduce the number of overall tasks since the scheduler becomes pretty busy if we reach graphs of significant size (>1M tasks) and this approach would not help in dealing with this. I'm wondering here if the cost of optimization outweighs the cost of task overhead

mrocklin · 2020-11-27T16:35:38Z

And actually, if one of the indirect goals is to remove the necessity of fusing, I would argue that one of the convenient things about fusing is that it also helps to reduce the number of overall tasks since the scheduler becomes pretty busy if we reach graphs of significant size (>1M tasks) and this approach would not help in dealing with this. I'm wondering here if the cost of optimization outweighs the cost of task overhead

The main objective is to avoid task fusion on the client side. This allows us to transmit high level graphs directly to the scheduler and avoid creating low-level graphs on the client entirely. I expect that this will significantly outweigh the costs of added tasks on the scheduler.

Also, we're working to make the scheduler itself faster. Currently we process around 5000 tasks per second. I suspect/hope that we can improve on this significantly. It's a bit easier to optimize just the scheduler rather than both the scheduler and the client.

Can the scheduler still distinguish between processing on worker and speculatively assigned to worker?
My gut feeling tells me the scheduler does not necessarily need to know but I might be wrong, in particular w.r.t work stealing

This problem exists today. The scheduler assigns all tasks the "processing" state after they have been assigned to a worker, but they may still be waiting in a queue, or waiting on data to transfer. The Scheduler does not know when a task has actually started running. This does come up in work stealing. There is a complex handshake to handle this. I believe that the current behavior is that the scheduler suggests to a worker "Hey Alice maybe you should go steal task X from worker Bob". Alice then tries to do that and Bob either agrees or says "nope, I'm working on X right now".

mrocklin · 2020-11-27T16:36:59Z

For context, here is a link to the dataframe optimization code. The only thing below the ensure_dict call which converts a HighLevelGraph to a dict is optimization code.

https://github.com/dask/dask/blob/fbccc4ef3e1974da2b4a9cb61aa83c1e6d61efba/dask/dataframe/optimize.py#L12-L41

We get a lot of speedup if we're able to return the HighLevelGraph

mrocklin changed the title ~~Populate workers with downstream tasks~~ Speculatively assign tasks to workers Aug 6, 2020

gforsyth linked a pull request Nov 23, 2020 that will close this issue

Speculative task assignment #4264

Draft

6 tasks

gjoseph92 mentioned this issue Jul 24, 2021

[Idea] Could workers sometimes know when to release keys on their own? #5114

Open

gjoseph92 mentioned this issue Aug 17, 2021

Workers run twice as many root tasks as they should, causing memory pressure #5223

Closed

gjoseph92 mentioned this issue Sep 7, 2021

Memory prioritization on workers #5250

Open

fjetter mentioned this issue Sep 28, 2021

Worker state machine refactor #5046

Merged

7 tasks

fjetter mentioned this issue Oct 22, 2021

Distributed scheduler failure case #5453

Open

fjetter mentioned this issue Dec 2, 2021

Distributed scheduler does not obey dask.order.order for num_workers=1, num_threads=1 #5555

Open

davidhao3300 mentioned this issue Dec 2, 2021

Suboptimal graph structure when read-writing a parquet dask/dask#8445

Closed

gjoseph92 mentioned this issue Dec 9, 2021

[DNM] P2P shuffle skeleton - scheduler plugin #5524

Closed

2 tasks

gjoseph92 mentioned this issue Jan 25, 2022

[DISCUSSION] Layer-by-Layer Graph Execution dask/dask#8616

Open

gjoseph92 mentioned this issue May 18, 2022

Ease memory pressure by deprioritizing root tasks? #6360

Open

gjoseph92 mentioned this issue Jun 20, 2022

[WIP] Queue root tasks on scheduler, not workers [with co-assignment] #6598

Draft

2 tasks

fjetter mentioned this issue Aug 19, 2022

Factor out and instrument task categorization logic - static graph analysis #6922

Open

gjoseph92 mentioned this issue Sep 30, 2022

[DNM] Structural co-assignment #7076

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculatively assign tasks to workers #3974

Speculatively assign tasks to workers #3974

mrocklin commented Jul 21, 2020

mrocklin commented Jul 29, 2020

fjetter commented Nov 27, 2020

mrocklin commented Nov 27, 2020

mrocklin commented Nov 27, 2020

Speculatively assign tasks to workers #3974

Speculatively assign tasks to workers #3974

Comments

mrocklin commented Jul 21, 2020

Motivation

Send tasks to workers before dependencies are set

Changes

Learning

mrocklin commented Jul 29, 2020

fjetter commented Nov 27, 2020

mrocklin commented Nov 27, 2020

mrocklin commented Nov 27, 2020