`decide_worker` could be expensive on large clusters with queuing enabled #7246

gjoseph92 · 2022-11-03T04:42:27Z

When queuing is enabled, we currently pick the worker for a queued task by finding the idle worker with the fewest tasks.

This means a linear search over the set of idle workers, which could, at worst, contain all workers.

So on really large clusters (thousands/tens of thousands of workers), this could possibly get expensive, because decide_worker is called for every task.

Previously, idle was a SortedSet, and when there were >20 workers (arbitrary cutoff), we'd just pick one round-robin style, which cost O(logn):

distributed/distributed/scheduler.py

Line 2204 in 02b9430

ws = wp_vals[self.n_tasks % n_workers]

We don't have data showing performance is a problem here, but we also haven't benchmarked a really large cluster yet. I don't want to prematurely optimize, but given that we intend to turn queuing on by default #7213, it would also be bad if the default were slow for large-cluster users.

Do we want to preemptively change this logic? Options I could imagine (simple->complex):

Do nothing. With 10k workers, there are probably plenty of other things that are more inefficient than decide_worker that we should improve first. Plus, idle will only be large at the beginning and end of the computation; most of the time it should be quite small.
If len(idle) > some arbitrary cutoff (maybe 20 again), just pick next(iter(self.idle)). (I'd like to make idle no longer sorted since it's rather expensive and we're only sorting by name, not something useful Remove sortedcontainers #7245.)

We could do something simple with CPython set iteration order (or use a dict[WorkerState, None]) to make this properly round-robin.
Maintain a structure binning idle workers by the number of tasks processing. This assumes worker thread counts are relatively small (in the thousands at most). We could find the least-occupied worker in O(1), and updating when tasks are added/removed would be O(1) as well. (Could also use a heap, but then the update would be O(logn). Taking a bucket-sort approach by assuming thread counts are small seems smarter.)

cc @fjetter @crusaderky

The text was updated successfully, but these errors were encountered:

mrocklin · 2022-11-03T13:08:33Z

Historical notes:

The idle sorted set was created because this did come up in profiles for larger numbers of workers.

We used SortedSet in order to have something that was indexable to provide round-robin behaviors (not because we cared about sorting). I found that the sorting nature didn't really cost anything (except the dependency). If there is another way to do this then grand.

Binning workers by number of tasks processing or occupancy is also fine by me. The idle/saturated sets were a simple example of binning (just three bins) although obviously based around occupancy rather than number of tasks (which should presumably change).

That's all from me. Just providing history.

gjoseph92 · 2022-11-03T14:53:45Z

That history is helpful, thanks.

I think the main question here is "should we do something for the large-worker case (even if it's trivial, like round robin) before we change the default? Or should we wait until we find it's actually a problem?"

mrocklin · 2022-11-03T15:33:56Z

Can we not do the same thing that we did before? Maintain `idle` and select round-robin from it?

…

On Thu, Nov 3, 2022 at 9:53 AM Gabe Joseph ***@***.***> wrote: That history is helpful, thanks. I think the main question here is "should we do *something* for the large-worker case (even if it's trivial, like round robin) before we change the default? Or should we wait until we find it's actually a problem?" — Reply to this email directly, view it on GitHub <#7246 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGUA6Q5RVQDTOZPE33WGPGYJANCNFSM6AAAAAARVX5J3Q> . You are receiving this because you commented.Message ID: ***@***.***>

mrocklin · 2022-11-03T15:37:27Z

Maybe the definition of idle changes to "has less tasks than cores"?

…

On Thu, Nov 3, 2022 at 10:33 AM Matthew Rocklin ***@***.***> wrote: Can we not do the same thing that we did before? Maintain `idle` and select round-robin from it? On Thu, Nov 3, 2022 at 9:53 AM Gabe Joseph ***@***.***> wrote: > That history is helpful, thanks. > > I think the main question here is "should we do *something* for the > large-worker case (even if it's trivial, like round robin) before we change > the default? Or should we wait until we find it's actually a problem?" > > — > Reply to this email directly, view it on GitHub > <#7246 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTGUA6Q5RVQDTOZPE33WGPGYJANCNFSM6AAAAAARVX5J3Q> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

gjoseph92 · 2022-11-04T02:52:04Z

Here is a scheduler profile of a 512-worker cluster running anom_mean (~500k tasks) under #7257:

profile-215d100-sat-1_0-test_anom_mean-clipped.json

(I've cropped out the ~7min of idleness waiting for the 700mb graph to upload. That is itself a big issue, but for talking about scheduler performance it makes the percentages you see in speedscope more useful.)

Good news: decide_worker_rootish_queuing_enabled is 0.15% of total time. (And about half of it is in __iter__ over the sorted container, which would get even faster with #7245.) For comparison, decide_worker_non_rootish is 1.6%.

This is so tiny that we should stop thinking about it because it doesn't seem to matter.

Bad news: the scheduler is pretty much 100% busy, and it looks like only 12% of that is even spent on transitions (the meat of the scheduling work). There's a lot of other overhead. As usual, I think a lot of it (~50%?) is 'non-blocking' IO blocking the event loop, plus tornado and asyncio overhead.

That's a different discussion, but the point is that decide_worker doesn't seem to be very important in the scheme of things.

gjoseph92 · 2022-11-10T16:52:17Z

Seems like we're okay with this, closing.

gjoseph92 mentioned this issue Nov 3, 2022

Turn on queuing by default #7213

Closed

11 tasks

This was referenced Nov 4, 2022

All tasks without dependencies are root-ish #7221

Open

Round-robin worker selection makes poor choices with worker-saturation > 1.0 #7197

Open

gjoseph92 closed this as completed Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`decide_worker` could be expensive on large clusters with queuing enabled #7246

`decide_worker` could be expensive on large clusters with queuing enabled #7246

gjoseph92 commented Nov 3, 2022

mrocklin commented Nov 3, 2022

gjoseph92 commented Nov 3, 2022

mrocklin commented Nov 3, 2022 via email

mrocklin commented Nov 3, 2022 via email

gjoseph92 commented Nov 4, 2022 •

edited

Loading

gjoseph92 commented Nov 10, 2022

decide_worker could be expensive on large clusters with queuing enabled #7246

decide_worker could be expensive on large clusters with queuing enabled #7246

Comments

gjoseph92 commented Nov 3, 2022

mrocklin commented Nov 3, 2022

gjoseph92 commented Nov 3, 2022

mrocklin commented Nov 3, 2022 via email

mrocklin commented Nov 3, 2022 via email

gjoseph92 commented Nov 4, 2022 • edited Loading

gjoseph92 commented Nov 10, 2022

`decide_worker` could be expensive on large clusters with queuing enabled #7246

`decide_worker` could be expensive on large clusters with queuing enabled #7246

gjoseph92 commented Nov 4, 2022 •

edited

Loading