Tasks with worker restrictions get stuck in `no-worker` when required worker is removed #7346

hendrikmakait · 2022-11-24T15:19:34Z

Problem

When tasks are restricted to run on a specific worker using worker_restrictions, they transition to no-worker if the specified worker is removed. The task will remain in no-worker until a worker with the same address rejoins. I see two use cases for this behavior:

Restricting a task to run on a specific machine (with specific hardware)
- In this case, the user may be better served by using host_restrictions or resource_restrictions
Restricting a task to run on a specific worker instance
- For example, we restrict tasks to run on specific worker instances in the P2PShuffle implementation
- In this case, we want tasks to fail as soon as the instance is removed instead of remaining in no-worker indefinitely

Am I missing a use case here?

Depending on your deployment system, another worker with the same address may join the cluster, which would resolve the deadlock of use case #1, but create trouble for tasks that want to achieve use case #2 as another manifestation of #6392.

Possible solutions

Adjust worker_restrictions to use IDs instead of addresses to ensure uniqueness and transition tasks to erred instead of no-worker. This would be a breaking change and I do not know if anybody relies on the current behavior.
Introduce another type of restrictions based on IDs that restricts workers to currently existing worker instances via IDs and implements the behavior outlined above.

cc @fjetter

The text was updated successfully, but these errors were encountered:

crusaderky · 2022-11-25T16:45:53Z

+1; the change makes sense to me.

By this same logic, a task should transition to erred as soon as it lands on the scheduler if the workers satisfying its restriction are not there.

Adjust worker_restrictions to use IDs instead of addresses to ensure uniqueness and transition tasks to erred instead of no-worker. This would be a breaking change and I do not know if anybody relies on the current behavior.

We use worker addresses in restrictions pretty much all over the place. I'd not want to go through our whole test suite to adjust it.

Don't we already communicate the UUID of the worker to the scheduler? We also have aliases.
I think worker restrictions should match any worker that satisfies

address
or alias
or UUID

e.g. these should all be valid:

submit(inc, 1, workers=["tcp://127.0.0.1:12345"])
submit(inc, 1, workers=[0])
submit(inc, 1, workers=["ed6db7b2-6aee-47e8-964f-2c71481fce4a"])

The first two are already happening; see Scheduler.coerce_address.

Note that aliases are user-defined, but they are converted to addresses by update_graph - so if the worker leaves the cluster and later on another worker with the same name joins the cluster, it won't get the job.
I think it should be changed to the UUID instead to avoid collisions.

gjoseph92 · 2022-11-28T19:27:02Z

@hendrikmakait I've definitely been bitten by this before and sort of wished that the task would error if the worker I was pinning it do didn't exist.

a task should transition to erred as soon as it lands on the scheduler if the workers satisfying its restriction are not there

I don't think this would be good behavior in general—otherwise you couldn't scale up from zero if your tasks had resource restrictions. Resource restrictions are different from worker restrictions, because other workers can fulfill them. If we're treating a worker restriction as a unique identifier (which it isn't, but I think we intend it to be), then it makes sense that the task would error if the worker specified doesn't exist.

I think it should be changed to the UUID instead to avoid collisions.

Agreed:

Worker addresses are treated as unique identifiers, but may not be #6392

hendrikmakait · 2022-11-29T13:03:23Z

a task should transition to erred as soon as it lands on the scheduler if the workers satisfying its restriction are not there

I agree with @gjoseph92 that there are scale-up scenarios in which workers on specific hosts or with specific resources have yet to join the cluster when the tasks are submitted. I'd be fine with making this configurable via a flag so that users can decide whether they fail on mismatches or allow waiting for workers that will satisfy the restrictions (and potentially waiting indefinitely in the case of a typo).

@crusaderky: I didn't even know about aliases; thanks for bringing those up. Your suggestion for dealing with addresses, aliases, and UUIDs makes sense to me. I'd love to have that.

trivialfis · 2024-11-06T11:44:56Z

Hi, is there a way to disable restarting such failed tasks instead of letting them wait for a non-existent worker? I tried to set retries=0 in client.submit and explicit dask.annotate(retries=0) before client.submit, but neither of them prevents dask from restarting the task after abort.

My naive workaround at the moment is to set allow_other_workers=True and add a check inside the task to make sure it's not running on an unexpected worker.

hendrikmakait · 2024-11-06T13:43:07Z

@trivialfis: From what I gathered reading through issues you have posted on, you are working on making XGBoost more robust. Is that correct? If so, we should probably talk as we have already dealt with several of the issues you will encounter and it's a non-trivial feat to become robust against them.

There is currently no off-the-shelf solution for your problem. We have somewhat recently added a configurable no-worker-timeout that at this time fails tasks that stay in no-worker for too long. By default, this is disabled though (#8806).

fjetter · 2024-11-06T13:47:37Z

About the question above. We added an option no-workers-timeout that basically allows a task to fail after a set amount of time if no workers arrive. See also #8806

It might be the time to enable this by default. I think this would already go a long way for xgboost users.

trivialfis · 2024-11-06T16:00:15Z

. Is that correct?

Yes.

If so, we should probably talk

Love to! Would you like to set up a meeting or I should join the next dask dev meeting?

About the question above. We added an option no-workers-timeout

Thank you both for sharing!

hendrikmakait · 2024-11-06T16:03:58Z

Love to! Would you like to set up a meeting or I should join the next dask dev meeting?

Joining the next (read: tomorrow's) Dask dev meeting sounds like a great idea to get some engagement from maintainers in general. I suspect we'll still set up a smaller follow-up meeting later to talk details.

trivialfis · 2024-11-06T19:03:19Z

@hendrikmakait Sounds good, sent an email to the address shared in your profile.

hendrikmakait mentioned this issue Nov 24, 2022

Fail P2PShuffle gracefully upon worker failure #7326

Merged

2 tasks

fjetter added the discussion Discussing a topic with no specific actions yet label Nov 25, 2022

hendrikmakait mentioned this issue Apr 25, 2023

Data loss possible with P2P shuffle when worker returns with same address #7798

Closed

hendrikmakait mentioned this issue Jan 3, 2024

Resilience doens't work if workers for client.submit is specified. #8320

Closed

hendrikmakait changed the title ~~Transition tasks with worker_restrictions to erred instead of no-worker on worker removal~~ Tasks with worker_restrictions get stuck in no-worker when worker is removed Jan 3, 2024

hendrikmakait changed the title ~~Tasks with worker_restrictions get stuck in no-worker when worker is removed~~ Tasks with worker restrictions get stuck in no-worker when required worker is removed Jan 3, 2024

hendrikmakait mentioned this issue Jun 11, 2024

dispatch.run is not resilient to worker loss saturncloud/dask-pytorch-ddp#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks with worker restrictions get stuck in `no-worker` when required worker is removed #7346

Tasks with worker restrictions get stuck in `no-worker` when required worker is removed #7346

hendrikmakait commented Nov 24, 2022 •

edited

Loading

crusaderky commented Nov 25, 2022

gjoseph92 commented Nov 28, 2022

hendrikmakait commented Nov 29, 2022

trivialfis commented Nov 6, 2024 •

edited

Loading

hendrikmakait commented Nov 6, 2024

fjetter commented Nov 6, 2024

trivialfis commented Nov 6, 2024

hendrikmakait commented Nov 6, 2024

trivialfis commented Nov 6, 2024

Tasks with worker restrictions get stuck in no-worker when required worker is removed #7346

Tasks with worker restrictions get stuck in no-worker when required worker is removed #7346

Comments

hendrikmakait commented Nov 24, 2022 • edited Loading

crusaderky commented Nov 25, 2022

gjoseph92 commented Nov 28, 2022

hendrikmakait commented Nov 29, 2022

trivialfis commented Nov 6, 2024 • edited Loading

hendrikmakait commented Nov 6, 2024

fjetter commented Nov 6, 2024

trivialfis commented Nov 6, 2024

hendrikmakait commented Nov 6, 2024

trivialfis commented Nov 6, 2024

Tasks with worker restrictions get stuck in `no-worker` when required worker is removed #7346

Tasks with worker restrictions get stuck in `no-worker` when required worker is removed #7346

hendrikmakait commented Nov 24, 2022 •

edited

Loading

trivialfis commented Nov 6, 2024 •

edited

Loading