Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks with worker restrictions get stuck in no-worker when required worker is removed #7346

Open
hendrikmakait opened this issue Nov 24, 2022 · 9 comments
Labels
discussion Discussing a topic with no specific actions yet

Comments

@hendrikmakait
Copy link
Member

hendrikmakait commented Nov 24, 2022

Problem

When tasks are restricted to run on a specific worker using worker_restrictions, they transition to no-worker if the specified worker is removed. The task will remain in no-worker until a worker with the same address rejoins. I see two use cases for this behavior:

  1. Restricting a task to run on a specific machine (with specific hardware)
    • In this case, the user may be better served by using host_restrictions or resource_restrictions
  2. Restricting a task to run on a specific worker instance
    • For example, we restrict tasks to run on specific worker instances in the P2PShuffle implementation
    • In this case, we want tasks to fail as soon as the instance is removed instead of remaining in no-worker indefinitely

Am I missing a use case here?

Depending on your deployment system, another worker with the same address may join the cluster, which would resolve the deadlock of use case #1, but create trouble for tasks that want to achieve use case #2 as another manifestation of #6392.

Possible solutions

  1. Adjust worker_restrictions to use IDs instead of addresses to ensure uniqueness and transition tasks to erred instead of no-worker. This would be a breaking change and I do not know if anybody relies on the current behavior.
  2. Introduce another type of restrictions based on IDs that restricts workers to currently existing worker instances via IDs and implements the behavior outlined above.

cc @fjetter

@fjetter fjetter added the discussion Discussing a topic with no specific actions yet label Nov 25, 2022
@crusaderky
Copy link
Collaborator

+1; the change makes sense to me.

By this same logic, a task should transition to erred as soon as it lands on the scheduler if the workers satisfying its restriction are not there.

Adjust worker_restrictions to use IDs instead of addresses to ensure uniqueness and transition tasks to erred instead of no-worker. This would be a breaking change and I do not know if anybody relies on the current behavior.

We use worker addresses in restrictions pretty much all over the place. I'd not want to go through our whole test suite to adjust it.

Don't we already communicate the UUID of the worker to the scheduler? We also have aliases.
I think worker restrictions should match any worker that satisfies

  • address
  • or alias
  • or UUID

e.g. these should all be valid:

submit(inc, 1, workers=["tcp://127.0.0.1:12345"])
submit(inc, 1, workers=[0])
submit(inc, 1, workers=["ed6db7b2-6aee-47e8-964f-2c71481fce4a"])

The first two are already happening; see Scheduler.coerce_address.

Note that aliases are user-defined, but they are converted to addresses by update_graph - so if the worker leaves the cluster and later on another worker with the same name joins the cluster, it won't get the job.
I think it should be changed to the UUID instead to avoid collisions.

@gjoseph92
Copy link
Collaborator

@hendrikmakait I've definitely been bitten by this before and sort of wished that the task would error if the worker I was pinning it do didn't exist.

a task should transition to erred as soon as it lands on the scheduler if the workers satisfying its restriction are not there

I don't think this would be good behavior in general—otherwise you couldn't scale up from zero if your tasks had resource restrictions. Resource restrictions are different from worker restrictions, because other workers can fulfill them. If we're treating a worker restriction as a unique identifier (which it isn't, but I think we intend it to be), then it makes sense that the task would error if the worker specified doesn't exist.

I think it should be changed to the UUID instead to avoid collisions.

Agreed:

@hendrikmakait
Copy link
Member Author

a task should transition to erred as soon as it lands on the scheduler if the workers satisfying its restriction are not there

I agree with @gjoseph92 that there are scale-up scenarios in which workers on specific hosts or with specific resources have yet to join the cluster when the tasks are submitted. I'd be fine with making this configurable via a flag so that users can decide whether they fail on mismatches or allow waiting for workers that will satisfy the restrictions (and potentially waiting indefinitely in the case of a typo).

@crusaderky: I didn't even know about aliases; thanks for bringing those up. Your suggestion for dealing with addresses, aliases, and UUIDs makes sense to me. I'd love to have that.

@hendrikmakait hendrikmakait changed the title Transition tasks with worker_restrictions to erred instead of no-worker on worker removal Tasks with worker_restrictions get stuck in no-worker when worker is removed Jan 3, 2024
@hendrikmakait hendrikmakait changed the title Tasks with worker_restrictions get stuck in no-worker when worker is removed Tasks with worker restrictions get stuck in no-worker when required worker is removed Jan 3, 2024
@trivialfis
Copy link

trivialfis commented Nov 6, 2024

Hi, is there a way to disable restarting such failed tasks instead of letting them wait for a non-existent worker? I tried to set retries=0 in client.submit and explicit dask.annotate(retries=0) before client.submit, but neither of them prevents dask from restarting the task after abort.

My naive workaround at the moment is to set allow_other_workers=True and add a check inside the task to make sure it's not running on an unexpected worker.

@hendrikmakait
Copy link
Member Author

@trivialfis: From what I gathered reading through issues you have posted on, you are working on making XGBoost more robust. Is that correct? If so, we should probably talk as we have already dealt with several of the issues you will encounter and it's a non-trivial feat to become robust against them.

There is currently no off-the-shelf solution for your problem. We have somewhat recently added a configurable no-worker-timeout that at this time fails tasks that stay in no-worker for too long. By default, this is disabled though (#8806).

@fjetter
Copy link
Member

fjetter commented Nov 6, 2024

About the question above. We added an option no-workers-timeout that basically allows a task to fail after a set amount of time if no workers arrive. See also #8806

It might be the time to enable this by default. I think this would already go a long way for xgboost users.

@trivialfis
Copy link

. Is that correct?

Yes.

If so, we should probably talk

Love to! Would you like to set up a meeting or I should join the next dask dev meeting?

About the question above. We added an option no-workers-timeout

Thank you both for sharing!

@hendrikmakait
Copy link
Member Author

Love to! Would you like to set up a meeting or I should join the next dask dev meeting?

Joining the next (read: tomorrow's) Dask dev meeting sounds like a great idea to get some engagement from maintainers in general. I suspect we'll still set up a smaller follow-up meeting later to talk details.

@trivialfis
Copy link

@hendrikmakait Sounds good, sent an email to the address shared in your profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet
Projects
None yet
Development

No branches or pull requests

5 participants