-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks with worker restrictions get stuck in no-worker
when required worker is removed
#7346
Comments
+1; the change makes sense to me. By this same logic, a task should transition to erred as soon as it lands on the scheduler if the workers satisfying its restriction are not there.
We use worker addresses in restrictions pretty much all over the place. I'd not want to go through our whole test suite to adjust it. Don't we already communicate the UUID of the worker to the scheduler? We also have aliases.
e.g. these should all be valid: submit(inc, 1, workers=["tcp://127.0.0.1:12345"])
submit(inc, 1, workers=[0])
submit(inc, 1, workers=["ed6db7b2-6aee-47e8-964f-2c71481fce4a"]) The first two are already happening; see Note that aliases are user-defined, but they are converted to addresses by update_graph - so if the worker leaves the cluster and later on another worker with the same name joins the cluster, it won't get the job. |
@hendrikmakait I've definitely been bitten by this before and sort of wished that the task would error if the worker I was pinning it do didn't exist.
I don't think this would be good behavior in general—otherwise you couldn't scale up from zero if your tasks had resource restrictions. Resource restrictions are different from worker restrictions, because other workers can fulfill them. If we're treating a worker restriction as a unique identifier (which it isn't, but I think we intend it to be), then it makes sense that the task would error if the worker specified doesn't exist.
Agreed: |
I agree with @gjoseph92 that there are scale-up scenarios in which workers on specific hosts or with specific resources have yet to join the cluster when the tasks are submitted. I'd be fine with making this configurable via a flag so that users can decide whether they fail on mismatches or allow waiting for workers that will satisfy the restrictions (and potentially waiting indefinitely in the case of a typo). @crusaderky: I didn't even know about aliases; thanks for bringing those up. Your suggestion for dealing with addresses, aliases, and UUIDs makes sense to me. I'd love to have that. |
worker_restrictions
to erred
instead of no-worker
on worker removalworker_restrictions
get stuck in no-worker
when worker is removed
worker_restrictions
get stuck in no-worker
when worker is removedno-worker
when required worker is removed
Hi, is there a way to disable restarting such failed tasks instead of letting them wait for a non-existent worker? I tried to set My naive workaround at the moment is to set |
@trivialfis: From what I gathered reading through issues you have posted on, you are working on making XGBoost more robust. Is that correct? If so, we should probably talk as we have already dealt with several of the issues you will encounter and it's a non-trivial feat to become robust against them. There is currently no off-the-shelf solution for your problem. We have somewhat recently added a configurable |
About the question above. We added an option It might be the time to enable this by default. I think this would already go a long way for xgboost users. |
Yes.
Love to! Would you like to set up a meeting or I should join the next dask dev meeting?
Thank you both for sharing! |
Joining the next (read: tomorrow's) Dask dev meeting sounds like a great idea to get some engagement from maintainers in general. I suspect we'll still set up a smaller follow-up meeting later to talk details. |
@hendrikmakait Sounds good, sent an email to the address shared in your profile. |
Problem
When tasks are restricted to run on a specific worker using
worker_restrictions
, they transition tono-worker
if the specified worker is removed. The task will remain inno-worker
until a worker with the same address rejoins. I see two use cases for this behavior:host_restrictions
orresource_restrictions
no-worker
indefinitelyAm I missing a use case here?
Depending on your deployment system, another worker with the same address may join the cluster, which would resolve the deadlock of use case #1, but create trouble for tasks that want to achieve use case #2 as another manifestation of #6392.
Possible solutions
worker_restrictions
to use IDs instead of addresses to ensure uniqueness and transition tasks toerred
instead ofno-worker
. This would be a breaking change and I do not know if anybody relies on the current behavior.cc @fjetter
The text was updated successfully, but these errors were encountered: