Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with tasks completing on workers after being released and re-submitted #7356

Open
gjoseph92 opened this issue Nov 28, 2022 · 1 comment
Labels
bug Something is broken deadlock The cluster appears to not make any progress scheduler

Comments

@gjoseph92
Copy link
Collaborator

#7348 fixed a deadlock where a task gets released by the client, then re-submitted by a client, but before the worker hears about the task being cancelled, it completes it and tells the scheduler.

This fix was specific to the queued state, but I'd think that the problem (and possible deadlocks) are broader than that. There's more fundamentally a consistency issue whenever:

  1. client cancels a task
  2. scheduler forgets about the task immediately before confirming that workers have also forgotten about the task
  3. the same task key is submitted again

Some theoretical possibilities (I haven't come up with tests to create them yet, but I imagine they're slight variations of the test added in #7348):

  1. The task has resource restrictions, so it's in no-worker instead of queued. Maybe this causes a deadlock?
  2. The task is processing instead of queued. When re-submitted, the scheduler picks a different worker for it. Then, the message from the original worker that the task is finished arrives. This would trigger a RuntimeError on the scheduler from the task completing on an unexpected worker.
  3. The task and its dependencies were cancelled, then resubmitted, so the task is now waiting. Then the task completed message arrives, triggering the waiting->memory transition @crusaderky just added in Edge and impossible transitions to memory #7205 (the docstring of transition_waiting_memory does not mention this case FWIW, so I don't think this is intended, though the behavior would be okay)
  4. After being cancelled and resubmitted, the task ran on a new worker and erred. Then, the task finished message from the original worker arrives. This would trigger an impossible erred->memory transition.

A couple ways to address this:

  1. Just don't let the scheduler eagerly forget about tasks when a client asks it to; instead, wait for confirmation from workers. client_releases_keys would send a free-keys message to workers, and only forget the task once the workers confirm they've forgotten it. (We can transition the task to released immediately, just not forgotten.) This ensures the scheduler maintains a consistent view of the workers' state.
  2. On each TaskState, store the value of transition_counter when it was created, to use for disambiguating re-submitted tasks. Send this to workers as part of the task spec. Whenever workers send messages regarding a key, they also include this transition_counter value. If the value doesn't match what the scheduler has on record for that task (i.e. it's been forgotten and re-submitted in the interim), we know the message is stale, and we can ignore it and just tell the worker to release that key. (We'd have to be careful for consistency issues though—we don't want to accidentally tell the worker to release a newer version of the key, if it has that too.)

cc @fjetter @crusaderky @hendrikmakait

@gjoseph92 gjoseph92 added bug Something is broken deadlock The cluster appears to not make any progress scheduler labels Nov 28, 2022
@fjetter
Copy link
Member

fjetter commented Dec 8, 2022

There's more fundamentally a consistency issue whenever:

I'm pretty sure I added more than one test that covers this case. That's the entire reason why we have cancelled/resumed states.

  1. The task has resource restrictions, so it's in no-worker instead of queued. Maybe this causes a deadlock?

If it is transitioned to no-worker, by definition, the only workers executing the task are dead. Their batched comm is dead and there is no way to come back from this since we removed the worker reconnect. If we received such a message the worker restrictions were changed mid flight. I'm perfectly fine that something goes wrong if somebody messes with internal state (I know Scheduler.set_restrictions doesn't have an underscore but I still don't consider it "public")

  1. The task is processing instead of queued. When re-submitted, the scheduler picks a different worker for

This is even specifically parametrized in the test I added in #7348

  1. The task and its dependencies were cancelled, then resubmitted, so the task is now waiting

IIUC the worst thing happening here is that we'd recompute the key because waiting->memory will just keep the task in waiting and we'd get to it eventually. Not nice but also not a problem for a rare edge case

After being cancelled and resubmitted, the task ran on a new worker and erred. Then, the task finished message from the original worker arrives. This would trigger an impossible erred->memory transition.

No, it wouldn't. This is explicitly handled in the stimulus handler for erred tasks

if ts is None or ts.state != "processing":
return {}, {}, {}

a cleaner implementation would be something like #7372 but this should be good enough.

I just realize that #7372 is similar to your suggestion 2.) but there are a few caveats on worker side to figure out first to give these kind of guarantees.


TLDR From what I understand, all your theoretical problem cases are handled and won't result in a problem unless smbd changes restrictions mid flight.
We'll need to look at this a bit more closely in #7353 where this thing could actually happen iiuc

Otherwise, I'm inclined to close this ticket unless there is an actual reproducer

gjoseph92 added a commit to gjoseph92/distributed that referenced this issue Dec 16, 2022
This fixes `test_deadlock_resubmit_queued_tasks_fast`. The problem:
- an "old" TaskState `f-0` is forgotten and removed from `queued` HeapSet. Internally, this removes it from a set, but a weakref is still on the heap
- a new TaskState `f-0` is added to the HeapSet. It's pushed onto the heap, but since its priority is higher (newer generation), it comes after the old `f-0` object that's still on the front of the heap
- `pop` pops old `f-0` off the heap. The weakref is still alive (for whatever reason). `value in self._data` (the set) is True, because the _hash_ of old `f-0` and new `f-0` are the same (same key). So we return the stale, old TaskState object.

Much like `WorkerState`s, there should be exactly 1 `TaskState` instance per task. If there are multiple instances with the same key, they are different tasks, and have different state. xref dask#7372, dask#7356.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken deadlock The cluster appears to not make any progress scheduler
Projects
None yet
Development

No branches or pull requests

2 participants