Issues with tasks completing on workers after being released and re-submitted #7356

gjoseph92 · 2022-11-28T19:15:27Z

#7348 fixed a deadlock where a task gets released by the client, then re-submitted by a client, but before the worker hears about the task being cancelled, it completes it and tells the scheduler.

This fix was specific to the queued state, but I'd think that the problem (and possible deadlocks) are broader than that. There's more fundamentally a consistency issue whenever:

client cancels a task
scheduler forgets about the task immediately before confirming that workers have also forgotten about the task
the same task key is submitted again

Some theoretical possibilities (I haven't come up with tests to create them yet, but I imagine they're slight variations of the test added in #7348):

The task has resource restrictions, so it's in no-worker instead of queued. Maybe this causes a deadlock?
The task is processing instead of queued. When re-submitted, the scheduler picks a different worker for it. Then, the message from the original worker that the task is finished arrives. This would trigger a RuntimeError on the scheduler from the task completing on an unexpected worker.
The task and its dependencies were cancelled, then resubmitted, so the task is now waiting. Then the task completed message arrives, triggering the waiting->memory transition @crusaderky just added in Edge and impossible transitions to memory #7205 (the docstring of transition_waiting_memory does not mention this case FWIW, so I don't think this is intended, though the behavior would be okay)
After being cancelled and resubmitted, the task ran on a new worker and erred. Then, the task finished message from the original worker arrives. This would trigger an impossible erred->memory transition.

A couple ways to address this:

Just don't let the scheduler eagerly forget about tasks when a client asks it to; instead, wait for confirmation from workers. client_releases_keys would send a free-keys message to workers, and only forget the task once the workers confirm they've forgotten it. (We can transition the task to released immediately, just not forgotten.) This ensures the scheduler maintains a consistent view of the workers' state.
On each TaskState, store the value of transition_counter when it was created, to use for disambiguating re-submitted tasks. Send this to workers as part of the task spec. Whenever workers send messages regarding a key, they also include this transition_counter value. If the value doesn't match what the scheduler has on record for that task (i.e. it's been forgotten and re-submitted in the interim), we know the message is stale, and we can ignore it and just tell the worker to release that key. (We'd have to be careful for consistency issues though—we don't want to accidentally tell the worker to release a newer version of the key, if it has that too.)

cc @fjetter @crusaderky @hendrikmakait

The text was updated successfully, but these errors were encountered:

fjetter · 2022-12-08T14:55:02Z

There's more fundamentally a consistency issue whenever:

I'm pretty sure I added more than one test that covers this case. That's the entire reason why we have cancelled/resumed states.

The task has resource restrictions, so it's in no-worker instead of queued. Maybe this causes a deadlock?

If it is transitioned to no-worker, by definition, the only workers executing the task are dead. Their batched comm is dead and there is no way to come back from this since we removed the worker reconnect. If we received such a message the worker restrictions were changed mid flight. I'm perfectly fine that something goes wrong if somebody messes with internal state (I know Scheduler.set_restrictions doesn't have an underscore but I still don't consider it "public")

The task is processing instead of queued. When re-submitted, the scheduler picks a different worker for

This is even specifically parametrized in the test I added in #7348

The task and its dependencies were cancelled, then resubmitted, so the task is now waiting

IIUC the worst thing happening here is that we'd recompute the key because waiting->memory will just keep the task in waiting and we'd get to it eventually. Not nice but also not a problem for a rare edge case

After being cancelled and resubmitted, the task ran on a new worker and erred. Then, the task finished message from the original worker arrives. This would trigger an impossible erred->memory transition.

No, it wouldn't. This is explicitly handled in the stimulus handler for erred tasks

distributed/distributed/scheduler.py

Lines 4675 to 4676 in 53284cd

    
           if ts is None or ts.state != "processing": 
        
               return {}, {}, {}

a cleaner implementation would be something like #7372 but this should be good enough.

I just realize that #7372 is similar to your suggestion 2.) but there are a few caveats on worker side to figure out first to give these kind of guarantees.

TLDR From what I understand, all your theoretical problem cases are handled and won't result in a problem unless smbd changes restrictions mid flight.
We'll need to look at this a bit more closely in #7353 where this thing could actually happen iiuc

Otherwise, I'm inclined to close this ticket unless there is an actual reproducer

This fixes `test_deadlock_resubmit_queued_tasks_fast`. The problem: - an "old" TaskState `f-0` is forgotten and removed from `queued` HeapSet. Internally, this removes it from a set, but a weakref is still on the heap - a new TaskState `f-0` is added to the HeapSet. It's pushed onto the heap, but since its priority is higher (newer generation), it comes after the old `f-0` object that's still on the front of the heap - `pop` pops old `f-0` off the heap. The weakref is still alive (for whatever reason). `value in self._data` (the set) is True, because the _hash_ of old `f-0` and new `f-0` are the same (same key). So we return the stale, old TaskState object. Much like `WorkerState`s, there should be exactly 1 `TaskState` instance per task. If there are multiple instances with the same key, they are different tasks, and have different state. xref dask#7372, dask#7356.

gjoseph92 added bug Something is broken deadlock The cluster appears to not make any progress scheduler labels Nov 28, 2022

fjetter mentioned this issue Dec 8, 2022

RFC Sign every compute task with a unique counter to correlated responses #7372

Closed

gjoseph92 mentioned this issue Jan 28, 2023

Scheduler TaskState objects should be unique, not hashed by key #7510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with tasks completing on workers after being released and re-submitted #7356

Issues with tasks completing on workers after being released and re-submitted #7356

gjoseph92 commented Nov 28, 2022

fjetter commented Dec 8, 2022

Issues with tasks completing on workers after being released and re-submitted #7356

Issues with tasks completing on workers after being released and re-submitted #7356

Comments

gjoseph92 commented Nov 28, 2022

fjetter commented Dec 8, 2022