Bound the wait for cancelled tasks to complete #82906
Labels
:Distributed Coordination/Task Management
Issues for anything around the Tasks API - both persistent and node level.
>enhancement
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Cancelling a task on a remote node is an active process: we send a cancellation request to the remote node and continue to wait for it to indicate the completion of the task. Typically the task will complete with a
TaskCancelledException
but it might fail in a different way, or even succeed, if the cancellation loses the race to completion. If the remote node is unable to respond for some reason then today we wait indefinitely, and this means we cannot free the resources held by the listener. If the remote node remains unresponsive for long enough then the build-up of listeners on other nodes can cause a cascading failure.In contrast, if we specify a timeout on the transport request that triggers the remote task then we complete the listener eagerly at the timeout, although the task is still running remotely. Indeed we don't even attempt to cancel the remote task in this case (#66992) so it just keeps on running.
I believe we should not wait indefinitely for a cancelled task to complete and should instead unilaterally complete the waiting listener with a
TaskCancelledException
to protect against an unresponsive remote node. Maybe we should do this straight away, similarly to how we handle a timeout, but we could also allow some time for the cancellation to happen gracefully first.Relates #82337 which describes a similar problem specifically about stats requests, since these are often the source of actual problems in this area.
The text was updated successfully, but these errors were encountered: