Bound the wait for cancelled tasks to complete #82906

DaveCTurner · 2022-01-21T11:07:24Z

Cancelling a task on a remote node is an active process: we send a cancellation request to the remote node and continue to wait for it to indicate the completion of the task. Typically the task will complete with a TaskCancelledException but it might fail in a different way, or even succeed, if the cancellation loses the race to completion. If the remote node is unable to respond for some reason then today we wait indefinitely, and this means we cannot free the resources held by the listener. If the remote node remains unresponsive for long enough then the build-up of listeners on other nodes can cause a cascading failure.

In contrast, if we specify a timeout on the transport request that triggers the remote task then we complete the listener eagerly at the timeout, although the task is still running remotely. Indeed we don't even attempt to cancel the remote task in this case (#66992) so it just keeps on running.

I believe we should not wait indefinitely for a cancelled task to complete and should instead unilaterally complete the waiting listener with a TaskCancelledException to protect against an unresponsive remote node. Maybe we should do this straight away, similarly to how we handle a timeout, but we could also allow some time for the cancellation to happen gracefully first.

Relates #82337 which describes a similar problem specifically about stats requests, since these are often the source of actual problems in this area.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-01-21T11:07:27Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2022-03-15T14:20:18Z

We (the @elastic/es-distributed team) discussed this today and agreed to proceed, at least with a short timeout. However this would change the behaviour of cancellations submitted via the tasks API, because such cancellations would no longer report failures to cancel remote tasks. I propose making this change only for cancellations triggered by a client which closes its HTTP connection before it received a response. WDYT?

dnhatn · 2022-03-18T02:06:04Z

I propose making this change only for cancellations triggered by a client which closes its HTTP connection before it received a response. WDYT?

+1. I think this is the right choice.

DaveCTurner added >enhancement :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. team-discuss labels Jan 21, 2022

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 21, 2022

DaveCTurner removed the team-discuss label May 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound the wait for cancelled tasks to complete #82906

Bound the wait for cancelled tasks to complete #82906

DaveCTurner commented Jan 21, 2022 •

edited

Loading

elasticmachine commented Jan 21, 2022

DaveCTurner commented Mar 15, 2022

dnhatn commented Mar 18, 2022

Bound the wait for cancelled tasks to complete #82906

Bound the wait for cancelled tasks to complete #82906

Comments

DaveCTurner commented Jan 21, 2022 • edited Loading

elasticmachine commented Jan 21, 2022

DaveCTurner commented Mar 15, 2022

dnhatn commented Mar 18, 2022

DaveCTurner commented Jan 21, 2022 •

edited

Loading