[native] Add functionality to cancel abandoned tasks. #21779

spershin · 2024-01-25T00:09:35Z

Description

During restarts and cluster updates we noticed that coordinator restarts quickly,
while workers take long time because they are waiting for the running Tasks to finish.

The solution is to cancel the Tasks that are still running, but have haven't
been accessed by the coordinator for a considerable time.
We leverage the same timeout we use for cleaning up the old Tasks.

Motivation and Context

To speed up cluster restarts and updates, when queries have clearly failed already.

== NO RELEASE NOTE ==

tanjialiang · 2024-01-25T00:25:09Z

Shutdown was good 2 weeks ago. Should we identify why this is the case before we kill the signal?

spershin · 2024-01-25T02:36:23Z

Shutdown was good 2 weeks ago. Should we identify why this is the case before we kill the signal?

Shutdown always suffered from long running queries from time to time.
This is not related to the stuck driver threads.

amitkdutta · 2024-01-25T07:52:03Z

Here is a log snippet from shutdown

I0124 10:27:55.212841    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (9 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:27:56.213038    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (10 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:27:57.213208    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (11 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:27:58.213318    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (12 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:27:59.213505    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (13 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
**W0124 10:28:00.165964   511 PeriodicServiceInventoryManager.cpp:76] Error occurred during updating service address: (SELECTION_NO_HOST_IN_SMC) For service 'presto.tier.coordinator.https', no hosts available in SMC**
I0124 10:28:00.213656    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (14 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:01.213868    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (15 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:02.214051    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (16 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:03.214160    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (17 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:04.214345    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (18 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:05.214468    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (19 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:06.214578    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (20 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:06.628034   535 TaskManager.cpp:693] cleanOldTasks: Cleaned 8 old task(s) in 1 ms
I0124 10:28:06.718971   535 PeriodicTaskManager.cpp:618] Spill memory usage: current[0B] peak[0B]
I0124 10:28:06.740123   535 PeriodicTaskManager.cpp:405] Cache stats:
Cache size: XGB tinySize: YMB large size: WGB
....
Prefetch entries: 0 bytes: 0B
Alloc Megaclocks ABC124
I0124 10:28:07.214740    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (21 seconds so far) for 'Running' tasks to complete. 4 tasks left: Running=2 Finished=2 
I0124 10:28:08.214918    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (22 seconds so far) for 'Running' tasks to complete. 4 tasks left: Running=2 Finished=2 
I0124 10:28:09.215049    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (23 seconds so far) for 'Running' tasks to complete. 4 tasks left: Running=2 Finished=2 
...
W0124 10:28:33.103857   511 PeriodicServiceInventoryManager.cpp:76] Error occurred during updating service address: (SELECTION_NO_HOST_IN_SMC) For service 'presto.tier.coordinator.https', no hosts available in SMC

So if the worker receives SHUTDOWN signal, and it can't discover the coordinator anymore - it's completely safe to abandon anything running and no point continuing. It can be better than using oldTaskCleanUpMs_, at least during shutdown.

tdcmeehan · 2024-01-25T15:25:19Z

IMO, we should follow Java worker logic, which takes this a step further and cancels all abandoned tasks on a timer:

presto/presto-main/src/main/java/com/facebook/presto/execution/SqlTaskManager.java

Line 516 in ed2a648

public void failAbandonedTasks()

Calls to get the task info and task status are heartbeats--basically, if we stop heartbeating, no matter what we should kill the task.

spershin · 2024-01-25T17:39:31Z

@tdcmeehan

Are you saying we should fail abandoned queries even before shutting down? I'm a little bit concerned about it. Some hiccups in network and we might be failing legit queries that are still running.

This PR is solves a particular problem: during shutdown worker does not get signals from the coordinator that tasks are no longer needed. Because of this we are waiting till the container kills us (timeout). This makes restarts and pushes to take too much time.

Also, the code snippet you referenced it likely running on the coordinator, which cancels queries abandoned by the client.
Here are are talking about the workers.

Let me know if I misunderstood.

spershin · 2024-01-25T17:47:01Z

@amitkdutta

So if the worker receives SHUTDOWN signal, and it can't discover the coordinator anymore - it's completely safe to abandon anything running and no point continuing.

Almost like that, except we cancel tasks not because we don't see a coordinator, but because coordinator does not ask about the tasks anymore (abandoned, because it restarted and does not know about them).

It can be better than using oldTaskCleanUpMs_, at least during shutdown.

Not sure I understand this.

tdcmeehan · 2024-01-25T18:04:18Z

@spershin you make sure the timeout interval is something that can tolerate hiccups. The code above is on the worker. IMO, this code should handle these different problems in the same way:

The coordinator is shutting down, but something prevents the coordinator from explicitly killing the task. (It seems from your message it's recurrent: do you understand why this is the case? I presume it's just for native workers?)
The coordinator has died suddenly (e.g. OOM).

If you don't handle them in the same way, then what's the plan for #2?

spershin · 2024-01-25T18:25:18Z

@tdcmeehan
As I pointed earlier, for us the issue is during restart - it delays it.
I haven't had time to look it up in coordinator code why would this happen (nor I am very familiar with it). I suspect it should happen in Presto Java too, not just Native. We don't see this on Java maybe because we don't shadow that extensively on Java (no such active development) and in prod we do drain and then wait before restarting.
I would dare to assume (w/o looking at the code) that the coordinator itself lacks the graceful shutdown logic?

For coordinator crashes/OOMs - we aren't bothered by this. It never happens. However, if that happens, then the tasks would live on workers until shutdown, true. The drawback - is a bit of CPU and wasted memory.

So, you think we should just do this check and abandon tasks all the time, not on just shutdown?

amitkdutta · 2024-01-25T18:35:37Z

It can be better than using oldTaskCleanUpMs_, at least during shutdown.

What I meant is once worker sees coordinator is gone

**W0124 10:28:00.165964   511 PeriodicServiceInventoryManager.cpp:76] Error occurred during updating service address: (SELECTION_NO_HOST_IN_SMC) For service 'presto.tier.coordinator.https', no hosts available in SMC**

Then it can clean up as no point running any tasks, instead of using a timing threshold.

spershin · 2024-01-25T18:50:59Z

@amitkdutta

I'm not sure if this will work properly. Coordinator can restart quickly and back to normal and we wouldn't notice it depending on the timeout. Or some network hiccup and we consider it lost if the timeout is small.

It seems more safe to do it on per-task basis - if a task hasn't been accessed by the coordinator for some time (for example oldTaskCleanUpMs_) - then it is safe to abort it.

xiaoxmeng · 2024-04-25T22:23:17Z

@tdcmeehan As I pointed earlier, for us the issue is during restart - it delays it. I haven't had time to look it up in coordinator code why would this happen (nor I am very familiar with it). I suspect it should happen in Presto Java too, not just Native. We don't see this on Java maybe because we don't shadow that extensively on Java (no such active development) and in prod we do drain and then wait before restarting. I would dare to assume (w/o looking at the code) that the coordinator itself lacks the graceful shutdown logic?

For coordinator crashes/OOMs - we aren't bothered by this. It never happens. However, if that happens, then the tasks would live on workers until shutdown, true. The drawback - is a bit of CPU and wasted memory.

So, you think we should just do this check and abandon tasks all the time, not on just shutdown?

@spershin can we consider to be align with Java work on this as @tdcmeehan suggested? We can do this in followup and I felt it might be good to have.

xiaoxmeng

@spershin LGTM. Is it possible to add unit test for shutdown for TaskManager to have better coverage? Thanks!

xiaoxmeng · 2024-04-25T22:24:09Z

presto-native-execution/presto_cpp/main/TaskManager.h

@@ -163,6 +163,10 @@ class TaskManager {
      "concurrent_lifespans_per_task"};
  static constexpr folly::StringPiece kSessionTimezone{"session_timezone"};

+  // This is called during shutting down. We terminate all running tasks which
+  // haven't been pinged by coordinator for a considerable time.


nit: haven't received heartbeats from coordinator ...

@xiaoxmeng

nit: haven't received heartbeats from coordinator ...

Actually, this is not what's happening.
We don't get heartbeats from the coordinator.
Coord asks as all sorts of things about the task and we update our what we call 'heartbeat' time.
So, then it boils down to 'coordinator forgot about the task' event.

Request cancellation for tasks which haven't been accessed by coordinator for a considerable time.

xiaoxmeng

@spershin thanks!

spershin requested a review from a team as a code owner January 25, 2024 00:09

spershin requested review from tanjialiang and mbasmanova January 25, 2024 00:09

mbasmanova requested review from majetideepak and tdcmeehan January 25, 2024 00:16

xiaoxmeng previously approved these changes Apr 25, 2024

View reviewed changes

[native] Add functionality to cancel abandoned tasks.

503fb01

Request cancellation for tasks which haven't been accessed by coordinator for a considerable time.

spershin dismissed xiaoxmeng’s stale review via 503fb01 April 30, 2024 03:03

spershin force-pushed the KillOrphanedTasksDuringShutdown branch from 3862c24 to 503fb01 Compare April 30, 2024 03:03

spershin changed the title ~~[native] Cancelling abandoned tasks during shutdown.~~ [native] Add functionality to cancel abandoned tasks. Apr 30, 2024

xiaoxmeng approved these changes Apr 30, 2024

View reviewed changes

spershin merged commit affd83e into prestodb:master Apr 30, 2024
59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[native] Add functionality to cancel abandoned tasks. #21779

[native] Add functionality to cancel abandoned tasks. #21779

spershin commented Jan 25, 2024 •

edited

Loading

tanjialiang commented Jan 25, 2024 •

edited

Loading

spershin commented Jan 25, 2024

amitkdutta commented Jan 25, 2024 •

edited

Loading

tdcmeehan commented Jan 25, 2024

spershin commented Jan 25, 2024

spershin commented Jan 25, 2024

tdcmeehan commented Jan 25, 2024

spershin commented Jan 25, 2024

amitkdutta commented Jan 25, 2024

spershin commented Jan 25, 2024

xiaoxmeng commented Apr 25, 2024

xiaoxmeng left a comment

xiaoxmeng Apr 25, 2024

spershin Apr 27, 2024

xiaoxmeng left a comment

[native] Add functionality to cancel abandoned tasks. #21779

[native] Add functionality to cancel abandoned tasks. #21779

Conversation

spershin commented Jan 25, 2024 • edited Loading

Description

Motivation and Context

tanjialiang commented Jan 25, 2024 • edited Loading

spershin commented Jan 25, 2024

amitkdutta commented Jan 25, 2024 • edited Loading

tdcmeehan commented Jan 25, 2024

spershin commented Jan 25, 2024

spershin commented Jan 25, 2024

tdcmeehan commented Jan 25, 2024

spershin commented Jan 25, 2024

amitkdutta commented Jan 25, 2024

spershin commented Jan 25, 2024

xiaoxmeng commented Apr 25, 2024

xiaoxmeng left a comment

Choose a reason for hiding this comment

xiaoxmeng Apr 25, 2024

Choose a reason for hiding this comment

spershin Apr 27, 2024

Choose a reason for hiding this comment

xiaoxmeng left a comment

Choose a reason for hiding this comment

spershin commented Jan 25, 2024 •

edited

Loading

tanjialiang commented Jan 25, 2024 •

edited

Loading

amitkdutta commented Jan 25, 2024 •

edited

Loading