Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Add functionality to cancel abandoned tasks. #21779

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

spershin
Copy link
Contributor

@spershin spershin commented Jan 25, 2024

Description

During restarts and cluster updates we noticed that coordinator restarts quickly,
while workers take long time because they are waiting for the running Tasks to finish.

The solution is to cancel the Tasks that are still running, but have haven't
been accessed by the coordinator for a considerable time.
We leverage the same timeout we use for cleaning up the old Tasks.

Motivation and Context

To speed up cluster restarts and updates, when queries have clearly failed already.

== NO RELEASE NOTE ==

@spershin spershin requested a review from a team as a code owner January 25, 2024 00:09
@tanjialiang
Copy link
Contributor

tanjialiang commented Jan 25, 2024

Shutdown was good 2 weeks ago. Should we identify why this is the case before we kill the signal?

@spershin
Copy link
Contributor Author

Shutdown was good 2 weeks ago. Should we identify why this is the case before we kill the signal?

Shutdown always suffered from long running queries from time to time.
This is not related to the stuck driver threads.

@amitkdutta
Copy link
Contributor

amitkdutta commented Jan 25, 2024

Here is a log snippet from shutdown

I0124 10:27:55.212841    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (9 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:27:56.213038    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (10 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:27:57.213208    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (11 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:27:58.213318    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (12 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:27:59.213505    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (13 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
**W0124 10:28:00.165964   511 PeriodicServiceInventoryManager.cpp:76] Error occurred during updating service address: (SELECTION_NO_HOST_IN_SMC) For service 'presto.tier.coordinator.https', no hosts available in SMC**
I0124 10:28:00.213656    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (14 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:01.213868    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (15 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:02.214051    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (16 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:03.214160    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (17 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:04.214345    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (18 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:05.214468    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (19 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:06.214578    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (20 seconds so far) for 'Running' tasks to complete. 12 tasks left: Running=2 Finished=10 
I0124 10:28:06.628034   535 TaskManager.cpp:693] cleanOldTasks: Cleaned 8 old task(s) in 1 ms
I0124 10:28:06.718971   535 PeriodicTaskManager.cpp:618] Spill memory usage: current[0B] peak[0B]
I0124 10:28:06.740123   535 PeriodicTaskManager.cpp:405] Cache stats:
Cache size: XGB tinySize: YMB large size: WGB
....
Prefetch entries: 0 bytes: 0B
Alloc Megaclocks ABC124
I0124 10:28:07.214740    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (21 seconds so far) for 'Running' tasks to complete. 4 tasks left: Running=2 Finished=2 
I0124 10:28:08.214918    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (22 seconds so far) for 'Running' tasks to complete. 4 tasks left: Running=2 Finished=2 
I0124 10:28:09.215049    93 TaskManager.cpp:1068] [PRESTO_SHUTDOWN] Waited (23 seconds so far) for 'Running' tasks to complete. 4 tasks left: Running=2 Finished=2 
...
W0124 10:28:33.103857   511 PeriodicServiceInventoryManager.cpp:76] Error occurred during updating service address: (SELECTION_NO_HOST_IN_SMC) For service 'presto.tier.coordinator.https', no hosts available in SMC

So if the worker receives SHUTDOWN signal, and it can't discover the coordinator anymore - it's completely safe to abandon anything running and no point continuing. It can be better than using oldTaskCleanUpMs_, at least during shutdown.

@tdcmeehan
Copy link
Contributor

IMO, we should follow Java worker logic, which takes this a step further and cancels all abandoned tasks on a timer:

Calls to get the task info and task status are heartbeats--basically, if we stop heartbeating, no matter what we should kill the task.

@spershin
Copy link
Contributor Author

@tdcmeehan

Are you saying we should fail abandoned queries even before shutting down? I'm a little bit concerned about it. Some hiccups in network and we might be failing legit queries that are still running.

This PR is solves a particular problem: during shutdown worker does not get signals from the coordinator that tasks are no longer needed. Because of this we are waiting till the container kills us (timeout). This makes restarts and pushes to take too much time.

Also, the code snippet you referenced it likely running on the coordinator, which cancels queries abandoned by the client.
Here are are talking about the workers.

Let me know if I misunderstood.

@spershin
Copy link
Contributor Author

@amitkdutta

So if the worker receives SHUTDOWN signal, and it can't discover the coordinator anymore - it's completely safe to abandon anything running and no point continuing.

Almost like that, except we cancel tasks not because we don't see a coordinator, but because coordinator does not ask about the tasks anymore (abandoned, because it restarted and does not know about them).

It can be better than using oldTaskCleanUpMs_, at least during shutdown.

Not sure I understand this.

@tdcmeehan
Copy link
Contributor

@spershin you make sure the timeout interval is something that can tolerate hiccups. The code above is on the worker. IMO, this code should handle these different problems in the same way:

  1. The coordinator is shutting down, but something prevents the coordinator from explicitly killing the task. (It seems from your message it's recurrent: do you understand why this is the case? I presume it's just for native workers?)
  2. The coordinator has died suddenly (e.g. OOM).

If you don't handle them in the same way, then what's the plan for #2?

@spershin
Copy link
Contributor Author

@tdcmeehan
As I pointed earlier, for us the issue is during restart - it delays it.
I haven't had time to look it up in coordinator code why would this happen (nor I am very familiar with it). I suspect it should happen in Presto Java too, not just Native. We don't see this on Java maybe because we don't shadow that extensively on Java (no such active development) and in prod we do drain and then wait before restarting.
I would dare to assume (w/o looking at the code) that the coordinator itself lacks the graceful shutdown logic?

For coordinator crashes/OOMs - we aren't bothered by this. It never happens. However, if that happens, then the tasks would live on workers until shutdown, true. The drawback - is a bit of CPU and wasted memory.

So, you think we should just do this check and abandon tasks all the time, not on just shutdown?

@amitkdutta
Copy link
Contributor

It can be better than using oldTaskCleanUpMs_, at least during shutdown.

What I meant is once worker sees coordinator is gone

**W0124 10:28:00.165964   511 PeriodicServiceInventoryManager.cpp:76] Error occurred during updating service address: (SELECTION_NO_HOST_IN_SMC) For service 'presto.tier.coordinator.https', no hosts available in SMC**

Then it can clean up as no point running any tasks, instead of using a timing threshold.

@spershin
Copy link
Contributor Author

@amitkdutta

I'm not sure if this will work properly. Coordinator can restart quickly and back to normal and we wouldn't notice it depending on the timeout. Or some network hiccup and we consider it lost if the timeout is small.

It seems more safe to do it on per-task basis - if a task hasn't been accessed by the coordinator for some time (for example oldTaskCleanUpMs_) - then it is safe to abort it.

@xiaoxmeng
Copy link
Contributor

@tdcmeehan As I pointed earlier, for us the issue is during restart - it delays it. I haven't had time to look it up in coordinator code why would this happen (nor I am very familiar with it). I suspect it should happen in Presto Java too, not just Native. We don't see this on Java maybe because we don't shadow that extensively on Java (no such active development) and in prod we do drain and then wait before restarting. I would dare to assume (w/o looking at the code) that the coordinator itself lacks the graceful shutdown logic?

For coordinator crashes/OOMs - we aren't bothered by this. It never happens. However, if that happens, then the tasks would live on workers until shutdown, true. The drawback - is a bit of CPU and wasted memory.

So, you think we should just do this check and abandon tasks all the time, not on just shutdown?

@spershin can we consider to be align with Java work on this as @tdcmeehan suggested? We can do this in followup and I felt it might be good to have.

xiaoxmeng
xiaoxmeng previously approved these changes Apr 25, 2024
Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spershin LGTM. Is it possible to add unit test for shutdown for TaskManager to have better coverage? Thanks!

@@ -163,6 +163,10 @@ class TaskManager {
"concurrent_lifespans_per_task"};
static constexpr folly::StringPiece kSessionTimezone{"session_timezone"};

// This is called during shutting down. We terminate all running tasks which
// haven't been pinged by coordinator for a considerable time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: haven't received heartbeats from coordinator ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxmeng

nit: haven't received heartbeats from coordinator ...

Actually, this is not what's happening.
We don't get heartbeats from the coordinator.
Coord asks as all sorts of things about the task and we update our what we call 'heartbeat' time.
So, then it boils down to 'coordinator forgot about the task' event.

Request cancellation for tasks which haven't been
accessed by coordinator for a considerable time.
@spershin spershin force-pushed the KillOrphanedTasksDuringShutdown branch from 3862c24 to 503fb01 Compare April 30, 2024 03:03
@spershin spershin changed the title [native] Cancelling abandoned tasks during shutdown. [native] Add functionality to cancel abandoned tasks. Apr 30, 2024
Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spershin thanks!

@spershin spershin merged commit affd83e into prestodb:master Apr 30, 2024
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants