-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native] Add functionality to cancel abandoned tasks. #21779
[native] Add functionality to cancel abandoned tasks. #21779
Conversation
Shutdown was good 2 weeks ago. Should we identify why this is the case before we kill the signal? |
Shutdown always suffered from long running queries from time to time. |
Here is a log snippet from shutdown
So if the worker receives SHUTDOWN signal, and it can't discover the coordinator anymore - it's completely safe to abandon anything running and no point continuing. It can be better than using |
IMO, we should follow Java worker logic, which takes this a step further and cancels all abandoned tasks on a timer: presto/presto-main/src/main/java/com/facebook/presto/execution/SqlTaskManager.java Line 516 in ed2a648
Calls to get the task info and task status are heartbeats--basically, if we stop heartbeating, no matter what we should kill the task. |
Are you saying we should fail abandoned queries even before shutting down? I'm a little bit concerned about it. Some hiccups in network and we might be failing legit queries that are still running. This PR is solves a particular problem: during shutdown worker does not get signals from the coordinator that tasks are no longer needed. Because of this we are waiting till the container kills us (timeout). This makes restarts and pushes to take too much time. Also, the code snippet you referenced it likely running on the coordinator, which cancels queries abandoned by the client. Let me know if I misunderstood. |
Almost like that, except we cancel tasks not because we don't see a coordinator, but because coordinator does not ask about the tasks anymore (abandoned, because it restarted and does not know about them).
Not sure I understand this. |
@spershin you make sure the timeout interval is something that can tolerate hiccups. The code above is on the worker. IMO, this code should handle these different problems in the same way:
If you don't handle them in the same way, then what's the plan for #2? |
@tdcmeehan For coordinator crashes/OOMs - we aren't bothered by this. It never happens. However, if that happens, then the tasks would live on workers until shutdown, true. The drawback - is a bit of CPU and wasted memory. So, you think we should just do this check and abandon tasks all the time, not on just shutdown? |
What I meant is once worker sees coordinator is gone
Then it can clean up as no point running any tasks, instead of using a timing threshold. |
I'm not sure if this will work properly. Coordinator can restart quickly and back to normal and we wouldn't notice it depending on the timeout. Or some network hiccup and we consider it lost if the timeout is small. It seems more safe to do it on per-task basis - if a task hasn't been accessed by the coordinator for some time (for example oldTaskCleanUpMs_) - then it is safe to abort it. |
@spershin can we consider to be align with Java work on this as @tdcmeehan suggested? We can do this in followup and I felt it might be good to have. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spershin LGTM. Is it possible to add unit test for shutdown for TaskManager to have better coverage? Thanks!
@@ -163,6 +163,10 @@ class TaskManager { | |||
"concurrent_lifespans_per_task"}; | |||
static constexpr folly::StringPiece kSessionTimezone{"session_timezone"}; | |||
|
|||
// This is called during shutting down. We terminate all running tasks which | |||
// haven't been pinged by coordinator for a considerable time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: haven't received heartbeats from coordinator ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: haven't received heartbeats from coordinator ...
Actually, this is not what's happening.
We don't get heartbeats from the coordinator.
Coord asks as all sorts of things about the task and we update our what we call 'heartbeat' time.
So, then it boils down to 'coordinator forgot about the task' event.
Request cancellation for tasks which haven't been accessed by coordinator for a considerable time.
3862c24
to
503fb01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spershin thanks!
Description
During restarts and cluster updates we noticed that coordinator restarts quickly,
while workers take long time because they are waiting for the running Tasks to finish.
The solution is to cancel the Tasks that are still running, but have haven't
been accessed by the coordinator for a considerable time.
We leverage the same timeout we use for cleaning up the old Tasks.
Motivation and Context
To speed up cluster restarts and updates, when queries have clearly failed already.