-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for task limit on kill tasks spawned by auto kill coordinator duty #14769
Conversation
Previously, the `KillUnusedSegments` coordinator duty, in charge of periodically deleting unused segments, could spawn an unlimited number of kill tasks for unused segments. This change adds 2 new coordinator dynamic configs that can be used to control the limit of tasks spawned by this coordinator duty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM! Some tweaks to consider.
@@ -149,7 +173,7 @@ private void killUnusedSegments(Collection<String> dataSourcesToKill) | |||
} | |||
} | |||
|
|||
log.debug("Submitted kill tasks for [%d] datasources.", submittedTasks); | |||
log.debug("Submitted [%d] kill tasks for [%d] datasources.", submittedTasks, dataSourcesToKill.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if any datasources were skipped due to not enough slots, we could consider logging them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will get this in a follow up patch
while (activeTasks.hasNext()) { | ||
final TaskStatusPlus status = activeTasks.next(); | ||
|
||
// taskType can be null if middleManagers are running with an older version. Here, we consevatively regard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is important enough to capture as a method-level javadoc hint. ie: this might overestimate in some circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taskType can be null if middleManagers are running with an older version.
Do we know in which versio the middle managers would report the task type as null? I saw the same comment in the compact segments duty that was from mid 2018. So I wonder if this null check is that important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats exactly where I got it from 😆 . Just wanted to play it safe.
return numActiveKillTasks; | ||
} | ||
|
||
private int getKillTaskCapacity(double killTaskSlotRatio, int maxKillTaskSlots) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd split this to a getTotalWorkerCapacity
, to keep the kill calculation separate. I think testing would be cleaner this way too
It would be great to see a unit test that explicitly could test getKillTaskCapacity
based only on numeric inputs, so that you can test 0s, infinities, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed, and added tests. Let me know if better now.
public void testNoAvailableTaskCapacityForKill() | ||
{ | ||
mockNoAvailableTaskSlotsForKill(); | ||
runAndVerifyNoKill(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* address review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement!
server/src/main/java/org/apache/druid/server/coordinator/CoordinatorDynamicConfig.java
Outdated
Show resolved
Hide resolved
@@ -495,6 +533,8 @@ private static class Defaults | |||
static final int MAX_NON_PRIMARY_REPLICANTS_TO_LOAD = Integer.MAX_VALUE; | |||
static final boolean USE_ROUND_ROBIN_ASSIGNMENT = true; | |||
static final boolean SMART_SEGMENT_LOADING = true; | |||
static final double KILL_TASK_SLOT_RATIO = 1.0; | |||
static final int MAX_KILL_TASK_SLOTS = Integer.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here that these defaults are to preserve the behavior before Druid 0.28 and a future version may want to consider better defaults so that kill tasks can not eat up all the capacity in the cluster would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, thanks
if (0 == availableKillTaskSlots) { | ||
log.warn("Not killing any unused segments because there are no available kill task slots at this time."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a warning? It seems like it could be quite verbose if there are long running kill tasks and the KillUnusedSegments duty is scheduled to run frequently.
Perhaps, a better way of surfacing this information is emitting the number of kill tasks and calculated kill capacity as metrics so that operators can look at these metrics to decide if they need more kill capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to debug. 💯 about adding metrics, will do that in a follow up patch.
while (activeTasks.hasNext()) { | ||
final TaskStatusPlus status = activeTasks.next(); | ||
|
||
// taskType can be null if middleManagers are running with an older version. Here, we consevatively regard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taskType can be null if middleManagers are running with an older version.
Do we know in which versio the middle managers would report the task type as null? I saw the same comment in the compact segments duty that was from mid 2018. So I wonder if this null check is that important
return numActiveKillTasks; | ||
} | ||
|
||
private int getTotalWorkerCapacity() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a lot of this code is very similar to the code in CompactSegments
Can we refactor the code so that the code can be re-used in both duties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely correct. Will refactor this in a follow up patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Looking forward to the follow up patches!
Description
Previously, the
KillUnusedSegments
coordinator duty, in charge of periodically deleting unused segments, could spawn an unlimited number of kill tasks for unused segments. This change adds 2 new coordinator dynamic configs that can be used to control the limit of tasks spawned by this coordinator dutykillTaskSlotRatio
: Ratio of total available task slots, including autoscaling if applicable that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is 1, which allows all available tasks to be used, which is the existing behaviormaxKillTaskSlots
: Maximum number of tasks that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is INT.MAX, which essentially allows for unbounded number of tasks, which is the existing behavior.Realize that we can effectively get away with just the one
killTaskSlotRatio
, but following similarly to the compaction config, which has similar properties; I thought it was good to have some control of the upper limit regardless of ratio provided.Release note
NEW:
killTaskSlotRatio
andmaxKillTaskSlots
coordinator dynamic config properties added that allow control of task resource usage spawned byKillUnusedSegments
coordinator task (auto kill)This PR has: