Rolling Supervisor restarts at taskDuration #14396

suneet-s · 2023-06-08T20:28:41Z

Description

Currently, the number of slots needed for for a streaming supervisor to handover quickly is 2 * number of tasks needed for reading from the stream. This is because when the taskDuration expires, the previously reading tasks need to publish the segments, and if this takes time, and there are no available slots on the cluster, Druid will not be reading from the stream until a slot is available.

This change makes it so that during regular operations, tasks are rolled over one at a time, so that an operator can plan to have a capacity of number of tasks need to read from the stream + 1. For config changes to the supervisor, like taskCount, all tasks will need to stop, so an operator should factor this in to their capacity planning. If config changes are rare, this will make it so that you do not need as much free capacity in the cluster for stable operations.

Current behavior

Here are some screenshots of metrics from a test where we ran streaming ingestion in a cluster with a capacity of 24 slots. The was ingesting data from a kafka topic with 24 partitions using 10 tasks.
During this test, there was a batch ingestion task running that took 7 slots.

This screenshot shows 17 tasks being used from ~ 8am - 9:35am. At 9:35am, the batch task dies, and the task count goes down to 10. The screenshot shows spikes in used task slots at 7:54, 9:34 and from 8:54-56.

These spikes are when the taskDuration has expired and they need to rollover. This shows that the system needs to preserve capacity for the tasks to publish segments otherwise there is a risk of increasing kafka lag. The next screenshot shows kafka lag during this run at ~8:55, which shows that there were some partitions that could not get a slot, and so they experienced a spike in lag.

Behavior when setting `stopTaskCount`

Here we see the number of tasks rolling over at any point in time is capped, and that it takes ~ 10 minutes for all the tasks with an expired task duration to stop. Kafka lag metrics remain very low and there are no spikes when the taskDuration is hit.

Why not on by default / documented

This behavior of rolling a subset of tasks over is not on by default because it is not clear to me what a good default would be. Some defaults I've considered

Make it so that all tasks rollover in 10 minutes, so if you have 100 tasks in a supervisor, it would bounce 10 at a time.
Make it so that all tasks rollover within the taskDuration so that the max time for any task will be 2X taskDuration
Make it so that we always rollover 1 task at a time. This limits the upper capacity needed, but tasks may run for a lot longer than the taskDuration. Eg. with a taskDuration of 1 hour and 100 tasks, all taking 1 minute to publish segments, tasks would take ~ 160 minutes.

The first 2 rely on assumptions that the time it takes to publish segments will be predictable, which may not be the case, so we can't provide guarantees for capacity planning that the number of tasks you will need is X.
The 3rd option is nice for capacity planning, but could result in tasks holding on to locks for a much longer time than anticipated if the number of tasks in a supervisor is large.

The config stopTaskCount is not documented so that we do not need to support it in the future when we decide what the best behavior is and what it should be by default.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
been tested in a test Druid cluster.

abhishekagarwal87 · 2023-06-12T09:23:43Z

It will be really nice capability as it will make task scheduling more seamless. But is there going to be any drawback of this change? If the tasks start in a staggered fashion, would that affect processing throughput in any way?

suneet-s · 2023-06-14T18:06:08Z

But is there going to be any drawback of this change? If the tasks start in a staggered fashion, would that affect processing throughput in any way?

None that I can think of. The only thing I can imagine is if tasks are stuck in pending for such a long time, that it takes longer than the configured task handoff for all the tasks to cycle. But in that case, the tasks are still restarted from the earliest to latest, so I suspect everything should be ok.

Each kafka task is handling reading form the partitions independently, so this should have no impact on ingestion throughput

maytasm

I am +1 on not having this on by default (default would still be the current behavior) and having the config as number of tasks to roll at one time configurable per each supervisor). I am -1 on this PR as a whole due to the lack in unit testing for when the new config is used. I also have a few minor questions.

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

suneet-s · 2023-07-14T18:00:08Z

I am +1 on not having this on by default (default would still be the current behavior) and having the config as number of tasks to roll at one time configurable per each supervisor). I am -1 on this PR as a whole due to the lack in unit testing for when the new config is used. I also have a few minor questions.

I looked into writing tests for this and struggled. SeekableStreamSupervisor is a huge abstract class, and the part that needs testing uses DateTime.now which isn't very testable, so I would need to refactor the class to use a Clock to simulate time passing by. It feels like trying to make this class testable introduces more risk. I even looked at adding an integration test, but it seemed like that would be flaky and take a long time to run as we would have to prove that there was no more than 1 task handing over at a time.

If there are any suggestions or pointers on how to write tests for this class, I'm open to trying it out. Otherwise, I think the fact that this is disabled by default and undocumented provides the safety net we need that it won't break for Druid users.

Rolling supervior task publishing

6ec0bd1

github-actions bot added the Area - Documentation label Jun 8, 2023

suneet-s changed the title ~~Supervisors restart tasks one at a time~~ WIP Supervisors restart tasks one at a time Jun 8, 2023

kfaraz added the Area - Streaming Ingestion label Jun 9, 2023

suneet-s added 3 commits June 13, 2023 22:14

add an option for number of task groups to roll over

46c1313

Merge remote-tracking branch 'apache/master' into rolling-supervisors

5fbfeef

better

365e03d

suneet-s added 3 commits June 21, 2023 13:29

remove docs

b73dcd8

Merge remote-tracking branch 'apache/master' into rolling-supervisors

5dbe2ea

oops

55e82ed

github-actions bot removed the Area - Documentation label Jun 21, 2023

checkstyle

d6500f2

suneet-s changed the title ~~WIP Supervisors restart tasks one at a time~~ Rolling Supervisors restarts at taskDuration Jun 21, 2023

suneet-s changed the title ~~Rolling Supervisors restarts at taskDuration~~ Rolling Supervisor restarts at taskDuration Jun 22, 2023

Merge remote-tracking branch 'apache/master' into rolling-supervisors

f684969

maytasm requested changes Jul 3, 2023

View reviewed changes

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Outdated Show resolved Hide resolved

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java Outdated Show resolved Hide resolved

Merge remote-tracking branch 'apache/master' into rolling-supervisors

81f1ff4

suneet-s added 6 commits July 19, 2023 20:23

Merge remote-tracking branch 'apache/master' into rolling-supervisors

fa317da

wip test

6cb9b5d

Merge remote-tracking branch 'apache/master' into rolling-supervisors

a43e4fc

Merge remote-tracking branch 'apache/master' into rolling-supervisors

9ee2db9

undo partial test change

36f94a7

remove incomplete test

5c74bca

maytasm approved these changes Aug 7, 2023

View reviewed changes

suneet-s merged commit b624a4e into apache:master Aug 7, 2023

suneet-s deleted the rolling-supervisors branch August 7, 2023 23:24

LakshSingla added this to the 28.0 milestone Oct 12, 2023

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

YongGang mentioned this pull request Feb 7, 2024

Improve rolling Supervisor restarts at taskDuration #15859

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling Supervisor restarts at taskDuration #14396

Rolling Supervisor restarts at taskDuration #14396

suneet-s commented Jun 8, 2023 •

edited

Loading

abhishekagarwal87 commented Jun 12, 2023

suneet-s commented Jun 14, 2023

maytasm left a comment

suneet-s commented Jul 14, 2023

Rolling Supervisor restarts at taskDuration #14396

Rolling Supervisor restarts at taskDuration #14396

Conversation

suneet-s commented Jun 8, 2023 • edited Loading

Description

Current behavior

Behavior when setting stopTaskCount

Why not on by default / documented

abhishekagarwal87 commented Jun 12, 2023

suneet-s commented Jun 14, 2023

maytasm left a comment

Choose a reason for hiding this comment

suneet-s commented Jul 14, 2023

suneet-s commented Jun 8, 2023 •

edited

Loading

Behavior when setting `stopTaskCount`