Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling Supervisor restarts at taskDuration #14396

Merged
merged 16 commits into from
Aug 7, 2023

Conversation

suneet-s
Copy link
Contributor

@suneet-s suneet-s commented Jun 8, 2023

Description

Currently, the number of slots needed for for a streaming supervisor to handover quickly is 2 * number of tasks needed for reading from the stream. This is because when the taskDuration expires, the previously reading tasks need to publish the segments, and if this takes time, and there are no available slots on the cluster, Druid will not be reading from the stream until a slot is available.

This change makes it so that during regular operations, tasks are rolled over one at a time, so that an operator can plan to have a capacity of number of tasks need to read from the stream + 1. For config changes to the supervisor, like taskCount, all tasks will need to stop, so an operator should factor this in to their capacity planning. If config changes are rare, this will make it so that you do not need as much free capacity in the cluster for stable operations.

Current behavior

Here are some screenshots of metrics from a test where we ran streaming ingestion in a cluster with a capacity of 24 slots. The was ingesting data from a kafka topic with 24 partitions using 10 tasks.
During this test, there was a batch ingestion task running that took 7 slots.

Screenshot 2023-06-21 at 1 40 32 PM

This screenshot shows 17 tasks being used from ~ 8am - 9:35am. At 9:35am, the batch task dies, and the task count goes down to 10. The screenshot shows spikes in used task slots at 7:54, 9:34 and from 8:54-56.

These spikes are when the taskDuration has expired and they need to rollover. This shows that the system needs to preserve capacity for the tasks to publish segments otherwise there is a risk of increasing kafka lag. The next screenshot shows kafka lag during this run at ~8:55, which shows that there were some partitions that could not get a slot, and so they experienced a spike in lag.

Screenshot 2023-06-21 at 1 50 38 PM

Behavior when setting stopTaskCount

Screenshot 2023-06-21 at 2 01 23 PM

Here we see the number of tasks rolling over at any point in time is capped, and that it takes ~ 10 minutes for all the tasks with an expired task duration to stop. Kafka lag metrics remain very low and there are no spikes when the taskDuration is hit.
Screenshot 2023-06-21 at 2 04 44 PM

Why not on by default / documented

This behavior of rolling a subset of tasks over is not on by default because it is not clear to me what a good default would be. Some defaults I've considered

  • Make it so that all tasks rollover in 10 minutes, so if you have 100 tasks in a supervisor, it would bounce 10 at a time.
  • Make it so that all tasks rollover within the taskDuration so that the max time for any task will be 2X taskDuration
  • Make it so that we always rollover 1 task at a time. This limits the upper capacity needed, but tasks may run for a lot longer than the taskDuration. Eg. with a taskDuration of 1 hour and 100 tasks, all taking 1 minute to publish segments, tasks would take ~ 160 minutes.

The first 2 rely on assumptions that the time it takes to publish segments will be predictable, which may not be the case, so we can't provide guarantees for capacity planning that the number of tasks you will need is X.
The 3rd option is nice for capacity planning, but could result in tasks holding on to locks for a much longer time than anticipated if the number of tasks in a supervisor is large.

The config stopTaskCount is not documented so that we do not need to support it in the future when we decide what the best behavior is and what it should be by default.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@suneet-s suneet-s changed the title Supervisors restart tasks one at a time WIP Supervisors restart tasks one at a time Jun 8, 2023
@abhishekagarwal87
Copy link
Contributor

It will be really nice capability as it will make task scheduling more seamless. But is there going to be any drawback of this change? If the tasks start in a staggered fashion, would that affect processing throughput in any way?

@suneet-s
Copy link
Contributor Author

But is there going to be any drawback of this change? If the tasks start in a staggered fashion, would that affect processing throughput in any way?

None that I can think of. The only thing I can imagine is if tasks are stuck in pending for such a long time, that it takes longer than the configured task handoff for all the tasks to cycle. But in that case, the tasks are still restarted from the earliest to latest, so I suspect everything should be ok.

Each kafka task is handling reading form the partitions independently, so this should have no impact on ingestion throughput

@suneet-s suneet-s changed the title WIP Supervisors restart tasks one at a time Rolling Supervisors restarts at taskDuration Jun 21, 2023
@suneet-s suneet-s changed the title Rolling Supervisors restarts at taskDuration Rolling Supervisor restarts at taskDuration Jun 22, 2023
Copy link
Contributor

@maytasm maytasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 on not having this on by default (default would still be the current behavior) and having the config as number of tasks to roll at one time configurable per each supervisor). I am -1 on this PR as a whole due to the lack in unit testing for when the new config is used. I also have a few minor questions.

@suneet-s
Copy link
Contributor Author

I am +1 on not having this on by default (default would still be the current behavior) and having the config as number of tasks to roll at one time configurable per each supervisor). I am -1 on this PR as a whole due to the lack in unit testing for when the new config is used. I also have a few minor questions.

I looked into writing tests for this and struggled. SeekableStreamSupervisor is a huge abstract class, and the part that needs testing uses DateTime.now which isn't very testable, so I would need to refactor the class to use a Clock to simulate time passing by. It feels like trying to make this class testable introduces more risk. I even looked at adding an integration test, but it seemed like that would be flaky and take a long time to run as we would have to prove that there was no more than 1 task handing over at a time.

If there are any suggestions or pointers on how to write tests for this class, I'm open to trying it out. Otherwise, I think the fact that this is disabled by default and undocumented provides the safety net we need that it won't break for Druid users.

@suneet-s suneet-s merged commit b624a4e into apache:master Aug 7, 2023
@suneet-s suneet-s deleted the rolling-supervisors branch August 7, 2023 23:24
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants