feat: add commandRunnerCheck to healthcheck detail #6346

stevenpyzhang · 2020-10-01T18:29:04Z

Description

#6308
Exposes command runner health in the healthcheck endpoint

Testing done

Issued a request to healthcheck endpoint
Unit test

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

vcrfxia

Thanks @stevenpyzhang -- LGTM with a question inline!

vcrfxia · 2020-10-02T13:12:04Z

ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/healthcheck/HealthCheckAgent.java

+    public HealthCheckResponseDetail check(final HealthCheckAgent healthCheckAgent) {
+      return new HealthCheckResponseDetail(
+          healthCheckAgent.commandRunner.checkCommandRunnerStatus()
+              == CommandRunner.CommandRunnerStatus.RUNNING);


Do we understand what causes the command runner to transition into ERROR state in practice? From looking at the code, it looks like this happens if the command runner has been unable to poll for 15 seconds (by default). Do we have a sense of how often this happens in practice, and why? It'd be good to have some assurance that this is rare, and that when it happens it singles an actual issue rather than something spurious, else failing the healthcheck because of it would not provide useful signal to users.

BTW, how was the 15 second default time limit chosen?

The 15 seconds is kind of arbitrary. A normal poll of the command topic by default is 5 seconds so just 3x that was the rationale.

So there's two conditions that can lead to returning an ERROR, if there's currently a command being processed or not.

If there's no command being processed, we check to see if the command runner has been unable to poll for 15 seconds. If it hasn't, it means that the CommandRunner thread is most likely dead so the server would still be running without a CommandRunner thread, which is an unhealthy state (this is if the UncaughtExceptionHandler isn't enabled).

If there's a command being processed, and it's been processing for longer than ksql.server.command.blocked.threshold.error.ms that's also an unhealthy state for the server since it's probably stuck executing a command (this would most likely be a bug in the code and this wouldn't eventually finish executing on its own) and isn't making progress on the command topic

feat: add commandRunnerCheck to healthcheck detail

d51eff0

stevenpyzhang requested a review from a team as a code owner October 1, 2020 18:29

vcrfxia approved these changes Oct 2, 2020

View reviewed changes

stevenpyzhang merged commit 5f64d05 into confluentinc:master Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add commandRunnerCheck to healthcheck detail #6346

feat: add commandRunnerCheck to healthcheck detail #6346

stevenpyzhang commented Oct 1, 2020

vcrfxia left a comment

vcrfxia Oct 2, 2020

stevenpyzhang Oct 2, 2020

feat: add commandRunnerCheck to healthcheck detail #6346

feat: add commandRunnerCheck to healthcheck detail #6346

Conversation

stevenpyzhang commented Oct 1, 2020

Description

Testing done

Reviewer checklist

vcrfxia left a comment

Choose a reason for hiding this comment

vcrfxia Oct 2, 2020

Choose a reason for hiding this comment

stevenpyzhang Oct 2, 2020

Choose a reason for hiding this comment