Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add commandRunnerCheck to healthcheck detail #6346

Conversation

stevenpyzhang
Copy link
Member

Description

#6308
Exposes command runner health in the healthcheck endpoint

Testing done

Issued a request to healthcheck endpoint
Unit test

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@stevenpyzhang stevenpyzhang requested a review from a team as a code owner October 1, 2020 18:29
Copy link
Contributor

@vcrfxia vcrfxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stevenpyzhang -- LGTM with a question inline!

public HealthCheckResponseDetail check(final HealthCheckAgent healthCheckAgent) {
return new HealthCheckResponseDetail(
healthCheckAgent.commandRunner.checkCommandRunnerStatus()
== CommandRunner.CommandRunnerStatus.RUNNING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand what causes the command runner to transition into ERROR state in practice? From looking at the code, it looks like this happens if the command runner has been unable to poll for 15 seconds (by default). Do we have a sense of how often this happens in practice, and why? It'd be good to have some assurance that this is rare, and that when it happens it singles an actual issue rather than something spurious, else failing the healthcheck because of it would not provide useful signal to users.

BTW, how was the 15 second default time limit chosen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 15 seconds is kind of arbitrary. A normal poll of the command topic by default is 5 seconds so just 3x that was the rationale.

So there's two conditions that can lead to returning an ERROR, if there's currently a command being processed or not.

If there's no command being processed, we check to see if the command runner has been unable to poll for 15 seconds. If it hasn't, it means that the CommandRunner thread is most likely dead so the server would still be running without a CommandRunner thread, which is an unhealthy state (this is if the UncaughtExceptionHandler isn't enabled).

If there's a command being processed, and it's been processing for longer than ksql.server.command.blocked.threshold.error.ms that's also an unhealthy state for the server since it's probably stuck executing a command (this would most likely be a bug in the code and this wouldn't eventually finish executing on its own) and isn't making progress on the command topic

@stevenpyzhang stevenpyzhang merged commit 5f64d05 into confluentinc:master Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants