Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: CommandRunner metric has correct metric displayed when thread dies #4653

Merged

Conversation

stevenpyzhang
Copy link
Member

@stevenpyzhang stevenpyzhang commented Feb 27, 2020

Description

Fixes: #4652
Add a new variable to CommandRunner that keeps track of the last time the command topic was polled. If there's too long of a duration between the current time when CommandRunner.checkCommandRunnerStatus is called and the last poll, that means the thread is in ERROR state.

Testing done

Killed the command runner thread and watched the metric. It transitioned to error state after 15 seconds.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@stevenpyzhang stevenpyzhang requested a review from a team as a code owner February 27, 2020 00:06
@stevenpyzhang stevenpyzhang changed the title fix: CommandRunner metric correctly reports status of thread if it's dead fix: CommandRunner metric has correct metric displayed when thread dies Feb 27, 2020
@stevenpyzhang stevenpyzhang force-pushed the fix-command-runner-metric branch from 3a80fc6 to 6d5be45 Compare February 27, 2020 00:09
@stevenpyzhang stevenpyzhang requested review from rodesai and aeroforero and removed request for aeroforero February 27, 2020 00:12
Comment on lines 268 to 272
if (currentCommand == null) {
return CommandRunnerStatus.RUNNING;
return lastPollTime.get() == null ||
Duration.between(lastPollTime.get(), clock.instant()).toMillis()
< NEW_CMDS_TIMEOUT.toMillis() * 3
? CommandRunnerStatus.RUNNING : CommandRunnerStatus.ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just get rid of this and set the status to ERROR if we haven't polled in a while?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only start polling after we processPriorCommands so this could still be useful in the case of the CommandRunner getting stuck on start up.

@stevenpyzhang stevenpyzhang force-pushed the fix-command-runner-metric branch from 6d5be45 to 9251d03 Compare February 27, 2020 20:40
@stevenpyzhang stevenpyzhang force-pushed the fix-command-runner-metric branch 5 times, most recently from aef4638 to 310350d Compare March 2, 2020 21:51
private final Duration commandRunnerHealthTimeout;
private final Clock clock;

public enum CommandRunnerStatus {
RUNNING,
STUCK,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this additional status? Once it's here we have to support it forever. Why is it not enough to know that the runner is up or down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be helpful to be able to tell the difference between if the CommandRunner thread is stuck on a particular Command or the thread itself has died.

If we do force the server to shut down whenever the CommandRunner thread dies, then this additional status won't be needed.

@stevenpyzhang stevenpyzhang force-pushed the fix-command-runner-metric branch from 310350d to b69ad12 Compare March 18, 2020 22:03
@stevenpyzhang stevenpyzhang requested a review from rodesai March 18, 2020 23:01
@stevenpyzhang stevenpyzhang force-pushed the fix-command-runner-metric branch from 5cb1af0 to 11c0849 Compare March 20, 2020 18:40
Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stevenpyzhang stevenpyzhang merged commit 1db542b into confluentinc:master Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CommandRunner metric doesn't accurately report if the thread is running or not
2 participants