Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts #691

Closed

Conversation

hasbro17
Copy link
Contributor

The cluster-etcd-operator readiness probe uses the etcdctl endpoint health cmd to check the health of the cluster

--endpoints=unixs://${NODE_NODE_ENVVAR_NAME_IP}:0 \
endpoint health -w json | grep \"health\":true

As part of the endpoint health command, etcdctl checks the active alarms with grpc calls to the Alarm method:
https://github.com/etcd-io/etcd/blob/f2bc5eee916de63ac525ce5a843e849f73921415/etcdctl/ctlv3/command/ep_command.go#L139

To eliminate noise from the etcdGRPCRequestsSlow alerts we should discount the effect of Alarm grpc calls which could end up being slow due to the repeated calls, and indicating this to the user is not useful.
This is only done downstream through our custom.libsonnet rules file for now.

Side note:
Depending on when you installed gojsontoyaml it may or may not wrap long lines in its output.
The way forward since v1.0.0 however seems to be non-wrapped lines so we should probably stick to that. Note you may want to run go get go get github.com/brancz/gojsontoyaml again if make generate results in wrapped lines for you.
See:

/cc @lilic @hexfusion

@openshift-ci openshift-ci bot requested review from hexfusion and lilic October 12, 2021 23:17
@openshift-ci openshift-ci bot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Oct 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2021

@hasbro17: This pull request references Bugzilla bug 2013243, which is invalid:

  • expected the bug to target the "4.10.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Oct 12, 2021
@hasbro17
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2021

@hasbro17: This pull request references Bugzilla bug 2013243, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from geliu2016 October 12, 2021 23:18
@hasbro17
Copy link
Contributor Author

/cherry-pick release-4.9

@openshift-cherrypick-robot

@hasbro17: once the present PR merges, I will cherry-pick it on top of release-4.9 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

/hold

in case we want to add check for gojsontoyaml check in this PR. Feel free to remove hold if we want to do that in a separate one.

description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
leading to 50% increase in database size over the past four hours on etcd
instance {{ $labels.instance }}, please check as it might be disruptive.'
description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes leading to 50% increase in database size over the past four hours on etcd instance {{ $labels.instance }}, please check as it might be disruptive.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way forward since v1.0.0 however seems to be non-wrapped lines so we should probably stick to that.

Yes, I think this formatting changed was due by @hexfusion maybe having an outdated gojsontoyaml install during recent changes? Maybe we can add another check for this here https://github.com/openshift/cluster-etcd-operator/blob/master/hack/generate.sh#L5?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I will keep that in mind but we should bake it into the script.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 13, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, lilic

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2021
@lilic
Copy link
Contributor

lilic commented Oct 13, 2021

/retest

@hexfusion
Copy link
Contributor

I have been thinking about this one a bit. While Alarm is not really an expected result of a health check the fact that it is slow should be the same as any RPC thus part of the alert? With defrag it made sense to omit because the duration of the request was not inline with the thresholds of the alert. But do we care so much which RPC is slow vs in general slow given there is no explicit reason for the slow request?

In fact this helps us to identify Alarm as strange in the steady state.

@hasbro17
Copy link
Contributor Author

Closing this for now as we're not fully on board with specifically excluding Alarm from slow grpc requests given that it's still an indication that RPC calls in general are slow.

@hasbro17 hasbro17 closed this Oct 26, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2021

@hasbro17: This pull request references Bugzilla bug 2013243. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants