Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts #691

hasbro17 · 2021-10-12T23:17:16Z

The cluster-etcd-operator readiness probe uses the etcdctl endpoint health cmd to check the health of the cluster

cluster-etcd-operator/bindata/etcd/pod.yaml

Lines 210 to 211 in ea4c530

    
                       --endpoints=unixs://${NODE_NODE_ENVVAR_NAME_IP}:0 \ 
        
                       endpoint health -w json | grep \"health\":true

As part of the endpoint health command, etcdctl checks the active alarms with grpc calls to the Alarm method:
https://github.com/etcd-io/etcd/blob/f2bc5eee916de63ac525ce5a843e849f73921415/etcdctl/ctlv3/command/ep_command.go#L139

To eliminate noise from the etcdGRPCRequestsSlow alerts we should discount the effect of Alarm grpc calls which could end up being slow due to the repeated calls, and indicating this to the user is not useful.
This is only done downstream through our custom.libsonnet rules file for now.

Side note:
Depending on when you installed gojsontoyaml it may or may not wrap long lines in its output.
The way forward since v1.0.0 however seems to be non-wrapped lines so we should probably stick to that. Note you may want to run go get go get github.com/brancz/gojsontoyaml again if make generate results in wrapped lines for you.
See:

/cc @lilic @hexfusion

openshift-ci · 2021-10-12T23:17:20Z

@hasbro17: This pull request references Bugzilla bug 2013243, which is invalid:

expected the bug to target the "4.10.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hasbro17 · 2021-10-12T23:17:58Z

/bugzilla refresh

openshift-ci · 2021-10-12T23:18:05Z

@hasbro17: This pull request references Bugzilla bug 2013243, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.10.0) matches configured target release for branch (4.10.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hasbro17 · 2021-10-12T23:24:39Z

/cherry-pick release-4.9

openshift-cherrypick-robot · 2021-10-12T23:24:40Z

@hasbro17: once the present PR merges, I will cherry-pick it on top of release-4.9 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lilic

/lgtm
/approve

/hold

in case we want to add check for gojsontoyaml check in this PR. Feel free to remove hold if we want to do that in a separate one.

lilic · 2021-10-13T08:20:46Z

manifests/0000_90_etcd-operator_03_prometheusrule.yaml

-        description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
-          leading to 50% increase in database size over the past four hours on etcd
-          instance {{ $labels.instance }}, please check as it might be disruptive.'
+        description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes leading to 50% increase in database size over the past four hours on etcd instance {{ $labels.instance }}, please check as it might be disruptive.'


the way forward since v1.0.0 however seems to be non-wrapped lines so we should probably stick to that.

Yes, I think this formatting changed was due by @hexfusion maybe having an outdated gojsontoyaml install during recent changes? Maybe we can add another check for this here https://github.com/openshift/cluster-etcd-operator/blob/master/hack/generate.sh#L5?

yeah I will keep that in mind but we should bake it into the script.

openshift-ci · 2021-10-13T08:31:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, lilic

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lilic]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lilic · 2021-10-13T08:31:35Z

/retest

hexfusion · 2021-10-13T13:46:00Z

I have been thinking about this one a bit. While Alarm is not really an expected result of a health check the fact that it is slow should be the same as any RPC thus part of the alert? With defrag it made sense to omit because the duration of the request was not inline with the thresholds of the alert. But do we care so much which RPC is slow vs in general slow given there is no explicit reason for the slow request?

In fact this helps us to identify Alarm as strange in the steady state.

hasbro17 · 2021-10-26T22:16:59Z

Closing this for now as we're not fully on board with specifically excluding Alarm from slow grpc requests given that it's still an indication that RPC calls in general are slow.

openshift-ci · 2021-10-26T22:17:07Z

@hasbro17: This pull request references Bugzilla bug 2013243. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

In response to this:

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts

21e7878

openshift-ci bot requested review from hexfusion and lilic October 12, 2021 23:17

openshift-ci bot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Oct 12, 2021

openshift-ci bot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Oct 12, 2021

openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 12, 2021

openshift-ci bot requested a review from geliu2016 October 12, 2021 23:18

lilic reviewed Oct 13, 2021

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2021

openshift-ci bot assigned lilic Oct 13, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 13, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2021

hasbro17 closed this Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts #691

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts #691

hasbro17 commented Oct 12, 2021

openshift-ci bot commented Oct 12, 2021

hasbro17 commented Oct 12, 2021

openshift-ci bot commented Oct 12, 2021

hasbro17 commented Oct 12, 2021

openshift-cherrypick-robot commented Oct 12, 2021

lilic left a comment

lilic Oct 13, 2021

hexfusion Oct 13, 2021

openshift-ci bot commented Oct 13, 2021

lilic commented Oct 13, 2021

hexfusion commented Oct 13, 2021

hasbro17 commented Oct 26, 2021

openshift-ci bot commented Oct 26, 2021

	--endpoints=unixs://${NODE_NODE_ENVVAR_NAME_IP}:0 \
	endpoint health -w json \| grep \"health\":true

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts #691

Bug 2013243: jssonet: ignore Alarm method in etcdGRPCRequestsSlow alerts #691

Conversation

hasbro17 commented Oct 12, 2021

openshift-ci bot commented Oct 12, 2021

hasbro17 commented Oct 12, 2021

openshift-ci bot commented Oct 12, 2021

hasbro17 commented Oct 12, 2021

openshift-cherrypick-robot commented Oct 12, 2021

lilic left a comment

Choose a reason for hiding this comment

lilic Oct 13, 2021

Choose a reason for hiding this comment

hexfusion Oct 13, 2021

Choose a reason for hiding this comment

openshift-ci bot commented Oct 13, 2021

lilic commented Oct 13, 2021

hexfusion commented Oct 13, 2021

hasbro17 commented Oct 26, 2021

openshift-ci bot commented Oct 26, 2021