-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
contrib/mixin/mixin.libsonnet: Adjust gRPC failed requests #13127
Conversation
OK is not the only one that is allowed, this before also captured context canceled, NotFound, and other non error requests.
@hexfusion please have a look, thanks! |
@@ -88,7 +88,7 @@ | |||
{ | |||
alert: 'etcdHighNumberOfFailedGRPCRequests', | |||
expr: ||| | |||
100 * sum(rate(grpc_server_handled_total{%(etcd_selector)s, grpc_code!="OK"}[5m])) without (grpc_type, grpc_code) | |||
100 * sum(rate(grpc_server_handled_total{%(etcd_selector)s, grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be easier to maintain list of failed codes or successful ones? I lean towards keeping the list of successful codes, as it better handles case when new code is introduced. It assumes that new codes are errors, so it will alert by default which is better than assuming that new codes are ok and not noticing problem. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure can do that, I get your point, just not sure when we would be adding new codes, as these are the gRPC standard status codes listed here https://grpc.github.io/grpc/core/md_doc_statuscodes.html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serathius let me know what you think, happy to reverse the order if you think it still makes sense here, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok as it doesn't look there will be any difference in cost of maintaining this list up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's old, but haven't we left out codes that actually weren't creating the noise? E.g. in K8s, only the 'Canceled' code was really creating noise for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the 'Canceled' code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, only 'Canceled' is giving the noise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this cleans up a fair amount of noise, thanks @lilic. Regarding new codes we have not seen an addition to the list since the initial commit[1] so I think this approach is fine.
etcd-io/etcd#13127 Let me try to summarize the issue: The problem of the expression comes from the grpc_code!="OK" part: it treats other non-error codes (canceled, invalidArgument, notFound, and etc) as errors so the expression reports a high rate of errors thus triggers too many alerts (see GRPC standard here). The above problem in the expression has been fixed in the upstream etcd repo: etcd-io/etcd#13127 Signed-off-by: svparamzin <[email protected]>
OK
gRPC code is not the only one that is successful, this also capturedcontext canceled, NotFound, and other non-error requests codes.
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.