Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contrib/mixin/mixin.libsonnet: Adjust gRPC failed requests #13127

Merged
merged 1 commit into from
Jun 23, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions contrib/mixin/mixin.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
)
)
> 0
||| % {etcd_instance_labels: $._config.etcd_instance_labels, etcd_selector: $._config.etcd_selector, network_failure_range: $._config.scrape_interval_seconds*4},
||| % { etcd_instance_labels: $._config.etcd_instance_labels, etcd_selector: $._config.etcd_selector, network_failure_range: $._config.scrape_interval_seconds * 4 },
'for': '10m',
labels: {
severity: 'critical',
Expand Down Expand Up @@ -88,7 +88,7 @@
{
alert: 'etcdHighNumberOfFailedGRPCRequests',
expr: |||
100 * sum(rate(grpc_server_handled_total{%(etcd_selector)s, grpc_code!="OK"}[5m])) without (grpc_type, grpc_code)
100 * sum(rate(grpc_server_handled_total{%(etcd_selector)s, grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be easier to maintain list of failed codes or successful ones? I lean towards keeping the list of successful codes, as it better handles case when new code is introduced. It assumes that new codes are errors, so it will alert by default which is better than assuming that new codes are ok and not noticing problem. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can do that, I get your point, just not sure when we would be adding new codes, as these are the gRPC standard status codes listed here https://grpc.github.io/grpc/core/md_doc_statuscodes.html.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius let me know what you think, happy to reverse the order if you think it still makes sense here, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok as it doesn't look there will be any difference in cost of maintaining this list up to date.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's old, but haven't we left out codes that actually weren't creating the noise? E.g. in K8s, only the 'Canceled' code was really creating noise for me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the 'Canceled' code

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, only 'Canceled' is giving the noise

/
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[5m])) without (grpc_type, grpc_code)
> 1
Expand All @@ -105,7 +105,7 @@
{
alert: 'etcdHighNumberOfFailedGRPCRequests',
expr: |||
100 * sum(rate(grpc_server_handled_total{%(etcd_selector)s, grpc_code!="OK"}[5m])) without (grpc_type, grpc_code)
100 * sum(rate(grpc_server_handled_total{%(etcd_selector)s, grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
/
sum(rate(grpc_server_handled_total{%(etcd_selector)s}[5m])) without (grpc_type, grpc_code)
> 5
Expand Down Expand Up @@ -207,7 +207,7 @@
summary: 'etcd cluster 99th percentile commit durations are too high.',
},
},
{
{
alert: 'etcdBackendQuotaLowSpace',
expr: |||
(etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100 > 95
Expand All @@ -219,8 +219,8 @@
annotations: {
message: 'etcd cluster "{{ $labels.job }}": database size exceeds the defined quota on etcd instance {{ $labels.instance }}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.',
},
},
{
},
{
alert: 'etcdExcessiveDatabaseGrowth',
expr: |||
increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
Expand All @@ -232,7 +232,7 @@
annotations: {
message: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes leading to 50% increase in database size over the past four hours on etcd instance {{ $labels.instance }}, please check as it might be disruptive.',
},
},
},
],
},
],
Expand All @@ -243,7 +243,7 @@
uid: std.md5('etcd.json'),
title: 'etcd',
description: 'etcd sample Grafana dashboard with Prometheus',
tags: [ 'etcd-mixin' ],
tags: ['etcd-mixin'],
style: 'dark',
timezone: 'browser',
editable: true,
Expand Down Expand Up @@ -369,7 +369,7 @@
step: 2,
},
{
expr: 'sum(rate(grpc_server_handled_total{job="$cluster",grpc_type="unary",grpc_code!="OK"}[5m]))',
expr: 'sum(rate(grpc_server_handled_total{job="$cluster",grpc_type="unary",grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m]))',
format: 'time_series',
intervalFactor: 2,
legendFormat: 'RPC Failed Rate',
Expand Down