Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin/receive: add limits alerting #6466

Merged
merged 8 commits into from
Jun 22, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -397,6 +397,7 @@ The binaries published with this release are built with Go1.17.8 to avoid [CVE-2
- [#4874](https://github.com/thanos-io/thanos/pull/4874) Query: Add `--endpoint-strict` flag to statically configure Thanos API server endpoints. It is similar to `--store-strict` but supports passing any Thanos gRPC APIs: StoreAPI, MetadataAPI, RulesAPI, TargetsAPI and ExemplarsAPI.
- [#4868](https://github.com/thanos-io/thanos/pull/4868) Rule: Support ruleGroup limit introduced by Prometheus v2.31.0.
- [#4897](https://github.com/thanos-io/thanos/pull/4897) Query: Add validation for querier address flags.
- [#6466](https://github.com/thanos-io/thanos/pull/6466) Mixin (Receive): add limits alerting for configuration reload and meta-monitoring.

### Fixed

Expand Down
18 changes: 18 additions & 0 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,24 @@ rules:
for: 3h
labels:
severity: critical
- alert: ThanosReceiveLimitsConfigReloadFailure
annotations:
description: Thanos Receive {{$labels.job}} has not been able to reload the limits configuration.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitsconfigreloadfailure
summary: Thanos Receive has not been able to reload the limits configuration.
expr: sum by(job) (increase(thanos_receive_limits_config_reload_err_total{job=~".*thanos-receive.*"}[5m])) > 0
for: 5m
labels:
severity: warning
- alert: ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate
annotations:
description: Thanos Receive {{$labels.job}} is failing for {{$value | humanize}}% of meta monitoring queries.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitshighmetamonitoringqueriesfailurerate
summary: Thanos Receive has not been able to update the number of head series.
expr: (sum by(job) (increase(thanos_receive_metamonitoring_failed_queries_total{job=~".*thanos-receive.*"}[5m])) / 20) * 100 > 20
for: 5m
labels:
severity: warning
```

## Replicate
Expand Down
18 changes: 18 additions & 0 deletions examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,24 @@ groups:
for: 3h
labels:
severity: critical
- alert: ThanosReceiveLimitsConfigReloadFailure
annotations:
description: Thanos Receive {{$labels.job}} has not been able to reload the limits configuration.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitsconfigreloadfailure
summary: Thanos Receive has not been able to reload the limits configuration.
expr: sum by(job) (increase(thanos_receive_limits_config_reload_err_total{job=~".*thanos-receive.*"}[5m])) > 0
for: 5m
labels:
severity: warning
- alert: ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate
annotations:
description: Thanos Receive {{$labels.job}} is failing for {{$value | humanize}}% of meta monitoring queries.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitshighmetamonitoringqueriesfailurerate
summary: Thanos Receive has not been able to update the number of head series.
expr: (sum by(job) (increase(thanos_receive_metamonitoring_failed_queries_total{job=~".*thanos-receive.*"}[5m])) / 20) * 100 > 20
for: 5m
labels:
severity: warning
- name: thanos-sidecar
rules:
- alert: ThanosSidecarBucketOperationsFailed
Expand Down
26 changes: 26 additions & 0 deletions mixin/alerts/receive.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
httpErrorThreshold: 5,
ingestionThreshold: 50,
forwardErrorThreshold: 20,
metaMonitoringErrorThreshold: 20,
refreshErrorThreshold: 0,
p99LatencyThreshold: 10,
dimensions: std.join(', ', std.objectFields(thanos.targetGroups) + ['job']),
Expand Down Expand Up @@ -144,6 +145,31 @@
severity: 'critical',
},
},
{
alert: 'ThanosReceiveLimitsConfigReloadFailure',
annotations: {
description: 'Thanos Receive {{$labels.job}}%s has not been able to reload the limits configuration.' % location,
summary: 'Thanos Receive has not been able to reload the limits configuration.',
},
expr: 'sum by(%(dimensions)s) (increase(thanos_receive_limits_config_reload_err_total{%(selector)s}[5m])) > 0' % thanos.receive,
'for': '5m',
labels: {
severity: 'warning',
},
},
{
alert: 'ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate',
annotations: {
description: 'Thanos Receive {{$labels.job}}%s is failing for {{$value | humanize}}%% of meta monitoring queries.' % location,
summary: 'Thanos Receive has not been able to update the number of head series.',
},
// Values are updated every 15s, 20 times over 5 minutes.
expr: '(sum by(%(dimensions)s) (increase(thanos_receive_metamonitoring_failed_queries_total{%(selector)s}[5m])) / 20) * 100 > %(metaMonitoringErrorThreshold)s' % thanos.receive,
'for': '5m',
labels: {
severity: 'warning',
},
},
],
},
],
Expand Down
2 changes: 2 additions & 0 deletions mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@
|ThanosReceiveHighHashringFileRefreshFailures|Thanos Receive is failing to refresh hasring file.|Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value humanize}} of attempts failed.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighhashringfilerefreshfailures](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighhashringfilerefreshfailures)|
|ThanosReceiveConfigReloadFailure|Thanos Receive has not been able to reload configuration.|Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveconfigreloadfailure](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveconfigreloadfailure)|
|ThanosReceiveNoUpload|Thanos Receive has not uploaded latest data to object storage.|Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivenoupload](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivenoupload)|
|ThanosReceiveLimitsConfigReloadFailure|Thanos Receive has not been able to reload the limits configuration.|Thanos Receive {{$labels.job}} has not been able to reload the limits configuration.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitsconfigreloadfailure](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitsconfigreloadfailure)|
|ThanosReceiveLimitsHighMetaMonitoringQueriesFailureRate|Thanos Receive has not been able to update the number of head series.|Thanos Receive {{$labels.job}} is failing for {{$value humanize}}% of meta monitoring queries.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitshighmetamonitoringqueriesfailurerate](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivelimitshighmetamonitoringqueriesfailurerate)|

## thanos-rule

Expand Down