Skip to content

Commit

Permalink
Added alert ThanosReceiveTrafficBelowThreshold to flag unusually low …
Browse files Browse the repository at this point in the history
…ingestion rate

Signed-off-by: spaparaju <[email protected]>
  • Loading branch information
spaparaju committed Apr 29, 2021
1 parent f1ee264 commit 4c6aa8b
Show file tree
Hide file tree
Showing 5 changed files with 56 additions and 1 deletion.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ We use _breaking :warning:_ to mark changes that are not backward compatible (re
## Unreleased

### Added
-
- [#4117](https://github.com/thanos-io/thanos/pull/4117) Mixin: new alert ThanosReceiveTrafficBelowThreshold to flag if the ingestion average of the last hour is 50% of the ingestion average for the last 12 hours.

### Fixed
-
### Changed
Expand Down
17 changes: 17 additions & 0 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -562,6 +562,23 @@ rules:
for: 3h
labels:
severity: critical
- alert: ThanosReceiveTrafficBelowThreshold
annotations:
description: At Thanos Receive {{$labels.job}} in {{$labels.namespace}} , the
average 1-hr avg. metrics ingestion rate is {{$value | humanize}}% of 12-hr
avg. ingestion rate.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivetrafficbelowthreshold
summary: Thanos Receive is experiencing low avg. 1-hr ingestion rate relative
to avg. 12-hr ingestion rate.
expr: |
(
avg by (job) (rate(http_requests_total{code=~"2..", job=~".*thanos-receive.*", handler="receive"}[1h]))
/
avg by (job) (rate(http_requests_total{code=~"2..", job=~".*thanos-receive.*", handler="receive"}[12h]))
) * 100 < 50
for: 1h
labels:
severity: warning
```
## Replicate
Expand Down
17 changes: 17 additions & 0 deletions examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,23 @@ groups:
for: 3h
labels:
severity: critical
- alert: ThanosReceiveTrafficBelowThreshold
annotations:
description: At Thanos Receive {{$labels.job}} in {{$labels.namespace}} , the
average 1-hr avg. metrics ingestion rate is {{$value | humanize}}% of 12-hr
avg. ingestion rate.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivetrafficbelowthreshold
summary: Thanos Receive is experiencing low avg. 1-hr ingestion rate relative
to avg. 12-hr ingestion rate.
expr: |
(
avg by (job) (rate(http_requests_total{code=~"2..", job=~".*thanos-receive.*", handler="receive"}[1h]))
/
avg by (job) (rate(http_requests_total{code=~"2..", job=~".*thanos-receive.*", handler="receive"}[12h]))
) * 100 < 50
for: 1h
labels:
severity: warning
- name: thanos-sidecar
rules:
- alert: ThanosSidecarPrometheusDown
Expand Down
19 changes: 19 additions & 0 deletions mixin/alerts/receive.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
receive+:: {
selector: error 'must provide selector for Thanos Receive alerts',
httpErrorThreshold: 5,
ingestionThreshold: 50,
forwardErrorThreshold: 20,
refreshErrorThreshold: 0,
p99LatencyThreshold: 10,
Expand Down Expand Up @@ -143,6 +144,24 @@
severity: 'critical',
},
},
{
alert: 'ThanosReceiveTrafficBelowThreshold',
annotations: {
description: 'At Thanos Receive {{$labels.job}} in {{$labels.namespace}} , the average 1-hr avg. metrics ingestion rate is {{$value | humanize}}% of 12-hr avg. ingestion rate.',
summary: 'Thanos Receive is experiencing low avg. 1-hr ingestion rate relative to avg. 12-hr ingestion rate.',
},
expr: |||
(
avg by (%(dimensions)s) (rate(http_requests_total{code=~"2..", %(selector)s, handler="receive"}[1h]))
/
avg by (%(dimensions)s) (rate(http_requests_total{code=~"2..", %(selector)s, handler="receive"}[12h]))
) * 100 < %(ingestionThreshold)s
||| % thanos.receive,
'for': '1h',
labels: {
severity: 'warning',
},
},
],
},
],
Expand Down
1 change: 1 addition & 0 deletions mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
|ThanosReceiveHighHashringFileRefreshFailures|Thanos Receive is failing to refresh hasring file.|Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value humanize}} of attempts failed.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighhashringfilerefreshfailures](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivehighhashringfilerefreshfailures)|
|ThanosReceiveConfigReloadFailure|Thanos Receive has not been able to reload configuration.|Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveconfigreloadfailure](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceiveconfigreloadfailure)|
|ThanosReceiveNoUpload|Thanos Receive has not uploaded latest data to object storage.|Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivenoupload](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivenoupload)|
|ThanosReceiveTrafficBelowThreshold|Thanos Receive is experiencing low avg. 1-hr ingestion rate relative to avg. 12-hr ingestion rate.|At Thanos Receive {{$labels.job}} in {{$labels.namespace}} , the average 1-hr avg. metrics ingestion rate is {{$value humanize}}% of 12-hr avg. ingestion rate.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivetrafficbelowthreshold](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosreceivetrafficbelowthreshold)|

## thanos-rule

Expand Down

0 comments on commit 4c6aa8b

Please sign in to comment.