Skip to content

Commit

Permalink
Mixin: Lower alerting threshold for BucketIndexNotUpdated alert (#7879)
Browse files Browse the repository at this point in the history
This change will cause us to page when a bucket index has not been updated
by two compactor cleanup cycles (every 15 minutes) with a 5 minute buffer to
avoid false-positives.

The update time of the index is checked when queriers perform queries so as
soon as it exceeds the 1 hour max age, queries begin to fail. Thus we need
to alert on the age of the index before it begins to fail queries.

Signed-off-by: Nick Pillitteri <[email protected]>
  • Loading branch information
56quarters authored Apr 11, 2024
1 parent 0830c4d commit 6f5018c
Show file tree
Hide file tree
Showing 5 changed files with 12 additions and 8 deletions.
7 changes: 4 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,7 @@
### Mixin

* [CHANGE] Alerts: Removed obsolete `MimirQueriesIncorrect` alert that used test-exporter metrics. Test-exporter support was however removed in Mimir 2.0 release. #7774
* [CHANGE] Fine-tuned `terminationGracePeriodSeconds` for the following components: #7364
* Querier: changed from `30` to `180`
* Query-scheduler: changed from `30` to `180`
* [CHANGE] Alerts: Change threshold for `MimirBucketIndexNotUpdated` alert to fire before queries begin to fail due to bucket index age. #7879
* [FEATURE] Dashboards: added 'Remote ruler reads networking' dashboard. #7751
* [ENHANCEMENT] Alerts: allow configuring alerts range interval via `_config.base_alerts_range_interval_minutes`. #7591
* [ENHANCEMENT] Dashboards: Add panels for monitoring distributor and ingester when using ingest-storage. These panels are disabled by default, but can be enabled using `show_ingest_storage_panels: true` config option. Similarly existing panels used when distributors and ingesters use gRPC for forwarding requests can be disabled by setting `show_grpc_ingestion_panels: false`. #7670 #7699
Expand All @@ -61,6 +59,9 @@
### Jsonnet

* [CHANGE] Memcached: Change default read timeout for chunks and index caches to `750ms` from `450ms`. #7778
* [CHANGE] Fine-tuned `terminationGracePeriodSeconds` for the following components: #7364
* Querier: changed from `30` to `180`
* Query-scheduler: changed from `30` to `180`
* [ENHANCEMENT] Compactor: add `$._config.cortex_compactor_concurrent_rollout_enabled` option (disabled by default) that makes use of rollout-operator to speed up the rollout of compactors. #7783 #7878
* [ENHANCEMENT] Shuffle-sharding: add `$._config.shuffle_sharding.ingest_storage_partitions_enabled` and `$._config.shuffle_sharding.ingester_partitions_shard_size` options, that allow configuring partitions shard size in ingest-storage mode. #7804
* [BUGFIX] Guard against missing samples in KEDA queries. #7691
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -801,7 +801,7 @@ spec:
}}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirbucketindexnotupdated
expr: |
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 2100
labels:
severity: critical
- name: mimir_compactor_alerts
Expand Down
2 changes: 1 addition & 1 deletion operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -775,7 +775,7 @@ groups:
}}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirbucketindexnotupdated
expr: |
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 2100
labels:
severity: critical
- name: mimir_compactor_alerts
Expand Down
2 changes: 1 addition & 1 deletion operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -789,7 +789,7 @@ groups:
}}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirbucketindexnotupdated
expr: |
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
min by(cluster, namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 2100
labels:
severity: critical
- name: mimir_compactor_alerts
Expand Down
7 changes: 5 additions & 2 deletions operations/mimir-mixin/alerts/blocks.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -220,10 +220,13 @@
},
},
{
// Alert if the bucket index has not been updated for a given user.
// Alert if the bucket index has not been updated for a given user. The default update interval is 900 seconds
// so we alert if we've missed two updates plus a 300 second buffer to avoid false-positives. It's important
// that this alert fire before queriers start to return errors because the bucket index is too old (3600 seconds
// by default).
alert: $.alertName('BucketIndexNotUpdated'),
expr: |||
min by(%(alert_aggregation_labels)s, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
min by(%(alert_aggregation_labels)s, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 2100
||| % $._config,
labels: {
severity: 'critical',
Expand Down

0 comments on commit 6f5018c

Please sign in to comment.