Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alerts: Fixed compactor alert to use correct aggregation function. #2875

Merged
merged 1 commit into from
Jul 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ rules:
- alert: ThanosCompactHasNotRun
annotations:
message: Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.
expr: (time() - max(thanos_objstore_bucket_last_successful_upload_time{job=~"thanos-compact.*"}))
expr: (time() - max(max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~"thanos-compact.*"}[24h])))
/ 60 / 60 > 24
labels:
severity: warning
Expand Down
2 changes: 1 addition & 1 deletion examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ groups:
- alert: ThanosCompactHasNotRun
annotations:
message: Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.
expr: (time() - max(thanos_objstore_bucket_last_successful_upload_time{job=~"thanos-compact.*"}))
expr: (time() - max(max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~"thanos-compact.*"}[24h])))
/ 60 / 60 > 24
labels:
severity: warning
Expand Down
2 changes: 1 addition & 1 deletion mixin/alerts/compact.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
annotations: {
message: 'Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.',
},
expr: '(time() - max(thanos_objstore_bucket_last_successful_upload_time{%(selector)s})) / 60 / 60 > 24' % thanos.compact,
expr: '(time() - max(max_over_time(thanos_objstore_bucket_last_successful_upload_time{%(selector)s}[24h]))) / 60 / 60 > 24' % thanos.compact,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm .. the max_over_time doesn't look quite right .. don't we just want to ignore the rollout, so ignore identifying labels? something along the lines of max without(instance, namespace, pod) (...) ? (and make identifying labels configurable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends what we want to achieve.

To me we want manual eyes on compactor when it has no uploads for longer time = 1d. This alert does that, if after 1d none of new instances uploaded anything to bucket, we have a problem (:

What's missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, works as expected IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it now. Yes, this seems fine.
The problem is the fact that for every new deployment the thanos_objstore_bucket_last_successful_upload_time until the first upload is 0 so we alert on not having uploaded since 1970...
However, taking the max_over_time mitigates this.
LGTM

labels: {
severity: 'warning',
},
Expand Down