-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alerts: Fixed compactor alert to use correct aggregation function. #2875
Conversation
max is aggregating across series. We need to aggregate something across time as well as series due to rollout. Signed-off-by: Bartlomiej Plotka <[email protected]>
@@ -73,7 +73,7 @@ | |||
annotations: { | |||
message: 'Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.', | |||
}, | |||
expr: '(time() - max(thanos_objstore_bucket_last_successful_upload_time{%(selector)s})) / 60 / 60 > 24' % thanos.compact, | |||
expr: '(time() - max(max_over_time(thanos_objstore_bucket_last_successful_upload_time{%(selector)s}[24h]))) / 60 / 60 > 24' % thanos.compact, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm .. the max_over_time
doesn't look quite right .. don't we just want to ignore the rollout, so ignore identifying labels? something along the lines of max without(instance, namespace, pod) (...)
? (and make identifying labels configurable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends what we want to achieve.
To me we want manual eyes on compactor when it has no uploads for longer time = 1d
. This alert does that, if after 1d none of new instances uploaded anything to bucket, we have a problem (:
What's missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested, works as expected IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it now. Yes, this seems fine.
The problem is the fact that for every new deployment the thanos_objstore_bucket_last_successful_upload_time
until the first upload is 0
so we alert on not having uploaded since 1970...
However, taking the max_over_time mitigates this.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to revisit should this not catch compaction failures in the future :)
Question: doesn't this change only protects from false positives when the pod ID doesn't change between rollouts (eg. StatefulSet) but it still triggers otherwise? Am I missing anything? |
That was the intention iirc yes. |
max is aggregating across series. We need to aggregate something across time as well as series due to rollout.
Alert was flaky on every rollout essentially as last time is 0 in reset case.
Signed-off-by: Bartlomiej Plotka [email protected]