Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alerts: Fixed compactor alert to use correct aggregation function. #2875

Merged
merged 1 commit into from
Jul 10, 2020

Conversation

bwplotka
Copy link
Member

max is aggregating across series. We need to aggregate something across time as well as series due to rollout.

Alert was flaky on every rollout essentially as last time is 0 in reset case.

Signed-off-by: Bartlomiej Plotka [email protected]

max is aggregating across series. We need to aggregate something across time as well as series due to rollout.

Signed-off-by: Bartlomiej Plotka <[email protected]>
@@ -73,7 +73,7 @@
annotations: {
message: 'Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.',
},
expr: '(time() - max(thanos_objstore_bucket_last_successful_upload_time{%(selector)s})) / 60 / 60 > 24' % thanos.compact,
expr: '(time() - max(max_over_time(thanos_objstore_bucket_last_successful_upload_time{%(selector)s}[24h]))) / 60 / 60 > 24' % thanos.compact,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm .. the max_over_time doesn't look quite right .. don't we just want to ignore the rollout, so ignore identifying labels? something along the lines of max without(instance, namespace, pod) (...) ? (and make identifying labels configurable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends what we want to achieve.

To me we want manual eyes on compactor when it has no uploads for longer time = 1d. This alert does that, if after 1d none of new instances uploaded anything to bucket, we have a problem (:

What's missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, works as expected IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it now. Yes, this seems fine.
The problem is the fact that for every new deployment the thanos_objstore_bucket_last_successful_upload_time until the first upload is 0 so we alert on not having uploaded since 1970...
However, taking the max_over_time mitigates this.
LGTM

@bwplotka bwplotka requested a review from brancz July 10, 2020 10:46
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to revisit should this not catch compaction failures in the future :)

@brancz brancz merged commit 1ae66b9 into master Jul 10, 2020
@brancz brancz deleted the fixed-upload-compact-alert branch July 10, 2020 11:04
@pracucci
Copy link
Contributor

Question: doesn't this change only protects from false positives when the pod ID doesn't change between rollouts (eg. StatefulSet) but it still triggers otherwise? Am I missing anything?

@brancz
Copy link
Member

brancz commented Jul 22, 2020

That was the intention iirc yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants