alerts: Fixed compactor alert to use correct aggregation function. #2875

bwplotka · 2020-07-10T09:40:13Z

max is aggregating across series. We need to aggregate something across time as well as series due to rollout.

Alert was flaky on every rollout essentially as last time is 0 in reset case.

Signed-off-by: Bartlomiej Plotka [email protected]

max is aggregating across series. We need to aggregate something across time as well as series due to rollout. Signed-off-by: Bartlomiej Plotka <[email protected]>

brancz · 2020-07-10T10:00:00Z

mixin/alerts/compact.libsonnet

@@ -73,7 +73,7 @@
            annotations: {
              message: 'Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.',
            },
-            expr: '(time() - max(thanos_objstore_bucket_last_successful_upload_time{%(selector)s})) / 60 / 60 > 24' % thanos.compact,
+            expr: '(time() - max(max_over_time(thanos_objstore_bucket_last_successful_upload_time{%(selector)s}[24h]))) / 60 / 60 > 24' % thanos.compact,


hmm .. the max_over_time doesn't look quite right .. don't we just want to ignore the rollout, so ignore identifying labels? something along the lines of max without(instance, namespace, pod) (...) ? (and make identifying labels configurable)

Depends what we want to achieve.

To me we want manual eyes on compactor when it has no uploads for longer time = 1d. This alert does that, if after 1d none of new instances uploaded anything to bucket, we have a problem (:

What's missing?

Tested, works as expected IMO

Ok, got it now. Yes, this seems fine.
The problem is the fact that for every new deployment the thanos_objstore_bucket_last_successful_upload_time until the first upload is 0 so we alert on not having uploaded since 1970...
However, taking the max_over_time mitigates this.
LGTM

brancz

happy to revisit should this not catch compaction failures in the future :)

pracucci · 2020-07-21T08:31:32Z

Question: doesn't this change only protects from false positives when the pod ID doesn't change between rollouts (eg. StatefulSet) but it still triggers otherwise? Am I missing anything?

brancz · 2020-07-22T07:47:40Z

That was the intention iirc yes.

alerts: Fixed compactor alert to use correct aggregation function.

eaa92c5

max is aggregating across series. We need to aggregate something across time as well as series due to rollout. Signed-off-by: Bartlomiej Plotka <[email protected]>

bwplotka requested review from metalmatze, GiedriusS, brancz and squat July 10, 2020 09:40

brancz reviewed Jul 10, 2020

View reviewed changes

bwplotka requested a review from brancz July 10, 2020 10:46

brancz approved these changes Jul 10, 2020

View reviewed changes

brancz merged commit 1ae66b9 into master Jul 10, 2020

brancz deleted the fixed-upload-compact-alert branch July 10, 2020 11:04

metalmatze approved these changes Jul 10, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alerts: Fixed compactor alert to use correct aggregation function. #2875

alerts: Fixed compactor alert to use correct aggregation function. #2875

bwplotka commented Jul 10, 2020

brancz Jul 10, 2020

bwplotka Jul 10, 2020

bwplotka Jul 10, 2020

metalmatze Jul 10, 2020

brancz left a comment

pracucci commented Jul 21, 2020

brancz commented Jul 22, 2020

alerts: Fixed compactor alert to use correct aggregation function. #2875

alerts: Fixed compactor alert to use correct aggregation function. #2875

Conversation

bwplotka commented Jul 10, 2020

brancz Jul 10, 2020

Choose a reason for hiding this comment

bwplotka Jul 10, 2020

Choose a reason for hiding this comment

bwplotka Jul 10, 2020

Choose a reason for hiding this comment

metalmatze Jul 10, 2020

Choose a reason for hiding this comment

brancz left a comment

Choose a reason for hiding this comment

pracucci commented Jul 21, 2020

brancz commented Jul 22, 2020