-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust and rename ThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
alert; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds
metrics
#4508
Adjust and rename ThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
alert; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds
metrics
#4508
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure if this is right approach. Sounds like we would like to set some dependency and alert only if Prometheus is healthy but sidecar is not.
I think in this case that might work, but let's make sure to be explicit. I would even vote for alert name to be explit, to something SidecarNoConnectionToStartedPrometheus
, WDYT?
Thanks a lot @bwplotka for this quick review.
Do you mean to express alert dependencies through inhibition rules? If so, I thought of that and @simonpasquier suggested that it might over complicate.
+1, How about |
Yes it might be hard, especially when one might not alert when Prometheus is reloading (which makes sense!) I would prefer what I propose as from title you expect this failing when it's not connected, but it will NOT fail if not connected AND if Prometheus is reloading which might surprise op. (: |
CHANGELOG.md
Outdated
@@ -18,6 +18,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re | |||
|
|||
- [#4468](https://github.com/thanos-io/thanos/pull/4468) Rule: Fix temporary rule filename composition issue. | |||
- [#4476](https://github.com/thanos-io/thanos/pull/4476) UI: fix incorrect html escape sequence used for '>' symbol. | |||
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Fix ThanosSidecarUnhealthy alert during prometheus WAL replay and remove ThanosSidecarUnhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entry does not make sense (fix ThanosSidecarUnhealthy
and remove ThanosSidecarUnhealthy
) - what we try to say here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it was a typo :'(
What about alert name? What about |
@bwplotka looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's fix changelog, otherwise LGTM 💪🏽
CHANGELOG.md
Outdated
@@ -18,6 +18,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re | |||
|
|||
- [#4468](https://github.com/thanos-io/thanos/pull/4468) Rule: Fix temporary rule filename composition issue. | |||
- [#4476](https://github.com/thanos-io/thanos/pull/4476) UI: fix incorrect html escape sequence used for '>' symbol. | |||
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Refactor sidecar alerts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not useful changelog item, is it? (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to have, how about "Refactor sidecar alerts(ThanosSidecarUnhealthy, ThanosSidecarPrometheusDown)"?
Since this PR does 3 things, found it bit hard to put a concise title :)
- fix ThanosSidecarUnhealthy
- removes ThanosSidecarPrometheusDown
- rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be specific here, otherwise it's hard for user to learn quickly if they are impacted or not.
What about
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Refactor sidecar alerts. | |
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Adjusted and renamed `ThanosSidecarUnhealthy` to `ThanosSidecarNoConnectionToStartedPrometheus`; Removed `ThanosSidecarPrometheusDown` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwplotka Applied your suggestion, PTAL.
ThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 more suggestions. Also please run make docs
again - tests are failing.
Thanks!
examples/alerts/alerts.yaml
Outdated
expr: | | ||
time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240 | ||
time() - max by (pod, job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for challenging this last minute, but isn't expression with thanos_sidecar_prometheus_up
simpler?
time() - max by (pod, job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240 | |
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"}) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question :) AFAIK, thanos_sidecar_last_heartbeat_success_time_seconds and thanos_sidecar_prometheus_up carries the same information encoded in different ways. Let me evaluate and switch to thanos_sidecar_prometheus_up if it is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwplotka If we switch to thanos_sidecar_prometheus_up
, We may not able to show the duration at which sidecar is not connected in the alert description. Hope that is fine.
e.g.
Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 720 seconds.
@bwplotka I don't know, for some reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM, just we need to move the changelog item now and add one more info.
CHANGELOG.md
Outdated
@@ -28,6 +28,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re | |||
- [#4476](https://github.com/thanos-io/thanos/pull/4476) UI: fix incorrect html escape sequence used for '>' symbol. | |||
- [#4532](https://github.com/thanos-io/thanos/pull/4532) Mixin: Fixed "all jobs" selector in thanos mixin dashboards. | |||
- [#4607](https://github.com/thanos-io/thanos/pull/4607) Azure: Fix Azure MSI Rate Limit | |||
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Adjust and rename `ThanosSidecarUnhealthy` to `ThanosSidecarNoConnectionToStartedPrometheus`; Remove `ThanosSidecarPrometheusDown`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be moved now to the top of this change log (different release)
ThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
ThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
; Remove thanos_sidecar_last_heartbeat_success_time_seconds
metrics
ThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
; Remove thanos_sidecar_last_heartbeat_success_time_seconds
metricsThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove unused ThanosSidecarPrometheusDown
; Remove thanos_sidecar_last_heartbeat_success_time_seconds
metrics
ThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove unused ThanosSidecarPrometheusDown
; Remove thanos_sidecar_last_heartbeat_success_time_seconds
metricsThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds
metrics
ThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds
metricsThanosSidecarUnhealthy
to ThanosSidecarNoConnectionToStartedPrometheus
; Remove ThanosSidecarPrometheusDown
alert; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds
metrics
Prior to this fix, ThanosSidecarUnhealthy would fire even when Prometheus is busy with WAL replay. This would trigger a false positive alert. This PR considers prometheus_tsdb_data_replay_duration_seconds metric from Prometheus for ThanosSidecarUnhealthy alert. In order to correlate Thanos and Prometheus metrics we need to specify common label(s) which can be confiured through thanosPrometheusCommonDimensions jsonnet variable. This PR also removes ThanosSidecarPrometheusDown as it would fire at the same as ThanosSidecarUnhealthy. Fixes #3915. Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Arunprasad Rajkumar <[email protected]>
…ometheus Signed-off-by: Arunprasad Rajkumar <[email protected]>
…decar_prometheus_up Signed-off-by: Arunprasad Rajkumar <[email protected]>
…_time_seconds metric Signed-off-by: Arunprasad Rajkumar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽 Thanks!
This commit pulls latest changes form thanos mixins and sets `thanosPrometheusCommonDimensions` to `namespace, pod` for k8s use case. Refer thanos-io/thanos#4508 for more details. Signed-off-by: Arunprasad Rajkumar <[email protected]>
This commit pulls latest changes from thanos mixins and sets `thanosPrometheusCommonDimensions` to `namespace, pod` for k8s use case. Refer thanos-io/thanos#4508 for more details. Signed-off-by: Arunprasad Rajkumar <[email protected]>
Prior to this fix,
ThanosSidecarUnhealthy
would fire even whenPrometheus is busy with WAL replay. This would trigger a false positive alert.
This PR considers
prometheus_tsdb_data_replay_duration_seconds
metric fromPrometheus for
ThanosSidecarUnhealthy
alert. In order to correlateThanos and Prometheus metrics we need to specify common label(s) which
can be configured through
thanosPrometheusCommonDimensions
jsonnetvariable.
This PR also removes
ThanosSidecarPrometheusDown
as it would fire at the same asThanosSidecarUnhealthy
.Rename
ThanosSidecarUnhealthy
toThanosSidecarNoConnectionToStartedPrometheus
.Remove unused
thanos_sidecar_last_heartbeat_success_time_seconds
metric.Fixes #3915.
Signed-off-by: Arunprasad Rajkumar [email protected]
Changes
Verification