-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay #3915
Comments
My 2 cents (but I'm not a Thanos maintainer): do we need |
Looks like we need to change
As of now, I agree that there is no point in having them if both alerts are going to fire at the same time. Though as proposed, if we change The issue of |
Hi @prmsrswt, I'm trying to get familiar with the Thanos codebase and would love to take this up with some guidance. But I'm not sure where to change the endpoints. |
We spent 30m on going through code, so hopefully @idoqo understand the codebase if we would like to change this. But.... I am not convinced we should. First of I would exclude alerts from this discussion. We can arrange alerts in any way we want, we could create inhibit rules to make sure dependencies are set. I would love to focus on what Maybe it's alerting we might want to adjust then? |
Thank you for the clarification @bwplotka, it makes more sense now that you explained it. I was exactly wondering if my understanding of these metrics was correct before making any assumptions with the alerts, hence why I started this discussion. As for alerting, I think the alerts should be made more resilient to WAL replays as during these periods where the sidecar is unhealthy, there is nothing the administrator can really do, at the exception of waiting for Prometheus to be ready. One way we could do that, is by taking into account the rate of On another note, is it really meaningful to keep both alerts? Currently, both of them occurs for the same reason which is the sidecar being unable to reach the Prometheus instance. |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
This issue is still valid. cc @arajkumar @slashpai can you please have a look once you have some time on your hands? |
@arajkumar raised this issue again in slack here |
Apparently I missed this discussion :) Proposed solution has been already discussed here. Let me address @bwplotka concern w.r.t alerts instead of tweaking the metrics. |
IMHO, How about making use of [1] https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L211 |
Yesterday, myself and @paulfantom had pretty long slack programming :) He helped me to tweak the alert expression by including
We had to introduce another label set(commonDimensions) which is common between prometheus & thanos. However @paulfantom raised a similar concern as @simonpasquier about having two different ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown related to same situation. @paulfantom 's suggestion was to remove ThanosSidecarPrometheusDown as it seem to be a cause based rather than a symptom based alert. It would be helpful if you folks can throw some light on this. Thank you! |
I really like this approach based on the I would also be in favor of removing |
Those should result in |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
…ionToStartedPrometheus`; Remove `ThanosSidecarPrometheusDown` alert; Remove unused `thanos_sidecar_last_heartbeat_success_time_seconds` metrics (#4508) * Refactor sidecar alerts Prior to this fix, ThanosSidecarUnhealthy would fire even when Prometheus is busy with WAL replay. This would trigger a false positive alert. This PR considers prometheus_tsdb_data_replay_duration_seconds metric from Prometheus for ThanosSidecarUnhealthy alert. In order to correlate Thanos and Prometheus metrics we need to specify common label(s) which can be confiured through thanosPrometheusCommonDimensions jsonnet variable. This PR also removes ThanosSidecarPrometheusDown as it would fire at the same as ThanosSidecarUnhealthy. Fixes #3915. Co-authored-by: Bartlomiej Plotka <[email protected]> Signed-off-by: Arunprasad Rajkumar <[email protected]> * Rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus Signed-off-by: Arunprasad Rajkumar <[email protected]> * Simplify ThanosSidecarNoConnectionToStartedPrometheus using thanos_sidecar_prometheus_up Signed-off-by: Arunprasad Rajkumar <[email protected]> * Remove unused implementation of thanos_sidecar_last_heartbeat_success_time_seconds metric Signed-off-by: Arunprasad Rajkumar <[email protected]> Co-authored-by: Bartlomiej Plotka <[email protected]>
Hello everyone! I wanted to start a discussion regarding Thanos sidecar alerts' behaviour during Prometheus WAL replay.
Currently, during Prometheus WAL replay, both the
ThanosSidecarUnhealthy
andThanosSidecarPrometheusDown
alert can fire based on the duration of the replay.To put more context around that,
ThanosSidecarUnhealthy
andThanosSidecarPrometheusDown
alerts are based on two different metrics:thanos_sidecar_prometheus_up
thanos_sidecar_last_heartbeat_success_time_seconds
Both of these metrics are updated based the HTTP response of a heartbeat to the
/api/v1/status/config
endpoint of the Prometheus instance to which the Thanos sidecar is attached. Only a HTTP 2XX response would mean that Prometheus is up and that the heartbeat succeeded.This works flawlessly, but it might not handle the case of a WAL replay correctly. During a WAL replay, Prometheus isn't ready and respond to queries with an HTTP 503 Service Unavailable until the replay is finished. This is the case for all the endpoints of its API except
/-/healthy
as Prometheus it is still healthy despite not being ready.What this means in practice is that during a WAL replay, both alerts fire and report that a Thanos sidecar is unhealthy and that its Prometheus instance is down. However, what really happens is that both the sidecar and Prometheus are healthy but not ready. One is waiting for the other to become ready and the other is finishing its WAL replay.
That being said, I think that the
thanos_sidecar_prometheus_up
metrics should be updated to reflect Prometheus health instead of its readiness. However, I am quite uncertain for the sidecar healthiness. Should it reflect the healthiness/readiness of its Prometheus instance or should it report unhealthy when Prometheus isn't ready?The text was updated successfully, but these errors were encountered: