-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ruler: probes are not started while WAL is being replayed #4280
Comments
Valid bug. Maybe @ianbillett want to try to fix it? 🤗 |
BTW note that we are close to make stateless ruler happen (#4250), but WAL will be still there 🙃 |
Cool, I'm also happy to work on this. Was just not sure if this was an actual bug or not. |
@OGKevin nice report! Since you reported the bug, you have the right to fix it. It doesn't seem too complicated, so a nice opportunity to get a commit into the Thanos code base :) To fix, I think we just need to change the ordering of the ruler startup (as you point out above) so that we initialise the probes, and mark ourselves as healthy before then initialising the TSDB. |
So due to the current dependency chain, the current order makes sense. HTTP and GRPC server need Rule Manager, Rule Manager needs tsdb. I can think of 2 solutions atm:
|
I think we need 2 but it won't be complex. It will be exactly the same as thanos/pkg/receive/multitsdb.go Line 383 in 13ab756
|
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
Thanos, Prometheus and Golang version used:
Thanos: 20e8105ec5b697a4755f2ff735509924173bd5cf
What happened:
Ruler crashed and was restarted, it now needs to replay the WAL. However, the probes are started after the WAL has been read, This code
thanos/cmd/thanos/rule.go
Lines 320 to 323 in f2564a7
eventually calls
head.Init
in tsdb that replays the wall. The health probes are started afterwards here:thanos/cmd/thanos/rule.go
Lines 531 to 537 in f2564a7
This means that while ruler is reading the WAL, the liveness probe will fail and you end up in a crash loop unless you significantly increase the initial delay.
What you expected to happen:
Probe be the first thing to get started while the rest of ruler starts up. Reading WAL should be considered healthy but not ready. While failure to read WAL should be unhealthy.
The text was updated successfully, but these errors were encountered: