-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler not evaluating any rules #4772
Comments
This is epic report - @jessicalins thank you! Perfect pattern for providing all possible info 🤗 |
What about |
Yup, too late ): Good point about go routines - we forgot. Let's capture it next time it happens. We lost all pprof things 🤗 |
I happen to have the same behaviour on my clusters since the v0.23.1 update. Rollbacking Thanos Ruler to 0.22.0 did the trick for us to work around this issue. Maybe I can help providing those pprof. I don't have experience with that right now but I can try the request you provided @GiedriusS Edit: I spoke too soon: rollbacking to 0.22.0 didn't helped that much actually. |
This is from one of our Thanos Ruler currently failing to process some records (I haven't figured out yet if this applies to all of them or not). Version 0.22.0. I'm waiting to have some failures from a 0.23.1 ruler. |
One lead we are testing right now for this issue: fine tuning Thanos Query & Query Frontend. It is yet a bit too soon to draw any conclusions though as of now our thanos rulers records are way more stable |
Update: increasing Thanos Query performances helped for some time but eventually our Thanos Rule instances ends up evaluating no rules at all. So I suppose somethings clogs Thanos Ruler at some point and those goroutines never ends properly. |
We hit this too in one of our clusters with ruler version 0.23.1 and the same pattern (the increase of number of goroutines over time). I am unfortunately not able to provide pprof, because priority when this was discovered was to mitigate so we restarted all the pods. Could it be however possible that this is caused by similar issue as #4795 ? |
@jleloup We didn't encounter these kind of issue with v0.21.1 so I am going to rollback ruler to that version. |
v0.23.2 contains the fix for #4795 so I'd suggest trying that out to see whether it helps (: |
@GiedriusS thanks for the quick response, quick question on that is that code path executed in ruler mode? |
Ruler executes queries using the same /api/v1/query_range API and that API might not return any responses due to #4795. So, I think what happens in this case is that the Prometheus ruler manager continuously still tries to evaluate those alerting/recording rules but because no response is retrieved from Thanos, the memory usage stays more or less the same. 🤔 |
In our case
That might be what happened in our case, we upgrade all thanos components to v0.23.1 from v0.21.0. We noticed some query performance degradation (at the same time the ruler in one cluster got stuck this way), we downgraded the thanos query instances, but not the ruler instances, and we didn't notice this ruler being stuck in this state until now. |
Hello, I think I have the same issue with 0.24. Can other confirm? |
Facing in 0.24 as well |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
This issue is still being observed in thanos:v0.24.0 |
Thanos version used:
Thanos v0.23.1, deployed as a sidecar
Object Storage Provider: S3
What happened:
ThanosNoRuleEvaluations
in here)What you expected to happen:
Anything else we need to know:
Screenshots that may help debugging the issue:
After restarting the pods:
The text was updated successfully, but these errors were encountered: