Ruler not evaluating any rules #4772

jessicalins · 2021-10-12T12:25:33Z

Thanos version used:
Thanos v0.23.1, deployed as a sidecar

Object Storage Provider: S3

What happened:

Thanos Ruler did not evaluate any rules, causing an alert to fire (alert definition being used is the same as ThanosNoRuleEvaluations in here)
Ruler pods were also up and healthy.
After the ruler stopped evaluating, it did not log any lines.
The Ruler memory profile was also affected:
Issue was resolved after ruler pods were restarted

What you expected to happen:

Rules don’t stop getting evaluated if there are rules to evaluate
Ruler always evaluates in the specified interval
In case ruler stops evaluating, logs are sent

Anything else we need to know:
Screenshots that may help debugging the issue:

After restarting the pods:

The text was updated successfully, but these errors were encountered:

bwplotka · 2021-10-12T12:54:15Z

This is epic report - @jessicalins thank you! Perfect pattern for providing all possible info 🤗

GiedriusS · 2021-10-22T07:49:05Z

What about ${HTTP_IP}:${HTTP_PORT}/debug/pprof/goroutine?debug=1 of Rule when this happens? Could you please upload it? I'm still not sure what happened here.

bwplotka · 2021-10-27T10:02:04Z

Yup, too late ): Good point about go routines - we forgot. Let's capture it next time it happens. We lost all pprof things 🤗

jleloup · 2021-12-02T11:41:45Z

I happen to have the same behaviour on my clusters since the v0.23.1 update.

Rollbacking Thanos Ruler to 0.22.0 did the trick for us to work around this issue.

Maybe I can help providing those pprof. I don't have experience with that right now but I can try the request you provided @GiedriusS

Edit: I spoke too soon: rollbacking to 0.22.0 didn't helped that much actually.
We got our records values for some time but now those are missing again.

jleloup · 2021-12-02T13:38:07Z

thanos-rule.pprof.txt

This is from one of our Thanos Ruler currently failing to process some records (I haven't figured out yet if this applies to all of them or not). Version 0.22.0.

I'm waiting to have some failures from a 0.23.1 ruler.

jleloup · 2021-12-02T16:55:38Z

One lead we are testing right now for this issue: fine tuning Thanos Query & Query Frontend.
We have increased some concurrency parameters & the likes to ensure that there is no bottleneck on query path that would slow down Thanos Ruler queries.

It is yet a bit too soon to draw any conclusions though as of now our thanos rulers records are way more stable

jleloup · 2021-12-13T15:50:27Z

Update: increasing Thanos Query performances helped for some time but eventually our Thanos Rule instances ends up evaluating no rules at all.
The only thing I can add is that the number of goroutines increases a lot when Thanos Rulers stops evaluating

So I suppose somethings clogs Thanos Ruler at some point and those goroutines never ends properly.

jmichalek132 · 2022-01-04T14:29:30Z

We hit this too in one of our clusters with ruler version 0.23.1 and the same pattern (the increase of number of goroutines over time). I am unfortunately not able to provide pprof, because priority when this was discovered was to mitigate so we restarted all the pods. Could it be however possible that this is caused by similar issue as #4795 ?

jmichalek132 · 2022-01-04T14:42:59Z

@jleloup We didn't encounter these kind of issue with v0.21.1 so I am going to rollback ruler to that version.

GiedriusS · 2022-01-04T14:44:21Z

v0.23.2 contains the fix for #4795 so I'd suggest trying that out to see whether it helps (:

jmichalek132 · 2022-01-04T14:45:13Z

@GiedriusS thanks for the quick response, quick question on that is that code path executed in ruler mode?

GiedriusS · 2022-01-04T14:48:20Z

Ruler executes queries using the same /api/v1/query_range API and that API might not return any responses due to #4795. So, I think what happens in this case is that the Prometheus ruler manager continuously still tries to evaluate those alerting/recording rules but because no response is retrieved from Thanos, the memory usage stays more or less the same. 🤔

jmichalek132 · 2022-01-04T14:56:41Z

In our case

Ruler executes queries using the same /api/v1/query_range API and that API might not return any responses due to #4795. So, I think what happens in this case is that the Prometheus ruler manager continuously still tries to evaluate those alerting/recording rules but because no response is retrieved from Thanos, the memory usage stays more or less the same. 🤔

That might be what happened in our case, we upgrade all thanos components to v0.23.1 from v0.21.0. We noticed some query performance degradation (at the same time the ruler in one cluster got stuck this way), we downgraded the thanos query instances, but not the ruler instances, and we didn't notice this ruler being stuck in this state until now.

ahurtaud · 2022-02-10T09:58:03Z

Hello, I think I have the same issue with 0.24. Can other confirm?
I also commented on #4924 which may be a duplicate....

phoenixking25 · 2022-03-12T03:30:46Z

Facing in 0.24 as well

stale · 2022-06-12T17:56:29Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

panchambaruahwise · 2022-08-08T11:32:18Z

This issue is still being observed in thanos:v0.24.0

bill3tt added bug component: rule priority: P1 labels Oct 13, 2021

jleloup mentioned this issue Dec 7, 2021

Thanos Ruler fails to evaluate all recording rules correctly #4924

Closed

stale bot added the stale label Jun 12, 2022

stale bot removed the stale label Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler not evaluating any rules #4772

Ruler not evaluating any rules #4772

jessicalins commented Oct 12, 2021

bwplotka commented Oct 12, 2021

GiedriusS commented Oct 22, 2021

bwplotka commented Oct 27, 2021

jleloup commented Dec 2, 2021 •

edited

Loading

jleloup commented Dec 2, 2021

jleloup commented Dec 2, 2021

jleloup commented Dec 13, 2021

jmichalek132 commented Jan 4, 2022

jmichalek132 commented Jan 4, 2022

GiedriusS commented Jan 4, 2022

jmichalek132 commented Jan 4, 2022

GiedriusS commented Jan 4, 2022

jmichalek132 commented Jan 4, 2022

ahurtaud commented Feb 10, 2022

phoenixking25 commented Mar 12, 2022

stale bot commented Jun 12, 2022

panchambaruahwise commented Aug 8, 2022

Ruler not evaluating any rules #4772

Ruler not evaluating any rules #4772

Comments

jessicalins commented Oct 12, 2021

bwplotka commented Oct 12, 2021

GiedriusS commented Oct 22, 2021

bwplotka commented Oct 27, 2021

jleloup commented Dec 2, 2021 • edited Loading

jleloup commented Dec 2, 2021

jleloup commented Dec 2, 2021

jleloup commented Dec 13, 2021

jmichalek132 commented Jan 4, 2022

jmichalek132 commented Jan 4, 2022

GiedriusS commented Jan 4, 2022

jmichalek132 commented Jan 4, 2022

GiedriusS commented Jan 4, 2022

jmichalek132 commented Jan 4, 2022

ahurtaud commented Feb 10, 2022

phoenixking25 commented Mar 12, 2022

stale bot commented Jun 12, 2022

panchambaruahwise commented Aug 8, 2022

jleloup commented Dec 2, 2021 •

edited

Loading