Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Ruler fails to evaluate all recording rules correctly #4924

Closed
sharathfeb12 opened this issue Dec 6, 2021 · 7 comments
Closed

Thanos Ruler fails to evaluate all recording rules correctly #4924

sharathfeb12 opened this issue Dec 6, 2021 · 7 comments

Comments

@sharathfeb12
Copy link

I am currently running Thanos v0.24.0-rc.0.

Few recording rules are evaluated fine while few recording rules seems to be last evaluated 2 days back. This happens very frequently. Restarting the pod fixes the issue temporarily. This issue is reproducible on v0.23.0 as well.

Here is the args passed to the ruler:

- rule
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --objstore.config=$(OBJSTORE_CONFIG)
- --data-dir=/thanos/data
- --eval-interval=1m
- --label=rule_replica="$(NAME)"
- --alert.label-drop=rule_replica
- --remote-write.config-file=/etc/thanos/conf/rw-config.yaml
- --query=dnssrv+_http._tcp.observatorium-thanos-query.monitoring.svc.cluster.local
- --rule-file=/etc/thanos/rules/*/*.yaml

image

@GiedriusS
Copy link
Member

Hello, could you please dump the goroutine stacks when this happens and upload them?

@bwplotka
Copy link
Member

bwplotka commented Dec 6, 2021

Thanks for reporting! pprof profiles available at /debug/pprof/goroutines done at the moment of things being stuck, would be super helpful!

@jleloup
Copy link
Contributor

jleloup commented Dec 7, 2021

Isn't this issue similar to #4772 ?

@jleloup
Copy link
Contributor

jleloup commented Dec 7, 2021

I think the .pprof file I uploaded there is actually more relevant in this issue as the behaviour I got in my Thanos Ruler is more comparable to this issue as restarting those pods helped only for some dozens of minutes before failing again to process records.

Link to the .pprof: #4772 (comment)

@ahurtaud
Copy link
Contributor

ahurtaud commented Jan 5, 2022

I have the same issue on v0.24.0
Here the goroutine pprof after long time being stuck:
ruler-goroutine.zip

@stale
Copy link

stale bot commented Apr 17, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 17, 2022
@stale
Copy link

stale bot commented May 1, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants