-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346
Comments
wasn't able to pull the pprof output when the component was bricked - but here it is after we rebooted
|
Could you share relevant pprof's with https://share.polarsignals.com/ ? That makes it very easy for us :) |
hey @wiardvanrij maybe im dumb, but i can't seem to spit out a file output that the polar signals tool likes. you mind providing me a command or two to do so? i've tried
but no dice |
Is it the same with 0.24.0 or any newer version? Wouldn't be surprised if you are running into this. |
The default output should be a x.pb.gz file. If you would omit the -o. So you can just do your first command and say it's foo.pb.gz |
@wiardvanrij gah! so close. here it is: |
@GiedriusS yeah it's very possible; we're gonna be updating our images to v0.24.0 this week, and see if that resolves this |
@wiardvanrij we actually just experienced this leak just now, i was able to export a pprof here |
Seems like I was correct. I think you are running into #4795. Please try upgrading to 0.24.0 or 0.23.2. |
yo @GiedriusS @wiardvanrij -- things seem quiet after bumping to til next time! |
Thanos, Prometheus and Golang version used:
We're using the official thanos docker image from the public registry (from
thanosio/thanos
)Object Storage Provider:
S3
What happened:
We've been successfully running a Thanos cluster for around a year now - we've got some storegateways fronting s3, thanos sidecars on the prometheus nodes, and a query-frontend + query component (just 1x replica atm)
However, over the weekend, we ran into an issue where the
query-frontend
component (ingress to all our grafana, ruler, and developer querying) started timing out on all queriesdigging further, it appears that our
query
component itself went silent for around 72 hours (silent = logs stopped, but pod was still healthy + passing health checks). the last registered error log was over a week ago, which exceeds when we started to notice the timeoutsdigging EVEN FURTHER, it appears that the
query
component got stalled - to test this, we were able to get queries running again by rebooting thequery
podI also noticed that, during this "outage", we saw:
query
component hitting max concurrent queries, which is defaulted to 20FYI: our issue appears to have the same symptoms as the following, past issues:
#705
#4766
#4925
#5079
What you expected to happen:
We'd expect the
query
component to either NOT brick, or fail health checks if it does so our k8s scheduler could replace itHow to reproduce it (as minimally and precisely as possible):
Not 100% sure, this appears to be somewhat symptomatic of a goroutine leak
Full logs to relevant components:
Can't seem to find helpful logs, since the
query
component went silent after experiencing the goroutine leak. But here are ourquery
flags, for posterityAnything else we need to know:
The text was updated successfully, but these errors were encountered: