Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346

Closed
parkedwards opened this issue May 9, 2022 · 10 comments

Comments

@parkedwards
Copy link

parkedwards commented May 9, 2022

Thanos, Prometheus and Golang version used:

$ thanos --version

thanos, version 0.23.0 (branch: HEAD, revision: fe0d695e8df8619f2e6588e6259230a13535001a)
  build user:       circleci@80970a8015b2
  build date:       20210927-14:57:42
  go version:       go1.16.8
  platform:         linux/amd64

We're using the official thanos docker image from the public registry (from thanosio/thanos)

Object Storage Provider:
S3

What happened:
We've been successfully running a Thanos cluster for around a year now - we've got some storegateways fronting s3, thanos sidecars on the prometheus nodes, and a query-frontend + query component (just 1x replica atm)

However, over the weekend, we ran into an issue where the query-frontend component (ingress to all our grafana, ruler, and developer querying) started timing out on all queries

digging further, it appears that our query component itself went silent for around 72 hours (silent = logs stopped, but pod was still healthy + passing health checks). the last registered error log was over a week ago, which exceeds when we started to notice the timeouts

digging EVEN FURTHER, it appears that the query component got stalled - to test this, we were able to get queries running again by rebooting the query pod

I also noticed that, during this "outage", we saw:

  • num of goroutines spike up to > 10k
  • the query component hitting max concurrent queries, which is defaulted to 20
    image

FYI: our issue appears to have the same symptoms as the following, past issues:
#705
#4766
#4925
#5079

What you expected to happen:
We'd expect the query component to either NOT brick, or fail health checks if it does so our k8s scheduler could replace it

How to reproduce it (as minimally and precisely as possible):
Not 100% sure, this appears to be somewhat symptomatic of a goroutine leak

Full logs to relevant components:

Can't seem to find helpful logs, since the query component went silent after experiencing the goroutine leak. But here are our query flags, for posterity

    Args:
      query
      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --query.replica-label=prometheus_replica
      --query.replica-label=replica
      --query.replica-label=prometheus
      --store=dnssrv+_grpc._tcp.prometheus-kube-prometheus-prometheus-thanos.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-storegateway-0.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-storegateway-1.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-storegateway-2.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-ruler.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-receive.monitoring.svc.cluster.local
      --store=<our prometheus DNS records>

Anything else we need to know:

@parkedwards
Copy link
Author

wasn't able to pull the pprof output when the component was bricked - but here it is after we rebooted

# go tool pprof -symbolize=remote "thanos-query.monitoring.svc.cluster.local:9090/debug/pprof/goroutine"
Fetching profile over HTTP from http://thanos-query.monitoring.svc.cluster.local:9090/debug/pprof/goroutine
Saved profile in /root/pprof/pprof.thanos.goroutine.001.pb.gz
File: thanos
Type: goroutine
Time: May 9, 2022 at 5:28pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top 10
Showing nodes accounting for 75, 100% of 75 total
Showing top 10 nodes out of 71
      flat  flat%   sum%        cum   cum%
        72 96.00% 96.00%         72 96.00%  runtime.gopark
         1  1.33% 97.33%         26 34.67%  internal/poll.(*pollDesc).wait
         1  1.33% 98.67%          1  1.33%  runtime.notetsleepg
         1  1.33%   100%          1  1.33%  runtime/pprof.runtime_goroutineProfileWithLabels
         0     0%   100%          8 10.67%  bufio.(*Reader).Peek
         0     0%   100%         13 17.33%  bufio.(*Reader).Read
         0     0%   100%          2  2.67%  bufio.(*Reader).ReadLine
         0     0%   100%          2  2.67%  bufio.(*Reader).ReadSlice
         0     0%   100%         10 13.33%  bufio.(*Reader).fill
         0     0%   100%          1  1.33%  github.com/baidubce/bce-sdk-go/util/log.NewLogger.func1

@wiardvanrij
Copy link
Member

wiardvanrij commented May 9, 2022

Could you share relevant pprof's with https://share.polarsignals.com/ ? That makes it very easy for us :)
Perhaps you can try to fetch the profiles when you experience this problem. That would most likely be very useful to find the root cause as well (:

@parkedwards
Copy link
Author

hey @wiardvanrij maybe im dumb, but i can't seem to spit out a file output that the polar signals tool likes. you mind providing me a command or two to do so? i've tried

curl <internal query host>/debug/pprof/goroutine -o output.pb

# or
curl <internal query host>/debug/pprof/goroutine?debug=1 -o output.pb

# or
curl <internal query host>/debug/pprof/goroutine?debug=2 -o output.pb

# and
gzip ./output.pb

but no dice

@GiedriusS
Copy link
Member

Is it the same with 0.24.0 or any newer version? Wouldn't be surprised if you are running into this.

@wiardvanrij
Copy link
Member

hey @wiardvanrij maybe im dumb, but i can't seem to spit out a file output that the polar signals tool likes. you mind providing me a command or two to do so? i've tried

curl <internal query host>/debug/pprof/goroutine -o output.pb

# or
curl <internal query host>/debug/pprof/goroutine?debug=1 -o output.pb

# or
curl <internal query host>/debug/pprof/goroutine?debug=2 -o output.pb

# and
gzip ./output.pb

but no dice

The default output should be a x.pb.gz file. If you would omit the -o. So you can just do your first command and say it's foo.pb.gz

@parkedwards
Copy link
Author

@wiardvanrij gah! so close. here it is:
https://share.polarsignals.com/cdfe34b/

@parkedwards
Copy link
Author

@GiedriusS yeah it's very possible; we're gonna be updating our images to v0.24.0 this week, and see if that resolves this

@parkedwards
Copy link
Author

@wiardvanrij we actually just experienced this leak just now, i was able to export a pprof here

https://share.polarsignals.com/b33f04f/

@GiedriusS
Copy link
Member

GiedriusS commented May 12, 2022

Seems like I was correct. I think you are running into #4795. Please try upgrading to 0.24.0 or 0.23.2.

@parkedwards
Copy link
Author

yo @GiedriusS @wiardvanrij -- things seem quiet after bumping to v0.24.0. thanks again for your help, and I can close this out

til next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants