query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346

parkedwards · 2022-05-09T16:41:01Z

Thanos, Prometheus and Golang version used:

$ thanos --version

thanos, version 0.23.0 (branch: HEAD, revision: fe0d695e8df8619f2e6588e6259230a13535001a)
  build user:       circleci@80970a8015b2
  build date:       20210927-14:57:42
  go version:       go1.16.8
  platform:         linux/amd64

We're using the official thanos docker image from the public registry (from thanosio/thanos)

Object Storage Provider:
S3

What happened:
We've been successfully running a Thanos cluster for around a year now - we've got some storegateways fronting s3, thanos sidecars on the prometheus nodes, and a query-frontend + query component (just 1x replica atm)

However, over the weekend, we ran into an issue where the query-frontend component (ingress to all our grafana, ruler, and developer querying) started timing out on all queries

digging further, it appears that our query component itself went silent for around 72 hours (silent = logs stopped, but pod was still healthy + passing health checks). the last registered error log was over a week ago, which exceeds when we started to notice the timeouts

digging EVEN FURTHER, it appears that the query component got stalled - to test this, we were able to get queries running again by rebooting the query pod

I also noticed that, during this "outage", we saw:

num of goroutines spike up to > 10k
the query component hitting max concurrent queries, which is defaulted to 20

FYI: our issue appears to have the same symptoms as the following, past issues:
#705
#4766
#4925
#5079

What you expected to happen:
We'd expect the query component to either NOT brick, or fail health checks if it does so our k8s scheduler could replace it

How to reproduce it (as minimally and precisely as possible):
Not 100% sure, this appears to be somewhat symptomatic of a goroutine leak

Full logs to relevant components:

Can't seem to find helpful logs, since the query component went silent after experiencing the goroutine leak. But here are our query flags, for posterity

    Args:
      query
      --log.level=info
      --log.format=logfmt
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --query.replica-label=prometheus_replica
      --query.replica-label=replica
      --query.replica-label=prometheus
      --store=dnssrv+_grpc._tcp.prometheus-kube-prometheus-prometheus-thanos.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-storegateway-0.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-storegateway-1.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-storegateway-2.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-ruler.monitoring.svc.cluster.local
      --store=dnssrv+_grpc._tcp.thanos-receive.monitoring.svc.cluster.local
      --store=<our prometheus DNS records>

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

parkedwards · 2022-05-09T17:30:41Z

wasn't able to pull the pprof output when the component was bricked - but here it is after we rebooted

# go tool pprof -symbolize=remote "thanos-query.monitoring.svc.cluster.local:9090/debug/pprof/goroutine"
Fetching profile over HTTP from http://thanos-query.monitoring.svc.cluster.local:9090/debug/pprof/goroutine
Saved profile in /root/pprof/pprof.thanos.goroutine.001.pb.gz
File: thanos
Type: goroutine
Time: May 9, 2022 at 5:28pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top 10
Showing nodes accounting for 75, 100% of 75 total
Showing top 10 nodes out of 71
      flat  flat%   sum%        cum   cum%
        72 96.00% 96.00%         72 96.00%  runtime.gopark
         1  1.33% 97.33%         26 34.67%  internal/poll.(*pollDesc).wait
         1  1.33% 98.67%          1  1.33%  runtime.notetsleepg
         1  1.33%   100%          1  1.33%  runtime/pprof.runtime_goroutineProfileWithLabels
         0     0%   100%          8 10.67%  bufio.(*Reader).Peek
         0     0%   100%         13 17.33%  bufio.(*Reader).Read
         0     0%   100%          2  2.67%  bufio.(*Reader).ReadLine
         0     0%   100%          2  2.67%  bufio.(*Reader).ReadSlice
         0     0%   100%         10 13.33%  bufio.(*Reader).fill
         0     0%   100%          1  1.33%  github.com/baidubce/bce-sdk-go/util/log.NewLogger.func1

wiardvanrij · 2022-05-09T17:55:38Z

Could you share relevant pprof's with https://share.polarsignals.com/ ? That makes it very easy for us :)
Perhaps you can try to fetch the profiles when you experience this problem. That would most likely be very useful to find the root cause as well (:

parkedwards · 2022-05-10T00:38:40Z

hey @wiardvanrij maybe im dumb, but i can't seem to spit out a file output that the polar signals tool likes. you mind providing me a command or two to do so? i've tried

curl <internal query host>/debug/pprof/goroutine -o output.pb

# or
curl <internal query host>/debug/pprof/goroutine?debug=1 -o output.pb

# or
curl <internal query host>/debug/pprof/goroutine?debug=2 -o output.pb

# and
gzip ./output.pb

but no dice

GiedriusS · 2022-05-10T07:46:50Z

Is it the same with 0.24.0 or any newer version? Wouldn't be surprised if you are running into this.

wiardvanrij · 2022-05-10T09:42:12Z

hey @wiardvanrij maybe im dumb, but i can't seem to spit out a file output that the polar signals tool likes. you mind providing me a command or two to do so? i've tried
curl <internal query host>/debug/pprof/goroutine -o output.pb

# or
curl <internal query host>/debug/pprof/goroutine?debug=1 -o output.pb

# or
curl <internal query host>/debug/pprof/goroutine?debug=2 -o output.pb

# and
gzip ./output.pb
but no dice

The default output should be a x.pb.gz file. If you would omit the -o. So you can just do your first command and say it's foo.pb.gz

parkedwards · 2022-05-10T15:10:05Z

@wiardvanrij gah! so close. here it is:
https://share.polarsignals.com/cdfe34b/

parkedwards · 2022-05-10T15:10:42Z

@GiedriusS yeah it's very possible; we're gonna be updating our images to v0.24.0 this week, and see if that resolves this

parkedwards · 2022-05-11T19:16:18Z

@wiardvanrij we actually just experienced this leak just now, i was able to export a pprof here

https://share.polarsignals.com/b33f04f/

GiedriusS · 2022-05-12T09:12:53Z

Seems like I was correct. I think you are running into #4795. Please try upgrading to 0.24.0 or 0.23.2.

parkedwards · 2022-05-20T16:16:06Z

yo @GiedriusS @wiardvanrij -- things seem quiet after bumping to v0.24.0. thanks again for your help, and I can close this out

til next time!

wiardvanrij added needs-investigation component: query labels May 9, 2022

parkedwards closed this as completed May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346

query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346

parkedwards commented May 9, 2022 •

edited

Loading

parkedwards commented May 9, 2022

wiardvanrij commented May 9, 2022 •

edited

Loading

parkedwards commented May 10, 2022

GiedriusS commented May 10, 2022

wiardvanrij commented May 10, 2022

parkedwards commented May 10, 2022

parkedwards commented May 10, 2022

parkedwards commented May 11, 2022

GiedriusS commented May 12, 2022 •

edited

Loading

parkedwards commented May 20, 2022

query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346

query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346

Comments

parkedwards commented May 9, 2022 • edited Loading

parkedwards commented May 9, 2022

wiardvanrij commented May 9, 2022 • edited Loading

parkedwards commented May 10, 2022

GiedriusS commented May 10, 2022

wiardvanrij commented May 10, 2022

parkedwards commented May 10, 2022

parkedwards commented May 10, 2022

parkedwards commented May 11, 2022

GiedriusS commented May 12, 2022 • edited Loading

parkedwards commented May 20, 2022

parkedwards commented May 9, 2022 •

edited

Loading

wiardvanrij commented May 9, 2022 •

edited

Loading

GiedriusS commented May 12, 2022 •

edited

Loading