query: Query API is unresponsive but health/ready APIs report healthy/ready #4766

AdmiralGT · 2021-10-11T11:00:36Z

Thanos, Prometheus and Golang version used: Thanos v0.23.0-rc.1

Object Storage Provider: Azure, accessed via a managed identity.

What happened: I have deployed 2 Thanos query nodes, Prometheus, Thanos query frontend, Ruler, Store, Grafana in multiple Azure k8s service clusters using thanos, kube-prometheus-stack and grafana helm charts. The Query nodes are configured in a hierarchical fashion, such that one query node collects metrics solely from the cluster and the other queries the query node in every deployed cluster. Query Frontend sits in front of the top level query node.

The query API of the top level query node is completely unresponsive and does not respond or timeout API calls. I've tried using the different API endpoints e.g. query, labels but they are all unresponsive. However, both the healthy and ready APIs are reporting success and I'm not seeing any significant logs in the query node other than some SRV records failing to be resolved (which is expected as those clusters are currently down).

The query API in the low level query node is also fine and I can query metrics from the cluster, I just am unable to query metrics covering multiple clusters.

What you expected to happen: Query API to respond to queries when pod is reported healthy or timeout.

How to reproduce it (as minimally and precisely as possible): I don't have exact reproduction steps. I have deployed multiple clusters and in 6, the top level query API has come up successfully while in 2, query API was unresponsive. Restarting the Query pod has resolved the issue in one cluster but since the healthy API is reporting success, this is not happening automatically.

GiedriusS · 2021-10-12T10:54:35Z

Could you please paste the goroutine dump from ${IP}:${HTTPPORT}/debug/pprof when this happens?

AdmiralGT · 2021-10-12T11:27:50Z

Hopefully I've attached a couple of outputs from the goroutine endpoint while this is happening. I've not gathered output from that endpoint before so hoping it all worked fine.

goroutines.tar.gz

ahurtaud · 2021-10-15T07:43:42Z

Hello,

Same issue for me with similar setup on 0.23.1.
Issue is always happening on the same platform (the biggest query leaf)
profilegoroutine2.pdf
profile-goroutines.pdf
.

GiedriusS · 2021-10-15T08:05:22Z

This is such a weird format of a goroutine dump. I was more looking for the stacktraces of all of them 😂

AdmiralGT · 2021-10-15T09:50:47Z

Apologies if I've done this wrong, I simply curled /debug/pprof/goroutine and output to file. I tried curl'ing /debug/pprof but I got a 301. If you can let me know what I should be doing I can get more diagnostics.

ahurtaud · 2021-10-15T11:55:30Z

@AdmiralGT This was a comment for my PDFs I think :).
with your tar file, it is possible to execute again go tool pprof and get what is needed. On my side I had to fallback because the cluster in question is Production and we cannot leave it as is :/
So I am not sure I will be able to get new dump of goroutine.

ahurtaud · 2021-10-15T14:17:31Z

2_goroutine_dump_pprof.zip

AdmiralGT · 2021-10-19T09:31:42Z

@GiedriusS Have you been able to look into this? More of our clusters have started encountering the same problem and each time it requires manual intervention because the ready and health APIs are reporting everything as fine but it's causing us significant problems.

GiedriusS · 2021-10-19T20:05:41Z

Mhm, it seems like:

         1   runtime.gopark                                                                              
             runtime.goparkunlock (inline)                                                               
             runtime.semacquire1                                                                         
             sync.runtime_Semacquire                                                                     
             sync.(*WaitGroup).Wait
             github.com/thanos-io/thanos/pkg/query.(*EndpointSet).getActiveEndpoints
             github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update
             main.runQuery.func3.1                                                                       
             github.com/thanos-io/thanos/pkg/runutil.Repeat             
             main.runQuery.func3                                                                         
             github.com/oklog/run.(*Group).Run.func1

This is stuck and then RLock() gets stuck. Another routine is stuck on dialing:

         3   runtime.gopark                                                                              
             runtime.netpollblock                                                                                                                                                                                  
             internal/poll.runtime_pollWait                                                                                                                                                                        
             internal/poll.(*pollDesc).wait                                                                                                                                                                        
             internal/poll.(*pollDesc).waitWrite (inline)                                                                                                                                                          
             internal/poll.(*FD).WaitWrite (inline)                                                                                                                                                                
             net.(*netFD).connect                                                                        
             net.(*netFD).dial                                                                                                                                                                                     
             net.socket                                                                                                                                                                                            
             net.internetSocket                                                                          
             net.(*sysDialer).doDialTCP                                                                                                                                                                            
             net.(*sysDialer).dialTCP                                                                                                                                                                              
             net.(*sysDialer).dialSingle                                                                 
             net.(*sysDialer).dialSerial                                                                                                                                                                           
             net.(*Dialer).DialContext                                                                                                                                                                             
             google.golang.org/grpc.DialContext.func2
             google.golang.org/grpc.newProxyDialer.func1                                                                                                                                                           
             google.golang.org/grpc/internal/transport.dial                                                                                                                                                        
             google.golang.org/grpc/internal/transport.newHTTP2Client                                    
             google.golang.org/grpc/internal/transport.NewClientTransport (inline)                                                                                                                                 
             google.golang.org/grpc.(*addrConn).createTransport                                                                                                                                                    
             google.golang.org/grpc.(*addrConn).tryAllAddrs                                              
             google.golang.org/grpc.(*addrConn).resetTransport

Seems like it never finishes its work. Investigating further.

GiedriusS · 2021-10-21T15:47:36Z

I believe the issue is because of this: https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188

We are RLock()ing twice in methods such as HasStoreAPI() and others because they call HasClients() which does another RLock(). Then Lock() becomes permanently blocked because the previous read locks aren't ever unlocked. The fix is probably to remove RLocking from all HasXAPI() methods and simply check er.clients != nil there. @hitanshu-mehta how does this sound?

Avoid RLock()ing twice as described here: thanos-io#4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]>

Avoid RLock()ing twice as described here: #4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]>

Avoid RLock()ing twice as described here: thanos-io#4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]>

lud97x · 2021-10-26T07:54:05Z

Hi,
Just for information, we are expecting this release this fix our query's stuck issue too.

pentlander · 2021-11-10T12:34:57Z

Same here, we ran into significant downtime due to this issue. I think it would be worth releasing a patch version with this fix.

Avoid RLock()ing twice as described here: thanos-io#4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]>

Avoid RLock()ing twice as described here: thanos-io#4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Aymeric <[email protected]>

Avoid RLock()ing twice as described here: #4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Aymeric <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]>

Issue: https://issues.redhat.com/browse/OCPBUGS-2037 Problem: thanos-io/thanos#4766 Solution: update to 0.23.2 that contains the patch thanos-io/thanos#4795 Signed-off-by: JoaoBraveCoding <[email protected]>

GiedriusS added bug component: query priority: P0 labels Oct 19, 2021

GiedriusS mentioned this issue Oct 21, 2021

query: fix deadlock in endpointset #4795

Merged

GiedriusS closed this as completed in #4795 Oct 21, 2021

aymericDD mentioned this issue Dec 6, 2021

backport (query: fix deadlock in endpointset (#4795)) #4926

Merged

2 tasks

hanjm mentioned this issue Dec 14, 2021

Thanos-query does not query data after running for a period of time, health check is ok, but query api timeout. #4948

Closed

parkedwards mentioned this issue May 9, 2022

query: component stuck/stalled + goroutines > 10k + hit max concurrent queries #5346

Closed

JoaoBraveCoding mentioned this issue Nov 4, 2022

OCPBUGS-2037: Bump Thanos to 0.23.2 openshift/cluster-monitoring-operator#1811

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: Query API is unresponsive but health/ready APIs report healthy/ready #4766

query: Query API is unresponsive but health/ready APIs report healthy/ready #4766

AdmiralGT commented Oct 11, 2021

GiedriusS commented Oct 12, 2021

AdmiralGT commented Oct 12, 2021 •

edited

Loading

ahurtaud commented Oct 15, 2021 •

edited

Loading

GiedriusS commented Oct 15, 2021

AdmiralGT commented Oct 15, 2021

ahurtaud commented Oct 15, 2021

ahurtaud commented Oct 15, 2021

AdmiralGT commented Oct 19, 2021

GiedriusS commented Oct 19, 2021 •

edited

Loading

GiedriusS commented Oct 21, 2021

lud97x commented Oct 26, 2021

pentlander commented Nov 10, 2021

query: Query API is unresponsive but health/ready APIs report healthy/ready #4766

query: Query API is unresponsive but health/ready APIs report healthy/ready #4766

Comments

AdmiralGT commented Oct 11, 2021

GiedriusS commented Oct 12, 2021

AdmiralGT commented Oct 12, 2021 • edited Loading

ahurtaud commented Oct 15, 2021 • edited Loading

GiedriusS commented Oct 15, 2021

AdmiralGT commented Oct 15, 2021

ahurtaud commented Oct 15, 2021

ahurtaud commented Oct 15, 2021

AdmiralGT commented Oct 19, 2021

GiedriusS commented Oct 19, 2021 • edited Loading

GiedriusS commented Oct 21, 2021

lud97x commented Oct 26, 2021

pentlander commented Nov 10, 2021

AdmiralGT commented Oct 12, 2021 •

edited

Loading

ahurtaud commented Oct 15, 2021 •

edited

Loading

GiedriusS commented Oct 19, 2021 •

edited

Loading