Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: Query API is unresponsive but health/ready APIs report healthy/ready #4766

Closed
AdmiralGT opened this issue Oct 11, 2021 · 12 comments · Fixed by #4795
Closed

query: Query API is unresponsive but health/ready APIs report healthy/ready #4766

AdmiralGT opened this issue Oct 11, 2021 · 12 comments · Fixed by #4795

Comments

@AdmiralGT
Copy link
Contributor

Thanos, Prometheus and Golang version used: Thanos v0.23.0-rc.1

Object Storage Provider: Azure, accessed via a managed identity.

What happened: I have deployed 2 Thanos query nodes, Prometheus, Thanos query frontend, Ruler, Store, Grafana in multiple Azure k8s service clusters using thanos, kube-prometheus-stack and grafana helm charts. The Query nodes are configured in a hierarchical fashion, such that one query node collects metrics solely from the cluster and the other queries the query node in every deployed cluster. Query Frontend sits in front of the top level query node.

The query API of the top level query node is completely unresponsive and does not respond or timeout API calls. I've tried using the different API endpoints e.g. query, labels but they are all unresponsive. However, both the healthy and ready APIs are reporting success and I'm not seeing any significant logs in the query node other than some SRV records failing to be resolved (which is expected as those clusters are currently down).

The query API in the low level query node is also fine and I can query metrics from the cluster, I just am unable to query metrics covering multiple clusters.

What you expected to happen: Query API to respond to queries when pod is reported healthy or timeout.

How to reproduce it (as minimally and precisely as possible): I don't have exact reproduction steps. I have deployed multiple clusters and in 6, the top level query API has come up successfully while in 2, query API was unresponsive. Restarting the Query pod has resolved the issue in one cluster but since the healthy API is reporting success, this is not happening automatically.

@GiedriusS
Copy link
Member

Could you please paste the goroutine dump from ${IP}:${HTTPPORT}/debug/pprof when this happens?

@AdmiralGT
Copy link
Contributor Author

AdmiralGT commented Oct 12, 2021

Hopefully I've attached a couple of outputs from the goroutine endpoint while this is happening. I've not gathered output from that endpoint before so hoping it all worked fine.

goroutines.tar.gz

@ahurtaud
Copy link
Contributor

ahurtaud commented Oct 15, 2021

Hello,

Same issue for me with similar setup on 0.23.1.
Issue is always happening on the same platform (the biggest query leaf)
profilegoroutine2.pdf
profile-goroutines.pdf
.

@GiedriusS
Copy link
Member

This is such a weird format of a goroutine dump. I was more looking for the stacktraces of all of them 😂

@AdmiralGT
Copy link
Contributor Author

Apologies if I've done this wrong, I simply curled /debug/pprof/goroutine and output to file. I tried curl'ing /debug/pprof but I got a 301. If you can let me know what I should be doing I can get more diagnostics.

@ahurtaud
Copy link
Contributor

@AdmiralGT This was a comment for my PDFs I think :).
with your tar file, it is possible to execute again go tool pprof and get what is needed. On my side I had to fallback because the cluster in question is Production and we cannot leave it as is :/
So I am not sure I will be able to get new dump of goroutine.

@ahurtaud
Copy link
Contributor

@AdmiralGT
Copy link
Contributor Author

@GiedriusS Have you been able to look into this? More of our clusters have started encountering the same problem and each time it requires manual intervention because the ready and health APIs are reporting everything as fine but it's causing us significant problems.

@GiedriusS
Copy link
Member

GiedriusS commented Oct 19, 2021

Mhm, it seems like:

         1   runtime.gopark                                                                              
             runtime.goparkunlock (inline)                                                               
             runtime.semacquire1                                                                         
             sync.runtime_Semacquire                                                                     
             sync.(*WaitGroup).Wait
             github.com/thanos-io/thanos/pkg/query.(*EndpointSet).getActiveEndpoints
             github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update
             main.runQuery.func3.1                                                                       
             github.com/thanos-io/thanos/pkg/runutil.Repeat             
             main.runQuery.func3                                                                         
             github.com/oklog/run.(*Group).Run.func1 

This is stuck and then RLock() gets stuck. Another routine is stuck on dialing:

         3   runtime.gopark                                                                              
             runtime.netpollblock                                                                                                                                                                                  
             internal/poll.runtime_pollWait                                                                                                                                                                        
             internal/poll.(*pollDesc).wait                                                                                                                                                                        
             internal/poll.(*pollDesc).waitWrite (inline)                                                                                                                                                          
             internal/poll.(*FD).WaitWrite (inline)                                                                                                                                                                
             net.(*netFD).connect                                                                        
             net.(*netFD).dial                                                                                                                                                                                     
             net.socket                                                                                                                                                                                            
             net.internetSocket                                                                          
             net.(*sysDialer).doDialTCP                                                                                                                                                                            
             net.(*sysDialer).dialTCP                                                                                                                                                                              
             net.(*sysDialer).dialSingle                                                                 
             net.(*sysDialer).dialSerial                                                                                                                                                                           
             net.(*Dialer).DialContext                                                                                                                                                                             
             google.golang.org/grpc.DialContext.func2
             google.golang.org/grpc.newProxyDialer.func1                                                                                                                                                           
             google.golang.org/grpc/internal/transport.dial                                                                                                                                                        
             google.golang.org/grpc/internal/transport.newHTTP2Client                                    
             google.golang.org/grpc/internal/transport.NewClientTransport (inline)                                                                                                                                 
             google.golang.org/grpc.(*addrConn).createTransport                                                                                                                                                    
             google.golang.org/grpc.(*addrConn).tryAllAddrs                                              
             google.golang.org/grpc.(*addrConn).resetTransport

Seems like it never finishes its work. Investigating further.

@GiedriusS
Copy link
Member

I believe the issue is because of this: https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188

We are RLock()ing twice in methods such as HasStoreAPI() and others because they call HasClients() which does another RLock(). Then Lock() becomes permanently blocked because the previous read locks aren't ever unlocked. The fix is probably to remove RLocking from all HasXAPI() methods and simply check er.clients != nil there. @hitanshu-mehta how does this sound?

GiedriusS added a commit to GiedriusS/thanos that referenced this issue Oct 21, 2021
Avoid RLock()ing twice as described here:
thanos-io#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
GiedriusS added a commit that referenced this issue Oct 21, 2021
Avoid RLock()ing twice as described here:
#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
GiedriusS added a commit to GiedriusS/thanos that referenced this issue Oct 22, 2021
Avoid RLock()ing twice as described here:
thanos-io#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
@lud97x
Copy link

lud97x commented Oct 26, 2021

Hi,
Just for information, we are expecting this release this fix our query's stuck issue too.

@pentlander
Copy link

Same here, we ran into significant downtime due to this issue. I think it would be worth releasing a patch version with this fix.

aymericDD pushed a commit to aymericDD/thanos that referenced this issue Dec 6, 2021
Avoid RLock()ing twice as described here:
thanos-io#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
aymericDD pushed a commit to aymericDD/thanos that referenced this issue Dec 7, 2021
Avoid RLock()ing twice as described here:
thanos-io#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Aymeric <[email protected]>
bwplotka pushed a commit that referenced this issue Dec 7, 2021
Avoid RLock()ing twice as described here:
#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Aymeric <[email protected]>

Co-authored-by: Giedrius Statkevičius <[email protected]>
JoaoBraveCoding added a commit to JoaoBraveCoding/cluster-monitoring-operator that referenced this issue Nov 4, 2022
Issue: https://issues.redhat.com/browse/OCPBUGS-2037

Problem: thanos-io/thanos#4766

Solution: update to 0.23.2 that contains the patch thanos-io/thanos#4795

Signed-off-by: JoaoBraveCoding <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants