-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: Query API is unresponsive but health/ready APIs report healthy/ready #4766
Comments
Could you please paste the goroutine dump from |
Hopefully I've attached a couple of outputs from the goroutine endpoint while this is happening. I've not gathered output from that endpoint before so hoping it all worked fine. |
Hello, Same issue for me with similar setup on 0.23.1. |
This is such a weird format of a goroutine dump. I was more looking for the stacktraces of all of them 😂 |
Apologies if I've done this wrong, I simply curled |
@AdmiralGT This was a comment for my PDFs I think :). |
@GiedriusS Have you been able to look into this? More of our clusters have started encountering the same problem and each time it requires manual intervention because the ready and health APIs are reporting everything as fine but it's causing us significant problems. |
Mhm, it seems like:
This is stuck and then
Seems like it never finishes its work. Investigating further. |
I believe the issue is because of this: https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188 We are RLock()ing twice in methods such as |
Avoid RLock()ing twice as described here: thanos-io#4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]>
Avoid RLock()ing twice as described here: #4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]>
Avoid RLock()ing twice as described here: thanos-io#4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]>
Hi, |
Same here, we ran into significant downtime due to this issue. I think it would be worth releasing a patch version with this fix. |
Avoid RLock()ing twice as described here: thanos-io#4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]>
Avoid RLock()ing twice as described here: thanos-io#4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Aymeric <[email protected]>
Avoid RLock()ing twice as described here: #4766 (comment) (due to https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188). Fix it by removing HasClients() and simply changing it with `er.clients != nil`. Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Aymeric <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]>
Issue: https://issues.redhat.com/browse/OCPBUGS-2037 Problem: thanos-io/thanos#4766 Solution: update to 0.23.2 that contains the patch thanos-io/thanos#4795 Signed-off-by: JoaoBraveCoding <[email protected]>
Thanos, Prometheus and Golang version used: Thanos v0.23.0-rc.1
Object Storage Provider: Azure, accessed via a managed identity.
What happened: I have deployed 2 Thanos query nodes, Prometheus, Thanos query frontend, Ruler, Store, Grafana in multiple Azure k8s service clusters using thanos, kube-prometheus-stack and grafana helm charts. The Query nodes are configured in a hierarchical fashion, such that one query node collects metrics solely from the cluster and the other queries the query node in every deployed cluster. Query Frontend sits in front of the top level query node.
The query API of the top level query node is completely unresponsive and does not respond or timeout API calls. I've tried using the different API endpoints e.g. query, labels but they are all unresponsive. However, both the healthy and ready APIs are reporting success and I'm not seeing any significant logs in the query node other than some SRV records failing to be resolved (which is expected as those clusters are currently down).
The query API in the low level query node is also fine and I can query metrics from the cluster, I just am unable to query metrics covering multiple clusters.
What you expected to happen: Query API to respond to queries when pod is reported healthy or timeout.
How to reproduce it (as minimally and precisely as possible): I don't have exact reproduction steps. I have deployed multiple clusters and in 6, the top level query API has come up successfully while in 2, query API was unresponsive. Restarting the Query pod has resolved the issue in one cluster but since the healthy API is reporting success, this is not happening automatically.
The text was updated successfully, but these errors were encountered: