Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: error DeadlineExceeded missing error details #4923

Closed
Jakob3xD opened this issue Dec 3, 2021 · 20 comments · Fixed by #6428
Closed

query: error DeadlineExceeded missing error details #4923

Jakob3xD opened this issue Dec 3, 2021 · 20 comments · Fixed by #6428

Comments

@Jakob3xD
Copy link
Contributor

Jakob3xD commented Dec 3, 2021

Thanos, Prometheus and Golang version used:
The Docker images:
Thanos: 0.23.0 and newer
go: go1.16.8

What happened:
When the query fails to connect to a store/sidecar it does not display certificates issues like before. This makes debugging difficulty.

level=warn
ts=2021-12-03T16:11:21.556102388Z
caller=endpointset.go:500
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901

What you expected to happen:
The error should display more information if the certificate is invalid like in Thanos v0.22.0

level=warn ts=2021-12-03T16:30:14.55260649Z
caller=endpointset.go:512
component=endpointset
msg="update of node failed" err="getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901 after err: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local\": rpc error: code = DeadlineExceeded desc = context deadline exceeded"
address=prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901

How to reproduce it (as minimally and precisely as possible):

  • Create an invalid Cert for the Store/Sidecar
  • Run Thanos 0.23.0 or newer with TLS enabled at the Store/Sidecar
@hanjm
Copy link
Member

hanjm commented Dec 4, 2021

The error message says the cert is not valid, please check cert. the diff is .scrape-blockstorage vs cpe-scrape-blockstorage.
transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local

@Jakob3xD
Copy link
Contributor Author

Jakob3xD commented Dec 4, 2021

Yes, I know what the error message is complaining about but only because it showed the Cert error, which is only displayed in Thanos Versions v0.22.0 and bellow. Thats is my issue. In Thanos v0.23.0 does NOT display the cert error.

@RoryDoherty
Copy link

Ran into a similar issue and had to temporarily install v0.22.0 to see the full error message
Thanos v0.24.0 still does not display the full error message

@nicolastakashi
Copy link
Contributor

I've faced the same issue today, I've missed adding the store port and I could only see the errors after rollback to version 0.22

@stale
Copy link

stale bot commented Apr 16, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 16, 2022
@stale
Copy link

stale bot commented Apr 30, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Apr 30, 2022
@ahrex
Copy link

ahrex commented May 9, 2022

Still reproducible and needed.

@wiardvanrij
Copy link
Member

Is this also reproducible with the latest version?

@ahrex
Copy link

ahrex commented May 9, 2022

I have yet to ask our internal Thanos maintainers about whether the issue persists on the latest version or not.

It looks like we've held our deploy back at v0.22.0 so we can have the additional context in error messages.

I did some digging and found the following:

  • thanos commit 2bf7e1ca06e60cd1dc722c97640b639b9d8ce704 between tags v0.22.0 and v0.23.0 upgraded a few dependencies, including grpc:
ahlex@galette:~/repos/thanos% git diff 2bf7e1ca06e60cd1dc722c97640b639b9d8ce704{~,} -- go.mod | rg -C 2 -w grpc
-       google.golang.org/api v0.50.0
-       google.golang.org/genproto v0.0.0-20210713002101-d411969a0d9a
-       google.golang.org/grpc v1.39.0
+       google.golang.org/api v0.56.0
+       google.golang.org/genproto v0.0.0-20210903162649-d08c68adba83
+       google.golang.org/grpc v1.40.0
        gopkg.in/alecthomas/kingpin.v2 v2.2.6
        gopkg.in/fsnotify.v1 v1.4.7
--
+       github.com/prometheus/prometheus => github.com/prometheus/prometheus v1.8.2-0.20210914090109-37468d88dce8
        github.com/sercand/kuberesolver => github.com/sercand/kuberesolver v2.4.0+incompatible
        google.golang.org/grpc => google.golang.org/grpc v1.29.1
  • inspecting differences between grpc versions v1.39.0 and v1.40.0 show a simplification in the errors returned:
ahlex@galette:~/repos/grpc-go% git diff v1.39.0..v1.40.0 -- internal/status/status.go | rg -C 5 -F 'Error()'
 type Error struct {
-       e *spb.Status
+       s *Status
 }
 
 func (e *Error) Error() string {
-       return fmt.Sprintf("rpc error: code = %s desc = %s", codes.Code(e.e.GetCode()), e.e.GetMessage())
+       return e.s.String()
 }
 
 // GRPCStatus returns the Status represented by se.
  • with what I think is the oversimplification of going from e.e.GetMessage() to e.s.String(), introduced in ba41bbac225e6e1a9b822fe636c40c3b7d977894 of grpc:
ahlex@galette:~/repos/grpc-go% git diff ba41bbac225e6e1a9b822fe636c40c3b7d977894{~,} -- internal/status/status.go | rg -C 5 -F 'Error()'
 type Error struct {
-       e *spb.Status
+       s *Status
 }
 
 func (e *Error) Error() string {
-       return fmt.Sprintf("rpc error: code = %s desc = %s", codes.Code(e.e.GetCode()), e.e.GetMessage())
+       return e.s.String()
 }
 
 // GRPCStatus returns the Status represented by se.

To help answer:

Is this also reproducible with the latest version?

  • getting the version of grpc in the latest tagged thanos version v0.26.0:
ahlex@galette:~/repos/thanos% git show v0.26.0:go.mod | rg -F google.golang.org/grpc
        google.golang.org/grpc v1.43.0
        google.golang.org/grpc/examples v0.0.0-20211119005141-f45e61797429
        google.golang.org/grpc => google.golang.org/grpc v1.40.1
  • checking for the simplified Error() response in grpc version v1.43.0:
ahlex@galette:~/repos/grpc-go% git show v1.43.0:internal/status/status.go | rg -C 5 -F 'Error()'
// and a nil *Error should never be returned by this package.
type Error struct {
        s *Status
}

func (e *Error) Error() string {
        return e.s.String()
}

// GRPCStatus returns the Status represented by se.
func (e *Error) GRPCStatus() *Status {

Yes, I believe the latest Thanos can repro the issue, but I would like someone to also validate my digging.


As for a solution forward, if y'all think my research is on the right track, I'm more than happy to open a PR into grpc-go, if no one else beats me to it.

Thanks!

@wiardvanrij
Copy link
Member

wiardvanrij commented May 9, 2022

I think that makes sense. The only thing I'm wondering if we shouldn't call something else to get the full info, but at a first simple look, I could not find such method. Perhaps we should make an issue / PR there indeed. I.e. best time/effort would perhaps be to create an issue there. Just to double check we are on the right path.

Anyhow, good spot/debugging ;) thank you

@ahrex
Copy link

ahrex commented May 10, 2022

Hello! So I spent some time trying to PR a fix, to find out that I was a bit off the mark.

While I still think the problem is between grpc-go versions v1.39.0 and v1.40.0, upon closer inspection the changes to internal/status/status.go look fine.

I did some more digging and found out I'm not proficient enough in all of the changes to make a definitive call, so I filed grpc/grpc-go#5342 to have the experts look at it. I did find some solid anchors to what I think needs to be bubbled up, and I've dropped those hints onto the upstream bug report.

Sorry I wasn't more helpful, and hopefully we get a fix soon!

@wiardvanrij
Copy link
Member

All good, you helped a lot :)

@mbrancato
Copy link

Just wanted to add this here, I ran into this same issue during an upgrade:

level=warn ts=2022-06-09T03:31:27.679924722Z caller=endpointset.go:517 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from metrics.mycluster.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=metrics.mycluster.com:443

I'm not sure if others are experiencing this using the bitnami helm chart, but that is how I installed Thanos. The issue was that the querier parameter --grpc-client-server-name was missing after upgrading from the 9.x to 10.x version of the chart. At some point they changed the values structure from query.grpc.client.servername to query.grpc.client.serverName (capitalized N).

@stale
Copy link

stale bot commented Aug 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 13, 2022
@erikgb
Copy link

erikgb commented Sep 14, 2022

Still relevant! Problem persists in version 0.28.0.

@stale stale bot removed the stale label Sep 14, 2022
@stale
Copy link

stale bot commented Nov 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Nov 13, 2022
@ddreier
Copy link

ddreier commented Nov 23, 2022

Still relevant, and rather annoying when trying to decipher why two Thanos components can't talk.

@stale stale bot removed the stale label Nov 23, 2022
@AntonioDiTuri
Copy link

AntonioDiTuri commented May 25, 2023

Is this fixed? I have the same error message using Thanos 0.26.0 but only for one sidecar, the other work. How to get more information why communication cannot be established by different component?

@simonpasquier
Copy link
Contributor

We've got bitten by this too. In our case, there was an issue with the TLS version not being supported end-to-end and the fact that the endpointset logs didn't include the TLS error made it hard to troubleshoot.

To reproduce, you can take a working Thanos setup with Query and sidecar and set the --grpc-client-server-name argument to a wrong value for the Query.

@simonpasquier
Copy link
Contributor

In the mean time, setting the following environment variables might help troubleshooting: GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.