Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential regression in DeadlineExceeded error description #5342

Closed
ahrex opened this issue May 10, 2022 · 5 comments
Closed

Potential regression in DeadlineExceeded error description #5342

ahrex opened this issue May 10, 2022 · 5 comments

Comments

@ahrex
Copy link

ahrex commented May 10, 2022

Hello, this is a follow-up of thanos-io/thanos#4923.

I believe there is a regression in error reporting in picker_wrapper.go, around:

errStr = ctx.Err().Error()

Introduced between grpc-go versions v1.39.0 and v1.40.0.

In good version v1.39.0, a TLS error during connection attempt shows more verbose information bubbled up to the user:

getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local\": rpc error: code = DeadlineExceeded desc = context deadline exceeded

While in borked version v1.40.0, only a generic error message is given:

getting metadata: fallback fetching info from prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded

The middle bits missing in v1.40.0 are:

latest balancer error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for prometheus-sidecar-blockstorage-1-0.prometheus-operated.scrape-blockstorage.svc.cluster.local, not prometheus-sidecar-blockstorage-1-0.prometheus-operated.cpe-scrape-blockstorage.svc.cluster.local\": rpc error: code = DeadlineExceeded desc =

Which in this case, identifies a TLS hostname mismatch.

If this context were to disappear, as it did in v1.40.0, debugging becomes a lot more difficult to root cause DeadlineExceeded.


From digging through the code, v1.39.0's error message all have latest balancer error, meaning they take this code path:

grpc-go/picker_wrapper.go

Lines 106 to 107 in 462d867

if lastPickErr != nil {
errStr = "latest balancer error: " + lastPickErr.Error()

I don't know if changing the code path to hit line 109 is related to making the error message generic, but something happened such that the error message we get no longer contains the detailed information found in:

return nil, connectionErrorf(isTemporary(err), err, "transport: authentication handshake failed: %v", err)

Could we bubble the error message in http2_client.go#L264 all the way back up to picker_wrapper.go#L109?

Thanks!

@dfawley
Copy link
Member

dfawley commented May 10, 2022

Have you tried with the latest (1.46) version, and is the behavior the same as with 1.40?

@easwars
Copy link
Contributor

easwars commented May 10, 2022

Do you have an easy way to reproduce this?

I tried modifying the TLS example here to specify a wrong server name and I see the following error:

2022/05/10 08:51:16 client.UnaryEcho(_) = _, rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for *.test.example.com, not localhost": 

And this is on master.

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label May 16, 2022
@ahrex
Copy link
Author

ahrex commented May 16, 2022

Do you have an easy way to reproduce this?

I don't have a min repro, sorry!

However, given the following:

thanos version gRPC version descriptive error messages
0.22.0 v1.39.0
0.23.0 v1.40.0

I've filed an internal task to compile Thanos 0.22.0 with gRPC versions v1.40.0 and v1.42.0, with hopes of hitting the regression on v1.40.0, and maybe resolving the regression in v1.42.0.

@github-actions github-actions bot removed the stale label May 16, 2022
@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label May 22, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants