Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: use a blocking initial connection with timeout #7965

Merged
merged 1 commit into from
May 15, 2020

Conversation

tgross
Copy link
Member

@tgross tgross commented May 14, 2020

For #7424 (1 of probably many) and specifically the case described in #7931 (comment)

The plugin supervisor lazily connects to plugins, but this means we only get "Unavailable" back from the gRPC call in cases where the plugin can never be reached (for example, if the Nomad client has the wrong permissions for the socket).

This changeset improves the operator experience by switching to a blocking DialWithContext. It eagerly connects so that we can validate the connection is real and get a "failed to open" error in case where Nomad can't establish the initial connection.


The new error messages in a scenario where Nomad has the wrong permissions will look like the following (repeated each fingerprint attempt):

2020-05-14T19:38:35.203Z [DEBUG] client.alloc_runner.task_runner.task_hook: CSI Plugin not ready: alloc_id=30851259-1fde-8821-9f6e-3e5806f700a1 task=plugin error="failed to stat socket: stat /tmp/NomadClient616705218/csi/monolith/hostpath-plugin0/csi.sock: no such file or directory

2020-05-14T19:38:41.223Z [DEBUG] client.alloc_runner.task_runner.task_hook: CSI Plugin not ready: alloc_id=30851259-1fde-8821-9f6e-3e5806f700a1 task=plugin error="failed to create csi client: failed to open grpc connection to addr: /tmp/NomadClient616705218/csi/monolith/hostpath-plugin0/csi.sock, err: context deadline exceeded

@tgross tgross added this to the 0.11.3 milestone May 14, 2020
The plugin supervisor lazily connects to plugins, but this means we
only get "Unavailable" back from the gRPC call in cases where the
plugin can never be reached (for example, if the Nomad client has the
wrong permissions for the socket).

This changeset improves the operator experience by switching to a
blocking `DialWithContext`. It eagerly connects so that we can
validate the connection is real and get a "failed to open" error in
case where Nomad can't establish the initial connection.
@tgross tgross force-pushed the csi_better_error_on_bad_connect branch from 95daea0 to eb2f77a Compare May 14, 2020 19:59
if err != nil {
return false, fmt.Errorf("failed to create csi client: %v", err)
}
defer client.Close()
Copy link
Member Author

@tgross tgross May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a panic that can't happen in 0.11.2 or earlier because we never actually connect in this block, so the error condition never happens.

@tgross tgross requested review from langmartin and nickethier May 14, 2020 20:08
Copy link
Contributor

@langmartin langmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, I thought about it a little and talked myself out of wanting a way to change the timeout. It might be interesting to have a couple of coarse grained tuning parameters eventually where we'd have or derive an out-of-dc-connection-timeout. Being open source is a type of allowing tuning of something most folks won't need to tune, though 😀

@tgross
Copy link
Member Author

tgross commented May 14, 2020

nice, I thought about it a little and talked myself out of wanting a way to change the timeout.

Yeah, the timeout only impacts the initial connection and we retry it on the next fingerprint anyways, so I figured it'd be safe to have it be short.

@tgross tgross merged commit 09a27f9 into master May 15, 2020
@tgross tgross deleted the csi_better_error_on_bad_connect branch May 15, 2020 12:17
@tgross
Copy link
Member Author

tgross commented May 27, 2020

Cherry-picked to 0.11.3 branch as d31bb8b

@github-actions
Copy link

github-actions bot commented Jan 4, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants