Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: make gRPC client creation more robust #12057

Merged
merged 6 commits into from
Feb 15, 2022
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Feb 11, 2022

Related to #11784. It doesn't fix it but makes debugging it more legible.
I've broken this up into mostly bite-sized commits for review.


Nomad communicates with CSI plugin tasks via gRPC. The plugin
supervisor hook uses this to ping the plugin for health checks which
it emits as task events. After the first successful health check the
plugin supervisor registers the plugin in the client's dynamic plugin
registry, which in turn creates a CSI plugin manager instance that has
its own gRPC client for fingerprinting the plugin and sending mount
requests.

If the plugin manager instance fails to connect to the plugin on its
first attempt, it exits. The plugin supervisor hook is unaware that
connection failed so long as its own pings continue to work. A
transient failure during plugin startup may mislead the plugin
supervisor hook into thinking the plugin is up (so there's no need to
restart the allocation) but no fingerprinter is started.

  • Refactors the gRPC client to connect on first use. This provides the
    plugin manager instance the ability to retry the gRPC client
    connection until success.
  • Add a 30s timeout to the plugin supervisor so that we don't poll
    forever waiting for a plugin that will never come back up.

Minor improvements:

  • The plugin supervisor hook creates a new gRPC client for every probe
    and then throws it away. Instead, reuse the client as we do for the
    plugin manager.
  • The gRPC client constructor has a 1 second timeout. Clarify that this
    timeout applies to the connection and not the rest of the client
    lifetime.

The plugin supervisor registers the plugin in the `Poststart` hook, so
the task itself should be running. If the plugin can't communicate
with us after 30s, exit and mark the task as unhealthy so that it can
be restarted.
Copy link
Contributor

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; just the one question about kill

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line theme/storage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants