-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI plugin fails to be marked healthy after reboot #11784
Comments
Hi @ygersie Thanks for filing this issue. We'll take a look at this as soon as we can. Derek |
Leaving a note that this issue appears to be related to but may have subtly different code paths from #9810 |
Hey @ygersie just wanted to give an update. I don't have a firm root cause on this issue but I spent some time trying to reproduce and discovered that instantiating our gRPC client to the plugins was not very robust and this can definitely lead to the symptoms you've presented here. We have two gRPC clients that talk to the plugins. One is in the
Once the That sequence of events lines up with the evidence in this issue and in my reproductions, but I'm not quite comfortable calling this issue solved until I've done a bit more testing. In the meantime, I've opened #12057 with two changes to solve the problem I've described here:
Those improvements will ship in Nomad 1.3.0 and get backported to 1.2.x and 1.1.x. |
Naturally almost as soon as I posted I end up with a better repro...
In this scenario I killed the Docker container that was running so that Nomad would recreate it. The unix socket is there but it's refusing connections from any client:
The Docker container has the mount as expected:
If I exec into the task, I see the socket:
Let's look at the two plugin processes we have. One is for a controller (which is broken) and the other is for a node (which is working fine):
Ok, let's look at the (working) node plugin's socket:
Just as we expect, and we can see the Nomad process has got the other end as well. How about our controller plugin's socket?
No one is listening! In this case something went wrong with our plugin when it was created, and it's not listening on the socket at all. But allocation logs after the restart seem to think it should be. {"level":"info","message":"initializing csi driver: nfs-client","service":"democratic-csi"}
{"level":"debug","message":"setting default identity service caps","service":"democratic-csi"}
{"level":"debug","message":"setting default identity volume_expansion caps","service":"democratic-csi"}
{"level":"debug","message":"setting default controller caps","service":"democratic-csi"}
{"level":"debug","message":"setting default node caps","service":"democratic-csi"}
{"level":"info","message":"starting csi server - name: org.democratic-csi.nfs, version: 1.4.3, driver: nfs-client, mode: controller, csi version: 1.5.0, address: , socket: unix:///csi/csi.sock","service":"democratic-csi"} Unfortunately I don't have the right toolset on the box I'm testing this on to properly dump the heap of the plugin, but clearly this isn't a good state. It also suggests that #12057 as it currently stands isn't a good idea. Regardless of the reason, we should almost certainly be killing the plugin task if we get "stuck" like this. I'm at the end of my week here but I'll pick it up again on Monday and revise #12057 as necessary. |
I would like to also confirm that this also happens with ceph-csi. Start a node system job and controller service job and restart any nodes. When restarted, preexisting ceph-csi containers run again on boot, marked as healthy in |
Ok #12057
With that patch I tested again by running plugins on my cluster and restarting/draining nodes. The node allocations come back up:
As do the controller allocations:
Now the plugin status doesn't see the new allocations but does have the correct count! I'll note that all the missing allocations here are the replacement allocations.
If we look at the allocs, they're healthy as we'd expect:
The only logs we see for the new plugin allocations are the logs for the initial ping that fails. We should probably add a log line that says when they finally get registered too so as to make it clear this is a temporary condition.
Then I created a NFS volume that supports multi-writer and ran the following job that deploys allocations on all 3 nodes: jobspecjob "httpd" {
datacenters = ["dc1"]
group "web" {
count = 3
volume "csi_data" {
type = "csi"
read_only = false
source = "csi-volume-nfs0"
access_mode = "multi-node-multi-writer"
attachment_mode = "file-system"
}
constraint {
operator = "distinct_hosts"
value = "true"
}
network {
mode = "bridge"
port "www" {
to = 8001
}
}
task "http" {
driver = "docker"
config {
image = "busybox:1"
command = "httpd"
args = ["-v", "-f", "-p", "8001", "-h", "/srv"]
ports = ["www"]
}
volume_mount {
volume = "csi_data"
destination = "/srv"
read_only = false
}
resources {
cpu = 128
memory = 128
}
}
}
} And it works with all allocations running!
This suggests that the client has a working state for the plugins but the server has an incorrect state with respect to which allocations serve the plugin (but not which nodes!). Unfortunately the server logs don't have any useful logs for me here, but this is a good starting point to drive down to root cause on the rest of this bug. |
With the experiment above in mind I've written up some tests in #12078 that demonstrate the behaviors we're seeing here, and from there I have a working hypothesis: The dynamic plugin registry assumes that plugins are singletons, which matches the behavior of other Nomad plugins. But of course dynamic plugins like CSI are implemented by allocations which can run concurrently. This concurrency includes restoring the allocation from restarts on top of the existing dynamic plugin registry! We need to handle the possibility of multiple allocations for a given plugin type + ID, as well as behaviors around interleaved allocation starts and stops. while maintaining the illusion of it being a singleton from the perspective of callers who just want to mount/unmount volumes. I've left a TODO list in #12078 and have started on implementation. |
Should be fixed by #12078, which will ship in Nomad 1.3.0 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
v1.2.3
Issue
I looked around in the issues and saw multiple CSI related issues but not sure if this was already reported.
When rebooting a node that runs csi plugins the plugins are never marked as healthy making it impossible for the volumes associated with those plugins to be scheduled. The only fix is to wait until the plugins are started and are registered as healthy on the node and then restart the nomad agent. Another interesting thing is that when you restart the agent the allocation events of the plugin shows the task exited:
the logs also show Nomad created a container:
but the containers were never actually restarted at that time (23 minutes ago was 16:06, the first time the plugins started):
So here's what is shown in the various states after I reboot a node:
The state never seems to change and this is the result:
Reproduction steps
Reboot a node that runs CSI plugins and check the various states.
The text was updated successfully, but these errors were encountered: