Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI plugin fails to be marked healthy after reboot #11784

Closed
ygersie opened this issue Jan 6, 2022 · 9 comments · Fixed by #12078
Closed

CSI plugin fails to be marked healthy after reboot #11784

ygersie opened this issue Jan 6, 2022 · 9 comments · Fixed by #12078
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@ygersie
Copy link
Contributor

ygersie commented Jan 6, 2022

Nomad version

v1.2.3

Issue

I looked around in the issues and saw multiple CSI related issues but not sure if this was already reported.
When rebooting a node that runs csi plugins the plugins are never marked as healthy making it impossible for the volumes associated with those plugins to be scheduled. The only fix is to wait until the plugins are started and are registered as healthy on the node and then restart the nomad agent. Another interesting thing is that when you restart the agent the allocation events of the plugin shows the task exited:

Time                       Type                   Description
2022-01-06T16:29:27+01:00  Started                Task started by client
2022-01-06T16:29:08+01:00  Plugin became healthy  plugin: aws-ebs-us-west-2c
2022-01-06T16:29:08+01:00  Restarting             Task restarting in 16.433792327s
2022-01-06T16:29:08+01:00  Terminated             Exit Code: 0, Exit Message: "unexpected EOF"
2022-01-06T16:06:53+01:00  Plugin became healthy  plugin: aws-ebs-us-west-2c
2022-01-06T16:06:47+01:00  Started                Task started by client
2022-01-06T16:06:45+01:00  Task Setup             Building Task Directory
2022-01-06T16:06:45+01:00  Received               Task received by client

the logs also show Nomad created a container:

2022-01-06T15:29:26.362Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=33acf7f1862452ab921d37c715795fea75c5e6368a47598bc1b1a76db8cd001e
2022-01-06T15:29:27.362Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=3ca8e32e99e57882affdf151e571f782dff75016f4768cce0112c5ec8d4e7a8c
2022-01-06T15:29:27.368Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=b02849a790ab2d5cb9050e2508d69934a4235a924af446dea20bb6c1991455de

but the containers were never actually restarted at that time (23 minutes ago was 16:06, the first time the plugins started):

root@mynode:~# docker ps
CONTAINER ID   IMAGE                                   COMMAND                  CREATED          STATUS          PORTS     NAMES
3ca8e32e99e5   amazon/aws-ebs-csi-driver:release-1.4   "sh -c 'umask 002 &&…"   23 minutes ago   Up 23 minutes             plugin-e60daf88-b4d7-d439-28cb-4648af8ec53a
33acf7f18624   amazon/aws-efs-csi-driver:v1.3.4        "sh -c 'umask 002 &&…"   23 minutes ago   Up 23 minutes             plugin-9243e3cf-6e16-1668-7f95-692aadf271ce
b02849a790ab   amazon/aws-ebs-csi-driver:release-1.4   "sh -c 'umask 002 &&…"   23 minutes ago   Up 23 minutes             plugin-e9640812-41bd-7337-186b-cc39060c6176

So here's what is shown in the various states after I reboot a node:

$ nomad plugin status -verbose
Container Storage Interface
ID                  Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs-us-west-2a  <none>           0/1                           0/1
$ nomad volume status
Container Storage Interface
ID                               Name                            Plugin ID           Schedulable  Access Mode
us-west-2a-myvolume[0]           myvolume                        aws-ebs-us-west-2a  false        <none>
$ nomad node status e51ef358
ID              = e51ef358-1c33-7468-c46c-fdab82f44ec8
Name            = agent-name
Class           = default
DC              = us-west-2a
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 4m24s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec

..............

Allocations
ID        Node ID   Task Group             Version  Desired  Status    Created     Modified
d7709737  e51ef358  node-us-west-2a        2        run      running   3m22s ago   3m5s ago
e612cbf0  e51ef358  controller-us-west-2a  2        run      running   3m22s ago   3m9s ago
$ nomad alloc status d7709737
ID                  = d7709737-e01c-a117-3320-b77c5df0720c
Eval ID             = dce00c2f
Name                = csi-ebs-nodes.node-us-west-2a[0]
Node ID             = e51ef358
Node Name           = node-name
Job ID              = csi-ebs-nodes
Job Version         = 2
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 5m55s ago
Modified            = 5m38s ago

Task "plugin" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/500 MHz  7.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2022-01-06T15:38:59Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                   Description
2022-01-06T16:39:05+01:00  Plugin became healthy  plugin: aws-ebs-us-west-2a
2022-01-06T16:38:59+01:00  Started                Task started by client
2022-01-06T16:38:49+01:00  Task Setup             Building Task Directory
2022-01-06T16:38:49+01:00  Received               Task received by client
$ nomad alloc status e612cbf0
ID                  = e612cbf0-c647-fce7-b03c-9f165dd0d503
Eval ID             = 4c3be9f3
Name                = csi-ebs-controller.controller-us-west-2a[0]
Node ID             = e51ef358
Node Name           = node-name
Job ID              = csi-ebs-controller
Job Version         = 2
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 8m2s ago
Modified            = 7m49s ago

Task "plugin" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/500 MHz  7.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2022-01-06T15:38:51Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                   Description
2022-01-06T16:38:57+01:00  Plugin became healthy  plugin: aws-ebs-us-west-2a
2022-01-06T16:38:51+01:00  Started                Task started by client
2022-01-06T16:38:49+01:00  Task Setup             Building Task Directory
2022-01-06T16:38:49+01:00  Received               Task received by client

The state never seems to change and this is the result:

$ nomad plan example.hcl
+ Job: "example"
+ Task Group: "example-0" (1 create)
  + Task: "example" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "example-0" (failed to place 1 allocation):
    * Class "default": 3 nodes excluded by filter
    * Constraint "CSI plugin aws-ebs-us-west-2a is missing from client e51ef358-1c33-7468-c46c-fdab82f44ec8": 1 nodes excluded by filter

Reproduction steps

Reboot a node that runs CSI plugins and check the various states.

@DerekStrickland DerekStrickland self-assigned this Jan 6, 2022
@DerekStrickland
Copy link
Contributor

Hi @ygersie

Thanks for filing this issue. We'll take a look at this as soon as we can.

Derek

@tgross
Copy link
Member

tgross commented Feb 3, 2022

Leaving a note that this issue appears to be related to but may have subtly different code paths from #9810

@tgross
Copy link
Member

tgross commented Feb 11, 2022

Hey @ygersie just wanted to give an update. I don't have a firm root cause on this issue but I spent some time trying to reproduce and discovered that instantiating our gRPC client to the plugins was not very robust and this can definitely lead to the symptoms you've presented here.

We have two gRPC clients that talk to the plugins. One is in the plugin_supervisor_hook that's associated with the plugin task. This hook detects the initial health of the plugin and registers the plugin with the Nomad client's internal plugin registry. The hook is also what writes the task event that you see here.

Recent Events:
Time                       Type                   Description
2022-01-06T16:39:05+01:00  Plugin became healthy  plugin: aws-ebs-us-west-2a

Once the plugin_supervisor_hook registers the plugin, a second gRPC client is spawned for the plugin instanceManager which is what fingerprints the plugins and actually talks to the plugins on behalf of other tasks that need volumes. Unfortunately if the first attempt to connect to the plugin fails, we just bail out instance.go#L78. So we never fingerprint the plugin and can't talk to it to ask it to mount volumes, but we also never see a task event that the plugin became unhealthy!

That sequence of events lines up with the evidence in this issue and in my reproductions, but I'm not quite comfortable calling this issue solved until I've done a bit more testing.

In the meantime, I've opened #12057 with two changes to solve the problem I've described here:

  • The plugin instanceManager will retry creating the gRPC client every 5 seconds until it succeeds or gets a shutdown notice.
  • Reuse the gRPC in the plugin_supervisor_hook so that we're not doing all that work every 30s and then throwing it away.

Those improvements will ship in Nomad 1.3.0 and get backported to 1.2.x and 1.1.x.

@tgross tgross added this to the 1.3.0 milestone Feb 11, 2022
@tgross
Copy link
Member

tgross commented Feb 11, 2022

Naturally almost as soon as I posted I end up with a better repro...

Recent Events:
Time                       Type                     Description
2022-02-11T15:55:46-05:00  Plugin became unhealthy  error: failed to probe plugin: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/nomad/data/client/csi/controller/org.democratic-csi.nfs/csi.sock: connect: connection refused"
2022-02-11T15:55:32-05:00  Started                  Task started by client
2022-02-11T15:55:32-05:00  Restarting               Task restarting in 16.368197257s
2022-02-11T15:55:32-05:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-02-11T15:53:16-05:00  Plugin became healthy    plugin: org.democratic-csi.nfs
2022-02-11T15:53:11-05:00  Started                  Task started by client
2022-02-11T15:53:11-05:00  Task Setup               Building Task Directory
2022-02-11T15:53:11-05:00  Received                 Task received by client

In this scenario I killed the Docker container that was running so that Nomad would recreate it. The unix socket is there but it's refusing connections from any client:

# socat - UNIX-CONNECT://var/nomad/data/client/csi/controller/org.democratic-csi.nfs/csi.sock
2022/02/11 21:18:08 socat[18314] E connect(5, AF=1 "//var/nomad/data/client/csi/controller/org.democratic-csi.nfs/csi.sock", 72): Connection refused

The Docker container has the mount as expected:

$ docker inspect 4b7 | jq '.[0].Mounts[] | select(.Destination == "/csi")'
{
  "Type": "bind",
  "Source": "/var/nomad/data/client/csi/node/org.democratic-csi.nfs",
  "Destination": "/csi",
  "Mode": "",
  "RW": true,
  "Propagation": "rshared"
}

If I exec into the task, I see the socket:

$ nomad alloc exec 42a /bin/bash
root@79af57406091:/home/csi/app# ls -lah /csi
total 8.0K
drwx------ 2 root root 4.0K Feb 11 20:53 .
drwxr-xr-x 1 root root 4.0K Feb 11 20:55 ..
srwxr-xr-x 1 root root    0 Feb 11 20:53 csi.sock

Let's look at the two plugin processes we have. One is for a controller (which is broken) and the other is for a node (which is working fine):

$ ps -ef | grep democratic
root       16799   16770  0 20:55 ?        00:00:00 node --nouse-idle-notification --expose-gc bin/democratic-csi --csi-version=1.5.0 --csi-name=org.democratic-csi.nfs --driver-config-file=/local/driver-config-file.yaml --log-level=debug --csi-mode=controller --server-socket=/csi/csi.sock
root       17139   17110  0 20:59 ?        00:00:01 node --nouse-idle-notification --expose-gc bin/democratic-csi --csi-version=1.5.0 --csi-name=org.democratic-csi.nfs --driver-config-file=/local/driver-config-file.yaml --log-level=debug --csi-mode=node --server-socket=/csi/csi.sock

Ok, let's look at the (working) node plugin's socket:

$ sudo lsof +E -U
COMMAND     PID            USER   FD   TYPE             DEVICE SIZE/OFF   NODE NAME
...
nomad     16438            root   28u  unix 0xffff91ef5c3a0c00      0t0  84227 type=STREAM ->INO=84228 17139,node,19u
nomad     16438            root   29u  unix 0xffff91ef5c3a1400      0t0  84230 type=STREAM ->INO=84231 17139,node,20u
node      17139            root   18u  unix 0xffff91ef3121b800      0t0  84223 ///csi/csi.sock type=STREAM
node      17139            root   19u  unix 0xffff91ef5c3a0400      0t0  84228 ///csi/csi.sock type=STREAM ->INO=84227 16438,nomad,28u
node      17139            root   20u  unix 0xffff91ef5c3a1800      0t0  84231 ///csi/csi.sock type=STREAM ->INO=84230 16438,nomad,29u

Just as we expect, and we can see the Nomad process has got the other end as well. How about our controller plugin's socket?

$ sudo lsof +E -U | grep 16799

No one is listening! In this case something went wrong with our plugin when it was created, and it's not listening on the socket at all. But allocation logs after the restart seem to think it should be.

{"level":"info","message":"initializing csi driver: nfs-client","service":"democratic-csi"}
{"level":"debug","message":"setting default identity service caps","service":"democratic-csi"}
{"level":"debug","message":"setting default identity volume_expansion caps","service":"democratic-csi"}
{"level":"debug","message":"setting default controller caps","service":"democratic-csi"}
{"level":"debug","message":"setting default node caps","service":"democratic-csi"}
{"level":"info","message":"starting csi server - name: org.democratic-csi.nfs, version: 1.4.3, driver: nfs-client, mode: controller, csi version: 1.5.0, address: , socket: unix:///csi/csi.sock","service":"democratic-csi"}

Unfortunately I don't have the right toolset on the box I'm testing this on to properly dump the heap of the plugin, but clearly this isn't a good state. It also suggests that #12057 as it currently stands isn't a good idea. Regardless of the reason, we should almost certainly be killing the plugin task if we get "stuck" like this.

I'm at the end of my week here but I'll pick it up again on Monday and revise #12057 as necessary.

@gregory112
Copy link

I would like to also confirm that this also happens with ceph-csi. Start a node system job and controller service job and restart any nodes. When restarted, preexisting ceph-csi containers run again on boot, marked as healthy in job status. However, it does not show up in plugin status, just like in here. The issue can reliably be reproduced just by rebooting.

@tgross
Copy link
Member

tgross commented Feb 15, 2022

Ok #12057 is up for review has been merged. It doesn't fix this issue by itself but should help make it much more clear what's happening:

  • Plugin tasks will get killed (and restart) after 30s if the initial probe fails.
  • The fingerprinter will continuously retry to establish its connection until the task exits, so that we don't end up in a situation where the fingerprinter failed to connect but the plugin supervisor is still merrily continuing without it.

With that patch I tested again by running plugins on my cluster and restarting/draining nodes. The node allocations come back up:

$ nomad job status democratic-csi-nfs-node
ID            = democratic-csi-nfs-node
...

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
190c1fcb  d708b712  nodes       0        run      running   2m42s ago  2m31s ago
ce744e65  8e4428b4  nodes       0        run      running   3m26s ago  3m16s ago
76dcd56f  8e4428b4  nodes       0        stop     complete  6m35s ago  3m22s ago
c427f14e  d708b712  nodes       0        stop     complete  6m35s ago  2m51s ago
f710e1b0  fecb9b26  nodes       0        run      running   6m35s ago  6m25s ago

As do the controller allocations:

$ nomad job status democratic-csi-nfs-controller
ID            = democratic-csi-nfs-controller
...

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
cbee8837  8e4428b4  controller  0        run      running   5m33s ago  5m23s ago
4e2106cc  d708b712  controller  0        stop     complete  7m29s ago  5m33s ago
ba0d6689  8e4428b4  controller  0        stop     complete  9m20s ago  6m4s ago
efa615d3  fecb9b26  controller  0        run      running   9m20s ago  9m10s ago

Now the plugin status doesn't see the new allocations but does have the correct count! I'll note that all the missing allocations here are the replacement allocations.

$ nomad plugin status o
ID                   = org.democratic-csi.nfs
Provider             = org.democratic-csi.nfs
Version              = 1.4.3
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
f710e1b0  fecb9b26  nodes       0        run      running   6m39s ago  6m29s ago
efa615d3  fecb9b26  controller  0        run      running   6m42s ago  6m32s ago
ba0d6689  8e4428b4  controller  0        stop     complete  6m42s ago  3m26s ago
76dcd56f  8e4428b4  nodes       0        stop     complete  6m39s ago  3m26s ago
190c1fcb  d708b712  nodes       0        run      running   2m46s ago  2m35s ago

If we look at the allocs, they're healthy as we'd expect:

$ nomad alloc status ce744e65
ID                  = ce744e65-67eb-cf97-920c-251083fa2f7e
Eval ID             = b793de3d
Name                = democratic-csi-nfs-node.nodes[0]
Node ID             = 8e4428b4
Node Name           = nomad-client1
Job ID              = democratic-csi-nfs-node
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 3m51s ago
Modified            = 3m41s ago

Task "plugin" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/500 MHz  26 MiB/200 MiB  300 MiB

Task Events:
Started At     = 2022-02-15T18:41:54Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                   Description
2022-02-15T13:42:04-05:00  Plugin became healthy  plugin: org.democratic-csi.nfs
2022-02-15T13:41:54-05:00  Started                Task started by client
2022-02-15T13:41:54-05:00  Task Setup             Building Task Directory
2022-02-15T13:41:54-05:00  Received               Task received by client

The only logs we see for the new plugin allocations are the logs for the initial ping that fails. We should probably add a log line that says when they finally get registered too so as to make it clear this is a temporary condition.

2022-02-15T18:41:59.938Z [INFO] client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=ce744e65-67eb-cf97-920c-251083fa2f7e plugin.name=org.democratic-csi.nfs plugin.type=node task=plugin grpc.code=DeadlineExceeded duration="10.654µs" grpc.service=csi.v1.Identity grpc.method=Probe
2022-02-15T18:41:59.939Z [DEBUG] client.alloc_runner.task_runner.task_hook: CSI plugin not ready: alloc_id=ce744e65-67eb-cf97-920c-251083fa2f7e task=plugin error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
...
2022-02-15T18:42:35.406Z [INFO] client.alloc_runner.task_runner.task_hook.csi_client: finished client unary call: alloc_id=cbee8837-10d6-93a9-d962-ede0ec9b63bc plugin.name=org.democratic-csi.nfs plugin.type=controller task=plugin grpc.code=DeadlineExceeded duration="20.634µs" grpc.service=csi.v1.Identity grpc.method=Probe
2022-02-15T18:42:35.407Z [DEBUG] client.alloc_runner.task_runner.task_hook: CSI plugin not ready: alloc_id=cbee8837-10d6-93a9-d962-ede0ec9b63bc task=plugin error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Then I created a NFS volume that supports multi-writer and ran the following job that deploys allocations on all 3 nodes:

jobspec
job "httpd" {
  datacenters = ["dc1"]

  group "web" {

    count = 3

    volume "csi_data" {
      type            = "csi"
      read_only       = false
      source          = "csi-volume-nfs0"
      access_mode     = "multi-node-multi-writer"
      attachment_mode = "file-system"
    }

    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/srv"]
        ports   = ["www"]
      }

      volume_mount {
        volume      = "csi_data"
        destination = "/srv"
        read_only   = false
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }
}

And it works with all allocations running!

$ nomad volume status csi-volume-nfs0
ID                   = csi-volume-nfs0
Name                 = csi-volume-nfs0
External ID          = csi-volume-nfs0
Plugin ID            = org.democratic-csi.nfs
Provider             = org.democratic-csi.nfs
Version              = 1.4.3
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = multi-node-multi-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID                                    Node ID                               Task Group  Version  Desired  Status   Created  Modified
9fe2fcf2-bda1-ef29-fda3-929aacfd3822  8e4428b4-7de4-d696-8e2d-cc78d65476e6  web         0        run      running  30s ago  19s ago
24002022-c6a1-5be3-4f3b-d9aecf3a0a4d  fecb9b26-87e6-c03e-470f-18443c20f737  web         0        run      running  30s ago  19s ago
bdc7431f-590e-9c39-328a-bec03a297030  d708b712-6d12-a1a0-2bd0-c5594e739a94  web         0        run      running  30s ago  19s ago

This suggests that the client has a working state for the plugins but the server has an incorrect state with respect to which allocations serve the plugin (but not which nodes!). Unfortunately the server logs don't have any useful logs for me here, but this is a good starting point to drive down to root cause on the rest of this bug.

@tgross
Copy link
Member

tgross commented Feb 16, 2022

With the experiment above in mind I've written up some tests in #12078 that demonstrate the behaviors we're seeing here, and from there I have a working hypothesis:

The dynamic plugin registry assumes that plugins are singletons, which matches the behavior of other Nomad plugins. But of course dynamic plugins like CSI are implemented by allocations which can run concurrently. This concurrency includes restoring the allocation from restarts on top of the existing dynamic plugin registry!

We need to handle the possibility of multiple allocations for a given plugin type + ID, as well as behaviors around interleaved allocation starts and stops. while maintaining the illusion of it being a singleton from the perspective of callers who just want to mount/unmount volumes. I've left a TODO list in #12078 and have started on implementation.

@tgross tgross linked a pull request Feb 17, 2022 that will close this issue
5 tasks
@tgross
Copy link
Member

tgross commented Feb 23, 2022

Should be fixed by #12078, which will ship in Nomad 1.3.0

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants