Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CSI] requests to job left freezing after ceph csi plugin restart #11999

Closed
kriestof opened this issue Feb 3, 2022 · 10 comments
Closed

[CSI] requests to job left freezing after ceph csi plugin restart #11999

kriestof opened this issue Feb 3, 2022 · 10 comments

Comments

@kriestof
Copy link

kriestof commented Feb 3, 2022

Nomad version

Nomad v1.2.5

Operating system and Environment details

Arch Linux
kernel 5.16.5-arch1-1

Issue

I've setup docker registry storing images on ceph filesystem using ceph-csi. Everything works fine until ceph-csi plugin will be restarted. Then none of the docker registry images can be pulled. It is especially troubling, because plugin restarts itself ocassionaly from time to time. Then docker registry job has to be manually restarted also. Unfortunately, I'm not able to narrow the spot. There isn't any kind of descriptive error both in ceph-csi and in docker-registry job. This kind of error happened also on 1.1.x version on nomad (actually even before) and with different ceph-csi plugin versions.

Reproduction steps

Steps to reproduce are a little bit troublesome -- you need running ceph instance. But maybe somebody has running ceph with nomad and could share his/her results. Also I'm open to test other more minimal example, but so far no easy example with storage + csi comes to my mind.

nomad run -hcl1 plugin-ceph-nodes.nomad
nomad volume register registry-volume.hcl
nomad run -hcl1 image-registry.hcl

docker pull nginx:latest
docker tag nginx:latest registry.service.consul:5000/nginx
docker push registry.service.consul:5000/nginx
curl http://registry.service.consul:5000/v2/_catalog # should return you something like {"repositories":["nginx"]}

nomad stop plugin-ceph-nodes
nomad run -hcl1 plugin-ceph-nodes.nomad

# Give it a few minutes break
curl http://registry.service.consul:5000/v2/_catalog # Should timeout

nomad stop docker-registry
nomad run -hcl1 image-registry.hcl
curl http://registry.service.consul:5000/v2/_catalog # Should return result

Expected Result

After ceph-csi restart running jobs should either just work with proper mounts or be restarted automatically.

Actual Result

Seems after csi plugin disconnection mounted job directory at docker registry is left freezing.

Job file (if appropriate)

plugin-ceph-node.nomad

job "plugin-ceph-nodes" {
  datacenters = ["dc1"]

  type = "system"

  group "nodes" {
    task "ceph-node" {
      driver = "docker"

      template {
        data        = <<EOF
[{
  "clusterID": "xxxxxxxxxxxxx",
  "monitors": [
     "xxx.xx.xxx.xx", "xxx.xx.xxx.xx", ...
  ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }

     config {
        image = "quay.io/cephcsi/cephcsi:v3.4.0"

        args = [
          "--endpoint=unix://csi/csi.sock",
          "--type=cephfs",
          "--drivername=cephfs.csi.ceph.com",
          "--nodeid=${node.unique.id}",
          "--instanceid=${NOMAD_ALLOC_ID}",
          "--nodeserver=true"
        ]

        privileged = true

        mounts = [
          {
            type     = "tmpfs"
            target   = "/tmp/csi/keys"
            readonly = false
            tmpfs_options {
              size = 1000000 # size in bytes
            }
          }
        ]

        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json",
          "/lib/modules:/lib/modules"
        ]
      }

      resources {
        cpu    = 500
        memory = 256
      }

      csi_plugin {
        id        = "ceph-csi"
        type      = "node"
        mount_dir = "/csi"
      }
    }

    task "ceph-controller" {
      driver = "docker"

      template {
        data        = <<EOF
[{
  "clusterID": "xxxxxxxx",
  "monitors":[
     "xxx.xx.xxx.xx", "xxx.xx.xxx.xx", ...
  ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }

      config {
        image = "quay.io/cephcsi/cephcsi:v3.4.0"

        args = [
          "--type=cephfs",
          "--controllerserver=true",
          "--drivername=cephfs.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.id}",
          "--instanceid=${NOMAD_ALLOC_ID}"
        ]

        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json"
        ]
      }

      resources {
        cpu    = 200
        memory = 500
      }

      csi_plugin {
        id        = "ceph-csi"
        type      = "controller"
        mount_dir = "/csi"
      }
    }
  }
}

registry-volume.hcl:

type = "csi"
id = "registry-volume"
name = "Docker registry"
external_id = "xxxxxxxxx"
plugin_id = "ceph-csi"

capability {
  access_mode = "multi-node-multi-writer"
  attachment_mode = "file-system"
}

parameters {
}

secrets {
  userID  = "uuuuu"
  userKey = "ppppp"
}

context {
  pool = "cephfs_data"
  clusterID = "xxxxxxxxxxxxxxxxx"
  monitors = "xxx.xx.xxx.xx,xxx.xx.xxx.xx, ..."
  provisionVolume = "false"
  rootPath = "/docker-registry"
}

image-registry.hcl:

job "registry" {
  datacenters = ["dc1"]

  group "registry" {
    network {
      port  "http"  {
        static = 5000
      }
    }

    volume "registry-volume" {
      type      = "csi"
      read_only = false
      source    = "registry-volume"
      access_mode = "multi-node-multi-writer"
      attachment_mode = "file-system"
    }


    task "web" {
      volume_mount {
        volume      = "registry-volume"
        destination = "/var/lib/registry"
        read_only   = false
      }

      driver = "docker"

      config {
        image = "registry:latest"
        ports = ["http"]
      }

      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
      }

      service {
        name = "registry"
        port = "http"

        check {
          name     = "alive"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@tgross
Copy link
Member

tgross commented Feb 3, 2022

Hi @kriestof there may be something here that's happening #11784 in terms of the plugin restart.

But architecturally speaking, when a CSI plugin exits, nothing at all should be happening to the volumes it mounted. The only way this should cause a problem is if the job that wants the volume also restarts/reschedules, and that's because Nomad can't talk to the plugin if it's not running.

Nomad doesn't touch the volumes for a running job at all no matter what happens to the plugins that mounted them, so there may be a bug in the Ceph plugin you're using. Can you provide the logs from the plugin allocation? It may have some clues.

@kriestof
Copy link
Author

kriestof commented Feb 3, 2022

@tgross thank you for quick response. I've managed to get better log level on ceph-csi (added "--v=5"). It seems while job is started ceph-csi mounts properly volume. But then after plugin restart nothing at all happens.

It's reasonable, request to open directory is stuck, because plugin do not respond (or request even doesn't come there). After a new job instance of ceph-csi plugin is created for some reason it cannot handle previous connection. Might be as well problem with ceph-csi. I'm just not sure at all how to tackle that one.

ceph csi controller before registry job start

I0203 21:16:09.699580       1 cephcsi.go:167] Driver version: v3.4.0 and Git version: 94ef181bc866c8ccaf507de66f213b26608309ff
I0203 21:16:09.699907       1 cephcsi.go:212] Starting driver type: cephfs with name: cephfs.csi.ceph.com
I0203 21:16:09.736161       1 volumemounter.go:87] loaded mounter: kernel
I0203 21:16:09.776261       1 volumemounter.go:98] loaded mounter: fuse
I0203 21:16:09.776468       1 driver.go:94] Enabling controller service capability: CREATE_DELETE_VOLUME
I0203 21:16:09.776477       1 driver.go:94] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0203 21:16:09.776480       1 driver.go:94] Enabling controller service capability: EXPAND_VOLUME
I0203 21:16:09.776484       1 driver.go:94] Enabling controller service capability: CLONE_VOLUME
I0203 21:16:09.776487       1 driver.go:106] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0203 21:16:09.776827       1 server.go:131] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0203 21:16:15.066006       1 utils.go:176] ID: 1 GRPC call: /csi.v1.Identity/Probe
I0203 21:16:15.069269       1 utils.go:180] ID: 1 GRPC request: {}
I0203 21:16:15.069314       1 utils.go:187] ID: 1 GRPC response: {}
[...]

ceph csi node before registry job start

I0203 21:16:17.527420       1 cephcsi.go:212] Starting driver type: cephfs with name: cephfs.csi.ceph.com
I0203 21:16:17.645130       1 volumemounter.go:87] loaded mounter: kernel
I0203 21:16:17.683367       1 volumemounter.go:98] loaded mounter: fuse
I0203 21:16:17.683989       1 server.go:131] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0203 21:16:20.496160       1 utils.go:176] ID: 1 GRPC call: /csi.v1.Identity/Probe
I0203 21:16:20.498296       1 utils.go:180] ID: 1 GRPC request: {}
I0203 21:16:20.498386       1 utils.go:187] ID: 1 GRPC response: {}
I0203 21:16:20.549602       1 utils.go:176] ID: 2 GRPC call: /csi.v1.Identity/GetPluginInfo
I0203 21:16:20.549665       1 utils.go:180] ID: 2 GRPC request: {}
[...]
I0203 21:21:50.675781       1 utils.go:176] ID: 40 GRPC call: /csi.v1.Identity/Probe
I0203 21:21:50.675820       1 utils.go:180] ID: 40 GRPC request: {}
I0203 21:21:50.675836       1 utils.go:187] ID: 40 GRPC response: {}
I0203 21:21:57.354965       1 utils.go:176] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f GRPC call: /csi.v1.Node/NodeStageVolume
I0203 21:21:57.355260       1 utils.go:180] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f GRPC request: {"secrets":"***stripped***","staging_target_path":"/csi/staging/registry-volume/rw-file-system-multi-node-multi-writer","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":5}},"volume_context":{"clusterID":"xxxxx","monitors":"-----","pool":"---","provisionVolume":"false","rootPath":"/docker-registry"},"volume_id":"----"}
I0203 21:21:57.526672       1 mount_linux.go:173] Cannot run systemd-run, assuming non-systemd OS
I0203 21:21:57.526694       1 mount_linux.go:174] systemd-run failed with: exit status 1
I0203 21:21:57.526702       1 mount_linux.go:175] systemd-run output: System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
I0203 21:21:57.526753       1 volumemounter.go:134] requested mounter: , chosen mounter: kernel
I0203 21:21:57.526792       1 nodeserver.go:157] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f cephfs: mounting volume 53245bc5-e638-4b24-b692-505feaef8f2f with Ceph kernel client
I0203 21:21:57.528300       1 cephcmds.go:60] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f command succeeded: modprobe [ceph]
I0203 21:21:57.662270       1 cephcmds.go:60] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f command succeeded: mount [-t ceph 172.24.213.67,172.24.213.68,172.24.213.69,172.24.213.70,172.24.213.71:/docker-registry /csi/staging/registry-volume/rw-file-system-multi-node-multi-writer -o name=nomad,secretfile=/tmp/csi/keys/keyfile-028551869,_netdev]
I0203 21:21:57.662394       1 nodeserver.go:133] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f cephfs: successfully mounted volume 53245bc5-e638-4b24-b692-505feaef8f2f to /csi/staging/registry-volume/rw-file-system-multi-node-multi-writer
I0203 21:21:57.662502       1 utils.go:187] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f GRPC response: {}
I0203 21:21:57.685145       1 utils.go:176] ID: 42 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f GRPC call: /csi.v1.Node/NodePublishVolume
I0203 21:21:57.685465       1 utils.go:180] ID: 42 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f GRPC request: {"secrets":"***stripped***","staging_target_path":"/csi/staging/registry-volume/rw-file-system-multi-node-multi-writer","target_path":"/csi/per-alloc/408722ad-3448-b7cb-ca8f-dd476e3879f1/registry-volume/rw-file-system-multi-node-multi-writer","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":5}},"volume_context":{"clusterID":"53245bc5-e638-4b24-b692-505feaef8f2f","monitors":"172.24.213.67,172.24.213.68,172.24.213.69,172.24.213.70,172.24.213.71","pool":"cephfs_data","provisionVolume":"false","rootPath":"/docker-registry"},"volume_id":"53245bc5-e638-4b24-b692-505feaef8f2f"}
I0203 21:21:57.700548       1 mount_linux.go:173] Cannot run systemd-run, assuming non-systemd OS
I0203 21:21:57.700580       1 mount_linux.go:174] systemd-run failed with: exit status 1
I0203 21:21:57.700599       1 mount_linux.go:175] systemd-run output: System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
I0203 21:21:57.705551       1 cephcmds.go:60] ID: 42 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f command succeeded: mount [-o bind,_netdev /csi/staging/registry-volume/rw-file-system-multi-node-multi-writer /csi/per-alloc/408722ad-3448-b7cb-ca8f-dd476e3879f1/registry-volume/rw-file-system-multi-node-multi-writer]
I0203 21:21:57.705593       1 nodeserver.go:266] ID: 42 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f cephfs: successfully bind-mounted volume 53245bc5-e638-4b24-b692-505feaef8f2f to /csi/per-alloc/408722ad-3448-b7cb-ca8f-dd476e3879f1/registry-volume/rw-file-system-multi-node-multi-writer
I0203 21:21:57.705699       1 utils.go:187] ID: 42 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f GRPC response: {}
I0203 21:22:20.588007       1 utils.go:176] ID: 43 GRPC call: /csi.v1.Identity/Probe
I0203 21:22:20.588144       1 utils.go:180] ID: 43 GRPC request: {}
I0203 21:22:20.588192       1 utils.go:187] ID: 43 GRPC response: {}
[...]

ceph csi controller after registry job start

I0203 21:30:21.533288       1 cephcsi.go:167] Driver version: v3.4.0 and Git version: 94ef181bc866c8ccaf507de66f213b26608309ff
I0203 21:30:21.533414       1 cephcsi.go:212] Starting driver type: cephfs with name: cephfs.csi.ceph.com
I0203 21:30:21.669072       1 volumemounter.go:87] loaded mounter: kernel
I0203 21:30:21.865440       1 volumemounter.go:98] loaded mounter: fuse
I0203 21:30:21.865575       1 driver.go:94] Enabling controller service capability: CREATE_DELETE_VOLUME
I0203 21:30:21.865584       1 driver.go:94] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0203 21:30:21.865588       1 driver.go:94] Enabling controller service capability: EXPAND_VOLUME
I0203 21:30:21.865592       1 driver.go:94] Enabling controller service capability: CLONE_VOLUME
I0203 21:30:21.865595       1 driver.go:106] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0203 21:30:21.865813       1 server.go:131] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0203 21:30:26.115558       1 utils.go:176] ID: 1 GRPC call: /csi.v1.Identity/Probe
I0203 21:30:26.118469       1 utils.go:180] ID: 1 GRPC request: {}
I0203 21:30:26.118513       1 utils.go:187] ID: 1 GRPC response: {}
I0203 21:30:26.231061       1 utils.go:176] ID: 2 GRPC call: /csi.v1.Identity/GetPluginInfo
I0203 21:30:26.231119       1 utils.go:180] ID: 2 GRPC request: {}
I0203 21:30:26.231131       1 identityserver-default.go:38] ID: 2 Using default GetPluginInfo
I0203 21:30:26.231191       1 utils.go:187] ID: 2 GRPC response: {"name":"cephfs.csi.ceph.com","vendor_version":"v3.4.0"}
I0203 21:30:26.231927       1 utils.go:176] ID: 3 GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0203 21:30:26.231957       1 utils.go:180] ID: 3 GRPC request: {}
I0203 21:30:26.234463       1 utils.go:187] ID: 3 GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}},{"Type":{"VolumeExpansion":{"type":1}}},{"Type":{"Service":{"type":2}}}]}
I0203 21:30:26.234819       1 utils.go:176] ID: 4 GRPC call: /csi.v1.Identity/Probe
I0203 21:30:26.234843       1 utils.go:180] ID: 4 GRPC request: {}
I0203 21:30:26.234861       1 utils.go:187] ID: 4 GRPC response: {}

ceph csi node after registry job start:

I0203 21:30:21.533298       1 cephcsi.go:167] Driver version: v3.4.0 and Git version: 94ef181bc866c8ccaf507de66f213b26608309ff
I0203 21:30:21.533421       1 cephcsi.go:212] Starting driver type: cephfs with name: cephfs.csi.ceph.com
I0203 21:30:21.669202       1 volumemounter.go:87] loaded mounter: kernel
I0203 21:30:21.865032       1 volumemounter.go:98] loaded mounter: fuse
I0203 21:30:21.865493       1 server.go:131] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0203 21:30:26.551735       1 utils.go:176] ID: 1 GRPC call: /csi.v1.Identity/Probe
I0203 21:30:26.555213       1 utils.go:180] ID: 1 GRPC request: {}
I0203 21:30:26.555279       1 utils.go:187] ID: 1 GRPC response: {}
I0203 21:30:26.689151       1 utils.go:176] ID: 2 GRPC call: /csi.v1.Identity/GetPluginInfo
I0203 21:30:26.689256       1 utils.go:180] ID: 2 GRPC request: {}
I0203 21:30:26.689276       1 identityserver-default.go:38] ID: 2 Using default GetPluginInfo
I0203 21:30:26.689391       1 utils.go:187] ID: 2 GRPC response: {"name":"cephfs.csi.ceph.com","vendor_version":"v3.4.0"}
I0203 21:30:26.691346       1 utils.go:176] ID: 3 GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0203 21:30:26.691501       1 utils.go:180] ID: 3 GRPC request: {}
I0203 21:30:26.691944       1 utils.go:187] ID: 3 GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}},{"Type":{"VolumeExpansion":{"type":1}}},{"Type":{"Service":{"type":2}}}]}
I0203 21:30:26.692901       1 utils.go:176] ID: 4 GRPC call: /csi.v1.Node/NodeGetInfo
I0203 21:30:26.692976       1 utils.go:180] ID: 4 GRPC request: {}
I0203 21:30:26.692993       1 nodeserver-default.go:60] ID: 4 Using default NodeGetInfo
I0203 21:30:26.693197       1 utils.go:187] ID: 4 GRPC response: {"accessible_topology":{},"node_id":"27a53a35-403b-a099-4d1f-f4bf7bee2934"}
I0203 21:30:26.694012       1 utils.go:176] ID: 5 GRPC call: /csi.v1.Identity/Probe
I0203 21:30:26.694088       1 utils.go:180] ID: 5 GRPC request: {}
I0203 21:30:26.694148       1 utils.go:187] ID: 5 GRPC response: {}
I0203 21:30:26.694846       1 utils.go:176] ID: 6 GRPC call: /csi.v1.Node/NodeGetCapabilities
I0203 21:30:26.694950       1 utils.go:180] ID: 6 GRPC request: {}
I0203 21:30:26.695346       1 utils.go:187] ID: 6 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0203 21:30:26.821976       1 utils.go:176] ID: 7 GRPC call: /csi.v1.Identity/Probe
I0203 21:30:26.822039       1 utils.go:180] ID: 7 GRPC request: {}
I0203 21:30:26.822090       1 utils.go:187] ID: 7 GRPC response: {}
I0203 21:30:56.696819       1 utils.go:176] ID: 8 GRPC call: /csi.v1.Identity/Probe
I0203 21:30:56.696930       1 utils.go:180] ID: 8 GRPC request: {}
I0203 21:30:56.696976       1 utils.go:187] ID: 8 GRPC response: {}
I0203 21:30:56.697682       1 utils.go:176] ID: 9 GRPC call: /csi.v1.Node/NodeGetCapabilities
I0203 21:30:56.697972       1 utils.go:180] ID: 9 GRPC request: {}
I0203 21:30:56.698285       1 utils.go:187] ID: 9 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0203 21:30:56.823462       1 utils.go:176] ID: 10 GRPC call: /csi.v1.Identity/Probe
I0203 21:30:56.823522       1 utils.go:180] ID: 10 GRPC request: {}
I0203 21:30:56.823546       1 utils.go:187] ID: 10 GRPC response: {}
I0203 21:31:26.699324       1 utils.go:176] ID: 11 GRPC call: /csi.v1.Identity/Probe
I0203 21:31:26.699428       1 utils.go:180] ID: 11 GRPC request: {}
I0203 21:31:26.699469       1 utils.go:187] ID: 11 GRPC response: {}
I0203 21:31:26.700013       1 utils.go:176] ID: 12 GRPC call: /csi.v1.Node/NodeGetCapabilities
I0203 21:31:26.700052       1 utils.go:180] ID: 12 GRPC request: {}
I0203 21:31:26.700128       1 utils.go:187] ID: 12 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0203 21:31:26.828145       1 utils.go:176] ID: 13 GRPC call: /csi.v1.Identity/Probe
I0203 21:31:26.828259       1 utils.go:180] ID: 13 GRPC request: {}
I0203 21:31:26.828674       1 utils.go:187] ID: 13 GRPC response: {}

@tgross
Copy link
Member

tgross commented Feb 3, 2022

It's reasonable, request to open directory is stuck, because plugin do not respond (or request even doesn't come there). After a new job instance of ceph-csi plugin is created for some reason it cannot handle previous connection.

I think here's where the confusion is. Once your docker registry job has successfully mounted the volume, it no longer needs to talk to the plugin. We can see that worked successfully in your logs here:

I0203 21:21:57.662270 1 cephcmds.go:60] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f command succeeded: mount [-t ceph 172.24.213.67,172.24.213.68,172.24.213.69,172.24.213.70,172.24.213.71:/docker-registry /csi/staging/registry-volume/rw-file-system-multi-node-multi-writer -o name=nomad,secretfile=/tmp/csi/keys/keyfile-028551869,_netdev]
I0203 21:21:57.662394 1 nodeserver.go:133] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f cephfs: successfully mounted volume 53245bc5-e638-4b24-b692-505feaef8f2f to /csi/staging/registry-volume/rw-file-system-multi-node-multi-writer
I0

At this point, your registry task is making filesystem calls, mediated by the kernel, to Ceph MDS, not to the Ceph plugin. So if the Ceph plugin goes away, it shouldn't touch those mounts on the registry job.

I think what might be interesting are the logs on the plugin after it restarts, and compare them against the same time frame from the Ceph MDS logs. Because that's what you said caused the problem. Did the registry task's allocation logs show anything interesting?

@kriestof
Copy link
Author

kriestof commented Feb 4, 2022

Ok, so now I understand better and it gets even more confusing. Based on what you said plugin stoppage after job was started should not impact job. But when I stop plugin (even without starting it again) job's volume is left freezing.

There is nothing interesting in ceph logs or job logs. Only from dmesg on the nomad client I get

[Fri Feb  4 12:54:20 2022] libceph: mds0 (1)172.24.213.67:6801 connect error
[Fri Feb  4 12:54:20 2022] libceph: mon1 (1)172.24.213.68:6789 connect error

I've also checked mounts there. Effect is exactly the same. I can't enter mounted ceph directory. My console is left freezing.

I'm not too much into csi or kernel details, but my current bet is ceph-csi is needed even after startup. I guess mount is executed inside plugin, then exposed with bind outside of the container and binded to any job. Hence, maybe if it's stopped, connection with ceph gets broken due to some process running there.

I'll try to contact ceph-csi folks to ask what they think about it.

@tgross
Copy link
Member

tgross commented Feb 4, 2022

I'm not too much into csi or kernel details, but my current bet is ceph-csi is needed even after startup. I guess mount is executed inside plugin, then exposed with bind outside of the container and binded to any job. Hence, maybe if it's stopped, connection with ceph gets broken due to some process running there.

Yeah I think you might be on to something here but I'm not seeing it in the Ceph plugin logs.

Just to help bring you up to speed... we have:

  • A plugin job for plugin cephfs in allocation d433812a-8c5a-66e2-9302-b84eda63bb46 (I made up this alloc ID for illustration).
  • A job with a docker-registry task that mounts a CSI volume registry-volume at a read-write mountpoint volume0 in allocation 408722ad-3448-b7cb-ca8f-dd476e3879f1 (this alloc ID is the one in your logs for the registry).

Here's what we'd expect the Nomad datadir on the client looks like:

root@linux:/var/nomad# tree .
├── alloc
│   ├── 408722ad-3448-b7cb-ca8f-dd476e3879f1
│   │   ├── alloc
│   │   │   ├── data
│   │   │   ├── logs
│   │   │   ├── tmp
│   │   │   └── registry-volume # (final mount)
│   │   └── docker-registry # (task dir)
│   └── d433812a-8c5a-66e2-9302-b84eda63bb46
│       ├── alloc # (alloc dir)
│       └── cephfs-task # (task dir)
└── client
    ├── client-id
    └── csi
        └── monolith
            └── cephfs
                ├── csi.sock
                ├── per-alloc
                │   └── 408722ad-3448-b7cb-ca8f-dd476e3879f1
                │      └── registry-volume
                │          └── rw-file-system-multi-node-multi-writer
                └── staging
                    └── registry-volume
                        └── rw-file-system-multi-node-multi-writer

This has the following mounts:

  • Nomad bind-mounts /var/nomad/client/csi/monolith/cephfs into the plugin container at /csi.
  • The plugin "stages" the device by mounting it to /var/nomad/client/csi/monolith/cephfs/staging/registry-volume/rw-file-system-multi-node-multi-writer.
  • The plugin "publishes" /var/nomad/client/csi/monolith/cephfs/staging/registry-volume/rw-file-system-multi-node-multi-writer by bind-mounting to /var/nomad/client/csi/monolith/cephfs/per-alloc/408722ad-3448-b7cb-ca8f-dd476e3879f1/registry-volume/rw-file-system-multi-node-multi-writer when the docker-registry task starts.
  • Nomad has dockerd bind-mount /var/nomad/client/csi/monolith/cephfs/per-alloc/408722ad-3448-b7cb-ca8f-dd476e3879f1/registry-volume/rw-file-system-multi-node-multi-writer to /alloc/registry-volume, which we see show up in the alloc directory at /var/nomad/alloc/408722ad-3448-b7cb-ca8f-dd476e3879f1/alloc/registry-volume.

Note that there are no bind-mounts in the plugin's allocation directory /var/nomad/alloc/d433812a-8c5a-66e2-9302-b84eda63bb46, which is destroyed when the client GC's the plugin allocation.

So given what we know, let's look at our log entries again:

I0203 21:21:57.662270 1 cephcmds.go:60] ID: 41 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f command succeeded: mount [-t ceph 172.24.213.67,172.24.213.68,172.24.213.69,172.24.213.70,172.24.213.71:/docker-registry /csi/staging/registry-volume/rw-file-system-multi-node-multi-writer -o name=nomad,secretfile=/tmp/csi/keys/keyfile-028551869,_netdev]
...
I0203 21:21:57.705551 1 cephcmds.go:60] ID: 42 Req-ID: 53245bc5-e638-4b24-b692-505feaef8f2f command succeeded: mount [-o bind,_netdev /csi/staging/registry-volume/rw-file-system-multi-node-multi-writer /csi/per-alloc/408722ad-3448-b7cb-ca8f-dd476e3879f1/registry-volume/rw-file-system-multi-node-multi-writer]

That seems to follow what we expect!

I've also checked mounts there. Effect is exactly the same. I can't enter mounted ceph directory. My console is left freezing.

When you say "checked mounts there", do you mean you nomad alloc exec'd into the Ceph plugin container and you can't cd into the mounted directory? Or the docker registry container (which I'd expect if the application can't use it).

The next thing I'd try to do is to nomad alloc exec into the Ceph plugin container after the registry has been started up but before the plugin has been restarted, and see if you can call mount there. That might tell you whether the plugin is doing anything unexpected with the mount paths from the perspective of its own mount namespace.

@kriestof
Copy link
Author

kriestof commented Feb 4, 2022

I've got an answer from ceph-csi guys and it seems to be resolved. All you need to do is to use host network (adding network_mode = "host" to config stanza). You can find more details here. Then after stopping or restarting plugin volume can still be accessed. Now I only need to reboot all machines to get rid of this trash mounts, but I don't expect any further troubles here.

It's been a long way, I've tried to fix that for the last year. Couldn't be any better! Thank you!

@kriestof kriestof closed this as completed Feb 4, 2022
@tgross
Copy link
Member

tgross commented Feb 4, 2022

Glad to hear we got that worked out! I'll try to add some notes about that in https://github.com/hashicorp/nomad/tree/main/demo/csi/ceph-csi-plugin (where we're demo'ing cephrbd which doesn't seem to have this problem).

@kriestof
Copy link
Author

kriestof commented Feb 4, 2022

@tgross My guess is the same should apply to rbd. It's possible nobody else has noticed it so far. Unless you know how csi plugin works you might just end up occasionally (around once a month) with broken jobs and the easiest solution is to restart them. On the other hand, it may be that rbd plugin doesn't restart on its own so you don't end up with randomly broken jobs. Also, I guess not many people go with beta storage operational.

@Rakshith-R
Copy link

Glad to hear we got that worked out! I'll try to add some notes about that in https://github.com/hashicorp/nomad/tree/main/demo/csi/ceph-csi-plugin (where we're demo'ing cephrbd which doesn't seem to have this problem).

@tgross
Actually rbd also needs to be running on host networking, for rbd, writes will hang.
rook/rook#8085 (comment)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants