-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CSI] requests to job left freezing after ceph csi plugin restart #11999
Comments
Hi @kriestof there may be something here that's happening #11784 in terms of the plugin restart. But architecturally speaking, when a CSI plugin exits, nothing at all should be happening to the volumes it mounted. The only way this should cause a problem is if the job that wants the volume also restarts/reschedules, and that's because Nomad can't talk to the plugin if it's not running. Nomad doesn't touch the volumes for a running job at all no matter what happens to the plugins that mounted them, so there may be a bug in the Ceph plugin you're using. Can you provide the logs from the plugin allocation? It may have some clues. |
@tgross thank you for quick response. I've managed to get better log level on ceph-csi (added "--v=5"). It seems while job is started ceph-csi mounts properly volume. But then after plugin restart nothing at all happens. It's reasonable, request to open directory is stuck, because plugin do not respond (or request even doesn't come there). After a new job instance of ceph-csi plugin is created for some reason it cannot handle previous connection. Might be as well problem with ceph-csi. I'm just not sure at all how to tackle that one. ceph csi controller before registry job start
ceph csi node before registry job start
ceph csi controller after registry job start
ceph csi node after registry job start:
|
I think here's where the confusion is. Once your docker registry job has successfully mounted the volume, it no longer needs to talk to the plugin. We can see that worked successfully in your logs here:
At this point, your registry task is making filesystem calls, mediated by the kernel, to Ceph MDS, not to the Ceph plugin. So if the Ceph plugin goes away, it shouldn't touch those mounts on the registry job. I think what might be interesting are the logs on the plugin after it restarts, and compare them against the same time frame from the Ceph MDS logs. Because that's what you said caused the problem. Did the registry task's allocation logs show anything interesting? |
Ok, so now I understand better and it gets even more confusing. Based on what you said plugin stoppage after job was started should not impact job. But when I stop plugin (even without starting it again) job's volume is left freezing. There is nothing interesting in ceph logs or job logs. Only from dmesg on the nomad client I get
I've also checked mounts there. Effect is exactly the same. I can't enter mounted ceph directory. My console is left freezing. I'm not too much into csi or kernel details, but my current bet is ceph-csi is needed even after startup. I guess mount is executed inside plugin, then exposed with bind outside of the container and binded to any job. Hence, maybe if it's stopped, connection with ceph gets broken due to some process running there. I'll try to contact ceph-csi folks to ask what they think about it. |
Yeah I think you might be on to something here but I'm not seeing it in the Ceph plugin logs. Just to help bring you up to speed... we have:
Here's what we'd expect the Nomad datadir on the client looks like: root@linux:/var/nomad# tree .
├── alloc
│ ├── 408722ad-3448-b7cb-ca8f-dd476e3879f1
│ │ ├── alloc
│ │ │ ├── data
│ │ │ ├── logs
│ │ │ ├── tmp
│ │ │ └── registry-volume # (final mount)
│ │ └── docker-registry # (task dir)
│ └── d433812a-8c5a-66e2-9302-b84eda63bb46
│ ├── alloc # (alloc dir)
│ └── cephfs-task # (task dir)
└── client
├── client-id
└── csi
└── monolith
└── cephfs
├── csi.sock
├── per-alloc
│ └── 408722ad-3448-b7cb-ca8f-dd476e3879f1
│ └── registry-volume
│ └── rw-file-system-multi-node-multi-writer
└── staging
└── registry-volume
└── rw-file-system-multi-node-multi-writer This has the following mounts:
Note that there are no bind-mounts in the plugin's allocation directory So given what we know, let's look at our log entries again:
That seems to follow what we expect!
When you say "checked mounts there", do you mean you The next thing I'd try to do is to |
I've got an answer from ceph-csi guys and it seems to be resolved. All you need to do is to use host network (adding It's been a long way, I've tried to fix that for the last year. Couldn't be any better! Thank you! |
Glad to hear we got that worked out! I'll try to add some notes about that in https://github.com/hashicorp/nomad/tree/main/demo/csi/ceph-csi-plugin (where we're demo'ing |
@tgross My guess is the same should apply to rbd. It's possible nobody else has noticed it so far. Unless you know how csi plugin works you might just end up occasionally (around once a month) with broken jobs and the easiest solution is to restart them. On the other hand, it may be that rbd plugin doesn't restart on its own so you don't end up with randomly broken jobs. Also, I guess not many people go with beta storage operational. |
@tgross |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.2.5
Operating system and Environment details
Arch Linux
kernel 5.16.5-arch1-1
Issue
I've setup docker registry storing images on ceph filesystem using ceph-csi. Everything works fine until ceph-csi plugin will be restarted. Then none of the docker registry images can be pulled. It is especially troubling, because plugin restarts itself ocassionaly from time to time. Then docker registry job has to be manually restarted also. Unfortunately, I'm not able to narrow the spot. There isn't any kind of descriptive error both in ceph-csi and in docker-registry job. This kind of error happened also on 1.1.x version on nomad (actually even before) and with different ceph-csi plugin versions.
Reproduction steps
Steps to reproduce are a little bit troublesome -- you need running ceph instance. But maybe somebody has running ceph with nomad and could share his/her results. Also I'm open to test other more minimal example, but so far no easy example with storage + csi comes to my mind.
Expected Result
After ceph-csi restart running jobs should either just work with proper mounts or be restarted automatically.
Actual Result
Seems after csi plugin disconnection mounted job directory at docker registry is left freezing.
Job file (if appropriate)
plugin-ceph-node.nomad
registry-volume.hcl:
image-registry.hcl:
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: