-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting GPU container with rootless runsc: operation not permitted #11076
Comments
Can you get it to work without gVisor in rootless mode (with runc)? Are you setting up the nvidia-container-runtime with rootless mode: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#rootless-mode? runsc is trying to emulate the |
Not yet. Using the same
Yeah, it's
|
Ok, I think I've sorted it out in NVIDIA/libnvidia-container#288. tl;dr, the argv := []string{
cliPath,
"--load-kmods",
"--user=root:root", // Additional flag
"configure",
fmt.Sprintf("--ldconfig=%s", ldconfigPath), // Edited to remove '@'
"--no-cgroups",
"--utility",
"--compute",
fmt.Sprintf("--pid=%d", goferCmd.Process.Pid),
fmt.Sprintf("--device=%s", devices),
spec.Root.Path,
} Then Should gVisor pass the updated flags when |
Thanks @sfc-gh-lshi for doing the investigation! Is the Lines 82 to 91 in 94aa652
There are 2 ways in which Rootless containers can be run:
(2) is handled in different ways in runsc. You can see checks like this: gvisor/runsc/container/container.go Lines 1354 to 1358 in 94aa652
I assume the changes you have mentioned will only work in (1). It is possible to achieve (1) without the |
I agree that There are other ways to approach this, you can introduce an additional |
Description
In #11069 I obtained a working OCI runtime spec for using
sudo runsc
to directly access the GPU. The same configuration fails to start in rootless mode though:This error seems to originate from
libnvidia-container
either here or here.Changes made
nvidia-container-cli
is invoked throughgofer
, so I started adding capabilities there to see if that is the issue. Unfortunately, that didn't help, but here is what I did:wget https://github.com/google/gvisor/archive/refs/tags/release-20240807.0.zip && unzip release-20240807.0.zip
.runsc/container/container.go
to providegofer
with all capabilities and make them inheritable, so thatnvidia-container-cli configure
is called with the same capabilities.runsc/container/BUILD
, add@com_github_syndtr_gocapability//capability:go_default_library
todeps
forcontainer
.runsc
:mkdir -p bin && make copy TARGETS=runsc DESTINATION=bin/
.runsc
with the reproduction steps below:/home/$USER/runsc.log
will show that all capabilities are set and inheritable:Steps to reproduce
rootfs
, and add thisconfig.json
.runsc version
runsc version release-20240807.0
spec: 1.1.0-rc.1
docker version (if using docker)
No response
uname
Linux lshi-gvisor-gpu 6.5.0-1025-gcp #27~22.04.1-Ubuntu SMP Tue Jul 16 23:03:39 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
D1022 23:27:09.028129 36265 container.go:544] Run container, cid: container, rootDir: "/run/user/1001/runsc"
D1022 23:27:09.028148 36265 container.go:200] Create container, cid: container, rootDir: "/run/user/1001/runsc"
D1022 23:27:09.028238 36265 container.go:262] Creating new sandbox for container, cid: container
D1022 23:27:09.028285 36265 cgroup.go:428] New cgroup for pid: self, *cgroup.cgroupV2: &{Mountpoint:/sys/fs/cgroup Path:/container Controllers:[cpuset cpu io memory hugetlb pids rdma misc] Own:[]}
D1022 23:27:09.028326 36265 cgroup_v2.go:132] Installing cgroup path "/sys/fs/cgroup/container"
D1022 23:27:09.028345 36265 cgroup_v2.go:177] Deleting cgroup "/sys/fs/cgroup/container"
W1022 23:27:09.028369 36265 container.go:1767] Skipping cgroup configuration in rootless mode: open /sys/fs/cgroup/cgroup.subtree_control: permission denied
D1022 23:27:09.028428 36265 container.go:1919] Executing ["/sbin/modprobe" "nvidia"]
D1022 23:27:09.030535 36265 container.go:1919] Executing ["/sbin/modprobe" "nvidia-uvm"]
D1022 23:27:09.032889 36265 donation.go:32] Donating FD 3: "/home/lshi/runsc.log"
D1022 23:27:09.032907 36265 donation.go:32] Donating FD 4: "/home/lshi/tmp/config.json"
D1022 23:27:09.032918 36265 donation.go:32] Donating FD 5: "|1"
D1022 23:27:09.032923 36265 donation.go:32] Donating FD 6: "gofer IO FD"
D1022 23:27:09.032927 36265 donation.go:32] Donating FD 7: "gofer dev IO FD"
D1022 23:27:09.032931 36265 donation.go:32] Donating FD 8: "nvproxy sync gofer FD"
D1022 23:27:09.032935 36265 container.go:1364] Starting gofer: /proc/self/exe [runsc-gofer --nvproxy=true --root=/run/user/1001/runsc --debug=true --debug-log=/home/lshi/runsc.log --host-uds=all --network=host --strace=true --rootless=true --debug-log-fd=3 gofer --bundle /home/lshi/tmp --gofer-mount-confs=lisafs:none --spec-fd=4 --mounts-fd=5 --io-fds=6 --dev-io-fd=7 --sync-nvproxy-fd=8]
I1022 23:27:09.034687 36265 container.go:1368] Gofer started, PID: 36273
D1022 23:27:09.034714 36265 container.go:2017] Executing ["/usr/bin/nvidia-container-cli" "--load-kmods" "configure" "--ldconfig=@/sbin/ldconfig.real" "--no-cgroups" "--utility" "--compute" "--pid=36273" "--device=all" "/home/lshi/tmp/rootfs"]
D1022 23:27:09.037032 36265 container.go:790] Destroy container, cid: container
D1022 23:27:09.037081 36265 container.go:1101] Killing gofer for container, cid: container, PID: 36273
W1022 23:27:09.038009 36265 util.go:64] FATAL ERROR: running container: creating container: cannot create gofer process: nvproxy setup: nvidia-container-cli configure failed, err: exit status 1
stdout:
stderr: nvidia-container-cli: initialization error: privilege change failed: operation not permitted
W1022 23:27:09.038116 36265 main.go:231] Failure to execute command, err: 1
The text was updated successfully, but these errors were encountered: