-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starting container directly with runsc: GPU access blocked by operating system #11069
Comments
The provided OCI spec doesn't run with runc either. Here are a few problems with it:
I would recommend you run gVisor with Docker (which you confirmed works), look copy the OCI spec generated by docker and work backwards to generating the desired OCI spec. For the rootfs, instead of bind mounting the host /usr and /lib directories, maybe create the rootfs using In trying to run this OCI spec, I ended up completely bricking my test VM to the point that I need to delete it now 😭 |
Awesome, thanks a lot, that seems to work! 🎉 I'll open up a separate issue for the rootless case, since that still persists. For completeness, here's what I did: To get the OCI runtime spec
I made some edits to trim the configuration, ending up with:
To start a container
|
Description
I would like to use
runsc
to start a container with GPU access:sudo runsc --nvproxy --strace --debug --debug-log=/tmp/logs/runsc.log --network=host --host-uds=all run "container"
. This container appears to start, but when runningnvidia-smi
the following error is returned:Failed to initialize NVML: GPU access blocked by the operating system
.Note that GPU access works fine through Docker + gVisor:
How can I resolve this issue?
Bonus question
How can I make it work with
--rootless
? I'm happy to open up a different issue, but if there's an easy/quick pointer then we can tackle it here.With
--rootless
, the command fails during thenvidia-container-cli configure
run bygofer
(sameconfig.json
as below).This error seems to originate from
libnvidia-container
either here or here. I already tried adding all capabilities (togofer
) and making them inheritable right before thenvidia-container-cli configure
call.Steps to reproduce
ssh
into the GCP instance and sayyes
to installing GPU drivers:gcloud compute ssh --zone=us-central1-f $vmName
.nvidia-smi
as a sanity check - this should work.20240807
.config.json
.root.path
needs to be updated with your home directory.lib
mounts are present to provide libraries required bynvidia-smi
, and these paths are added toLD_LIBRARY_PATH
.runsc version
runsc version release-20240807.0
spec: 1.1.0-rc.1
docker version (if using docker)
No response
uname
Linux lshi-gvisor-gpu 6.5.0-1025-gcp #27~22.04.1-Ubuntu SMP Tue Jul 16 23:03:39 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
I1021 20:39:11.873793 33296 strace.go:564] [ 1: 1] nvidia-smi E stat(0x7ef93efab480 /usr/bin/nvidia-modprobe, 0x7ebbe460cc00)
I1021 20:39:11.873816 33296 strace.go:602] [ 1: 1] nvidia-smi X stat(0x7ef93efab480 /usr/bin/nvidia-modprobe, 0x7ebbe460cc00 {dev=29, ino=3, mode=S_IFREG|S_ISUID|0o755, nlink=1, uid=0, gid=0, rdev=0, size=43344, blksize=4096, blocks=85, atime=2024-10-21 19:58:30.573566562 +0000 UTC, mtime=2024-10-21 18:53:12.461679304 +0000 UTC, ctime=2024-10-21 18:53:12.461679304 +0000 UTC}) = 0 (0x0) (10.499µs)
I1021 20:39:11.873833 33296 strace.go:559] [ 1: 1] nvidia-smi E geteuid()
I1021 20:39:11.873839 33296 strace.go:596] [ 1: 1] nvidia-smi X geteuid() = 0 (0x0) (341ns)
I1021 20:39:11.873848 33296 strace.go:570] [ 1: 1] nvidia-smi E openat(AT_FDCWD /, 0x7ef93efab32b /proc/devices, O_RDONLY|0x0, 0o0)
I1021 20:39:11.873859 33296 strace.go:608] [ 1: 1] nvidia-smi X openat(AT_FDCWD /, 0x7ef93efab32b /proc/devices, O_RDONLY|0x0, 0o0) = 0 (0x0) errno=2 (no such file or directory) (2.456µs)
I1021 20:39:11.873868 33296 strace.go:570] [ 1: 1] nvidia-smi E openat(AT_FDCWD /, 0x7ebbe460ccf0 /proc/driver/nvidia/capabilities/mig/config, O_RDONLY|0x0, 0o0)
I1021 20:39:11.873876 33296 strace.go:608] [ 1: 1] nvidia-smi X openat(AT_FDCWD /, 0x7ebbe460ccf0 /proc/driver/nvidia/capabilities/mig/config, O_RDONLY|0x0, 0o0) = 0 (0x0) errno=2 (no such file or directory) (1.745µs)
I1021 20:39:11.873885 33296 strace.go:564] [ 1: 1] nvidia-smi E stat(0x7ebbe460cc10 , 0x7ebbe460cb20)
I1021 20:39:11.873910 33296 strace.go:602] [ 1: 1] nvidia-smi X stat(0x7ebbe460cc10 , 0x7ebbe460cb20) = 0 (0x0) errno=2 (no such file or directory) (16.376µs)
I1021 20:39:11.873927 33296 strace.go:570] [ 1: 1] nvidia-smi E openat(AT_FDCWD /, 0x7ef93efab32b /proc/devices, O_RDONLY|0x0, 0o0)
I1021 20:39:11.873937 33296 strace.go:608] [ 1: 1] nvidia-smi X openat(AT_FDCWD /, 0x7ef93efab32b /proc/devices, O_RDONLY|0x0, 0o0) = 0 (0x0) errno=2 (no such file or directory) (3.263µs)
I1021 20:39:11.873946 33296 strace.go:564] [ 1: 1] nvidia-smi E stat(0x7ef93efab480 /usr/bin/nvidia-modprobe, 0x7ebbe460cc00)
I1021 20:39:11.873975 33296 strace.go:602] [ 1: 1] nvidia-smi X stat(0x7ef93efab480 /usr/bin/nvidia-modprobe, 0x7ebbe460cc00 {dev=29, ino=3, mode=S_IFREG|S_ISUID|0o755, nlink=1, uid=0, gid=0, rdev=0, size=43344, blksize=4096, blocks=85, atime=2024-10-21 19:58:30.573566562 +0000 UTC, mtime=2024-10-21 18:53:12.461679304 +0000 UTC, ctime=2024-10-21 18:53:12.461679304 +0000 UTC}) = 0 (0x0) (10.458µs)
I1021 20:39:11.873990 33296 strace.go:559] [ 1: 1] nvidia-smi E geteuid()
I1021 20:39:11.873996 33296 strace.go:596] [ 1: 1] nvidia-smi X geteuid() = 0 (0x0) (248ns)
I1021 20:39:11.874005 33296 strace.go:570] [ 1: 1] nvidia-smi E openat(AT_FDCWD /, 0x7ef93efab32b /proc/devices, O_RDONLY|0x0, 0o0)
I1021 20:39:11.874016 33296 strace.go:608] [ 1: 1] nvidia-smi X openat(AT_FDCWD /, 0x7ef93efab32b /proc/devices, O_RDONLY|0x0, 0o0) = 0 (0x0) errno=2 (no such file or directory) (2.198µs)
I1021 20:39:11.874025 33296 strace.go:570] [ 1: 1] nvidia-smi E openat(AT_FDCWD /, 0x7ebbe460ccf0 /proc/driver/nvidia/capabilities/mig/monitor, O_RDONLY|0x0, 0o0)
I1021 20:39:11.874033 33296 strace.go:608] [ 1: 1] nvidia-smi X openat(AT_FDCWD /, 0x7ebbe460ccf0 /proc/driver/nvidia/capabilities/mig/monitor, O_RDONLY|0x0, 0o0) = 0 (0x0) errno=2 (no such file or directory) (1.696µs)
I1021 20:39:11.874042 33296 strace.go:564] [ 1: 1] nvidia-smi E stat(0x7ebbe460cc10 , 0x7ebbe460cb20)
I1021 20:39:11.874049 33296 strace.go:602] [ 1: 1] nvidia-smi X stat(0x7ebbe460cc10 , 0x7ebbe460cb20) = 0 (0x0) errno=2 (no such file or directory) (640ns)
I1021 20:39:11.874069 33296 strace.go:570] [ 1: 1] nvidia-smi E newfstatat(0x1 host:[2], 0x7ef9403d844f , 0x7ebbe460d260, 0x1000)
I1021 20:39:11.874086 33296 strace.go:608] [ 1: 1] nvidia-smi X newfstatat(0x1 host:[2], 0x7ef9403d844f , 0x7ebbe460d260 {dev=8, ino=2, mode=S_IFCHR|0o620, nlink=1, uid=0, gid=0, rdev=0, size=0, blksize=1024, blocks=0, atime=2024-10-21 20:39:07.411607567 +0000 UTC, mtime=2024-10-21 20:39:07.411607567 +0000 UTC, ctime=2024-10-21 20:39:11.659911063 +0000 UTC}, 0x1000) = 0 (0x0) (5.233µs)
D1021 20:39:11.874098 33296 usertrap_amd64.go:210] [ 1: 1] Found the pattern at ip 7ef940319f43:sysno 16
D1021 20:39:11.874105 33296 usertrap_amd64.go:122] [ 1: 1] Allocate a new trap: 0xc000140090 26
D1021 20:39:11.874112 33296 usertrap_amd64.go:223] [ 1: 1] Apply the binary patch addr 7ef940319f43 trap addr 62820 ([184 16 0 0 0 15 5] -> [255 36 37 32 40 6 0])
I1021 20:39:11.874122 33296 strace.go:567] [ 1: 1] nvidia-smi E ioctl(0x1 host:[2], 0x5401, 0x7ebbe460d1c0)
I1021 20:39:11.874130 33296 strace.go:605] [ 1: 1] nvidia-smi X ioctl(0x1 host:[2], 0x5401, 0x7ebbe460d1c0) = 0 (0x0) errno=25 (not a typewriter) (1.128µs)
D1021 20:39:11.874178 33296 usertrap_amd64.go:210] [ 1: 1] Found the pattern at ip 7ef940314880:sysno 1
D1021 20:39:11.874185 33296 usertrap_amd64.go:122] [ 1: 1] Allocate a new trap: 0xc000140090 27
D1021 20:39:11.874192 33296 usertrap_amd64.go:223] [ 1: 1] Apply the binary patch addr 7ef940314880 trap addr 62870 ([184 1 0 0 0 15 5] -> [255 36 37 112 40 6 0])
I1021 20:39:11.874203 33296 strace.go:567] [ 1: 1] nvidia-smi E write(0x1 host:[2], 0x708f20 "Failed to initialize NVML: GPU access blocked by the operating system\n", 0x46)
I1021 20:39:11.874241 33296 strace.go:605] [ 1: 1] nvidia-smi X write(0x1 host:[2], ..., 0x46) = 70 (0x46) (30.963µs)
I1021 20:39:11.874260 33296 strace.go:561] [ 1: 1] nvidia-smi E exit_group(0x11)
I1021 20:39:11.874275 33296 strace.go:599] [ 1: 1] nvidia-smi X exit_group(0x11) = 0 (0x0) (8.346µs)
D1021 20:39:11.874282 33296 task_exit.go:204] [ 1: 1] Transitioning from exit state TaskExitNone to TaskExitInitiated
D1021 20:39:11.875153 1 connection.go:127] sock read failed, closing connection: EOF
D1021 20:39:11.875201 1 connection.go:127] sock read failed, closing connection: EOF
D1021 20:39:11.875355 1 connection.go:127] sock read failed, closing connection: EOF
D1021 20:39:11.875442 1 connection.go:127] sock read failed, closing connection: EOF
D1021 20:39:11.875491 1 connection.go:127] sock read failed, closing connection: EOF
D1021 20:39:11.875568 1 connection.go:127] sock read failed, closing connection: EOF
I1021 20:39:11.875592 33296 loader.go:1215] Gofer socket disconnected, killing container "container"
D1021 20:39:11.875619 1 connection.go:127] sock read failed, closing connection: EOF
D1021 20:39:11.875670 33296 task_exit.go:361] [ 1: 1] Init process terminating, killing namespace
D1021 20:39:11.875708 33296 task_signals.go:481] [ 1: 1] No task notified of signal 9
D1021 20:39:11.875733 33296 task_exit.go:204] [ 1: 1] Transitioning from exit state TaskExitInitiated to TaskExitZombie
D1021 20:39:11.875745 33296 task_exit.go:204] [ 1: 1] Transitioning from exit state TaskExitZombie to TaskExitDead
D1021 20:39:11.875775 33296 controller.go:681] containerManager.Wait returned, cid: container, waitStatus: 0x1100, err:
I1021 20:39:11.875781 33296 boot.go:534] application exiting with exit status 17
D1021 20:39:11.875816 33296 urpc.go:571] urpc: successfully marshalled 39 bytes.
I1021 20:39:11.875844 33296 watchdog.go:221] Stopping watchdog
I1021 20:39:11.875858 33296 watchdog.go:225] Watchdog stopped
D1021 20:39:11.875884 33249 urpc.go:614] urpc: unmarshal success.
D1021 20:39:11.876068 1 connection.go:127] sock read failed, closing connection: EOF
I1021 20:39:11.876071 33296 main.go:222] Exiting with status: 4352
I1021 20:39:11.876133 1 gofer.go:341] All lisafs servers exited.
I1021 20:39:11.876198 1 main.go:222] Exiting with status: 0
D1021 20:39:11.878341 33249 container.go:790] Destroy container, cid: container
D1021 20:39:11.878395 33249 container.go:1087] Destroying container, cid: container
D1021 20:39:11.878404 33249 sandbox.go:1602] Destroying root container by destroying sandbox, cid: container
D1021 20:39:11.878412 33249 sandbox.go:1299] Destroying sandbox "container"
D1021 20:39:11.878425 33249 container.go:1101] Killing gofer for container, cid: container, PID: 33257
D1021 20:39:11.878453 33249 cgroup_v2.go:177] Deleting cgroup "/sys/fs/cgroup/container"
D1021 20:39:11.878474 33249 cgroup_v2.go:188] Removing cgroup for path="/sys/fs/cgroup/container"
I1021 20:39:11.878623 33249 main.go:222] Exiting with status: 4352
The text was updated successfully, but these errors were encountered: