Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Testing (with GTX1080) - error creating driver: failed to create device library: failed to locate driver libraries: error locating "libnvidia-ml.so.1" #204

Open
wenzel-felix opened this issue Nov 12, 2024 · 3 comments

Comments

@wenzel-felix
Copy link

wenzel-felix commented Nov 12, 2024

Hi,

I was following the tutorial to get DRA running and so initially everything was working as expected until the installation of the driver.

The kubelet plugin directly fails with:

Error: error creating driver: failed to create device library: failed to locate driver libraries: error locating "libnvidia-ml.so.1"

Some extra information:
In order for the cluster to be able to mount the cdi device, I needed to change the name/path from runtime.nvidia.com/gpu/all to nvidia.com/gpu/all as I only see this as CDI device.

➜  k8s-dra-driver git:(main) ✗ sudo nvidia-ctk cdi list
INFO[0000] Found 1 CDI devices
nvidia.com/gpu=all

Here some details about my setup:

GPU: GTX 1080
OS: Windows/WSL

nvidia-container-toolkit version:

NVIDIA Container Toolkit CLI version 1.17.0
commit: 5bc031544833253e3ab6a36daec376dc13a4f479

runtime config:

➜  k8s-dra-driver git:(main) ✗ nvidia-ctk runtime configure --dry-run
INFO[0000] Loading config from /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

nvidia-smi command on host:

➜  k8s-dra-driver git:(main) ✗ nvidia-smi
Tue Nov 12 20:54:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.02              Driver Version: 566.03         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080        On  |   00000000:01:00.0  On |                  N/A |
|  0%   58C    P0             45W /  210W |    1907MiB /   8192MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvidia-smi command on node:

root@k8s-dra-driver-cluster-worker:/# nvidia-smi
Tue Nov 12 19:55:46 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.02              Driver Version: 566.03         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080        On  |   00000000:01:00.0  On |                  N/A |
|  0%   61C    P0             46W /  210W |    1903MiB /   8192MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Any help would be highly appreciated!

I already saw a similar behavior in this issue, but it s not the same: #65

@wenzel-felix
Copy link
Author

After some more investigation and adjusting the cdi file manually I have now the following problem:

➜  k8s-dra-driver git:(main) ✗ k logs -n nvidia nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-5pzdd
Error: error creating driver: error enumerating all possible devices: error enumerating GPUs and MIG devices: error initializing NVML: Not Supported

My current /etc/cdi/nvidia.yaml looks as follows - with the containerPaths manually adjusted:

cdiVersion: 0.3.0
containerEdits:
  env:
  - NVIDIA_VISIBLE_DEVICES=void
  hooks:
  - args:
    - nvidia-cdi-hook
    - create-symlinks
    - --link
    - /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/nvidia-smi::/usr/bin/nvidia-smi
    hookName: createContainer
    path: /usr/bin/nvidia-cdi-hook
  - args:
    - nvidia-cdi-hook
    - update-ldcache
    - --folder
    - /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03
    - --folder
    - /usr/lib/wsl/lib
    hookName: createContainer
    path: /usr/bin/nvidia-cdi-hook
  mounts:
  - containerPath: /usr/lib64/libdxcore.so
    hostPath: /usr/lib/wsl/lib/libdxcore.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib64/libcuda.so.1.1
    hostPath: /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/libcuda.so.1.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib64/libcuda_loader.so
    hostPath: /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/libcuda_loader.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib64/libnvdxgdmal.so.1
    hostPath: /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/libnvdxgdmal.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib64/libnvidia-ml.so.1
    hostPath: /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/libnvidia-ml.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib64/libnvidia-ml_loader.so
    hostPath: /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/libnvidia-ml_loader.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib64/libnvidia-ptxjitcompiler.so.1
    hostPath: /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/libnvidia-ptxjitcompiler.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /sbin/nvcubins.bin
    hostPath: /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/nvcubins.bin
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /sbin/nvidia-smi
    hostPath: /usr/lib/wsl/drivers/nvmdi.inf_amd64_8de72197f4c7fa03/nvidia-smi
    options:
    - ro
    - nosuid
    - nodev
    - bind
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/dxg
  name: all
kind: nvidia.com/gpu

@elezar
Copy link
Member

elezar commented Nov 28, 2024

@wenzel-felix one thing to note is that you're running on WSL2 which has not been tested in the context of the k8s-dra-driver. There also seems to be unexpected behaviour using the runtime.nvidia.com/gpu=all device in the NVIDIA Container Runtime for this case.

In order to provide some more context, could you confirm how you are installing docker? Are you running docker-ce in your WSL2 distribution, or are you using Docker Desktop?

@wenzel-felix
Copy link
Author

Hi @elezar , I'm aware that it was not tested with WSL2, I was just investigating myself if I could get it running.
I'm using docker-ce in WSL2 - installed according to the official docker installation.

➜  ~ apt-cache policy docker-ce
docker-ce:
  Installed: 5:27.3.1-1~ubuntu.22.04~jammy
  Candidate: 5:27.3.1-1~ubuntu.22.04~jammy
  Version table:
 *** 5:27.3.1-1~ubuntu.22.04~jammy 500
        500 https://download.docker.com/linux/ubuntu jammy/stable amd64 Packages
        100 /var/lib/dpkg/status

I just thought it would be an interesting use case for easy showcasing the basic DRA feature as some probably run Windows with WSL2.
So, if you deem it not interesting we can also close the issue 👍🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants