Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA driver does not pick up all GPUs on the node #32

Open
asm582 opened this issue Nov 30, 2023 · 8 comments
Open

DRA driver does not pick up all GPUs on the node #32

asm582 opened this issue Nov 30, 2023 · 8 comments
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@asm582
Copy link

asm582 commented Nov 30, 2023

I have enabled MIG mode on both GPUs on a single node but the nas object shows one of the GPUs is not mig enabled:

  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    0
      Memory Bytes:             85899345920
      Mig Enabled:              true
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-1a9afbae-5932-54f8-c2c4-a863888d45bb
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    1
      Memory Bytes:             85899345920
      Mig Enabled:              false
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-713eebac-08df-c534-6c98-8d5055ca97a9

output of nvidia-smi:

[root@nvd-srv-02 k8s-dra-driver]# nvidia-smi
Thu Nov 30 11:15:31 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:17:00.0 Off |                   On |
| N/A   35C    P0              45W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:65:00.0 Off |                   On |
| N/A   35C    P0              46W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |

Can you please share how can nas object be updated correctly?

@klueska
Copy link
Collaborator

klueska commented Nov 30, 2023

Hmm, this is unexpected. We just pull the MIG state directly from NVML (the underlying library that nvidia-smi uses as well).

Is it possible that the plugin came online when only one was enabled and the other wasn't? The plugin doesn't do any real-time reconciliation the GPU state -- the only way to get it to update is to restart the plugin.

So can you try restarting the plugin?

And ff that doesn't work, can you try deleting the NAS object and then restarting the plugin? This shouldn't be necessary, but I'm curious if it then resolves the issue or not.

@asm582
Copy link
Author

asm582 commented Nov 30, 2023

Thanks, I will delete the cluster and create plus reinstall the dra driver.

@asm582
Copy link
Author

asm582 commented Nov 30, 2023

update: recreated KinD cluster and re-deployed the previously built driver image, but still no luck:

[root@nvd-srv-02 k8s-dra-driver]# kubectl get pods -n nvidia-dra-driver
NAME                                               READY   STATUS    RESTARTS   AGE
nvidia-k8s-dra-driver-controller-6d6b45756-47khb   1/1     Running   0          2m
nvidia-k8s-dra-driver-kubelet-plugin-fkz4d         1/1     Running   0          2m
Spec:
  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    1
      Memory Bytes:             85899345920
      Mig Enabled:              false
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-713eebac-08df-c534-6c98-8d5055ca97a9
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    0
      Memory Bytes:             85899345920
      Mig Enabled:              true
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-1a9afbae-5932-54f8-c2c4-a863888d45bb

@elezar
Copy link
Member

elezar commented Nov 30, 2023

Could you confirm that running nvidia-smi in the kind worker node shows MIG as enabled?

@elezar elezar closed this as completed Nov 30, 2023
@asm582
Copy link
Author

asm582 commented Dec 4, 2023

Thanks, could you please recommend the container image that I should use to run the command?

@elezar elezar reopened this Dec 4, 2023
@elezar
Copy link
Member

elezar commented Dec 4, 2023

Running:

$ docker ps
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED       STATUS       PORTS                       NAMES
0141a7534ebf   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   3 weeks ago   Up 3 weeks   127.0.0.1:44521->6443/tcp   k8s-dra-driver-cluster-control-plane
255a4db134af   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   3 weeks ago   Up 3 weeks                               k8s-dra-driver-cluster-worker

shows the kind nodes created by the demo.

Running:

$  docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi

is equivalent to running nvidia-smi on a k8s node. The containerized kind worker node in this case.

@asm582
Copy link
Author

asm582 commented Dec 4, 2023

Thank you for sharing the command, below is the command output:

[root@nvd-srv-02 k8s-dra-driver]# docker ps
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED              STATUS              PORTS                       NAMES
b5609ebd1675   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   About a minute ago   Up About a minute   127.0.0.1:34917->6443/tcp   k8s-dra-driver-cluster-control-plane
5ef8b180a289   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   About a minute ago   Up About a minute                               k8s-dra-driver-cluster-worker
[root@nvd-srv-02 k8s-dra-driver]# docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi
Mon Dec  4 14:29:52 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:17:00.0 Off |                   On |
| N/A   36C    P0              45W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:65:00.0 Off |                   On |
| N/A   35C    P0              46W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[root@nvd-srv-02 k8s-dra-driver]# kubectl get nodes
NAME                                   STATUS   ROLES           AGE     VERSION
k8s-dra-driver-cluster-control-plane   Ready    control-plane   2m36s   v1.27.1
k8s-dra-driver-cluster-worker          Ready    <none>          2m12s   v1.27.1
[root@nvd-srv-02 k8s-dra-driver]# kubectl describe nas/k8s-dra-driver-cluster-worker -n nvidia-dra-driver
Name:         k8s-dra-driver-cluster-worker
Namespace:    nvidia-dra-driver
Labels:       <none>
Annotations:  <none>
API Version:  nas.gpu.resource.nvidia.com/v1alpha1
Kind:         NodeAllocationState
Metadata:
  Creation Timestamp:  2023-12-04T14:29:09Z
  Generation:          4
  Owner References:
    API Version:     v1
    Kind:            Node
    Name:            k8s-dra-driver-cluster-worker
    UID:             ddb095d1-a608-4f70-a7b2-bc55ad81ed4c
  Resource Version:  587
  UID:               863efe97-f965-4f42-9816-88e5fc3bb860
Spec:
  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    0
      Memory Bytes:             85899345920
      Mig Enabled:              true
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-1a9afbae-5932-54f8-c2c4-a863888d45bb
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.0
      Index:                    1
      Memory Bytes:             85899345920
      Mig Enabled:              false
      Product Name:             NVIDIA A100 80GB PCIe
      Uuid:                     GPU-713eebac-08df-c534-6c98-8d5055ca97a9
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   1
        Start:  0
        Size:   1
        Start:  1
        Size:   1
        Start:  2
        Size:   1
        Start:  3
        Size:   1
        Start:  4
        Size:   1
        Start:  5
        Size:   1
        Start:  6
      Profile:  1g.10gb+me
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   2
        Start:  0
        Size:   2
        Start:  2
        Size:   2
        Start:  4
        Size:   2
        Start:  6
      Profile:  1g.20gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   1
        Start:  0
        Size:   1
        Start:  1
        Size:   1
        Start:  2
        Size:   1
        Start:  3
        Size:   1
        Start:  4
        Size:   1
        Start:  5
        Size:   1
        Start:  6
      Profile:  1g.10gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   2
        Start:  0
        Size:   2
        Start:  2
        Size:   2
        Start:  4
      Profile:  2g.20gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   4
        Start:  0
        Size:   4
        Start:  4
      Profile:  3g.40gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   4
        Start:  0
      Profile:  4g.40gb
    Mig:
      Parent Product Name:  NVIDIA A100 80GB PCIe
      Placements:
        Size:   8
        Start:  0
      Profile:  7g.80gb
Status:         Ready
Events:         <none>

As seen nvidia-smi and nas do not agree.

one thing to note is that docker exec -ti k8s-dra-driver-cluster-worker nvidia-smi takes a long time to execute about 12 seconds.

@klueska klueska added the bug Issue/PR to expose/discuss/fix a bug label Jan 25, 2024
@klueska klueska self-assigned this Jan 25, 2024
@klueska
Copy link
Collaborator

klueska commented Aug 27, 2024

Can this be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

3 participants