feat: add nvidia MIG #258

piyush-jena · 2024-11-13T12:43:34Z

Issue number:

Testing done:

Instance joined the cluster

NAME                                           STATUS   ROLES    AGE   VERSION
ip-XXXX.us-west-2.compute.internal   Ready    <none>   15h   v1.29.5-eks-1109419

Model Default:

bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}

Model Updates:

bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.device-partitioning-strategy="mig"
bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.mig.profile."a100-40gb"="1g.5gb"
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "mig",
        "device-sharing-strategy": "none",
        "mig": {
          "profile": {
            "a100-40gb": "1g.5gb"
          }
        },
        "pass-device-specs": true
      }
    }
  }
}

kubectl describe node shows 56 gpus post instance reboot.

Bounded check:

bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "hello"="1g.5gb"
> EOF
Failed to apply settings: Failed to PATCH settings from '-' to '/settings?tx=apiclient-apply-7NsnlaurtHEacSYL': Status 400 when PATCHing /settings?tx=apiclient-apply-7NsnlaurtHEacSYL: Json deserialize error: Unable to deserialize into NvidiaGPUModel: NVIDIA GPU Model must match '^([a-z])(\d+)\.(\d+)gb$', given: hello at line 1 column 62
bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "a100.40gb"="2"
> EOF
bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "a100.40gb"="5"
> EOF
Failed to apply settings: Failed to PATCH settings from '-' to '/settings?tx=apiclient-apply-GzUHB0axGlWNPzGw': Status 400 when PATCHing /settings?tx=apiclient-apply-GzUHB0axGlWNPzGw: Json deserialize error: Unable to deserialize into MIGProfile: MIG Profile must match '^[0-9]g\.\d+gb$', given: 5 at line 1 column 71

Files generated:

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

bcressey · 2025-01-27T22:35:32Z

packages/systemd/systemd-tmpfiles.conf

@@ -1,5 +1,6 @@
 d /run/cache 0755 root root -
 d /run/lock 0755 root root -
+d /run/prairiedog 0755 root root -


The systemd package's tmpfiles snippet isn't the right place for this.

I'm also not convinced migmanager should be using prairiedog's temp directory, or that we want to route it all through prairiedog.

We could just add another program to reboot-if-required.service (like this, or via drop-in):

[Service] Type=oneshot ExecStart=/usr/bin/prairiedog reboot-if-required ExecStart=-/usr/bin/nvidia-migmanager reboot-if-required