Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to `CrashLoopBackOff` errors. #48

amanshanbhag · 2025-01-28T15:52:45Z

In the main chart, both the neuron-device-plugin and nvidia-device-plugin are set to enabled: true. The nvidia-device-plugin has some additional logic that supposedly only schedules the ds on specific nodes (that match the tolerations):

  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
    - key: sagemaker.amazonaws.com/node-health-status
      operator: Equal
      value: Unschedulable
      effect: NoSchedule

This logic doesn't exist for the neuron-device-plugin in the main values.yaml file.

Digging a bit deeper, I see that the neuron-device-plugin has it's own subdirectory for charts. These charts have a much more thorough definition of which nodes to schedule the neuron-device-plugin pods on (i.e., schedule only on the instances that match the nodeAffinity defined, which is essentially the Neuron based instances).

This is quite an odd definition for the helm charts. The nvidia-device-plugin runs no matter what -- it will schedule one per node, regardless of node type, because there's no nodeAffinity defined that makes it get deployed on specific (GPU) nodes types only. It will schedule on a node as long as the node has passed the health check (even though it doesn't have the nvidia.com/gpu label).

Because of this, nvidia-device-plugin pods get scheduled on non-GPU instances and go into CrashLoopBackOff state. It also tries to restart the pod multiple times. Relevant error:

I0127 15:32:51.363463       1 main.go:317] Retrieving plugins.
E0127 15:32:51.363574       1 factory.go:87] Incompatible strategy detected auto
E0127 15:32:51.363585       1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0127 15:32:51.363590       1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0127 15:32:51.363595       1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0127 15:32:51.363600       1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0127 15:32:51.372916       1 main.go:149] error starting plugins: error creating plugin manager: unable to create plugin manager: invalid device discovery strategy
stream closed EOF for kube-system/hyperpod-dependencies-nvidia-device-plugin-hcmdh (nvidia-device-plugin-ctr)

Can we implement a similar nodeAffinity definition for the nvidia-device-plugin too?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to `CrashLoopBackOff` errors. #48

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to `CrashLoopBackOff` errors. #48

amanshanbhag commented Jan 28, 2025

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to CrashLoopBackOff errors. #48

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to CrashLoopBackOff errors. #48

Comments

amanshanbhag commented Jan 28, 2025

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to `CrashLoopBackOff` errors. #48

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to `CrashLoopBackOff` errors. #48