Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to CrashLoopBackOff errors. #48

Open
amanshanbhag opened this issue Jan 28, 2025 · 0 comments

Comments

@amanshanbhag
Copy link

In the main chart, both the neuron-device-plugin and nvidia-device-plugin are set to enabled: true. The nvidia-device-plugin has some additional logic that supposedly only schedules the ds on specific nodes (that match the tolerations):

  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
    - key: sagemaker.amazonaws.com/node-health-status
      operator: Equal
      value: Unschedulable
      effect: NoSchedule

This logic doesn't exist for the neuron-device-plugin in the main values.yaml file.

Digging a bit deeper, I see that the neuron-device-plugin has it's own subdirectory for charts. These charts have a much more thorough definition of which nodes to schedule the neuron-device-plugin pods on (i.e., schedule only on the instances that match the nodeAffinity defined, which is essentially the Neuron based instances).

This is quite an odd definition for the helm charts. The nvidia-device-plugin runs no matter what -- it will schedule one per node, regardless of node type, because there's no nodeAffinity defined that makes it get deployed on specific (GPU) nodes types only. It will schedule on a node as long as the node has passed the health check (even though it doesn't have the nvidia.com/gpu label).

Because of this, nvidia-device-plugin pods get scheduled on non-GPU instances and go into CrashLoopBackOff state. It also tries to restart the pod multiple times. Relevant error:

I0127 15:32:51.363463       1 main.go:317] Retrieving plugins.
E0127 15:32:51.363574       1 factory.go:87] Incompatible strategy detected auto
E0127 15:32:51.363585       1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0127 15:32:51.363590       1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0127 15:32:51.363595       1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0127 15:32:51.363600       1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0127 15:32:51.372916       1 main.go:149] error starting plugins: error creating plugin manager: unable to create plugin manager: invalid device discovery strategy
stream closed EOF for kube-system/hyperpod-dependencies-nvidia-device-plugin-hcmdh (nvidia-device-plugin-ctr)

Can we implement a similar nodeAffinity definition for the nvidia-device-plugin too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant