You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the main chart, both the neuron-device-plugin and nvidia-device-plugin are set to enabled: true. The nvidia-device-plugin has some additional logic that supposedly only schedules the ds on specific nodes (that match the tolerations):
This logic doesn't exist for the neuron-device-plugin in the main values.yaml file.
Digging a bit deeper, I see that the neuron-device-plugin has it's own subdirectory for charts. These charts have a much more thorough definition of which nodes to schedule the neuron-device-plugin pods on (i.e., schedule only on the instances that match the nodeAffinity defined, which is essentially the Neuron based instances).
This is quite an odd definition for the helm charts. The nvidia-device-plugin runs no matter what -- it will schedule one per node, regardless of node type, because there's no nodeAffinity defined that makes it get deployed on specific (GPU) nodes types only. It will schedule on a node as long as the node has passed the health check (even though it doesn't have the nvidia.com/gpu label).
Because of this, nvidia-device-plugin pods get scheduled on non-GPU instances and go into CrashLoopBackOff state. It also tries to restart the pod multiple times. Relevant error:
I0127 15:32:51.363463 1 main.go:317] Retrieving plugins.
E0127 15:32:51.363574 1 factory.go:87] Incompatible strategy detected auto
E0127 15:32:51.363585 1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0127 15:32:51.363590 1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0127 15:32:51.363595 1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0127 15:32:51.363600 1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0127 15:32:51.372916 1 main.go:149] error starting plugins: error creating plugin manager: unable to create plugin manager: invalid device discovery strategy
stream closed EOF for kube-system/hyperpod-dependencies-nvidia-device-plugin-hcmdh (nvidia-device-plugin-ctr)
Can we implement a similar nodeAffinity definition for the nvidia-device-plugin too?
The text was updated successfully, but these errors were encountered:
In the main chart, both the
neuron-device-plugin
andnvidia-device-plugin
are set toenabled: true
. Thenvidia-device-plugin
has some additional logic that supposedly only schedules the ds on specific nodes (that match the tolerations):This logic doesn't exist for the
neuron-device-plugin
in the main values.yaml file.Digging a bit deeper, I see that the
neuron-device-plugin
has it's own subdirectory for charts. These charts have a much more thorough definition of which nodes to schedule theneuron-device-plugin
pods on (i.e., schedule only on the instances that match thenodeAffinity
defined, which is essentially the Neuron based instances).This is quite an odd definition for the helm charts. The
nvidia-device-plugin
runs no matter what -- it will schedule one per node, regardless of node type, because there's nonodeAffinity
defined that makes it get deployed on specific (GPU) nodes types only. It will schedule on a node as long as the node has passed the health check (even though it doesn't have thenvidia.com/gpu
label).Because of this,
nvidia-device-plugin
pods get scheduled on non-GPU instances and go intoCrashLoopBackOff
state. It also tries to restart the pod multiple times. Relevant error:Can we implement a similar
nodeAffinity
definition for thenvidia-device-plugin
too?The text was updated successfully, but these errors were encountered: