Karpenter runtime logs showing tons of record: incompatible with nodepool , daemonset overhead #7574

wangsic · 2025-01-08T14:56:57Z

Description

Version:
Karpenter Version: v1.0.6
Kubernetes Version: v1.31

Context:

There are some pods (not daemonset) weren't scheduled to the dedicated nodes successfully according to karpenter logs as below:

"incompatible with nodepool \"gpu\", daemonset overhead={\"cpu\":\"605m\",\"memory\":\"1288Mi\",\"pods\":\"12\"}, did not tolerate nvidia.com/gpu=1:NoSchedule;
incompatible with nodepool \"app\", daemonset overhead={\"cpu\":\"605m\",\"memory\":\"1288Mi\",\"pods\":\"12\"}, incompatible requirements, label \"eks.amazonaws.com/nodegroup\" does not have known values"

As per design we don't want these pods are scheduled to the dedicated nodes by using node taints, pod toleration/nodeSelector, whereas it did actually, the good thing is it's failed.

As of now It don't impact our business, everything looks good due to above failure, but those tons of error message in karpenter logs. We'd like to know how to avoid it happen then clear those error messages.

Other Information:

We have two nodepools, these are gpu and app.

gpu has its own taints as below

key    = "nvidia.com/gpu"`
`value  = "1"`
`effect = "NoSchedule"

app nodepool doesn't have taints, but it has startup_taints as below

key    = "node.cilium.io/agent-not-ready"`
`value  = "true"`
`effect = "NoExecute"

We also have two node groups managed by AWS ASG, one is for karpenter, the other is for infrastructure addons.

karpenter node group is a dedicated node, only accept karpenter pods.

Infrastructure node group only accept infrastructure addons, which has taints and label

Taint:

key    = "node-group"`
`value  = "infra"`
`effect = "NO_SCHEDULE"

node label:

eks.amazonaws.com/nodegroup=non-prod-uw2-blue-infra-nodegroup

all infra addons should be running at Infrastructure node group

The one of infra addon, Istio pod should be scheduled to infra node group instead of gpu / app nodepool, but it did actually according to karpenter logs, Istio has toleration and nodeselector as below:

tolerations:
\- key: "node-group"

operator: "Exists"

nodeSelector:

eks.amazonaws.com/nodegroup: ${CLUSTER_NAME}-infra-nodegroup

We spent some time to investigate it, but no luck, still can't find root cause, so have to raise issue here.

We'd like to know how to avoid it happen then clear those error messages if possible.

The text was updated successfully, but these errors were encountered:

jigisha620 · 2025-01-13T18:08:06Z

Closing since this is a duplicate of the issue that's open in upstream - kubernetes-sigs/karpenter#1904

wangsic added bug Something isn't working needs-triage Issues that need to be triaged labels Jan 8, 2025

jigisha620 closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter runtime logs showing tons of record: incompatible with nodepool , daemonset overhead #7574

Karpenter runtime logs showing tons of record: incompatible with nodepool , daemonset overhead #7574

wangsic commented Jan 8, 2025

jigisha620 commented Jan 13, 2025

Karpenter runtime logs showing tons of record: incompatible with nodepool , daemonset overhead #7574

Karpenter runtime logs showing tons of record: incompatible with nodepool , daemonset overhead #7574

Comments

wangsic commented Jan 8, 2025

Description

jigisha620 commented Jan 13, 2025