Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter takes a long time to delete nodes that have been removed by AWS via spot interruption #3562

Closed
mrparkers opened this issue Mar 8, 2023 · 14 comments · Fixed by kubernetes-sigs/karpenter#252
Assignees
Labels
bug Something isn't working

Comments

@mrparkers
Copy link

Version

Karpenter Version: v0.26.1

Kubernetes Version: Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.10-eks-48e63af", GitCommit:"9176fb99b52f8d5ff73d67fea27f3a638f679f8a", GitTreeState:"clean", BuildDate:"2023-01-24T19:17:48Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Expected Behavior

We have some GPU-enabled spot instances that run workloads with a long termination grace period (the workloads can sometimes take up to 10 minutes to terminate gracefully). These workloads also have a PodDisruptionBudget to ensure that enough replicas are available during a rollout.

When AWS issues a spot interruption warning for the node, karpenter starts to drain it, but it's usually unable to completely drain before AWS reclaims the node (either due to the termination grace period or the PodDisruptionBudget). When this happens, the underlying EC2 instance is no longer visible in the AWS console, but the node still shows up when querying via kubectl get nodes.

Here is an example of the events I'm seeing:

Events:
  Type     Reason                               Age                  From                             Message
  ----     ------                               ----                 ----                             -------
  Normal   Starting                             35m                  kube-proxy
  Normal   RegisteredNode                       36m                  node-controller                  Node ip-10-3-136-12.us-west-2.compute.internal event: Registered Node ip-10-3-136-12.us-west-2.compute.internal in Controller
  Normal   NodeAllocatableEnforced              35m                  kubelet                          Updated Node Allocatable limit across pods
  Warning  InvalidDiskCapacity                  35m                  kubelet                          invalid capacity 0 on image filesystem
  Normal   Starting                             35m                  kubelet                          Starting kubelet.
  Normal   NodeHasSufficientMemory              35m (x3 over 35m)    kubelet                          Node ip-10-3-136-12.us-west-2.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasSufficientPID                 35m (x3 over 35m)    kubelet                          Node ip-10-3-136-12.us-west-2.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeHasNoDiskPressure                35m (x3 over 35m)    kubelet                          Node ip-10-3-136-12.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeReady                            35m                  kubelet                          Node ip-10-3-136-12.us-west-2.compute.internal status is now: NodeReady
  Normal   Unconsolidatable                     24m                  karpenter                        provisioner gpu-spot has consolidation disabled
  Normal   InstanceSpotRebalanceRecommendation  22m                  karpenter                        Node ip-10-3-136-12.us-west-2.compute.internal event: A spot rebalance recommendation was triggered for the node
  Warning  InstanceSpotInterrupted              19m                  karpenter                        Node ip-10-3-136-12.us-west-2.compute.internal event: A spot interruption warning was triggered for the node
  Normal   NodeNotSchedulable                   19m                  kubelet                          Node ip-10-3-136-12.us-west-2.compute.internal status is now: NodeNotSchedulable
  Warning  NodeTerminatingOnInterruption        17m (x2 over 19m)    karpenter                        Node ip-10-3-136-12.us-west-2.compute.internal event: Interruption triggered termination for the node
  Warning  InstanceTerminating                  17m                  karpenter                        Node ip-10-3-136-12.us-west-2.compute.internal event: Instance is terminating
  Normal   NodeNotReady                         17m                  node-controller                  Node ip-10-3-136-12.us-west-2.compute.internal status is now: NodeNotReady
  Warning  FailedInflightCheck                  16m                  karpenter                        Can't drain node, PDB foo/bar is blocking evictions
  Warning  FailedDraining                       96s (x10 over 19m)   karpenter                        Failed to drain node, 1 pods are waiting to be evicted
  Normal   DeletingNode                         92s (x145 over 16m)  cloud-node-lifecycle-controller  Deleting node ip-10-3-136-12.us-west-2.compute.internal because it does not exist in the cloud provider

I understand that Karpenter wants to delete nodes that receive a spot interruption gracefully. However, once AWS terminates the node, I would expect Karpenter to stop the graceful termination and attempt to remove the node by force, similar to myself running kubectl delete node ip-10-3-136-12.us-west-2.compute.internal --force --grace-period=0. This is what I have to run in order to remove the node, although it appears karpenter eventually removes it after ~20 minutes.

Actual Behavior

It seems like karpenter continues to attempt to delete the node gracefully, even after it has been reclaimed by AWS.

Steps to Reproduce the Problem

  1. Schedule a pod with a PDB and/or a long termination grace period on a spot node. If the goal is to reproduce, you can create a PDB that never allows evictions by setting minAvailable to 1 and replicas to 1.
  2. Wait for AWS to reclaim the spot node
  3. Observe that it takes karpenter up to 20 minutes to delete the node after AWS has reclaimed it.

Resource Specs and Logs

$ stern -n karpenter karpenter | grep ip-10-3-136-12.us-west-2.compute.internal
+ karpenter-75b66f6bb5-qxbqg › controller
+ karpenter-75b66f6bb5-2qhl7 › controller
karpenter-75b66f6bb5-qxbqg controller {"level":"INFO","time":"2023-03-08T18:50:56.314Z","logger":"controller.provisioner.cloudprovider","message":"launched new instance","commit":"beb0a64-dirty","provisioner":"gpu-spot","id":"i-06b3734b274b683d1","hostname":"ip-10-3-136-12.us-west-2.compute.internal","instance-type":"g4dn.2xlarge","zone":"us-west-2a","capacity-type":"spot"}
karpenter-75b66f6bb5-qxbqg controller {"level":"DEBUG","time":"2023-03-08T19:07:39.045Z","logger":"controller.interruption","message":"removing offering from offerings","commit":"beb0a64-dirty","queue":"my-karpenter-interrupt-queue","messageKind":"SpotInterruptionKind","node":"ip-10-3-136-12.us-west-2.compute.internal","action":"CordonAndDrain","reason":"SpotInterruptionKind","instance-type":"g4dn.2xlarge","zone":"us-west-2a","capacity-type":"spot","ttl":"3m0s"}
karpenter-75b66f6bb5-qxbqg controller {"level":"INFO","time":"2023-03-08T19:07:39.125Z","logger":"controller.interruption","message":"deleted node from interruption message","commit":"beb0a64-dirty","queue":"my-karpenter-interrupt-queue","messageKind":"SpotInterruptionKind","node":"ip-10-3-136-12.us-west-2.compute.internal","action":"CordonAndDrain"}
karpenter-75b66f6bb5-qxbqg controller {"level":"INFO","time":"2023-03-08T19:07:39.182Z","logger":"controller.termination","message":"cordoned node","commit":"beb0a64-dirty","node":"ip-10-3-136-12.us-west-2.compute.internal"}
karpenter-75b66f6bb5-qxbqg controller {"level":"INFO","time":"2023-03-08T19:09:39.439Z","logger":"controller.interruption","message":"deleted node from interruption message","commit":"beb0a64-dirty","queue":"my-karpenter-interrupt-queue","messageKind":"StateChangeKind","node":"ip-10-3-136-12.us-west-2.compute.internal","action":"CordonAndDrain"}
karpenter-75b66f6bb5-qxbqg controller {"level":"INFO","time":"2023-03-08T19:10:35.263Z","logger":"controller.interruption","message":"deleted node from interruption message","commit":"beb0a64-dirty","queue":"my-karpenter-interrupt-queue","messageKind":"StateChangeKind","node":"ip-10-3-136-12.us-west-2.compute.internal","action":"CordonAndDrain"}
karpenter-75b66f6bb5-qxbqg controller {"level":"INFO","time":"2023-03-08T19:10:56.442Z","logger":"controller.inflightchecks","message":"Inflight check failed for node, Can't drain node, PDB foo/bar is blocking evictions","commit":"beb0a64-dirty","node":"ip-10-3-136-12.us-west-2.compute.internal"}
karpenter-75b66f6bb5-qxbqg controller {"level":"INFO","time":"2023-03-08T19:27:32.673Z","logger":"controller.termination","message":"deleted node","commit":"beb0a64-dirty","node":"ip-10-3-136-12.us-west-2.compute.internal"}

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@mrparkers mrparkers added the bug Something isn't working label Mar 8, 2023
@jonathan-innis jonathan-innis added the burning Time sensitive issues label Mar 8, 2023
@liorfranko
Copy link

We experience the same issue.
Our workaround is to run the following:

kubectl get nodes -L node.kubernetes.io/instance-type,nodegroup-name,intent,lifecycle,karpenter.sh/capacity-type,kubernetes.io/os,project | grep NotReady,SchedulingDisabled

kubectl patch node <nodes> -p '{"metadata":{"finalizers":null}}' --type=merge

@liorfranko
Copy link

Our behavior is when there is a significant Spot interruption larger than the percentage in the PDB; once the nodes are taken by AWS, Karpenter tries to delete them based on the PDB allowance; although the pods are already not working, but remain stale with the nodes due to Karpenter's finalizer.
the issue is resolved once we remove the Karpenter finalizer from the nodes.

@jonathan-innis jonathan-innis removed the burning Time sensitive issues label Mar 11, 2023
@jonathan-innis
Copy link
Contributor

jonathan-innis commented Mar 11, 2023

We don't want to ignore the graceful termination on nodes because it's difficult for us to know when EC2 will actually reclaim the capacity. I'm wondering if we can use the state change operation to detect when the instance is going into a stopped/terminated state and use that to forcefully terminate the node.

Otherwise, if we are going through a standard graceful termination flow, I think that any kind of force mechanism or a timeout is probably unwise, since each workload can be different with how long it needs to gracefully terminate.

@jonathan-innis
Copy link
Contributor

that run workloads with a long termination grace period

@mrparkers why do your workloads need 10 minutes to gracefully terminate?

@liorfranko
Copy link

After the significant spot interruptions, in the cluster's events, we see that the cloud-node-lifecycle-controller says: Deleting node ip-10-207-156-229.ec2.internal because it does not exist in the cloud provider.
But Karpenter tries to evict the pods gracefully based on the PDB, which doesn't make sense because the pods are no longer working.
Karpenter's finalizer blocks the deletion of the nodes.
This behavior slows down the recovery after interruptions.

So you should try to detect when the node was already taken and force deletes it.

I think you can reproduce it by creating a deployment with one pod and PDB minAvailable: 100%, then terminating the node from AWS console.
The expected behavior should be to detect when the node is no longer working, as soon as possible, and force deletes the node.

@njtran
Copy link
Contributor

njtran commented Mar 14, 2023

I was able to reproduce the behavior like you said. I wasn't able to see the cloud-node-lifecycle-controller event in my cluster, but do see the node eventually forcefully terminated after about 5 minutes. Once I removed the finalizer, the node pretty much instantly deleted but the pods took about a minute to reschedule.

Karpenter could check if a backing instance is in a terminated state, and patch out the finalizer early. The only consideration is how often we should check this as to be throttled by the EC2 API. What's a realistic timeframe for these nodes to be deprovisioned after they've been terminated?

@tzneal
Copy link
Contributor

tzneal commented Mar 14, 2023

If we check only when a node is not ready, that would at least help limit the # of calls (i.e. no point checking if a backing instance is terminated if the node is still up and reporting healthy).

@mrparkers
Copy link
Author

why do your workloads need 10 minutes to gracefully terminate?

Without going into too much detail, we run a few third-party applications that are responsible for transcribing audio streams. We prefer to run these on spot instances since these workloads require GPUs, so that saves us a bit of money. Sometimes this means they terminate early (and unexpectedly) due to AWS reclaiming spot instances, but we'd prefer for these workloads to be given the time they need to exit gracefully whenever it's possible. That's why we give them such a long grace period.

@njtran
Copy link
Contributor

njtran commented Mar 16, 2023

Got it. My understanding is that the PDBs that are blocking eviction for the pods, and due to the time given for spot interruptions, not all of them can gracefully terminate. Is this right? Discovering this condition for nodes make the node clean up minutes earlier, but like you said, the instance is already terminated, so it seems like the pods wouldn't have the time to gracefully terminate anyway.

Just to understand, is the problem with the time to forcefully terminate this node with application runtime or with the time it takes to gracefully terminate? If you're able to respond to rebalance recommendations, your pods should have more time to gracefully terminate. As I understand it, the quicker you're able to understand that your pods need to gracefully terminate, the quicker Karpenter will provision new capacity to handle that evicted pod.

@ellistarn
Copy link
Contributor

ellistarn commented Mar 25, 2023

@njtran, given your theory of tolerationSeconds allowing this to block for longer than 5min, were you able to test using a large terminationGracePeriodSeconds as well? @liorfranko, are you setting either of these for your use case?

@njtran
Copy link
Contributor

njtran commented Mar 28, 2023

Yeah I did a reproduction as follows:

  • Fully blocking maxUnavailable: 0 PDB
  • Node with a terminated underlying instance

Saw the NodeLifeCycle controller delete (not use Eviction API) pods with the Unreachable taint when it times out and bypass PDB requirements. Confirmed this when I set the following on my pods as opposed to the default of tolerationSeconds: 300.

  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 10

I'm adding the termination check in the linked PR above as our finalization logic may be preventing other termination flows that we're not aware of.

@njtran
Copy link
Contributor

njtran commented Mar 28, 2023

Following the same pattern of a blocking PDB and relying on taint manager eviction, I also added changed the pod deployment spec to include terminationGracePeriodSeconds: 3000. Even though the PDB was still intact with 0 allowed disruptions, the Deployment's ReplicaSet spun up 10 more pods, and Karpenter provisioned a new node.

I was left with a mix of Running and Terminating pods.

NAME                       READY   STATUS        RESTARTS   AGE
inflate-6c854b49db-5dtqx   1/1     Running       0          14m
inflate-6c854b49db-5f9z4   1/1     Running       0          19m
inflate-6c854b49db-67n47   1/1     Running       0          14m
inflate-6c854b49db-6bcn2   1/1     Running       0          14m
inflate-6c854b49db-6xvr4   1/1     Terminating   0          40m
inflate-6c854b49db-76d8g   1/1     Terminating   0          40m
inflate-6c854b49db-bncvp   1/1     Terminating   0          39m
inflate-6c854b49db-cdzqr   1/1     Running       0          14m
inflate-6c854b49db-gfqtb   1/1     Terminating   0          40m
inflate-6c854b49db-gp5zb   1/1     Running       0          14m
inflate-6c854b49db-gx9h8   1/1     Running       0          14m
inflate-6c854b49db-hbxbb   1/1     Terminating   0          39m
inflate-6c854b49db-kv8dd   1/1     Running       0          14m
inflate-6c854b49db-ns562   1/1     Terminating   0          39m
inflate-6c854b49db-pwd8j   1/1     Terminating   0          40m
inflate-6c854b49db-q7blh   1/1     Running       0          14m
inflate-6c854b49db-vp2f8   1/1     Terminating   0          39m
inflate-6c854b49db-w79qh   1/1     Terminating   0          40m
inflate-6c854b49db-xbp5l   1/1     Running       0          14m
inflate-6c854b49db-xwglw   1/1     Terminating   0          39m

Furthermore, I changed my PDB to be maxUnavailable: 5, and saw that these terminating pods were no longer being considered for PDB calculations:

✗ k get pdb
NAME       MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
test-pdb   N/A             5                 5                     4d19h

The PDBs only block eviction based off of the existing healthy pods. Once the pods go terminating and have a deletionTimestamp set, they are no longer considered for PDBs.

Given this, I'm curious if there's something else that shows that the pod's controller object is not able to run the desired amount of replicas due to a blocking PDB. Even though the old node and pods are left around while they're terminating due to my long terminationGracePeriodSeconds, the deployment is still able to achieve the desiredReplica count.

@njtran
Copy link
Contributor

njtran commented Mar 31, 2023

Reopening as this was closed since the linked PR was merged. Will wait for the buddy PR to be merged and validated that it works before closing.

@njtran njtran reopened this Mar 31, 2023
@njtran
Copy link
Contributor

njtran commented May 1, 2023

Closing this as the PR has been merged and validated to solve the problem.

@njtran njtran closed this as completed May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants