-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter takes a long time to delete nodes that have been removed by AWS via spot interruption #3562
Comments
We experience the same issue.
|
Our behavior is when there is a significant Spot interruption larger than the percentage in the PDB; once the nodes are taken by AWS, Karpenter tries to delete them based on the PDB allowance; although the pods are already not working, but remain stale with the nodes due to Karpenter's finalizer. |
We don't want to ignore the graceful termination on nodes because it's difficult for us to know when EC2 will actually reclaim the capacity. I'm wondering if we can use the state change operation to detect when the instance is going into a stopped/terminated state and use that to forcefully terminate the node. Otherwise, if we are going through a standard graceful termination flow, I think that any kind of force mechanism or a timeout is probably unwise, since each workload can be different with how long it needs to gracefully terminate. |
@mrparkers why do your workloads need 10 minutes to gracefully terminate? |
After the significant spot interruptions, in the cluster's events, we see that the So you should try to detect when the node was already taken and force deletes it. I think you can reproduce it by creating a deployment with one pod and PDB |
I was able to reproduce the behavior like you said. I wasn't able to see the Karpenter could check if a backing instance is in a terminated state, and patch out the finalizer early. The only consideration is how often we should check this as to be throttled by the EC2 API. What's a realistic timeframe for these nodes to be deprovisioned after they've been terminated? |
If we check only when a node is not ready, that would at least help limit the # of calls (i.e. no point checking if a backing instance is terminated if the node is still up and reporting healthy). |
Without going into too much detail, we run a few third-party applications that are responsible for transcribing audio streams. We prefer to run these on spot instances since these workloads require GPUs, so that saves us a bit of money. Sometimes this means they terminate early (and unexpectedly) due to AWS reclaiming spot instances, but we'd prefer for these workloads to be given the time they need to exit gracefully whenever it's possible. That's why we give them such a long grace period. |
Got it. My understanding is that the PDBs that are blocking eviction for the pods, and due to the time given for spot interruptions, not all of them can gracefully terminate. Is this right? Discovering this condition for nodes make the node clean up minutes earlier, but like you said, the instance is already terminated, so it seems like the pods wouldn't have the time to gracefully terminate anyway. Just to understand, is the problem with the time to forcefully terminate this node with application runtime or with the time it takes to gracefully terminate? If you're able to respond to rebalance recommendations, your pods should have more time to gracefully terminate. As I understand it, the quicker you're able to understand that your pods need to gracefully terminate, the quicker Karpenter will provision new capacity to handle that evicted pod. |
@njtran, given your theory of |
Yeah I did a reproduction as follows:
Saw the NodeLifeCycle controller delete (not use Eviction API) pods with the Unreachable taint when it times out and bypass PDB requirements. Confirmed this when I set the following on my pods as opposed to the default of
I'm adding the termination check in the linked PR above as our finalization logic may be preventing other termination flows that we're not aware of. |
Following the same pattern of a blocking PDB and relying on taint manager eviction, I also added changed the pod deployment spec to include I was left with a mix of
Furthermore, I changed my PDB to be
The PDBs only block eviction based off of the existing healthy pods. Once the pods go terminating and have a deletionTimestamp set, they are no longer considered for PDBs. Given this, I'm curious if there's something else that shows that the pod's controller object is not able to run the desired amount of replicas due to a blocking PDB. Even though the old node and pods are left around while they're terminating due to my long |
Reopening as this was closed since the linked PR was merged. Will wait for the buddy PR to be merged and validated that it works before closing. |
Closing this as the PR has been merged and validated to solve the problem. |
Version
Karpenter Version: v0.26.1
Kubernetes Version:
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.10-eks-48e63af", GitCommit:"9176fb99b52f8d5ff73d67fea27f3a638f679f8a", GitTreeState:"clean", BuildDate:"2023-01-24T19:17:48Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Expected Behavior
We have some GPU-enabled spot instances that run workloads with a long termination grace period (the workloads can sometimes take up to 10 minutes to terminate gracefully). These workloads also have a
PodDisruptionBudget
to ensure that enough replicas are available during a rollout.When AWS issues a spot interruption warning for the node, karpenter starts to drain it, but it's usually unable to completely drain before AWS reclaims the node (either due to the termination grace period or the
PodDisruptionBudget
). When this happens, the underlying EC2 instance is no longer visible in the AWS console, but the node still shows up when querying viakubectl get nodes
.Here is an example of the events I'm seeing:
I understand that Karpenter wants to delete nodes that receive a spot interruption gracefully. However, once AWS terminates the node, I would expect Karpenter to stop the graceful termination and attempt to remove the node by force, similar to myself running
kubectl delete node ip-10-3-136-12.us-west-2.compute.internal --force --grace-period=0
. This is what I have to run in order to remove the node, although it appears karpenter eventually removes it after ~20 minutes.Actual Behavior
It seems like karpenter continues to attempt to delete the node gracefully, even after it has been reclaimed by AWS.
Steps to Reproduce the Problem
minAvailable
to 1 andreplicas
to 1.Resource Specs and Logs
Community Note
The text was updated successfully, but these errors were encountered: