Node draining during instance shutdown leads to "flapping" state #7

hjacobs · 2017-02-11T14:19:23Z

Using autoscaling without a proper node shutdown sequence will kill all pods/containers without any grace period. Node draining such as kube-node-drainer.service should generally be recommended to avoid service disruptions: zalando-incubator/kubernetes-on-aws#257

Problem with node draining: the autoscaler currently goes in "flapping" state as cordoned nodes (nodes marked as unschedulable) are compensated. Example:

kube-aws-autoscaler figures out new DesiredCapacity and scales down from 6 to 5
ASG terminates one EC2 instance
kube-node-drainer.service on the EC2 instance calls kubectl drain
the node is therefore marked as "unschedulable"
kube-aws-autoscaler now sees 6 nodes, but one of them is cordoned, so it compensates to 7 nodes
kube-aws-autoscaler sets ASG DesiredCapacity to 7

The text was updated successfully, but these errors were encountered:

hjacobs · 2017-02-11T14:45:37Z

This is a critical issue as can be seen in the following two graphs (simple scenario: scale up nginx from 10 to 50 replicas with kubectl scale deploy nginx --replicas=50).

Number of service endpoints for nginx service (should be going up from 10 to 50):

Number of healthy worker hosts (ELB health check to kubelet on worker node):

Eventually everything is fine:

50 nginx pods are running and all are ready (registered in service as endpoint)
worker ASG runs stable with 6 nodes

But the flapping/scaling period clearly causes service disruptions (in one minute only 7 of 50 service endpoints are available).

hjacobs · 2017-02-11T16:06:38Z

Another scenario where a single instance (i-02e6f917c5d3bd5d8) is terminated manually:

$ aws autoscaling describe-scaling-activities --auto-scaling-group-name kube-aws-test-1-WorkerAutoScaling-UUMORYB3ZVK2 --max-records=20 | jq '.Activities[].Cause' -r | tac
At 2017-02-11T15:33:29Z instance i-02e6f917c5d3bd5d8 was taken out of service in response to a user request.
At 2017-02-11T15:33:41Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 6.
At 2017-02-11T15:34:54Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 5.  At 2017-02-11T15:35:09Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 6 to 5.  At 2017-02-11T15:35:09Z instance i-0b9548c2536c04b90 was selected for termination.
At 2017-02-11T15:36:21Z a user request explicitly set group desired capacity changing the desired capacity from 5 to 8.  At 2017-02-11T15:36:37Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 8.
At 2017-02-11T15:36:21Z a user request explicitly set group desired capacity changing the desired capacity from 5 to 8.  At 2017-02-11T15:36:37Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 8.
At 2017-02-11T15:36:21Z a user request explicitly set group desired capacity changing the desired capacity from 5 to 8.  At 2017-02-11T15:36:37Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 8.
At 2017-02-11T15:37:22Z a user request explicitly set group desired capacity changing the desired capacity from 8 to 6.  At 2017-02-11T15:38:05Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 8 to 6.  At 2017-02-11T15:38:05Z instance i-09eff56ae37a67833 was selected for termination.  At 2017-02-11T15:38:05Z instance i-0cb8ccba1996cc308 was selected for termination.
At 2017-02-11T15:37:22Z a user request explicitly set group desired capacity changing the desired capacity from 8 to 6.  At 2017-02-11T15:38:05Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 8 to 6.  At 2017-02-11T15:38:05Z instance i-09eff56ae37a67833 was selected for termination.  At 2017-02-11T15:38:05Z instance i-0cb8ccba1996cc308 was selected for termination.
At 2017-02-11T15:39:10Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 9.  At 2017-02-11T15:39:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 9.
At 2017-02-11T15:39:10Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 9.  At 2017-02-11T15:39:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 9.
At 2017-02-11T15:39:10Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 9.  At 2017-02-11T15:39:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 9.
At 2017-02-11T15:40:11Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 6.  At 2017-02-11T15:40:31Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 6.  At 2017-02-11T15:40:31Z instance i-07a7326eda26c20cd was selected for termination.  At 2017-02-11T15:40:31Z instance i-0f0d7717d203a1ff2 was selected for termination.  At 2017-02-11T15:40:31Z instance i-0eb6db2e0e79092ad was selected for termination.
At 2017-02-11T15:40:11Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 6.  At 2017-02-11T15:40:31Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 6.  At 2017-02-11T15:40:31Z instance i-07a7326eda26c20cd was selected for termination.  At 2017-02-11T15:40:31Z instance i-0f0d7717d203a1ff2 was selected for termination.  At 2017-02-11T15:40:31Z instance i-0eb6db2e0e79092ad was selected for termination.
At 2017-02-11T15:40:11Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 6.  At 2017-02-11T15:40:31Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 6.  At 2017-02-11T15:40:31Z instance i-07a7326eda26c20cd was selected for termination.  At 2017-02-11T15:40:31Z instance i-0f0d7717d203a1ff2 was selected for termination.  At 2017-02-11T15:40:31Z instance i-0eb6db2e0e79092ad was selected for termination.
At 2017-02-11T15:41:11Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 7.  At 2017-02-11T15:41:30Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 7.
At 2017-02-11T15:44:49Z a user request explicitly set group desired capacity changing the desired capacity from 7 to 6.  At 2017-02-11T15:44:54Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 7 to 6.  At 2017-02-11T15:44:54Z instance i-0eca68586af353f8f was selected for termination.
At 2017-02-11T15:45:50Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 7.  At 2017-02-11T15:45:53Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 7.
At 2017-02-11T15:46:50Z a user request explicitly set group desired capacity changing the desired capacity from 7 to 5.  At 2017-02-11T15:46:51Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 7 to 5.  At 2017-02-11T15:46:51Z instance i-0086b4f3a1f073b1c was selected for termination.  At 2017-02-11T15:46:51Z instance i-047adfff288ba9f67 was selected for termination.
At 2017-02-11T15:46:50Z a user request explicitly set group desired capacity changing the desired capacity from 7 to 5.  At 2017-02-11T15:46:51Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 7 to 5.  At 2017-02-11T15:46:51Z instance i-0086b4f3a1f073b1c was selected for termination.  At 2017-02-11T15:46:51Z instance i-047adfff288ba9f67 was selected for termination.
At 2017-02-11T15:47:51Z a user request explicitly set group desired capacity changing the desired capacity from 5 to 6.  At 2017-02-11T15:48:19Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 6.

Similar graph as above for nginx service endpoints:

Graph for cluster nodes (from Kubernetes API):

hjacobs · 2017-02-12T09:25:11Z

Unschedulable nodes which are terminating (i.e. the unschedulable flag was set by the shut down kube-node-drainer.service) are no longer compensated for.

There is still the problem that the readiness of nodes is not properly evaluated, i.e. kube-aws-autoscaler only looks at the Ready condition of the kubelet, but this does not mean that the whole node can actually serve traffic.

hjacobs · 2017-03-03T21:17:07Z

Closing this and created a follow-up issue for defining the node "readiness" concept: #23

hjacobs added the bug label Feb 11, 2017

hjacobs mentioned this issue Feb 11, 2017

Kube node drainer zalando-incubator/kubernetes-on-aws#257

Merged

hjacobs added a commit that referenced this issue Feb 11, 2017

#7 do not scale down if ASG as activity in progress

56f0ca4

hjacobs added a commit that referenced this issue Feb 11, 2017

#7 allow autoscaling:DescribeScalingActivities

0f48747

hjacobs closed this as completed Mar 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node draining during instance shutdown leads to "flapping" state #7

Node draining during instance shutdown leads to "flapping" state #7

hjacobs commented Feb 11, 2017

hjacobs commented Feb 11, 2017

hjacobs commented Feb 11, 2017

hjacobs commented Feb 12, 2017

hjacobs commented Mar 3, 2017

Node draining during instance shutdown leads to "flapping" state #7

Node draining during instance shutdown leads to "flapping" state #7

Comments

hjacobs commented Feb 11, 2017

hjacobs commented Feb 11, 2017

hjacobs commented Feb 11, 2017

hjacobs commented Feb 12, 2017

hjacobs commented Mar 3, 2017