Skip to content
This repository has been archived by the owner on Dec 24, 2019. It is now read-only.

Node draining during instance shutdown leads to "flapping" state #7

Closed
hjacobs opened this issue Feb 11, 2017 · 4 comments
Closed

Node draining during instance shutdown leads to "flapping" state #7

hjacobs opened this issue Feb 11, 2017 · 4 comments
Labels

Comments

@hjacobs
Copy link
Owner

hjacobs commented Feb 11, 2017

Using autoscaling without a proper node shutdown sequence will kill all pods/containers without any grace period. Node draining such as kube-node-drainer.service should generally be recommended to avoid service disruptions: zalando-incubator/kubernetes-on-aws#257

Problem with node draining: the autoscaler currently goes in "flapping" state as cordoned nodes (nodes marked as unschedulable) are compensated. Example:

  • kube-aws-autoscaler figures out new DesiredCapacity and scales down from 6 to 5
  • ASG terminates one EC2 instance
  • kube-node-drainer.service on the EC2 instance calls kubectl drain
  • the node is therefore marked as "unschedulable"
  • kube-aws-autoscaler now sees 6 nodes, but one of them is cordoned, so it compensates to 7 nodes
  • kube-aws-autoscaler sets ASG DesiredCapacity to 7
@hjacobs hjacobs added the bug label Feb 11, 2017
@hjacobs
Copy link
Owner Author

hjacobs commented Feb 11, 2017

This is a critical issue as can be seen in the following two graphs (simple scenario: scale up nginx from 10 to 50 replicas with kubectl scale deploy nginx --replicas=50).

Number of service endpoints for nginx service (should be going up from 10 to 50):
screenshot_2017-02-11_15-37-43

Number of healthy worker hosts (ELB health check to kubelet on worker node):
screenshot_2017-02-11_15-39-01

Eventually everything is fine:

  • 50 nginx pods are running and all are ready (registered in service as endpoint)
  • worker ASG runs stable with 6 nodes

But the flapping/scaling period clearly causes service disruptions (in one minute only 7 of 50 service endpoints are available).

@hjacobs
Copy link
Owner Author

hjacobs commented Feb 11, 2017

Another scenario where a single instance (i-02e6f917c5d3bd5d8) is terminated manually:

$ aws autoscaling describe-scaling-activities --auto-scaling-group-name kube-aws-test-1-WorkerAutoScaling-UUMORYB3ZVK2 --max-records=20 | jq '.Activities[].Cause' -r | tac
At 2017-02-11T15:33:29Z instance i-02e6f917c5d3bd5d8 was taken out of service in response to a user request.
At 2017-02-11T15:33:41Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 6.
At 2017-02-11T15:34:54Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 5.  At 2017-02-11T15:35:09Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 6 to 5.  At 2017-02-11T15:35:09Z instance i-0b9548c2536c04b90 was selected for termination.
At 2017-02-11T15:36:21Z a user request explicitly set group desired capacity changing the desired capacity from 5 to 8.  At 2017-02-11T15:36:37Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 8.
At 2017-02-11T15:36:21Z a user request explicitly set group desired capacity changing the desired capacity from 5 to 8.  At 2017-02-11T15:36:37Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 8.
At 2017-02-11T15:36:21Z a user request explicitly set group desired capacity changing the desired capacity from 5 to 8.  At 2017-02-11T15:36:37Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 8.
At 2017-02-11T15:37:22Z a user request explicitly set group desired capacity changing the desired capacity from 8 to 6.  At 2017-02-11T15:38:05Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 8 to 6.  At 2017-02-11T15:38:05Z instance i-09eff56ae37a67833 was selected for termination.  At 2017-02-11T15:38:05Z instance i-0cb8ccba1996cc308 was selected for termination.
At 2017-02-11T15:37:22Z a user request explicitly set group desired capacity changing the desired capacity from 8 to 6.  At 2017-02-11T15:38:05Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 8 to 6.  At 2017-02-11T15:38:05Z instance i-09eff56ae37a67833 was selected for termination.  At 2017-02-11T15:38:05Z instance i-0cb8ccba1996cc308 was selected for termination.
At 2017-02-11T15:39:10Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 9.  At 2017-02-11T15:39:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 9.
At 2017-02-11T15:39:10Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 9.  At 2017-02-11T15:39:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 9.
At 2017-02-11T15:39:10Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 9.  At 2017-02-11T15:39:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 9.
At 2017-02-11T15:40:11Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 6.  At 2017-02-11T15:40:31Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 6.  At 2017-02-11T15:40:31Z instance i-07a7326eda26c20cd was selected for termination.  At 2017-02-11T15:40:31Z instance i-0f0d7717d203a1ff2 was selected for termination.  At 2017-02-11T15:40:31Z instance i-0eb6db2e0e79092ad was selected for termination.
At 2017-02-11T15:40:11Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 6.  At 2017-02-11T15:40:31Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 6.  At 2017-02-11T15:40:31Z instance i-07a7326eda26c20cd was selected for termination.  At 2017-02-11T15:40:31Z instance i-0f0d7717d203a1ff2 was selected for termination.  At 2017-02-11T15:40:31Z instance i-0eb6db2e0e79092ad was selected for termination.
At 2017-02-11T15:40:11Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 6.  At 2017-02-11T15:40:31Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 6.  At 2017-02-11T15:40:31Z instance i-07a7326eda26c20cd was selected for termination.  At 2017-02-11T15:40:31Z instance i-0f0d7717d203a1ff2 was selected for termination.  At 2017-02-11T15:40:31Z instance i-0eb6db2e0e79092ad was selected for termination.
At 2017-02-11T15:41:11Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 7.  At 2017-02-11T15:41:30Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 7.
At 2017-02-11T15:44:49Z a user request explicitly set group desired capacity changing the desired capacity from 7 to 6.  At 2017-02-11T15:44:54Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 7 to 6.  At 2017-02-11T15:44:54Z instance i-0eca68586af353f8f was selected for termination.
At 2017-02-11T15:45:50Z a user request explicitly set group desired capacity changing the desired capacity from 6 to 7.  At 2017-02-11T15:45:53Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 6 to 7.
At 2017-02-11T15:46:50Z a user request explicitly set group desired capacity changing the desired capacity from 7 to 5.  At 2017-02-11T15:46:51Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 7 to 5.  At 2017-02-11T15:46:51Z instance i-0086b4f3a1f073b1c was selected for termination.  At 2017-02-11T15:46:51Z instance i-047adfff288ba9f67 was selected for termination.
At 2017-02-11T15:46:50Z a user request explicitly set group desired capacity changing the desired capacity from 7 to 5.  At 2017-02-11T15:46:51Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 7 to 5.  At 2017-02-11T15:46:51Z instance i-0086b4f3a1f073b1c was selected for termination.  At 2017-02-11T15:46:51Z instance i-047adfff288ba9f67 was selected for termination.
At 2017-02-11T15:47:51Z a user request explicitly set group desired capacity changing the desired capacity from 5 to 6.  At 2017-02-11T15:48:19Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 5 to 6.

Similar graph as above for nginx service endpoints:
screenshot_2017-02-11_17-03-28

Graph for cluster nodes (from Kubernetes API):
screenshot_2017-02-11_17-03-16

@hjacobs
Copy link
Owner Author

hjacobs commented Feb 12, 2017

Unschedulable nodes which are terminating (i.e. the unschedulable flag was set by the shut down kube-node-drainer.service) are no longer compensated for.

There is still the problem that the readiness of nodes is not properly evaluated, i.e. kube-aws-autoscaler only looks at the Ready condition of the kubelet, but this does not mean that the whole node can actually serve traffic.

@hjacobs
Copy link
Owner Author

hjacobs commented Mar 3, 2017

Closing this and created a follow-up issue for defining the node "readiness" concept: #23

@hjacobs hjacobs closed this as completed Mar 3, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant