From f51f891a0077f19f0f89db08e056b6355c476d04 Mon Sep 17 00:00:00 2001 From: Jason Hall Date: Wed, 28 Jul 2021 12:13:47 -0400 Subject: [PATCH] Mark webhook and controller as safe-to-evict The safe-to-evict annotation tells the cluster autoscaler whether the pod can be evicted to allow the node it's on to scale down. This was set to false (by me!) 2 years ago in https://github.com/tektoncd/pipeline/commit/fc6ef39101a1bbead39c8271d813ed0e70733cb1 to prevent service unreliability during scale-down events. If the no webhook replicas are available, users can't create/update/delete Tekton objects; if no controller replicas are available, status updates from Pod events, etc., won't be processed. Unfortunately, blocking node eviction means the node that the pod(s) get scheduled to can't be scaled down. Furthermore, the nodes can't be fully drained when updating the cluster. This can leave a cluster in a mid-upgrade state that can make issues difficult to diagnose and reason about. With this change, a cluster scale-down event might cause temporary service unreliability with the default single-replica configuration. As with #3787 if a user/operator wants to prevent this, they should configure more replicas for HA. --- config/controller.yaml | 2 -- config/webhook.yaml | 2 -- docs/enabling-ha.md | 6 +++--- 3 files changed, 3 insertions(+), 7 deletions(-) diff --git a/config/controller.yaml b/config/controller.yaml index c1de705d7d8..9568c32c383 100644 --- a/config/controller.yaml +++ b/config/controller.yaml @@ -37,8 +37,6 @@ spec: app.kubernetes.io/part-of: tekton-pipelines template: metadata: - annotations: - cluster-autoscaler.kubernetes.io/safe-to-evict: "false" labels: app.kubernetes.io/name: controller app.kubernetes.io/component: controller diff --git a/config/webhook.yaml b/config/webhook.yaml index 63f0b20eebb..739a12d75b9 100644 --- a/config/webhook.yaml +++ b/config/webhook.yaml @@ -40,8 +40,6 @@ spec: app.kubernetes.io/part-of: tekton-pipelines template: metadata: - annotations: - cluster-autoscaler.kubernetes.io/safe-to-evict: "false" labels: app.kubernetes.io/name: webhook app.kubernetes.io/component: webhook diff --git a/docs/enabling-ha.md b/docs/enabling-ha.md index d95890a4656..38290f706e3 100644 --- a/docs/enabling-ha.md +++ b/docs/enabling-ha.md @@ -95,9 +95,9 @@ spec: minReplicas: 1 ``` -By default, the Webhook deployment is configured to block a [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) from scaling down the node that's running the only replica of the deployment using the `cluster-autoscaler.kubernetes.io/safe-to-evict` annotation. -This is configured because, while the only replica of the Webhook is unavailable, Tekton resources can't be created, updated or deleted. -If you configure more than one replica, you can remove the annotation to allow the Cluster Autoscaler more freedom to scale down nodes, without disrupting the Webhook service. +By default, the Webhook deployment is _not_ configured to block a [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) from scaling down the node that's running the only replica of the deployment using the `cluster-autoscaler.kubernetes.io/safe-to-evict` annotation. +This means that during node drains, the Webhook might be unavailable temporarily, during which time Tekton resources can't be created, updated or deleted. +To avoid this, you can add the `safe-to-evict` annotation set to `false` to block node drains during autoscaling, or, better yet, configure multiple replicas of the Webhook deployment. ### Avoiding Disruptions