Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark webhook and controller as safe-to-evict #4124

Merged
merged 1 commit into from
Jul 29, 2021

Conversation

imjasonh
Copy link
Member

The safe-to-evict annotation tells the cluster autoscaler whether the
pod can be evicted to allow the node it's on to scale down.

This was set to false (by me!) 2 years ago in fc6ef39
to prevent service unreliability during scale-down events. If the
no webhook replicas are available, users can't create/update/delete
Tekton objects; if no controller replicas are available, status updates
from Pod events, etc., won't be processed.

Unfortunately, blocking node eviction means the node that the pod(s) get
scheduled to can't be scaled down. Furthermore, the nodes can't be fully
drained when updating the cluster. This can leave a cluster in a
mid-upgrade state that can make issues difficult to diagnose and reason
about.

With this change, a cluster scale-down event might cause temporary
service unreliability with the default single-replica configuration. As
with #3787 if a user/operator wants to prevent this, they should
configure more replicas for HA.

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Docs included if any changes are user facing
  • [n/a] Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

By default, controller components are now marked as safe-to-evict by the cluster autoscaler. See docs/enabling-ha.md for more details.

@vdemeester

The safe-to-evict annotation tells the cluster autoscaler whether the
pod can be evicted to allow the node it's on to scale down.

This was set to false (by me!) 2 years ago in tektoncd@fc6ef39
to prevent service unreliability during scale-down events. If the
no webhook replicas are available, users can't create/update/delete
Tekton objects; if no controller replicas are available, status updates
from Pod events, etc., won't be processed.

Unfortunately, blocking node eviction means the node that the pod(s) get
scheduled to can't be scaled down. Furthermore, the nodes can't be fully
drained when updating the cluster. This can leave a cluster in a
mid-upgrade state that can make issues difficult to diagnose and reason
about.

With this change, a cluster scale-down event might cause temporary
service unreliability with the default single-replica configuration. As
with tektoncd#3787 if a user/operator wants to prevent this, they should
configure more replicas for HA.
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jul 28, 2021
@tekton-robot tekton-robot requested review from dibyom and dlorenc July 28, 2021 16:25
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 28, 2021
@imjasonh
Copy link
Member Author

/test tekton-pipeline-unit-tests

1 similar comment
@imjasonh
Copy link
Member Author

/test tekton-pipeline-unit-tests

@imjasonh
Copy link
Member Author

/test pull-tekton-pipeline-alpha-integration-tests

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2021
@pierretasci
Copy link

/assign
/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 29, 2021
@tekton-robot tekton-robot merged commit 5350069 into tektoncd:main Jul 29, 2021
@vdemeester
Copy link
Member

/cc @dibyom as I think triggers does the same.

@dibyom
Copy link
Member

dibyom commented Aug 5, 2021

Yeah we should port this to triggers as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants