PodDisruptionBudget causing inability to gracefully drain a node with tekton-pipelines-webhook pod #3654

r0bj · 2021-01-05T02:05:02Z

Expected Behavior

Gracefully draining nodes which contains tekton pods are possible.

Actual Behavior

With single running tekton-pipelines-webhook pod (which is possible due to tekton-pipelines-webhook: HorizontalPodAutoscaler.spec.minReplicas: 1) it's impossible to gracefully drain a node because PodDisruptionBudget is set to minAvailable: 80%.

Steps to Reproduce the Problem

Run tekton-pipeline: kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/previous/v0.19.0/release.yaml
Make sure there is single tekton-pipelines-webhook pod (scaled down)
Try to drain node with tekton-pipelines-webhook pod

Additional Info

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:41:49Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

Client version: 0.15.0
Pipeline version: v0.19.0
Triggers version: v0.10.0

The text was updated successfully, but these errors were encountered:

imjasonh · 2021-02-12T13:20:33Z

This definitely does sound like something we need to document and communicate to operators better.

The PodDisruptionBudget was put in place to prevent users from being unable to create PipelineRuns/TaskRuns/etc when the webhook Pod is unavailable, e.g., during an upgrade. Unfortunately they way it did this basically turned into "never let Kubernetes make the Pod unavailable" 😬

There are basically two options, I think:

If an operator can tolerate some downtime (a few seconds), they can remove the PodDisruptionBudget -- either always, or only just before the upgrade, then put it back when they're done.
Scale up the webhook deployment -- again, either always, or only just before an upgrade, and back down again after. Since minAvailable is 80%, you'd need 5 replicas to be able to safely take one down, and Kubernetes would do a rolling upgrade through one at a time.

I think the bug here is to, at the very least, document this behavior for operators. We might want to make minAvailable 50% so that only two replicas are needed to trigger the rolling ugprade behavior.

cc @vdemeester @nikhil-thomas

imjasonh · 2021-02-12T13:21:56Z

The PodDisruptionBudget was put in place in #3391 cc @raballew

bitsofinfo · 2021-02-12T13:22:19Z

1 and/or 2 are what i did to get around it, but it burned a day or so w/ our k8s paas support trying to triage why the upgrade of the cluster was halted/stuck due to this.

bitsofinfo · 2021-02-12T13:22:54Z

agree, its more of a documentation heads up clear warning kind of thing

r0bj · 2021-02-12T20:47:30Z

In my opinion defaults should be reasonable. Current default values casing trouble operating cluster and require manual intervention during normal operations like draining nodes.

In this case I see options:

PodDisruptionBudget is optional, not enabled by default
HorizontalPodAutoscaler is set with higher minReplicas to meet PodDisruptionBudget demands

imjasonh · 2021-02-12T21:11:57Z

In my opinion defaults should be reasonable. Current default values casing trouble operating cluster and require manual intervention during normal operations like draining nodes.

I agree, the default configuration should be safe and easy to use for basic usage.

cc @afrittoli

Worth noting that Knative seems to have the same configuration (autoscaling, HPA), I wonder if they've got the same issue, or whether they have a solution: https://github.com/knative/serving/blob/master/config/core/deployments/webhook-hpa.yaml added in knative/serving#9444

nikhil-thomas · 2021-02-15T11:33:32Z

Worth noting that Knative seems to have the same configuration (autoscaling, HPA), I wonder if they've got the same issue, or whether they have a solution: https://github.com/knative/serving/blob/master/config/core/deployments/webhook-hpa.yaml added in knative/serving#9444

the knative team is handling this with the operator, by patching the pdb on the fly.
https://github.com/openshift-knative/serverless-operator/blob/master/openshift-knative-operator/hack/003-serving-pdb.patch

imjasonh · 2021-02-15T19:55:53Z

This issue also seems to be related: knative/operator#376

If we change our PDB to minAvailable:1 as in that patch, will the HPA scale up a second pod on another node, thus allowing the first pod to go down? I don't know enough about how HPA+PDB work together.

imjasonh · 2021-02-15T20:03:00Z

@nak3 seems to be responsible for both knative/operator#376 and the minAvailable:1 patch added in openshift-knative/serverless-operator#751, I wonder if he could advise on how best to approach this in Tekton as well.

In general I'd prefer to solve this by making Tekton's default configuration easy to use even if it means being slightly less HA-capable, and then document HA and maybe make operators responsible for automating it. Anyway, curious for @nak3's thoughts.

nak3 · 2021-02-16T00:51:47Z

I'd prefer to solve this by making Tekton's default configuration easy to use even if it means being slightly less HA-capable,

Yes, we use minAvailable:1 with the same reason. Although it is not perfect solution as you know, it is easy to use.
You can refer to https://issues.redhat.com/browse/SRVKS-670 (sorry the link is Red Hat only) for other opinions when we had the discussion.

nikhil-thomas · 2021-02-23T06:55:19Z

@abhinavkhanna-sf
a quick solution is to change the pdb resource in your tekton-pipelines installation.
eg: https://github.com/tektoncd/operator/blob/main/cmd/openshift/kodata/tekton-pipeline/0.19.0/00_tektoncd-pipeline-v0.19.0.yaml#L1826

imjasonh · 2021-02-24T13:55:58Z

Turns out, #3784 probably won't actually fix the issue. 😢

Based on investigation from @nikhil-thomas (🙇‍♂️ ), even with minAvailable:1, Kubernetes won't scale up another replica before turning the old one down to drain the node.

We basically have two options in the near-term:

Always run at least 2 replicas (replicas:2 in the deployment, minReplicas:2 in the HPA), with node anti-affinity so they're always on different nodes, and they'll take turns going down when nodes drain/restart. This means requiring double the resources, even when in the normal steady state one replica is enough.
- A downside to this, as pointed out by @nikhil-thomas, is that if the cluster has one node (minikube, small test clusters), we'll still end up in a state where nodes can't drain because it'll mean violating the PDB. Getting around this is a bit more involved (see below), and possibly out of scope since single-node test clusters probably don't care about HA that much 🤷‍♂️
Disable the PDB by default, and instruct users on how to re-enable it. This could be as easy as minAvailable:0 in the PDB, with a comment and markdown docs in docs/enabling-ha.md, that to enable HA for the webhook you need to set replicas:2 (or more) and minAvailable:1.

Given those options, I think (2) is the least surprising/painful to users, and I'll send a PR today to Make It So.

Longer-term, this is something the operator can handle for us, in a couple possible ways:

Expose a setting like enableHA: true, which turns on/off the minReplicas:2/minAvailable:1 described in the docs above and makes it easier to enable/disable and harder to mess up.
- This setting could possibly also detect a single-node situation and refuse to enable HA, and surface a warning that HA is impossible in single-node scenarios.
Possibly detect that nodes are draining and temporarily set replicas:2 to scale/move the replica, to unblock draining.

In any case, these operator changes will be more involved than just documentation, so they probably deserve a TEP of their own, and won't happen soon.

sbose78 · 2021-03-05T05:37:38Z

Great adventure!

Expose a setting like enableHA: true, which turns on/off the minReplicas:2/minAvailable:1 described in the docs above and makes it easier to enable/disable and harder to mess up.
This setting could possibly also detect a single-node situation and refuse to enable HA, and surface a warning that HA is impossible in single-node scenarios.

This looks like a good first step!

r0bj added the kind/bug Categorizes issue or PR as related to a bug. label Jan 5, 2021

imjasonh mentioned this issue Feb 12, 2021

Pod disruption budget: tekton-pipelines-webhook issue #3759

Closed

imjasonh added this to the Pipelines 0.22 milestone Feb 12, 2021

imjasonh mentioned this issue Feb 19, 2021

webhook pod blocking node drain #3774

Closed

imjasonh mentioned this issue Feb 23, 2021

Set minAvailable:1 to unblock node upgrades #3784

Merged

tekton-robot closed this as completed in #3784 Feb 24, 2021

imjasonh reopened this Feb 24, 2021

imjasonh mentioned this issue Feb 24, 2021

Disable Webhook PDB by default, document enabling it #3787

Merged

tekton-robot closed this as completed in #3787 Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PodDisruptionBudget causing inability to gracefully drain a node with tekton-pipelines-webhook pod #3654

PodDisruptionBudget causing inability to gracefully drain a node with tekton-pipelines-webhook pod #3654

r0bj commented Jan 5, 2021

imjasonh commented Feb 12, 2021

imjasonh commented Feb 12, 2021

bitsofinfo commented Feb 12, 2021

bitsofinfo commented Feb 12, 2021

r0bj commented Feb 12, 2021

imjasonh commented Feb 12, 2021

nikhil-thomas commented Feb 15, 2021

imjasonh commented Feb 15, 2021

imjasonh commented Feb 15, 2021

nak3 commented Feb 16, 2021

nikhil-thomas commented Feb 23, 2021

imjasonh commented Feb 24, 2021

sbose78 commented Mar 5, 2021

PodDisruptionBudget causing inability to gracefully drain a node with tekton-pipelines-webhook pod #3654

PodDisruptionBudget causing inability to gracefully drain a node with tekton-pipelines-webhook pod #3654

Comments

r0bj commented Jan 5, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

imjasonh commented Feb 12, 2021

imjasonh commented Feb 12, 2021

bitsofinfo commented Feb 12, 2021

bitsofinfo commented Feb 12, 2021

r0bj commented Feb 12, 2021

imjasonh commented Feb 12, 2021

nikhil-thomas commented Feb 15, 2021

imjasonh commented Feb 15, 2021

imjasonh commented Feb 15, 2021

nak3 commented Feb 16, 2021

nikhil-thomas commented Feb 23, 2021

imjasonh commented Feb 24, 2021

sbose78 commented Mar 5, 2021