Workflow-controller non-leader replicas are unhealthy #5525

vbarbaresi · 2021-03-26T19:52:35Z

Summary

Non-leader workflow-controller pods are considered unhealthy because the metrics server is not running.
They are periodically restarted and end up in CrashLoopBackOff state, which is not great for high availability

Diagnostics

What Kubernetes provider are you using?
Company hosted Kubernetes cluster based on v1.19.7

What version of Argo Workflows are you running?
v3.0.0-rc9

I'm using this default deployment template for the workflow-controller:
https://github.com/argoproj/argo-workflows/blob/v3.0.0-rc9/manifests/base/workflow-controller/workflow-controller-deployment.yaml

The problem happens after I run:
kubectl scale deploy/workflow-controller --replicas=2

It seems to come from there:
https://github.com/argoproj/argo-workflows/blob/v3.0.0-rc9/workflow/controller/controller.go#L237-L248
It seems that the metrics server only starts if the pod is elected leader?

I can propose a patch and move the metrics server start earlier in the initialization if you agree.

kubectl describe the non-leader pod replica:

  Normal   Created    6h9m (x4 over 6h14m)   kubelet            Created container workflow-controller
  Normal   Started    6h9m (x4 over 6h14m)   kubelet            Started container workflow-controller
  Normal   Killing    6h9m (x3 over 6h12m)   kubelet            Container workflow-controller failed liveness probe, will be restarted
  Warning  Unhealthy  6h8m (x11 over 6h13m)  kubelet            Liveness probe failed: Get "http://10.132.138.134:9090/metrics": dial tcp 10.132.138.134:9090: connect: connection refused

Workflow controller logs: nothing particular happening except that it is not leader

time="2021-03-26T19:36:45.580Z" level=info msg="config map" name=workflow-controller-configmap
time="2021-03-26T19:36:45.630Z" level=info msg="Get configmaps 200"
time="2021-03-26T19:36:45.639Z" level=info msg="Configuration:\nartifactRepository: {}\ncontainerRuntimeExecutor: k8sapi\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\ntelemetryConfig: {}\n"
time="2021-03-26T19:36:45.639Z" level=info msg="Persistence configuration disabled"
time="2021-03-26T19:36:45.641Z" level=info msg="Starting Workflow Controller" version=v3.0.0-rc9
time="2021-03-26T19:36:45.641Z" level=info msg="Workers: workflow: 32, pod: 32, pod cleanup: 4"
time="2021-03-26T19:36:45.649Z" level=info msg="List workflowtemplates 200"
time="2021-03-26T19:36:45.652Z" level=info msg="Watch workflowtemplates 200"
time="2021-03-26T19:36:45.653Z" level=info msg="List configmaps 200"
time="2021-03-26T19:36:45.654Z" level=info msg="List pods 200"
time="2021-03-26T19:36:45.660Z" level=info msg="Watch configmaps 200"
time="2021-03-26T19:36:45.664Z" level=info msg="Watch configmaps 200"
time="2021-03-26T19:36:45.688Z" level=info msg="List workflows 200"
time="2021-03-26T19:36:45.782Z" level=info msg="Watch pods 200"
time="2021-03-26T19:36:45.970Z" level=info msg="Watch workflows 200"
time="2021-03-26T19:36:46.050Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-03-26T19:36:46.056Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-03-26T19:36:46.066Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-03-26T19:36:46.066Z" level=warning msg="Controller doesn't have RBAC access for ClusterWorkflowTemplates"
time="2021-03-26T19:36:46.113Z" level=info msg="List workflows 200"
time="2021-03-26T19:36:46.158Z" level=info msg="Manager initialized successfully"
I0326 19:36:46.158921       1 leaderelection.go:243] attempting to acquire leader lease  mortar/workflow-controller...
time="2021-03-26T19:36:46.166Z" level=info msg="Get leases 200"
time="2021-03-26T19:36:46.167Z" level=info msg="new leader" id=workflow-controller-569549784b-2kctg leader=workflow-controller-569549784b-t9xdg
time="2021-03-26T19:36:54.755Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:01.668Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:06.792Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:17.591Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:26.364Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:33.937Z" level=info msg="Get leases 200"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

sarabala1979 · 2021-03-26T20:12:32Z

@vbarbaresi do you have metrics server port configuration in your ‘workflow-controller-config map’?

vbarbaresi · 2021-03-26T20:32:58Z

I don't have this, my workflow-controller-config map only contains:

data:
  config: |
    containerRuntimeExecutor: k8sapi

I edited my initial bug description:
Everything works fine with 1 replica. The problem only happens on new replicas that aren't leader after I run:
kubectl scale deploy/workflow-controller --replicas=2

sarabala1979 · 2021-03-26T20:35:57Z

@vbarbaresi I will look into it

alexec · 2021-03-26T21:34:41Z

Does the metrics endpoint start if we are not leader? @terrytangyuan ?

sarabala1979 · 2021-03-26T21:45:04Z

@alexec It will not start

sarabala1979 · 2021-03-26T21:46:15Z

I am fixing it. I am able to reproduce locally.

terrytangyuan · 2021-03-26T22:28:01Z

Does the metrics endpoint start if we are not leader? @terrytangyuan ?

Nope, currently it only starts when the leader starts leading but not for other instances. We should also add some tests to make sure the non-leading replicas are all heathy.

vbarbaresi · 2022-04-22T09:38:29Z

We're seeing this issue happen again on v3.3.1 (after upgrading from 3.0.3 to 3.3.1)

The leader replica is healthy
Stand-by replicas are unhealthy: their liveness probe failing because the metrics server doesn't see to be up:

Warning Unhealthy 102s (x10 over 6m12s) kubelet Liveness probe failed: Get "http://10.131.254.165:9090/metrics": dial tcp 10.131.254.165:9090: connect: connection refused

vbarbaresi · 2022-04-26T14:14:14Z

The readiness issue looks fixed in v3.3.2 by 283f6b5 (related to #8283)

vbarbaresi added the type/bug label Mar 26, 2021

sarabala1979 self-assigned this Mar 26, 2021

sarabala1979 mentioned this issue Mar 29, 2021

fix(controller): Enable metrics server on stand-by controller #5540

Merged

1 task

sarabala1979 closed this as completed in #5540 Mar 29, 2021

vbarbaresi mentioned this issue Apr 21, 2021

Workflow controller v3 fails to start on OpenShift 3.11 #5638

Closed

TakuKaneda mentioned this issue Feb 10, 2022

controller liveness probe fails w/ timeout -- too many workflows #7845

Closed

agilgur5 added the area/controller Controller issues, panics label Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow-controller non-leader replicas are unhealthy #5525

Workflow-controller non-leader replicas are unhealthy #5525

vbarbaresi commented Mar 26, 2021 •

edited

Loading

sarabala1979 commented Mar 26, 2021 •

edited

Loading

vbarbaresi commented Mar 26, 2021

sarabala1979 commented Mar 26, 2021

alexec commented Mar 26, 2021

sarabala1979 commented Mar 26, 2021

sarabala1979 commented Mar 26, 2021

terrytangyuan commented Mar 26, 2021

vbarbaresi commented Apr 22, 2022

vbarbaresi commented Apr 26, 2022

Workflow-controller non-leader replicas are unhealthy #5525

Workflow-controller non-leader replicas are unhealthy #5525

Comments

vbarbaresi commented Mar 26, 2021 • edited Loading

Summary

Diagnostics

sarabala1979 commented Mar 26, 2021 • edited Loading

vbarbaresi commented Mar 26, 2021

sarabala1979 commented Mar 26, 2021

alexec commented Mar 26, 2021

sarabala1979 commented Mar 26, 2021

sarabala1979 commented Mar 26, 2021

terrytangyuan commented Mar 26, 2021

vbarbaresi commented Apr 22, 2022

vbarbaresi commented Apr 26, 2022

vbarbaresi commented Mar 26, 2021 •

edited

Loading

sarabala1979 commented Mar 26, 2021 •

edited

Loading