Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow-controller non-leader replicas are unhealthy #5525

Closed
vbarbaresi opened this issue Mar 26, 2021 · 9 comments · Fixed by #5540
Closed

Workflow-controller non-leader replicas are unhealthy #5525

vbarbaresi opened this issue Mar 26, 2021 · 9 comments · Fixed by #5540
Assignees
Labels
area/controller Controller issues, panics type/bug

Comments

@vbarbaresi
Copy link

vbarbaresi commented Mar 26, 2021

Summary

Non-leader workflow-controller pods are considered unhealthy because the metrics server is not running.
They are periodically restarted and end up in CrashLoopBackOff state, which is not great for high availability

Diagnostics

What Kubernetes provider are you using?
Company hosted Kubernetes cluster based on v1.19.7

What version of Argo Workflows are you running?
v3.0.0-rc9

I'm using this default deployment template for the workflow-controller:
https://github.com/argoproj/argo-workflows/blob/v3.0.0-rc9/manifests/base/workflow-controller/workflow-controller-deployment.yaml

The problem happens after I run:
kubectl scale deploy/workflow-controller --replicas=2

It seems to come from there:
https://github.com/argoproj/argo-workflows/blob/v3.0.0-rc9/workflow/controller/controller.go#L237-L248
It seems that the metrics server only starts if the pod is elected leader?

I can propose a patch and move the metrics server start earlier in the initialization if you agree.

kubectl describe the non-leader pod replica:

  Normal   Created    6h9m (x4 over 6h14m)   kubelet            Created container workflow-controller
  Normal   Started    6h9m (x4 over 6h14m)   kubelet            Started container workflow-controller
  Normal   Killing    6h9m (x3 over 6h12m)   kubelet            Container workflow-controller failed liveness probe, will be restarted
  Warning  Unhealthy  6h8m (x11 over 6h13m)  kubelet            Liveness probe failed: Get "http://10.132.138.134:9090/metrics": dial tcp 10.132.138.134:9090: connect: connection refused

Workflow controller logs: nothing particular happening except that it is not leader

time="2021-03-26T19:36:45.580Z" level=info msg="config map" name=workflow-controller-configmap
time="2021-03-26T19:36:45.630Z" level=info msg="Get configmaps 200"
time="2021-03-26T19:36:45.639Z" level=info msg="Configuration:\nartifactRepository: {}\ncontainerRuntimeExecutor: k8sapi\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\ntelemetryConfig: {}\n"
time="2021-03-26T19:36:45.639Z" level=info msg="Persistence configuration disabled"
time="2021-03-26T19:36:45.641Z" level=info msg="Starting Workflow Controller" version=v3.0.0-rc9
time="2021-03-26T19:36:45.641Z" level=info msg="Workers: workflow: 32, pod: 32, pod cleanup: 4"
time="2021-03-26T19:36:45.649Z" level=info msg="List workflowtemplates 200"
time="2021-03-26T19:36:45.652Z" level=info msg="Watch workflowtemplates 200"
time="2021-03-26T19:36:45.653Z" level=info msg="List configmaps 200"
time="2021-03-26T19:36:45.654Z" level=info msg="List pods 200"
time="2021-03-26T19:36:45.660Z" level=info msg="Watch configmaps 200"
time="2021-03-26T19:36:45.664Z" level=info msg="Watch configmaps 200"
time="2021-03-26T19:36:45.688Z" level=info msg="List workflows 200"
time="2021-03-26T19:36:45.782Z" level=info msg="Watch pods 200"
time="2021-03-26T19:36:45.970Z" level=info msg="Watch workflows 200"
time="2021-03-26T19:36:46.050Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-03-26T19:36:46.056Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-03-26T19:36:46.066Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-03-26T19:36:46.066Z" level=warning msg="Controller doesn't have RBAC access for ClusterWorkflowTemplates"
time="2021-03-26T19:36:46.113Z" level=info msg="List workflows 200"
time="2021-03-26T19:36:46.158Z" level=info msg="Manager initialized successfully"
I0326 19:36:46.158921       1 leaderelection.go:243] attempting to acquire leader lease  mortar/workflow-controller...
time="2021-03-26T19:36:46.166Z" level=info msg="Get leases 200"
time="2021-03-26T19:36:46.167Z" level=info msg="new leader" id=workflow-controller-569549784b-2kctg leader=workflow-controller-569549784b-t9xdg
time="2021-03-26T19:36:54.755Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:01.668Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:06.792Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:17.591Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:26.364Z" level=info msg="Get leases 200"
time="2021-03-26T19:37:33.937Z" level=info msg="Get leases 200"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@sarabala1979
Copy link
Member

sarabala1979 commented Mar 26, 2021

@vbarbaresi do you have metrics server port configuration in your ‘workflow-controller-config map’?

@sarabala1979 sarabala1979 self-assigned this Mar 26, 2021
@vbarbaresi
Copy link
Author

I don't have this, my workflow-controller-config map only contains:

data:
  config: |
    containerRuntimeExecutor: k8sapi

I edited my initial bug description:
Everything works fine with 1 replica. The problem only happens on new replicas that aren't leader after I run:
kubectl scale deploy/workflow-controller --replicas=2

@sarabala1979
Copy link
Member

@vbarbaresi I will look into it

@alexec
Copy link
Contributor

alexec commented Mar 26, 2021

Does the metrics endpoint start if we are not leader? @terrytangyuan ?

@sarabala1979
Copy link
Member

@alexec It will not start

@sarabala1979
Copy link
Member

I am fixing it. I am able to reproduce locally.

@terrytangyuan
Copy link
Member

Does the metrics endpoint start if we are not leader? @terrytangyuan ?

Nope, currently it only starts when the leader starts leading but not for other instances. We should also add some tests to make sure the non-leading replicas are all heathy.

@vbarbaresi
Copy link
Author

We're seeing this issue happen again on v3.3.1 (after upgrading from 3.0.3 to 3.3.1)

The leader replica is healthy
Stand-by replicas are unhealthy: their liveness probe failing because the metrics server doesn't see to be up:

Warning Unhealthy 102s (x10 over 6m12s) kubelet Liveness probe failed: Get "http://10.131.254.165:9090/metrics": dial tcp 10.131.254.165:9090: connect: connection refused

@vbarbaresi
Copy link
Author

The readiness issue looks fixed in v3.3.2 by 283f6b5 (related to #8283)

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants