-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow-controller non-leader replicas are unhealthy #5525
Comments
@vbarbaresi do you have metrics server port configuration in your ‘workflow-controller-config map’? |
I don't have this, my workflow-controller-config map only contains:
I edited my initial bug description: |
@vbarbaresi I will look into it |
Does the metrics endpoint start if we are not leader? @terrytangyuan ? |
@alexec It will not start |
I am fixing it. I am able to reproduce locally. |
Nope, currently it only starts when the leader starts leading but not for other instances. We should also add some tests to make sure the non-leading replicas are all heathy. |
We're seeing this issue happen again on v3.3.1 (after upgrading from 3.0.3 to 3.3.1) The leader replica is healthy
|
Summary
Non-leader workflow-controller pods are considered unhealthy because the metrics server is not running.
They are periodically restarted and end up in CrashLoopBackOff state, which is not great for high availability
Diagnostics
What Kubernetes provider are you using?
Company hosted Kubernetes cluster based on
v1.19.7
What version of Argo Workflows are you running?
v3.0.0-rc9
I'm using this default deployment template for the workflow-controller:
https://github.com/argoproj/argo-workflows/blob/v3.0.0-rc9/manifests/base/workflow-controller/workflow-controller-deployment.yaml
The problem happens after I run:
kubectl scale deploy/workflow-controller --replicas=2
It seems to come from there:
https://github.com/argoproj/argo-workflows/blob/v3.0.0-rc9/workflow/controller/controller.go#L237-L248
It seems that the metrics server only starts if the pod is elected leader?
I can propose a patch and move the metrics server start earlier in the initialization if you agree.
kubectl describe
the non-leader pod replica:Workflow controller logs: nothing particular happening except that it is not leader
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: