-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs error at random in UI (with stripped log output) #13380
Comments
Same problem here after upgrading Kubernetes cluster to 1.24.9. Updating to latest AWX version didn't help either. |
I might have found the reason: whenever I have Setting - name: Collect facts
ansible.builtin.setup:
gather_subset:
- 'system'
- 'local' helps to resolve the issue. Yet we have another playbook which suffers from the same issue. This playbook returns a large log in one step which makes an API call via the PS: |
+1 here. Seems problem occurs after updating K8S, too. |
We have the same problem on AWS EKS 1.24 with all versions that we managed to test with - 21.7.0 and 21.11.0. AWX 21.7.0 was working alright on AWS EKS 1.20. This ticket is a duplicate of #13469 and a solution to the problem is to add the following environment variable to your AWX Operator config:
|
Can confirm that this receptor reconnect modification is working for me. Thank you for working solution! |
How do I set RECEPTOR_KUBE_SUPPORT_RECONNECT to disabled for a custom pod spec? |
Please confirm the following
Bug Summary
Since some days some builds don't echo the log properly in the UI and error with
Job terminated due to error
.Restarting sometimes works at random, sometimes not. It's like flipping a dice.
Other jobs which produce a lot of output (e.g. output from the
uri
module) can only be run withno_log: true
as they face the same "output freeze" otherwise.The job in the screenshot though already errors during "Gathering facts".
Important: I am aware of the log rotation issues on k8s like #13376 and others.
The cluster already runs with
containerLogMaxSize: "100Mi"
and hence I assume this not be the issue. Also this would not explain the errors already during the "Gathering facts" step. Furthermore AFAIK the log rotation issues should have been addressed on the AWX in the version we are using (besides the required log size increase in the kubelet).Important - 2: Jobs continue to run on the cluster and finish "normally" there, yet they are marked as "error" in AWX.
We've recently updated the cluster from 1.23 to 1.24 though I can't fully confirm that the issues started after this upgrade. What I can say is that we didn't have any issues when we were on k8s 1.23 + AWX 21.8.0.
Overall this is a really big issue for us as the deployments only succeed at random and a lot manual time is needed. This is especially problematic for jobs within a workflow.
AWX version
21.10.2
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
I can't provide a simple way to reproduce this.
Expected results
The job does not error and mirrors the output from the pod.
Actual results
The job is stuck during job output and reports a false-positive error.
Additional information
quay.io/ansible/ansible-runner:latest
, ~ 2 weeks oldHappy to provide additional information that would help solving this issue.
The text was updated successfully, but these errors were encountered: