-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX job fails but it completes successfully #14288
Comments
Same problem, we got this error on big inventory and a lot of lines at ones (skipped status) |
do you have an error message in you awx-ee container around the time the job fails? are you using the Receptor reconnect feature? did you bump up max container log size for you cluster? can you also provide output of |
@fosterseth Hi! Sorry for the late response.
In container there are no errors, all stuff goes till the end successfully
No, but we tested this feature when you asked about it.
After it we started to get failing jobs without no relevant output. Jobs fail immediately after 1-5 sec gathering facts When we set it to When we revert all changes (removed extra env), all went as usual. In AWX UI job fails, but container still working. The interesting thing when we set this var into instance group (job pod template, not in the operator) all keep usual state - AWX UI job fails, but container still working.
Yes, we increased these parameters earlier for kubelet, no luck with it: In sum we revert all changes related to We tried these ways, nothing helped:
sure, json output:
|
what version of kubelet do you have (kubectl get node should show it)? the RECEPTOR_KUBE_SUPPORT_RECONNECT requires a certain minimum version, which you can find on the PR ansible/receptor#683 It still feels like you might be running into log rotation problems. 100Mi is still pretty low given your inventory/task size. Maybe bump this up? it would need to accommodate your entire stdout from the automation job pod |
@fosterseth hi! I bumped up to 1Gi logs, will test it, thx for the tip Appreciate your assistance! |
Tested with bumped up to 1Gi logs, the same: UI says that job failed, but pod keep working and giving me actual logs |
Same issue here. my last post: #12685 |
I'm here with updates for you |
Hi @EsDmitrii @jollyekb could you share your solution, in case you got it working? @fosterseth any other data/logs I should check that could be helpful here? The AWX jobs output has only ~150 lines at the time it fails. For me, after upgrading from "job_explanation": "Job terminated due to error",
...
"result_traceback": "Finished", Is now replaced with "job_explanation": "The running ansible process received a shutdown signal.",
...
"result_traceback": "", And AWX jobs fail after about ~4 minutes. kubelet version
Logs from task pod doesn't say much for job id 29842:
|
Hi @fivetran-joliveira! |
HI @EsDmitrii, |
Hi same for us after upgrade k3s to 1.27.5 , awx 21.14.0 , before upgrade k3s (1.24) we don't have this problem |
I noticed that the container playbook was in status completed and that it was deleted before awx could retrieve the logs so the job is successfully but in error in awx with incomplete stream |
in my case, the stop streaming is caused by a big file with many line (+80k), capture the content freeze the stream |
@EsDmitrii |
Hi all! |
Please confirm the following
[email protected]
instead.)Bug Summary
Hi!
We have weird issue with AWX.
Our jobs fails but all tasks in Ansible log completes without errors.
Also there are no errors in AWX pod (we run it inside k8s).
AWX log freeze and we can't see any updates until AWX pod succeed task.
After all AWX job fails with error:
AWX version
22.3.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
ansible-core==2.15.0 ansible-runner==2.3.2
Operating system
Centos7
Web browser
Chrome
Steps to reproduce
Run huge task on huge inventory:)
Really don't know what is the root cause of this issue
Expected results
Job succeeded without errors
Actual results
Job fails with error: "Job terminated due to error"
Additional information
We found a lot relevant issues with no updates:
#12297 (open with no updates)
#12530 (closed and said that issue fixed in 21.14.0)
#13376 (closed but there is a new comment with the same error)
#12400 (open)
etc..
The text was updated successfully, but these errors were encountered: