-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job failed with just error and without log output #13469
Comments
compiled a debug image quay.io/haoliu/awx-ee:debug-stdout
added a simple change to log the output from the streamReader I want to see if receptor was actually able to read from kube apiserver |
@Halytskyi DM me a gist containing the result it does seem that we were able to read from the stdout stream |
Further debugging after preserving WorkUnit dir found that stdout file is empty |
In Matrix @Halytskyi told me that he's on EKS with 1.23.15 |
Disabling kube reconnection support works around the problem which indicate that our new code change to receptor may have contributed to this problem |
Hi. logs from awx-ee container:
|
I can confirm that disabling reconnect support fixed my job failures - I don't have anything that runs over 4 hours at this time. |
we digged a further onto this issue and we found
the k8s version detection code did not function correctly and incorrectly enabled the fix |
another problem that we found during investigation
does not contain the fix since long log message still have timestamp inserted in the middle of the message
|
Hi @yuliym @iuvooneill can u run a test for us
in the job log look to see if there's any random timestamps in the messages also please provide the output from |
i noticed something interesting on a fresh EKS clusters
k8s API server and kublet version does not match |
playbook: chatty_payload.yml succeed however some other our jobs failed due to Error. UPD: |
Hi @TheRealHaoLiu , I'm also facing the same issues as described by @Halytskyi. I'm running awx on kubernetes(GKE) installed using AWX-Operator. We have db refresh jobs( < 4hrs ) that are erroring out w/o log output. Sleep command playbook
#Sleep for 10 mins
Partial O/P Automation-Job Pod AWX - v21.0.0(Installed using AWX Operator) |
For anyone else running into this, I was able to resolve the issue on our EKS cluster by adjusting the Default Instance Group Pod specification to ensure containers launched are assigned to particular Nodes by following the instructions here. Our EKS cluster has two Node Groups, one of which was on Kubernetes version Viewing which Nodes Pods were being assigned to by using I hope this helps others that are running into this issue with a similar setup. |
Unfortunately not our case. We are running AWX on AKS and versions are align
|
@TheRealHaoLiu I found another issue with reconnect partly related to #13161
|
hi @TheRealHaoLiu this issue is duplicated here let me know if I can close or I keep open (because the duplicated isse is with needs_triage label) |
@yuliym do you know if the retry not resetting problem only occurs for tasks that don't emit new stdout within the 5 minute period (i.e. if there is a sleep 650 task)? feels like this line should be resetting the count back to 5 after a successful write https://github.com/ansible/receptor/blob/4addde85f132cc555331041e9a6f7963519c542c/pkg/workceptor/kubernetes.go#L299 |
Please confirm the following
Bug Summary
Clean deployment and I use "Demo Job Template". Job failing and no output except "Finished":
In "awx-ee" container logs I see:
"awx-task" log:
"/var/lib/awx/venv/awx/bin/receptorctl --socket /var/run/receptor/receptor.sock work list" output:
"/var/lib/awx/venv/awx/bin/receptorctl --socket /var/run/receptor/receptor.sock work results 4Pmqfss9" (empty output):
"/var/lib/awx/venv/awx/bin/receptorctl --socket /var/run/receptor/receptor.sock status":
Part of logs from "automation-job-6-njfz8":
"api/v2/jobs/6/":
It's like job running without errors but it doesn't return successful status and "marked" as failed.
P.S. I saw the similar reports, but they mostly related to job running >4h. In my case it happens within seconds.
Is any way to debug that for find the issue?
AWX version
21.11.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Issue (with logs) described above.
Expected results
No errors, see output for job.
Actual results
Job failed without output (except "Finished").
Additional information
No response
The text was updated successfully, but these errors were encountered: