-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job error without log #12297
Comments
Same for us |
+1 in awx_task log just:
and nothing in stdout |
We can also experience this or at least a similiar issue on AWX 21.1.0.
When capturing the log of the K8s job, the following is seen:
This seems like AWX itself passes this function as cancel_callback: https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/callback.py#L177
As rc is not used of |
Found the cause (to our problem at least): The Loadbalancer cut the connection, as no data was being transfered. Another cause can be the AWX setting @ AWX Maintainers reading this Edit, additional info: |
I know that this awx/awx/main/utils/update_model.py Line 43 in 29d6084
However, I have still usually seen a traceback in the logs that happened as a side effect of this. In It could be, as comments here suggest, that we get |
I think it's related to #11805. In screenshot theres is 4 hours and 7 minutes difference between start and error of playbook |
Issue still present in AWX 21.5.0 |
I think I have seen this myself recently, but still unclear of cause. It would be good if anyone could get receptorctl information after it happens, look up the work_unit from the API and in shell see I've wondered if this might be hitting #12089, which still needs work to update. Even if this were happening, we would only get a very slightly better message from ansible-runner, possibly that the message isn't JSON... which actually could be the same root cause as the BadZipFile errors seen elsewhere. |
I've lost the logs from the case I saw, but I believe it was due to running into the postgres max connection limit. |
This issue started in our environment since a couple of day back and still persistent. Restarting AWX pod did work few days ago but now it seems to not making any difference as all jobs are resulting in ERROR state. Is this any workaround known? |
Hi i don't know if is the same issue , but i have 2 jobs finish in same time, with error and no logs, and short time The output job is not complete AWX 21.6.0 In awx logs they are nothing expect Job terminated due to error
Nothing in ansible-runner logs |
Hello I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs. AWX 21.8.0 on k8s |
I resolved my problem by increasing pod's max log size. Edit kubelet config:
Add or edit containerLogMaxSize param and choose a new value (default is 10Mi):
Restart kubelet to take into account:
All failed jobs now work again. |
Hello, We have the same issue (AWX 21.10.2 in Google Kubernetes Engine). Unfortunately, we cannot change the containerLogMaxSize. Is there any other option? Thank you |
Hello, (I have no option to downgrade EKS/K8 and downgrading AWX is also not an viable option, but I tested few latest version of AWX and results are the same) I was looking at reported issues and followed ""containerLogMaxSize"" size increase but that does not make any changes to job output. " EXEC /bin/sh -c 'rm -f -r /home/runner/.ansible/tmp/ansible-tmp-1673484679.613122-71-189407760331966/ > /dev/null 2>&1 && sleep 0'" Job status is same for multiple attempts, and I quote -> "Job terminated due to error" In the task container log I can see only this: Job 3151 - is the one which is failing but its also part of workflow, so I get some more output for failed workflow as well, but consider as not related. I cant recall now which version of AWX we had, which was working fine, but I feel this is more likely to be the result of K8 upgrade on top of AWX upgrades (it was around 6 minor versions of AWX). Please advise on how I can proceed, Any test that I can perform? I'll appreciate any guidance. |
In addition to previous comment, I added "no_log: true" to failing job and it is now working, besides "containerLogMaxSize" was set to 500MB for two weeks now. I will test few more workflow runs with "no_log" option for next few days to see if this is really resolution of the problem. |
My problem was solved with "Job Slicing: 5" parameter in the template's settings (God bless Telegram chats) |
Could you please charge chat url?:) |
the very good one - https://t.me/pro_ansible *just in case - the primary chat language is russian |
Depending on what version of Kubernetes is used you may run into this problem. Take a look at #13380 (comment) for a possible temporary solution until the fix is released with a later AWX version |
Related issue in receptor: ansible/receptor#736 |
@yzhivkov - Thanks for pointing potential solution - it is working! I was tracking multiple issues around this problem on AWX github for around month already and was trying receptor patched version (dev one), but having this specified receptor option as AWX operator modification is easy to implement and again, it is working. Thank you! |
But this WA will revert the fix for long running jobs that result in error after 4 hours I believe |
same for us |
Just want to share that we faced these kinds of job error situations after some time and the solution we found was to increase a few Linux kernel settings on our Kubernetes worker nodes:
It appears that AWX relies very heavily on the inotify API. Haven't seen a lot of comments indicating that other folks were able to resolve their erroring job issues with these kernel param adjustments so wanted to publicize this info. Personally I know very little about the inotify API but would be curious if someone here knows whether AWX might perhaps be rather wasteful with inotify watches and instances (especially instances). AWX version: 22.7.0, but same issue was present in 22.3.0 and lower. |
Unfortunately not helped in my case( |
Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :( |
Same here
From For me even running playbooks against a single host the issue happens for many servers, whereas until few months ago we had playbooks running against hundreds of servers without being interrupted. Could anyone on AWX team just tell us which version we should downgrade to until a stable one is available, since this seems to be an issue with more recent versions? |
Hi same for us after upgrade k3s to 1.27.5 , awx 21.14.0 , before upgrade k3s (1.24) we don't have this problem, i have try to disable reconnect, the log mag size is configured to 500Mi On logs i don't see the problem , the status of the job is successful, but streaming is not complete , Job terminated due to error in events, and EOF error in awx-EE |
So in our case it looks like setup kubelet-args in config.yaml (rke2) correctly solve the issue.
then you can check rke2-agent service if your arguments has been passed correctly. hope it helps. |
@KillJohn is already configured in kubelet to 500, the job do not exceed |
So, in my case, AWX is built on top of AWS EKS. Problem appeared after upgrade of underlining EKS platform (Kubernetes from 1.23 to 1.24). This was somehow related to Kubernetes migration from Docker to ContainerD. There was no easy/possible way to upgrade to newer Kubernetes version, as suggested on different threads...
but that doesnt solve the problem. It seems the issue was related to AWX "receptor" and its log stream. I cant find the other git issue where it was discussed but there was potential solution in a developer patch to AWX/receptor. If any of you who still facing this issue, built AWX on top of Kubernetes, try upgrade k8s to at least 1.25.x or newer, as that could be a solution (it was in my case), including latest version of AWX (at least around March this year or newer). Apologize for lack of details, it was solved more then 6 months ago in my case. Ahh and just as a side note, in my case, before actual remediation/resolution, limiting log output in particular Ansible tasks, like adding "no log = true" as example countermeasure whenever I could, was helpful to finish jobs - just as a confirmation/poc, that any task which was generating text output was causing job to fail (so I used multiple/different hacks to limit/strict to minimum text output). This suggestion can be used to prove/confirm, that text output/ logs stream are causing this issue... |
It would be best if any comments could reference a specific error message, or an overview of the situation explaining why you could not find a specific error message.
Architecturally, if the control node for a job is shut down, that job is lost. Depending on your deployment details, K8S can do that. This new message (about "shutdown signal") is helpful - something shut down services while that job was running. You can 👍 work to change that at #13848 There are a lot of different issues mixed in these comments. This type of stability is a big topic, and it is a priority for us. Keeping the discussion focused to the particular error mechanism will help get action. |
Hi all! |
Please confirm the following
Summary
Job finishs with status error without all complete log
AWX version
21.0.0
Select the relevant components
Installation method
kubernetes
Modifications
yes
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Run job
Expected results
Finish job with error or success with complete log
Actual results
Job with error status without complete log
Additional information
Job Output
https://gist.github.com/ybucci/6db652089175de2feb8e1c28492d018f
Failed with error
The text was updated successfully, but these errors were encountered: