Job error without log #12297

ybucci · 2022-05-31T12:37:02Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Summary

Job finishs with status error without all complete log

AWX version

21.0.0

Select the relevant components

UI
API
Docs

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Run job

Expected results

Finish job with error or success with complete log

Actual results

Job with error status without complete log

Additional information

Job Output

https://gist.github.com/ybucci/6db652089175de2feb8e1c28492d018f

Failed with error

Tioborto · 2022-05-31T15:39:57Z

Same for us

adpavlov · 2022-06-02T09:31:11Z

+1 in awx_task log just:

awx.main.dispatch job 1216 (error) encountered an error (rc=None), please see task stdout for details.

and nothing in stdout

moonrail · 2022-06-09T13:30:17Z

We can also experience this or at least a similiar issue on AWX 21.1.0.
It happens on Inventory Updates and regular Playbook runs.
It happens with different Ansible Core versions (2.10, 2.11 and 2.12) based completely on the official ansible-runner Execution Environments.
We cannot find the common thread on to why though, as these Jobs just fail without any interaction.
API reports for these jobs e.g.:

    "status": "error",
    "execution_environment": 5,
    "failed": true,
    "started": "2022-06-09T12:27:22.691919Z",
    "finished": "2022-06-09T12:28:43.090729Z",
    "canceled_on": null,
    "elapsed": 80.399,
(...)
    "job_explanation": "Job terminated due to error",`

When capturing the log of the K8s job, the following is seen: {"status": "canceled", "runner_ident": "2313176"}
ansible-runner also outputs a base64 encoded ZIP-file. When investigating its contents, following can be found:

status: canceled
rc: 254
stderr is empty
stdout shows all stdout also visible in AWX UI Job log, so no discrepancies there

This seems like ansible-runner receives a True as result from cancel_callback:
https://github.com/ansible/ansible-runner/blob/2.2.0/ansible_runner/runner.py#L321

AWX itself passes this function as cancel_callback: https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/callback.py#L177

task log contains something ambigious like: 2022-06-09 12:28:43,413 WARNING [ab304382f64e4f8eb5cfb1b9e0a3ec48] awx.main.dispatch job 2313176 (error) encountered an error (rc=None), please see task stdout for details.
The rc is missing in this message, even though we can confirm it being returned by ansible-runner.
Also the error raised is of type TaskError based on the message format and wording:
https://github.com/ansible/awx/blob/21.1.0/awx/main/exceptions.py#L22
https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/jobs.py#L588

As rc is not used of ansible-runner, TaskError is used instead of TaskCancel and Job status is error there seems to be a bug e.g. in AWXReceptorJob that causes a cancelling of ansible-runner but an error in AWXReceptorJob.

moonrail · 2022-06-09T15:00:17Z

Found the cause (to our problem at least):
Connections were terminated by the Loadbalancer before the K8s cluster, AWX was launching pods in.
We have AWX run in Cluster A while Jobs are being run in Cluster B.

The Loadbalancer cut the connection, as no data was being transfered.
After increasing the timeouts for client & server to (for our use-case) acceptable values, Jobs were completed.

Another cause can be the AWX setting DEFAULT_JOB_IDLE_TIMEOUT that is passed to ansible-runner as idle_timeout.
If it is 0, ansible-runner sets a default of 600 seconds:
https://ansible-runner.readthedocs.io/en/stable/intro/?highlight=idle_timeout#env-settings-settings-for-runner-itself
We've increased this also and now our Jobs are running just fine.

@ AWX Maintainers reading this
I would not call this issue to be solved yet, because the main issue is, that ansible-runner - whysoever - thinks its being canceled and AWX just errors out without any explanation.
As connection issues can arise from a multitude of causes, AWX should catch and explain them to its users to avoid headaches.
Also a keepalive between receptor and ansible-runner could help here.

Edit, additional info:
Our instance of this issue came up, when no data was being transferred by ansible-runner due to modules running long before returning output. In our case the Loadbalancer has an idle timeout for client & server and both have to be longer than our expected runtime of these Ansible modules. It can of course be replicated by a ansible.builtin.command with sleep 3600 or something similiar.

AlanCoding · 2022-06-10T17:08:51Z

I know that this update_model method used in the cancel callback has an annoying behavior of returning None when a database error happens and doesn't recover.

awx/awx/main/utils/update_model.py

Line 43 in 29d6084

    
           logger.error('Failed to update %s after %d retries.', model._meta.object_name, _attempt)

However, I have still usually seen a traceback in the logs that happened as a side effect of this. In awx.main.tasks.jobs, it will try to do something with self.instance, and it'll give some kind of error because that is None. This handling isn't ideal, it would be preferable to state that the database connection was lost. However, I don't see any documented evidence of this here.

It could be, as comments here suggest, that we get CallbackError from ansible-runner, and in that case I'm not sure how it would play out.

fosterseth · 2022-06-14T15:58:24Z

@ybucci Hi, did @moonrail 's suggestion help solve this bug for you?

adpavlov · 2022-06-15T12:44:15Z

I think it's related to #11805. In screenshot theres is 4 hours and 7 minutes difference between start and error of playbook

adpavlov · 2022-08-31T21:23:25Z

Issue still present in AWX 21.5.0

AlanCoding · 2022-09-02T20:19:49Z

I think I have seen this myself recently, but still unclear of cause. It would be good if anyone could get receptorctl information after it happens, look up the work_unit from the API and in shell see receptorctl work status <unit_id>, which may require flipping the setting to not release work. That could be hard.

I've wondered if this might be hitting #12089, which still needs work to update. Even if this were happening, we would only get a very slightly better message from ansible-runner, possibly that the message isn't JSON... which actually could be the same root cause as the BadZipFile errors seen elsewhere.

AlanCoding · 2022-09-02T20:43:51Z

I've lost the logs from the case I saw, but I believe it was due to running into the postgres max connection limit.

lals1 · 2022-09-28T08:31:52Z

This issue started in our environment since a couple of day back and still persistent. Restarting AWX pod did work few days ago but now it seems to not making any difference as all jobs are resulting in ERROR state. Is this any workaround known?

chris93111 · 2022-10-27T12:33:54Z

Hi i don't know if is the same issue , but i have 2 jobs finish in same time, with error and no logs, and short time

The output job is not complete

AWX 21.6.0

In awx logs they are nothing expect Job terminated due to error

2022-10-27T02:01:01.62302983+02:00 stderr F 2022-10-27 00:01:01,622 WARNING  [2e34bef877db4d96bd5915404082b806] awx.main.dispatch job 671164 (error) encountered an error (rc=None), please see task stdout for details.
2022-10-27T02:01:01.627768355+02:00 stderr F 2022-10-27 00:01:01,627 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.dispatch task add05b22-34f2-4989-9b44-cf7df76e5889 starting awx.main.tasks.system.handle_work_error(*['add05b22-34f2-4989-9b44-cf7df76e5889'])
2022-10-27T02:01:01.627797983+02:00 stderr F 2022-10-27 00:01:01,627 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.tasks.system Executing error task id add05b22-34f2-4989-9b44-cf7df76e5889, subtasks: [{'type': 'job', 'id': 671164}]
2022-10-27T02:01:01.651078816+02:00 stderr F 2022-10-27 00:01:01,650 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.dispatch task add05b22-34f2-4989-9b44-cf7df76e5889 starting awx.main.tasks.system.handle_work_success(*[])

Nothing in ansible-runner logs

lbr38 · 2022-11-10T13:47:18Z

Hello

I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs.

AWX 21.8.0 on k8s
CentOS7

lbr38 · 2022-11-16T13:08:18Z

Hello

I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs.

AWX 21.8.0 on k8s CentOS7

I resolved my problem by increasing pod's max log size.

Edit kubelet config:

vim /var/lib/kubelet/config.yaml

Add or edit containerLogMaxSize param and choose a new value (default is 10Mi):

containerLogMaxSize: 100Mi

Restart kubelet to take into account:

systemctl restart kubelet

All failed jobs now work again.

rsindar · 2023-01-09T18:41:51Z

Hello,

We have the same issue (AWX 21.10.2 in Google Kubernetes Engine). Unfortunately, we cannot change the containerLogMaxSize. Is there any other option?

Thank you

nan0viol3t · 2023-01-12T02:01:52Z

Hello,
I'm also facing problem with job logging.
Issue kick in after upgrade of Kubernetes (AWS EKS Platform) with combination of AWX Updates.
Currently on latest AWX (multiple version tested past last month) and k8 is 1.24
Its probably due to K8 switch to "containerd".

(I have no option to downgrade EKS/K8 and downgrading AWX is also not an viable option, but I tested few latest version of AWX and results are the same)

I was looking at reported issues and followed ""containerLogMaxSize"" size increase but that does not make any changes to job output.
I'm pretty sure that my action is failing due to AWS/botocore issues but I have no indication of it from AWX logs.
Output I get in web UI on 0 verbosity is just blank.
With verbosity set to 3 - I see some job details but it ends on:

" EXEC /bin/sh -c 'rm -f -r /home/runner/.ansible/tmp/ansible-tmp-1673484679.613122-71-189407760331966/ > /dev/null 2>&1 && sleep 0'"

Job status is same for multiple attempts, and I quote -> "Job terminated due to error"

In the task container log I can see only this:
2023-01-12T00:51:29.225191663Z stderr F 2023-01-12 00:51:29,224 INFO [4ea4c01a8d0c4b8b952845d19f698f0f] awx.analytics.job_lifecycle job-3151 finalize run
2023-01-12T00:51:29.255313496Z stderr F 2023-01-12 00:51:29,254 WARNING [4ea4c01a8d0c4b8b952845d19f698f0f] awx.main.dispatch job 3151 (error) encountered an error (rc=None), please see task stdout for details.

Job 3151 - is the one which is failing but its also part of workflow, so I get some more output for failed workflow as well, but consider as not related.
Also, same workflow is working on my really old prod instance (10.0.0) which is kept for such events/fails as I'm facing now.

I cant recall now which version of AWX we had, which was working fine, but I feel this is more likely to be the result of K8 upgrade on top of AWX upgrades (it was around 6 minor versions of AWX).
After upgrade, I had problems with passing "containerLogMaxSize" to kubelet configuration and for now its done manually on worker node, by editing kubelet systemd policy and kubelet restart.
Systemctl status is showing that kubelet has new setting for "containerLogMaxSize" , but it takes no effect in AWX/job output.

Please advise on how I can proceed, Any test that I can perform? I'll appreciate any guidance.

nan0viol3t · 2023-01-12T02:38:21Z

In addition to previous comment,

I added "no_log: true" to failing job and it is now working, besides "containerLogMaxSize" was set to 500MB for two weeks now.
So, it seems it is related to amount of logs but not on the container storage level.
Not sure, but still I can be of service in case of any tests you would want me to perform.

I will test few more workflow runs with "no_log" option for next few days to see if this is really resolution of the problem.

rsindar · 2023-01-13T08:02:14Z

My problem was solved with "Job Slicing: 5" parameter in the template's settings (God bless Telegram chats)

adpavlov · 2023-01-13T08:11:39Z

My problem was solved with "Job Slicing: 5" partameter in the template's settings (God bless Telegram chats)

Could you please charge chat url?:)

rsindar · 2023-01-13T09:13:52Z

the very good one - https://t.me/pro_ansible

*just in case - the primary chat language is russian

yzhivkov · 2023-02-12T00:44:28Z

Depending on what version of Kubernetes is used you may run into this problem. Take a look at #13380 (comment) for a possible temporary solution until the fix is released with a later AWX version

AlanCoding · 2023-02-14T13:53:17Z

Related issue in receptor: ansible/receptor#736

nan0viol3t · 2023-02-15T10:22:37Z

@yzhivkov - Thanks for pointing potential solution - it is working! I was tracking multiple issues around this problem on AWX github for around month already and was trying receptor patched version (dev one), but having this specified receptor option as AWX operator modification is easy to implement and again, it is working. Thank you!

adpavlov · 2023-02-15T18:06:17Z

Depending on what version of Kubernetes is used you may run into this problem. Take a look at #13380 (comment) for a possible temporary solution until the fix is released with a later AWX version

But this WA will revert the fix for long running jobs that result in error after 4 hours I believe

EsDmitrii · 2023-07-27T09:48:50Z

same for us
awx ver 22.3.0

jrgoldfinemiddleton · 2023-09-08T15:45:51Z

Just want to share that we faced these kinds of job error situations after some time and the solution we found was to increase a few Linux kernel settings on our Kubernetes worker nodes:

fs.inotify.max_user_instances = 81920
fs.inotify.max_user_watches = 819200

It appears that AWX relies very heavily on the inotify API. Haven't seen a lot of comments indicating that other folks were able to resolve their erroring job issues with these kernel param adjustments so wanted to publicize this info.

Personally I know very little about the inotify API but would be curious if someone here knows whether AWX might perhaps be rather wasteful with inotify watches and instances (especially instances).

AWX version: 22.7.0, but same issue was present in 22.3.0 and lower.

EsDmitrii · 2023-09-11T10:17:14Z

**jrgoldfinemiddleton ** commented Sep 8, 2023 •

Unfortunately not helped in my case(
I upgraded to 22.7.0 earlier
Tried to set params you mentioned.
In my case default params were fs.inotify.max_user_instances = 128 and fs.inotify.max_user_watches = 16777216, I set fs.inotify.max_user_instances = 81920 and fs.inotify.max_user_watches = 167772160
Jobs still fail with these params (checked that it applied)

KillJohn · 2023-09-14T14:22:55Z

Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :(

fivetran-joliveira · 2023-09-18T15:37:35Z

Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :(

Same here

For me, after upgrading from 22.3.0 to 22.7.0 the error on /api/v2/jobs/

From Job terminated due to error to The running ansible process received a shutdown signal

For me even running playbooks against a single host the issue happens for many servers, whereas until few months ago we had playbooks running against hundreds of servers without being interrupted.

Could anyone on AWX team just tell us which version we should downgrade to until a stable one is available, since this seems to be an issue with more recent versions?

chris93111 · 2023-09-18T15:59:02Z

Hi same for us after upgrade k3s to 1.27.5 , awx 21.14.0 , before upgrade k3s (1.24) we don't have this problem, i have try to disable reconnect, the log mag size is configured to 500Mi

On logs i don't see the problem , the status of the job is successful, but streaming is not complete , Job terminated due to error in events, and EOF error in awx-EE

KillJohn · 2023-09-18T19:57:49Z

So in our case it looks like setup kubelet-args in config.yaml (rke2) correctly solve the issue.

kubelet-arg:
  - container-log-max-files=9
  - container-log-max-size=50Mi
 or kubelet-arg: [ container-log-max-files=9, container-log-max-size=50Mi ]

then you can check rke2-agent service if your arguments has been passed correctly. hope it helps.

chris93111 · 2023-09-18T21:59:23Z

@KillJohn is already configured in kubelet to 500, the job do not exceed

nan0viol3t · 2023-09-19T19:24:48Z

So, in my case, AWX is built on top of AWS EKS. Problem appeared after upgrade of underlining EKS platform (Kubernetes from 1.23 to 1.24). This was somehow related to Kubernetes migration from Docker to ContainerD. There was no easy/possible way to upgrade to newer Kubernetes version, as suggested on different threads...
Of course I added kubelet arges on worker nodes:

container-log-max-files=9
container-log-max-size=50Mi

but that doesnt solve the problem.

It seems the issue was related to AWX "receptor" and its log stream. I cant find the other git issue where it was discussed but there was potential solution in a developer patch to AWX/receptor.
This was true for AWX version from around February 2023.
Eventually new Kubernetes and new AWX version (with patched receptor) appear around late Feb, and upgrades did solve my problem. It seems its related to log streaming of AWX but its also platform dependent.

If any of you who still facing this issue, built AWX on top of Kubernetes, try upgrade k8s to at least 1.25.x or newer, as that could be a solution (it was in my case), including latest version of AWX (at least around March this year or newer).

Apologize for lack of details, it was solved more then 6 months ago in my case.
Still on AWX, all working fine...

Ahh and just as a side note, in my case, before actual remediation/resolution, limiting log output in particular Ansible tasks, like adding "no log = true" as example countermeasure whenever I could, was helpful to finish jobs - just as a confirmation/poc, that any task which was generating text output was causing job to fail (so I used multiple/different hacks to limit/strict to minimum text output). This suggestion can be used to prove/confirm, that text output/ logs stream are causing this issue...

AlanCoding · 2023-09-28T14:00:58Z

It would be best if any comments could reference a specific error message, or an overview of the situation explaining why you could not find a specific error message.

From Job terminated due to error to The running ansible process received a shutdown signal

Architecturally, if the control node for a job is shut down, that job is lost. Depending on your deployment details, K8S can do that. This new message (about "shutdown signal") is helpful - something shut down services while that job was running. You can 👍 work to change that at #13848

There are a lot of different issues mixed in these comments. This type of stability is a big topic, and it is a priority for us. Keeping the discussion focused to the particular error mechanism will help get action.

EsDmitrii · 2023-12-07T08:42:24Z

Hi all!
Sorry for the late response.
My team and me divided huge jobs to a several smaller ones
Now all works well

github-actions bot added component:api component:ui needs_triage type:bug labels May 31, 2022

fosterseth removed the needs_triage label Jun 10, 2022

fosterseth mentioned this issue Jun 22, 2022

awx intermittently reports non-complete playbook as failed and stops outputing log #12400

Open

9 tasks

hesmithrh added the needs_refinement Apply to items in 'Backlog' that need refinement. label Jul 12, 2022

rudesome mentioned this issue Jul 20, 2022

Jobs with long running tasks and Inventory sync terminating after 5 minutes #12530

Closed

9 tasks

chris93111 mentioned this issue Nov 1, 2022

Job with closed tunnel give error #13140

Open

9 tasks

EsDmitrii mentioned this issue Jul 27, 2023

AWX job fails but it completes successfully #14288

Closed

11 tasks

yhzs8 mentioned this issue Dec 17, 2023

AWX Job error after executing reboot module #14725

Closed

11 tasks

Job error without log #12297

Job error without log #12297

Comments

ybucci commented May 31, 2022

Please confirm the following

Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

Tioborto commented May 31, 2022

adpavlov commented Jun 2, 2022

moonrail commented Jun 9, 2022 • edited Loading

moonrail commented Jun 9, 2022 • edited Loading

AlanCoding commented Jun 10, 2022

fosterseth commented Jun 14, 2022

adpavlov commented Jun 15, 2022

adpavlov commented Aug 31, 2022

AlanCoding commented Sep 2, 2022

AlanCoding commented Sep 2, 2022

lals1 commented Sep 28, 2022

chris93111 commented Oct 27, 2022 • edited Loading

lbr38 commented Nov 10, 2022

lbr38 commented Nov 16, 2022 • edited Loading

rsindar commented Jan 9, 2023 • edited Loading

nan0viol3t commented Jan 12, 2023 • edited Loading

nan0viol3t commented Jan 12, 2023 • edited Loading

rsindar commented Jan 13, 2023 • edited Loading

adpavlov commented Jan 13, 2023

rsindar commented Jan 13, 2023

yzhivkov commented Feb 12, 2023

AlanCoding commented Feb 14, 2023

nan0viol3t commented Feb 15, 2023

adpavlov commented Feb 15, 2023

EsDmitrii commented Jul 27, 2023 • edited Loading

jrgoldfinemiddleton commented Sep 8, 2023 • edited Loading

EsDmitrii commented Sep 11, 2023 • edited Loading

KillJohn commented Sep 14, 2023

fivetran-joliveira commented Sep 18, 2023

chris93111 commented Sep 18, 2023 • edited Loading

KillJohn commented Sep 18, 2023 • edited Loading

chris93111 commented Sep 18, 2023

nan0viol3t commented Sep 19, 2023 • edited Loading

AlanCoding commented Sep 28, 2023

EsDmitrii commented Dec 7, 2023

moonrail commented Jun 9, 2022 •

edited

Loading

moonrail commented Jun 9, 2022 •

edited

Loading

chris93111 commented Oct 27, 2022 •

edited

Loading

lbr38 commented Nov 16, 2022 •

edited

Loading

rsindar commented Jan 9, 2023 •

edited

Loading

nan0viol3t commented Jan 12, 2023 •

edited

Loading

nan0viol3t commented Jan 12, 2023 •

edited

Loading

rsindar commented Jan 13, 2023 •

edited

Loading

EsDmitrii commented Jul 27, 2023 •

edited

Loading

jrgoldfinemiddleton commented Sep 8, 2023 •

edited

Loading

EsDmitrii commented Sep 11, 2023 •

edited

Loading

chris93111 commented Sep 18, 2023 •

edited

Loading

KillJohn commented Sep 18, 2023 •

edited

Loading

nan0viol3t commented Sep 19, 2023 •

edited

Loading