Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job error without log #12297

Open
5 of 6 tasks
ybucci opened this issue May 31, 2022 · 35 comments
Open
5 of 6 tasks

Job error without log #12297

ybucci opened this issue May 31, 2022 · 35 comments
Labels
component:api component:ui needs_refinement Apply to items in 'Backlog' that need refinement. type:bug

Comments

@ybucci
Copy link

ybucci commented May 31, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Summary

Job finishs with status error without all complete log

AWX version

21.0.0

Select the relevant components

  • UI
  • API
  • Docs

Installation method

kubernetes

Modifications

yes

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Run job

Expected results

Finish job with error or success with complete log

Actual results

Job with error status without complete log

Additional information

Job Output

https://gist.github.com/ybucci/6db652089175de2feb8e1c28492d018f

Failed with error

image

@Tioborto
Copy link
Contributor

Same for us

@adpavlov
Copy link

adpavlov commented Jun 2, 2022

+1 in awx_task log just:

awx.main.dispatch job 1216 (error) encountered an error (rc=None), please see task stdout for details.

and nothing in stdout

@moonrail
Copy link

moonrail commented Jun 9, 2022

We can also experience this or at least a similiar issue on AWX 21.1.0.
It happens on Inventory Updates and regular Playbook runs.
It happens with different Ansible Core versions (2.10, 2.11 and 2.12) based completely on the official ansible-runner Execution Environments.
We cannot find the common thread on to why though, as these Jobs just fail without any interaction.
API reports for these jobs e.g.:

    "status": "error",
    "execution_environment": 5,
    "failed": true,
    "started": "2022-06-09T12:27:22.691919Z",
    "finished": "2022-06-09T12:28:43.090729Z",
    "canceled_on": null,
    "elapsed": 80.399,
(...)
    "job_explanation": "Job terminated due to error",`

When capturing the log of the K8s job, the following is seen: {"status": "canceled", "runner_ident": "2313176"}
ansible-runner also outputs a base64 encoded ZIP-file. When investigating its contents, following can be found:

  • status: canceled
  • rc: 254
  • stderr is empty
  • stdout shows all stdout also visible in AWX UI Job log, so no discrepancies there

This seems like ansible-runner receives a True as result from cancel_callback:
https://github.com/ansible/ansible-runner/blob/2.2.0/ansible_runner/runner.py#L321

AWX itself passes this function as cancel_callback: https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/callback.py#L177

task log contains something ambigious like: 2022-06-09 12:28:43,413 WARNING [ab304382f64e4f8eb5cfb1b9e0a3ec48] awx.main.dispatch job 2313176 (error) encountered an error (rc=None), please see task stdout for details.
The rc is missing in this message, even though we can confirm it being returned by ansible-runner.
Also the error raised is of type TaskError based on the message format and wording:
https://github.com/ansible/awx/blob/21.1.0/awx/main/exceptions.py#L22
https://github.com/ansible/awx/blob/21.1.0/awx/main/tasks/jobs.py#L588

As rc is not used of ansible-runner, TaskError is used instead of TaskCancel and Job status is error there seems to be a bug e.g. in AWXReceptorJob that causes a cancelling of ansible-runner but an error in AWXReceptorJob.

@moonrail
Copy link

moonrail commented Jun 9, 2022

Found the cause (to our problem at least):
Connections were terminated by the Loadbalancer before the K8s cluster, AWX was launching pods in.
We have AWX run in Cluster A while Jobs are being run in Cluster B.

The Loadbalancer cut the connection, as no data was being transfered.
After increasing the timeouts for client & server to (for our use-case) acceptable values, Jobs were completed.

Another cause can be the AWX setting DEFAULT_JOB_IDLE_TIMEOUT that is passed to ansible-runner as idle_timeout.
If it is 0, ansible-runner sets a default of 600 seconds:
https://ansible-runner.readthedocs.io/en/stable/intro/?highlight=idle_timeout#env-settings-settings-for-runner-itself
We've increased this also and now our Jobs are running just fine.

@ AWX Maintainers reading this
I would not call this issue to be solved yet, because the main issue is, that ansible-runner - whysoever - thinks its being canceled and AWX just errors out without any explanation.
As connection issues can arise from a multitude of causes, AWX should catch and explain them to its users to avoid headaches.
Also a keepalive between receptor and ansible-runner could help here.

Edit, additional info:
Our instance of this issue came up, when no data was being transferred by ansible-runner due to modules running long before returning output. In our case the Loadbalancer has an idle timeout for client & server and both have to be longer than our expected runtime of these Ansible modules. It can of course be replicated by a ansible.builtin.command with sleep 3600 or something similiar.

@AlanCoding
Copy link
Member

I know that this update_model method used in the cancel callback has an annoying behavior of returning None when a database error happens and doesn't recover.

logger.error('Failed to update %s after %d retries.', model._meta.object_name, _attempt)

However, I have still usually seen a traceback in the logs that happened as a side effect of this. In awx.main.tasks.jobs, it will try to do something with self.instance, and it'll give some kind of error because that is None. This handling isn't ideal, it would be preferable to state that the database connection was lost. However, I don't see any documented evidence of this here.

It could be, as comments here suggest, that we get CallbackError from ansible-runner, and in that case I'm not sure how it would play out.

@fosterseth
Copy link
Member

@ybucci Hi, did @moonrail 's suggestion help solve this bug for you?

@adpavlov
Copy link

I think it's related to #11805. In screenshot theres is 4 hours and 7 minutes difference between start and error of playbook

@adpavlov
Copy link

Issue still present in AWX 21.5.0

@AlanCoding
Copy link
Member

I think I have seen this myself recently, but still unclear of cause. It would be good if anyone could get receptorctl information after it happens, look up the work_unit from the API and in shell see receptorctl work status <unit_id>, which may require flipping the setting to not release work. That could be hard.

I've wondered if this might be hitting #12089, which still needs work to update. Even if this were happening, we would only get a very slightly better message from ansible-runner, possibly that the message isn't JSON... which actually could be the same root cause as the BadZipFile errors seen elsewhere.

@AlanCoding
Copy link
Member

I've lost the logs from the case I saw, but I believe it was due to running into the postgres max connection limit.

@lals1
Copy link

lals1 commented Sep 28, 2022

This issue started in our environment since a couple of day back and still persistent. Restarting AWX pod did work few days ago but now it seems to not making any difference as all jobs are resulting in ERROR state. Is this any workaround known?

@chris93111
Copy link
Contributor

chris93111 commented Oct 27, 2022

Hi i don't know if is the same issue , but i have 2 jobs finish in same time, with error and no logs, and short time

The output job is not complete

AWX 21.6.0

In awx logs they are nothing expect Job terminated due to error

2022-10-27T02:01:01.62302983+02:00 stderr F 2022-10-27 00:01:01,622 WARNING  [2e34bef877db4d96bd5915404082b806] awx.main.dispatch job 671164 (error) encountered an error (rc=None), please see task stdout for details.
2022-10-27T02:01:01.627768355+02:00 stderr F 2022-10-27 00:01:01,627 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.dispatch task add05b22-34f2-4989-9b44-cf7df76e5889 starting awx.main.tasks.system.handle_work_error(*['add05b22-34f2-4989-9b44-cf7df76e5889'])
2022-10-27T02:01:01.627797983+02:00 stderr F 2022-10-27 00:01:01,627 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.tasks.system Executing error task id add05b22-34f2-4989-9b44-cf7df76e5889, subtasks: [{'type': 'job', 'id': 671164}]
2022-10-27T02:01:01.651078816+02:00 stderr F 2022-10-27 00:01:01,650 DEBUG    [2e34bef877db4d96bd5915404082b806] awx.main.dispatch task add05b22-34f2-4989-9b44-cf7df76e5889 starting awx.main.tasks.system.handle_work_success(*[])

Nothing in ansible-runner logs

Capture d’écran 2022-10-27 à 14 30 34

Capture d’écran 2022-10-27 à 14 35 29

@lbr38
Copy link

lbr38 commented Nov 10, 2022

Hello

I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs.

AWX 21.8.0 on k8s
CentOS7

@lbr38
Copy link

lbr38 commented Nov 16, 2022

Hello

I'm having the same error. Some jobs are finishing with error "Job terminated due to error" with incomplete output and no logs.

AWX 21.8.0 on k8s CentOS7

I resolved my problem by increasing pod's max log size.

Edit kubelet config:

vim /var/lib/kubelet/config.yaml

Add or edit containerLogMaxSize param and choose a new value (default is 10Mi):

containerLogMaxSize: 100Mi

Restart kubelet to take into account:

systemctl restart kubelet

All failed jobs now work again.

@rsindar
Copy link

rsindar commented Jan 9, 2023

Hello,

We have the same issue (AWX 21.10.2 in Google Kubernetes Engine). Unfortunately, we cannot change the containerLogMaxSize. Is there any other option?

Thank you

@nan0viol3t
Copy link

nan0viol3t commented Jan 12, 2023

Hello,
I'm also facing problem with job logging.
Issue kick in after upgrade of Kubernetes (AWS EKS Platform) with combination of AWX Updates.
Currently on latest AWX (multiple version tested past last month) and k8 is 1.24
Its probably due to K8 switch to "containerd".

(I have no option to downgrade EKS/K8 and downgrading AWX is also not an viable option, but I tested few latest version of AWX and results are the same)

I was looking at reported issues and followed ""containerLogMaxSize"" size increase but that does not make any changes to job output.
I'm pretty sure that my action is failing due to AWS/botocore issues but I have no indication of it from AWX logs.
Output I get in web UI on 0 verbosity is just blank.
With verbosity set to 3 - I see some job details but it ends on:

" EXEC /bin/sh -c 'rm -f -r /home/runner/.ansible/tmp/ansible-tmp-1673484679.613122-71-189407760331966/ > /dev/null 2>&1 && sleep 0'"

Job status is same for multiple attempts, and I quote -> "Job terminated due to error"

In the task container log I can see only this:
2023-01-12T00:51:29.225191663Z stderr F 2023-01-12 00:51:29,224 INFO [4ea4c01a8d0c4b8b952845d19f698f0f] awx.analytics.job_lifecycle job-3151 finalize run
2023-01-12T00:51:29.255313496Z stderr F 2023-01-12 00:51:29,254 WARNING [4ea4c01a8d0c4b8b952845d19f698f0f] awx.main.dispatch job 3151 (error) encountered an error (rc=None), please see task stdout for details.

Job 3151 - is the one which is failing but its also part of workflow, so I get some more output for failed workflow as well, but consider as not related.
Also, same workflow is working on my really old prod instance (10.0.0) which is kept for such events/fails as I'm facing now.

I cant recall now which version of AWX we had, which was working fine, but I feel this is more likely to be the result of K8 upgrade on top of AWX upgrades (it was around 6 minor versions of AWX).
After upgrade, I had problems with passing "containerLogMaxSize" to kubelet configuration and for now its done manually on worker node, by editing kubelet systemd policy and kubelet restart.
Systemctl status is showing that kubelet has new setting for "containerLogMaxSize" , but it takes no effect in AWX/job output.

Please advise on how I can proceed, Any test that I can perform? I'll appreciate any guidance.

@nan0viol3t
Copy link

nan0viol3t commented Jan 12, 2023

In addition to previous comment,

I added "no_log: true" to failing job and it is now working, besides "containerLogMaxSize" was set to 500MB for two weeks now.
So, it seems it is related to amount of logs but not on the container storage level.
Not sure, but still I can be of service in case of any tests you would want me to perform.

I will test few more workflow runs with "no_log" option for next few days to see if this is really resolution of the problem.

@rsindar
Copy link

rsindar commented Jan 13, 2023

My problem was solved with "Job Slicing: 5" parameter in the template's settings (God bless Telegram chats)

@adpavlov
Copy link

My problem was solved with "Job Slicing: 5" partameter in the template's settings (God bless Telegram chats)

Could you please charge chat url?:)

@rsindar
Copy link

rsindar commented Jan 13, 2023

the very good one - https://t.me/pro_ansible

*just in case - the primary chat language is russian

@yzhivkov
Copy link

Depending on what version of Kubernetes is used you may run into this problem. Take a look at #13380 (comment) for a possible temporary solution until the fix is released with a later AWX version

@AlanCoding
Copy link
Member

Related issue in receptor: ansible/receptor#736

@nan0viol3t
Copy link

@yzhivkov - Thanks for pointing potential solution - it is working! I was tracking multiple issues around this problem on AWX github for around month already and was trying receptor patched version (dev one), but having this specified receptor option as AWX operator modification is easy to implement and again, it is working. Thank you!

@adpavlov
Copy link

Depending on what version of Kubernetes is used you may run into this problem. Take a look at #13380 (comment) for a possible temporary solution until the fix is released with a later AWX version

But this WA will revert the fix for long running jobs that result in error after 4 hours I believe

@EsDmitrii
Copy link

EsDmitrii commented Jul 27, 2023

same for us
awx ver 22.3.0

@jrgoldfinemiddleton
Copy link

jrgoldfinemiddleton commented Sep 8, 2023

Just want to share that we faced these kinds of job error situations after some time and the solution we found was to increase a few Linux kernel settings on our Kubernetes worker nodes:

fs.inotify.max_user_instances = 81920
fs.inotify.max_user_watches = 819200

It appears that AWX relies very heavily on the inotify API. Haven't seen a lot of comments indicating that other folks were able to resolve their erroring job issues with these kernel param adjustments so wanted to publicize this info.

Personally I know very little about the inotify API but would be curious if someone here knows whether AWX might perhaps be rather wasteful with inotify watches and instances (especially instances).

AWX version: 22.7.0, but same issue was present in 22.3.0 and lower.

@EsDmitrii
Copy link

EsDmitrii commented Sep 11, 2023

**jrgoldfinemiddleton ** commented Sep 8, 2023

Unfortunately not helped in my case(
I upgraded to 22.7.0 earlier
Tried to set params you mentioned.
In my case default params were fs.inotify.max_user_instances = 128 and fs.inotify.max_user_watches = 16777216, I set fs.inotify.max_user_instances = 81920 and fs.inotify.max_user_watches = 167772160
Jobs still fail with these params (checked that it applied)

@KillJohn
Copy link

Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :(

@fivetran-joliveira
Copy link

Hi all, does some of you solve the issue? I faced the same in awx 22.3.0. currently upgraded to 23.0.0 and have the same issue. :(

Same here

For me, after upgrading from 22.3.0 to 22.7.0 the error on /api/v2/jobs/

From Job terminated due to error to The running ansible process received a shutdown signal

For me even running playbooks against a single host the issue happens for many servers, whereas until few months ago we had playbooks running against hundreds of servers without being interrupted.

Could anyone on AWX team just tell us which version we should downgrade to until a stable one is available, since this seems to be an issue with more recent versions?

@chris93111
Copy link
Contributor

chris93111 commented Sep 18, 2023

Hi same for us after upgrade k3s to 1.27.5 , awx 21.14.0 , before upgrade k3s (1.24) we don't have this problem, i have try to disable reconnect, the log mag size is configured to 500Mi

On logs i don't see the problem , the status of the job is successful, but streaming is not complete , Job terminated due to error in events, and EOF error in awx-EE

@KillJohn
Copy link

KillJohn commented Sep 18, 2023

So in our case it looks like setup kubelet-args in config.yaml (rke2) correctly solve the issue.

kubelet-arg:
  - container-log-max-files=9
  - container-log-max-size=50Mi
 or kubelet-arg: [ container-log-max-files=9, container-log-max-size=50Mi ]

then you can check rke2-agent service if your arguments has been passed correctly. hope it helps.

@chris93111
Copy link
Contributor

@KillJohn is already configured in kubelet to 500, the job do not exceed

@nan0viol3t
Copy link

nan0viol3t commented Sep 19, 2023

So, in my case, AWX is built on top of AWS EKS. Problem appeared after upgrade of underlining EKS platform (Kubernetes from 1.23 to 1.24). This was somehow related to Kubernetes migration from Docker to ContainerD. There was no easy/possible way to upgrade to newer Kubernetes version, as suggested on different threads...
Of course I added kubelet arges on worker nodes:

  • container-log-max-files=9
  • container-log-max-size=50Mi

but that doesnt solve the problem.

It seems the issue was related to AWX "receptor" and its log stream. I cant find the other git issue where it was discussed but there was potential solution in a developer patch to AWX/receptor.
This was true for AWX version from around February 2023.
Eventually new Kubernetes and new AWX version (with patched receptor) appear around late Feb, and upgrades did solve my problem. It seems its related to log streaming of AWX but its also platform dependent.

If any of you who still facing this issue, built AWX on top of Kubernetes, try upgrade k8s to at least 1.25.x or newer, as that could be a solution (it was in my case), including latest version of AWX (at least around March this year or newer).

Apologize for lack of details, it was solved more then 6 months ago in my case.
Still on AWX, all working fine...

Ahh and just as a side note, in my case, before actual remediation/resolution, limiting log output in particular Ansible tasks, like adding "no log = true" as example countermeasure whenever I could, was helpful to finish jobs - just as a confirmation/poc, that any task which was generating text output was causing job to fail (so I used multiple/different hacks to limit/strict to minimum text output). This suggestion can be used to prove/confirm, that text output/ logs stream are causing this issue...

@AlanCoding
Copy link
Member

It would be best if any comments could reference a specific error message, or an overview of the situation explaining why you could not find a specific error message.

From Job terminated due to error to The running ansible process received a shutdown signal

Architecturally, if the control node for a job is shut down, that job is lost. Depending on your deployment details, K8S can do that. This new message (about "shutdown signal") is helpful - something shut down services while that job was running. You can 👍 work to change that at #13848

There are a lot of different issues mixed in these comments. This type of stability is a big topic, and it is a priority for us. Keeping the discussion focused to the particular error mechanism will help get action.

@EsDmitrii
Copy link

Hi all!
Sorry for the late response.
My team and me divided huge jobs to a several smaller ones
Now all works well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:api component:ui needs_refinement Apply to items in 'Backlog' that need refinement. type:bug
Projects
None yet
Development

No branches or pull requests