Jobs error at random in UI (with stripped log output) #13380

pat-s · 2022-12-28T12:05:42Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Since some days some builds don't echo the log properly in the UI and error with Job terminated due to error.
Restarting sometimes works at random, sometimes not. It's like flipping a dice.

Other jobs which produce a lot of output (e.g. output from the uri module) can only be run with no_log: true as they face the same "output freeze" otherwise.

The job in the screenshot though already errors during "Gathering facts".

Important: I am aware of the log rotation issues on k8s like #13376 and others.
The cluster already runs with containerLogMaxSize: "100Mi" and hence I assume this not be the issue. Also this would not explain the errors already during the "Gathering facts" step. Furthermore AFAIK the log rotation issues should have been addressed on the AWX in the version we are using (besides the required log size increase in the kubelet).

Important - 2: Jobs continue to run on the cluster and finish "normally" there, yet they are marked as "error" in AWX.

We've recently updated the cluster from 1.23 to 1.24 though I can't fully confirm that the issues started after this upgrade. What I can say is that we didn't have any issues when we were on k8s 1.23 + AWX 21.8.0.

Overall this is a really big issue for us as the deployments only succeed at random and a lot manual time is needed. This is especially problematic for jobs within a workflow.

AWX version

21.10.2

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

I can't provide a simple way to reproduce this.

Expected results

The job does not error and mirrors the output from the pod.

Actual results

The job is stuck during job output and reports a false-positive error.

Additional information

EKS 1.24
EE: Based on quay.io/ansible/ansible-runner:latest, ~ 2 weeks old

Happy to provide additional information that would help solving this issue.

The text was updated successfully, but these errors were encountered:

K3ndu · 2022-12-30T12:43:52Z

Same problem here after upgrading Kubernetes cluster to 1.24.9. Updating to latest AWX version didn't help either.

pat-s · 2023-01-07T09:45:12Z

I might have found the reason: whenever I have gather_facts: true in a playbook, the log stalls. It seems that facts gathering outputs causes some large log output which seems to cause the stalling for some reason. It might either be a single line that is very large or the total output which AWX does not seem to able to handle.

Setting gather_facts: false and only gather a subset of facts via

   - name: Collect facts
     ansible.builtin.setup:
       gather_subset:
         - 'system'
         - 'local'

helps to resolve the issue.

Yet we have another playbook which suffers from the same issue. This playbook returns a large log in one step which makes an API call via the uri module. This strengthens our assumption that a long log line or too much log output in a certain duration causes the AWX log to stall.
A workaround is to set no_log: true to prevent the build from erroring but this is of course not a nice one as you don't see what's happening and potentially changing.

PS: container-log-max-size is set to 500 MB in our kubelet config.

MaxBidlingmaier · 2023-02-01T12:19:05Z

+1 here. Seems problem occurs after updating K8S, too.

yzhivkov · 2023-02-12T00:06:24Z

We have the same problem on AWS EKS 1.24 with all versions that we managed to test with - 21.7.0 and 21.11.0. AWX 21.7.0 was working alright on AWS EKS 1.20.

This ticket is a duplicate of #13469 and a solution to the problem is to add the following environment variable to your AWX Operator config:

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  …
  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: disabled

pat-s · 2023-02-12T21:55:02Z

@yzhivkov Thanks for the link and help! I can confirm it fixes the issues we had.

This ticket is a duplicate of #13469

Isn't it rather reverse? #13469 being a duplicate of this one? This ticket is much older in time.

nan0viol3t · 2023-02-15T10:17:25Z

Can confirm that this receptor reconnect modification is working for me. Thank you for working solution!

emoshaya · 2023-06-17T02:00:32Z

How do I set RECEPTOR_KUBE_SUPPORT_RECONNECT to disabled for a custom pod spec?

github-actions bot added component:ui needs_triage type:bug community labels Dec 28, 2022

yzhivkov mentioned this issue Feb 12, 2023

Job error without log #12297

Open

6 tasks

mabashian added component:api and removed component:ui labels May 24, 2023

masbahnana mentioned this issue Jul 3, 2023

Job failed with just error and without log output #13469

Open

9 tasks

yhzs8 mentioned this issue Dec 17, 2023

AWX Job error after executing reboot module #14725

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs error at random in UI (with stripped log output) #13380

Jobs error at random in UI (with stripped log output) #13380

pat-s commented Dec 28, 2022

K3ndu commented Dec 30, 2022

pat-s commented Jan 7, 2023

MaxBidlingmaier commented Feb 1, 2023

yzhivkov commented Feb 12, 2023 •

edited

Loading

pat-s commented Feb 12, 2023

nan0viol3t commented Feb 15, 2023

emoshaya commented Jun 17, 2023

Jobs error at random in UI (with stripped log output) #13380

Jobs error at random in UI (with stripped log output) #13380

Comments

pat-s commented Dec 28, 2022

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

K3ndu commented Dec 30, 2022

pat-s commented Jan 7, 2023

MaxBidlingmaier commented Feb 1, 2023

yzhivkov commented Feb 12, 2023 • edited Loading

pat-s commented Feb 12, 2023

nan0viol3t commented Feb 15, 2023

emoshaya commented Jun 17, 2023

yzhivkov commented Feb 12, 2023 •

edited

Loading