Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs error at random in UI (with stripped log output) #13380

Open
4 of 9 tasks
pat-s opened this issue Dec 28, 2022 · 7 comments
Open
4 of 9 tasks

Jobs error at random in UI (with stripped log output) #13380

pat-s opened this issue Dec 28, 2022 · 7 comments

Comments

@pat-s
Copy link

pat-s commented Dec 28, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Since some days some builds don't echo the log properly in the UI and error with Job terminated due to error.
Restarting sometimes works at random, sometimes not. It's like flipping a dice.

Other jobs which produce a lot of output (e.g. output from the uri module) can only be run with no_log: true as they face the same "output freeze" otherwise.

The job in the screenshot though already errors during "Gathering facts".

Important: I am aware of the log rotation issues on k8s like #13376 and others.
The cluster already runs with containerLogMaxSize: "100Mi" and hence I assume this not be the issue. Also this would not explain the errors already during the "Gathering facts" step. Furthermore AFAIK the log rotation issues should have been addressed on the AWX in the version we are using (besides the required log size increase in the kubelet).

Important - 2: Jobs continue to run on the cluster and finish "normally" there, yet they are marked as "error" in AWX.

We've recently updated the cluster from 1.23 to 1.24 though I can't fully confirm that the issues started after this upgrade. What I can say is that we didn't have any issues when we were on k8s 1.23 + AWX 21.8.0.

Overall this is a really big issue for us as the deployments only succeed at random and a lot manual time is needed. This is especially problematic for jobs within a workflow.

image

AWX version

21.10.2

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

I can't provide a simple way to reproduce this.

Expected results

The job does not error and mirrors the output from the pod.

Actual results

The job is stuck during job output and reports a false-positive error.

Additional information

  • EKS 1.24
  • EE: Based on quay.io/ansible/ansible-runner:latest, ~ 2 weeks old

Happy to provide additional information that would help solving this issue.

@K3ndu
Copy link

K3ndu commented Dec 30, 2022

Same problem here after upgrading Kubernetes cluster to 1.24.9. Updating to latest AWX version didn't help either.

@pat-s
Copy link
Author

pat-s commented Jan 7, 2023

I might have found the reason: whenever I have gather_facts: true in a playbook, the log stalls. It seems that facts gathering outputs causes some large log output which seems to cause the stalling for some reason. It might either be a single line that is very large or the total output which AWX does not seem to able to handle.

Setting gather_facts: false and only gather a subset of facts via

   - name: Collect facts
     ansible.builtin.setup:
       gather_subset:
         - 'system'
         - 'local'

helps to resolve the issue.

Yet we have another playbook which suffers from the same issue. This playbook returns a large log in one step which makes an API call via the uri module. This strengthens our assumption that a long log line or too much log output in a certain duration causes the AWX log to stall.
A workaround is to set no_log: true to prevent the build from erroring but this is of course not a nice one as you don't see what's happening and potentially changing.

PS: container-log-max-size is set to 500 MB in our kubelet config.

@MaxBidlingmaier
Copy link

+1 here. Seems problem occurs after updating K8S, too.

@yzhivkov
Copy link

yzhivkov commented Feb 12, 2023

We have the same problem on AWS EKS 1.24 with all versions that we managed to test with - 21.7.0 and 21.11.0. AWX 21.7.0 was working alright on AWS EKS 1.20.

This ticket is a duplicate of #13469 and a solution to the problem is to add the following environment variable to your AWX Operator config:

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  …
  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: disabled

@pat-s
Copy link
Author

pat-s commented Feb 12, 2023

@yzhivkov Thanks for the link and help! I can confirm it fixes the issues we had.

This ticket is a duplicate of #13469

Isn't it rather reverse? #13469 being a duplicate of this one? This ticket is much older in time.

@nan0viol3t
Copy link

Can confirm that this receptor reconnect modification is working for me. Thank you for working solution!

@emoshaya
Copy link

How do I set RECEPTOR_KUBE_SUPPORT_RECONNECT to disabled for a custom pod spec?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants