-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3s max container log size leading to job error #14057
Comments
what k8s cluster are using? (eks, kind, k3s, etc) what is your awx spec file? how many job events get produced before the job errors out? Is it consistent for each job failure? |
k3s 1.24.4
It's stock 22.3.0; I don't really know what you mean by "awx spec file", could you elaborate?
Seems like it generally dies at around 4300 lines. Could you try reproducing it in your environment? Here's the playbook if you just want to clone it. |
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
namespace: awx
spec:
hostname: redacted
service_type: ClusterIP
admin_user: admin
admin_email: redacted
admin_password_secret: redacted |
Reproducing this in a brand new environment (same K3s and AWX versions and everything is "stock"):
Same simple playbook, set it to count to 6k and it only got to about 4300 before "Error: Job terminated due to error". |
can you check the kubelet container-log-max-size configuration for your k3s cluster? one possibility is due to log size and cause log rotation in kube which you will need to enable RECEPTOR_KUBE_SUPPORT_RECONNECT described in |
back of napkin calculation each run of by default kubelet container-log-max-size is 10MiB according to https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ unless k3s configures the kubelet differently and overrides container-log-max-size this playbook seems like it WOULD run into the log rotation and would require |
I think it's I'll try |
can you please update the name for the issue to reflect the actual problem "Task was destoryed but it is pending" seem to be coming from the asyncio library which is used in the wsrelay component (websocket) unlikely its related |
I need to bump k3s to try the Receptor argument: https://github.com/ansible/receptor/pull/683/files#diff-792611baeb730234abface9c9bc33204ea86453469e7fcdfbe8de8d4e04f2598R659 I can do that, and, I'll report back when I've done that and tested. |
|
I think this issue is resolved, 16k lines is complete (both in the UI and the downloaded logfile). ❯ wc -l ~/Downloads/job_36.txt
16022 /Users/mmercado/Downloads/job_36.txt The job took nearly 6h which makes me feel good as well. To summarize, I bumped k3s to 1.24.14 (for the Receptor argument) and am now running Lastly, I added: ee_extra_env: |
- name: RECEPTOR_KUBE_SUPPORT_RECONNECT
value: enabled to the AWX spec. |
if I run with RECEPTOR_KUBE_SUPPORT_RECONNECT enabled, it does reconnect, but it seems to "lose" content. So for the reproducer by @mamercad I see in the log of a running job:
maybe related #14158 |
Details of the two executions:
~140 minutes between those two where the logs have not made it into AWX |
Please confirm the following
[email protected]
instead.)Bug Summary
Seeing "Job terminated due to error" in the UI on a very simple playbook (below).
In the UI it looks like this:
In the logs around the end of this job I see this:
In the UI job output it looks like this (note that this should count to 8000):
Here's the playbook:
AWX version
22.3.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
Whatever the AWX 22.3.0 EE is running
Operating system
Kubernetes 1.24.4
Web browser
Chrome
Steps to reproduce
Running this very simple playbook; it should be easy to reproduce.
Expected results
That the playbook completes without error.
Actual results
The playbook ends with "Job terminated due to error".
Additional information
I'd urge that this is triaged pretty highly, it'd be impossible justifying running this in production. The reason that I'm running such a trivial playbook is that we have some in production (non-trivial) that run for quite a long amount of time.
The text was updated successfully, but these errors were encountered: