Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s max container log size leading to job error #14057

Closed
5 of 11 tasks
mamercad opened this issue May 27, 2023 · 15 comments
Closed
5 of 11 tasks

k3s max container log size leading to job error #14057

mamercad opened this issue May 27, 2023 · 15 comments

Comments

@mamercad
Copy link
Contributor

mamercad commented May 27, 2023

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

Seeing "Job terminated due to error" in the UI on a very simple playbook (below).

In the UI it looks like this:

image

In the logs around the end of this job I see this:

awx-task 2023-05-27 14:52:07,144 INFO     [2d7014133ee34ad9b12564981b1764b2] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 93
awx-task 2023-05-27 14:52:07,207 INFO     [2d7014133ee34ad9b12564981b1764b2] awx.analytics.job_lifecycle job-93 post run
awx-task 2023-05-27 14:52:07,357 INFO     [2d7014133ee34ad9b12564981b1764b2] awx.analytics.job_lifecycle job-93 finalize run
awx-task 2023-05-27 14:52:07,440 WARNING  [2d7014133ee34ad9b12564981b1764b2] awx.main.dispatch job 93 (error) encountered an error (rc=None), please see task stdout for details.
awx-task Task was destroyed but it is pending!
awx-rsyslog 2023-05-20 18:23:01,640 INFO success: awx-rsyslog-configurer entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
awx-rsyslog 2023-05-20 18:23:01,640 INFO success: awx-rsyslog-configurer entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
awx-rsyslog 2023-05-20 18:23:03,643 INFO success: awx-rsyslogd entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
awx-rsyslog 2023-05-20 18:23:03,643 INFO success: awx-rsyslogd entered RUNNING state, process has stayed up for > than 30 seconds (startsecs)
awx-task task: <Task pending name='Task-1076076' coro=<RedisConnection._read_data() done, defined at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:18
0> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f7d55077520>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /var/lib/awx/venv/awx/lib64/python3.9/sit
e-packages/aioredis/connection.py:168]>
awx-task Task was destroyed but it is pending!
awx-task task: <Task pending name='Task-1080753' coro=<RedisConnection._read_data() done, defined at /var/lib/awx/venv/awx/lib64/python3.9/site-packages/aioredis/connection.py:18
0> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f7d54f9bb80>()]> cb=[RedisConnection.__init__.<locals>.<lambda>() at /var/lib/awx/venv/awx/lib64/python3.9/sit
e-packages/aioredis/connection.py:168]>
awx-task Task was destroyed but it is pending!

In the UI job output it looks like this (note that this should count to 8000):

image

Here's the playbook:

---
- name: long_log_lines
  hosts: localhost
  connection: local
  gather_facts: false
  vars:
    how_many: 8000
    delay_secs: 1

  tasks:
    - name: Create a list
      ansible.builtin.command:
        cmd: seq {{ how_many }}
      register: the_list

    - name: Set a fact for the list
      ansible.builtin.set_fact:
        the_list: "{{ the_list.stdout_lines }}"

    - name: Ping list times
      ansible.builtin.shell:
        cmd: |
          echo {{ item }}
          sleep {{ delay_secs }}
      loop: "{{ the_list }}"

AWX version

22.3.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

Whatever the AWX 22.3.0 EE is running

Operating system

Kubernetes 1.24.4

Web browser

Chrome

Steps to reproduce

Running this very simple playbook; it should be easy to reproduce.

Expected results

That the playbook completes without error.

Actual results

The playbook ends with "Job terminated due to error".

Additional information

I'd urge that this is triaged pretty highly, it'd be impossible justifying running this in production. The reason that I'm running such a trivial playbook is that we have some in production (non-trivial) that run for quite a long amount of time.

@mamercad
Copy link
Contributor Author

mamercad commented May 27, 2023

I've reproduced this many times now:

image

@fosterseth
Copy link
Member

what k8s cluster are using? (eks, kind, k3s, etc)

what is your awx spec file?

how many job events get produced before the job errors out? Is it consistent for each job failure?

@mamercad
Copy link
Contributor Author

mamercad commented Jun 1, 2023

what k8s cluster are using? (eks, kind, k3s, etc)

k3s 1.24.4

what is your awx spec file?

It's stock 22.3.0; I don't really know what you mean by "awx spec file", could you elaborate?

how many job events get produced before the job errors out? Is it consistent for each job failure?

Seems like it generally dies at around 4300 lines.

Could you try reproducing it in your environment?

Here's the playbook if you just want to clone it.

@mamercad
Copy link
Contributor Author

mamercad commented Jun 3, 2023

what is your awx spec file?

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  hostname: redacted
  service_type: ClusterIP
  admin_user: admin
  admin_email: redacted
  admin_password_secret: redacted

@mamercad
Copy link
Contributor Author

mamercad commented Jun 6, 2023

Reproducing this in a brand new environment (same K3s and AWX versions and everything is "stock"):

2023-06-06 14:17:44,743 INFO     [53284703eef94916ba127ed768d7580b] awx.analytics.job_lifecycle job-14 waiting
2023-06-06 14:17:44,970 INFO     [53284703eef94916ba127ed768d7580b] awx.analytics.job_lifecycle job-14 pre run
2023-06-06 14:17:45,620 INFO     [53284703eef94916ba127ed768d7580b] awx.analytics.job_lifecycle job-14 preparing playbook
2023-06-06 14:17:45,744 INFO     [53284703eef94916ba127ed768d7580b] awx.analytics.job_lifecycle job-14 running playbook
2023-06-06 14:17:46,803 INFO     [53284703eef94916ba127ed768d7580b] awx.analytics.job_lifecycle job-14 work unit id received
2023-06-06 14:17:46,868 INFO     [53284703eef94916ba127ed768d7580b] awx.analytics.job_lifecycle job-14 work unit id assigned
2023-06-06 14:17:50,888 INFO     [-] awx.main.wsrelay Starting producer for 10.42.0.28-schedules-changed
2023-06-06 14:17:56,448 ERROR    [2871ad68ee404c6abab9a4315553b734] awx.main.dispatch Worker failed to run task awx.main.tasks.system.purge_old_stdout_files(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 332, in purge_old_stdout_files
    for f in os.listdir(settings.JOBOUTPUT_ROOT):
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/awx/job_status'
2023-06-06 14:18:00,894 INFO     [-] awx.main.wsrelay Producer 10.42.0.28-schedules-changed has no subscribers, shutting down.
2023-06-06 14:35:25,562 INFO     [-] awx.main.wsrelay Starting producer for 10.42.0.28-schedules-changed
2023-06-06 14:35:35,570 INFO     [-] awx.main.wsrelay Producer 10.42.0.28-schedules-changed has no subscribers, shutting down.

Same simple playbook, set it to count to 6k and it only got to about 4300 before "Error: Job terminated due to error".

@TheRealHaoLiu TheRealHaoLiu self-assigned this Jun 6, 2023
@TheRealHaoLiu
Copy link
Member

can you check the kubelet container-log-max-size configuration for your k3s cluster?

one possibility is due to log size and cause log rotation in kube which you will need to enable RECEPTOR_KUBE_SUPPORT_RECONNECT described in

ansible/receptor#683

@TheRealHaoLiu
Copy link
Member

back of napkin calculation each run of Ping list times produce about 2000Bytes of data in stdout
8000 run of it produce about 17.25 MiB of data

by default kubelet container-log-max-size is 10MiB according to https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

unless k3s configures the kubelet differently and overrides container-log-max-size this playbook seems like it WOULD run into the log rotation and would require RECEPTOR_KUBE_SUPPORT_RECONNECT

@mamercad
Copy link
Contributor Author

mamercad commented Jun 7, 2023

can you check the kubelet container-log-max-size configuration for your k3s cluster?

I think it's 10Mi by default with k3s, I'm testing with k3s server --kubelet-arg container-log-max-files=4 --kubelet-arg container-log-max-size=50Mi currently and will report back.

I'll try RECEPTOR_KUBE_SUPPORT_RECONNECT as well.

@mamercad
Copy link
Contributor Author

mamercad commented Jun 7, 2023

Good news, the job completed with the updated kubelet log arguments:

image

For the next test, I'm going to try 9000 lines and I'm going to add RECEPTOR_KUBE_SUPPORT_RECONNECT.

@TheRealHaoLiu
Copy link
Member

can you please update the name for the issue to reflect the actual problem

"Task was destoryed but it is pending" seem to be coming from the asyncio library which is used in the wsrelay component (websocket) unlikely its related

@fosterseth fosterseth changed the title Task was destroyed but it is pending k3s max container log size leading to job error Jun 7, 2023
@mamercad
Copy link
Contributor Author

mamercad commented Jun 7, 2023

I need to bump k3s to try the Receptor argument: https://github.com/ansible/receptor/pull/683/files#diff-792611baeb730234abface9c9bc33204ea86453469e7fcdfbe8de8d4e04f2598R659

I can do that, and, I'll report back when I've done that and tested.

@mamercad
Copy link
Contributor Author

mamercad commented Jun 7, 2023

I need to bump k3s to try the Receptor argument: https://github.com/ansible/receptor/pull/683/files#diff-792611baeb730234abface9c9bc33204ea86453469e7fcdfbe8de8d4e04f2598R659

I can do that, and, I'll report back when I've done that and tested.

k3s bumped to 1.24.14, Receptor flag set, trying with 16k lines ...

@mamercad
Copy link
Contributor Author

mamercad commented Jun 8, 2023

I think this issue is resolved, 16k lines is complete (both in the UI and the downloaded logfile).

image

❯ wc -l ~/Downloads/job_36.txt
   16022 /Users/mmercado/Downloads/job_36.txt

The job took nearly 6h which makes me feel good as well.

To summarize, I bumped k3s to 1.24.14 (for the Receptor argument) and am now running k3s server --kubelet-arg container-log-max-files=4 --kubelet-arg container-log-max-size=50Mi.

Lastly, I added:

        ee_extra_env: |
          - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
            value: enabled

to the AWX spec.

@Klaas-
Copy link
Contributor

Klaas- commented Jun 22, 2023

if I run with RECEPTOR_KUBE_SUPPORT_RECONNECT enabled, it does reconnect, but it seems to "lose" content.

So for the reproducer by @mamercad I see in the log of a running job:

changed: [localhost] => (item=13111)
changed: [localhost] => (item=13112)
changed: [localhost] => (item=19886)
changed: [localhost] => (item=19887)

maybe related #14158

@Klaas-
Copy link
Contributor

Klaas- commented Jun 22, 2023

Details of the two executions:

  "cmd": "echo 13112\nsleep 1\n",
  "start": "2023-06-22 10:55:45.976451",
  "end": "2023-06-22 10:55:46.983008",

  "cmd": "echo 19886\nsleep 1\n",
  "start": "2023-06-22 13:22:35.641397",
  "end": "2023-06-22 13:22:36.649521",

~140 minutes between those two where the logs have not made it into AWX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants