Recovery of running job after its control process is lost #13848

AlanCoding · 2023-04-12T14:54:08Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Feature type

New Feature

Feature Summary

Scenario:

A job that sleeps for several minutes is launched on an execution node, so controller_node != execution_node
The job successfully starts and starts producing events
Mid-run, services are restarted on the job's controller node

Right now, this job will be reaped. We hate it when jobs get reaped.

Details of proposed solution, subject to change:

replace the reap method called from cluster_node_heatbeat with a different reconciliation method called from awx_receptor_workunit_reaper. This method has access to receptorctl status.
- in addition, this method will be given access to the process list, worker_tasks
- in addition, this method may pull the list of running jobs from the database, as needed
In the event that the database status is "running" but the process is missing (timing issues assumed to be worked out), it will send a message back to the dispatcher main process if the receptor status is still active.
When the dispatcher gets a message that an active job is orphaned, it will launch RunJob or its equivalent.
Instead of starting the job, it will pick up processing from the last line processed, which can be ascertained from the saved events or some other means of tracking.

Select the relevant components

Steps to reproduce

See feature summary

Current results

Reaper message, job canceled

Sugested feature result

Jobs should never be reaped - just have processing resumed.

Receptor becomes source of ultimate truth.

Additional information

No response

The text was updated successfully, but these errors were encountered:

sirjaren · 2023-11-13T20:13:53Z

Apologies for the tag, but since this issue is fairly old, I figured a ping would be acceptable

@AlanCoding

Version: awx 22.6.0

I suspect I'm running into this behavior but have not yet identified the cause.
Specifically, I see this happening when trying to perform maintenance when there are workflow jobs running.

5 node K8s cluster example:

3 AWX control nodes
2 DB nodes (primary and replica)

In order to have a bit more resiliency with AWX, the following is configured:

Dispatcher DB Toleration

Dispatcher is configured to to wait 10 minutes instead of the default 40 seconds.
Useful for instances where DB HA is not available and clustered DB's are not supported in AWX.

DISPATCHER_DB_DOWNTOWN_TOLLERANCE = 600  # awx < 23.0.0
DISPATCHER_DB_DOWNTIME_TOLERANCE = 600   # awx >= 23.0.0

Receptor Reconnect

As I'm running a newer version of Kubernetes, I opted to include this setting in the event it helps (but I'm not positive it applies here):

RECEPTOR_KUBE_SUPPORT_RECONNECT = enabled

Supervisord Retries:

All supervisord managed programs in the awx-task, awx-web and awx-rsyslogd containers have a new option applied to retry more often.

Useful for instances where DB HA is not available and clustered DB's are not supported in AWX.

IDEA | Would be nice to have this be configurable via AWX Operator (Applied via template to the supervisord configurations).

Currently, I have the option, startretries, applied to all program stanzas, which are applied via ConfigMap to each relevant container, overwriting the default configurations.

An example (heavily truncated output, but given as an example)

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: supervisord-conf
data:
  supervisord_task.conf: |
    [supervisord]
    nodaemon = True
    umask = 022
    logfile = /dev/stdout
    logfile_maxbytes = 0
    pidfile = /var/run/supervisor/supervisor.task.pid

    [program:dispatcher]
    command = awx-manage run_dispatcher
    ...
    startretries = 200

    [program:wsrelay]
    command = awx-manage run_wsrelay
    ...
    startretries = 200

    [program:callback-receiver]
    command = awx-manage run_callback_receiver
    ...
    startretries = 200

    ...

  supervisord_web.conf: |
    [supervisord]
    nodaemon = True
    umask = 022
    logfile = /dev/stdout
    logfile_maxbytes = 0
    pidfile = /var/run/supervisor/supervisor.web.pid

    [program:nginx]
    command = nginx -g "daemon off;"
    ...
    startretries = 200

    [program:uwsgi]
    command = /var/lib/awx/venv/awx/bin/uwsgi /etc/tower/uwsgi.ini
    ...
    startretries = 200

    [program:daphne]
    command = /var/lib/awx/venv/awx/bin/daphne -b 127.0.0.1 -p 8051 --websocket_timeout -1 awx.asgi:channel_layer
    ...
    startretries = 200

    [program:ws-heartbeat]
    command = awx-manage run_ws_heartbeat
    ...
    startretries = 200

    [program:awx-cache-clear]
    command = awx-manage run_cache_clear
    ...
    startretries = 200

    ...

  supervisord_rsyslog.conf: |
    [supervisord]
    nodaemon = True
    umask = 022
    logfile = /dev/stdout
    logfile_maxbytes = 0
    pidfile = /var/run/supervisor/supervisor.rsyslog.pid

    [program:awx-rsyslogd]
    command = rsyslogd -n -i /var/run/awx-rsyslog/rsyslog.pid -f /var/lib/awx/rsyslog/rsyslog.conf
    ...
    startretries = 200

    [program:awx-rsyslog-configurer]
    command = awx-manage run_rsyslog_configurer
    ...
    startretries = 200

    ...

Workflow Job Management Challenges

I'm using Ansible to manage the maintenance of the AWX K8s nodes (patch, reboots, etc).

When bringing an AWX node down for maintenance it's challenging to not kill workflow jobs since they are not directly exposed to K8s or the underlying OS.

For example, for regular jobs, we can see the automation-pod-* K8s pods which represent AWX job template jobs (non-workflow). A combination of K8s node cordoning, AWX instance disabling, node draining and node reboot does not necessarily mean the workflow jobs (managed by Disaptcher? or AWX Task Manager?) have reconciled and moved to other AWX nodes.

I'll try and lay out an architecture of what I'm seeing:

9 Workflow jobs containing 3 workflow nodes, with each job node running a task that pauses for 3 hours

Below, we can see the job template jobs which are run from the workflow:

node1
  automation-pod-1
  automation-pod-2
  automation-pod-3
node2
  automation-pod-4
  automation-pod-5
  automation-pod-6
node3
  automation-pod-7
  automation-pod-8
  automation-pod-9

Let's assume maintenance will be applied on node1.

After disabling the AWX instance on node1, cordoning node1, we can either wait for all jobs to clear or go in and cancel them. If we cancel them, we are left with the following:

node1 (disabled)
node2
  automation-pod-4
  automation-pod-5
  automation-pod-6
  automation-pod-10
node3
  automation-pod-7
  automation-pod-8
  automation-pod-9
  automation-pod-11
  automation-pod-12

Above, the next 3 workflow nodes are executed in the 3 workflows which had the first workflow nodes managing automation-pod-1, automation-pod-2, and automation-pod-3.

If I were to drain node1, wait for pods to evict and bring down node1 for maintenance, depending on "something" (Redis? Dispatcher? AWX Task Manager? Receptor?), it's possible we have jobs get killed (on other nodes) with the dreaded Job terminated due to error message.

Every so often, I'll get non-Ansible job output in the AWX UI job output screen with just a single message:
Finished

No tracebacks, or JSON-like errors to show.

Sometimes, jobs can survive and I wonder if it's because the workflow is managed on a node that was not taken down.
If jobs are killed, this could make our example look like the following:

node1 (disabled)
node2
  automation-pod-4
  automation-pod-5
  automation-pod-6
  automation-pod-13
  automation-pod-15
node3
  automation-pod-7
  automation-pod-8
  automation-pod-9
  automation-pod-14

Where the newly created jobs from before (automation-pod-10, automation-pod-11, and automation-pod-12) are killed because there may have been "something" on node1, where bringing it down for maintenance killed these pods (which live on another node).

Finally, the last workflow nodes are executed creating new job pods on the surviving nodes, eg: automation-pod-13, automation-pod-14, and automation-pod-15.

Wrap-Up

It's at this point, I'm lost on how to continue forward, but I'm happy to do any debugging if needed.

Something which I just thought about, which I can try play around with, is changing the supervisord configuration entries of:

stopasgroup=false
killasgroup=false

Which may be something that could help in this case.

Additionally, looking at your previous comments on AWX issues, you mention using receptorctl.
Our EE's do not have this installed, but we do include Receptor via:

- COPY --from=quay.io/ansible/receptor:v1.4.2 /usr/bin/receptor /usr/bin/receptor
- RUN mkdir -p /var/run/receptor

I'll work on getting the receptorctl command in these EE's and maybe it can help me understand a bit more if this is indeed a Receptor-related issue.

github-actions bot added component:api needs_triage type:enhancement labels Apr 12, 2023

djyasin removed the needs_triage label Apr 12, 2023

AlanCoding mentioned this issue Sep 28, 2023

Job error without log #12297

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery of running job after its control process is lost #13848

Recovery of running job after its control process is lost #13848

AlanCoding commented Apr 12, 2023

sirjaren commented Nov 13, 2023 •

edited

Loading

Recovery of running job after its control process is lost #13848

Recovery of running job after its control process is lost #13848

Comments

AlanCoding commented Apr 12, 2023

Please confirm the following

Feature type

Feature Summary

Select the relevant components

Steps to reproduce

Current results

Sugested feature result

Additional information

sirjaren commented Nov 13, 2023 • edited Loading

Dispatcher DB Toleration

Receptor Reconnect

Supervisord Retries:

Workflow Job Management Challenges

9 Workflow jobs containing 3 workflow nodes, with each job node running a task that pauses for 3 hours

Wrap-Up

sirjaren commented Nov 13, 2023 •

edited

Loading