-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovery of running job after its control process is lost #13848
Comments
Apologies for the tag, but since this issue is fairly old, I figured a ping would be acceptable Version: I suspect I'm running into this behavior but have not yet identified the cause. 5 node K8s cluster example:
In order to have a bit more resiliency with AWX, the following is configured: Dispatcher DB TolerationDispatcher is configured to to wait 10 minutes instead of the default 40 seconds.
Receptor ReconnectAs I'm running a newer version of Kubernetes, I opted to include this setting in the event it helps (but I'm not positive it applies here):
Supervisord Retries:All Useful for instances where DB HA is not available and clustered DB's are not supported in AWX. IDEA | Would be nice to have this be configurable via AWX Operator (Applied via template to the Currently, I have the option, An example (heavily truncated output, but given as an example) ---
apiVersion: v1
kind: ConfigMap
metadata:
name: supervisord-conf
data:
supervisord_task.conf: |
[supervisord]
nodaemon = True
umask = 022
logfile = /dev/stdout
logfile_maxbytes = 0
pidfile = /var/run/supervisor/supervisor.task.pid
[program:dispatcher]
command = awx-manage run_dispatcher
...
startretries = 200
[program:wsrelay]
command = awx-manage run_wsrelay
...
startretries = 200
[program:callback-receiver]
command = awx-manage run_callback_receiver
...
startretries = 200
...
supervisord_web.conf: |
[supervisord]
nodaemon = True
umask = 022
logfile = /dev/stdout
logfile_maxbytes = 0
pidfile = /var/run/supervisor/supervisor.web.pid
[program:nginx]
command = nginx -g "daemon off;"
...
startretries = 200
[program:uwsgi]
command = /var/lib/awx/venv/awx/bin/uwsgi /etc/tower/uwsgi.ini
...
startretries = 200
[program:daphne]
command = /var/lib/awx/venv/awx/bin/daphne -b 127.0.0.1 -p 8051 --websocket_timeout -1 awx.asgi:channel_layer
...
startretries = 200
[program:ws-heartbeat]
command = awx-manage run_ws_heartbeat
...
startretries = 200
[program:awx-cache-clear]
command = awx-manage run_cache_clear
...
startretries = 200
...
supervisord_rsyslog.conf: |
[supervisord]
nodaemon = True
umask = 022
logfile = /dev/stdout
logfile_maxbytes = 0
pidfile = /var/run/supervisor/supervisor.rsyslog.pid
[program:awx-rsyslogd]
command = rsyslogd -n -i /var/run/awx-rsyslog/rsyslog.pid -f /var/lib/awx/rsyslog/rsyslog.conf
...
startretries = 200
[program:awx-rsyslog-configurer]
command = awx-manage run_rsyslog_configurer
...
startretries = 200
... Workflow Job Management ChallengesI'm using Ansible to manage the maintenance of the AWX K8s nodes (patch, reboots, etc). When bringing an AWX node down for maintenance it's challenging to not kill workflow jobs since they are not directly exposed to K8s or the underlying OS. For example, for regular jobs, we can see the I'll try and lay out an architecture of what I'm seeing: 9 Workflow jobs containing 3 workflow nodes, with each job node running a task that pauses for 3 hoursBelow, we can see the job template jobs which are run from the workflow:
Let's assume maintenance will be applied on After disabling the AWX instance on
Above, the next 3 workflow nodes are executed in the 3 workflows which had the first workflow nodes managing If I were to drain Every so often, I'll get non-Ansible job output in the AWX UI job output screen with just a single message: No tracebacks, or JSON-like errors to show. Sometimes, jobs can survive and I wonder if it's because the workflow is managed on a node that was not taken down.
Where the newly created jobs from before ( Finally, the last workflow nodes are executed creating new job pods on the surviving nodes, eg: Wrap-UpIt's at this point, I'm lost on how to continue forward, but I'm happy to do any debugging if needed. Something which I just thought about, which I can try play around with, is changing the supervisord configuration entries of:
Which may be something that could help in this case. Additionally, looking at your previous comments on AWX issues, you mention using
I'll work on getting the |
Please confirm the following
Feature type
New Feature
Feature Summary
Scenario:
controller_node
!=execution_node
Right now, this job will be reaped. We hate it when jobs get reaped.
Details of proposed solution, subject to change:
reap
method called fromcluster_node_heatbeat
with a different reconciliation method called fromawx_receptor_workunit_reaper
. This method has access toreceptorctl status
.worker_tasks
RunJob
or its equivalent.process
ing from the last line processed, which can be ascertained from the saved events or some other means of tracking.Select the relevant components
Steps to reproduce
See feature summary
Current results
Reaper message, job canceled
Sugested feature result
Jobs should never be reaped - just have processing resumed.
Receptor becomes source of ultimate truth.
Additional information
No response
The text was updated successfully, but these errors were encountered: