Increase resiliency when application crashes #11955

shanemcd · 2022-03-24T16:23:56Z

SUMMARY

Prior to this change, the control plane pod would remain running if a service crashed.

The commits should be self-explanatory, but at a high-level, this PR:

Adds a new event listener that causes the pod to exit if any process crashes.
Fixes a race condition where the callback receiver would crash if code in the dispatcher didnt run first.

ISSUE TYPE

Bugfix Pull Request

COMPONENT NAME

API

tools/ansible/roles/dockerfile/templates/Dockerfile.j2

rooftopcellist

Other than this small nitpick, LGTM.

AlanCoding · 2022-03-29T02:41:53Z

tools/ansible/roles/dockerfile/files/launch_awx.sh

@@ -17,4 +17,13 @@ set -e

 wait-for-migrations

-supervisord -c /etc/supervisord.conf
+awx-manage provision_instance


I want @jbradberry eyes on this

At a first glance I think I like it.

I never liked --hostname being a mandatory "optional" flag anyway.

The other thing I'm chewing on is this being in an entry-point kind of script, but I now realize this pattern is more similar to the development environment

https://github.com/ansible/awx/blob/devel/tools/docker-compose/bootstrap_development.sh

I still feel like something's suspiciously missing. Isn't this a function of the AWX_AUTO_DEPROVISION_INSTANCES setting? If we register in this script, what are we doing here?

awx/awx/main/managers.py

Lines 172 to 173 in ac6a82e

if settings.IS_K8S:

registered = self.register(ip_address=pod_ip, node_type='control', uuid=settings.SYSTEM_UUID)

Already, that raises another point that if we do register here, we should register as a control node. Also with the correct uuid.

Also, what's the reason for moving it here to begin with?

After this change, instances can still get registered here, and I'm unsure if it should be removed (but lean toward removing)

awx/awx/main/dispatch/reaper.py

Line 38 in f5c1767

(changed, me) = Instance.objects.get_or_register()

What's the behavior if you run awx-manage provision_instance on a non-K8S environment? Will it give a sensible error?

After this change, instances can still get registered here, and I'm unsure if it should be removed (but lean toward removing)

I lean towards keeping it. If for some reason the instance gets deprovisioned while the pod is running, the system would not recover. I am worried about reintroducing the bug that was getting solved by #4829. I dont think it's going hurt anything by keeping it.

What's the behavior if you run awx-manage provision_instance on a non-K8S environment? Will it give a sensible error?

I dont see a reason to throw an error. But I have updated the help text to reflect what will happen.

This fix appears to be very strongly related to #11927

I think this needs more careful thought about expectations. If a missing instance is going to result in an error (as one would expect if registration is done by the entrypoint or installer) then fine, that's not a bad decision. But care needs to be taken to carry that out in a way where other changes won't be fighting it.

tools/ansible/roles/dockerfile/files/launch_awx.sh

AlanCoding · 2022-03-29T17:55:26Z

shanemcd#60

There was a race condition because the callback reciever tried to run this code: File "/awx_devel/awx/main/management/commands/run_callback_receiver.py", line 31, in handle CallbackBrokerWorker(), File "/awx_devel/awx/main/dispatch/worker/callback.py", line 49, in __init__ self.subsystem_metrics = s_metrics.Metrics(auto_pipe_execute=False) File "/awx_devel/awx/main/analytics/subsystem_metrics.py", line 156, in __init__ self.instance_name = Instance.objects.me().hostname Before get_or_register was being called by the dispatcher.

I didnt see a reason for the weird inconsistencies here.

dumb-init is more actively maintained, available on pypi, and already used for both upstream and downstream EEs

AlanCoding · 2022-03-29T18:29:24Z

tools/ansible/roles/dockerfile/templates/supervisor.conf.j2

@@ -12,9 +12,10 @@ directory = /awx_devel
 {% else %}
 command = nginx -g "daemon off;"
 {% endif %}
-autostart = true


removed because it's the default... just want to get that stated as a sanity check.

AlanCoding · 2022-03-29T18:36:36Z

tools/docker-compose/supervisor.conf

 autorestart = true
-stopwaitsecs = 1
-stopsignal=KILL
+startsecs = 30


Looks like the default for stopwaitsecs is 10. This would be my preference.

The dispatcher with running jobs may not respond well to SIGTERM right now. This is something I packed into #11745, along with a supervisor config change which is almost identical to the change you make here (for stopwaitsecs and stopsignal).

But that's getting kicked out until next release. If you want, I could look into a targeted change to just process SIGTERM signals to cancel a job so that it's not left running orphaned on the execution node.

AlanCoding · 2022-03-29T18:38:06Z

I'd like to make a note to port changes to startsecs (at least) to the traditional installer.

AlanCoding · 2022-03-29T18:45:04Z

This behavior could still use some cleanup.

$ awx-manage provision_instance
Traceback (most recent call last):
  File "/usr/local/bin/awx-manage", line 9, in <module>
    load_entry_point('awx', 'console_scripts', 'awx-manage')()
  File "/awx_devel/awx/__init__.py", line 170, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/usr/lib64/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/awx_devel/awx/main/management/commands/provision_instance.py", line 37, in handle
    self._register_hostname(options.get('hostname'), options.get('node_type'), options.get('uuid'))
  File "/awx_devel/awx/main/management/commands/provision_instance.py", line 25, in _register_hostname
    (changed, instance) = Instance.objects.get_or_register()
  File "/awx_devel/awx/main/managers.py", line 182, in get_or_register
    return (False, self.me())
  File "/awx_devel/awx/main/managers.py", line 109, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

I'm tempted to just not run this in non-OCP environments. This seems the "safest" from a maintainability mindset.

diff --git a/awx/main/management/commands/provision_instance.py b/awx/main/management/commands/provision_instance.py
index b873cb2fd5..aeae496306 100644
--- a/awx/main/management/commands/provision_instance.py
+++ b/awx/main/management/commands/provision_instance.py
@@ -1,8 +1,9 @@
 # Copyright (c) 2015 Ansible, Inc.
 # All Rights Reserved
 
-from django.core.management.base import BaseCommand
+from django.core.management.base import BaseCommand, CommandError
 from django.db import transaction
+from django.conf import settings
 
 from awx.main.models import Instance
 
@@ -22,6 +23,8 @@ class Command(BaseCommand):
 
     def _register_hostname(self, hostname, node_type, uuid):
         if not hostname:
+            if not settings.AWX_AUTO_DEPROVISION_INSTANCES:
+                raise CommandError('Registering with values from settings only intended for use in K8s installs')
             (changed, instance) = Instance.objects.get_or_register()
         else:
             (changed, instance) = Instance.objects.register(hostname=hostname, node_type=node_type, uuid=uuid)

AlanCoding · 2022-03-29T18:56:36Z

^ that might be better done in followup work, because it could cascade to some unused code deletions.

shanemcd added 2 commits March 23, 2022 18:37

Remove awx-config-watcher from kube dev env

dbcdb82

Kill all control plane processes if one of them fails to boot

270497e

awxbot added type:bug component:api labels Mar 24, 2022

shanemcd mentioned this pull request Mar 24, 2022

The dispatcher service fails to ever recover from a restart of the postgres pod #11960

Closed

6 tasks

AlanCoding requested a review from rooftopcellist March 24, 2022 17:50

shanemcd mentioned this pull request Mar 24, 2022

django.db.utils.OperationalError: server closed the connection unexpectedly for long jobs #11391

Closed

3 tasks

shanemcd force-pushed the fail-better branch 3 times, most recently from d6e4b2b to 9344cd5 Compare March 28, 2022 15:57

shanemcd marked this pull request as ready for review March 28, 2022 16:23

shanemcd force-pushed the fail-better branch 2 times, most recently from 9be56a0 to 1c16c7c Compare March 28, 2022 16:51

rooftopcellist reviewed Mar 28, 2022

View reviewed changes

tools/ansible/roles/dockerfile/templates/Dockerfile.j2 Show resolved Hide resolved

rooftopcellist approved these changes Mar 28, 2022

View reviewed changes

AlanCoding reviewed Mar 29, 2022

View reviewed changes

tools/ansible/roles/dockerfile/files/launch_awx.sh Show resolved Hide resolved

shanemcd force-pushed the fail-better branch from b937669 to a223f64 Compare March 29, 2022 17:56

shanemcd and others added 7 commits March 29, 2022 14:07

Use the same settings on each supervisor entry

0cbc802

I didnt see a reason for the weird inconsistencies here.

Fix race condition causing rsyslog to crash

dbdc529

Replace tini with dumb-init

39905b3

dumb-init is more actively maintained, available on pypi, and already used for both upstream and downstream EEs

Ensure stop signals are forwarded to the correct processes

4446434

Update help text for provision_instance command

b36af5d

Specifically abort the reaper if instance not registered

fe5736d

shanemcd force-pushed the fail-better branch from a9f68b6 to fe5736d Compare March 29, 2022 18:09

AlanCoding reviewed Mar 29, 2022

View reviewed changes

AlanCoding approved these changes Mar 29, 2022

View reviewed changes

Only allow provision_instances without options to be used on k8s

28fa90e

shanemcd merged commit ef0f6ca into ansible:devel Mar 30, 2022

This was referenced Mar 30, 2022

Delete dead code from get_or_register, move, and test #11980

Merged

rsyslog.conf is missing at container startup #11888

Closed

AlanCoding mentioned this pull request Jun 29, 2022

Error with rejoining node to cluster after lost connection to postgres #4294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase resiliency when application crashes #11955

Increase resiliency when application crashes #11955

shanemcd commented Mar 24, 2022 •

edited

Loading

rooftopcellist left a comment •

edited

Loading

AlanCoding Mar 29, 2022

jbradberry Mar 29, 2022

jbradberry Mar 29, 2022

AlanCoding Mar 29, 2022

AlanCoding Mar 29, 2022

AlanCoding Mar 29, 2022

shanemcd Mar 29, 2022

shanemcd Mar 29, 2022

AlanCoding Mar 29, 2022

AlanCoding commented Mar 29, 2022

AlanCoding Mar 29, 2022

AlanCoding Mar 29, 2022 •

edited

Loading

AlanCoding commented Mar 29, 2022

AlanCoding commented Mar 29, 2022

AlanCoding commented Mar 29, 2022

	if settings.IS_K8S:
	registered = self.register(ip_address=pod_ip, node_type='control', uuid=settings.SYSTEM_UUID)

Increase resiliency when application crashes #11955

Increase resiliency when application crashes #11955

Conversation

shanemcd commented Mar 24, 2022 • edited Loading

SUMMARY

ISSUE TYPE

COMPONENT NAME

rooftopcellist left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlanCoding commented Mar 29, 2022

Choose a reason for hiding this comment

AlanCoding Mar 29, 2022 • edited Loading

Choose a reason for hiding this comment

AlanCoding commented Mar 29, 2022

AlanCoding commented Mar 29, 2022

AlanCoding commented Mar 29, 2022

shanemcd commented Mar 24, 2022 •

edited

Loading

rooftopcellist left a comment •

edited

Loading

AlanCoding Mar 29, 2022 •

edited

Loading