Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase resiliency when application crashes #11955

Merged
merged 10 commits into from
Mar 30, 2022
Merged

Conversation

shanemcd
Copy link
Member

@shanemcd shanemcd commented Mar 24, 2022

SUMMARY

Prior to this change, the control plane pod would remain running if a service crashed.

The commits should be self-explanatory, but at a high-level, this PR:

  • Adds a new event listener that causes the pod to exit if any process crashes.
  • Fixes a race condition where the callback receiver would crash if code in the dispatcher didnt run first.
ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME
  • API

Copy link
Member

@rooftopcellist rooftopcellist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than this small nitpick, LGTM.

@@ -17,4 +17,13 @@ set -e

wait-for-migrations

supervisord -c /etc/supervisord.conf
awx-manage provision_instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want @jbradberry eyes on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a first glance I think I like it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never liked --hostname being a mandatory "optional" flag anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing I'm chewing on is this being in an entry-point kind of script, but I now realize this pattern is more similar to the development environment

https://github.com/ansible/awx/blob/devel/tools/docker-compose/bootstrap_development.sh

I still feel like something's suspiciously missing. Isn't this a function of the AWX_AUTO_DEPROVISION_INSTANCES setting? If we register in this script, what are we doing here?

awx/awx/main/managers.py

Lines 172 to 173 in ac6a82e

if settings.IS_K8S:
registered = self.register(ip_address=pod_ip, node_type='control', uuid=settings.SYSTEM_UUID)

Already, that raises another point that if we do register here, we should register as a control node. Also with the correct uuid.

Also, what's the reason for moving it here to begin with?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, instances can still get registered here, and I'm unsure if it should be removed (but lean toward removing)

(changed, me) = Instance.objects.get_or_register()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior if you run awx-manage provision_instance on a non-K8S environment? Will it give a sensible error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, instances can still get registered here, and I'm unsure if it should be removed (but lean toward removing)

I lean towards keeping it. If for some reason the instance gets deprovisioned while the pod is running, the system would not recover. I am worried about reintroducing the bug that was getting solved by #4829. I dont think it's going hurt anything by keeping it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior if you run awx-manage provision_instance on a non-K8S environment? Will it give a sensible error?

I dont see a reason to throw an error. But I have updated the help text to reflect what will happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix appears to be very strongly related to #11927

I think this needs more careful thought about expectations. If a missing instance is going to result in an error (as one would expect if registration is done by the entrypoint or installer) then fine, that's not a bad decision. But care needs to be taken to carry that out in a way where other changes won't be fighting it.

@AlanCoding
Copy link
Member

shanemcd#60

shanemcd and others added 7 commits March 29, 2022 14:07
There was a race condition because the callback reciever tried to run this code:

  File "/awx_devel/awx/main/management/commands/run_callback_receiver.py", line 31, in handle
    CallbackBrokerWorker(),
  File "/awx_devel/awx/main/dispatch/worker/callback.py", line 49, in __init__
    self.subsystem_metrics = s_metrics.Metrics(auto_pipe_execute=False)
  File "/awx_devel/awx/main/analytics/subsystem_metrics.py", line 156, in __init__
    self.instance_name = Instance.objects.me().hostname

Before get_or_register was being called by the dispatcher.
I didnt see a reason for the weird inconsistencies here.
dumb-init is more actively maintained, available on pypi, and already used for both upstream and downstream EEs
@@ -12,9 +12,10 @@ directory = /awx_devel
{% else %}
command = nginx -g "daemon off;"
{% endif %}
autostart = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed because it's the default... just want to get that stated as a sanity check.

autorestart = true
stopwaitsecs = 1
stopsignal=KILL
startsecs = 30
Copy link
Member

@AlanCoding AlanCoding Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the default for stopwaitsecs is 10. This would be my preference.

The dispatcher with running jobs may not respond well to SIGTERM right now. This is something I packed into #11745, along with a supervisor config change which is almost identical to the change you make here (for stopwaitsecs and stopsignal).

But that's getting kicked out until next release. If you want, I could look into a targeted change to just process SIGTERM signals to cancel a job so that it's not left running orphaned on the execution node.

@AlanCoding
Copy link
Member

I'd like to make a note to port changes to startsecs (at least) to the traditional installer.

@AlanCoding
Copy link
Member

This behavior could still use some cleanup.

$ awx-manage provision_instance
Traceback (most recent call last):
  File "/usr/local/bin/awx-manage", line 9, in <module>
    load_entry_point('awx', 'console_scripts', 'awx-manage')()
  File "/awx_devel/awx/__init__.py", line 170, in manage
    execute_from_command_line(sys.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/usr/lib64/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/awx_devel/awx/main/management/commands/provision_instance.py", line 37, in handle
    self._register_hostname(options.get('hostname'), options.get('node_type'), options.get('uuid'))
  File "/awx_devel/awx/main/management/commands/provision_instance.py", line 25, in _register_hostname
    (changed, instance) = Instance.objects.get_or_register()
  File "/awx_devel/awx/main/managers.py", line 182, in get_or_register
    return (False, self.me())
  File "/awx_devel/awx/main/managers.py", line 109, in me
    raise RuntimeError("No instance found with the current cluster host id")
RuntimeError: No instance found with the current cluster host id

I'm tempted to just not run this in non-OCP environments. This seems the "safest" from a maintainability mindset.

diff --git a/awx/main/management/commands/provision_instance.py b/awx/main/management/commands/provision_instance.py
index b873cb2fd5..aeae496306 100644
--- a/awx/main/management/commands/provision_instance.py
+++ b/awx/main/management/commands/provision_instance.py
@@ -1,8 +1,9 @@
 # Copyright (c) 2015 Ansible, Inc.
 # All Rights Reserved
 
-from django.core.management.base import BaseCommand
+from django.core.management.base import BaseCommand, CommandError
 from django.db import transaction
+from django.conf import settings
 
 from awx.main.models import Instance
 
@@ -22,6 +23,8 @@ class Command(BaseCommand):
 
     def _register_hostname(self, hostname, node_type, uuid):
         if not hostname:
+            if not settings.AWX_AUTO_DEPROVISION_INSTANCES:
+                raise CommandError('Registering with values from settings only intended for use in K8s installs')
             (changed, instance) = Instance.objects.get_or_register()
         else:
             (changed, instance) = Instance.objects.register(hostname=hostname, node_type=node_type, uuid=uuid)

@AlanCoding
Copy link
Member

^ that might be better done in followup work, because it could cascade to some unused code deletions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants