-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase resiliency when application crashes #11955
Conversation
d6e4b2b
to
9344cd5
Compare
9be56a0
to
1c16c7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than this small nitpick, LGTM.
@@ -17,4 +17,13 @@ set -e | |||
|
|||
wait-for-migrations | |||
|
|||
supervisord -c /etc/supervisord.conf | |||
awx-manage provision_instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want @jbradberry eyes on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a first glance I think I like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never liked --hostname
being a mandatory "optional" flag anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing I'm chewing on is this being in an entry-point kind of script, but I now realize this pattern is more similar to the development environment
https://github.com/ansible/awx/blob/devel/tools/docker-compose/bootstrap_development.sh
I still feel like something's suspiciously missing. Isn't this a function of the AWX_AUTO_DEPROVISION_INSTANCES
setting? If we register in this script, what are we doing here?
Lines 172 to 173 in ac6a82e
if settings.IS_K8S: | |
registered = self.register(ip_address=pod_ip, node_type='control', uuid=settings.SYSTEM_UUID) |
Already, that raises another point that if we do register here, we should register as a control node. Also with the correct uuid.
Also, what's the reason for moving it here to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, instances can still get registered here, and I'm unsure if it should be removed (but lean toward removing)
awx/awx/main/dispatch/reaper.py
Line 38 in f5c1767
(changed, me) = Instance.objects.get_or_register() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behavior if you run awx-manage provision_instance
on a non-K8S environment? Will it give a sensible error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, instances can still get registered here, and I'm unsure if it should be removed (but lean toward removing)
I lean towards keeping it. If for some reason the instance gets deprovisioned while the pod is running, the system would not recover. I am worried about reintroducing the bug that was getting solved by #4829. I dont think it's going hurt anything by keeping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behavior if you run awx-manage provision_instance on a non-K8S environment? Will it give a sensible error?
I dont see a reason to throw an error. But I have updated the help text to reflect what will happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix appears to be very strongly related to #11927
I think this needs more careful thought about expectations. If a missing instance is going to result in an error (as one would expect if registration is done by the entrypoint or installer) then fine, that's not a bad decision. But care needs to be taken to carry that out in a way where other changes won't be fighting it.
There was a race condition because the callback reciever tried to run this code: File "/awx_devel/awx/main/management/commands/run_callback_receiver.py", line 31, in handle CallbackBrokerWorker(), File "/awx_devel/awx/main/dispatch/worker/callback.py", line 49, in __init__ self.subsystem_metrics = s_metrics.Metrics(auto_pipe_execute=False) File "/awx_devel/awx/main/analytics/subsystem_metrics.py", line 156, in __init__ self.instance_name = Instance.objects.me().hostname Before get_or_register was being called by the dispatcher.
I didnt see a reason for the weird inconsistencies here.
dumb-init is more actively maintained, available on pypi, and already used for both upstream and downstream EEs
@@ -12,9 +12,10 @@ directory = /awx_devel | |||
{% else %} | |||
command = nginx -g "daemon off;" | |||
{% endif %} | |||
autostart = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed because it's the default... just want to get that stated as a sanity check.
autorestart = true | ||
stopwaitsecs = 1 | ||
stopsignal=KILL | ||
startsecs = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the default for stopwaitsecs
is 10. This would be my preference.
The dispatcher with running jobs may not respond well to SIGTERM right now. This is something I packed into #11745, along with a supervisor config change which is almost identical to the change you make here (for stopwaitsecs and stopsignal).
But that's getting kicked out until next release. If you want, I could look into a targeted change to just process SIGTERM signals to cancel a job so that it's not left running orphaned on the execution node.
I'd like to make a note to port changes to |
This behavior could still use some cleanup.
I'm tempted to just not run this in non-OCP environments. This seems the "safest" from a maintainability mindset. diff --git a/awx/main/management/commands/provision_instance.py b/awx/main/management/commands/provision_instance.py
index b873cb2fd5..aeae496306 100644
--- a/awx/main/management/commands/provision_instance.py
+++ b/awx/main/management/commands/provision_instance.py
@@ -1,8 +1,9 @@
# Copyright (c) 2015 Ansible, Inc.
# All Rights Reserved
-from django.core.management.base import BaseCommand
+from django.core.management.base import BaseCommand, CommandError
from django.db import transaction
+from django.conf import settings
from awx.main.models import Instance
@@ -22,6 +23,8 @@ class Command(BaseCommand):
def _register_hostname(self, hostname, node_type, uuid):
if not hostname:
+ if not settings.AWX_AUTO_DEPROVISION_INSTANCES:
+ raise CommandError('Registering with values from settings only intended for use in K8s installs')
(changed, instance) = Instance.objects.get_or_register()
else:
(changed, instance) = Instance.objects.register(hostname=hostname, node_type=node_type, uuid=uuid) |
^ that might be better done in followup work, because it could cascade to some unused code deletions. |
SUMMARY
Prior to this change, the control plane pod would remain running if a service crashed.
The commits should be self-explanatory, but at a high-level, this PR:
ISSUE TYPE
COMPONENT NAME