-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"could not reach dispatcher" errors when canceling jobs #12740
Comments
It seems like this was problem introduced in #12573 Here is what I've learned so far:
awx/awx/main/dispatch/__init__.py Line 34 in 4fbf5e9
But if you look at So the first thing I tried was: diff --git a/awx/main/dispatch/__init__.py b/awx/main/dispatch/__init__.py
index 7fa4bd06f1..970e9bfc09 100644
--- a/awx/main/dispatch/__init__.py
+++ b/awx/main/dispatch/__init__.py
@@ -16,6 +16,7 @@ def get_local_queuename():
class PubSub(object):
def __init__(self, conn):
+ assert conn.autocommit, "Connection must be in autocommit mode."
self.conn = conn
def listen(self, channel):
@@ -31,9 +32,6 @@ class PubSub(object):
cur.execute('SELECT pg_notify(%s, %s);', (channel, payload))
def events(self, select_timeout=5, yield_timeouts=False):
- if not pg_connection.get_autocommit():
- raise RuntimeError('Listening for events can only be done in autocommit mode')
-
while True:
if select.select([self.conn], [], [], select_timeout) == NOT_READY:
if yield_timeouts:
@@ -73,6 +71,8 @@ def pg_bus_conn(new_connection=False):
raise RuntimeError('Unexpectedly could not connect to postgres for pg_notify actions')
conn = pg_connection.connection
+ assert conn.autocommit, "Connection must be in autocommit mode."
+
pubsub = PubSub(conn)
yield pubsub
if new_connection: But that now causes:
I'm starting to think it may be pretty dangerous to reuse the connection here. What are the implications of trying to reuse a connection that is not in autocommit mode? Is it safe to enable autocommit mode on an existing connection? I wouldn't think so. This code blowing up is in the task manager, which runs inside a transaction... |
It's fine generally, it just happens to conflict with certain things that we do. For task notifiers there's no problem. For task listeners we have reasons that it doesn't work for us:
Absolutely not. We should never do that.
In your diff, because you re-inserted the Because I understand the purpose of the task manager, I know that it submits tasks, but it never uses this for listening over pg_notify. The only exception might be when it cancels jobs inside of a workflow... even then, it would be most correct to create a new connection and do a catch of any Otherwise I feel pretty confident that this was a fairly isolated bug and would prefer to move ahead with #12769 as the fix. |
This seems to conflict with what it says here: https://www.psycopg.org/docs/advanced.html?highlight=autocommit "Because of the way sessions interact with notifications (see NOTIFY documentation), you should keep the connection in autocommit mode if you wish to receive or send notifications in a timely manner." Will pull your new code and play around with it. |
Acknowledged about the warning, and I'm ok with this. Because our prior solution to this problem was to put the NOTIFY actions in Django on_commit methods, which was less timely than doing it in the transaction. For a demonstration case, say the task manager takes 1 minute to run. A job is dispatched at the 30 second mark. It would take 30 seconds before the NOTIFY is received by another client. That fails the doc's definition of a "timely manner", but it is what we want. |
Link fallout from the fix for this - #13017 Not a huge deal, but needs some tweaking. |
Please confirm the following
Bug Summary
I noticed this in my logs when canceling jobs:
I was seeing this immediately, not after 5 seconds.
I changed this from
error
toexception
:awx/awx/main/models/unified_jobs.py
Line 1410 in 017e474
Now I see:
AWX version
21.5.1.dev1+gcbef8717a8.d20220825
Select the relevant components
Installation method
N/A
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Try canceling a job
Expected results
No errors
Actual results
An error
Additional information
No response
The text was updated successfully, but these errors were encountered: