Rework job canceling to avoid polling the database #11464

AlanCoding · 2021-12-14T21:05:50Z

SUMMARY

This changes the way we inform a running dispatcher process that a job was canceled.

Before: we switched the job's canceled_flag to True, and the process polls this once every second
Proposed: a task is submitted and received by the node's parent dispatcher process, which tells it to do a SIGTERM for that process. The process still polls in a separate thread to look if it got the signal (otherwise it would block on the read)

The cancel_flag is still kept here, because the job could still be canceled while in the "waiting" status, before it makes it to the node which runs it. Before the job starts, both the DB flag and the sigterm flag are checked. After that, the logic goes, it's safe to just check the sigterm flag.

If a user wishes to, the task can be resubmitted multiple times, but this shouldn't be necessary.

ISSUE TYPE

Feature Pull Request

COMPONENT NAME

API

AWX VERSION

19.5.0

ADDITIONAL INFORMATION

I'm marking this as draft because I may still attempt the 1 other major change that was considered.

Instead of running the receptorctl work cancel <unit-id> command at the end of the AWXReceptorJob class, we could do that in an independent task similar to cancel_unified_job. However, I'm worried about when the dispatcher has a full queue. In the current state, issuing the SIGTERM signal jumps the line. It happens before the task lines up in the multiprocessing queue (which is good). Ideally we would first cancel the receptor job and then send the SIGTERM, but if the dispatcher is overloaded it might end up in a deadlock. Maybe we could have a special condition to handle that, but I don't love that idea.

AlanCoding · 2021-12-16T15:00:09Z

awx/main/tasks.py

+    def __init__(self):
+        self.sigterm_flag = False
+        for s in self.SIGNALS:
+            signal.signal(s, self.set_flag)


The only thing I might still be a little worried about is the fact that I don't set the original signal handlers back here. I can do that if anyone thinks I should. Only slightly tricky part is that I have to track which job "owns" this, as it is passed to the local dependencies, like project syncs.

If canceled attempted before, still allow attempting another cancel in this case, attempt to send the sigterm signal again. Keep clicking, you might help! Use queue name to cancel task call Replace other cancel_callbacks with sigterm watcher adapt special inventory mechanism for this too Pass watcher to any dependent local tasks

AlanCoding · 2022-02-03T03:05:34Z

Digging into some old issues, I've come upon some new information.

If a job is canceled via receptorctl work cancel, it seems to somehow obtain the correct status and complete correctly. I do not know entirely how this is happening, but it's an exciting change. Some local PoC seems to verify that we can rely on receptor to do this and completely eliminate threading for processing the results.

That is tremendously appealing, and I abandon the approach here in hopes of the new implementation working.

awxbot added type:enhancement component:api labels Dec 14, 2021

AlanCoding force-pushed the cancel_rework branch 2 times, most recently from a716d81 to 6116a97 Compare December 16, 2021 14:57

AlanCoding commented Dec 16, 2021

View reviewed changes

AlanCoding marked this pull request as ready for review December 16, 2021 15:00

AlanCoding force-pushed the cancel_rework branch from 6116a97 to aea4c05 Compare January 14, 2022 20:13

AlanCoding requested review from amolgautam25 and jbradberry January 17, 2022 18:31

AlanCoding force-pushed the cancel_rework branch from 97d949b to 04a6d06 Compare January 18, 2022 16:28

amolgautam25 approved these changes Jan 19, 2022

View reviewed changes

AlanCoding force-pushed the cancel_rework branch 3 times, most recently from 69e604f to c61fa4f Compare January 26, 2022 22:04

AlanCoding added 3 commits February 1, 2022 13:05

Update import to new module

bdfd37b

Correct task name to be full module name

c9dff2e

AlanCoding force-pushed the cancel_rework branch from c61fa4f to c9dff2e Compare February 1, 2022 18:06

AlanCoding closed this Feb 3, 2022

AlanCoding mentioned this pull request Feb 15, 2022

Close database connections while processing job output #11745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework job canceling to avoid polling the database #11464

Rework job canceling to avoid polling the database #11464

AlanCoding commented Dec 14, 2021

AlanCoding Dec 16, 2021

AlanCoding commented Feb 3, 2022

Rework job canceling to avoid polling the database #11464

Rework job canceling to avoid polling the database #11464

Conversation

AlanCoding commented Dec 14, 2021

SUMMARY

ISSUE TYPE

COMPONENT NAME

AWX VERSION

ADDITIONAL INFORMATION

AlanCoding Dec 16, 2021

Choose a reason for hiding this comment

AlanCoding commented Feb 3, 2022