-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework job canceling to avoid polling the database #11464
Conversation
a716d81
to
6116a97
Compare
awx/main/tasks.py
Outdated
def __init__(self): | ||
self.sigterm_flag = False | ||
for s in self.SIGNALS: | ||
signal.signal(s, self.set_flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I might still be a little worried about is the fact that I don't set the original signal handlers back here. I can do that if anyone thinks I should. Only slightly tricky part is that I have to track which job "owns" this, as it is passed to the local dependencies, like project syncs.
6116a97
to
aea4c05
Compare
97d949b
to
04a6d06
Compare
69e604f
to
c61fa4f
Compare
If canceled attempted before, still allow attempting another cancel in this case, attempt to send the sigterm signal again. Keep clicking, you might help! Use queue name to cancel task call Replace other cancel_callbacks with sigterm watcher adapt special inventory mechanism for this too Pass watcher to any dependent local tasks
c61fa4f
to
c9dff2e
Compare
Digging into some old issues, I've come upon some new information. If a job is canceled via That is tremendously appealing, and I abandon the approach here in hopes of the new implementation working. |
SUMMARY
This changes the way we inform a running dispatcher process that a job was canceled.
canceled_flag
to True, and the process polls this once every secondThe
cancel_flag
is still kept here, because the job could still be canceled while in the "waiting" status, before it makes it to the node which runs it. Before the job starts, both the DB flag and the sigterm flag are checked. After that, the logic goes, it's safe to just check the sigterm flag.If a user wishes to, the task can be resubmitted multiple times, but this shouldn't be necessary.
ISSUE TYPE
COMPONENT NAME
AWX VERSION
ADDITIONAL INFORMATION
I'm marking this as draft because I may still attempt the 1 other major change that was considered.
Instead of running the
receptorctl work cancel <unit-id>
command at the end of theAWXReceptorJob
class, we could do that in an independent task similar tocancel_unified_job
. However, I'm worried about when the dispatcher has a full queue. In the current state, issuing the SIGTERM signal jumps the line. It happens before the task lines up in the multiprocessing queue (which is good). Ideally we would first cancel the receptor job and then send the SIGTERM, but if the dispatcher is overloaded it might end up in a deadlock. Maybe we could have a special condition to handle that, but I don't love that idea.