Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework job canceling to avoid polling the database #11464

Closed
wants to merge 3 commits into from

Conversation

AlanCoding
Copy link
Member

SUMMARY

This changes the way we inform a running dispatcher process that a job was canceled.

  • Before: we switched the job's canceled_flag to True, and the process polls this once every second
  • Proposed: a task is submitted and received by the node's parent dispatcher process, which tells it to do a SIGTERM for that process. The process still polls in a separate thread to look if it got the signal (otherwise it would block on the read)

The cancel_flag is still kept here, because the job could still be canceled while in the "waiting" status, before it makes it to the node which runs it. Before the job starts, both the DB flag and the sigterm flag are checked. After that, the logic goes, it's safe to just check the sigterm flag.

If a user wishes to, the task can be resubmitted multiple times, but this shouldn't be necessary.

ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME
  • API
AWX VERSION
19.5.0
ADDITIONAL INFORMATION

I'm marking this as draft because I may still attempt the 1 other major change that was considered.

Instead of running the receptorctl work cancel <unit-id> command at the end of the AWXReceptorJob class, we could do that in an independent task similar to cancel_unified_job. However, I'm worried about when the dispatcher has a full queue. In the current state, issuing the SIGTERM signal jumps the line. It happens before the task lines up in the multiprocessing queue (which is good). Ideally we would first cancel the receptor job and then send the SIGTERM, but if the dispatcher is overloaded it might end up in a deadlock. Maybe we could have a special condition to handle that, but I don't love that idea.

def __init__(self):
self.sigterm_flag = False
for s in self.SIGNALS:
signal.signal(s, self.set_flag)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I might still be a little worried about is the fact that I don't set the original signal handlers back here. I can do that if anyone thinks I should. Only slightly tricky part is that I have to track which job "owns" this, as it is passed to the local dependencies, like project syncs.

@AlanCoding AlanCoding marked this pull request as ready for review December 16, 2021 15:00
@AlanCoding AlanCoding force-pushed the cancel_rework branch 3 times, most recently from 69e604f to c61fa4f Compare January 26, 2022 22:04
If canceled attempted before, still allow attempting another cancel
in this case, attempt to send the sigterm signal again.
Keep clicking, you might help!

Use queue name to cancel task call

Replace other cancel_callbacks with sigterm watcher
  adapt special inventory mechanism for this too

Pass watcher to any dependent local tasks
@AlanCoding
Copy link
Member Author

Digging into some old issues, I've come upon some new information.

If a job is canceled via receptorctl work cancel, it seems to somehow obtain the correct status and complete correctly. I do not know entirely how this is happening, but it's an exciting change. Some local PoC seems to verify that we can rely on receptor to do this and completely eliminate threading for processing the results.

That is tremendously appealing, and I abandon the approach here in hopes of the new implementation working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants