-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix job canceling when getting stuck reading or writing receptor socket #12653
Conversation
In that case, was it confirmed that the job process was really running? It would be good to get strace information to find out what specifically the control process is hanging on. I agree with your assessments and this is adding in good general defense to code points that might result in hanging, but there are other scenarios for jobs stuck in running. |
try: | ||
ansible_runner.interface.run(streamer='transmit', _output=_socket.makefile('wb'), **self.runner_params) | ||
finally: | ||
# Socket must be shutdown here, or the reader will hang forever. | ||
_socket.shutdown(socket.SHUT_WR) | ||
|
||
@cleanup_new_process | ||
def submitter(self, payload_reader): | ||
# Prepare the submit_work kwargs before creating threads, because references to settings are not thread-safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment doesn't work, because you are already in a thread at this point.
try: | ||
receptor_ctl.simple_command(f"work release {self.unit_id}") | ||
self.receptor_ctl.simple_command(f"work release {unit_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the concept is the, generally, receptor commands are not safe from hangs, then I expect that would apply to receptorctl work release
as well. I believe I've had it hang here before.
if res and res.status == "canceled": | ||
return res, unit_id | ||
|
||
resultsock, resultfile = work_results_future.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All work_results
does is return the socket. It seems somewhat extraordinary that it would be hanging.
It does read from the general communication socket to receptor, so it absolutely could hang.
payload_reader_file.close() | ||
self.receptor_ctl._socket.shutdown(socket.SHUT_RDWR) | ||
self.receptor_ctl._socket.close() | ||
self.receptor_ctl._sockfile.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat of a tangent - I have many thought it would be better to call self.receptor_ctl.close()
, to replace the 3 lines above here. It's written defensively enough, and would be more clear / DRY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.receptor_ctl._socket.shutdown(socket.SHUT_RDWR)
is not call in receptorctl's close, so we could keep that line. But the other 2 lines could be replaced
awx/main/tasks/receptor.py
Outdated
# This ThreadPoolExecutor runs for the duration of the job. | ||
# The cancel_func pattern is intended to guard againsnt any situation where we may | ||
# end up stuck while reading or writing from the receptor socket. | ||
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked out your branch and ran graphs
This is for running 5 jobs. The increase in database connections is 30-10 = 20. Then 20 / 5 = 4 threads per job, as expected. This is because max_workers=3
doesn't include the main thread, so 4 threads in total, and without connection pooling between threads, a new connection is created for each. Currently it's 3 per job, and particularly if we're backporting, increasing this number by 1 will be disastrous for some large (or poorly balanced) deployments.
Really, it should be either 1 or 0. But for now, I'm going to pull out some of my tricks to suggest to mitigate what you're doing here, because watching for cancels over the transmit phase is something I know we were lacking.
Link shanemcd#75, I may still have more changes I want to see after this. At this stage I might also try to run some tests. |
Use cancel_watcher as method, not thread
SUMMARY
On a recent customer case they reported not being able to cancel a running job. After looking at the code, there were 2 additional places (
submit_work
andget_work_results
) where it is possible to get stuck on a read or write with the receptor socket. This patch introduces a new pattern that will allow us to eject whenever we might be stuck.ISSUE TYPE
COMPONENT NAME