-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close database connections while processing job output #11745
Conversation
45e3817
to
537f3c3
Compare
45ed4b7
to
c4040dc
Compare
This looks good to me... what's the hold up? |
Through doing #12435, I've learned more, and I should apply those lessons correctly before reconsidering this. Really quick runthough: The signal callback method will only be ran in the main process. This means the code in the If you set the callback method to raise an exception, that may or may not be respected depending on what the code is doing. File reads will ignore it, which means that we can't 100% get rid of reads in the processing step. This is something I had already discovered last time I was playing with this. However, the line |
e636da9
to
7018416
Compare
awx/main/tasks/system.py
Outdated
logger.info(f'Job id {unified_job_id} has been deleted, cancel aborted') | ||
return | ||
time.sleep(1) | ||
logger.info(f'{unified_job.log_format} stopped before obtaining a task id, sigterm not needed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm increasingly thinking this method cancel_unified_job
shouldn't exist. The race condition it exists to address (which is pending/waiting jobs) shouldn't actually be a problem, as the cancel_flag
can be relied on as the signaling methods in the cases that matter.
There are much more narrow race conditions to worry about, like when the job is "running" but has not yet been submitted to receptor. Again, this would be the wrong place to handle that. After we set raise_exception
, we should check if the flag has been set, and treat it as a cancel if it has been. By itself, that should be a complete solution to the potential race conditions.
I couldn't make myself happy with how I was breaking with existing code patterns in the dispatcher, now I think I can be happy with this. The gist of the last change is that I stopped using a pseudo-task to tell the dispatcher parent process to SIGTERM a task's worker. This is replaced by the dispatcher control-and-reply mechanism. We were already using this mechanism in the This gets us to the point of getting an acknowledgement that the quit signal was sent to the job's control process within the request-response cycle of doing a POST to the /cancel/ endpoint. After that, we still wait for the signal processing of the job to close sockets, do a receptor cancel, and save any half-way progress the job has made. |
5bc6146
to
3143449
Compare
ffd5188
to
9ebd7a7
Compare
At last I've gotten around to putting up #12586 See the linked PR for a reference. |
looking good now? |
#12653 now puts up very heavy conflicts with this. It's not solving the same problem, but the same code structures definitely evolve in two different directions in both of these, so we need some discussion around that. |
4af0973
to
a4ea8fa
Compare
I had to squash commits again in order to rebase, because some intermediate commits had conflicts with devel. |
a4ea8fa
to
224cf00
Compare
Okay I went ahead and tested this PR and this looks really good. Following are the results:
As you can see we had about 9 idle connections with query COMMIT and with long age (20 mins and counting)
Now we have no idle connections with query COMMIT and with long age. This solves loads of problems with connections. We have seen in the past that you can't run more than a certain amount of job on controller because we run out of DB connections. We have also seen those idle connection consumes resources (memory and CPU) and can cause issues. This PR would address both the issues. I think we should let this go through the formal testing and merge it. |
Currently playing around with this PR and am seeing this in my logs when trying to cancel jobs:
|
awx/main/models/unified_jobs.py
Outdated
# Use control and reply mechanism to cancel and obtain confirmation | ||
timeout = 5 | ||
canceled = ControlDispatcher('dispatcher', self.controller_node).cancel([self.celery_task_id]) | ||
except (socket.timeout, RuntimeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty dangerous to include both of these exception types here. We should catch RuntimeError
separately and use logger.exception
. Right now this is eating:
tools_awx_1 | 2022-08-25 16:12:37,974 ERROR [bf11a9d4] awx.main.models.unified_jobs could not reach dispatcher on awx_1 within 5s
tools_awx_1 | Traceback (most recent call last):
tools_awx_1 | File "/awx_devel/./awx/main/models/unified_jobs.py", line 1420, in cancel_dispatcher_process
tools_awx_1 | canceled = ControlDispatcher('dispatcher', self.controller_node).cancel([self.celery_task_id])
tools_awx_1 | File "/awx_devel/./awx/main/dispatch/control.py", line 41, in cancel
tools_awx_1 | return self.control_with_reply('cancel', *args, extra_data={'task_ids': task_ids}, **kwargs)
tools_awx_1 | File "/awx_devel/./awx/main/dispatch/control.py", line 59, in control_with_reply
tools_awx_1 | for reply in conn.events(select_timeout=timeout, yield_timeouts=True):
tools_awx_1 | File "/awx_devel/./awx/main/dispatch/__init__.py", line 35, in events
tools_awx_1 | raise RuntimeError('Listening for events can only be done in autocommit mode')
tools_awx_1 | RuntimeError: Listening for events can only be done in autocommit mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. This is also happening on devel. Yikes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I'm trying to pick up the error handling changes faithfully with a rebase, and will push in a second.
c26139d
to
bc257e7
Compare
If canceled attempted before, still allow attempting another cancel in this case, attempt to send the sigterm signal again. Keep clicking, you might help! Replace other cancel_callbacks with sigterm watcher adapt special inventory mechanism for this too Get rid of the cancel_watcher method with exception in main thread Handle academic case of sigterm race condition Process cancelation as control signal Fully connect cancel method and run_dispatcher to control Never transition workflows directly to canceled, add logs
bc257e7
to
c59bbde
Compare
I'm not working to test AlanCoding#48, which is an evolution from #12653 rebased on top of these changes. Much of those changes are non-trivial. |
I wanted to be extra sure that merging this doesn't mess anything up. So let me explain some of the testing and where this led me. control-standaloneThis scenario has been very stable on this branch, and fixes 500 errors from some scenarios that run it up to near capacity. I don't know for sure what caused us to get closer to the connection limit, but I suspect it could be the change that added 7 additional max workers. hybrid-clusterThis looks considerably better than the reference runs, because I think the reference runs have similar 500 errors control-hop-to-execution-nodes clusterThis branch had 1 more failure than the reference run in a particularly concerning place. This was in a workflow that was canceled, and expected the jobs in it to be marked as canceled. I dove into this bug, and found that the problem was that canceling a job will not work as expected if the project sync is running because the signal gets read by the project sync as a SIGTERM, but from its perspective this contradicts the Because of that, I pushed an additional commit which I found fixes this. I developed a minimal test case to hit this issue, but I did find that even With the additional commit, I feel safe about merging now. |
SUMMARY
Replaces #11464
Most of the change here (taken from that prior PR) is to replace the cancel callback method so that it no longer needs to poll the database. Instead, it processes a SIGTERM signal to know when to close.
This closes database connections before we start the
ansible-runner process
work to consume output from a remote job. During processing, we need to pull from receptor and push to redis. After the job is finished, it will re-connect to the database to update status and trigger other actions.ISSUE TYPE
COMPONENT NAME
ADDITIONAL INFORMATION
The challenging part of this is ordering of actions, race conditions, when to close stuff, and that sort of thing. This involved a lot of experimentation.
An important detail is that when you do
supervisorctl
service restart, it cancels jobs by this mechanism, because supervisor itself sends a SIGTERM signal to the process group.