Race condition that can make jobs have "failed" status due to canceling it #13017

AlanCoding · 2022-10-07T19:47:16Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

In some conditions, jobs will incorrectly have the following message applied to them:

Task was canceled due to receiving a shutdown signal.

After the user does a POST to /api/v2/jobs/N/cancel/.

this appears to be fallout from recent development, where we landed fixes for some other things, and this was fallout from those fixes.

AWX version

devel

Select the relevant components

Installation method

openshift

Modifications

no

Ansible version

devel

Operating system

N/A

Web browser

Chrome

Steps to reproduce

Right now I don't have concrete steps to reproduce, other than to use this hack:

diff --git a/awx/main/models/unified_jobs.py b/awx/main/models/unified_jobs.py
index a8ac64b2cc..35f63fe9ca 100644
--- a/awx/main/models/unified_jobs.py
+++ b/awx/main/models/unified_jobs.py
@@ -1481,8 +1481,12 @@ class UnifiedJob(
                     self.status = 'canceled'
                     cancel_fields.append('status')
 
+            import time
+            time.sleep(10)
+
             self.save(update_fields=cancel_fields)
 
+
         return self.cancel_flag
 
     @property

However, we have reports that this bug has been seen in real life in OCP deployments. Ping @CFSNM , the reporter.

Expected results

A user cancel should always yield a simple "canceled" status.

Actual results

Additional information

To describe what happens...

In the request, a message is sent to the dispatcher to SIGTERM the controller process. The process receives that, stops processing data, does a receptorctl cancel of the work unit, then it checks the cancel_flag to determine if it should use the above message, or just a "cancelled" status.

Meanwhile, the cancel_flag will only change to True on commit of the request, since it operates in a transaction (the job does not). So this leaves open the possibility to get this message when the request processing is slow and/or at the same time, the dispatcher and worker respond very fast.

This is academically pretty well established, so we should really just close the loop hole. This is not impossible. Some ideas I have...

Change the control-and-reply pattern for canceling to just a control signal with no reply. Then do not use the new_connection parameter when submitting the task. Thus, the message is submitted in the same transaction as the cancel_flag is flipped. This may require moving some corner-case logic out into an on_commit method or a dispatcher task to handle the cases where the status should flip directly to "cancelled".
Make this view non-atomic. Then we can do very minor re-arrangement to set the cancel_flag before we submit the task. If the view is no longer atomic, there will be no problem with this. We also won't need the new_connection flag in this case either.

The text was updated successfully, but these errors were encountered:

AlanCoding · 2022-10-20T19:06:44Z

Notes:

This issue directly came from an integration test, so the lack of the original failure may be evidence to close
Technically, I don't see a strong reason to prioritize this for backports, so I'm not filing any issues to do that work, but I would not argue against it.

github-actions bot added component:api needs_triage type:bug labels Oct 7, 2022

AlanCoding removed the needs_triage label Oct 10, 2022

AlanCoding self-assigned this Oct 13, 2022

This was referenced Oct 13, 2022

"could not reach dispatcher" errors when canceling jobs #12740

Closed

Make canceling view non-atomic to fix 500 errors with job bursts #13072

Merged

jay-steurer closed this as completed Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition that can make jobs have "failed" status due to canceling it #13017

Race condition that can make jobs have "failed" status due to canceling it #13017

AlanCoding commented Oct 7, 2022

AlanCoding commented Oct 20, 2022

Race condition that can make jobs have "failed" status due to canceling it #13017

Race condition that can make jobs have "failed" status due to canceling it #13017

Comments

AlanCoding commented Oct 7, 2022

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

AlanCoding commented Oct 20, 2022