Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition that can make jobs have "failed" status due to canceling it #13017

Closed
4 of 9 tasks
AlanCoding opened this issue Oct 7, 2022 · 1 comment
Closed
4 of 9 tasks

Comments

@AlanCoding
Copy link
Member

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

In some conditions, jobs will incorrectly have the following message applied to them:

Task was canceled due to receiving a shutdown signal.

After the user does a POST to /api/v2/jobs/N/cancel/.

this appears to be fallout from recent development, where we landed fixes for some other things, and this was fallout from those fixes.

AWX version

devel

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

openshift

Modifications

no

Ansible version

devel

Operating system

N/A

Web browser

Chrome

Steps to reproduce

Right now I don't have concrete steps to reproduce, other than to use this hack:

diff --git a/awx/main/models/unified_jobs.py b/awx/main/models/unified_jobs.py
index a8ac64b2cc..35f63fe9ca 100644
--- a/awx/main/models/unified_jobs.py
+++ b/awx/main/models/unified_jobs.py
@@ -1481,8 +1481,12 @@ class UnifiedJob(
                     self.status = 'canceled'
                     cancel_fields.append('status')
 
+            import time
+            time.sleep(10)
+
             self.save(update_fields=cancel_fields)
 
+
         return self.cancel_flag
 
     @property

However, we have reports that this bug has been seen in real life in OCP deployments. Ping @CFSNM , the reporter.

Expected results

A user cancel should always yield a simple "canceled" status.

Actual results

Screenshot from 2022-10-07 15-40-41

Additional information

To describe what happens...

In the request, a message is sent to the dispatcher to SIGTERM the controller process. The process receives that, stops processing data, does a receptorctl cancel of the work unit, then it checks the cancel_flag to determine if it should use the above message, or just a "cancelled" status.

Meanwhile, the cancel_flag will only change to True on commit of the request, since it operates in a transaction (the job does not). So this leaves open the possibility to get this message when the request processing is slow and/or at the same time, the dispatcher and worker respond very fast.

This is academically pretty well established, so we should really just close the loop hole. This is not impossible. Some ideas I have...

  1. Change the control-and-reply pattern for canceling to just a control signal with no reply. Then do not use the new_connection parameter when submitting the task. Thus, the message is submitted in the same transaction as the cancel_flag is flipped. This may require moving some corner-case logic out into an on_commit method or a dispatcher task to handle the cases where the status should flip directly to "cancelled".
  2. Make this view non-atomic. Then we can do very minor re-arrangement to set the cancel_flag before we submit the task. If the view is no longer atomic, there will be no problem with this. We also won't need the new_connection flag in this case either.
@AlanCoding
Copy link
Member Author

Notes:

  • This issue directly came from an integration test, so the lack of the original failure may be evidence to close
  • Technically, I don't see a strong reason to prioritize this for backports, so I'm not filing any issues to do that work, but I would not argue against it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants