-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task fails to retry correctly due to polling #3460
Comments
Left it running for 30 minutes with Py3 & Cylc latest from Running with Py2 & Cylc 7.8.4-10-gde7d the issue happened after the third retry. 2019-12-06T09:37:49+13:00 INFO - [flakey.20191206T0937+13] status=submitted: (received)started at 2019-12-06T09:37:47+13:00 for job(03)
2019-12-06T09:37:49+13:00 INFO - [flakey.20191206T0937+13] -health check settings: execution timeout=None, polling intervals=PT4S,...
2019-12-06T09:37:50+13:00 INFO - [stop.20191206T0937+13] status=running: (received)succeeded at 2019-12-06T09:37:48+13:00 for job(01)
2019-12-06T09:37:53+13:00 INFO - [flakey.20191206T0937+13] -poll now, (next in PT4S (after 2019-12-06T09:37:57+13:00))
2019-12-06T09:37:53+13:00 CRITICAL - [flakey.20191206T0937+13] status=running: (received)failed/EXIT at 2019-12-06T09:37:52+13:00 for job(03)
2019-12-06T09:37:53+13:00 INFO - [flakey.20191206T0937+13] -job(03) failed, retrying in PT10S (after 2019-12-06T09:38:03+13:00)
2019-12-06T09:37:54+13:00 INFO - Waiting for the command process pool to empty for shutdown
2019-12-06T09:37:54+13:00 INFO - [flakey.20191206T0937+13] status=retrying: (polled)failed at 2019-12-06T09:37:52+13:00 for job(03)
2019-12-06T09:37:54+13:00 INFO - Suite shutting down - REQUEST(CLEAN)
2019-12-06T09:37:54+13:00 INFO - DONE Then trying again, it happened after the second retry. |
Left it running a bit longer than 30 minutes again, with Cylc 8, and the issue did not happen. Hopefully helpful for others when troubleshooting. Cylc 8 appears to be immune to this bug. |
Well that's good news, if slightly surprising! (Trying to recall if we've done any refactoring of relevant parts of the code, only on Cylc 8...). |
Note we should check up on these kinds of issues for (or after) @oliver-sanders PR #3423 - description includes the comment "TODO need to check this doesn't break polling logic". |
Further report of this bug: |
I've tested this using Cylc 8.0b2 using this workflow:
With Cylc 7 the problem occurs when a poll is triggered and then the task failure message is received before the poll returns. With Cylc 8 the following happens in this situation:
So I think we can be sure this is fixed at Cylc 8. |
Closed by #3286 |
The following workflow illustrates the problem:
The
stop
task should never run (which is what happens if you switch off execution polling). However, when I run this (with 7.8.4) I find thestop
task runs within a few runs of theflakey
task. It appears to happen when when a poll is triggered and then the task failure message is received before the poll returns. Relevant log output:The text was updated successfully, but these errors were encountered: