Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task fails to retry correctly due to polling #3460

Closed
dpmatthews opened this issue Dec 5, 2019 · 7 comments
Closed

Task fails to retry correctly due to polling #3460

dpmatthews opened this issue Dec 5, 2019 · 7 comments
Labels
bug Something is wrong :(
Milestone

Comments

@dpmatthews
Copy link
Contributor

The following workflow illustrates the problem:

[scheduling]
    initial cycle point = now
    [[dependencies]]
        [[[PT1H]]]
            graph = """
                    flakey[-PT1H] => flakey => !stop
                    flakey:fail => stop
                    """
[runtime]
    [[flakey]]
        script = "sleep 5; ((CYLC_TASK_TRY_NUMBER <= 9)) && exit 1; true"
        [[[job]]]
            execution retry delays = 10*PT10S
            execution polling intervals= PT4S
    [[stop]]
        script = "echo This should never run; cylc stop $CYLC_SUITE_NAME"

The stop task should never run (which is what happens if you switch off execution polling). However, when I run this (with 7.8.4) I find the stop task runs within a few runs of the flakey task. It appears to happen when when a poll is triggered and then the task failure message is received before the poll returns. Relevant log output:

2019-12-05T19:20:58Z INFO - [flakey.20191205T1920Z] -poll now, (next in PT4S (after 2019-12-05T19:21:02Z))
2019-12-05T19:20:59Z INFO - [flakey.20191205T1920Z] status=running: (polled)failed at 2019-12-05T19:20:58Z for job(01)
2019-12-05T19:20:59Z INFO - [flakey.20191205T1920Z] -job(01) failed, retrying in PT10S (after 2019-12-05T19:21:09Z)
2019-12-05T19:20:59Z CRITICAL - [flakey.20191205T1920Z] status=retrying: (received)failed/EXIT at 2019-12-05T19:20:58Z for job(01)
2019-12-05T19:21:09Z INFO - [flakey.20191205T1920Z] -submit-num=02, owner@host=exvcylc02.metoffice.gov.uk
2019-12-05T19:21:10Z INFO - [flakey.20191205T1920Z] status=ready: (internal)submitted at 2019-12-05T19:21:10Z for job(02)
2019-12-05T19:21:10Z INFO - [flakey.20191205T1920Z] -health check settings: submission timeout=P1D, polling intervals=PT1H,...
2019-12-05T19:21:12Z INFO - [flakey.20191205T1920Z] status=submitted: (received)started at 2019-12-05T19:21:11Z for job(02)
2019-12-05T19:21:12Z INFO - [flakey.20191205T1920Z] -health check settings: execution timeout=None, polling intervals=PT4S,...
2019-12-05T19:21:16Z INFO - [flakey.20191205T1920Z] -poll now, (next in PT4S (after 2019-12-05T19:21:20Z))
2019-12-05T19:21:17Z CRITICAL - [flakey.20191205T1920Z] status=running: (received)failed/EXIT at 2019-12-05T19:21:16Z for job(02)
2019-12-05T19:21:17Z INFO - [flakey.20191205T1920Z] -job(02) failed, retrying in PT10S (after 2019-12-05T19:21:27Z)
2019-12-05T19:21:17Z INFO - [flakey.20191205T1920Z] status=retrying: (polled)failed at 2019-12-05T19:21:16Z for job(02)
2019-12-05T19:21:27Z INFO - [stop.20191205T1920Z] -submit-num=01, owner@host=exvcylc02.metoffice.gov.uk
@dpmatthews dpmatthews added the bug Something is wrong :( label Dec 5, 2019
@dpmatthews dpmatthews added this to the cylc-7.8.x milestone Dec 5, 2019
@kinow
Copy link
Member

kinow commented Dec 5, 2019

Left it running for 30 minutes with Py3 & Cylc latest from master (8.0a2-dev I guess). But could not reproduce the issue there.

Running with Py2 & Cylc 7.8.4-10-gde7d the issue happened after the third retry.

2019-12-06T09:37:49+13:00 INFO - [flakey.20191206T0937+13] status=submitted: (received)started at 2019-12-06T09:37:47+13:00 for job(03)
2019-12-06T09:37:49+13:00 INFO - [flakey.20191206T0937+13] -health check settings: execution timeout=None, polling intervals=PT4S,...
2019-12-06T09:37:50+13:00 INFO - [stop.20191206T0937+13] status=running: (received)succeeded at 2019-12-06T09:37:48+13:00 for job(01)
2019-12-06T09:37:53+13:00 INFO - [flakey.20191206T0937+13] -poll now, (next in PT4S (after 2019-12-06T09:37:57+13:00))
2019-12-06T09:37:53+13:00 CRITICAL - [flakey.20191206T0937+13] status=running: (received)failed/EXIT at 2019-12-06T09:37:52+13:00 for job(03)
2019-12-06T09:37:53+13:00 INFO - [flakey.20191206T0937+13] -job(03) failed, retrying in PT10S (after 2019-12-06T09:38:03+13:00)
2019-12-06T09:37:54+13:00 INFO - Waiting for the command process pool to empty for shutdown
2019-12-06T09:37:54+13:00 INFO - [flakey.20191206T0937+13] status=retrying: (polled)failed at 2019-12-06T09:37:52+13:00 for job(03)
2019-12-06T09:37:54+13:00 INFO - Suite shutting down - REQUEST(CLEAN)
2019-12-06T09:37:54+13:00 INFO - DONE

Then trying again, it happened after the second retry.

@kinow
Copy link
Member

kinow commented Dec 5, 2019

Left it running a bit longer than 30 minutes again, with Cylc 8, and the issue did not happen. Hopefully helpful for others when troubleshooting. Cylc 8 appears to be immune to this bug.

@hjoliver
Copy link
Member

hjoliver commented Dec 6, 2019

Well that's good news, if slightly surprising! (Trying to recall if we've done any refactoring of relevant parts of the code, only on Cylc 8...).

@hjoliver
Copy link
Member

hjoliver commented Dec 6, 2019

Note we should check up on these kinds of issues for (or after) @oliver-sanders PR #3423 - description includes the comment "TODO need to check this doesn't break polling logic".

@dpmatthews dpmatthews changed the title Failure trigger activated incorrectly for a retrying task due to polling Task fails to retry correctly due to polling Feb 4, 2021
@dpmatthews
Copy link
Contributor Author

Further report of this bug:
https://cylc.discourse.group/t/execution-retry-delays-and-execution-job-polling
I've updated the issue description - the lack of retrying is the real issue here.

@dpmatthews
Copy link
Contributor Author

I've tested this using Cylc 8.0b2 using this workflow:

[scheduling]
    initial cycle point = now
    [[graph]]
        PT1H = "flakey[-PT1H] => flakey"
[runtime]
    [[flakey]]
        platform = ??? # Use a remote platform to slow down how quickly the polling returns
        script = "sleep 5; ((CYLC_TASK_TRY_NUMBER <= 9)) && exit 1; true"
        execution retry delays = 10*PT5S
        execution polling intervals= PT3S

With Cylc 7 the problem occurs when a poll is triggered and then the task failure message is received before the poll returns. With Cylc 8 the following happens in this situation:

2021-08-13T13:52:50+01:00 INFO - [flakey.20210813T1546+0100] -poll now, (next in PT3S (after 2021-08-13T13:52:53+01:00))
2021-08-13T13:52:52+01:00 CRITICAL - [flakey.20210813T1546+0100] status=running: (received)failed/ERR at 2021-08-13T13:52:51+01:00  for job(05) flow(e)
2021-08-13T13:52:52+01:00 INFO - [flakey.20210813T1546+0100] -job(05) failed, retrying in PT5S (after 2021-08-13T13:52:57+01:00)
2021-08-13T13:52:53+01:00 WARNING - [flakey.20210813T1546+0100] status=waiting: (polled-ignored)failed at 2021-08-13T13:52:51+01:00  for job(05) flow(e)

So I think we can be sure this is fixed at Cylc 8.
I think it was actually fixed in 8.0a1 by #3286 ("Ignore job poll message, when task is already in a retrying state").

@hjoliver
Copy link
Member

Closed by #3286

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

4 participants