Task fails to retry correctly due to polling #3460

dpmatthews · 2019-12-05T20:03:58Z

The following workflow illustrates the problem:

[scheduling]
    initial cycle point = now
    [[dependencies]]
        [[[PT1H]]]
            graph = """
                    flakey[-PT1H] => flakey => !stop
                    flakey:fail => stop
                    """
[runtime]
    [[flakey]]
        script = "sleep 5; ((CYLC_TASK_TRY_NUMBER <= 9)) && exit 1; true"
        [[[job]]]
            execution retry delays = 10*PT10S
            execution polling intervals= PT4S
    [[stop]]
        script = "echo This should never run; cylc stop $CYLC_SUITE_NAME"

The stop task should never run (which is what happens if you switch off execution polling). However, when I run this (with 7.8.4) I find the stop task runs within a few runs of the flakey task. It appears to happen when when a poll is triggered and then the task failure message is received before the poll returns. Relevant log output:

2019-12-05T19:20:58Z INFO - [flakey.20191205T1920Z] -poll now, (next in PT4S (after 2019-12-05T19:21:02Z))
2019-12-05T19:20:59Z INFO - [flakey.20191205T1920Z] status=running: (polled)failed at 2019-12-05T19:20:58Z for job(01)
2019-12-05T19:20:59Z INFO - [flakey.20191205T1920Z] -job(01) failed, retrying in PT10S (after 2019-12-05T19:21:09Z)
2019-12-05T19:20:59Z CRITICAL - [flakey.20191205T1920Z] status=retrying: (received)failed/EXIT at 2019-12-05T19:20:58Z for job(01)
2019-12-05T19:21:09Z INFO - [flakey.20191205T1920Z] -submit-num=02, owner@host=exvcylc02.metoffice.gov.uk
2019-12-05T19:21:10Z INFO - [flakey.20191205T1920Z] status=ready: (internal)submitted at 2019-12-05T19:21:10Z for job(02)
2019-12-05T19:21:10Z INFO - [flakey.20191205T1920Z] -health check settings: submission timeout=P1D, polling intervals=PT1H,...
2019-12-05T19:21:12Z INFO - [flakey.20191205T1920Z] status=submitted: (received)started at 2019-12-05T19:21:11Z for job(02)
2019-12-05T19:21:12Z INFO - [flakey.20191205T1920Z] -health check settings: execution timeout=None, polling intervals=PT4S,...
2019-12-05T19:21:16Z INFO - [flakey.20191205T1920Z] -poll now, (next in PT4S (after 2019-12-05T19:21:20Z))
2019-12-05T19:21:17Z CRITICAL - [flakey.20191205T1920Z] status=running: (received)failed/EXIT at 2019-12-05T19:21:16Z for job(02)
2019-12-05T19:21:17Z INFO - [flakey.20191205T1920Z] -job(02) failed, retrying in PT10S (after 2019-12-05T19:21:27Z)
2019-12-05T19:21:17Z INFO - [flakey.20191205T1920Z] status=retrying: (polled)failed at 2019-12-05T19:21:16Z for job(02)
2019-12-05T19:21:27Z INFO - [stop.20191205T1920Z] -submit-num=01, owner@host=exvcylc02.metoffice.gov.uk

The text was updated successfully, but these errors were encountered:

kinow · 2019-12-05T20:39:57Z

Left it running for 30 minutes with Py3 & Cylc latest from master (8.0a2-dev I guess). But could not reproduce the issue there.

Running with Py2 & Cylc 7.8.4-10-gde7d the issue happened after the third retry.

2019-12-06T09:37:49+13:00 INFO - [flakey.20191206T0937+13] status=submitted: (received)started at 2019-12-06T09:37:47+13:00 for job(03)
2019-12-06T09:37:49+13:00 INFO - [flakey.20191206T0937+13] -health check settings: execution timeout=None, polling intervals=PT4S,...
2019-12-06T09:37:50+13:00 INFO - [stop.20191206T0937+13] status=running: (received)succeeded at 2019-12-06T09:37:48+13:00 for job(01)
2019-12-06T09:37:53+13:00 INFO - [flakey.20191206T0937+13] -poll now, (next in PT4S (after 2019-12-06T09:37:57+13:00))
2019-12-06T09:37:53+13:00 CRITICAL - [flakey.20191206T0937+13] status=running: (received)failed/EXIT at 2019-12-06T09:37:52+13:00 for job(03)
2019-12-06T09:37:53+13:00 INFO - [flakey.20191206T0937+13] -job(03) failed, retrying in PT10S (after 2019-12-06T09:38:03+13:00)
2019-12-06T09:37:54+13:00 INFO - Waiting for the command process pool to empty for shutdown
2019-12-06T09:37:54+13:00 INFO - [flakey.20191206T0937+13] status=retrying: (polled)failed at 2019-12-06T09:37:52+13:00 for job(03)
2019-12-06T09:37:54+13:00 INFO - Suite shutting down - REQUEST(CLEAN)
2019-12-06T09:37:54+13:00 INFO - DONE

Then trying again, it happened after the second retry.

kinow · 2019-12-05T21:31:58Z

Left it running a bit longer than 30 minutes again, with Cylc 8, and the issue did not happen. Hopefully helpful for others when troubleshooting. Cylc 8 appears to be immune to this bug.

hjoliver · 2019-12-06T01:46:12Z

Well that's good news, if slightly surprising! (Trying to recall if we've done any refactoring of relevant parts of the code, only on Cylc 8...).

hjoliver · 2019-12-06T01:48:02Z

Note we should check up on these kinds of issues for (or after) @oliver-sanders PR #3423 - description includes the comment "TODO need to check this doesn't break polling logic".

dpmatthews · 2021-02-04T08:40:36Z

Further report of this bug:
https://cylc.discourse.group/t/execution-retry-delays-and-execution-job-polling
I've updated the issue description - the lack of retrying is the real issue here.

dpmatthews · 2021-08-13T14:07:15Z

I've tested this using Cylc 8.0b2 using this workflow:

[scheduling]
    initial cycle point = now
    [[graph]]
        PT1H = "flakey[-PT1H] => flakey"
[runtime]
    [[flakey]]
        platform = ??? # Use a remote platform to slow down how quickly the polling returns
        script = "sleep 5; ((CYLC_TASK_TRY_NUMBER <= 9)) && exit 1; true"
        execution retry delays = 10*PT5S
        execution polling intervals= PT3S

With Cylc 7 the problem occurs when a poll is triggered and then the task failure message is received before the poll returns. With Cylc 8 the following happens in this situation:

2021-08-13T13:52:50+01:00 INFO - [flakey.20210813T1546+0100] -poll now, (next in PT3S (after 2021-08-13T13:52:53+01:00))
2021-08-13T13:52:52+01:00 CRITICAL - [flakey.20210813T1546+0100] status=running: (received)failed/ERR at 2021-08-13T13:52:51+01:00  for job(05) flow(e)
2021-08-13T13:52:52+01:00 INFO - [flakey.20210813T1546+0100] -job(05) failed, retrying in PT5S (after 2021-08-13T13:52:57+01:00)
2021-08-13T13:52:53+01:00 WARNING - [flakey.20210813T1546+0100] status=waiting: (polled-ignored)failed at 2021-08-13T13:52:51+01:00  for job(05) flow(e)

So I think we can be sure this is fixed at Cylc 8.
I think it was actually fixed in 8.0a1 by #3286 ("Ignore job poll message, when task is already in a retrying state").

hjoliver · 2021-08-15T21:39:00Z

Closed by #3286

dpmatthews added the bug Something is wrong :( label Dec 5, 2019

dpmatthews added this to the cylc-7.8.x milestone Dec 5, 2019

dpmatthews changed the title ~~Failure trigger activated incorrectly for a retrying task due to polling~~ Task fails to retry correctly due to polling Feb 4, 2021

hjoliver closed this as completed Aug 15, 2021

oliver-sanders modified the milestones: cylc-7.8.x, cylc-8.0a1 Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task fails to retry correctly due to polling #3460

Task fails to retry correctly due to polling #3460

dpmatthews commented Dec 5, 2019

kinow commented Dec 5, 2019 •

edited

Loading

kinow commented Dec 5, 2019

hjoliver commented Dec 6, 2019

hjoliver commented Dec 6, 2019

dpmatthews commented Feb 4, 2021

dpmatthews commented Aug 13, 2021

hjoliver commented Aug 15, 2021

Task fails to retry correctly due to polling #3460

Task fails to retry correctly due to polling #3460

Comments

dpmatthews commented Dec 5, 2019

kinow commented Dec 5, 2019 • edited Loading

kinow commented Dec 5, 2019

hjoliver commented Dec 6, 2019

hjoliver commented Dec 6, 2019

dpmatthews commented Feb 4, 2021

dpmatthews commented Aug 13, 2021

hjoliver commented Aug 15, 2021

kinow commented Dec 5, 2019 •

edited

Loading