Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc pause #4076

Merged
merged 34 commits into from
Mar 19, 2021
Merged

cylc pause #4076

merged 34 commits into from
Mar 19, 2021

Conversation

MetRonnie
Copy link
Member

@MetRonnie MetRonnie commented Feb 15, 2021

These changes close #3897

  • Implement cylc pause
  • cylc play resumes a paused workflow
  • Decouple cylc hold/release from workflow state, so they only hold/release tasks (or set/remove the hold cycle point for all tasks*)

*The [scheduling]hold after cycle point, cylc hold --after and cylc play --hold-after have remained pretty much the way they were, contrary to some of the discussion in #3897. However, they no longer affect the status of the task pool or scheduler, it just sets a point and if a spawned task is beyond that point it will be held

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • (master branch) I have opened a documentation PR at Cylc play/pause cylc-doc#204
  • No dependency changes.

@MetRonnie MetRonnie added this to the cylc-8.0b0 milestone Feb 15, 2021
@MetRonnie MetRonnie self-assigned this Feb 15, 2021
@hjoliver

This comment has been minimized.

@MetRonnie

This comment has been minimized.

@MetRonnie
Copy link
Member Author

Before I go through any more tests, can I ask @oliver-sanders and/or @hjoliver to review the changes to scheduler.py so far? 5a71c4d#diff-90711b005fe27b9f894e6cd09f08816d847dcf853518894c6b7e2139784cbcb8

@oliver-sanders
Copy link
Member

Looks sensible to me, nice.

The only thing I'm unsure about is whether is_paused should cause should_process_tasks to return False or whether it should block the task processing itself in process_task_pool itself as there are other things going on in that method.

@MetRonnie
Copy link
Member Author

@hjoliver What do you say?

def should_process_tasks(self) -> bool:
"""Return True if waiting tasks are ready."""
# do we need to do a pass through the main task processing loop?
process = False
if self.is_paused:
return False
# New-style xtriggers.
self.xtrigger_mgr.check_xtriggers(self.pool.get_tasks())
if self.xtrigger_mgr.pflag:
process = True
self.xtrigger_mgr.pflag = False # reset
# Old-style external triggers.
self.broadcast_mgr.add_ext_triggers(self.ext_trigger_queue)
for itask in self.pool.get_tasks():
if (itask.state.external_triggers and
self.broadcast_mgr.match_ext_trigger(itask)):
process = True
if self.task_events_mgr.pflag:
# This flag is turned on by commands that change task state
process = True
self.task_events_mgr.pflag = False # reset
if self.task_job_mgr.task_remote_mgr.ready:
# This flag is turned on when a host init/select command completes
process = True
self.task_job_mgr.task_remote_mgr.ready = False # reset
broadcast_mgr = self.task_events_mgr.broadcast_mgr
broadcast_mgr.add_ext_triggers(self.ext_trigger_queue)
for itask in self.pool.get_tasks():
# External trigger matching and task expiry must be done
# regardless, so they need to be in separate "if ..." blocks.
if broadcast_mgr.match_ext_trigger(itask):
process = True
if self.pool.set_expired_task(itask, time()):
process = True
if all(itask.is_ready()):
process = True
if (
self.config.run_mode('simulation') and
self.pool.sim_time_check(self.message_queue)
):
process = True
return process

I'm not sure what the side effects of this method are

@hjoliver
Copy link
Member

(I still need to think about that question, ran out of time today sorry)

@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 18, 2021

Cross-posting from element, here's the process_task_pool method:

def process_task_pool(self):
"""Process ALL TASKS whenever something has changed that might
require renegotiation of dependencies, etc"""
LOG.debug("BEGIN TASK PROCESSING")
time0 = time()
if self._get_events_conf(self.EVENT_INACTIVITY_TIMEOUT):
self.set_suite_inactivity_timer()
if self.stop_mode is None and self.auto_restart_time is None:
itasks = self.pool.get_ready_tasks()
if itasks:
self.is_updated = True
self.task_job_mgr.task_remote_mgr.rsync_includes = (
self.config.get_validated_rsync_includes())
for itask in self.task_job_mgr.submit_task_jobs(
self.suite,
itasks,
self.curve_auth,
self.client_pub_key_dir,
self.config.run_mode('simulation')
):
# TODO log itask.flow_label here (beware effect on ref tests)
LOG.info('[%s] -triggered off %s',
itask, itask.state.get_resolved_dependencies())
self.broadcast_mgr.expire_broadcast(self.pool.get_min_point())
self.xtrigger_mgr.housekeep()
self.suite_db_mgr.put_xtriggers(self.xtrigger_mgr.sat_xtrig)
LOG.debug("END TASK PROCESSING (took %s seconds)" % (time() - time0))

Here's my suggestion:

    def process_task_pool(self):
        """Process ALL TASKS whenever something has changed that might
        require renegotiation of dependencies, etc"""
        LOG.debug("BEGIN TASK PROCESSING")
        time0 = time()
        if self._get_events_conf(self.EVENT_INACTIVITY_TIMEOUT):
            self.set_suite_inactivity_timer()
        if not self.is_paused and self.stop_mode is None and self.auto_restart_time is None:
            # ^ skip all this stuff if paused
            # ...
        self.broadcast_mgr.expire_broadcast(self.pool.get_min_point())
        self.xtrigger_mgr.housekeep()
        self.suite_db_mgr.put_xtriggers(self.xtrigger_mgr.sat_xtrig)
        LOG.debug("END TASK PROCESSING (took %s seconds)" % (time() - time0))

This would allow the other things in that function to work as I think they currently would for a "held" workflow.

@hjoliver
Copy link
Member

Yep, I agree with @oliver-sanders 👍

@MetRonnie MetRonnie mentioned this pull request Feb 19, 2021
7 tasks
MetRonnie added a commit to MetRonnie/cylc-flow that referenced this pull request Feb 19, 2021
@MetRonnie
Copy link
Member Author

In this

# Dependency negotiation etc. will run after these commands
PROC_CMDS = (
'release_suite',
'release_tasks',
'kill_tasks',
'force_spawn_children',
'force_trigger_tasks',
'reload_suite'
)

release_suite and release_tasks are not used as command names, even on master, see

def release(self, tasks=None):
"""Release (un-hold) the workflow."""
self.schd.command_queue.put((
"release",
(),
filter_none({
'ids': tasks
})
))
return (True, 'Command queued')

Should I change them to

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 4bb79bfa0..027a2374c 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -156,8 +156,7 @@ class Scheduler:
     # Dependency negotiation etc. will run after these commands
     PROC_CMDS = (
-        'release_suite',
-        'release_tasks',
+        'release',
         'kill_tasks',
         'force_spawn_children',

MetRonnie added a commit to MetRonnie/cylc-flow that referenced this pull request Feb 19, 2021
@hjoliver
Copy link
Member

Should I change them to ...

I think so, yes, but want to look at the associated code tomorrow...

MetRonnie added a commit to MetRonnie/cylc-flow that referenced this pull request Feb 22, 2021
@MetRonnie MetRonnie marked this pull request as ready for review March 11, 2021 18:29
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving after a somewhat rushed review, but LGTM 🎉

@oliver-sanders

This comment has been minimized.

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review good, going to do some end-to-end testing, not expecting any issues.

cylc/flow/tui/data.py Outdated Show resolved Hide resolved
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good on the end-to-end testing 👍.

One small conflict and MyPy is a bit angry since recent changes?

@MetRonnie
Copy link
Member Author

Yes, the attribute declarations like pool: Optional[TaskPool] = None are not playing well. I'm trying to figure out the best solution

data_store_mgr: DataStoreMgr
suite_db_mgr: SuiteDatabaseManager
broadcast_mgr: BroadcastMgr
xtrigger_mgr: XtriggerManager
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If making these attributes without defaults, they have to be before attributes with defaults in an @dataclass class

@MetRonnie
Copy link
Member Author

MetRonnie commented Mar 18, 2021

Strange that tests/f/platforms/07-localhost-set-if-not-in-globalcfg is failing on MacOS consistently while passing on Linux.

ok 1 - 07-localhost-set-if-not-in-globalcfg-validate
not ok 2 - 07-localhost-set-if-not-in-globalcfg-run
ok 3 - job-grep-ok

Suite log:

2021-03-18T22:42:10 DEBUG - PT5M suite inactivity timer starts NOW: 2021-03-18T22:42:10
2021-03-18T22:42:10 DEBUG - END TASK PROCESSING (took 0.00095367431640625 seconds)
2021-03-18T22:47:11 WARNING - suite timed out after inactivity for PT5M
2021-03-18T22:47:11 ERROR - Suite shutting down - Abort on suite inactivity is set
2021-03-18T22:47:11 WARNING - Orphaned task jobs:
	* foo.1 (submitted)

There is no job.err for foo.

Whereas on my local VLD I get

DEBUG - PT5M suite inactivity timer starts NOW: 2021-03-18T17:46:47Z
DEBUG - END TASK PROCESSING (took 0.0009641647338867188 seconds)
DEBUG - version: b'1.0', request_id: b'1', domain: '', address: 'xx.xxx.xx.xxx', identity: b'', mechanism: b'CURVE'
DEBUG - ALLOWED (CURVE) domain=* client_key=b'XXXXXXX'
DEBUG - ZAP reply code=b'200' text=b'OK'
INFO - [foo.1] status=submitted: (received)started at 2021-03-18T17:46:50Z  for job(01) flow(D)

@oliver-sanders can you take a look?

@oliver-sanders
Copy link
Member

Sure.

@oliver-sanders
Copy link
Member

oliver-sanders commented Mar 19, 2021

That's because it sets job runner = at which we wouldn't expect to work on MacOS (the at daemon needs to be authenticated and enabled before it can work). Now why did this test not fail on master, or perhaps it did but no one noticed?

This should do it - https://github.com/cylc/cylc-flow/pull/4136/files

@MetRonnie
Copy link
Member Author

Shouldn't be any need to re-run the tests

@oliver-sanders
Copy link
Member

Fair'nough

@oliver-sanders oliver-sanders merged commit 4947568 into cylc:master Mar 19, 2021
@MetRonnie MetRonnie deleted the cylc-pause branch March 19, 2021 15:49
@MetRonnie MetRonnie added the db change Change to the workflow database structure label Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
db change Change to the workflow database structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: play|pause|stop
3 participants