Auto shutdown & restart #988

dpmatthews · 2014-06-19T12:28:17Z

The fact that the cylc suite daemon needs to keep running for the entire duration of a suite can cause problems in some circumstances, for instance:

Some platforms where users want to run suites have been known to automatically kill long running processes. In this case it would be good if you could configure the suite to automatically shut-down & restart, e.g. every 24 hours.
Some platforms have scheduled reboots, e.g. every month. In this case it would be good if the suite could check on regular basis to see whether the current host is still present in a list of permitted suite hosts and, if not, shut down and restart on a new host.

Both of these could probably be done using a special task in a suite but it would be better if this support could be built into cylc.

arjclark · 2015-12-31T10:14:35Z

Having a think about this, how about something along these lines:

site.rc

...
[cylc]
invalid-hosts=hostX # List of hosts suites should not be being run on
host-check=P1D # Period to check whether or not current host is legal 
[[event hooks]]
invalid-host=<some command> # site configurable default command
...

The key thing here, I guess, being that the site can set the appropriate action to take on a suite detecting its host as being "invalid".

For cylc alone a sensible set of commands is easy enough but when you factor in something like Rose handling the running of suites etc. you need to be able to specify something a bit smarter (e.g. to cope with the change in $ROSE_ORIG_HOST) so I think its best to keep it configurable.

Whether you'd want users to be able to override the event hook for this is up for debate I guess.

I don't think specifying "valid" hosts is trivial though as you'd need to be able to specify regex's to account for things like being able to run on any user's desktop (e.g. dtp*). A "get off these specific machines" list feels more straightforward. As far as the restart goes, you'd probably just specify the method for selecting a valid host under your invalid-host handler.

So, e.g. in a Rose context you might end up with something like this:

site.rc

...
[cylc]
invalid-hosts=cylcserver02
host-check=P1D 
[[event hooks]]
invalid-host=rose suite-shutdown --name=$SUITE_ID -y -- -v; rose suite-restart --name=$SUITE_ID
...

(Though that might bork the setting of $ROSE_ORIG_HOST which is a different problem)

hjoliver · 2016-06-15T09:14:24Z

[meeting] - a suite daemon can shut itself down, but (obviously) can't revive itself when dead. So, at least in the first instance, we should just provide appropriate early shutdown options and allow the user to handle the restarts e.g. via cron. It would be easy enough to allow shutdowns to be ordered after:

a single iteration through the scheduling loop
N iterations through the scheduling loop
some wallclock interval

Probably needs to be "shutdown --now".

Could we automatically edit the user's crontab to arrange restarts at the right intervals? This might be appropriate if we can un-edit the crontab once the suite has run to completion, and/or we store completed status in the suite DB so that attempted restarts can be aborted immediately.

arjclark · 2016-06-15T09:16:49Z

Probably needs to be "shutdown --now"

I think it has to be to prevent some stuck task somewhere gumming up the system/holding up timely shutdown.

hjoliver · 2016-06-15T10:26:03Z

@dpmatthews reports he was primarily interested in this use-case:

A running suite daemon should periodically check (e.g. the site config file as per @arjclark's comment above) that it is still running on a "valid" suite server. If not, it should shut itself down, and as a final hurrah, cause itself to be restarted on a valid server.

In this case, an external means of restarting (e.g. cron) would not be required.

This would be a great help for site cylc server maintenance. Quite a high priority but is somewhat dependent on #1885 ("rose suite-run" migration).

hjoliver · 2016-06-15T10:34:04Z

Another bullet point for the revive-from-the-dead use case:

on start-up, continue until there is nothing to do (e.g. waiting on a clock-trigger for the next cycle), then shut down.

This would allow operational suites to not exist for some time between cycles, rather than staying alive in a purely waiting state. Each new cycle would presumably be kicked off with cron.

oliver-sanders · 2018-07-19T15:59:36Z

Options for checking whether the suite needs to shutdown/restart:

Check whether the server the suite is running on is still listed as a valid host (Incorporate suite host selection logic #2693).

This means if we remove a host from the list all suites running on that host will jump ship. This might not be the intended effect, for instance we very occupationally remove hosts from the pool if they are unhealthy or if we want them to drain naturally. Probably better to come up with some sort of "condemned host" setting to make this more explicit and prevent unintentional consequences.

Once a host is marked as "condemned" all suites will try to jump ship at once. This is a problem because:
- If using a ranking method other than "random" this is likely to result in one host in the pool getting swamped as it picks up all the suites from the "condemned" host.
- The suite startup process can be quite heavy for suites with large configurations. The impact of many suites starting up simultaneously could cause a server slow down.
To get around this we could have suites wait for a random interval before shutting down, not pretty but quite functional. A few minutes lag isn't going to be a problem?
Instruct suites to shut down using HTTP(s) comms.

Write a simple script to call the existing Cylc interfaces, very little work but it would have to be run by a sys admin (in order to get access to suite's passphrase files).

The admin could send shutdown messages across port ranges in chunks to get around the all suites jumping ship at the same time problem.
Flag file

Much the same as 1. but with a file serving as the trigger to jump ship.

I believe we were leaning towards option 1? Does the configurable random delay seem like a sensible way forward?

matthewrmshin · 2018-07-20T15:00:07Z

Option 1 is the preferred. The global configuration can be reloaded on health check intervals (current default is PT10M or something like that). This should then give you an indication of whether its host is being drained or not. Agree that it may be sensible to stagger suite restarts (if they are not staggered already due to individual health check intervals.)

dpmatthews added this to the later milestone Jun 19, 2014

matthewrmshin modified the milestones: soon, later Mar 2, 2017

matthewrmshin self-assigned this Mar 2, 2017

This was referenced Jan 24, 2018

scan port range and run port range #2323

Closed

rose suite-run: pgrep for suite processes? metomi/rose#2142

Closed

matthewrmshin mentioned this issue Mar 19, 2018

Lazy loading of global.rc #2601

Merged

sadielbartholomew mentioned this issue Jun 6, 2018

Incorporate suite host selection logic #2693

Merged

7 tasks

oliver-sanders self-assigned this Jul 19, 2018

oliver-sanders unassigned matthewrmshin Jul 26, 2018

oliver-sanders modified the milestones: soon, next release Oct 12, 2018

This was referenced Oct 15, 2018

Formalise Suite Parameters #2799

Closed

Auto stop-restart #2809

Merged

kinow mentioned this issue Nov 21, 2018

Propagate host info when socket errors occur #2875

Closed

hjoliver closed this as completed in #2809 Nov 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto shutdown & restart #988

Auto shutdown & restart #988

dpmatthews commented Jun 19, 2014

arjclark commented Dec 31, 2015

hjoliver commented Jun 15, 2016 •

edited

Loading

arjclark commented Jun 15, 2016

hjoliver commented Jun 15, 2016

hjoliver commented Jun 15, 2016

oliver-sanders commented Jul 19, 2018

matthewrmshin commented Jul 20, 2018

Auto shutdown & restart #988

Auto shutdown & restart #988

Comments

dpmatthews commented Jun 19, 2014

arjclark commented Dec 31, 2015

hjoliver commented Jun 15, 2016 • edited Loading

arjclark commented Jun 15, 2016

hjoliver commented Jun 15, 2016

hjoliver commented Jun 15, 2016

oliver-sanders commented Jul 19, 2018

Options for checking whether the suite needs to shutdown/restart:

I believe we were leaning towards option 1? Does the configurable random delay seem like a sensible way forward?

matthewrmshin commented Jul 20, 2018

hjoliver commented Jun 15, 2016 •

edited

Loading