Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto shutdown & restart #988

Closed
dpmatthews opened this issue Jun 19, 2014 · 7 comments · Fixed by #2809
Closed

Auto shutdown & restart #988

dpmatthews opened this issue Jun 19, 2014 · 7 comments · Fixed by #2809
Assignees
Milestone

Comments

@dpmatthews
Copy link
Contributor

The fact that the cylc suite daemon needs to keep running for the entire duration of a suite can cause problems in some circumstances, for instance:

  • Some platforms where users want to run suites have been known to automatically kill long running processes. In this case it would be good if you could configure the suite to automatically shut-down & restart, e.g. every 24 hours.
  • Some platforms have scheduled reboots, e.g. every month. In this case it would be good if the suite could check on regular basis to see whether the current host is still present in a list of permitted suite hosts and, if not, shut down and restart on a new host.

Both of these could probably be done using a special task in a suite but it would be better if this support could be built into cylc.

@dpmatthews dpmatthews added this to the later milestone Jun 19, 2014
@arjclark
Copy link
Contributor

Having a think about this, how about something along these lines:

site.rc

...
[cylc]
invalid-hosts=hostX # List of hosts suites should not be being run on
host-check=P1D # Period to check whether or not current host is legal 
[[event hooks]]
invalid-host=<some command> # site configurable default command
...

The key thing here, I guess, being that the site can set the appropriate action to take on a suite detecting its host as being "invalid".

For cylc alone a sensible set of commands is easy enough but when you factor in something like Rose handling the running of suites etc. you need to be able to specify something a bit smarter (e.g. to cope with the change in $ROSE_ORIG_HOST) so I think its best to keep it configurable.

Whether you'd want users to be able to override the event hook for this is up for debate I guess.

I don't think specifying "valid" hosts is trivial though as you'd need to be able to specify regex's to account for things like being able to run on any user's desktop (e.g. dtp*). A "get off these specific machines" list feels more straightforward. As far as the restart goes, you'd probably just specify the method for selecting a valid host under your invalid-host handler.

So, e.g. in a Rose context you might end up with something like this:

site.rc

...
[cylc]
invalid-hosts=cylcserver02
host-check=P1D 
[[event hooks]]
invalid-host=rose suite-shutdown --name=$SUITE_ID -y -- -v; rose suite-restart --name=$SUITE_ID
...

(Though that might bork the setting of $ROSE_ORIG_HOST which is a different problem)

@hjoliver
Copy link
Member

hjoliver commented Jun 15, 2016

[meeting] - a suite daemon can shut itself down, but (obviously) can't revive itself when dead. So, at least in the first instance, we should just provide appropriate early shutdown options and allow the user to handle the restarts e.g. via cron. It would be easy enough to allow shutdowns to be ordered after:

  • a single iteration through the scheduling loop
  • N iterations through the scheduling loop
  • some wallclock interval

Probably needs to be "shutdown --now".

Could we automatically edit the user's crontab to arrange restarts at the right intervals? This might be appropriate if we can un-edit the crontab once the suite has run to completion, and/or we store completed status in the suite DB so that attempted restarts can be aborted immediately.

@arjclark
Copy link
Contributor

Probably needs to be "shutdown --now"

I think it has to be to prevent some stuck task somewhere gumming up the system/holding up timely shutdown.

@hjoliver
Copy link
Member

@dpmatthews reports he was primarily interested in this use-case:

  • A running suite daemon should periodically check (e.g. the site config file as per @arjclark's comment above) that it is still running on a "valid" suite server. If not, it should shut itself down, and as a final hurrah, cause itself to be restarted on a valid server.

In this case, an external means of restarting (e.g. cron) would not be required.

This would be a great help for site cylc server maintenance. Quite a high priority but is somewhat dependent on #1885 ("rose suite-run" migration).

@hjoliver
Copy link
Member

Another bullet point for the revive-from-the-dead use case:

  • on start-up, continue until there is nothing to do (e.g. waiting on a clock-trigger for the next cycle), then shut down.

This would allow operational suites to not exist for some time between cycles, rather than staying alive in a purely waiting state. Each new cycle would presumably be kicked off with cron.

@oliver-sanders
Copy link
Member

Options for checking whether the suite needs to shutdown/restart:

  1. Check whether the server the suite is running on is still listed as a valid host (Incorporate suite host selection logic #2693).

    This means if we remove a host from the list all suites running on that host will jump ship. This might not be the intended effect, for instance we very occupationally remove hosts from the pool if they are unhealthy or if we want them to drain naturally. Probably better to come up with some sort of "condemned host" setting to make this more explicit and prevent unintentional consequences.

    Once a host is marked as "condemned" all suites will try to jump ship at once. This is a problem because:

    • If using a ranking method other than "random" this is likely to result in one host in the pool getting swamped as it picks up all the suites from the "condemned" host.
    • The suite startup process can be quite heavy for suites with large configurations. The impact of many suites starting up simultaneously could cause a server slow down.

    To get around this we could have suites wait for a random interval before shutting down, not pretty but quite functional. A few minutes lag isn't going to be a problem?

  2. Instruct suites to shut down using HTTP(s) comms.

    Write a simple script to call the existing Cylc interfaces, very little work but it would have to be run by a sys admin (in order to get access to suite's passphrase files).

    The admin could send shutdown messages across port ranges in chunks to get around the all suites jumping ship at the same time problem.

  3. Flag file

    Much the same as 1. but with a file serving as the trigger to jump ship.

I believe we were leaning towards option 1? Does the configurable random delay seem like a sensible way forward?

@matthewrmshin
Copy link
Contributor

Option 1 is the preferred. The global configuration can be reloaded on health check intervals (current default is PT10M or something like that). This should then give you an indication of whether its host is being drained or not. Agree that it may be sensible to stagger suite restarts (if they are not staggered already due to individual health check intervals.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants