-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto shutdown & restart #988
Comments
Having a think about this, how about something along these lines: site.rc
The key thing here, I guess, being that the site can set the appropriate action to take on a suite detecting its host as being "invalid". For cylc alone a sensible set of commands is easy enough but when you factor in something like Rose handling the running of suites etc. you need to be able to specify something a bit smarter (e.g. to cope with the change in $ROSE_ORIG_HOST) so I think its best to keep it configurable. Whether you'd want users to be able to override the event hook for this is up for debate I guess. I don't think specifying "valid" hosts is trivial though as you'd need to be able to specify regex's to account for things like being able to run on any user's desktop (e.g. dtp*). A "get off these specific machines" list feels more straightforward. As far as the restart goes, you'd probably just specify the method for selecting a valid host under your invalid-host handler. So, e.g. in a Rose context you might end up with something like this: site.rc
(Though that might bork the setting of $ROSE_ORIG_HOST which is a different problem) |
[meeting] - a suite daemon can shut itself down, but (obviously) can't revive itself when dead. So, at least in the first instance, we should just provide appropriate early shutdown options and allow the user to handle the restarts e.g. via cron. It would be easy enough to allow shutdowns to be ordered after:
Probably needs to be "shutdown --now". Could we automatically edit the user's crontab to arrange restarts at the right intervals? This might be appropriate if we can un-edit the crontab once the suite has run to completion, and/or we store completed status in the suite DB so that attempted restarts can be aborted immediately. |
I think it has to be to prevent some stuck task somewhere gumming up the system/holding up timely shutdown. |
@dpmatthews reports he was primarily interested in this use-case:
In this case, an external means of restarting (e.g. cron) would not be required. This would be a great help for site cylc server maintenance. Quite a high priority but is somewhat dependent on #1885 ("rose suite-run" migration). |
Another bullet point for the revive-from-the-dead use case:
This would allow operational suites to not exist for some time between cycles, rather than staying alive in a purely waiting state. Each new cycle would presumably be kicked off with cron. |
Options for checking whether the suite needs to shutdown/restart:
I believe we were leaning towards option 1? Does the configurable random delay seem like a sensible way forward? |
Option 1 is the preferred. The global configuration can be reloaded on health check intervals (current default is PT10M or something like that). This should then give you an indication of whether its host is being drained or not. Agree that it may be sensible to stagger suite restarts (if they are not staggered already due to individual health check intervals.) |
The fact that the cylc suite daemon needs to keep running for the entire duration of a suite can cause problems in some circumstances, for instance:
Both of these could probably be done using a special task in a suite but it would be better if this support could be built into cylc.
The text was updated successfully, but these errors were encountered: