You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the network client(s) will raise WorkflowStopped on failure to load the contact file, however, this could also mean that there is no such workflow to load the contact file for.
Unfortunately it is hard to tell the two possibilities apart for all cases:
Workflow might exist on the scheduler host but not on the remote.
Workflow might exist on the remote but not on the scheduler host.
The way we keep remotes in-sync is through the contact file, however, (because it's not needed) we do not sync the contact file to "polling" remote hosts. We would need to extend this signalling mechanism to polling hosts to allow us to differentiate in these cases. We used to have a "contact 2" file, though I don't think it was used for this purpose it would work for this case.
Note that this singling mechanism is not perfect and can fail:
Scheduler get killed.
Remote tidy fails due to network issues.
On the scheduler hosts we can perform an SSH/process listing to check if the workflow is still alive and tidy up the contact file if not. We can't pull off these tricks from polling remote hosts (and probably shouldn't try from TCP/SSH+TCP remote hosts, although it's likely we do).
Currently the network client(s) will raise
WorkflowStopped
on failure to load the contact file, however, this could also mean that there is no such workflow to load the contact file for.This can be somewhat confusing, e.g.
cylc tui
- TUI: non-existent workflows are reported as stopped #4715cylc stop
(and other "live" commands) - stop: fix tracebacks #4776 (review)Unfortunately it is hard to tell the two possibilities apart for all cases:
The way we keep remotes in-sync is through the contact file, however, (because it's not needed) we do not sync the contact file to "polling" remote hosts. We would need to extend this signalling mechanism to polling hosts to allow us to differentiate in these cases. We used to have a "contact 2" file, though I don't think it was used for this purpose it would work for this case.
Note that this singling mechanism is not perfect and can fail:
On the scheduler hosts we can perform an SSH/process listing to check if the workflow is still alive and tidy up the contact file if not. We can't pull off these tricks from polling remote hosts (and probably shouldn't try from TCP/SSH+TCP remote hosts, although it's likely we do).
See also: #4776 (comment)
Pull requests welcome!
The text was updated successfully, but these errors were encountered: