client: tell apart "no such workflow" from "workflow stopped" #4798

oliver-sanders · 2022-04-04T09:17:20Z

Currently the network client(s) will raise WorkflowStopped on failure to load the contact file, however, this could also mean that there is no such workflow to load the contact file for.

This can be somewhat confusing, e.g.

cylc tui - TUI: non-existent workflows are reported as stopped #4715
cylc stop (and other "live" commands) - stop: fix tracebacks #4776 (review)
workflow porting to polling job hosts - stop: fix tracebacks #4776 (comment)
etc

Unfortunately it is hard to tell the two possibilities apart for all cases:

Workflow might exist on the scheduler host but not on the remote.
Workflow might exist on the remote but not on the scheduler host.

The way we keep remotes in-sync is through the contact file, however, (because it's not needed) we do not sync the contact file to "polling" remote hosts. We would need to extend this signalling mechanism to polling hosts to allow us to differentiate in these cases. We used to have a "contact 2" file, though I don't think it was used for this purpose it would work for this case.

Note that this singling mechanism is not perfect and can fail:

Scheduler get killed.
Remote tidy fails due to network issues.

On the scheduler hosts we can perform an SSH/process listing to check if the workflow is still alive and tidy up the contact file if not. We can't pull off these tricks from polling remote hosts (and probably shouldn't try from TCP/SSH+TCP remote hosts, although it's likely we do).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: tell apart "no such workflow" from "workflow stopped" #4798

client: tell apart "no such workflow" from "workflow stopped" #4798

oliver-sanders commented Apr 4, 2022

client: tell apart "no such workflow" from "workflow stopped" #4798

client: tell apart "no such workflow" from "workflow stopped" #4798

Comments

oliver-sanders commented Apr 4, 2022