Allow Client reconnect without loss of progression #5667
Labels
enhancement
Improve existing functionality or make things work better
networking
stability
Issue or feature related to cluster stability (e.g. deadlock)
Real world network infrastructure is not as reliable as one would hope. Short network disconnects are a possibility. This is particularly true for mobile clients, home networks, etc. but is also not uncommon in busy professional or cloud networks.
Currently, our entire networking code interprets a disconnecting or broken Comm as a dead remote server and handles this accordingly. In case a
Client <-> Scheduler
connection is broken, both sides act radically by cancelling or releasing all futures and tasks, respectively.If both the Client and Scheduler are still alive, allowing them to reestablish the connection and recover the system to its original, functional state without loosing progress can save time and money in ad-hoc execution scenarios and can significantly increase stability for (semi-)automated workflow.
The current drastic behaviour is necessary to allow for either sides to clean up their state in case the remote actually vanishes.
Allowing a graceful reconnect requires us to change the behaviour on Scheduler and Client side
Scheduler
ClientState
usagesScheduler.send_all
,Scheduler.report
,Scheduler.cancel_key
,Scheduler.restart
Note: A similar message problem appears in worker reconnect scenarios.
Client:
The Client needs to distinguish three different scenarios
All three scenarios should log appropriate information to the user.
Expected behaviour
Related issues:
The current behaviour is not only not resilient but also leads to confusing behaviour due to the exception messages raised, see #5666
The text was updated successfully, but these errors were encountered: