Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Client reconnect without loss of progression #5667

Open
fjetter opened this issue Jan 18, 2022 · 0 comments
Open

Allow Client reconnect without loss of progression #5667

fjetter opened this issue Jan 18, 2022 · 0 comments
Labels
enhancement Improve existing functionality or make things work better networking stability Issue or feature related to cluster stability (e.g. deadlock)

Comments

@fjetter
Copy link
Member

fjetter commented Jan 18, 2022

Real world network infrastructure is not as reliable as one would hope. Short network disconnects are a possibility. This is particularly true for mobile clients, home networks, etc. but is also not uncommon in busy professional or cloud networks.

Currently, our entire networking code interprets a disconnecting or broken Comm as a dead remote server and handles this accordingly. In case a Client <-> Scheduler connection is broken, both sides act radically by cancelling or releasing all futures and tasks, respectively.

If both the Client and Scheduler are still alive, allowing them to reestablish the connection and recover the system to its original, functional state without loosing progress can save time and money in ad-hoc execution scenarios and can significantly increase stability for (semi-)automated workflow.

The current drastic behaviour is necessary to allow for either sides to clean up their state in case the remote actually vanishes.

Allowing a graceful reconnect requires us to change the behaviour on Scheduler and Client side

Scheduler

  • The scheduler should allow for a certain grace period during which a Client is allowed to reconnect.
  • During this time period, the client is not operational. However, we do not differentiate states in ClientState and would likely need to introduce states, e.g. "running" and "lost" and verify all ClientState usages
  • We will need to buffer all messages sent to this client during that time period and resubmit them after connection. period? Examples, Scheduler.send_all, Scheduler.report, Scheduler.cancel_key, Scheduler.restart

Note: A similar message problem appears in worker reconnect scenarios.

Client:

The Client needs to distinguish three different scenarios

  • Reconnect attempt fails. Cancel all futures. Close.
  • New scheduler appears. Again, we need to cancel and remove all local futures.
  • Same scheduler reconnects. Restore state to before connection failure.

All three scenarios should log appropriate information to the user.

Expected behaviour

  • Client can reestablish a connection to the same scheduler and continue its progress if that scheduler is still alive
  • If the reconnect only happens to a new scheduler, the client needs to invalidate all stale futures
  • If the Client is gone for good, the scheduler needs to release all tasks as it is doing right now
  • During the outage period / grace period, no new futures are allowed to be created on client side
@gen_cluster(client=True)
async def test_reconnect_same_scheduler(c, s, a, b):
    f1 = c.submit(inc, 1, key="f1")

    c.scheduler_comm.abort()  # E.g. external network blip

    # TODO: Assert log messages about disconnect on level >=WARNING

    assert await f1 == 2

Related issues:

The current behaviour is not only not resilient but also leads to confusing behaviour due to the exception messages raised, see #5666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improve existing functionality or make things work better networking stability Issue or feature related to cluster stability (e.g. deadlock)
Projects
None yet
Development

No branches or pull requests

1 participant