Unclear user feedback in case Client-Scheduler connection breaks up #5666

fjetter · 2022-01-18T14:48:36Z

Description

In case of a disconnect between Client and Scheduler, the logging messages are not very helpful for the typical user and do not help debugging the issue but rather cause more confusion. The feedback is sometimes delayed (only triggered after a timeout), causation is not transparent or log messages are missing or on a wrong level.

A) CancelledError confusing

The below code raises a CancelledError after a timeout of 30s. This is typically a cryptic exception most users cannot relate to. If a persist is in the chain of commands, the timeout is not even logged but only a CancelledError is raised such that the disconnect is entirely obfuscated.

Instead of a CancelledError, I would expect a more user friendly exception telling the user about the non-running status of the client which is likely caused by a disconnect with the scheduler.

CancelledError after timeout

import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1) as cluster:
    client = distributed.Client(cluster)
x = da.ones(10, chunks=5)
x.compute()

Causes

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError

CancelledError with persist

import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1) as cluster:
    client = distributed.Client(cluster)
    x = x.persist()
y = x.sum()
y.compute()

B) Client reconnecting to (new) scheduler

In a case a reconnect finishes successfully, this can go entirely unnoticed

import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1, scheduler_port=5678) as cluster:
    client = distributed.Client(cluster)
    x = da.ones(10, chunks=5).persist()

# New cluster, same address
cluster = distributed.LocalCluster(n_workers=1, scheduler_port=5678)
assert client.status == "running"

x.sum().compute()  # Boom

Expected behaviour

Clear logging mentioning that a scheduler<->Client disconnect happened with a sufficient log level that it is not dropped in common applications (e.g. jupyter lab / default config)
Clear exception messages indicating what's wrong and where to look
Both persist and compute calls provide a clear message

Note:
When only using persist, the log only appears after a timeout of 30s (after giving up on the reconnect)

The text was updated successfully, but these errors were encountered:

in dask#4617 and dask#5666 a asyncio.gather call isn't correctly waited on and logs the following asyncio unhandled error: ``` _GatheringFuture exception was never retrieved future: <_GatheringFuture finished exception=CancelledError()> asyncio.exceptions.CancelledError ``` this exception is happening because on reconnect `_close` cancels itself before calling gather: https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343 `_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited) `self.coroutines` can only ever contain 1 task - `_handle_report` and so can be removed in favour of explicitly tracking the `_handle_report` task.

in #4617 and #5666 a asyncio.gather call isn't correctly waited on and logs the following asyncio unhandled error: ``` _GatheringFuture exception was never retrieved future: <_GatheringFuture finished exception=CancelledError()> asyncio.exceptions.CancelledError ``` this exception is happening because on reconnect `_close` cancels itself before calling gather: https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343 `_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited) `self.coroutines` can only ever contain 1 task - `_handle_report` and so can be removed in favour of explicitly tracking the `_handle_report` task.

…5672) in dask#4617 and dask#5666 a asyncio.gather call isn't correctly waited on and logs the following asyncio unhandled error: ``` _GatheringFuture exception was never retrieved future: <_GatheringFuture finished exception=CancelledError()> asyncio.exceptions.CancelledError ``` this exception is happening because on reconnect `_close` cancels itself before calling gather: https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343 `_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited) `self.coroutines` can only ever contain 1 task - `_handle_report` and so can be removed in favour of explicitly tracking the `_handle_report` task.

davidkell · 2022-04-03T16:19:47Z

Hey @fjetter and @hayesgb, taking a look at this one. It’s an issue I’ve experienced first hand running dask on GKE, e.g. when nodes are pre-empted or we force a re-deploy. IMO it would be a big win, especially for new users.

Why it happens

When a client loses the connection to the scheduler, and attempts to reconnect, all futures are cancelled and removed from the client.futures dict.

If the client later attempts to work with these futures, as defined elsewhere, it cannot find them in client.futures and the default strategy in the codebase is to throw CancelledError without more information.

Interestingly, for the two examples above this condition is reached in different parts of the codebase(s):

cancelled after timeout: distributed.client.Client._gather
cancelled after persist: dask.highlevelgraph.Layer.__dask_distributed_pack__

Short term proposal

As an initial fix, we could introduce logging and an exception when this condition is reached, explaining to the user that the future does not exist in the client and, most likely, this happened because their client re-connected (e.g. due to short network disconnect).

We'd need to introduce it to at least two points (see above), and possibly more - I see dictionary lookups on client.futures in 20+ places across the distributed and dask codebases. (Or I guess we write a custom dictionary class that inherits from default dict to implement this? Feels nasty)

We'd also need to to be confident that this is the only likely explanation for why a future is not in client.futures. Reading through the code, I think that is a safe assumption, but for completeness here are the other reasons it can happen:

The future was created by another client. This is handled by a separate exception.
The reference count on the reference went to 0. Given how reference couunting works, that shouldn't be an issue.
User/client library mutates the client.futures manually. Not much we can do about that...

It would work ... however, it does feel like a band-aid fix. The need to add logic in multiple places usually suggests there's a deeper issue. In this case, it would be better to have guarantees on when a future will be in client.futures.

Longer term solution?

As discussed in #5663, just keeping old futures around will break things, due to inconsistency between the client and scheduler. For example, the client might get stuck waiting on the scheduler for a future that it doesn't know about.

However, in #5667, @fjetter has explored adding logic for reconnecting without wiping client.futures on short network disconnects. TLDR: It is doable, we just need to think carefully about retaining consistency.

I think that if that gets implemented successfully, it would automatically solve this issue as well. Why? It would enable us to guarantee that while the client is open, or trying to reconnect, any future will be in client.futures; and when the client is closed, the client.futures is empty.

Taking the "cancelled after timeout" example, the behaviour you'd get is:

client hangs for 30 seconds trying to reconnect
client closes and fails

It's now transparent to the user what went wrong, and we got that for free.

Now #5667 is a much bigger piece of work - so perhaps the right strategy is the short term solution at a few key points (as a quick UX win), while working on the long term solution that eventually supersedes this. Thoughts?

fjetter · 2024-07-16T12:58:04Z

closed by #8705

xref #8690

fjetter mentioned this issue Jan 18, 2022

Allow Client reconnect without loss of progression #5667

Open

fjetter added bug Something is broken diagnostics good second issue Clearly described, educational, but less trivial than "good first issue". labels Jan 18, 2022

jrbourbeau mentioned this issue Jan 20, 2022

[DNM] Don't cancel futures on client reconnect #5663

Closed

graingert mentioned this issue Jan 20, 2022

avoid Client._handle_report cancelling itself on Client._close #5672

Merged

3 tasks

fjetter closed this as completed Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear user feedback in case Client-Scheduler connection breaks up #5666

Unclear user feedback in case Client-Scheduler connection breaks up #5666

fjetter commented Jan 18, 2022 •

edited

Loading

davidkell commented Apr 3, 2022 •

edited

Loading

fjetter commented Jul 16, 2024

Unclear user feedback in case Client-Scheduler connection breaks up #5666

Unclear user feedback in case Client-Scheduler connection breaks up #5666

Comments

fjetter commented Jan 18, 2022 • edited Loading

Description

Expected behaviour

davidkell commented Apr 3, 2022 • edited Loading

Why it happens

Short term proposal

Longer term solution?

fjetter commented Jul 16, 2024

fjetter commented Jan 18, 2022 •

edited

Loading

davidkell commented Apr 3, 2022 •

edited

Loading