Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear user feedback in case Client-Scheduler connection breaks up #5666

Closed
fjetter opened this issue Jan 18, 2022 · 2 comments
Closed

Unclear user feedback in case Client-Scheduler connection breaks up #5666

fjetter opened this issue Jan 18, 2022 · 2 comments
Labels
bug Something is broken diagnostics good second issue Clearly described, educational, but less trivial than "good first issue".

Comments

@fjetter
Copy link
Member

fjetter commented Jan 18, 2022

Description

In case of a disconnect between Client and Scheduler, the logging messages are not very helpful for the typical user and do not help debugging the issue but rather cause more confusion. The feedback is sometimes delayed (only triggered after a timeout), causation is not transparent or log messages are missing or on a wrong level.

A) CancelledError confusing

The below code raises a CancelledError after a timeout of 30s. This is typically a cryptic exception most users cannot relate to. If a persist is in the chain of commands, the timeout is not even logged but only a CancelledError is raised such that the disconnect is entirely obfuscated.

Instead of a CancelledError, I would expect a more user friendly exception telling the user about the non-running status of the client which is likely caused by a disconnect with the scheduler.

CancelledError after timeout
import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1) as cluster:
    client = distributed.Client(cluster)
x = da.ones(10, chunks=5)
x.compute()

Causes

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
CancelledError with persist
import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1) as cluster:
    client = distributed.Client(cluster)
    x = x.persist()
y = x.sum()
y.compute()

B) Client reconnecting to (new) scheduler

In a case a reconnect finishes successfully, this can go entirely unnoticed

import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1, scheduler_port=5678) as cluster:
    client = distributed.Client(cluster)
    x = da.ones(10, chunks=5).persist()

# New cluster, same address
cluster = distributed.LocalCluster(n_workers=1, scheduler_port=5678)
assert client.status == "running"

x.sum().compute()  # Boom

image

Expected behaviour

  • Clear logging mentioning that a scheduler<->Client disconnect happened with a sufficient log level that it is not dropped in common applications (e.g. jupyter lab / default config)
  • Clear exception messages indicating what's wrong and where to look
  • Both persist and compute calls provide a clear message

Note:
When only using persist, the log only appears after a timeout of 30s (after giving up on the reconnect)

@fjetter fjetter added bug Something is broken diagnostics good second issue Clearly described, educational, but less trivial than "good first issue". labels Jan 18, 2022
graingert added a commit to graingert/distributed that referenced this issue Jan 20, 2022
in dask#4617 and dask#5666
a asyncio.gather call isn't correctly waited on and logs the following
asyncio unhandled error:

```
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
```

this exception is happening because on reconnect `_close` cancels itself
before calling gather:
https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343
`_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited)

`self.coroutines` can only ever contain 1 task - `_handle_report` and so
can be removed in favour of explicitly tracking the `_handle_report`
task.
graingert added a commit to graingert/distributed that referenced this issue Jan 21, 2022
in dask#4617 and dask#5666
a asyncio.gather call isn't correctly waited on and logs the following
asyncio unhandled error:

```
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
```

this exception is happening because on reconnect `_close` cancels itself
before calling gather:
https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343
`_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited)

`self.coroutines` can only ever contain 1 task - `_handle_report` and so
can be removed in favour of explicitly tracking the `_handle_report`
task.
fjetter pushed a commit that referenced this issue Jan 25, 2022
in #4617 and #5666
a asyncio.gather call isn't correctly waited on and logs the following
asyncio unhandled error:

```
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
```

this exception is happening because on reconnect `_close` cancels itself
before calling gather:
https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343
`_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited)

`self.coroutines` can only ever contain 1 task - `_handle_report` and so
can be removed in favour of explicitly tracking the `_handle_report`
task.
gjoseph92 pushed a commit to gjoseph92/distributed that referenced this issue Feb 1, 2022
…5672)

in dask#4617 and dask#5666
a asyncio.gather call isn't correctly waited on and logs the following
asyncio unhandled error:

```
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
```

this exception is happening because on reconnect `_close` cancels itself
before calling gather:
https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343
`_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited)

`self.coroutines` can only ever contain 1 task - `_handle_report` and so
can be removed in favour of explicitly tracking the `_handle_report`
task.
@davidkell
Copy link

davidkell commented Apr 3, 2022

Hey @fjetter and @hayesgb, taking a look at this one. It’s an issue I’ve experienced first hand running dask on GKE, e.g. when nodes are pre-empted or we force a re-deploy. IMO it would be a big win, especially for new users.

Why it happens

When a client loses the connection to the scheduler, and attempts to reconnect, all futures are cancelled and removed from the client.futures dict.

If the client later attempts to work with these futures, as defined elsewhere, it cannot find them in client.futures and the default strategy in the codebase is to throw CancelledError without more information.

Interestingly, for the two examples above this condition is reached in different parts of the codebase(s):

  • cancelled after timeout: distributed.client.Client._gather
  • cancelled after persist: dask.highlevelgraph.Layer.__dask_distributed_pack__

Short term proposal

As an initial fix, we could introduce logging and an exception when this condition is reached, explaining to the user that the future does not exist in the client and, most likely, this happened because their client re-connected (e.g. due to short network disconnect).

We'd need to introduce it to at least two points (see above), and possibly more - I see dictionary lookups on client.futures in 20+ places across the distributed and dask codebases. (Or I guess we write a custom dictionary class that inherits from default dict to implement this? Feels nasty)

We'd also need to to be confident that this is the only likely explanation for why a future is not in client.futures. Reading through the code, I think that is a safe assumption, but for completeness here are the other reasons it can happen:

  1. The future was created by another client. This is handled by a separate exception.
  2. The reference count on the reference went to 0. Given how reference couunting works, that shouldn't be an issue.
  3. User/client library mutates the client.futures manually. Not much we can do about that...

It would work ... however, it does feel like a band-aid fix. The need to add logic in multiple places usually suggests there's a deeper issue. In this case, it would be better to have guarantees on when a future will be in client.futures.

Longer term solution?

As discussed in #5663, just keeping old futures around will break things, due to inconsistency between the client and scheduler. For example, the client might get stuck waiting on the scheduler for a future that it doesn't know about.

However, in #5667, @fjetter has explored adding logic for reconnecting without wiping client.futures on short network disconnects. TLDR: It is doable, we just need to think carefully about retaining consistency.

I think that if that gets implemented successfully, it would automatically solve this issue as well. Why? It would enable us to guarantee that while the client is open, or trying to reconnect, any future will be in client.futures; and when the client is closed, the client.futures is empty.

Taking the "cancelled after timeout" example, the behaviour you'd get is:

  • client hangs for 30 seconds trying to reconnect
  • client closes and fails

It's now transparent to the user what went wrong, and we got that for free.

Now #5667 is a much bigger piece of work - so perhaps the right strategy is the short term solution at a few key points (as a quick UX win), while working on the long term solution that eventually supersedes this. Thoughts?

@fjetter
Copy link
Member Author

fjetter commented Jul 16, 2024

closed by #8705

xref #8690

@fjetter fjetter closed this as completed Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken diagnostics good second issue Clearly described, educational, but less trivial than "good first issue".
Projects
None yet
Development

No branches or pull requests

2 participants