-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear user feedback in case Client-Scheduler connection breaks up #5666
Comments
in dask#4617 and dask#5666 a asyncio.gather call isn't correctly waited on and logs the following asyncio unhandled error: ``` _GatheringFuture exception was never retrieved future: <_GatheringFuture finished exception=CancelledError()> asyncio.exceptions.CancelledError ``` this exception is happening because on reconnect `_close` cancels itself before calling gather: https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343 `_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited) `self.coroutines` can only ever contain 1 task - `_handle_report` and so can be removed in favour of explicitly tracking the `_handle_report` task.
in dask#4617 and dask#5666 a asyncio.gather call isn't correctly waited on and logs the following asyncio unhandled error: ``` _GatheringFuture exception was never retrieved future: <_GatheringFuture finished exception=CancelledError()> asyncio.exceptions.CancelledError ``` this exception is happening because on reconnect `_close` cancels itself before calling gather: https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343 `_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited) `self.coroutines` can only ever contain 1 task - `_handle_report` and so can be removed in favour of explicitly tracking the `_handle_report` task.
in #4617 and #5666 a asyncio.gather call isn't correctly waited on and logs the following asyncio unhandled error: ``` _GatheringFuture exception was never retrieved future: <_GatheringFuture finished exception=CancelledError()> asyncio.exceptions.CancelledError ``` this exception is happening because on reconnect `_close` cancels itself before calling gather: https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343 `_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited) `self.coroutines` can only ever contain 1 task - `_handle_report` and so can be removed in favour of explicitly tracking the `_handle_report` task.
…5672) in dask#4617 and dask#5666 a asyncio.gather call isn't correctly waited on and logs the following asyncio unhandled error: ``` _GatheringFuture exception was never retrieved future: <_GatheringFuture finished exception=CancelledError()> asyncio.exceptions.CancelledError ``` this exception is happening because on reconnect `_close` cancels itself before calling gather: https://github.com/dask/distributed/blob/feac52b49292781e78beff8226407f3a5f2e563e/distributed/client.py#L1335-L1343 `_handle_report()` calls `_reconnect()` calls `_close()` which then cancels itself (edited) `self.coroutines` can only ever contain 1 task - `_handle_report` and so can be removed in favour of explicitly tracking the `_handle_report` task.
Hey @fjetter and @hayesgb, taking a look at this one. It’s an issue I’ve experienced first hand running dask on GKE, e.g. when nodes are pre-empted or we force a re-deploy. IMO it would be a big win, especially for new users. Why it happensWhen a client loses the connection to the scheduler, and attempts to reconnect, all futures are cancelled and removed from the If the client later attempts to work with these futures, as defined elsewhere, it cannot find them in Interestingly, for the two examples above this condition is reached in different parts of the codebase(s):
Short term proposalAs an initial fix, we could introduce logging and an exception when this condition is reached, explaining to the user that the future does not exist in the client and, most likely, this happened because their client re-connected (e.g. due to short network disconnect). We'd need to introduce it to at least two points (see above), and possibly more - I see dictionary lookups on We'd also need to to be confident that this is the only likely explanation for why a future is not in
It would work ... however, it does feel like a band-aid fix. The need to add logic in multiple places usually suggests there's a deeper issue. In this case, it would be better to have guarantees on when a future will be in Longer term solution?As discussed in #5663, just keeping old futures around will break things, due to inconsistency between the client and scheduler. For example, the client might get stuck waiting on the scheduler for a future that it doesn't know about. However, in #5667, @fjetter has explored adding logic for reconnecting without wiping I think that if that gets implemented successfully, it would automatically solve this issue as well. Why? It would enable us to guarantee that while the client is open, or trying to reconnect, any future will be in Taking the "cancelled after timeout" example, the behaviour you'd get is:
It's now transparent to the user what went wrong, and we got that for free. Now #5667 is a much bigger piece of work - so perhaps the right strategy is the short term solution at a few key points (as a quick UX win), while working on the long term solution that eventually supersedes this. Thoughts? |
Description
In case of a disconnect between Client and Scheduler, the logging messages are not very helpful for the typical user and do not help debugging the issue but rather cause more confusion. The feedback is sometimes delayed (only triggered after a timeout), causation is not transparent or log messages are missing or on a wrong level.
A) CancelledError confusing
The below code raises a
CancelledError
after a timeout of 30s. This is typically a cryptic exception most users cannot relate to. If apersist
is in the chain of commands, the timeout is not even logged but only aCancelledError
is raised such that the disconnect is entirely obfuscated.Instead of a
CancelledError
, I would expect a more user friendly exception telling the user about the non-running status of the client which is likely caused by a disconnect with the scheduler.CancelledError after timeout
Causes
CancelledError with persist
B) Client reconnecting to (new) scheduler
In a case a reconnect finishes successfully, this can go entirely unnoticed
Expected behaviour
Note:
When only using persist, the log only appears after a timeout of 30s (after giving up on the reconnect)
The text was updated successfully, but these errors were encountered: