-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_worker
can return the wrong worker with an async local cluster
#4959
Comments
Thanks for the detailed description and example test @gjoseph92! As mentioned offline, when distributed/distributed/worker.py Lines 3561 to 3567 in 9f4165a
In the case that |
Just adding a note here since I didn't see it written down: setting |
Fixes dask#4959 `get_client` was calling the private `Worker._get_client` method when it ran within a task. `_get_client` should really have been called `_make_client`, since it created a new client every time. The simplest correct thing to do instead would have been to use the `Worker.client` property, which caches this instance. In order to pass the `timeout` parameter through though, I changed `Worker.get_client` to actually match its docstring and always return the same instance.
Fixes dask#4959 `get_client` was calling the private `Worker._get_client` method when it ran within a task. `_get_client` should really have been called `_make_client`, since it created a new client every time. The simplest correct thing to do instead would have been to use the `Worker.client` property, which caches this instance. In order to pass the `timeout` parameter through though, I changed `Worker.get_client` to actually match its docstring and always return the same instance.
This is probably a good idea in general (xref dask#4959), but it particularly helps with performance deserializing Futures, which have a fastpath through `Client.current` that bypasses a number of unnecessarily slow things that `get_client` does before it checks `Client.current`.
Split out from #4937 (comment).
What happened:
When using a local cluster in async mode,
get_worker
always returns the same Worker instance, no matter which worker it's being called within.What you expected to happen:
get_worker
Minimal Complete Verifiable Example:
Anything else we need to know?:
This probably almost never affects users directly. But since most tests use an async local cluster with
@gen_cluster
, I'm concerned what edge-case behavior we might testing incorrectly.Also, note that the same issue affects
get_client
. This feels a tiny bit less bad (at least it's always the right client, unlikeget_worker
), but still can have some strange effects. In multiple places, worker code updates the default Client instance assuming it's in a separate process. With multiple workers trampling the default client, I wonder if this affects tests around advancedsecede
/client-within-task workloads.I feel like the proper solution here would be to set a contextvar for the current worker that's updated as we context-switch in and out of that worker. Identifying the points where those switches have to happen seems tricky though.
I also think it would be reasonable for
get_worker
to error iflen(Worker._instances) > 1
.Environment:
The text was updated successfully, but these errors were encountered: