Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-worker process remains alive after Nanny exception on plugins= #6320

Closed
pentschev opened this issue May 10, 2022 · 2 comments · Fixed by #6363
Closed

dask-worker process remains alive after Nanny exception on plugins= #6320

pentschev opened this issue May 10, 2022 · 2 comments · Fixed by #6363

Comments

@pentschev
Copy link
Member

What happened:

An exception on Nannys plugins= doesn't cause the dask-worker process to terminate after #5910 , instead it hangs indefinitely.

What you expected to happen:

dask-worker process to terminate

Minimal Complete Verifiable Example:

Unfortunately, this isn't something that is implemented in dask-worker today, but a patch to reproduce this can be found below:

diff --git a/distributed/cli/dask_worker.py b/distributed/cli/dask_worker.py
index 376f2a1c..4f4172a9 100755
--- a/distributed/cli/dask_worker.py
+++ b/distributed/cli/dask_worker.py
@@ -36,6 +36,11 @@ logger = logging.getLogger("distributed.dask_worker")
 pem_file_option_type = click.Path(exists=True, resolve_path=True)


+class MyPlugin:
+    def setup(self, workers=None):
+        raise ValueError("Kill dask-worker")
+
+
 @click.command(context_settings=dict(ignore_unknown_options=True))
 @click.argument("scheduler", type=str, required=False)
 @click.option(
@@ -444,6 +449,7 @@ def main(
             host=host,
             dashboard=dashboard,
             dashboard_address=dashboard_address,
+            plugins={MyPlugin()},
             name=name
             if n_workers == 1 or name is None or name == ""
             else str(name) + "-" + str(i),

To then reproduce, one would run the following CLIs:

# Scheduler
$ dask-scheduler

# Worker
$ dask-worker --nanny 127.0.0.1:8786

Anything else we need to know?:

This behavior is relied upon in Dask-CUDA. #5910 caused https://github.com/rapidsai/dask-cuda/blob/7de73c72ca52239c6af87e483a20af3c8896bf0d/dask_cuda/tests/test_dask_cuda_worker.py#L220-L228 to hang indefinitely.

Environment:

  • Dask version: 2022.5.0+0.gc11c8ee4
  • Python version: 3.8
  • Operating System: Linux
  • Install method (conda, pip, source): source

cc @fjetter who authored #5910 .

rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this issue May 12, 2022
Xfailing the test is needed for now due to dask/distributed#6320 . This should not have any major impact unless the user specifies a module that cannot be imported via `--pre-import`, in which case the worker process will hang indefinitely.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - https://github.com/jakirkham

URL: #908
@gjoseph92
Copy link
Collaborator

gjoseph92 commented May 17, 2022

Probably not related, but FYI: #6357

@fjetter
Copy link
Member

fjetter commented May 17, 2022

This ticket is caused by this line

await _close_on_failure(exc)

We are now trying to close the server instance if an error occured to ensure all resources are properly cleaned up. I suspect that the worker is stuck trying to communicate to the scheduler or something (i.e. report=True). Fixing this is a bit nasty since we're calling Server.close even though all subclasses are changing the close signature :/ I would prefer not having to introduce a timeout here. Will check if I find an easy way to clean this up

fjetter added a commit to fjetter/distributed that referenced this issue May 19, 2022
fjetter added a commit to fjetter/distributed that referenced this issue May 19, 2022
fjetter added a commit to fjetter/distributed that referenced this issue May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants