[FEA] Support RMM arguments for cluster initialization #6

randerzander · 2019-02-06T05:31:32Z

See how this example notebook is setting up RMM using client.run.

Is this something we can have LocalCudaCluster handle during startup?

randerzander · 2019-02-06T05:32:01Z

mrocklin · 2019-02-06T05:55:54Z

In principle this is the sort of thing that a project like LocalCUDACluster could handle, however I'm somewhat concerned that it seems very RAPIDS specific, and also seems perhaps to be pretty unstable. Someone using this project for PyTorch might not appreciate this change. Also my guess looking at the code is that this would change in the next month or would change based on what libraries someone wanted to use. My hope is that LocalCUDACluster would be general beyond just RAPIDS work and would avoid baking in code that was tailored for a specific workflow.

Instead, I think that we might finish up dask/distributed#2453 and use that, or perhaps use worker preload scripts (see http://distributed.dask.org/en/latest/setup.html#customizing-initialization). Short term the preload scripts are probably the easiest approach. You would add that function to a small script, and then call something like LocalCUDACluster(preload='myscript.py') and that script would be run on all of the workers when they started up. This would also work if, for example, the workers were restarted.

You can also avoid having to specify this keyword argument by putting this script into your config at distributed.worker.preload, for example by adding the following file to ~/.config/dask/rapids.yaml

distributed:
  worker:
    preload: /path/to/myscript.py

randerzander · 2019-02-06T16:10:11Z

Understood about the goal of making this a general utility, and the startup scripts will work.

However, it's worth noting that (while its adoption is still mostly in RAPIDS projects) RMM has a goal of making shared memory pool management easier across the GPU ecosystem. Might be worth revisiting later when it gets adopted more widely.

@harrism fyi

mrocklin · 2019-02-06T16:23:26Z

I think that the preload script solution is probably the right level of customization for this problem.

We can make a script with your startup commands, put that into configuration, and it will be run on any dask worker that people setup. That config file and script will be able to adapt much more nimbly than the dask-cuda project.

mrocklin · 2019-02-06T16:31:53Z

@kkraus14 what do you think is the right way medium-term to solve the RMM initialization issue on the Python usability side?

mrocklin · 2019-02-06T16:32:15Z

Might be worth revisiting later when it gets adopted more widely.

+1

randerzander closed this as completed Feb 6, 2019

pentschev mentioned this issue Jun 16, 2020

Solver functions give "no kernel image is available for execution on the device" #318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support RMM arguments for cluster initialization #6

[FEA] Support RMM arguments for cluster initialization #6

randerzander commented Feb 6, 2019

randerzander commented Feb 6, 2019

mrocklin commented Feb 6, 2019

randerzander commented Feb 6, 2019

mrocklin commented Feb 6, 2019

mrocklin commented Feb 6, 2019

mrocklin commented Feb 6, 2019

[FEA] Support RMM arguments for cluster initialization #6

[FEA] Support RMM arguments for cluster initialization #6

Comments

randerzander commented Feb 6, 2019

randerzander commented Feb 6, 2019

mrocklin commented Feb 6, 2019

randerzander commented Feb 6, 2019

mrocklin commented Feb 6, 2019

mrocklin commented Feb 6, 2019

mrocklin commented Feb 6, 2019