Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support RMM arguments for cluster initialization #6

Closed
randerzander opened this issue Feb 6, 2019 · 6 comments
Closed

[FEA] Support RMM arguments for cluster initialization #6

randerzander opened this issue Feb 6, 2019 · 6 comments

Comments

@randerzander
Copy link
Contributor

See how this example notebook is setting up RMM using client.run.

Is this something we can have LocalCudaCluster handle during startup?

@randerzander
Copy link
Contributor Author

@paulhendricks fyi

@mrocklin
Copy link
Contributor

mrocklin commented Feb 6, 2019

In principle this is the sort of thing that a project like LocalCUDACluster could handle, however I'm somewhat concerned that it seems very RAPIDS specific, and also seems perhaps to be pretty unstable. Someone using this project for PyTorch might not appreciate this change. Also my guess looking at the code is that this would change in the next month or would change based on what libraries someone wanted to use. My hope is that LocalCUDACluster would be general beyond just RAPIDS work and would avoid baking in code that was tailored for a specific workflow.

Instead, I think that we might finish up dask/distributed#2453 and use that, or perhaps use worker preload scripts (see http://distributed.dask.org/en/latest/setup.html#customizing-initialization). Short term the preload scripts are probably the easiest approach. You would add that function to a small script, and then call something like LocalCUDACluster(preload='myscript.py') and that script would be run on all of the workers when they started up. This would also work if, for example, the workers were restarted.

You can also avoid having to specify this keyword argument by putting this script into your config at distributed.worker.preload, for example by adding the following file to ~/.config/dask/rapids.yaml

distributed:
  worker:
    preload: /path/to/myscript.py

@randerzander
Copy link
Contributor Author

Understood about the goal of making this a general utility, and the startup scripts will work.

However, it's worth noting that (while its adoption is still mostly in RAPIDS projects) RMM has a goal of making shared memory pool management easier across the GPU ecosystem. Might be worth revisiting later when it gets adopted more widely.

@harrism fyi

@mrocklin
Copy link
Contributor

mrocklin commented Feb 6, 2019

I think that the preload script solution is probably the right level of customization for this problem.

We can make a script with your startup commands, put that into configuration, and it will be run on any dask worker that people setup. That config file and script will be able to adapt much more nimbly than the dask-cuda project.

@mrocklin
Copy link
Contributor

mrocklin commented Feb 6, 2019

@kkraus14 what do you think is the right way medium-term to solve the RMM initialization issue on the Python usability side?

@mrocklin
Copy link
Contributor

mrocklin commented Feb 6, 2019

Might be worth revisiting later when it gets adopted more widely.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants