Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for CUDA streams #96

Open
pentschev opened this issue Jul 24, 2019 · 21 comments
Open

Support for CUDA streams #96

pentschev opened this issue Jul 24, 2019 · 21 comments

Comments

@pentschev
Copy link
Member

When writing CUDA applications, an important aspect for keeping GPUs busy is the use of streams to enqueue operations asynchronously from the host.

Libraries such as Numba and CuPy offer support for CUDA streams, but today we don't know to what extent they're functional with Dask.

I believe CUDA streams will be beneficial to leverage higher performance, particularly in the case of several small operations, streams may help Dask keeping on dispatching work asynchronously, while GPUs do the work.

We should check what's the correct way of using them with Dask and how/if they provide performance improvements.

cc @mrocklin @jakirkham @jrhemstad

@mrocklin
Copy link
Contributor

There is no correct way today. Every Python GPU library has their own Python wrapping of the CUDA streams API, so there is no consistent way to do this today. Also cc'ing @kkraus14

@jakirkham
Copy link
Member

FWIW recently stumbled across this bug ( cupy/cupy#2159 ) on the CuPy issue tracker. Not sure if it is directly related, but figured it was worth being aware of that issue.

@leofang
Copy link
Member

leofang commented Jan 9, 2020

Possibly related: numba/numba#4797

@jakirkham
Copy link
Member

One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with --default-stream per-thread without having to necessarily do that. The question then is where should this functionality live? dask-cuda seems like a reasonable place, but we could also imagine Numba, CuPy, or others being reasonable candidates. Thoughts?

cc @kkraus14

@pentschev
Copy link
Member Author

I've never really used --default-stream per-thread, what's the behavior if you happen to access the same data on multiple threads, do you need explicit synchronization? If so, how would that be handled automatically without requiring user interference?

@leofang
Copy link
Member

leofang commented Feb 7, 2020

One potentially interesting idea to consider would be to create streams and store them in thread local storage. This way we could grab the stream for the current thread and apply it to operations within that thread relatively easily. Ideally this would give us the benefits of compiling with --default-stream per-thread without having to necessarily do that. The question then is where should this functionality live? dask-cuda seems like a reasonable place, but we could also imagine Numba, CuPy, or others being reasonable candidates. Thoughts?

Just wanna add that CuPy is already doing exactly what John said (thread-local streams), see https://github.com/cupy/cupy/blob/096957541c36d4f2a3f2a453e88456fffde1c8bf/cupy/cuda/stream.pyx#L8-L46.

@leofang
Copy link
Member

leofang commented Feb 7, 2020

For @pentschev's question: I think users who write a multi-threading GPU program are responsible to take care of synchronization as they would do in CUDA C/C++. Not sure if there's an easy way out.

By the way, a discussion on per-thread default streams is recently brought up in Numba: numba/numba#5137.

@pentschev
Copy link
Member Author

It seems like --default-stream per-thread could then be helpful if we assign multiple threads per worker (currently we only use one in dask-cuda). My original idea was to have the same thread working on multiple streams, but I think this may be too complex of a use case for Dask. However, multiple threads per worker on the same GPU may limit the problem size due to a potential larger memory footprint, so we need to really test this out and see if this is something that we could work with in Dask.

@jakirkham
Copy link
Member

Thanks for the info Leo! So maybe this is less a question of how should dask-cuda handle this (specifically) and more a question of how we should work together across libraries to achieve a shared stream solution.

@kkraus14
Copy link

kkraus14 commented Feb 8, 2020

I'd also note that --default-stream-per-thread is currently incompatible with RMM pool mode which is planned to be addressed in the future, but likely wont be for a while.

@jakirkham
Copy link
Member

Updating this issue a bit. It is possible to enable --default-stream-per-thread with RMM's default CNMeM pool ( rapidsai/rmm#354 ), but it comes with some tradeoffs. There is also a new RMM pool where --default-stream-per-thread is being worked on in PR ( rapidsai/rmm#425 ).

@jakirkham
Copy link
Member

Asking about PTDS support with CuPy in issue ( cupy/cupy#3755 ).

@jakirkham
Copy link
Member

Also we would need to make a change to Distributed to support PTDS. Have submitted a draft PR ( dask/distributed#4034 ) to make these changes.

@jakirkham
Copy link
Member

I'd also note that --default-stream-per-thread is currently incompatible with RMM pool mode which is planned to be addressed in the future, but likely wont be for a while.

Also PR ( rapidsai/rmm#466 ) will switch us to the new RMM pool, which does support PTDS.

@pentschev
Copy link
Member Author

I began with some testing and changes to enable PTDS support in rapidsai/rmm#633 and cupy/cupy#4322 .

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@jakirkham
Copy link
Member

There is active work here though it’s not really tracked in this issue as a few libraries are involved

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jakirkham
Copy link
Member

This is ongoing work. See issue ( #517 ) for more up-to-date status

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

5 participants