-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Provide device_resources_manager for easy generation of device_resources #1716
[FEA] Provide device_resources_manager for easy generation of device_resources #1716
Conversation
I took a cursory look through the code. The intent is pretty clear and I think the implementation looks good. I'll take a closer look tomorrow and read through the description more thoroughly. I suggest we name this |
Also implement destructor to ensure that resources created by manager are appropriately destroyed before the CUDA context
Edit: Figured it out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! It's shaping really well and I am very happy with your descriptions as well how well-commented the code is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally got a chance to review this more thoroghly. Mostly, I'd like to separate the concerns a little better between individual device_resources
and the device_resources_manager
. From the RAFT perspective, the device_resources
is not just an API for invoking RAFT's APIs. It's supposed to be responsible for maintaining the state assigned to an individual device. In many ways, this device_resources_manager
implementation is side-stepping the individual resource management when we could be using this as an opportunity to reuse the existing centralized resource management infrastructure.
Other than that, I think the design is great RAFT has been in need of something that can manage multiple devices (e.g. instances of device_resources
for different devices) across different threads.
I'm even okay stepping down a layer and using raft::resources
directly, if that makes the experience more clean, but I'd like to integrate and reuse the raft::resource
APIs as much as possible.
Could you be more specific? The intention of this object is to ensure that |
Refactored to cache the generated
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking my comment about splitting the logger code into logger-inl and logger-ext as a request for changes 👍
https://github.com/rapidsai/raft/pull/1716/files#r1286897448
Updated everything requested except the docs. I'll finish those up tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more move of log_level_setter requested. Sorry, should have given this a bit more thought in the previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving log_level_setter
to logger-ext.cuh. LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @wphicks!
/merge |
Edit: I have renamed the object to raft::device_resources_manager on the (excellent) suggestion of @cjnolet. Substitute the name accordingly in the description below.
Summary
This PR introduces a new utility (
raft::resource_manager
) to RAFT that is used to help downstream applications correctly generate theraft::device_resources
they need to interact with the RAFT API.Purpose
As more vector search applications have begun integrating RAFT, it has become apparent that correctly managing CUDA resources like streams, stream pools, and device memory can be a challenge in a codebase that has previously focused exclusively on CPU execution. As a specific example, these applications are generally highly multi-threaded on the host. As they begin to use RAFT, they typically make use of the default
device_resources
constructor to generate the requisitedevice_resources
object for the API. Because this default constructor makes use of the default stream per thread, the application uses as many streams as there are host threads. This behavior can lead to exhaustion of device resources because all of those host threads simultaneously launch work on independent streams.In a CUDA-aware codebase, we might expect the application to manage its own limited pool of streams, but requiring this creates and unnecessary barrier for RAFT adoption. Instead, the
resource_manager
provides a straightforward method for limiting streams and other CUDA resources to sensible values in a highly multi-threaded application.Usage
To use the
resource_manager
, the host application will make use of setters which can provide control over various device resources. For instance, to limit the total number of streams used by the application to 16 per device, the application would call the following during its startup:After startup, if the application wishes to make a RAFT API call using a
raft::device_resources
object, it may call the following:If the same host thread calls
get_device_resources()
again in another function, it will retrieve adevice_resources
object based on the exact same stream it got with the previous call. This is similar in spirit to the way that the default CUDA stream per thread is used, but it draws from a limited pool of streams. This also means that while each host thread is associated with one CUDA stream, that stream may be associated with multiple host threads.In addition to drawing the
device_resources
primary stream from a limited pool, we can share a pool of stream pools among host threads:Besides streams and stream pools, the resource manager optionally allows initialization of an RMM memory pool for device allocations:
For downstream consumers who know they want a memory pool but are not sure what sizes to pick, the following convenience function just sets up the memory pool with RMM defaults:
If no memory pool related options are set or if the maximum memory pool size is set to 0, no memory pool will be created or used. Furthermore, if the type of the current device memory resource is non-default, no memory pool will be created or used, and a warning will be emitted. We assume that if the application has already set a non-default memory resource, this was done intentionally and should not be overwritten.
Design
This object is designed with the following priorities:
device_resources
options as possibleIf a downstream application needs complete control over
device_resources
creation and memory resource initialization, that codebase is probably already CUDA-aware and will not benefit from the resource manager. Therefore, we do not insist that every possible configuration of resources be available through the manager. Nevertheless, codebases may grow more CUDA-aware with time, so we provide access to as many options as possible (with sensible defaults) to provide an on-ramp to more sophisticated manipulation of device resources.In terms of performance, the goal is to make retrieval of
device_resources
as fast as possible and avoid blocking other host threads. To that end, the design ofresource_manager
includes a layer of indirection that ensures that each host thread needs to acquire a lock only on its firstget_device_resources
call for any given device. Theresource_manager
singleton maintains an internal configuration object that may be updated using the setter methods until the first call toget_device_resources
on any thread. After that, a warning will be emitted and no changes will be made on any subsequent setter calls.Within the
get_device_resources
call, each thread keeps track of which devices it has already retrieved resources for. If it has not yet retrieved resources for that device, it acquires a lock. It then marks the configuration as finalized and checks to see if any thread has initialized the shared resources for that device. If no other thread has, it initializes those resources itself. It then updates its own thread_local list of devices it has retrieved resources for so that it does not need to reacquire the lock on subsequent calls.Questions
This PR is still a WIP. It is being posted here now primarily to gather feedback on its public API. A thorough review of the implementation can happen once tests have been added and it is moved out of WIP.Edit: This PR is now ready for review. Implementation feedback welcome!One question I have is about two convenience functions I added right at the end:
synchronize_work_from_this_thread
andsynchronize_work_from_all_threads
. The idea behind these functions is that for the target audience of this feature, it may be helpful to provide synchronization helpers that clearly indicate what their execution mean in relation to the more familiar synchronization requirements of the host code. The first is intended to communicate that it is guaranteed to block host execution on this thread until all work submitted to the device from that thread is completed. The hope is that this can increase confidence around questions of synchronicity that developers unfamiliar with CUDA streams sometimes have. Because this helper is not strictly required for the core functionality ofresource_manager
, however, it would be useful to have feedback on whether others think it is worthwhile.I am even less certain about
synchronize_work_from_all_threads
. The idea behind this is that it provides a way to block host thread execution until all work submitted on streams under theresource_manager
's control have completed. A natural question is why we would not just perform a device synchronization at that point. My justification for that is that the application may also be integrating other libraries which have their own stream management and synchronization infrastructure. In that case, it may be desirable not to synchronize work submitted to the device by calls to the other library. It would be useful to hear if others think this might be useful for the target audience of this PR or if we should just push them toward a device synchronization to avoid unpleasant edge cases.