Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[FEA] Provide device_resources_manager for easy generation of device_…
…resources (#1716) Edit: I have renamed the object to raft::device_resources_manager on the (excellent) suggestion of @cjnolet. Substitute the name accordingly in the description below. ### Summary This PR introduces a new utility (`raft::resource_manager`) to RAFT that is used to help downstream applications correctly generate the `raft::device_resources` they need to interact with the RAFT API. ### Purpose As more vector search applications have begun integrating RAFT, it has become apparent that correctly managing CUDA resources like streams, stream pools, and device memory can be a challenge in a codebase that has previously focused exclusively on CPU execution. As a specific example, these applications are generally highly multi-threaded on the host. As they begin to use RAFT, they typically make use of the default `device_resources` constructor to generate the requisite `device_resources` object for the API. Because this default constructor makes use of the default stream per thread, the application uses as many streams as there are host threads. This behavior can lead to exhaustion of device resources because all of those host threads simultaneously launch work on independent streams. In a CUDA-aware codebase, we might expect the application to manage its own limited pool of streams, but requiring this creates and unnecessary barrier for RAFT adoption. Instead, the `resource_manager` provides a straightforward method for limiting streams and other CUDA resources to sensible values in a highly multi-threaded application. ### Usage To use the `resource_manager`, the host application will make use of setters which can provide control over various device resources. For instance, to limit the total number of streams used by the application to 16 per device, the application would call the following during its startup: ``` raft::resource_manager::set_streams_per_device(16); ``` After startup, if the application wishes to make a RAFT API call using a `raft::device_resources` object, it may call the following: ``` auto res = raft::resource_manager::get_device_resources(); some_raft_call(res); ``` If the same host thread calls `get_device_resources()` again in another function, it will retrieve a `device_resources` object based on the exact same stream it got with the previous call. This is similar in spirit to the way that the default CUDA stream per thread is used, but it draws from a limited pool of streams. This also means that while each host thread is associated with one CUDA stream, that stream may be associated with multiple host threads. In addition to drawing the `device_resources` primary stream from a limited pool, we can share a pool of stream pools among host threads: ``` // Share 4 stream pools among host threads, with the same pool always assigned to any given thread raft::resource_manager::set_stream_pools_per_device(4); ``` Besides streams and stream pools, the resource manager optionally allows initialization of an RMM memory pool for device allocations: ``` // Start the pool with only 2048 bytes raft::resource_manager::set_init_mem_pool_size(2048); // Allow the pool to use all available device memory raft::resource_manager::set_max_mem_pool_size(std::nullopt); ``` For downstream consumers who know they want a memory pool but are not sure what sizes to pick, the following convenience function just sets up the memory pool with RMM defaults: ``` raft::resource_manager::set_mem_pool(); ``` If no memory pool related options are set or if the maximum memory pool size is set to 0, no memory pool will be created or used. Furthermore, if the type of the current device memory resource is non-default, no memory pool will be created or used, and a warning will be emitted. We assume that if the application has already set a non-default memory resource, this was done intentionally and should not be overwritten. ### Design This object is designed with the following priorities: - Ease of use - Thread safety - Performance - (Lastly) Access to as many `device_resources` options as possible If a downstream application needs complete control over `device_resources` creation and memory resource initialization, that codebase is probably already CUDA-aware and will not benefit from the resource manager. Therefore, we do not insist that every possible configuration of resources be available through the manager. Nevertheless, codebases may grow more CUDA-aware with time, so we provide access to as many options as possible (with sensible defaults) to provide an on-ramp to more sophisticated manipulation of device resources. In terms of performance, the goal is to make retrieval of `device_resources` as fast as possible and avoid blocking other host threads. To that end, the design of `resource_manager` includes a layer of indirection that ensures that each host thread needs to acquire a lock only on its first `get_device_resources` call for any given device. The `resource_manager` singleton maintains an internal configuration object that may be updated using the setter methods until the first call to `get_device_resources` on any thread. After that, a warning will be emitted and no changes will be made on any subsequent setter calls. Within the `get_device_resources` call, each thread keeps track of which devices it has already retrieved resources for. If it has not yet retrieved resources for that device, it acquires a lock. It then marks the configuration as finalized and checks to see if any thread has initialized the shared resources for that device. If no other thread has, it initializes those resources itself. It then updates its own thread_local list of devices it has retrieved resources for so that it does not need to reacquire the lock on subsequent calls. ### Questions ~This PR is still a WIP. It is being posted here now primarily to gather feedback on its public API. A thorough review of the implementation can happen once tests have been added and it is moved out of WIP.~ Edit: This PR is now ready for review. Implementation feedback welcome! One question I have is about two convenience functions I added right at the end: `synchronize_work_from_this_thread` and `synchronize_work_from_all_threads`. The idea behind these functions is that for the target audience of this feature, it may be helpful to provide synchronization helpers that clearly indicate what their execution mean in relation to the more familiar synchronization requirements of the host code. The first is intended to communicate that it is guaranteed to block host execution on this thread until all work submitted to the device from that thread is completed. The hope is that this can increase confidence around questions of synchronicity that developers unfamiliar with CUDA streams sometimes have. Because this helper is not strictly required for the core functionality of `resource_manager`, however, it would be useful to have feedback on whether others think it is worthwhile. I am even less certain about `synchronize_work_from_all_threads`. The idea behind this is that it provides a way to block host thread execution until all work submitted on streams under the `resource_manager`'s control have completed. A natural question is why we would not just perform a device synchronization at that point. My justification for that is that the application may also be integrating other libraries which have their own stream management and synchronization infrastructure. In that case, it may be desirable _not_ to synchronize work submitted to the device by calls to the other library. It would be useful to hear if others think this might be useful for the target audience of this PR or if we should just push them toward a device synchronization to avoid unpleasant edge cases. Authors: - William Hicks (https://github.com/wphicks) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) URL: #1716
- Loading branch information