Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Provide device_resources_manager for easy generation of device_resources #1716

Merged
merged 27 commits into from
Aug 15, 2023

Conversation

wphicks
Copy link
Contributor

@wphicks wphicks commented Aug 3, 2023

Edit: I have renamed the object to raft::device_resources_manager on the (excellent) suggestion of @cjnolet. Substitute the name accordingly in the description below.

Summary

This PR introduces a new utility (raft::resource_manager) to RAFT that is used to help downstream applications correctly generate the raft::device_resources they need to interact with the RAFT API.

Purpose

As more vector search applications have begun integrating RAFT, it has become apparent that correctly managing CUDA resources like streams, stream pools, and device memory can be a challenge in a codebase that has previously focused exclusively on CPU execution. As a specific example, these applications are generally highly multi-threaded on the host. As they begin to use RAFT, they typically make use of the default device_resources constructor to generate the requisite device_resources object for the API. Because this default constructor makes use of the default stream per thread, the application uses as many streams as there are host threads. This behavior can lead to exhaustion of device resources because all of those host threads simultaneously launch work on independent streams.

In a CUDA-aware codebase, we might expect the application to manage its own limited pool of streams, but requiring this creates and unnecessary barrier for RAFT adoption. Instead, the resource_manager provides a straightforward method for limiting streams and other CUDA resources to sensible values in a highly multi-threaded application.

Usage

To use the resource_manager, the host application will make use of setters which can provide control over various device resources. For instance, to limit the total number of streams used by the application to 16 per device, the application would call the following during its startup:

raft::resource_manager::set_streams_per_device(16);

After startup, if the application wishes to make a RAFT API call using a raft::device_resources object, it may call the following:

auto res = raft::resource_manager::get_device_resources();
some_raft_call(res);

If the same host thread calls get_device_resources() again in another function, it will retrieve a device_resources object based on the exact same stream it got with the previous call. This is similar in spirit to the way that the default CUDA stream per thread is used, but it draws from a limited pool of streams. This also means that while each host thread is associated with one CUDA stream, that stream may be associated with multiple host threads.

In addition to drawing the device_resources primary stream from a limited pool, we can share a pool of stream pools among host threads:

// Share 4 stream pools among host threads, with the same pool always assigned to any given thread
raft::resource_manager::set_stream_pools_per_device(4);

Besides streams and stream pools, the resource manager optionally allows initialization of an RMM memory pool for device allocations:

// Start the pool with only 2048 bytes
raft::resource_manager::set_init_mem_pool_size(2048);
// Allow the pool to use all available device memory
raft::resource_manager::set_max_mem_pool_size(std::nullopt);

For downstream consumers who know they want a memory pool but are not sure what sizes to pick, the following convenience function just sets up the memory pool with RMM defaults:

raft::resource_manager::set_mem_pool();

If no memory pool related options are set or if the maximum memory pool size is set to 0, no memory pool will be created or used. Furthermore, if the type of the current device memory resource is non-default, no memory pool will be created or used, and a warning will be emitted. We assume that if the application has already set a non-default memory resource, this was done intentionally and should not be overwritten.

Design

This object is designed with the following priorities:

  • Ease of use
  • Thread safety
  • Performance
  • (Lastly) Access to as many device_resources options as possible

If a downstream application needs complete control over device_resources creation and memory resource initialization, that codebase is probably already CUDA-aware and will not benefit from the resource manager. Therefore, we do not insist that every possible configuration of resources be available through the manager. Nevertheless, codebases may grow more CUDA-aware with time, so we provide access to as many options as possible (with sensible defaults) to provide an on-ramp to more sophisticated manipulation of device resources.

In terms of performance, the goal is to make retrieval of device_resources as fast as possible and avoid blocking other host threads. To that end, the design of resource_manager includes a layer of indirection that ensures that each host thread needs to acquire a lock only on its first get_device_resources call for any given device. The resource_manager singleton maintains an internal configuration object that may be updated using the setter methods until the first call to get_device_resources on any thread. After that, a warning will be emitted and no changes will be made on any subsequent setter calls.

Within the get_device_resources call, each thread keeps track of which devices it has already retrieved resources for. If it has not yet retrieved resources for that device, it acquires a lock. It then marks the configuration as finalized and checks to see if any thread has initialized the shared resources for that device. If no other thread has, it initializes those resources itself. It then updates its own thread_local list of devices it has retrieved resources for so that it does not need to reacquire the lock on subsequent calls.

Questions

This PR is still a WIP. It is being posted here now primarily to gather feedback on its public API. A thorough review of the implementation can happen once tests have been added and it is moved out of WIP. Edit: This PR is now ready for review. Implementation feedback welcome!

One question I have is about two convenience functions I added right at the end: synchronize_work_from_this_thread and synchronize_work_from_all_threads. The idea behind these functions is that for the target audience of this feature, it may be helpful to provide synchronization helpers that clearly indicate what their execution mean in relation to the more familiar synchronization requirements of the host code. The first is intended to communicate that it is guaranteed to block host execution on this thread until all work submitted to the device from that thread is completed. The hope is that this can increase confidence around questions of synchronicity that developers unfamiliar with CUDA streams sometimes have. Because this helper is not strictly required for the core functionality of resource_manager, however, it would be useful to have feedback on whether others think it is worthwhile.

I am even less certain about synchronize_work_from_all_threads. The idea behind this is that it provides a way to block host thread execution until all work submitted on streams under the resource_manager's control have completed. A natural question is why we would not just perform a device synchronization at that point. My justification for that is that the application may also be integrating other libraries which have their own stream management and synchronization infrastructure. In that case, it may be desirable not to synchronize work submitted to the device by calls to the other library. It would be useful to hear if others think this might be useful for the target audience of this PR or if we should just push them toward a device synchronization to avoid unpleasant edge cases.

@github-actions github-actions bot added the cpp label Aug 3, 2023
@wphicks wphicks added non-breaking Non-breaking change feature request New feature or request labels Aug 3, 2023
@cjnolet
Copy link
Member

cjnolet commented Aug 4, 2023

I took a cursory look through the code. The intent is pretty clear and I think the implementation looks good. I'll take a closer look tomorrow and read through the description more thoroughly.

I suggest we name this device_resource_manager or even device_resources_manager so that it's consistent with all the other naming conventions in RAFT- it becomes blatantly clear just through the naming that it's specific to managing device resources.

Also implement destructor to ensure that resources created by manager
are appropriately destroyed before the CUDA context
@wphicks wphicks changed the title [FEA] Provide resource_manager for easy generation of device_resources [FEA] Provide device_resources_manager for easy generation of device_resources Aug 4, 2023
@wphicks
Copy link
Contributor Author

wphicks commented Aug 4, 2023

Looks like my auto-signing configuration for git is not working. I'll rebase to add signatures to all commits before final review.

Edit: Figured it out!

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! It's shaping really well and I am very happy with your descriptions as well how well-commented the code is.

cpp/include/raft/core/device_resources_manager.hpp Outdated Show resolved Hide resolved
cpp/include/raft/core/device_resources_manager.hpp Outdated Show resolved Hide resolved
cpp/include/raft/core/device_resources_manager.hpp Outdated Show resolved Hide resolved
@github-actions github-actions bot added the CMake label Aug 7, 2023
@wphicks wphicks marked this pull request as ready for review August 7, 2023 21:31
@wphicks wphicks requested review from a team as code owners August 7, 2023 21:31
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got a chance to review this more thoroghly. Mostly, I'd like to separate the concerns a little better between individual device_resources and the device_resources_manager. From the RAFT perspective, the device_resources is not just an API for invoking RAFT's APIs. It's supposed to be responsible for maintaining the state assigned to an individual device. In many ways, this device_resources_manager implementation is side-stepping the individual resource management when we could be using this as an opportunity to reuse the existing centralized resource management infrastructure.

Other than that, I think the design is great RAFT has been in need of something that can manage multiple devices (e.g. instances of device_resources for different devices) across different threads.

I'm even okay stepping down a layer and using raft::resources directly, if that makes the experience more clean, but I'd like to integrate and reuse the raft::resource APIs as much as possible.

cpp/include/raft/core/device_resources_manager.hpp Outdated Show resolved Hide resolved
cpp/include/raft/core/device_resources_manager.hpp Outdated Show resolved Hide resolved
@wphicks
Copy link
Contributor Author

wphicks commented Aug 8, 2023

In many ways, this device_resources_manager implementation is side-stepping the individual resource management when we could be using this as an opportunity to reuse the existing centralized resource management infrastructure.

Could you be more specific? The intention of this object is to ensure that device_resources objects are constructed from a fixed pool of resources. E.g. Multiple device_resources objects may share a single stream in order to control the total number of streams used by the application per device. If you are saying that the device_resources objects themselves should be solely responsible for ownership of the underlying streams, there is no way to achieve that. If you are suggesting that we should create initial device_resources objects to distribute their resources among the final device_resources objects returned, this seems unnecessary when all we really want is a collection of streams (adequately encapsulated by rmm::cuda_stream_pool) and a collection of pools (adequately encapsulated by any container of rmm::cuda_stream_pool).

@wphicks
Copy link
Contributor Author

wphicks commented Aug 8, 2023

Refactored to cache the generated device_resources. Still to finalize:

  • Fix setting of workspace memory resource
  • Fix docs
  • Move the log_level_setter implementation

Copy link
Contributor

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking my comment about splitting the logger code into logger-inl and logger-ext as a request for changes 👍

https://github.com/rapidsai/raft/pull/1716/files#r1286897448

@wphicks
Copy link
Contributor Author

wphicks commented Aug 10, 2023

Updated everything requested except the docs. I'll finish those up tomorrow.

Copy link
Contributor

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more move of log_level_setter requested. Sorry, should have given this a bit more thought in the previous comment.

cpp/include/raft/core/logger-inl.hpp Outdated Show resolved Hide resolved
@wphicks wphicks requested review from ahendriksen and cjnolet August 11, 2023 21:33
Copy link
Contributor

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving log_level_setter to logger-ext.cuh. LGTM!

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @wphicks!

@cjnolet
Copy link
Member

cjnolet commented Aug 15, 2023

/merge

@rapids-bot rapids-bot bot merged commit 7aded12 into rapidsai:branch-23.10 Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp feature request New feature or request non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants