-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add system memory resource #1581
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved trivial CMake changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly documentation nits, but one unsigned integer arithmetic issue that wants fixed.
|
||
namespace detail { | ||
struct sam { | ||
static bool is_supported() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please document that this function returns whether something is supported on the currently active device (or, perhaps better, allow passing a cuda_device_id
in)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added parameter.
namespace rmm::mr { | ||
|
||
namespace detail { | ||
struct sam { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: What is sam
an abbreviation of? System accessible memory
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System allocated memory. Added a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should spell it out in the name of the struct. I like to avoid unnecessary abbreviation in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the struct.
* By default if no parameters are specified, this memory resource pass through to malloc/free | ||
* for allocation/deallocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* By default if no parameters are specified, this memory resource pass through to malloc/free | |
* for allocation/deallocation. | |
* By default if no parameters are specified, this memory resource passes through to malloc/free | |
* for allocation/deallocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed.
|
||
if (bytes >= threshold_size_) { | ||
auto const free = rmm::available_device_memory().first; | ||
auto const allocatable = std::max(free - headroom_size_, 0UL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: this subtraction is between two unsigned integers, and therefore wraps whenever free
is less than headroom_size_
. So exactly when you want to not allocate on the GPU, you will end up determining that you can allocate all of aligned
bytes on the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed.
RMM_CUDA_TRY(cudaMemAdvise(static_cast<char*>(ptr) + gpu_portion, | ||
cpu_portion, | ||
cudaMemAdviseSetPreferredLocation, | ||
cudaCpuDeviceId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: What do we want this interface to look like in a post cuda-12.2 world where cudaMemAdvice_v2
exists and we can specify numa affinity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. At least in GH200, there is only one NUMA node on the CPU side, so it doesn't really matter.
* However, when GPU memory is over-subscribed, system allocated memory will migrate to the GPU | ||
* and cannot migrate back, thus causing other CUDA calls to fail with out-of-memory errors. To | ||
* work around this problem, we can reserve some GPU memory as headroom for other CUDA calls. | ||
* Doing this check can be expensive, so only large buffer above the given threshold will be | ||
* checked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think I quite follow this documentation. The joining However
conjunction doesn't seem to follow from the previous paragraph.
For example, what is "this check"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved, removed "however".
* Two cuda_memory_resources always compare equal, because they can each | ||
* deallocate memory allocated by the other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: This isn't a cuda_memory_resource
though.
I think for equality, the requirement RMM tends to have is that pointers allocated by memory resource A can be deallocated by memory resource B. But here we are stricter, since we also require that the headroom and threshold are identical. Is this desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, two minor docstring suggestions from me.
/ok to test |
Does this mean @rongou doesnt have commit signing set up? |
Looks like it, yes. @roungou: these docs describe how to set this up https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification |
/ok to test |
Ok, signed the commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. One design discussion we should have, and a bunch of details.
* be expensive, checking whether to set the preferred location is only done above the specified | ||
* allocation threshold size. | ||
* | ||
* Note that if threshold size is non-zero, then an application making many small allocations can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of this, I wonder if instead of a threshold parameter, you should using a binning_memory_resource
, and encode the threshold as the maximum allocation size for the lower bin (or second highest, if more than two bins). All allocations in the larger bin (above the threshold) will do the headroom check. All allocations in the smaller bin(s) won't. You could even tune the smaller bins by using multiple bins, some or all of which might not even use SAM (so they won't eat into the headroom).
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think we can leave the threshold checking to the caller. At least in the current use case, we want to configure the cupy allocator to use this, and it'd only be used for large buffers.
*/ | ||
void* do_allocate(std::size_t bytes, [[maybe_unused]] cuda_stream_view stream) override | ||
{ | ||
auto const aligned{rmm::align_up(bytes, rmm::CUDA_ALLOCATION_ALIGNMENT)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SAM, shouldn't we be using HOST alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and did some experiments with both HMM and ATS, looks like we don't actually need to do additional alignments. Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got more recent advice from the CUDA team and they suggested we should use CUDA alignment if it will be accessed on the device. If nothing else, for cache line alignment this is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we do this alignment? Do we need to add some padding as in aligned_host_allocate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to do host MR style alignment, but with CUDA alignment. PTAL
namespace rmm::mr { | ||
|
||
namespace detail { | ||
struct sam { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should spell it out in the name of the struct. I like to avoid unnecessary abbreviation in code.
|
||
namespace detail { | ||
/** @brief Struct to check if system allocated memory (SAM) is supported. */ | ||
struct sam { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: does this need to be a struct? It looks like it could just be a static function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's just a single function for now. Removed the struct.
tests/mr/device/system_mr_tests.cu
Outdated
mr.deallocate(ptr, size_mb); | ||
} | ||
|
||
TEST(SAMHeadroomAdaptorTest, ThrowIfNotSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: perhaps group this with the other ThrowIfNotSupported test above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Co-authored-by: Mark Harris <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One copy-pasto
friend void get_property(system_memory_resource const&, cuda::mr::device_accessible) noexcept {} | ||
|
||
/** | ||
* @brief Enables the `cuda::mr::device_accessible` property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Enables the `cuda::mr::device_accessible` property | |
* @brief Enables the `cuda::mr::host_accessible` property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all of my feedback. Approving, with a couple of comments.
explicit sam_headroom_resource_adaptor(Upstream* upstream, std::size_t headroom) | ||
: upstream_{upstream}, headroom_{headroom} | ||
{ | ||
static_assert(std::is_same_v<system_memory_resource, Upstream>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this could work with any prefetchable upstream, including a SAM MR that is already wrapped with another adaptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifically designed to address the current shortcomings of SAM, so I'm not sure if it's useful for other things (e.g. managed memory). You are right about the adaptors though, but I'm leaning towards being more strict to avoid misuse. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, OK. We can always loosen it if needed.
tests/mr/device/mr_ref_test.hpp
Outdated
{ | ||
try { | ||
void* ptr = ref.allocate(bytes); | ||
EXPECT_NE(nullptr, ptr); | ||
EXPECT_TRUE(is_properly_aligned(ptr)); | ||
EXPECT_TRUE(is_device_accessible_memory(ptr)); | ||
if (not is_system_mr) { EXPECT_TRUE(is_device_accessible_memory(ptr)); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not device accessible memory, you shouldn't even be running these tests on it. If it's device accessible but is_device_accessible_memory
returns false, then we should investigate why it returns false and fix it. I don't really like the extra boolean parameter, the allocation tests should be generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
try { | ||
return rmm::detail::aligned_host_allocate( | ||
bytes, CUDA_ALLOCATION_ALIGNMENT, [](std::size_t size) { return ::operator new(size); }); | ||
} catch (std::bad_alloc const&) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this, is it just changing the exception type? I think you should try to incorporate the what()
from the bad_alloc
into the new exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* more information, see | ||
* https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/. | ||
*/ | ||
class system_memory_resource final : public device_memory_resource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a use case where we want a system_memory_resource that is not limited to systems that support paging. So we don't want to limit it to is_system_memory_supported(device)
. Instead, we want to limit the device_accessible
property to be true on systems where the memory is device accessible. I'm currently checking whether the device_accessible property can be used as a dynamic property. I don't think so. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess technically you could have multiple GPUs on a system and only some of them support SAM (e.g. a Turing and a Pascal), so the dynamic property also needs a device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harrism can this be merged now or do you still want to get the device_accessible property resolved first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still waiting, have pinged @jrhemstad for help. I think since MRs are implicitly tied to the current device, it can query the current device, doesn't need to take a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_accessible
is not a dynamic property.
For this kind of functionality, we'd need to add a new property like maybe_device_accessible
where get_property(mr, cuda::maybe_device_accessible)
returns a bool
for if it is device accessible or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this now? We can always enhance it when we have a dynamic device_accessible
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhemstad that would make this MR unusable with a resource_ref that expects device_accessible
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rongou Perhaps. I'm troubled that something called a system_memory_resource
is only usable on systems that support paging memory to device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a system_memory_resource
under the device
namespace, so it's probably reasonable to assume it only works on environments that support accessing system allocated memory from a device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no device
namespace, and device_memory_resource
and host_memory_resource
base classes are planned for removal.
@@ -32,6 +32,7 @@ INSTANTIATE_TEST_SUITE_P(ResourceTests, | |||
mr_factory{"CUDA_Async", &make_cuda_async}, | |||
#endif | |||
mr_factory{"Managed", &make_managed}, | |||
mr_factory{"System", &make_system}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this will cause the tests to fail on systems that don't support pageable memory access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are skipped if SAM is not supported.
/merge |
Follow up on #1581 to add access to the system memory resource in python. Fixes #1622 Authors: - Rong Ou (https://github.com/rongou) Approvers: - Mark Harris (https://github.com/harrism) - Vyas Ramasubramani (https://github.com/vyasr) URL: #1605
Description
Adds a new device memory resource that uses system allocated memory. Works around some existing issues when GPU memory is oversubscribed.
closes #1580
Checklist