-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rmm::prefetch() and DeviceBuffer.prefetch() #1573
Add rmm::prefetch() and DeviceBuffer.prefetch() #1573
Conversation
I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me. Instead, I think this should be a freestanding function like:
|
I actually had the same thought over the weekend. But there are a few other factors to consider.
|
include/rmm/prefetch.hpp
Outdated
* @param stream The stream to use for the prefetch | ||
*/ | ||
template <typename T> | ||
void prefetch(cuda::std::span<T> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function shouldn't be mutating any of the data.
void prefetch(cuda::std::span<T> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream) | |
void prefetch(cuda::std::span<T const> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream) |
include/rmm/prefetch.hpp
Outdated
* @param ptr The pointer to the memory to prefetch | ||
* @param size The number of bytes to prefetch | ||
* @param device The device to prefetch to | ||
* @param stream The stream to use for the prefetch | ||
*/ | ||
template <typename T> | ||
void prefetch(T* ptr, std::size_t size, rmm::cuda_device_id device, rmm::cuda_stream_view stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ptr
is typed, then size
shouldn't be bytes, it should be elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. If this takes T* ptr
, it should use size * sizeof(T)
to compute the bytes.
Or, if this is really designed for the case of device buffers, it could just use void* ptr
and accept size
in bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm switching it back to void const*
because then we can use span::size_bytes() in the span function. Someone suggested the T* version during discussions but I can't remember who or why. If there is a good reason, I'm all ears.
include/rmm/prefetch.hpp
Outdated
* @param ptr The pointer to the memory to prefetch | ||
* @param size The number of bytes to prefetch | ||
* @param device The device to prefetch to | ||
* @param stream The stream to use for the prefetch | ||
*/ | ||
template <typename T> | ||
void prefetch(T* ptr, std::size_t size, rmm::cuda_device_id device, rmm::cuda_stream_view stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. If this takes T* ptr
, it should use size * sizeof(T)
to compute the bytes.
Or, if this is really designed for the case of device buffers, it could just use void* ptr
and accept size
in bytes.
def test_rmm_device_buffer_prefetch(pool, managed): | ||
rmm.reinitialize(pool_allocator=pool, managed_memory=managed) | ||
db = rmm.DeviceBuffer.to_device(np.zeros(256, dtype="u1")) | ||
db.prefetch() # just test that it doesn't throw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to test that a prefetch call was issued with cudaMemRangeGetAttribute (void* data, size_t dataSize, cudaMemRangeAttribute attribute, const void* devPtr, size_t count)
, with attribute cudaMemRangeAttributeLastPrefetchLocation
. It ought to be possible to call this via cuda-python
(API docs).
Also:
Note that this simply returns the last location that the applicaton requested to prefetch the memory range to. It gives no indication as to whether the prefetch operation to that location has completed or even begun.
You wouldn't know if the prefetch completed or not, but you could verify that the prefetch request was issued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...of course, as soon as I scrolled down, I see you did this exact thing in the C++ tests, at Vyas's request. It would be nice to have a corresponding Python API test, since it should be quick to write with cuda-python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA Python results in very ugly code due to its non-pythonic error handling. But I've done what you asked...
Co-authored-by: Bradley Dice <[email protected]>
…nto fea-prefetch-device_buffer
Devcontainer build fails are due to (I think) not having NVIDIA/cccl#1836 |
I don't understand. This was working previously. I think this broke with the update to CCCL 2.5, is this a regression in 2.5? @miscco any idea why this was working fine before? I tried to go back to your godbolt example but godbolt only supports up to CCCL 2.2 Note that this only affects device_scalar, device_uvector converts to a span no problem. I confirmed that adding begin/end to |
It seems the doc-build pipeline failed but there's no log. Is it possible to retrigger it? |
/merge |
device : optional | ||
The CUDA device to which to prefetch the memory for this buffer. | ||
Defaults to the current CUDA device. To prefetch to the CPU, pass | ||
`~cuda.cudart.cudaCpuDeviceId` as the device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the doc build did not render it right, from here:
https://downloads.rapids.ai/ci/rmm/pull-request/1573/5165889/docs/rmm/html/
Possibly because the default role in RMM is "cpp" instead of "py". If so the fix would be
:py:`~cuda.cudart.cudaCpuDeviceId`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am certain that the correct reference is there. It can be checked as follows:
$ python -m sphinx.ext.intersphinx https://nvidia.github.io/cuda-python/objects.inv | grep "cuda.cudart.cudaCpuDeviceId"
cuda.cudart.cudaCpuDeviceId : module/cudart.html#cuda.cudart.cudaCpuDeviceId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this get fixed or should this be raised in a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think this was seen after merging. Issue filed: #1635 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: #1636 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Bradley! 🙏
This adds two `rmm::prefetch()` functions in C++. 1. `rmm::prefetch(void *ptr, size_t bytes, device, stream)` 2. `rmm::prefetch<T>(cuda::std::span<T> data, device, stream)` Item 2 enables prefetching the containers that RMM provides (`device_uvector`, `device_scalar`) that support conversion to `cuda::std::span`. In order to enable that, `device_scalar::size()` is added. Note that `device_buffer`s must be prefetched using item 1 because you can't create a `span<void>`. In Python, this adds `DeviceBuffer.prefetch()` because that's really the only RMM Python data type to prefetch. There is *one* Cython use of `device_uvector` in cuDF `join` that we might need to add prefetch support for later. `prefetch` is a no-op on non-managed memory. Rather than querying the type of memory, it just catches `cudaErrorInvalidValue` from `cudaMemPrefetchAsync`. Authors: - Mark Harris (https://github.com/harrism) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Rong Ou (https://github.com/rongou) - Jake Hemstad (https://github.com/jrhemstad) - Michael Schellenberger Costa (https://github.com/miscco) - Vyas Ramasubramani (https://github.com/vyasr) URL: #1573
Description
This adds two
rmm::prefetch()
functions in C++.rmm::prefetch(void *ptr, size_t bytes, device, stream)
rmm::prefetch<T>(cuda::std::span<T> data, device, stream)
Item 2 enables prefetching the containers that RMM provides (
device_uvector
,device_scalar
) that support conversion tocuda::std::span
. In order to enable that,device_scalar::size()
is added.Note that
device_buffer
s must be prefetched using item 1 because you can't create aspan<void>
.In Python, this adds
DeviceBuffer.prefetch()
because that's really the only RMM Python data type to prefetch. There is one Cython use ofdevice_uvector
in cuDFjoin
that we might need to add prefetch support for later.prefetch
is a no-op on non-managed memory. Rather than querying the type of memory, it just catchescudaErrorInvalidValue
fromcudaMemPrefetchAsync
.Checklist