-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCM/CUDA: Add support for CUDA virtual memory with cudaMallocAsync
#8623
Conversation
cc @Akshay-Venkatesh for awareness |
@pentschev Looks like The general question is - What are the performance expectations for mallocasync memory?
|
This is a good point. Is there a mechanism for distinguishing "sub-types", or would we have to introduce a new type, e.g.,
This is a very difficult question to answer. In a perfect world we would, of course, want to have the same performance as we do for CUDA "default" memory (PA-backed memory allocated via In the future I would say it would be worth having |
73d1283
to
173face
Compare
@pentschev IMO adding another memory type is pretty intrusive. How is it different from CUDA pinned memory? |
A complete different API set is required for handling both virtual and stream-ordered. Virtual and stream-ordered memory are not PA-backed as regular pinned memory, so supporting To be fair, I do not know internals of UCX at this level that well, so there's definitely a chance that adding a new memory type is indeed not required. My reasoning for this is currently that virtual/stream-ordered memory will have different requirements than pinned memory and will most likely require a different code-path for I'm certainly open to not having another memory type if we can manage the reasons I pointed out above in a different way. |
@pentschev based on an offline discussion, @yosefe suggested to treat virtual memory as managed memory so that the same protocols used with managed memory will be used with mallocasync for the time being. This would have the desired performance you see by setting
and drop all additions of CUDA_VIRTUAL. |
That seems fine for now, but how do we add support for |
|
173face
to
8d35f95
Compare
I now pushed changes removing the new memory type. Also replying @yosefe 's points:
Sounds reasonable to me. The only regression from the original PR is we now can't test ucx/test/gtest/common/mem_buffer.cc Lines 163 to 170 in c5d5b0c
Are there any other existing ways we could test that there besides having separate memory type, or would that have to be done by adding a new "allocation method" as you're suggesting for perftest?
That would be great, it's very useful to be able to test and verify performance of the stream-ordered allocator, as it is very different under-the-hood than pinned/managed memory. |
attr_type[3] = CU_POINTER_ATTRIBUTE_CONTEXT; | ||
attr_data[3] = &cuda_mem_ctx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Akshay-Venkatesh cudaMallocAsync
has first appeared in 11.2.0, but CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE
only appeared in 11.3.0, therefore I changed the check here which should still be equivalent, based on offline conversations with the CUDA team, see the updated code comments just below. Also, CU_POINTER_ATTRIBUTE_CONTEXT
has existed since at least CUDA 10.0, so I believe we don't need any extra preprocessor checks here, what do you think?
8d35f95
to
3af9172
Compare
src/ucm/cuda/cudamem.c
Outdated
#if CUDA_VERSION >= 11020 | ||
UCM_CUDA_FUNC_ENTRY(cuMemFreeAsync), | ||
#endif | ||
{{NULL}, NULL}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for completeness, can you add hooks below for runtime APIs as well? it might useful in when bistro hooks fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/ucm/cuda/cudamem.c
Outdated
@@ -58,6 +58,21 @@ | |||
return ret; \ | |||
} | |||
|
|||
/* Create a body of CUDA memory release replacement function */ | |||
#define UCM_CUDA_FREE_ASYNC_FUNC(_name, _retval, _ptr_type, _stream_type, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we extend UCM_CUDA_FREE_FUNC to have var-args instead of _stream_type?
see for example how UCM_CUDA_ALLOC_FUNC is using varargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to do but failed so far. I tried two different approaches, described below.
The first one would be just to add _args_format
and ...
, but that would then require existing functions to also pass those empty arguments, and it seems problematic to handle empty var-args, not sure if there are any preprocessor macros currently to handle that case, plus it would pollute the existing UCM_CUDA_FREE_FUNC
calls with empty parameters.
The second was to pass everything as var-args, including the pointer. We could then extract the first argument (pointer) with UCS_PP_TUPLE_0
, and then make use of it like the explicit parameter is currently used. However, this would then require passing __VA_ARGS__
to ucm_trace
which would need type-casting of the individual variables.
I'm not very experience writing var-arg macros, so maybe you could suggest other approaches?
src/ucm/cuda/cudamem.c
Outdated
{{NULL}, NULL} | ||
}; | ||
static ucm_cuda_func_t ucm_cuda_driver_funcs[] = | ||
{UCM_CUDA_FUNC_ENTRY(cuMemAlloc), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid changing the layout/indentation of the whole array and just add new items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, as per our conversation offline this may be ignored.
@@ -20,10 +20,18 @@ CUresult ucm_cuMemAllocPitch(CUdeviceptr *dptr, size_t *pPitch, | |||
CUresult ucm_cuMemAllocPitch_v2(CUdeviceptr *dptr, size_t *pPitch, | |||
size_t WidthInBytes, size_t Height, | |||
unsigned int ElementSizeBytes); | |||
#if CUDA_VERSION >= 11020 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the #if
is probably not needed here since it's just func prototype, it's not calling any Cuda API here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lack of #if
fails for older CUDA because cudaMemPool_t
/CUmemoryPool
only appeared in CUDA 11.2. Seems like we need those directives anyway, or is there some better way of handling that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being I reinstated the #if
s. I guess one alternative would be to define cudaMemPool_t
/CUmemoryPool
ourselves for older CUDA versions, but feels like that would pollute code more so not sure this is a great alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
src/ucm/cuda/cudamem.h
Outdated
CUresult ucm_cuMemFree(CUdeviceptr dptr); | ||
CUresult ucm_cuMemFree_v2(CUdeviceptr dptr); | ||
CUresult ucm_cuMemFreeHost(void *p); | ||
CUresult ucm_cuMemFreeHost_v2(void *p); | ||
#if CUDA_VERSION >= 11020 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/uct/cuda/base/cuda_md.c
Outdated
if (is_managed) { | ||
/* cuMemGetAddress range does not support managed memory so use provided | ||
* address and length as base address and alloc length respectively */ | ||
if (is_managed || !cuda_mem_ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(cuda_mem_ctx == NULL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/uct/ib/rc/accel/rc_mlx5_iface.c
Outdated
if (mlx5_config->tm.seg_size > UCT_IB_MLX5_MP_RQ_BYTE_CNT_MASK) { | ||
ucs_error("TM segment size is too big %ld, it must not exceed %d", | ||
mlx5_config->tm.seg_size, UCT_IB_MLX5_MP_RQ_BYTE_CNT_MASK); | ||
return UCS_ERR_INVALID_PARAM; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge artifact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, sorry for that, should be gone now.
a0375cc
to
bbe2c6d
Compare
test/gtest/ucm/cuda_hooks.cc
Outdated
int supported_cuda_version; | ||
CUresult cu_err; | ||
|
||
cu_err = cuDriverGetVersion(&supported_cuda_version); | ||
ASSERT_EQ(cu_err, CUDA_SUCCESS); | ||
|
||
if (supported_cuda_version < 11020) { | ||
GTEST_SKIP_("cuMemAllocAsync not available"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I tried to do this with cuGetProcAddress
, but cuGetProcAddress
only got introduced in CUDA 11.3. I guess we could skip tests for older versions as well but that feels like introducing an artificial cap, so checking the latest CUDA version supported by the driver seems like a better alternative here.
bbe2c6d
to
1e17ffa
Compare
@@ -20,10 +20,18 @@ CUresult ucm_cuMemAllocPitch(CUdeviceptr *dptr, size_t *pPitch, | |||
CUresult ucm_cuMemAllocPitch_v2(CUdeviceptr *dptr, size_t *pPitch, | |||
size_t WidthInBytes, size_t Height, | |||
unsigned int ElementSizeBytes); | |||
#if CUDA_VERSION >= 11020 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
@pentschev i've pushed another commit to this PR for #8623 (comment) |
65ba35c
to
033ed62
Compare
Awesome, thanks so much @yosefe ! I got so close, but the magic I was missing was Also addressed all other comments. Finally, it seems like the tests are all failing with the following error:
I'm not really sure how that happens, it seems to occur for all CUDA versions, including those that do support |
arg0 is the first argument in the vararg list. it means that the released pointer (that we need to pass down to mem hooks) is the first argument to the cuda function. We have similar thing with allocation functions, where we provide the expression to calculate the allocated buffer size based to arguments.
There was a missing definition for cudaMallocAsync that I added in my commit. Hopefully the tests would pass now. |
UCM_CUDA_FREE_FUNC(cuMemFree, UCS_MEMORY_TYPE_CUDA, CUresult, arg0, | ||
"ptr=0x%llx", CUdeviceptr) | ||
UCM_CUDA_FREE_FUNC(cuMemFree_v2, UCS_MEMORY_TYPE_CUDA, CUresult, arg0, | ||
"ptr=0x%llx", CUdeviceptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe I don't see cuMemFree
or cuMemFree_v2
with UCS_MEMORY_TYPE_CUDA_MANAGED
for cases that allocate with cuMemAllocManaged
. We should be covering that case as well. Will this result in potentially attempting to remove UCS_MEMORY_TYPE_CUDA
type of memory ranges that were allocated with cuMemAllocManaged
memory in memtype cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Akshay-Venkatesh actually it doesn't matter which memory type is used for memory free event.
we can remove memtype from free altogether (pass UCS_MEM_TYPE_UNKNOWN to the event) but since that "issue" existed before this PR, IMO it should not block this PR for v1.14
cudaMemPool_t, cudaStream_t) | ||
#endif | ||
UCM_CUDA_FREE_FUNC(cudaFree, UCS_MEMORY_TYPE_CUDA, cudaError_t, arg0, | ||
"devPtr=%p", void*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe same question as for cuMemFree
. We could be trying to free a cuda managed memory region with pinned cuda free. Not sure if removing range entries in memtype_cache ignores memtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does: ucs_memtype_cache_update_internal uses "mem_type" parameter only if action == UCS_MEMTYPE_CACHE_ACTION_SET_MEMTYPE.
When called from memory free hook, action is UCS_MEMTYPE_CACHE_ACTION_REMOVE
This passes now, thanks so much for all the help in getting everything fixed @yosefe ! |
@pentschev can you pls squash the commits to one commit? |
9c83465
to
d3693b8
Compare
@yosefe done. Is the commit name right or should it also include |
Thanks @yosefe for the help in getting things correct here! |
…1012) This change requires UCX 1.14 for async support from openucx/ucx#8623 and UVM fixes from openucx/ucx#8754 . Authors: - Peter Andreas Entschev (https://github.com/pentschev) - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #1012
What
Adding support for CUDA virtual/stream-ordered memory with
cudaMallocAsync
.Why ?
For proper CUDA virtual memory identification, add required hooking for
cudaMallocAsync
. Although this is currently adding support forcudaMallocAsync
/cudaMallocAsyncFromPool
, the stream-ordered memory allocator is part of the Virtual Memory Management in CUDA, which is why its special case was namedUCS_MEMORY_TYPE_CUDA_VIRTUAL
for the time being.This change was successfully tested with Dask+UCX-Py using bounce buffers setting
UCX_RNDV_FRAG_MEM_TYPE=cuda
and demonstrates to behave correctly so far, also providing similar performance to CUDA managed memory under the same conditions.