UCM/CUDA: Add support for CUDA virtual memory with `cudaMallocAsync` #8623

pentschev · 2022-10-12T15:52:58Z

What

Adding support for CUDA virtual/stream-ordered memory with cudaMallocAsync.

Why ?

For proper CUDA virtual memory identification, add required hooking for cudaMallocAsync. Although this is currently adding support for cudaMallocAsync/cudaMallocAsyncFromPool, the stream-ordered memory allocator is part of the Virtual Memory Management in CUDA, which is why its special case was named UCS_MEMORY_TYPE_CUDA_VIRTUAL for the time being.

This change was successfully tested with Dask+UCX-Py using bounce buffers setting UCX_RNDV_FRAG_MEM_TYPE=cuda and demonstrates to behave correctly so far, also providing similar performance to CUDA managed memory under the same conditions.

pentschev · 2022-10-12T15:53:26Z

cc @Akshay-Venkatesh for awareness

Akshay-Venkatesh · 2022-10-12T18:07:59Z

@pentschev Looks like cuda_copy transport has the added capability of detecting virtual memory and UCP happens to handle virtual pinned memory the same way as managed memory. When mallocasync supports managed memory the changes introduced through this PR would also need to distinguish types in UCS_MEMORY_TYPE_CUDA_VIRTUAL .

The general question is - What are the performance expectations for mallocasync memory?

If the expectation is that staging through pinned memory for both intra-node and inter-node paths provides enough performance, then we can rework without introducing UCS_MEMORY_TYPE_CUDA_VIRTUAL but instead by adding mempool attribute to cuda memory (whether be it pinned malloc async or managed malloc async in the future).
If the expectation is that mallocasync memory needs to use IPC capabilities introduced for mempools, then we would need to introduce new paths in cuda_ipc transport.

pentschev · 2022-10-12T19:27:18Z

@pentschev Looks like cuda_copy transport has the added capability of detecting virtual memory and UCP happens to handle virtual pinned memory the same way as managed memory. When mallocasync supports managed memory the changes introduced through this PR would also need to distinguish types in UCS_MEMORY_TYPE_CUDA_VIRTUAL .

This is a good point. Is there a mechanism for distinguishing "sub-types", or would we have to introduce a new type, e.g., UCS_MEMORY_TYPE_CUDA_VIRTUAL_MANAGED?

The general question is - What are the performance expectations for mallocasync memory?

If the expectation is that staging through pinned memory for both intra-node and inter-node paths provides enough performance, then we can rework without introducing UCS_MEMORY_TYPE_CUDA_VIRTUAL but instead by adding mempool attribute to cuda memory (whether be it pinned malloc async or managed malloc async in the future).

If the expectation is that mallocasync memory needs to use IPC capabilities introduced for mempools, then we would need to introduce new paths in cuda_ipc transport.

This is a very difficult question to answer. In a perfect world we would, of course, want to have the same performance as we do for CUDA "default" memory (PA-backed memory allocated via cudaMalloc), but that generally is unfeasible in practice. As for the tests we did so far, using the staging buffers provide very similar performance to PA-backed CUDA allocations, the penalty we observed was roughly 10%, which is definitely acceptable now. However, the tests currently involved single-node DGX-1 only, so I can't say whether this performance will scale.

In the future I would say it would be worth having cuda_ipc support as well, it feels that as we push towards newer systems with much higher bandwidth limits for cuda_ipc like the H100, staging allocations may impose a higher overall penalty particularly if we have to transfer large amounts of data at once, where the time spent to stage buffers may eventually be too high and would otherwise force us to dedicated more and more memory to the staging buffers to work around that issue. Please take this with a grain of salt, I do not have numbers at this time to back or falsify my theory, just trying to make an educated guess here.

yosefe · 2022-10-13T09:12:50Z

@pentschev IMO adding another memory type is pretty intrusive. How is it different from CUDA pinned memory?

pentschev · 2022-10-13T09:35:23Z

@pentschev IMO adding another memory type is pretty intrusive. How is it different from CUDA pinned memory?

A complete different API set is required for handling both virtual and stream-ordered. Virtual and stream-ordered memory are not PA-backed as regular pinned memory, so supporting cuda_ipc or verbs will have different requirements than for pinned memory, but not for cuda_copy or staging buffers, which will be supported with this PR already.

To be fair, I do not know internals of UCX at this level that well, so there's definitely a chance that adding a new memory type is indeed not required. My reasoning for this is currently that virtual/stream-ordered memory will have different requirements than pinned memory and will most likely require a different code-path for cuda_ipc and verbs. Plus, the way testing is organized in UCX requires having a different memory type, for example passing ucx_perftest -m cuda-virtual would not be possible without a different memory type and it would definitely be useful to be able to test/profile virtual/stream-ordered memory individually as well. Finally, as Akshay pointed out above, at one point we will have "virtual managed" memory, integrating both concepts, but it currently isn't clear to me whether the requirements will be different or not.

I'm certainly open to not having another memory type if we can manage the reasons I pointed out above in a different way.

Akshay-Venkatesh · 2022-10-13T14:55:34Z

@pentschev IMO adding another memory type is pretty intrusive. How is it different from CUDA pinned memory?

A complete different API set is required for handling both virtual and stream-ordered. Virtual and stream-ordered memory are not PA-backed as regular pinned memory, so supporting cuda_ipc or verbs will have different requirements than for pinned memory, but not for cuda_copy or staging buffers, which will be supported with this PR already.

To be fair, I do not know internals of UCX at this level that well, so there's definitely a chance that adding a new memory type is indeed not required. My reasoning for this is currently that virtual/stream-ordered memory will have different requirements than pinned memory and will most likely require a different code-path for cuda_ipc and verbs. Plus, the way testing is organized in UCX requires having a different memory type, for example passing ucx_perftest -m cuda-virtual would not be possible without a different memory type and it would definitely be useful to be able to test/profile virtual/stream-ordered memory individually as well. Finally, as Akshay pointed out above, at one point we will have "virtual managed" memory, integrating both concepts, but it currently isn't clear to me whether the requirements will be different or not.

I'm certainly open to not having another memory type if we can manage the reasons I pointed out above in a different way.

@pentschev based on an offline discussion, @yosefe suggested to treat virtual memory as managed memory so that the same protocols used with managed memory will be used with mallocasync for the time being. This would have the desired performance you see by setting RNDV_FRAG_MEM_TYPE=cuda. So effectively, you'd set here

mem_info->type = UCS_MEMORY_TYPE_CUDA_MANAGED;

and drop all additions of CUDA_VIRTUAL.

pentschev · 2022-10-13T15:04:58Z

@pentschev based on an offline discussion, @yosefe suggested to treat virtual memory as managed memory so that the same protocols used with managed memory will be used with mallocasync for the time being. This would have the desired performance you see by setting RNDV_FRAG_MEM_TYPE=cuda. So effectively, you'd set here
mem_info->type = UCS_MEMORY_TYPE_CUDA_MANAGED;
and drop all additions of CUDA_VIRTUAL.

That seems fine for now, but how do we add support for cudaMallocAsync in ucx-perftest, e.g., here? Can we still do that?

yosefe · 2022-10-13T15:08:34Z

That seems fine for now, but how do we add support for cudaMallocAsync in ucx-perftest, e.g., here? Can we still do that?

It could be in another PR to expedite merging this PR for v1.14.x
We can consider adding "allocation method" parameter to perftest, in addition to memory type. Even pinned memory can be allocated by more than one API. Not to mention host memory...

pentschev · 2022-10-13T19:50:50Z

I now pushed changes removing the new memory type.

Also replying @yosefe 's points:

It could be in another PR to expedite merging this PR for v1.14.x

Sounds reasonable to me. The only regression from the original PR is we now can't test cudaMallocAsync in:

ucx/test/gtest/common/mem_buffer.cc

Lines 163 to 170 in c5d5b0c

    
           #if HAVE_CUDA 
        
               case UCS_MEMORY_TYPE_CUDA: 
        
                   CUDA_CALL(cudaMalloc(&ptr, size), ": size=" << size); 
        
                   return ptr; 
        
               case UCS_MEMORY_TYPE_CUDA_MANAGED: 
        
                   CUDA_CALL(cudaMallocManaged(&ptr, size), ": size=" << size); 
        
                   return ptr; 
        
           #endif

Are there any other existing ways we could test that there besides having separate memory type, or would that have to be done by adding a new "allocation method" as you're suggesting for perftest?

We can consider adding "allocation method" parameter to perftest, in addition to memory type. Even pinned memory can be allocated by more than one API. Not to mention host memory...

That would be great, it's very useful to be able to test and verify performance of the stream-ordered allocator, as it is very different under-the-hood than pinned/managed memory.

pentschev · 2022-10-13T19:53:30Z

src/uct/cuda/base/cuda_md.c

+    attr_type[3] = CU_POINTER_ATTRIBUTE_CONTEXT;
+    attr_data[3] = &cuda_mem_ctx;


@Akshay-Venkatesh cudaMallocAsync has first appeared in 11.2.0, but CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE only appeared in 11.3.0, therefore I changed the check here which should still be equivalent, based on offline conversations with the CUDA team, see the updated code comments just below. Also, CU_POINTER_ATTRIBUTE_CONTEXT has existed since at least CUDA 10.0, so I believe we don't need any extra preprocessor checks here, what do you think?

bureddy · 2022-10-13T23:40:55Z

src/ucm/cuda/cudamem.c

+#if CUDA_VERSION >= 11020
+         UCM_CUDA_FUNC_ENTRY(cuMemFreeAsync),
+#endif
+         {{NULL}, NULL}};


for completeness, can you add hooks below for runtime APIs as well? it might useful in when bistro hooks fails

yosefe · 2022-10-14T07:28:35Z

src/ucm/cuda/cudamem.c

@@ -58,6 +58,21 @@
        return ret; \
    }

+/* Create a body of CUDA memory release replacement function */
+#define UCM_CUDA_FREE_ASYNC_FUNC(_name, _retval, _ptr_type, _stream_type, \


can we extend UCM_CUDA_FREE_FUNC to have var-args instead of _stream_type?
see for example how UCM_CUDA_ALLOC_FUNC is using varargs

I've been trying to do but failed so far. I tried two different approaches, described below.

The first one would be just to add _args_format and ..., but that would then require existing functions to also pass those empty arguments, and it seems problematic to handle empty var-args, not sure if there are any preprocessor macros currently to handle that case, plus it would pollute the existing UCM_CUDA_FREE_FUNC calls with empty parameters.

The second was to pass everything as var-args, including the pointer. We could then extract the first argument (pointer) with UCS_PP_TUPLE_0, and then make use of it like the explicit parameter is currently used. However, this would then require passing __VA_ARGS__ to ucm_trace which would need type-casting of the individual variables.

I'm not very experience writing var-arg macros, so maybe you could suggest other approaches?

yosefe · 2022-10-14T07:29:24Z

src/ucm/cuda/cudamem.c

-    {{NULL}, NULL}
-};
+static ucm_cuda_func_t ucm_cuda_driver_funcs[] =
+        {UCM_CUDA_FUNC_ENTRY(cuMemAlloc),


can we avoid changing the layout/indentation of the whole array and just add new items?

Done, as per our conversation offline this may be ignored.

yosefe · 2022-10-14T07:29:57Z

src/ucm/cuda/cudamem.h

@@ -20,10 +20,18 @@ CUresult ucm_cuMemAllocPitch(CUdeviceptr *dptr, size_t *pPitch,
 CUresult ucm_cuMemAllocPitch_v2(CUdeviceptr *dptr, size_t *pPitch,
                                size_t WidthInBytes, size_t Height,
                                unsigned int ElementSizeBytes);
+#if CUDA_VERSION >= 11020


the #if is probably not needed here since it's just func prototype, it's not calling any Cuda API here

The lack of #if fails for older CUDA because cudaMemPool_t/CUmemoryPool only appeared in CUDA 11.2. Seems like we need those directives anyway, or is there some better way of handling that case?

For the time being I reinstated the #ifs. I guess one alternative would be to define cudaMemPool_t/CUmemoryPool ourselves for older CUDA versions, but feels like that would pollute code more so not sure this is a great alternative.

yosefe · 2022-10-14T07:30:02Z

src/ucm/cuda/cudamem.h

 CUresult ucm_cuMemFree(CUdeviceptr dptr);
 CUresult ucm_cuMemFree_v2(CUdeviceptr dptr);
 CUresult ucm_cuMemFreeHost(void *p);
 CUresult ucm_cuMemFreeHost_v2(void *p);
+#if CUDA_VERSION >= 11020


yosefe · 2022-10-14T07:30:27Z

src/uct/cuda/base/cuda_md.c

-    if (is_managed) {
-        /* cuMemGetAddress range does not support managed memory so use provided
-         * address and length as base address and alloc length respectively */
+    if (is_managed || !cuda_mem_ctx) {


(cuda_mem_ctx == NULL)

yosefe · 2022-10-14T07:30:50Z

src/uct/ib/rc/accel/rc_mlx5_iface.c

+    if (mlx5_config->tm.seg_size > UCT_IB_MLX5_MP_RQ_BYTE_CNT_MASK) {
+        ucs_error("TM segment size is too big %ld, it must not exceed %d",
+                  mlx5_config->tm.seg_size, UCT_IB_MLX5_MP_RQ_BYTE_CNT_MASK);
+        return UCS_ERR_INVALID_PARAM;


merge artifact?

Indeed, sorry for that, should be gone now.

pentschev · 2022-10-17T09:12:08Z

test/gtest/ucm/cuda_hooks.cc

+    int supported_cuda_version;
+    CUresult cu_err;
+
+    cu_err = cuDriverGetVersion(&supported_cuda_version);
+    ASSERT_EQ(cu_err, CUDA_SUCCESS);
+
+    if (supported_cuda_version < 11020) {
+		GTEST_SKIP_("cuMemAllocAsync not available");
+    }


Originally I tried to do this with cuGetProcAddress, but cuGetProcAddress only got introduced in CUDA 11.3. I guess we could skip tests for older versions as well but that feels like introducing an artificial cap, so checking the latest CUDA version supported by the driver seems like a better alternative here.

test/gtest/ucm/cuda_hooks.cc

src/uct/cuda/base/cuda_md.c

test/gtest/ucm/cuda_hooks.cc

yosefe · 2022-10-17T09:20:10Z

src/ucm/cuda/cudamem.h

@@ -20,10 +20,18 @@ CUresult ucm_cuMemAllocPitch(CUdeviceptr *dptr, size_t *pPitch,
 CUresult ucm_cuMemAllocPitch_v2(CUdeviceptr *dptr, size_t *pPitch,
                                size_t WidthInBytes, size_t Height,
                                unsigned int ElementSizeBytes);
+#if CUDA_VERSION >= 11020


yosefe · 2022-10-17T16:19:12Z

@pentschev i've pushed another commit to this PR for #8623 (comment)

pentschev · 2022-10-17T17:08:04Z

@pentschev i've pushed another commit to this PR for #8623 (comment)

Awesome, thanks so much @yosefe ! I got so close, but the magic I was missing was arg0, would you mind explaining/pointing me to how arg0 gets expanded there?

Also addressed all other comments. Finally, it seems like the tests are all failing with the following error:

./test/gtest/gtest: symbol lookup error: /__w/1/s/ucx_src_52619/build-test/src/ucs/.libs/ucx/libucm_cuda.so.0: undefined symbol: ucm_cudaMallocAsync

I'm not really sure how that happens, it seems to occur for all CUDA versions, including those that do support cudaMallocAsync. Do you have any clue as to why the symbols are not being found in CI? Locally they pass successfully for me with CUDA 11.5.

yosefe · 2022-10-17T18:02:51Z

Awesome, thanks so much @yosefe ! I got so close, but the magic I was missing was arg0, would you mind explaining/pointing me to how arg0 gets expanded there?

arg0 is the first argument in the vararg list. it means that the released pointer (that we need to pass down to mem hooks) is the first argument to the cuda function. We have similar thing with allocation functions, where we provide the expression to calculate the allocated buffer size based to arguments.

I'm not really sure how that happens, it seems to occur for all CUDA versions, including those that do support cudaMallocAsync. Do you have any clue as to why the symbols are not being found in CI? Locally they pass successfully for me with CUDA 11.5.

There was a missing definition for cudaMallocAsync that I added in my commit. Hopefully the tests would pass now.

Akshay-Venkatesh · 2022-10-17T22:54:48Z

src/ucm/cuda/cudamem.c

+UCM_CUDA_FREE_FUNC(cuMemFree, UCS_MEMORY_TYPE_CUDA, CUresult, arg0,
+                   "ptr=0x%llx", CUdeviceptr)
+UCM_CUDA_FREE_FUNC(cuMemFree_v2, UCS_MEMORY_TYPE_CUDA, CUresult, arg0,
+                   "ptr=0x%llx", CUdeviceptr)


@yosefe I don't see cuMemFree or cuMemFree_v2 with UCS_MEMORY_TYPE_CUDA_MANAGED for cases that allocate with cuMemAllocManaged. We should be covering that case as well. Will this result in potentially attempting to remove UCS_MEMORY_TYPE_CUDA type of memory ranges that were allocated with cuMemAllocManaged memory in memtype cache?

@Akshay-Venkatesh actually it doesn't matter which memory type is used for memory free event.
we can remove memtype from free altogether (pass UCS_MEM_TYPE_UNKNOWN to the event) but since that "issue" existed before this PR, IMO it should not block this PR for v1.14

Akshay-Venkatesh · 2022-10-17T22:57:14Z

src/ucm/cuda/cudamem.c

+                    cudaMemPool_t, cudaStream_t)
+#endif
+UCM_CUDA_FREE_FUNC(cudaFree, UCS_MEMORY_TYPE_CUDA, cudaError_t, arg0,
+                   "devPtr=%p", void*)


@yosefe same question as for cuMemFree. We could be trying to free a cuda managed memory region with pinned cuda free. Not sure if removing range entries in memtype_cache ignores memtype.

it does: ucs_memtype_cache_update_internal uses "mem_type" parameter only if action == UCS_MEMTYPE_CACHE_ACTION_SET_MEMTYPE.
When called from memory free hook, action is UCS_MEMTYPE_CACHE_ACTION_REMOVE

pentschev · 2022-10-18T08:50:08Z

This passes now, thanks so much for all the help in getting everything fixed @yosefe !

yosefe · 2022-10-18T09:00:45Z

@pentschev can you pls squash the commits to one commit?

pentschev · 2022-10-18T09:21:09Z

@yosefe done. Is the commit name right or should it also include UCT/TEST?

pentschev · 2022-10-18T17:16:09Z

Thanks @yosefe for the help in getting things correct here!

…1012) This change requires UCX 1.14 for async support from openucx/ucx#8623 and UVM fixes from openucx/ucx#8754 . Authors: - Peter Andreas Entschev (https://github.com/pentschev) - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #1012

pentschev force-pushed the cuda-malloc-async branch from 73d1283 to 173face Compare October 12, 2022 20:46

pentschev force-pushed the cuda-malloc-async branch from 173face to 8d35f95 Compare October 13, 2022 19:44

pentschev commented Oct 13, 2022

View reviewed changes

pentschev force-pushed the cuda-malloc-async branch from 8d35f95 to 3af9172 Compare October 13, 2022 20:18

bureddy reviewed Oct 13, 2022

View reviewed changes

yosefe reviewed Oct 14, 2022

View reviewed changes

pentschev force-pushed the cuda-malloc-async branch 2 times, most recently from a0375cc to bbe2c6d Compare October 17, 2022 09:09

pentschev commented Oct 17, 2022

View reviewed changes

pentschev force-pushed the cuda-malloc-async branch from bbe2c6d to 1e17ffa Compare October 17, 2022 15:07

yosefe reviewed Oct 17, 2022

View reviewed changes

pentschev force-pushed the cuda-malloc-async branch from 65ba35c to 033ed62 Compare October 17, 2022 17:05

yosefe previously approved these changes Oct 17, 2022

View reviewed changes

yosefe dismissed their stale review via 9c83465 October 17, 2022 18:21

Akshay-Venkatesh reviewed Oct 17, 2022

View reviewed changes

UCM/CUDA: Add support for CUDA virtual memory with cudaMallocAsync

d3693b8

pentschev force-pushed the cuda-malloc-async branch from 9c83465 to d3693b8 Compare October 18, 2022 09:20

yosefe approved these changes Oct 18, 2022

View reviewed changes

yosefe enabled auto-merge October 18, 2022 10:16

yosefe merged commit a557b84 into openucx:master Oct 18, 2022

quasiben mentioned this pull request Oct 24, 2022

Meta Issue: Track CUDA virtual memory support in UCX rapidsai/ucx-py#890

Closed

1 task

This was referenced Nov 1, 2022

[CUDA] Support IPC for allocations created by cuMemCreate and cudaMallocAsync #7110

Open

Tracker for state of CUDA managed/async memory support rapidsai/ucx-py#892

Open

yosefe mentioned this pull request Nov 2, 2022

UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204

Closed

pentschev mentioned this pull request Feb 13, 2023

Allow using stream-ordered and managed RMM allocators in benchmarks rapidsai/dask-cuda#1012

Merged

maleadt mentioned this pull request May 30, 2024

Legacy cuIpc* APIs incompatible with stream-ordered allocator JuliaGPU/CUDA.jl#1053

Closed

		attr_type[3] = CU_POINTER_ATTRIBUTE_CONTEXT;
		attr_data[3] = &cuda_mem_ctx;

UCM/CUDA: Add support for CUDA virtual memory with cudaMallocAsync #8623

UCM/CUDA: Add support for CUDA virtual memory with cudaMallocAsync #8623

Conversation

pentschev commented Oct 12, 2022

What

Why ?

pentschev commented Oct 12, 2022

Akshay-Venkatesh commented Oct 12, 2022 • edited Loading

pentschev commented Oct 12, 2022

yosefe commented Oct 13, 2022 • edited Loading

pentschev commented Oct 13, 2022

Akshay-Venkatesh commented Oct 13, 2022

pentschev commented Oct 13, 2022

yosefe commented Oct 13, 2022

pentschev commented Oct 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentschev Oct 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosefe commented Oct 17, 2022

pentschev commented Oct 17, 2022

yosefe commented Oct 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentschev commented Oct 18, 2022

yosefe commented Oct 18, 2022

pentschev commented Oct 18, 2022

pentschev commented Oct 18, 2022

UCM/CUDA: Add support for CUDA virtual memory with `cudaMallocAsync` #8623

UCM/CUDA: Add support for CUDA virtual memory with `cudaMallocAsync` #8623

Akshay-Venkatesh commented Oct 12, 2022 •

edited

Loading

yosefe commented Oct 13, 2022 •

edited

Loading

pentschev Oct 17, 2022 •

edited

Loading