Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCM/CUDA: Add support for CUDA virtual memory with cudaMallocAsync #8623

Merged
merged 1 commit into from
Oct 18, 2022

Conversation

pentschev
Copy link
Contributor

What

Adding support for CUDA virtual/stream-ordered memory with cudaMallocAsync.

Why ?

For proper CUDA virtual memory identification, add required hooking for cudaMallocAsync. Although this is currently adding support for cudaMallocAsync/cudaMallocAsyncFromPool, the stream-ordered memory allocator is part of the Virtual Memory Management in CUDA, which is why its special case was named UCS_MEMORY_TYPE_CUDA_VIRTUAL for the time being.

This change was successfully tested with Dask+UCX-Py using bounce buffers setting UCX_RNDV_FRAG_MEM_TYPE=cuda and demonstrates to behave correctly so far, also providing similar performance to CUDA managed memory under the same conditions.

@pentschev
Copy link
Contributor Author

cc @Akshay-Venkatesh for awareness

@Akshay-Venkatesh
Copy link
Contributor

Akshay-Venkatesh commented Oct 12, 2022

@pentschev Looks like cuda_copy transport has the added capability of detecting virtual memory and UCP happens to handle virtual pinned memory the same way as managed memory. When mallocasync supports managed memory the changes introduced through this PR would also need to distinguish types in UCS_MEMORY_TYPE_CUDA_VIRTUAL .

The general question is - What are the performance expectations for mallocasync memory?

  1. If the expectation is that staging through pinned memory for both intra-node and inter-node paths provides enough performance, then we can rework without introducing UCS_MEMORY_TYPE_CUDA_VIRTUAL but instead by adding mempool attribute to cuda memory (whether be it pinned malloc async or managed malloc async in the future).
  2. If the expectation is that mallocasync memory needs to use IPC capabilities introduced for mempools, then we would need to introduce new paths in cuda_ipc transport.

@pentschev
Copy link
Contributor Author

@pentschev Looks like cuda_copy transport has the added capability of detecting virtual memory and UCP happens to handle virtual pinned memory the same way as managed memory. When mallocasync supports managed memory the changes introduced through this PR would also need to distinguish types in UCS_MEMORY_TYPE_CUDA_VIRTUAL .

This is a good point. Is there a mechanism for distinguishing "sub-types", or would we have to introduce a new type, e.g., UCS_MEMORY_TYPE_CUDA_VIRTUAL_MANAGED?

The general question is - What are the performance expectations for mallocasync memory?

  1. If the expectation is that staging through pinned memory for both intra-node and inter-node paths provides enough performance, then we can rework without introducing UCS_MEMORY_TYPE_CUDA_VIRTUAL but instead by adding mempool attribute to cuda memory (whether be it pinned malloc async or managed malloc async in the future).
  2. If the expectation is that mallocasync memory needs to use IPC capabilities introduced for mempools, then we would need to introduce new paths in cuda_ipc transport.

This is a very difficult question to answer. In a perfect world we would, of course, want to have the same performance as we do for CUDA "default" memory (PA-backed memory allocated via cudaMalloc), but that generally is unfeasible in practice. As for the tests we did so far, using the staging buffers provide very similar performance to PA-backed CUDA allocations, the penalty we observed was roughly 10%, which is definitely acceptable now. However, the tests currently involved single-node DGX-1 only, so I can't say whether this performance will scale.

In the future I would say it would be worth having cuda_ipc support as well, it feels that as we push towards newer systems with much higher bandwidth limits for cuda_ipc like the H100, staging allocations may impose a higher overall penalty particularly if we have to transfer large amounts of data at once, where the time spent to stage buffers may eventually be too high and would otherwise force us to dedicated more and more memory to the staging buffers to work around that issue. Please take this with a grain of salt, I do not have numbers at this time to back or falsify my theory, just trying to make an educated guess here.

@yosefe
Copy link
Contributor

yosefe commented Oct 13, 2022

@pentschev IMO adding another memory type is pretty intrusive. How is it different from CUDA pinned memory?

@pentschev
Copy link
Contributor Author

@pentschev IMO adding another memory type is pretty intrusive. How is it different from CUDA pinned memory?

A complete different API set is required for handling both virtual and stream-ordered. Virtual and stream-ordered memory are not PA-backed as regular pinned memory, so supporting cuda_ipc or verbs will have different requirements than for pinned memory, but not for cuda_copy or staging buffers, which will be supported with this PR already.

To be fair, I do not know internals of UCX at this level that well, so there's definitely a chance that adding a new memory type is indeed not required. My reasoning for this is currently that virtual/stream-ordered memory will have different requirements than pinned memory and will most likely require a different code-path for cuda_ipc and verbs. Plus, the way testing is organized in UCX requires having a different memory type, for example passing ucx_perftest -m cuda-virtual would not be possible without a different memory type and it would definitely be useful to be able to test/profile virtual/stream-ordered memory individually as well. Finally, as Akshay pointed out above, at one point we will have "virtual managed" memory, integrating both concepts, but it currently isn't clear to me whether the requirements will be different or not.

I'm certainly open to not having another memory type if we can manage the reasons I pointed out above in a different way.

@Akshay-Venkatesh
Copy link
Contributor

@pentschev IMO adding another memory type is pretty intrusive. How is it different from CUDA pinned memory?

A complete different API set is required for handling both virtual and stream-ordered. Virtual and stream-ordered memory are not PA-backed as regular pinned memory, so supporting cuda_ipc or verbs will have different requirements than for pinned memory, but not for cuda_copy or staging buffers, which will be supported with this PR already.

To be fair, I do not know internals of UCX at this level that well, so there's definitely a chance that adding a new memory type is indeed not required. My reasoning for this is currently that virtual/stream-ordered memory will have different requirements than pinned memory and will most likely require a different code-path for cuda_ipc and verbs. Plus, the way testing is organized in UCX requires having a different memory type, for example passing ucx_perftest -m cuda-virtual would not be possible without a different memory type and it would definitely be useful to be able to test/profile virtual/stream-ordered memory individually as well. Finally, as Akshay pointed out above, at one point we will have "virtual managed" memory, integrating both concepts, but it currently isn't clear to me whether the requirements will be different or not.

I'm certainly open to not having another memory type if we can manage the reasons I pointed out above in a different way.

@pentschev based on an offline discussion, @yosefe suggested to treat virtual memory as managed memory so that the same protocols used with managed memory will be used with mallocasync for the time being. This would have the desired performance you see by setting RNDV_FRAG_MEM_TYPE=cuda. So effectively, you'd set here

mem_info->type = UCS_MEMORY_TYPE_CUDA_MANAGED;

and drop all additions of CUDA_VIRTUAL.

@pentschev
Copy link
Contributor Author

@pentschev based on an offline discussion, @yosefe suggested to treat virtual memory as managed memory so that the same protocols used with managed memory will be used with mallocasync for the time being. This would have the desired performance you see by setting RNDV_FRAG_MEM_TYPE=cuda. So effectively, you'd set here

mem_info->type = UCS_MEMORY_TYPE_CUDA_MANAGED;

and drop all additions of CUDA_VIRTUAL.

That seems fine for now, but how do we add support for cudaMallocAsync in ucx-perftest, e.g., here? Can we still do that?

@yosefe
Copy link
Contributor

yosefe commented Oct 13, 2022

That seems fine for now, but how do we add support for cudaMallocAsync in ucx-perftest, e.g., here? Can we still do that?

  1. It could be in another PR to expedite merging this PR for v1.14.x
  2. We can consider adding "allocation method" parameter to perftest, in addition to memory type. Even pinned memory can be allocated by more than one API. Not to mention host memory...

@pentschev
Copy link
Contributor Author

I now pushed changes removing the new memory type.

Also replying @yosefe 's points:

  • It could be in another PR to expedite merging this PR for v1.14.x

Sounds reasonable to me. The only regression from the original PR is we now can't test cudaMallocAsync in:

#if HAVE_CUDA
case UCS_MEMORY_TYPE_CUDA:
CUDA_CALL(cudaMalloc(&ptr, size), ": size=" << size);
return ptr;
case UCS_MEMORY_TYPE_CUDA_MANAGED:
CUDA_CALL(cudaMallocManaged(&ptr, size), ": size=" << size);
return ptr;
#endif

Are there any other existing ways we could test that there besides having separate memory type, or would that have to be done by adding a new "allocation method" as you're suggesting for perftest?

  • We can consider adding "allocation method" parameter to perftest, in addition to memory type. Even pinned memory can be allocated by more than one API. Not to mention host memory...

That would be great, it's very useful to be able to test and verify performance of the stream-ordered allocator, as it is very different under-the-hood than pinned/managed memory.

Comment on lines +131 to +136
attr_type[3] = CU_POINTER_ATTRIBUTE_CONTEXT;
attr_data[3] = &cuda_mem_ctx;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Akshay-Venkatesh cudaMallocAsync has first appeared in 11.2.0, but CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE only appeared in 11.3.0, therefore I changed the check here which should still be equivalent, based on offline conversations with the CUDA team, see the updated code comments just below. Also, CU_POINTER_ATTRIBUTE_CONTEXT has existed since at least CUDA 10.0, so I believe we don't need any extra preprocessor checks here, what do you think?

#if CUDA_VERSION >= 11020
UCM_CUDA_FUNC_ENTRY(cuMemFreeAsync),
#endif
{{NULL}, NULL}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for completeness, can you add hooks below for runtime APIs as well? it might useful in when bistro hooks fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -58,6 +58,21 @@
return ret; \
}

/* Create a body of CUDA memory release replacement function */
#define UCM_CUDA_FREE_ASYNC_FUNC(_name, _retval, _ptr_type, _stream_type, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we extend UCM_CUDA_FREE_FUNC to have var-args instead of _stream_type?
see for example how UCM_CUDA_ALLOC_FUNC is using varargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to do but failed so far. I tried two different approaches, described below.

The first one would be just to add _args_format and ..., but that would then require existing functions to also pass those empty arguments, and it seems problematic to handle empty var-args, not sure if there are any preprocessor macros currently to handle that case, plus it would pollute the existing UCM_CUDA_FREE_FUNC calls with empty parameters.

The second was to pass everything as var-args, including the pointer. We could then extract the first argument (pointer) with UCS_PP_TUPLE_0, and then make use of it like the explicit parameter is currently used. However, this would then require passing __VA_ARGS__ to ucm_trace which would need type-casting of the individual variables.

I'm not very experience writing var-arg macros, so maybe you could suggest other approaches?

{{NULL}, NULL}
};
static ucm_cuda_func_t ucm_cuda_driver_funcs[] =
{UCM_CUDA_FUNC_ENTRY(cuMemAlloc),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid changing the layout/indentation of the whole array and just add new items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, as per our conversation offline this may be ignored.

@@ -20,10 +20,18 @@ CUresult ucm_cuMemAllocPitch(CUdeviceptr *dptr, size_t *pPitch,
CUresult ucm_cuMemAllocPitch_v2(CUdeviceptr *dptr, size_t *pPitch,
size_t WidthInBytes, size_t Height,
unsigned int ElementSizeBytes);
#if CUDA_VERSION >= 11020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the #if is probably not needed here since it's just func prototype, it's not calling any Cuda API here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of #if fails for older CUDA because cudaMemPool_t/CUmemoryPool only appeared in CUDA 11.2. Seems like we need those directives anyway, or is there some better way of handling that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being I reinstated the #ifs. I guess one alternative would be to define cudaMemPool_t/CUmemoryPool ourselves for older CUDA versions, but feels like that would pollute code more so not sure this is a great alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

CUresult ucm_cuMemFree(CUdeviceptr dptr);
CUresult ucm_cuMemFree_v2(CUdeviceptr dptr);
CUresult ucm_cuMemFreeHost(void *p);
CUresult ucm_cuMemFreeHost_v2(void *p);
#if CUDA_VERSION >= 11020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (is_managed) {
/* cuMemGetAddress range does not support managed memory so use provided
* address and length as base address and alloc length respectively */
if (is_managed || !cuda_mem_ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cuda_mem_ctx == NULL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 758 to 776
if (mlx5_config->tm.seg_size > UCT_IB_MLX5_MP_RQ_BYTE_CNT_MASK) {
ucs_error("TM segment size is too big %ld, it must not exceed %d",
mlx5_config->tm.seg_size, UCT_IB_MLX5_MP_RQ_BYTE_CNT_MASK);
return UCS_ERR_INVALID_PARAM;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge artifact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, sorry for that, should be gone now.

@pentschev pentschev force-pushed the cuda-malloc-async branch 2 times, most recently from a0375cc to bbe2c6d Compare October 17, 2022 09:09
Comment on lines 29 to 38
int supported_cuda_version;
CUresult cu_err;

cu_err = cuDriverGetVersion(&supported_cuda_version);
ASSERT_EQ(cu_err, CUDA_SUCCESS);

if (supported_cuda_version < 11020) {
GTEST_SKIP_("cuMemAllocAsync not available");
}
Copy link
Contributor Author

@pentschev pentschev Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I tried to do this with cuGetProcAddress, but cuGetProcAddress only got introduced in CUDA 11.3. I guess we could skip tests for older versions as well but that feels like introducing an artificial cap, so checking the latest CUDA version supported by the driver seems like a better alternative here.

test/gtest/ucm/cuda_hooks.cc Outdated Show resolved Hide resolved
src/uct/cuda/base/cuda_md.c Outdated Show resolved Hide resolved
test/gtest/ucm/cuda_hooks.cc Outdated Show resolved Hide resolved
test/gtest/ucm/cuda_hooks.cc Outdated Show resolved Hide resolved
@@ -20,10 +20,18 @@ CUresult ucm_cuMemAllocPitch(CUdeviceptr *dptr, size_t *pPitch,
CUresult ucm_cuMemAllocPitch_v2(CUdeviceptr *dptr, size_t *pPitch,
size_t WidthInBytes, size_t Height,
unsigned int ElementSizeBytes);
#if CUDA_VERSION >= 11020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@yosefe
Copy link
Contributor

yosefe commented Oct 17, 2022

@pentschev i've pushed another commit to this PR for #8623 (comment)

@pentschev
Copy link
Contributor Author

@pentschev i've pushed another commit to this PR for #8623 (comment)

Awesome, thanks so much @yosefe ! I got so close, but the magic I was missing was arg0, would you mind explaining/pointing me to how arg0 gets expanded there?

Also addressed all other comments. Finally, it seems like the tests are all failing with the following error:

./test/gtest/gtest: symbol lookup error: /__w/1/s/ucx_src_52619/build-test/src/ucs/.libs/ucx/libucm_cuda.so.0: undefined symbol: ucm_cudaMallocAsync

I'm not really sure how that happens, it seems to occur for all CUDA versions, including those that do support cudaMallocAsync. Do you have any clue as to why the symbols are not being found in CI? Locally they pass successfully for me with CUDA 11.5.

@yosefe
Copy link
Contributor

yosefe commented Oct 17, 2022

Awesome, thanks so much @yosefe ! I got so close, but the magic I was missing was arg0, would you mind explaining/pointing me to how arg0 gets expanded there?

arg0 is the first argument in the vararg list. it means that the released pointer (that we need to pass down to mem hooks) is the first argument to the cuda function. We have similar thing with allocation functions, where we provide the expression to calculate the allocated buffer size based to arguments.

I'm not really sure how that happens, it seems to occur for all CUDA versions, including those that do support cudaMallocAsync. Do you have any clue as to why the symbols are not being found in CI? Locally they pass successfully for me with CUDA 11.5.

There was a missing definition for cudaMallocAsync that I added in my commit. Hopefully the tests would pass now.

yosefe
yosefe previously approved these changes Oct 17, 2022
UCM_CUDA_FREE_FUNC(cuMemFree, UCS_MEMORY_TYPE_CUDA, CUresult, arg0,
"ptr=0x%llx", CUdeviceptr)
UCM_CUDA_FREE_FUNC(cuMemFree_v2, UCS_MEMORY_TYPE_CUDA, CUresult, arg0,
"ptr=0x%llx", CUdeviceptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe I don't see cuMemFree or cuMemFree_v2 with UCS_MEMORY_TYPE_CUDA_MANAGED for cases that allocate with cuMemAllocManaged. We should be covering that case as well. Will this result in potentially attempting to remove UCS_MEMORY_TYPE_CUDA type of memory ranges that were allocated with cuMemAllocManaged memory in memtype cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Akshay-Venkatesh actually it doesn't matter which memory type is used for memory free event.
we can remove memtype from free altogether (pass UCS_MEM_TYPE_UNKNOWN to the event) but since that "issue" existed before this PR, IMO it should not block this PR for v1.14

cudaMemPool_t, cudaStream_t)
#endif
UCM_CUDA_FREE_FUNC(cudaFree, UCS_MEMORY_TYPE_CUDA, cudaError_t, arg0,
"devPtr=%p", void*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe same question as for cuMemFree. We could be trying to free a cuda managed memory region with pinned cuda free. Not sure if removing range entries in memtype_cache ignores memtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does: ucs_memtype_cache_update_internal uses "mem_type" parameter only if action == UCS_MEMTYPE_CACHE_ACTION_SET_MEMTYPE.
When called from memory free hook, action is UCS_MEMTYPE_CACHE_ACTION_REMOVE

@pentschev
Copy link
Contributor Author

This passes now, thanks so much for all the help in getting everything fixed @yosefe !

@yosefe
Copy link
Contributor

yosefe commented Oct 18, 2022

@pentschev can you pls squash the commits to one commit?

@pentschev
Copy link
Contributor Author

@yosefe done. Is the commit name right or should it also include UCT/TEST?

@yosefe yosefe enabled auto-merge October 18, 2022 10:16
@yosefe yosefe merged commit a557b84 into openucx:master Oct 18, 2022
@pentschev
Copy link
Contributor Author

Thanks @yosefe for the help in getting things correct here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants