-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Multiple buffer copy kernel #7076
Comments
This sounds like something more general than libcudf, maybe it should live in RMM or somewhere else that's more general? |
I filed it here since I believe libcudf already has similar batch-copy code (in cuio and contigous_split, IIRC). It might be easy to refactor that into something externally callable. However I don't really care where it lives as long as we can expose a Java interface to it. RMM is probably a more appropriate place if this kernel would be useful in other RAPIDS libs. |
I would put it in libcudf unless and until it is needed elsewhere. Unnecessary baggage for RMM if it is not. |
Only other thought would be in RAFT if it would have any use for cuml / cugraph / etc. |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
I would still love to see this functionality. |
This issue has been labeled |
Still would love to see this, as the use-cases are still valid. |
@jlowe My assumption here is that both source and destination addresses here might be arbitrarily aligned. Is that correct? |
In the use-cases I can think of so far, the source addresses would be aligned but the destinations would not necessarily be, e.g.: post batch compression where we need to gather |
Seems like the safe thing to do would be to plan for the worst. Shouldn't be too bad. |
The core piece of functionality needed here is a function like:
It uses the group Work on this functionality is already in progress internally, and so this feature should wait until that is done. |
The thing about this though is: A single buffer memcpy ends up scaling badly when called many times. It's the same thing as with |
The function I described is a |
Depends on NVIDIA/cccl#944 |
Closing this as this feature does not belong in libcudf. Instead, working on it as a CUB algorithm here: NVIDIA/cub#297 |
FYI PR ( NVIDIA/cub#359 ) landed. Looks like this will be part of CUB 2.1.0. So this could be used if it is still of interest |
Is your feature request related to a problem? Please describe.
During Spark shuffle there are cases where we need to copy multiple buffers simultaneously. For example, after partitioning a task's data into 200 parts we use nvcomp's LZ4 to compress the 200 buffers in a batch operation, producing 200 output buffers that are typically oversized (as we have to estimate the output size when allocating the buffer before compression occurs). To release the unused memory we reallocate them, copying the 200 buffers to "right-sized" allocations, and this is currently performed with 200 separate cudaMemcpyAsync calls. It's much more efficient to invoke a kernel that performs the 200 copies in parallel.
Similarly during UCX shuffle send, we need to copy partitions into the registered memory buffers (i.e.: bounce buffers), and often we pack the transfer with multiple partitions, leading to another situation where we need to copy N buffers simultaneously. On the receiving end there's a similar situation where we need to copy the data out of the receipt bounce buffer into separate allocations, another N-buffer copy situation.
Describe the solution you'd like
libcudf could provide a multi-buffer copy API that takes the following inputs:
rmm::cuda_stream_view
to use for the copy kernelThe libcudf API would copy the source buffers to the corresponding destination addresses using a single CUDA kernel rather than invoking separate cudaMemcpyAsync operations for each one.
The text was updated successfully, but these errors were encountered: