-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Point-to-point communications in NCCL? #212
Comments
We've had plans to implement point to point communication for some time now, in the form of two new primitives : ncclSend and ncclRecv. Then, combining ncclSend and ncclRecv with ncclGroupStart and ncclGroupEnd, users could do any Alltoall, Scatter, Gather, or neighbor collectives. So it would look like point to point, but with the idea of implementing a collective alltoallv operation, within the NCCL communicator -- which would follow the same rules as the current collective operations, i.e. operations are serialized on the communicator. The main difference with MPI is that this is still a blocking call on the GPU side ; there is no Isend/Irecv. We are indeed interested in hearing about use cases to precisely determine if users need blocking send/receive operations (allltoallv) or asynchronous send/receive with respect to collective operations (which NCCL cannot provide due to CUDA kernel semantics). |
Thank you for the feedback! And apologies for not being able to reply back. The current plan sounds feasible for our use case since we do not need non-blocking operations in the meantime. |
Hi, I'm not sure if the work has already started or not since this thread was opened more than a month ago. But we are also looking for ways to do direct point-to-point communications using NCCL library to support one of our distributed training algorithms. The nature of the algorithm is a pair-wise binary tree reduction. We have already implemented this using blocking MPI send/recv. Having NCCL is believed to deliver an even better performance boost. Blocking NCCLSend and NCCLRecv would be sufficient for our use case. So I'm really looking forward to hearing about the road-map for this feature. Please let me know if this has been planned or not. Thanks in advance! |
Hi @sjeaugey Could you provide any insight on the plan to support blocking NCCL send and recv? Looking forward to having some collaboration with NCCL devs. Thanks! |
Hi @Tixxx. I don't see Send/Recv coming in the near future as we are still focusing on allreduce and its variants, and this is a large feature which needs a significant amount of work, with a lot of preparation and refactoring to be done before. For example (and among other things), we are trying to rewrite the topology detection and ring/tree creation to make it less ring-focused and more general, which is one of the steps needed before we start on point-to-point. |
hi @sjeaugey - the gap in point to point communication with a high level library like nccl impacts applications I work on that don't need collectives, just integrated efficient data transfer (glorified memcpy) across GPUs from the inter-thread to inter-node cases ( with gpudirect support ). Messaging semantics would be nice at times as well but the need there is lesser than that of an RMA like operation. Is there any sort of timeline or actively worked items in support of the point-to-point communication pattern? Your messages here and in #270 indicate significant redesigns first that make me think it's a good 1+ years out - and that doesn't work for me. Is there anything that can be done to make it work in the next few months? I think the worst part is I really want to give nccl a try but none of the existing operations seem like a workable fit for my problem - simulation halo exchanges, a fairly common application. |
Hi @nevion, hopefully this will arrive sooner. We are actively working on it now; the goal being to post a preview branch late next month, so that users can give it a try and provide feedback. Would that work for you ? |
@sjeaugey yes, that does indeed work for me. |
Looking forward to applying the P2P function to increase the power of my project! |
Any progress on this issue? gather/scatter/alltoall are important to recommendation models(e.g. https://github.com/facebookresearch/dlrm/blob/master/dlrm_s_pytorch.py#L426), which still cannot utilize GPU very well now. |
The p2p preview has been posted to the "p2p" branch. And PR #316 has been created for discussion / feedback. |
Thanks, that helps a lot. |
Great job! |
When I attended GTC last month and attended the session by Mr. J. Kraus on multi-GPU programming, I heard from him that there were plans for point-to-point communication support in NCCL and perhaps bumping the development team with an issue will help get the attention.
While I did feel like the idea was kind of controversial (since this is a collective communication library), it would be great if point-to-point communication is indeed supported. However I also feel that this is a niche request so I wouldn't expect it to roll out anytime soon.
Has there been a discussion on supporting point-to-point communication? And if so are there any roadmaps towards supporting it? Any response would be very helpful.
Thanks in advance.
The text was updated successfully, but these errors were encountered: