Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement in the documentation of the plugin interface. #309

Open
capcah opened this issue Mar 23, 2020 · 6 comments
Open

Improvement in the documentation of the plugin interface. #309

capcah opened this issue Mar 23, 2020 · 6 comments

Comments

@capcah
Copy link

capcah commented Mar 23, 2020

Hello all,

I'm implementing a plugin using the nccl_net.h API and I found the documentation of the functions pretty hard to understand. If you folks don't mind, I'll start sending PRs to improve the documentation there, based on my current understanding of the interface.

Additionally, there are some things that are not clear to me even after reading the code, such as: In test(), can the plugin disregard size if the request is not complete or should it contain the size of partially downloaded data?

Let me know if you want me to work on this.

@sjeaugey
Copy link
Member

sjeaugey commented Mar 23, 2020

The size needs only to be set when the request is complete.

Indeed the documentation is not as good as it could be; it's pretty much only the comments in nccl_net.h, the rest being in the implementation. The discovery / topology detection functions are used by src/graph/topo.cc, while the connection/communication part only happens in src/transport/net.cc (communication is all inside send/recv proxy functions).

We never put a lot of effort in documenting this, since it was not intended to be used by users, more by vendors/partners which we usually work closer with. Better documentation is always better though, so clarifying things won't hurt.

Also please note the API is evolving a bit version after version. Make sure you check the v3 structure here for NCCL 2.6 : https://github.com/NVIDIA/nccl/blob/v2.6/src/include/nccl_net.h.

We're interested in knowing who is developing and maintaining plugins so that we can give heads up and ping you when we make the API evolve (like in 2.6).

@capcah
Copy link
Author

capcah commented Mar 30, 2020

Hey Sylvain,

Thanks for the clarification! I'll reach out over email with some contact points in our side.

Regarding the API, there are a couple more things that are not clear:

  1. How does exactly the handle argument in connect() and listen() work? As it understand, it is initialized during listen() and accessed during connect(), but that seems odd: Hows is the code calling connect() aware of what the handle is supposed to be?
  2. For RMA memory registration, AFAIK, NCCL calls regMr() with a block of memory to be registered for remote access. Is there any way of giving NCCL a pointer to a memory region to use for its buffers instead or some way of controlling those addresses from the plugin?

Thanks again!

@sjeaugey
Copy link
Member

  1. Between listen() and connect(), NCCL exchanges the handle through a side channel (sockets).
  2. Not currently, but not a bad idea. We already have some special code for Infiniband (to support fork()), making sure memory we allocate is page-aligned. That said, that would only be used for CPU memory allocation; if we want to use GPU Direct RDMA, the buffer would be on the GPU hence allocated through cudaMalloc. Also, the CPU memory we would allocate would need to be registered on the GPU (the GPU would need to access it, writing data directly to it). So, you would need to make sure cudaHostRegister would work on that memory allocation. In any case, we could imagine adding a memAlloc() function to the net plugin. @rashikakheria would that help the libfabrics plugin ?

@capcah
Copy link
Author

capcah commented Mar 30, 2020

  1. Thanks! That makes a lot more sense now. Could you point me towards the code that does the OOB transport?

  2. That seems reasonable, since the plugin is already able to tell whether or not it can handle GPU pointers, it would make sense to have it allocate the memory needed and deal with any NIC-specific quirks. If this sounds good to you and Rashika, I'd be glad to give it a shot.

@sjeaugey
Copy link
Member

sjeaugey commented Mar 30, 2020

https://github.com/NVIDIA/nccl/blob/master/src/init.cc#L402
This is the function that connects peers, basically calling :
selectTransport .. transport->{send,recv}Setup .. net->listen
bootstrapSend/Recv
transport->{send,recv}Connect .. net->connect

The oob communication happens inside bootstrapSend/bootstrapRecv, in bootstrap.cc.

@rashikakheria
Copy link
Contributor

1. Between listen() and connect(), NCCL exchanges the handle through a side channel (sockets).

2. Not currently, but not a bad idea. We already have some special code for Infiniband (to support fork()), making sure memory we allocate is page-aligned. That said, that would only be used for CPU memory allocation; if we want to use GPU Direct RDMA, the buffer would be on the GPU hence allocated through cudaMalloc. Also, the CPU memory we would allocate would need to be registered on the GPU (the GPU would need to access it, writing data directly to it). So, you would need to make sure cudaHostRegister would work on that memory allocation. In any case, we could imagine adding a memAlloc() function to the net plugin. @rashikakheria would that help the libfabrics plugin ?

I think that might help libfabrics as well. It would provide opportunity to do lesser memory registrations and reuse memory regions once it has been used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants