You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using NCCL 2.4.6.
During the initialization of a communicator in ncclCommInitRank(...), we check locally if IB is setup and available and use IB for communication.
If for the same communicator initialization, a participant on another node does not have IB setup, we use TCP for communication.
In this mixed environment, the ncclCommInitRank call hangs forever.
Are we not doing a global check to make sure peers can use IB before picking it?
Here are the stack traces where it hangs:
rank 0 (TCP):
#0 0x00007fffa99437f8 in accept () from /usr/lib64/libpthread.so.0
#1 0x00007fff86bdb958 in ncclSocketAccept (listenComm=0x1b040e180, recvComm=0x7ffffd203040) at transport/net_socket.cc:153
#2 0x00007fff86bc159c in bootstrapNetAccept (recvComm=0x7ffffd203040, listenComm=0x1b040e180) at bootstrap.cc:20
#3 bootstrapRecv (commState=0x1b03f9980, peer=<optimized out>, data=0x7ffffd203100, size=<optimized out>) at bootstrap.cc:332
#4 0x00007fff86bb02c4 in p2pSetup (comm=0x1b1fd39c0, channel=0x1b1fd3bc0, nrecv=1, peerRecv=0x1b1fd3bc0, nsend=1, peerSend=0x1b1fd3bc4) at init.cc:667
#5 0x00007fff86bbb3d0 in initTransportsRank (commId=0x7ffffd203840, comm=0x1b1fd39c0) at init.cc:814
#6 ncclCommInitRankSync (newcomm=0x1b1fbae80, nranks=2, commId=..., myrank=0) at init.cc:950
#7 0x00007fff86bbbecc in ncclCommInitRank (newcomm=0x1b1fbae80, nranks=<optimized out>, commId=..., myrank=0) at init.cc:989
rank 1 (IB):
#0 0x00007fff9ca139f8 in recv () from /lib64/libpthread.so.0
#1 0x00007fff79cadc64 in socketProgress (op=<optimized out>, offset=<optimized out>, size=<optimized out>, ptr=<optimized out>, fd=<optimized out>) at include/socket.h:401
#2 socketWait (fd=<optimized out>, ptr=0x7fffe0fad260, size=48, offset=0x7fffe0fad230, op=1) at include/socket.h:422
#3 0x00007fff79cb1e74 in socketReceive (size=48, ptr=0x7fffe0fad260, fd=<optimized out>) at include/socket.h:434
#4 ncclIbAccept (listenComm=0x1df57a850, recvComm=0x13ec6caf8) at transport/net_ib.cc:467
#5 0x00007fff79ca5e00 in ncclNetAccept (recvComm=<optimized out>, listenComm=0x1df57a850) at include/net.h:28
#6 netRecvConnect (connectInfo=<optimized out>, recv=<optimized out>) at transport/net.cc:384
#7 0x00007fff79c8046c in p2pSetup (comm=0x1df573780, channel=0x1df573780, nrecv=1, peerRecv=0x1df573780, nsend=1, peerSend=0x1df573784) at init.cc:678
#8 0x00007fff79c8b3d0 in initTransportsRank (commId=0x7fffe0fadb60, comm=0x1df573780) at init.cc:814
#9 ncclCommInitRankSync (newcomm=0x1df3fbe00, nranks=2, commId=..., myrank=1) at init.cc:950
#10 0x00007fff79c8becc in ncclCommInitRank (newcomm=0x1df3fbe00, nranks=<optimized out>, commId=..., myrank=1) at init.cc:989
A workaround is to set NCCL_IB_DISABLE=1 in those mixed environments but it would be great if NCCL could figure it out by itself.
Please, let me know if you need more info.
The text was updated successfully, but these errors were encountered:
Bonjour @sjeaugey,
Would implementing a global check to pick a non-conflicting transport between the ranks be feasible?
If not, could we time out and display an error instead of hanging?
Thank you.
Indeed since IB and Sockets have similar bootstrap protocols (although they were never designed to interoperate) the behavior when one connects to the other is undefined and might hang.
Factorizing that code might be feasible and would help checking we're trying to connect the same type of network. Maybe we could do that with a small patch.
Longer term, we're thinking about a more generic solution which would have each node publicize which type of network it will use, how many NICs, and which network domain (e.g. netmask for sockets) for each network.
Using NCCL 2.4.6.
During the initialization of a communicator in
ncclCommInitRank(...)
, we check locally if IB is setup and available and use IB for communication.If for the same communicator initialization, a participant on another node does not have IB setup, we use TCP for communication.
In this mixed environment, the ncclCommInitRank call hangs forever.
Are we not doing a global check to make sure peers can use IB before picking it?
Here are the stack traces where it hangs:
rank 0 (TCP):
rank 1 (IB):
A workaround is to set
NCCL_IB_DISABLE=1
in those mixed environments but it would be great if NCCL could figure it out by itself.Please, let me know if you need more info.
The text was updated successfully, but these errors were encountered: