Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfiniBand is picked for transport even if it is not available on the other nodes #234

Open
nvcastet opened this issue Jun 21, 2019 · 3 comments

Comments

@nvcastet
Copy link

nvcastet commented Jun 21, 2019

Using NCCL 2.4.6.
During the initialization of a communicator in ncclCommInitRank(...), we check locally if IB is setup and available and use IB for communication.
If for the same communicator initialization, a participant on another node does not have IB setup, we use TCP for communication.
In this mixed environment, the ncclCommInitRank call hangs forever.
Are we not doing a global check to make sure peers can use IB before picking it?
Here are the stack traces where it hangs:
rank 0 (TCP):

#0  0x00007fffa99437f8 in accept () from /usr/lib64/libpthread.so.0
#1  0x00007fff86bdb958 in ncclSocketAccept (listenComm=0x1b040e180, recvComm=0x7ffffd203040) at transport/net_socket.cc:153
#2  0x00007fff86bc159c in bootstrapNetAccept (recvComm=0x7ffffd203040, listenComm=0x1b040e180) at bootstrap.cc:20
#3  bootstrapRecv (commState=0x1b03f9980, peer=<optimized out>, data=0x7ffffd203100, size=<optimized out>) at bootstrap.cc:332
#4  0x00007fff86bb02c4 in p2pSetup (comm=0x1b1fd39c0, channel=0x1b1fd3bc0, nrecv=1, peerRecv=0x1b1fd3bc0, nsend=1, peerSend=0x1b1fd3bc4) at init.cc:667
#5  0x00007fff86bbb3d0 in initTransportsRank (commId=0x7ffffd203840, comm=0x1b1fd39c0) at init.cc:814
#6  ncclCommInitRankSync (newcomm=0x1b1fbae80, nranks=2, commId=..., myrank=0) at init.cc:950
#7  0x00007fff86bbbecc in ncclCommInitRank (newcomm=0x1b1fbae80, nranks=<optimized out>, commId=..., myrank=0) at init.cc:989

rank 1 (IB):

#0  0x00007fff9ca139f8 in recv () from /lib64/libpthread.so.0
#1  0x00007fff79cadc64 in socketProgress (op=<optimized out>, offset=<optimized out>, size=<optimized out>, ptr=<optimized out>, fd=<optimized out>) at include/socket.h:401
#2  socketWait (fd=<optimized out>, ptr=0x7fffe0fad260, size=48, offset=0x7fffe0fad230, op=1) at include/socket.h:422
#3  0x00007fff79cb1e74 in socketReceive (size=48, ptr=0x7fffe0fad260, fd=<optimized out>) at include/socket.h:434
#4  ncclIbAccept (listenComm=0x1df57a850, recvComm=0x13ec6caf8) at transport/net_ib.cc:467
#5  0x00007fff79ca5e00 in ncclNetAccept (recvComm=<optimized out>, listenComm=0x1df57a850) at include/net.h:28
#6  netRecvConnect (connectInfo=<optimized out>, recv=<optimized out>) at transport/net.cc:384
#7  0x00007fff79c8046c in p2pSetup (comm=0x1df573780, channel=0x1df573780, nrecv=1, peerRecv=0x1df573780, nsend=1, peerSend=0x1df573784) at init.cc:678
#8  0x00007fff79c8b3d0 in initTransportsRank (commId=0x7fffe0fadb60, comm=0x1df573780) at init.cc:814
#9  ncclCommInitRankSync (newcomm=0x1df3fbe00, nranks=2, commId=..., myrank=1) at init.cc:950
#10 0x00007fff79c8becc in ncclCommInitRank (newcomm=0x1df3fbe00, nranks=<optimized out>, commId=..., myrank=1) at init.cc:989

A workaround is to set NCCL_IB_DISABLE=1 in those mixed environments but it would be great if NCCL could figure it out by itself.

Please, let me know if you need more info.

@sjeaugey
Copy link
Member

This is indeed a limitation since NCCL 2.0.

More than IB/Sockets, the order of the IB or IP interfaces needs to match between ranks if there are parallel independent networks.

@nvcastet
Copy link
Author

Bonjour @sjeaugey,
Would implementing a global check to pick a non-conflicting transport between the ranks be feasible?
If not, could we time out and display an error instead of hanging?
Thank you.

@sjeaugey
Copy link
Member

Indeed since IB and Sockets have similar bootstrap protocols (although they were never designed to interoperate) the behavior when one connects to the other is undefined and might hang.

Factorizing that code might be feasible and would help checking we're trying to connect the same type of network. Maybe we could do that with a small patch.

Longer term, we're thinking about a more generic solution which would have each node publicize which type of network it will use, how many NICs, and which network domain (e.g. netmask for sockets) for each network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants