NCCL hangs in P2P #275

AHEADer · 2019-12-10T03:28:40Z

In init.cc, NCCL will do the connect operation with NCCLCHECK. However, it will only report the error in logs but will still return ncclSuccess to the caller. In p2p.cc, if this failure occurs, its return value cannot be propagated to the user application, which will make the application hang for a long time(maybe forever?)

sjeaugey · 2019-12-10T17:27:01Z

This is not normal. The error should be reported to the original ncclCommInitRank call or the ncclGroupEnd call, if used. Can you share an example and log where there is an error but you get a success ?

AHEADer · 2019-12-11T01:19:33Z

I ran tensorflow with horovod and nccl, but GPUs(1080Ti) in my cluster already had too many P2P connections. Then my job hanged there, only printing "failed to open CUDA IPC handle".

AHEADer · 2019-12-11T02:23:26Z

Actually you can look at which function invoke send/connect functions. The function which invokes send/connect functions just does NCCLCHECK for this ncclInternalError but it will not report to functions that invoke it, on the contrary, it will only return ncclSuccess even one of its NCCLCHECK fails.

sjeaugey · 2020-01-28T00:31:29Z

Sorry for responding late. I'm not sure where we return ncclSuccess instead of an error.

If cudaIpcOpenMemHandle fails, it should return ncclUnhandledCudaError. See p2pSendConnect here : https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L244 and p2pRecvConnect here : https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L275.

NCCLCHECK should then propagate the error all the way up to the ncclCommInitRank/ncclCommInitAll call. Setting NCCL_DEBUG=INFO, you should see the whole backtrace until the ncclCommInit* call.

aokomoriuta mentioned this issue Jan 29, 2020

Fix wrong variable name "slice" to "chunk" #288

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL hangs in P2P #275

NCCL hangs in P2P #275

AHEADer commented Dec 10, 2019 •

edited

Loading

sjeaugey commented Dec 10, 2019

AHEADer commented Dec 11, 2019

AHEADer commented Dec 11, 2019

sjeaugey commented Jan 28, 2020

NCCL hangs in P2P #275

NCCL hangs in P2P #275

Comments

AHEADer commented Dec 10, 2019 • edited Loading

sjeaugey commented Dec 10, 2019

AHEADer commented Dec 11, 2019

AHEADer commented Dec 11, 2019

sjeaugey commented Jan 28, 2020

AHEADer commented Dec 10, 2019 •

edited

Loading