Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL hangs in P2P #275

Open
AHEADer opened this issue Dec 10, 2019 · 4 comments
Open

NCCL hangs in P2P #275

AHEADer opened this issue Dec 10, 2019 · 4 comments

Comments

@AHEADer
Copy link

AHEADer commented Dec 10, 2019

In init.cc, NCCL will do the connect operation with NCCLCHECK. However, it will only report the error in logs but will still return ncclSuccess to the caller. In p2p.cc, if this failure occurs, its return value cannot be propagated to the user application, which will make the application hang for a long time(maybe forever?)

@sjeaugey
Copy link
Member

This is not normal. The error should be reported to the original ncclCommInitRank call or the ncclGroupEnd call, if used. Can you share an example and log where there is an error but you get a success ?

@AHEADer
Copy link
Author

AHEADer commented Dec 11, 2019

I ran tensorflow with horovod and nccl, but GPUs(1080Ti) in my cluster already had too many P2P connections. Then my job hanged there, only printing "failed to open CUDA IPC handle".

@AHEADer
Copy link
Author

AHEADer commented Dec 11, 2019

Actually you can look at which function invoke send/connect functions. The function which invokes send/connect functions just does NCCLCHECK for this ncclInternalError but it will not report to functions that invoke it, on the contrary, it will only return ncclSuccess even one of its NCCLCHECK fails.

@sjeaugey
Copy link
Member

Sorry for responding late. I'm not sure where we return ncclSuccess instead of an error.

If cudaIpcOpenMemHandle fails, it should return ncclUnhandledCudaError. See p2pSendConnect here : https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L244 and p2pRecvConnect here : https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L275.

NCCLCHECK should then propagate the error all the way up to the ncclCommInitRank/ncclCommInitAll call. Setting NCCL_DEBUG=INFO, you should see the whole backtrace until the ncclCommInit* call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants