-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL hangs in P2P #275
Comments
This is not normal. The error should be reported to the original ncclCommInitRank call or the ncclGroupEnd call, if used. Can you share an example and log where there is an error but you get a success ? |
I ran tensorflow with horovod and nccl, but GPUs(1080Ti) in my cluster already had too many P2P connections. Then my job hanged there, only printing "failed to open CUDA IPC handle". |
Actually you can look at which function invoke send/connect functions. The function which invokes send/connect functions just does NCCLCHECK for this ncclInternalError but it will not report to functions that invoke it, on the contrary, it will only return ncclSuccess even one of its NCCLCHECK fails. |
Sorry for responding late. I'm not sure where we return ncclSuccess instead of an error. If NCCLCHECK should then propagate the error all the way up to the |
In init.cc, NCCL will do the connect operation with NCCLCHECK. However, it will only report the error in logs but will still return ncclSuccess to the caller. In p2p.cc, if this failure occurs, its return value cannot be propagated to the user application, which will make the application hang for a long time(maybe forever?)
The text was updated successfully, but these errors were encountered: