-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL hang on socket recv() #238
Comments
Hi @wangsiyu can you please try with NCCL 2.4.7 and see if it fixes the issue? Thanks! |
I have the same issue. @wangsiyu you solve it? Thanks in advance. |
I worked around this problem by NCCL_LL_THRESHOLD=0. It seems that it is a bug in 2.4.2. Although I did not upgrade nccl version. But I think this bug has been fixed in 2.4.6 according to its release note. Thanks very much! |
I can reproduce it on 2.4.7 |
Can you explain more in details what you could reproduce with 2.4.7 ? Do you have a hang, which can be solved by setting NCCL_LL_THRESHOLD=0 ? Is it a hang showing one thread is blocked in recv() ? Many things could cause hangs, so maybe it would be good to open another bug to make sure it is the same issue. Also note, NCCL 2.5 is the latest stable version you might want to try, and a preview of the NCCL 2.6 is also available on the v2.6 branch : https://github.com/nvidia/nccl/tree/v2.6. Feel free to give it a try. Thanks ! |
I use
Sorry that I cannot paste the full trace. By the way, |
Can you describe with a bit more details the problem you are facing ? Do you have a hang, does it reproduce every time, immediately/after some time ? Also pasting the NCCL debug would help (at least set NCCL_DEBUG=WARN to double check the NCCL version, and any warning). |
It always hangs during the training, but at different steps each time. I set
|
@372046933 Could you be using different versions of NCCL on different nodes or have compiled NCCL from files of different versions? I just notice your error message has different NCCL home paths. As mentioned by @sjeaugey it would be nice to confirm the NCCL version on different nodes. Another possibility is having set different values for some environment variables on different nodes. For example, one could have set |
Well, I run the script in docker, both containers have the same image and ENV set by Kubernetes. I have checked that the Docker image checksum is same on all nodes. By the way, TensorFlow 2.0 compiled NCCL 2.4.7 together. |
Thanks for the check. The next thing I would check is whether the ranks make the collective call with different sizes. |
Hi, I am an engineer in Alibaba Group and encounter some problems in using nccl in recent days.


When I run nccl 2.4.2 across multiple nodes, the program hangs randomly.
nvidia-smi
shows the GPU utility is always 100%.And CPU utility is not 0
I use


gdb -p
to the hanged process on each nodes, I found some threads stop at socket recv().At the same time,
NCCL_DEBUG=WARN
does not show more useful information. It just shows IB detection failed, but hang was happened after training for a long time.Is this a bug in nccl or the inappropriate usage leads to this hang?
The text was updated successfully, but these errors were encountered: