Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL hang on socket recv() #238

Open
wangsiyu opened this issue Jul 10, 2019 · 11 comments
Open

NCCL hang on socket recv() #238

wangsiyu opened this issue Jul 10, 2019 · 11 comments

Comments

@wangsiyu
Copy link

wangsiyu commented Jul 10, 2019

Hi, I am an engineer in Alibaba Group and encounter some problems in using nccl in recent days.
When I run nccl 2.4.2 across multiple nodes, the program hangs randomly. nvidia-smi shows the GPU utility is always 100%.
image
And CPU utility is not 0
image

I use gdb -p to the hanged process on each nodes, I found some threads stop at socket recv().
image
At the same time, NCCL_DEBUG=WARN does not show more useful information. It just shows IB detection failed, but hang was happened after training for a long time.
image
Is this a bug in nccl or the inappropriate usage leads to this hang?

@kwen2501
Copy link
Contributor

Hi @wangsiyu can you please try with NCCL 2.4.7 and see if it fixes the issue? Thanks!

@Mykheievskyi
Copy link

I have the same issue. @wangsiyu you solve it? Thanks in advance.

@wangsiyu
Copy link
Author

I worked around this problem by NCCL_LL_THRESHOLD=0. It seems that it is a bug in 2.4.2. Although I did not upgrade nccl version. But I think this bug has been fixed in 2.4.6 according to its release note. Thanks very much!

@372046933
Copy link

I can reproduce it on 2.4.7

@sjeaugey
Copy link
Member

sjeaugey commented Mar 2, 2020

Can you explain more in details what you could reproduce with 2.4.7 ? Do you have a hang, which can be solved by setting NCCL_LL_THRESHOLD=0 ? Is it a hang showing one thread is blocked in recv() ?

Many things could cause hangs, so maybe it would be good to open another bug to make sure it is the same issue. Also note, NCCL 2.5 is the latest stable version you might want to try, and a preview of the NCCL 2.6 is also available on the v2.6 branch : https://github.com/nvidia/nccl/tree/v2.6. Feel free to give it a try. Thanks !

@372046933
Copy link

I use gdb to print the stack trace. It's blocked at

#0 syscall ... x86_64/syscall.S:38
#1 nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec)
from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so

Sorry that I cannot paste the full trace. By the way, NCCL_LL_THRESHOLD=0 fixed the hang on three node settings. 10 nodes still fail occasionally.

@sjeaugey
Copy link
Member

sjeaugey commented Mar 3, 2020

Can you describe with a bit more details the problem you are facing ? Do you have a hang, does it reproduce every time, immediately/after some time ?

Also pasting the NCCL debug would help (at least set NCCL_DEBUG=WARN to double check the NCCL version, and any warning).

@372046933
Copy link

It always hangs during the training, but at different steps each time. I set NCCL_DEBUG=INFO get the following message before the hang.

[0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/SOCKET : message truncated : receiving 1048576 bytes instead of 32768
[0] NCCL INFO bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:34 -> 3
[0] NCCL INFO external/nccl_archive/src/transport/net.cc:533 -> 3
[0] NCCL INFO external/nccl_archive/src/transport.cc:163 -> 3 [Proxy Thread]

@kwen2501
Copy link
Contributor

kwen2501 commented Mar 4, 2020

@372046933 Could you be using different versions of NCCL on different nodes or have compiled NCCL from files of different versions? I just notice your error message has different NCCL home paths. As mentioned by @sjeaugey it would be nice to confirm the NCCL version on different nodes.

Another possibility is having set different values for some environment variables on different nodes. For example, one could have set NCCL_LL_THRESHOLD=0 on one node while not doing so on other nodes. From the message truncation warning you are getting, that is likely the cause , i.e. 1048576 = 1MB is the message size used for the Simple algorithm (when NCCL_LL_THRESHOLD is forced to 0), whereas 32768 = 32KB could be the message size used for the LL algorithm (when NCCL_LL_THRESHOLD is not set).

@372046933
Copy link

Well, I run the script in docker, both containers have the same image and ENV set by Kubernetes. I have checked that the Docker image checksum is same on all nodes. By the way, TensorFlow 2.0 compiled NCCL 2.4.7 together.

@kwen2501
Copy link
Contributor

kwen2501 commented Mar 4, 2020

Thanks for the check. The next thing I would check is whether the ranks make the collective call with different sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants