NCCL hang on socket recv() #238

wangsiyu · 2019-07-10T06:23:58Z

Hi, I am an engineer in Alibaba Group and encounter some problems in using nccl in recent days.
When I run nccl 2.4.2 across multiple nodes, the program hangs randomly. nvidia-smi shows the GPU utility is always 100%.

And CPU utility is not 0

I use gdb -p to the hanged process on each nodes, I found some threads stop at socket recv().

At the same time, NCCL_DEBUG=WARN does not show more useful information. It just shows IB detection failed, but hang was happened after training for a long time.

Is this a bug in nccl or the inappropriate usage leads to this hang？

The text was updated successfully, but these errors were encountered:

kwen2501 · 2019-07-10T15:04:50Z

Hi @wangsiyu can you please try with NCCL 2.4.7 and see if it fixes the issue? Thanks!

Mykheievskyi · 2019-07-14T11:49:06Z

I have the same issue. @wangsiyu you solve it? Thanks in advance.

wangsiyu · 2019-07-30T14:58:47Z

I worked around this problem by NCCL_LL_THRESHOLD=0. It seems that it is a bug in 2.4.2. Although I did not upgrade nccl version. But I think this bug has been fixed in 2.4.6 according to its release note. Thanks very much！

372046933 · 2020-03-02T07:27:20Z

I can reproduce it on 2.4.7

sjeaugey · 2020-03-02T18:05:09Z

Can you explain more in details what you could reproduce with 2.4.7 ? Do you have a hang, which can be solved by setting NCCL_LL_THRESHOLD=0 ? Is it a hang showing one thread is blocked in recv() ?

Many things could cause hangs, so maybe it would be good to open another bug to make sure it is the same issue. Also note, NCCL 2.5 is the latest stable version you might want to try, and a preview of the NCCL 2.6 is also available on the v2.6 branch : https://github.com/nvidia/nccl/tree/v2.6. Feel free to give it a try. Thanks !

372046933 · 2020-03-03T05:09:52Z

I use gdb to print the stack trace. It's blocked at

#0 syscall ... x86_64/syscall.S:38
#1 nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec)
from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so

Sorry that I cannot paste the full trace. By the way, NCCL_LL_THRESHOLD=0 fixed the hang on three node settings. 10 nodes still fail occasionally.

sjeaugey · 2020-03-03T17:02:11Z

Can you describe with a bit more details the problem you are facing ? Do you have a hang, does it reproduce every time, immediately/after some time ?

Also pasting the NCCL debug would help (at least set NCCL_DEBUG=WARN to double check the NCCL version, and any warning).

372046933 · 2020-03-04T09:58:09Z

It always hangs during the training, but at different steps each time. I set NCCL_DEBUG=INFO get the following message before the hang.

[0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/SOCKET : message truncated : receiving 1048576 bytes instead of 32768
[0] NCCL INFO bazel-out/k8-py2-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:34 -> 3
[0] NCCL INFO external/nccl_archive/src/transport/net.cc:533 -> 3
[0] NCCL INFO external/nccl_archive/src/transport.cc:163 -> 3 [Proxy Thread]

kwen2501 · 2020-03-04T11:07:00Z

@372046933 Could you be using different versions of NCCL on different nodes or have compiled NCCL from files of different versions? I just notice your error message has different NCCL home paths. As mentioned by @sjeaugey it would be nice to confirm the NCCL version on different nodes.

Another possibility is having set different values for some environment variables on different nodes. For example, one could have set NCCL_LL_THRESHOLD=0 on one node while not doing so on other nodes. From the message truncation warning you are getting, that is likely the cause , i.e. 1048576 = 1MB is the message size used for the Simple algorithm (when NCCL_LL_THRESHOLD is forced to 0), whereas 32768 = 32KB could be the message size used for the LL algorithm (when NCCL_LL_THRESHOLD is not set).

372046933 · 2020-03-04T14:21:03Z

Well, I run the script in docker, both containers have the same image and ENV set by Kubernetes. I have checked that the Docker image checksum is same on all nodes. By the way, TensorFlow 2.0 compiled NCCL 2.4.7 together.

kwen2501 · 2020-03-04T14:36:34Z

Thanks for the check. The next thing I would check is whether the ranks make the collective call with different sizes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL hang on socket recv() #238

NCCL hang on socket recv() #238

wangsiyu commented Jul 10, 2019 •

edited

Loading

kwen2501 commented Jul 10, 2019

Mykheievskyi commented Jul 14, 2019

wangsiyu commented Jul 30, 2019

372046933 commented Mar 2, 2020

sjeaugey commented Mar 2, 2020

372046933 commented Mar 3, 2020

sjeaugey commented Mar 3, 2020

372046933 commented Mar 4, 2020

kwen2501 commented Mar 4, 2020

372046933 commented Mar 4, 2020

kwen2501 commented Mar 4, 2020 •

edited

Loading

NCCL hang on socket recv() #238

NCCL hang on socket recv() #238

Comments

wangsiyu commented Jul 10, 2019 • edited Loading

kwen2501 commented Jul 10, 2019

Mykheievskyi commented Jul 14, 2019

wangsiyu commented Jul 30, 2019

372046933 commented Mar 2, 2020

sjeaugey commented Mar 2, 2020

372046933 commented Mar 3, 2020

sjeaugey commented Mar 3, 2020

372046933 commented Mar 4, 2020

kwen2501 commented Mar 4, 2020

372046933 commented Mar 4, 2020

kwen2501 commented Mar 4, 2020 • edited Loading

wangsiyu commented Jul 10, 2019 •

edited

Loading

kwen2501 commented Mar 4, 2020 •

edited

Loading