Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow processes with horovod(NCCL) get stuck during the training. #306

Open
jianyuheng opened this issue Mar 19, 2020 · 16 comments
Open

Comments

@jianyuheng
Copy link

I found that tensorflow process gets stuck after training lasts almost 30 hours and this problem can be reproduced every time. The usages of all gpus are 100% when all processes hang up.

I have raised an issue horovod/horovod#1799 (comment), but I am not sure this is a problem of horovod, so I think this issue is also necessary.

@sjeaugey
Copy link
Member

sjeaugey commented Mar 19, 2020

We've had similar problems in the past with LL, so trying to disable it (NCCL_PROTO=^LL) is definitely the first thing to try. Or more generally, see if one protocol/algorithm is responsible for the hang. But of course since it takes 30 hours to reproduce, it's a bit hard to debug.
Also make sure to set NCCL_DEBUG=WARN to have warnings printed, and confirm the NCCL version you are using is the one you think.

Beyond that, you could also try NCCL 2.6 (https://github.com/NVIDIA/nccl/tree/v2.6).

On a more general note, looking at your startup script, it would be good to avoid setting NCCL environment variables unless you really need it.
Taking your list, these are normal user/system specific config :

  • NCCL_SOCKET_IFNAME=eth1, NCCL_IB_GID_INDEX=3, NCCL_IB_HCA=$net_devices, NCCL_IB_SL=3
  • NCCL_DEBUG=INFO this is ok too, although sometimes setting it to WARN can help seeing the important warning in the middle of the rest.

However, setting the ones below have a significant potential for causing issues, like crashes, hangs and data corruption (because we don't heavily test all the possible settings, only the default settings) :

  • NCCL_BUFFSIZE=4194304 this should not be needed, it is the default. A typo on the value could cause misalignment and bugs.
  • NCCL_IB_CUDA_SUPPORT=1, NCCL_P2P_DISABLE=0, NCCL_SHM_DISABLE=0, NCCL_IB_DISABLE=0, NCCL_CHECKS_DISABLE=1, these are the default values. Setting them should not hurt, unless there is a bug which causes the behavior to change.
  • NCCL_NET_GDR_READ=1, NCCL_P2P_LEVEL=5, NCCL_IB_GDR_LEVEL=2, NCCL_NET_GDR_LEVEL=2 these should be detected automatically, and should therefore not be set unless you see a performance improvement. If so, we'd like to hear about it. NCCL might disable some protocols automatically, if they could cause data corruption, for example.

@jianyuheng
Copy link
Author

Hi, @sjeaugey
Following your advice, I update my startup script.

nohup mpirun --allow-run-as-root -np $gpu_num -hostfile $hosts_list --map-by slot \
--bind-to none \
--mca btl_openib_want_cuda_gdr 1 --mca coll_fca_enable 0 --mca btl_openib_if_include $net_devices \
--report-bindings --display-map --mca btl_openib_rroce_enable 1 --mca pml ob1 --mca btl ^openib \
--mca btl_openib_cpc_include rdmacm  --mca coll_hcoll_enable 0  --mca plm_rsh_no_tree_spawn 1 \
-x NCCL_SOCKET_IFNAME=eth1 -x NCCL_DEBUG=WARN -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_HCA=$net_devices -x NCCL_IB_SL=3 \
-x NCCL_PROTO=^LL \
-x HOROVOD_HIERARCHICAL_ALLREDUCE=0 -x HOROVOD_CROSS_SIZE=32 -x HOROVOD_HIERARCHICAL_THRESHOLD=20236324 -x HOROVOD_FUSION_THRESHOLD=0 \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl/build/lib/:/usr/local/cuda/extras/CUPTI/lib64/ \

I restart my job, using this new script, with NCCL v2.6.2 compiled from https://github.com/NVIDIA/nccl/tree/v2.6 and Horovod 0.19.1 installed by HOROVOD_NCCL_HOME=/usr/local/nccl/build HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod , but the training speed drop almost 40% and I'm confused about this large performance loss.

@sjeaugey
Copy link
Member

Sorry, we realized yesterday that there seemed to be a bug on 2.6 not using GPU Direct RDMA for reads. We'll try to fix the 2.6 branch soon.

@sjeaugey
Copy link
Member

I pushed NCCL 2.6.4 to the v2.6 branch. Let us know if you still see a significant performance degradation. Thanks !

@jianyuheng
Copy link
Author

Hi, @sjeaugey
I tried v2.6.4, but performance still drop 40%.

@qianzhang613
Copy link

qianzhang613 commented Mar 23, 2020

@jyhengcoder can you try set env var: NCCL_LAUNCH_MODE to GROUP with this PR #297 ?

we also meet the train hangup, and solved by this way.

@jianyuheng
Copy link
Author

jianyuheng commented Mar 23, 2020

@qianzhang613 Thanks for your suggestion, I tried NCCL v2.6.4 last weekend, the training hangup didn't appear.

@sjeaugey
It is notable that this time I didn't disable the low-latency protocol in NCCL and I got a large performance degradation.

@sjeaugey
Copy link
Member

@jyhengcoder thanks for the feedback. If you'd like to investigate the speed degradation, I would suggest running the NCCL performance tests (https://github.com/nvidia/nccl-tests) with both 2.5 and 2.6 and see whether the performance degradation is visible with the tests.

@jianyuheng
Copy link
Author

jianyuheng commented Mar 24, 2020

@sjeaugey I succeed in running nccl-tests with v2.4.7 and v2.5.7, but met an error with v2.6.4:

$ NCCL_DEBUG=WARN ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid  47863 on f9cca5e6-1d56-400f-976a-54ec071e6a7b device  0 [0x1a] Tesla V100-SXM2-32GB
#   Rank  1 Pid  47863 on f9cca5e6-1d56-400f-976a-54ec071e6a7b device  1 [0x1b] Tesla V100-SXM2-32GB
#   Rank  2 Pid  47863 on f9cca5e6-1d56-400f-976a-54ec071e6a7b device  2 [0x3d] Tesla V100-SXM2-32GB
#   Rank  3 Pid  47863 on f9cca5e6-1d56-400f-976a-54ec071e6a7b device  3 [0x3e] Tesla V100-SXM2-32GB
#   Rank  4 Pid  47863 on f9cca5e6-1d56-400f-976a-54ec071e6a7b device  4 [0x88] Tesla V100-SXM2-32GB
#   Rank  5 Pid  47863 on f9cca5e6-1d56-400f-976a-54ec071e6a7b device  5 [0x89] Tesla V100-SXM2-32GB
#   Rank  6 Pid  47863 on f9cca5e6-1d56-400f-976a-54ec071e6a7b device  6 [0xb1] Tesla V100-SXM2-32GB
#   Rank  7 Pid  47863 on f9cca5e6-1d56-400f-976a-54ec071e6a7b device  7 [0xb2] Tesla V100-SXM2-32GB
NCCL version 2.6.4+cuda10.0

f9cca5e6-1d56-400f-976a-54ec071e6a7b:47863:47884 [7] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device

f9cca5e6-1d56-400f-976a-54ec071e6a7b:47863:47881 [4] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device

f9cca5e6-1d56-400f-976a-54ec071e6a7b:47863:47882 [5] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device

f9cca5e6-1d56-400f-976a-54ec071e6a7b:47863:47879 [2] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device

f9cca5e6-1d56-400f-976a-54ec071e6a7b:47863:47883 [6] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device

f9cca5e6-1d56-400f-976a-54ec071e6a7b:47863:47880 [3] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device

f9cca5e6-1d56-400f-976a-54ec071e6a7b:47863:47878 [1] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device

f9cca5e6-1d56-400f-976a-54ec071e6a7b:47863:47877 [0] misc/ibvwrap.cc:284 NCCL WARN Call to ibv_modify_qp failed with error No such device
f9cca5e6-1d56-400f-976a-54ec071e6a7b: Test NCCL failure common.cu:775 'unhandled system error'

@sjeaugey
Copy link
Member

sjeaugey commented Mar 24, 2020

Hum, this is weird. Can you set NCCL_DEBUG=INFO to get the full log and backtrace of the WARN ?

Also I would suggest compiling the tests with MPI support (make ... MPI=1) and launch with mpirun, exactly like with horovod.

@jianyuheng
Copy link
Author

jianyuheng commented Mar 25, 2020

Hi @sjeaugey , I ran the NCCL performance test based on 16(nodes) * 8 Nvidia V100 GPU cluster.

Startup script

nohup mpirun --allow-run-as-root -np $gpu_num -hostfile $hosts_list --map-by slot \
--bind-to none \
--mca btl_openib_want_cuda_gdr 1 --mca coll_fca_enable 0 --mca btl_openib_if_include $net_devices \
--report-bindings --display-map --mca btl_openib_rroce_enable 1 --mca pml ob1 --mca btl ^openib \
--mca btl_openib_cpc_include rdmacm  --mca coll_hcoll_enable 0  --mca plm_rsh_no_tree_spawn 1 \
-x NCCL_SOCKET_IFNAME=eth1 -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_HCA=$net_devices -x NCCL_IB_SL=3 -x NCCL_CHECKS_DISABLE=1 \
-x HOROVOD_HIERARCHICAL_ALLREDUCE=0 -x HOROVOD_CROSS_SIZE=32 -x HOROVOD_HIERARCHICAL_THRESHOLD=20236324 \
-x HOROVOD_FUSION_THRESHOLD=0 \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl/build/lib/:/data/openmpi/lib:/usr/local/cuda/extras/CUPTI/lib64/ \
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 \ 
1>traceglob.log 2>&1 &

Results

v2.5.7
# Out of bounds values : 0 OK 
# Avg bus bandwidth    : 2.61599

v2.6.4
# Out of bounds values : 0 OK 
# Avg bus bandwidth    : 1.41663

@kwen2501
Copy link
Contributor

Do you know which virtual lane NCCL_IB_SL=3 correspond to? Are you using RoCEv2?
Starting NCCL 2.6.4, we enabled adaptive routing. But it must be on a virtual lane that supports it.

Setting NCCL_IB_AR_THRESHOLD above NCCL_BUFFSIZE (4194304) will disable adaptive routing completely. It may be worthwhile trying it and see if that restores the performance of 2.5.7

@sjeaugey
Copy link
Member

@jyhengcoder I was hoping you could post the full output for each version, i.e. performance from 8B to 128MB. The average confirms there is something clear here, but I can't tell from just the average what to look for next.

@kwen2501 Adaptive routing is an Infiniband-only feature, it does not apply to RoCE. But the code change for AR could still impact RoCE performance, so setting NCCL_IB_AR_THRESHOLD=4194304 is still something interesting to try.

@jianyuheng
Copy link
Author

@sjeaugey

Results

v2.5.7

           8             2   float     sum    108.7    0.00    0.00  1e-06    106.3    0.00    0.00  5e-07
          16             4   float     sum    99.51    0.00    0.00  7e-07    101.3    0.00    0.00  7e-07
          32             8   float     sum    109.0    0.00    0.00  1e-06    104.5    0.00    0.00  1e-06
          64            16   float     sum    105.8    0.00    0.00  1e-06    105.7    0.00    0.00  1e-06
         128            32   float     sum    104.7    0.00    0.00  1e-06    107.4    0.00    0.00  1e-06
         256            64   float     sum    115.1    0.00    0.00  1e-06    113.9    0.00    0.00  1e-06
         512           128   float     sum    118.2    0.00    0.01  1e-06    111.6    0.00    0.01  7e-07
        1024           256   float     sum    119.7    0.01    0.02  1e-06    117.6    0.01    0.02  1e-06
        2048           512   float     sum    119.5    0.02    0.03  1e-06    125.5    0.02    0.03  1e-06
        4096          1024   float     sum    132.5    0.03    0.06  1e-06    128.6    0.03    0.06  1e-06
        8192          2048   float     sum    133.5    0.06    0.12  1e-06    131.7    0.06    0.12  1e-06
       16384          4096   float     sum    169.4    0.10    0.19  1e-06    164.5    0.10    0.20  1e-06
       32768          8192   float     sum    200.0    0.16    0.33  1e-06    195.1    0.17    0.33  1e-06
       65536         16384   float     sum    258.5    0.25    0.50  1e-06    260.4    0.25    0.50  1e-06
      131072         32768   float     sum    249.0    0.53    1.04  1e-06    240.6    0.54    1.08  1e-06
      262144         65536   float     sum    283.7    0.92    1.83  1e-06    280.9    0.93    1.85  1e-06
      524288        131072   float     sum    377.6    1.39    2.76  1e-06    377.7    1.39    2.75  1e-06
     1048576        262144   float     sum    538.8    1.95    3.86  1e-06    546.9    1.92    3.80  1e-06
     2097152        524288   float     sum    761.2    2.76    5.47  1e-06    761.6    2.75    5.46  1e-06
     4194304       1048576   float     sum   2096.3    2.00    3.97  2e-06   2087.7    2.01    3.99  2e-06
     8388608       2097152   float     sum   2494.8    3.36    6.67  2e-06   2494.8    3.36    6.67  2e-06
    16777216       4194304   float     sum   3819.5    4.39    8.72  2e-06   3788.7    4.43    8.79  2e-06
    33554432       8388608   float     sum   6929.6    4.84    9.61  2e-06   6950.6    4.83    9.58  2e-06
    67108864      16777216   float     sum    13340    5.03    9.98  2e-06    13496    4.97    9.87  2e-06
   134217728      33554432   float     sum    25995    5.16   10.25  2e-06    26023    5.16   10.23  2e-06
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.61599 

v2.6.4

           8             2   float     sum    120.4    0.00    0.00  5e-07    116.5    0.00    0.00  7e-07
          16             4   float     sum    112.3    0.00    0.00  7e-07    113.5    0.00    0.00  5e-07
          32             8   float     sum    114.9    0.00    0.00  1e-06    122.7    0.00    0.00  1e-06
          64            16   float     sum    117.1    0.00    0.00  1e-06    122.4    0.00    0.00  1e-06
         128            32   float     sum    118.8    0.00    0.00  1e-06    120.4    0.00    0.00  1e-06
         256            64   float     sum    120.9    0.00    0.00  1e-06    121.1    0.00    0.00  1e-06
         512           128   float     sum    133.7    0.00    0.01  1e-06    123.3    0.00    0.01  1e-06
        1024           256   float     sum    127.8    0.01    0.02  1e-06    130.2    0.01    0.02  1e-06
        2048           512   float     sum    139.2    0.01    0.03  1e-06    137.7    0.01    0.03  1e-06
        4096          1024   float     sum    155.3    0.03    0.05  1e-06    155.9    0.03    0.05  1e-06
        8192          2048   float     sum    150.4    0.05    0.11  1e-06    152.6    0.05    0.11  1e-06
       16384          4096   float     sum    168.3    0.10    0.19  1e-06    173.2    0.09    0.19  1e-06
       32768          8192   float     sum    202.7    0.16    0.32  1e-06    196.2    0.17    0.33  1e-06
       65536         16384   float     sum    286.8    0.23    0.45  1e-06    287.2    0.23    0.45  1e-06
      131072         32768   float     sum    455.8    0.29    0.57  1e-06    445.9    0.29    0.58  1e-06
      262144         65536   float     sum    549.1    0.48    0.95  1e-06    542.3    0.48    0.96  1e-06
      524288        131072   float     sum   1742.7    0.30    0.60  2e-06   1735.0    0.30    0.60  2e-06
     1048576        262144   float     sum   1769.2    0.59    1.18  2e-06   1760.9    0.60    1.18  2e-06
     2097152        524288   float     sum   1114.3    1.88    3.73  1e-06   1115.9    1.88    3.73  1e-06
     4194304       1048576   float     sum   1930.9    2.17    4.31  1e-06   1933.5    2.17    4.30  1e-06
     8388608       2097152   float     sum   5781.2    1.45    2.88  2e-06   5865.1    1.43    2.84  2e-06
    16777216       4194304   float     sum   6890.2    2.43    4.83  2e-06   6866.9    2.44    4.85  2e-06
    33554432       8388608   float     sum    10380    3.23    6.41  2e-06    10402    3.23    6.40  2e-06
    67108864      16777216   float     sum    18206    3.69    7.31  2e-06    18184    3.69    7.32  2e-06
   134217728      33554432   float     sum   166995    0.80    1.59  2e-06   203284    0.66    1.31  2e-06
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.41663 

@sjeaugey
Copy link
Member

Thanks. So we're aiming at 10 GB/s here, but with 2.6 we see hiccups. This is typical of a RoCE fabric dropping packets. Did you configure your RoCE switch in lossless / PFC mode ? Otherwise your RoCE performance will be subject to this kind of sharp degradation every time we make even tiny changes in NCCL which cause timing changes in the packets.

It would be helpful if you could confirm whether the Adaptive routing support change is causing that difference, and whether NCCL_IB_AR_THRESHOLD=4194304makes performance go back to 2.5 numbers or not.

@jianyuheng
Copy link
Author

jianyuheng commented Mar 28, 2020

@qianzhang613
I tried setting NCCL_LAUNCH_MODE to GROUP in NCCL v2.5.7, but the training still got hang up.

@sjeaugey @kwen2501
When I transferred NCCL version to v2.6.4, I set NCCL_IB_AR_THRESHOLD=4194304 and configure RoCE switch in lossless / PFC mode, but it didn't work to improve the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants