[H200: All_reduce] Random Unhandled Cuda Error #276

vitduck · 2024-12-16T14:34:34Z

Hello,

We are observing a peculiarly random CUDA error related to NCCL.

HW:
- CPU: 2x XEON(R) PLATINUM 8558
- GPU: 8x H200 (Driver: 550.90.07)
- IB: NDR 400 Gbs (MT4129)
- Fabric: MLNX_OFED_LINUX-23.10-0.5.5.0
SW:
- OS: CentOS 7.9.2009
- CUDA: 12.3
- MPI: HPC-X 2.18.1
- NCCL: v2.23.4
Topo:

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	PIX	NODE	SYS	SYS	0-23	0		N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	PXB	NODE	SYS	SYS	0-23	0		N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	PXB	SYS	SYS	0-23	0		N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	PIX	SYS	SYS	0-23	0		N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	PXB	NODE	48-71	2		N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	PIX	NODE	48-71	2		N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	NODE	PXB	48-71	2		N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	NODE	PIX	48-71	2		N/A
NIC0	PIX	PXB	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	SYS	SYS				
NIC1	NODE	NODE	PXB	PIX	SYS	SYS	SYS	SYS	NODE	 X 	SYS	SYS				
NIC2	SYS	SYS	SYS	SYS	PXB	PIX	NODE	NODE	SYS	SYS	 X 	NODE				
NIC3	SYS	SYS	SYS	SYS	NODE	NODE	PXB	PIX	SYS	SYS	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3

Required kernel modules: gdrdrv and nv_peer_mem

$ lsmod | grep nvidia 
nvidia              54132454  151 gdrdrv,nvidia_modeset,nvidia_peermem,nvidia_uvm

Steps to reproduce:

$ export NCCL_DEBUG=info
$ export export NCCL_ALGO=Ring 
$ mpirun --np 16 --map-by ppr:8:node nccl-tests/build/all_reduce_perf -b 4 -e 4G -f 2 -g 1

Output with unhandled CUDA error:

# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   6387 on      gpu47 device  0 [0x0a] NVIDIA H200
#  Rank  1 Group  0 Pid   6388 on      gpu47 device  1 [0x18] NVIDIA H200
#  Rank  2 Group  0 Pid   6389 on      gpu47 device  2 [0x3b] NVIDIA H200
#  Rank  3 Group  0 Pid   6390 on      gpu47 device  3 [0x44] NVIDIA H200
#  Rank  4 Group  0 Pid   6393 on      gpu47 device  4 [0x87] NVIDIA H200
#  Rank  5 Group  0 Pid   6394 on      gpu47 device  5 [0x90] NVIDIA H200
#  Rank  6 Group  0 Pid   6395 on      gpu47 device  6 [0xb8] NVIDIA H200
#  Rank  7 Group  0 Pid   6396 on      gpu47 device  7 [0xc1] NVIDIA H200
#  Rank  8 Group  0 Pid  58815 on      gpu48 device  0 [0x0a] NVIDIA H200
#  Rank  9 Group  0 Pid  58816 on      gpu48 device  1 [0x18] NVIDIA H200
#  Rank 10 Group  0 Pid  58817 on      gpu48 device  2 [0x3b] NVIDIA H200
#  Rank 11 Group  0 Pid  58818 on      gpu48 device  3 [0x44] NVIDIA H200
#  Rank 12 Group  0 Pid  58819 on      gpu48 device  4 [0x87] NVIDIA H200
#  Rank 13 Group  0 Pid  58822 on      gpu48 device  5 [0x90] NVIDIA H200
#  Rank 14 Group  0 Pid  58823 on      gpu48 device  6 [0xb8] NVIDIA H200
#  Rank 15 Group  0 Pid  58824 on      gpu48 device  7 [0xc1] NVIDIA H200
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
gpu48: Test NCCL failure common.cu:1012 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. gpu48 pid 58816: Test failure common.cu:891

Normal output when running all_reduce_perf back to back ;

# nThread 1 nGpus 1 minBytes 4 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   5953 on      gpu47 device  0 [0x0a] NVIDIA H200
#  Rank  1 Group  0 Pid   5954 on      gpu47 device  1 [0x18] NVIDIA H200
#  Rank  2 Group  0 Pid   5955 on      gpu47 device  2 [0x3b] NVIDIA H200
#  Rank  3 Group  0 Pid   5956 on      gpu47 device  3 [0x44] NVIDIA H200
#  Rank  4 Group  0 Pid   5957 on      gpu47 device  4 [0x87] NVIDIA H200
#  Rank  5 Group  0 Pid   5960 on      gpu47 device  5 [0x90] NVIDIA H200
#  Rank  6 Group  0 Pid   5961 on      gpu47 device  6 [0xb8] NVIDIA H200
#  Rank  7 Group  0 Pid   5962 on      gpu47 device  7 [0xc1] NVIDIA H200
#  Rank  8 Group  0 Pid  58424 on      gpu48 device  0 [0x0a] NVIDIA H200
#  Rank  9 Group  0 Pid  58425 on      gpu48 device  1 [0x18] NVIDIA H200
#  Rank 10 Group  0 Pid  58426 on      gpu48 device  2 [0x3b] NVIDIA H200
#  Rank 11 Group  0 Pid  58427 on      gpu48 device  3 [0x44] NVIDIA H200
#  Rank 12 Group  0 Pid  58428 on      gpu48 device  4 [0x87] NVIDIA H200
#  Rank 13 Group  0 Pid  58431 on      gpu48 device  5 [0x90] NVIDIA H200
#  Rank 14 Group  0 Pid  58432 on      gpu48 device  6 [0xb8] NVIDIA H200
#  Rank 15 Group  0 Pid  58433 on      gpu48 device  7 [0xc1] NVIDIA H200
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           4             1     float     sum      -1    38.57    0.00    0.00      0    39.08    0.00    0.00      0
           8             2     float     sum      -1    39.32    0.00    0.00      0    39.41    0.00    0.00      0
          16             4     float     sum      -1    39.92    0.00    0.00      0    38.72    0.00    0.00      0
          32             8     float     sum      -1    42.43    0.00    0.00      0    41.98    0.00    0.00      0
          64            16     float     sum      -1    45.26    0.00    0.00      0    45.01    0.00    0.00      0
         128            32     float     sum      -1    48.23    0.00    0.00      0    47.68    0.00    0.01      0
         256            64     float     sum      -1    50.84    0.01    0.01      0    50.10    0.01    0.01      0
         512           128     float     sum      -1    50.94    0.01    0.02      0    50.13    0.01    0.02      0
        1024           256     float     sum      -1    52.72    0.02    0.04      0    51.07    0.02    0.04      0
        2048           512     float     sum      -1    51.83    0.04    0.07      0    51.89    0.04    0.07      0
        4096          1024     float     sum      -1    52.40    0.08    0.15      0    51.93    0.08    0.15      0
        8192          2048     float     sum      -1    54.11    0.15    0.28      0    53.07    0.15    0.29      0
       16384          4096     float     sum      -1    54.30    0.30    0.57      0    53.47    0.31    0.57      0
       32768          8192     float     sum      -1    56.60    0.58    1.09      0    56.31    0.58    1.09      0
       65536         16384     float     sum      -1    62.20    1.05    1.98      0    61.54    1.06    2.00      0
      131072         32768     float     sum      -1    63.86    2.05    3.85      0    63.01    2.08    3.90      0
      262144         65536     float     sum      -1    68.18    3.84    7.21      0    67.57    3.88    7.27      0
      524288        131072     float     sum      -1    86.39    6.07   11.38      0    86.32    6.07   11.39      0
     1048576        262144     float     sum      -1    139.3    7.53   14.12      0    139.7    7.51   14.08      0
     2097152        524288     float     sum      -1    133.5   15.71   29.46      0    132.8   15.79   29.61      0
     4194304       1048576     float     sum      -1    144.3   29.06   54.49      0    143.7   29.19   54.74      0
     8388608       2097152     float     sum      -1    164.0   51.13   95.88      0    163.6   51.28   96.15      0
    16777216       4194304     float     sum      -1    228.2   73.51  137.84      0    227.1   73.88  138.53      0
    33554432       8388608     float     sum      -1    388.9   86.28  161.78      0    386.9   86.73  162.62      0
    67108864      16777216     float     sum      -1    708.9   94.66  177.49      0    705.3   95.15  178.41      0
   134217728      33554432     float     sum      -1   1348.3   99.54  186.64      0   1344.9   99.80  187.12      0
   268435456      67108864     float     sum      -1   2596.6  103.38  193.84      0   2595.8  103.41  193.90      0
   536870912     134217728     float     sum      -1   5116.4  104.93  196.74      0   5123.2  104.79  196.49      0
  1073741824     268435456     float     sum      -1    10229  104.97  196.81      0    10339  103.85  194.72      0
  2147483648     536870912     float     sum      -1    20415  105.19  197.23      0    20523  104.64  196.20      0
  4294967296    1073741824     float     sum      -1    40890  105.04  196.94      0    41210  104.22  195.42      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 60.1725 
#

For 4x HDR 400 Gbs, this is the expected output.
We are only awared of this problem after a user's report related to unhandled CUDA error in PyTorch.
Due to its random nature, we are having difficult narrowing down the origin.
Please refer to logs resulted from NCCL_DEBUG=info for two identical runs above.
The error seems to be emitted within nvls.cc

gpu47:4962:5097 [1] transport/nvls.cc:223 NCCL WARN Cuda failure 999 'unknown error

allreduce-H200-2N-Ring_999.log
allreduce-H200-2N-Ring_Normal.log

Regards.

The text was updated successfully, but these errors were encountered:

AddyLaddy · 2024-12-16T22:30:27Z

That nvls.cc WARN is usually caused by the Fabric Manager being stopped and restarted without an intervening GPU reset.
Please refer to Section 6.2 of the Fabric Manager User Guide for details of the correct FM/GPU reset sequence:

https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

vitduck · 2024-12-24T06:47:11Z

@AddyLaddy

Sorry for my belated follow-up.
Fabric manager was indeed showing a 'non critical error'. We finally got around to reset GM/GPU.
With the issue being resolved, I would like to close this issue.

Thanks for your insight and suggestion.
Regards.

vitduck closed this as completed Dec 24, 2024

vitduck mentioned this issue Jan 3, 2025

[Hopper/NVLINK4] Origin of failure of fabric manager manifested through NCCL-based codes NVIDIA/nccl#1562

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[H200: All_reduce] Random Unhandled Cuda Error #276

[H200: All_reduce] Random Unhandled Cuda Error #276

vitduck commented Dec 16, 2024

AddyLaddy commented Dec 16, 2024

vitduck commented Dec 24, 2024

[H200: All_reduce] Random Unhandled Cuda Error #276

[H200: All_reduce] Random Unhandled Cuda Error #276

Comments

vitduck commented Dec 16, 2024

AddyLaddy commented Dec 16, 2024

vitduck commented Dec 24, 2024