[Hopper/NVLINK4] Origin of failure of fabric manager manifested through NCCL-based codes #1562

vitduck · 2025-01-03T07:39:38Z

Hi,

We operated a GPU cluster in which each node consists of:

8x A100-SMX4 interconnected with NVLINK3
8x H200-SMX5 interconnected with NVLINK4

The former is quite robust and we have rarely seen an issue involving NVLINK fabric managers.
The latter, however, has 3 instances of failure of fabric manager within one month of operation.

With nccl-tests:

gpu46:78367:78492 [0] transport/nvls.cc:244 NCCL WARN Cuda failure 1 'invalid argument'

With hpl from NGC

HPL ERROR: CUDART Error: cudaMemcpyAsync( ipiv_l, ipiv + step * nb, nb * sizeof(int64_t), cudaMemcpyDeviceToDevice, fact_stream) = 700 (an illegal memory access was encountered) on gpu46 at (cugetrfs_mp/getrfs_mp.cu:750)

With tensorflow and NCCL backend

==== backtrace (tid:  57187) ====
0 0x0000000000042520 __sigaction()  ???:0
1 0x000000000007e6d1 ncclGroupEnd()  ???:0
2 0x0000000000094b43 pthread_condattr_setpshared()  ???:0
3 0x0000000000125bb4 clone()  ???:0
=================================
Fatal Python error: Segmentation fault

HPC codes without NCCL implementation, e.g. Quantum-Espresso, LAMMPS, GROMACS have shown no performance degradation despite issue with fabric manager.

Other observations:

NCCL still function correctly between 2x H200 GPUs, but 4x H200 GPUs will trigger the aforementioned invalid argument failure.

dmesg also shows Xid 31 error, which we are uncertain if it actually contributes to the issue:

[Wed Dec 25 15:08:32 2024] NVRM: Xid (PCI:0000:90:00): 31, pid=43715, name=python, Ch 00000009, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE

Perusing the nccl and nccl-tests issue trackers, it seems to be common issue with H100/H200 GPUs.

The recommended solution so far is to either:

stop fabric manager service -> reset GPUs -> start fabric manager service
reboot the node if all else fails

The issue is attributed to fabric manager being forcefully restarted, if I have understood the issue correctly. In such a case:

What could be the potential trigger, especially if the users' codes have run routinely on A100 partitions without issue ?
Is there a specific NCCL version requirement for NVLINK4 to mitigate this issue ?
Is there some way we can preemptively detect this issue so that our users' ML/AI workflows are not interrupted ?

The current failure rate concerns us especially if we want to scale up our service.

Thanks.

The text was updated successfully, but these errors were encountered:

AddyLaddy · 2025-01-03T18:15:50Z

NVLink SHARP (NVLS) is a new HW feature introduced with the Hopper generation of NVLink and NVSwitches.
It offers acceleration of up to 1.3x for AllReduce operations on a single node.

You have found that the root cause of these issues is indeed the incorrect management of the FM and GPU reset sequences.
The solution is to make sure those reset sequences are followed.

If you cannot control the FM/GPU reset sequence of your system, then I can only suggest you disable NCCL NVLink SHARP use with

NCCL_NVLS_ENABLE=0

and accept that you will not be able to benefit from NVLS acceleration.

You could also run a simple nccl-tests:all_reduce_perf before starting your main job to see if NVLS is operating correctly. This would need to be run on >= 4 GPUs for NVLS to be used.

vitduck · 2025-01-07T08:51:03Z

Hi @AddyLaddy

Thanks for mention NCCL_NVLS_ENABLE=0. I will monitor the status of fabric manager without NVLS to bisection the issue.

You have found that the root cause of these issues is indeed the incorrect management of the FM and GPU reset sequences.
The solution is to make sure those reset sequences are followed

Under normal operation, we do not manually reset the GPUs/FMs in between user jobs.
That's why we are thinking that some thing has triggered this intermittent issue.
Form previous reports linked in this thread, the common denominator is Hopper architecture.

Or do you mean that the reset sequence is done automatically when a cuda job finishes ?

AddyLaddy · 2025-01-07T17:57:16Z

Hopper is the first architecture to support NVLink SHARP (NVLS).

There is no reason for the FM to be restarted after the node has been booted.
So, either it crashed and got restarted or some tool is restarting it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hopper/NVLINK4] Origin of failure of fabric manager manifested through NCCL-based codes #1562

[Hopper/NVLINK4] Origin of failure of fabric manager manifested through NCCL-based codes #1562

vitduck commented Jan 3, 2025

AddyLaddy commented Jan 3, 2025

vitduck commented Jan 7, 2025

AddyLaddy commented Jan 7, 2025

[Hopper/NVLINK4] Origin of failure of fabric manager manifested through NCCL-based codes #1562

[Hopper/NVLINK4] Origin of failure of fabric manager manifested through NCCL-based codes #1562

Comments

vitduck commented Jan 3, 2025

AddyLaddy commented Jan 3, 2025

vitduck commented Jan 7, 2025

AddyLaddy commented Jan 7, 2025