Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hopper/NVLINK4] Origin of failure of fabric manager manifested through NCCL-based codes #1562

Open
vitduck opened this issue Jan 3, 2025 · 3 comments

Comments

@vitduck
Copy link

vitduck commented Jan 3, 2025

Hi,

We operated a GPU cluster in which each node consists of:

  • 8x A100-SMX4 interconnected with NVLINK3
  • 8x H200-SMX5 interconnected with NVLINK4

The former is quite robust and we have rarely seen an issue involving NVLINK fabric managers.
The latter, however, has 3 instances of failure of fabric manager within one month of operation.

  1. With nccl-tests:
    gpu46:78367:78492 [0] transport/nvls.cc:244 NCCL WARN Cuda failure 1 'invalid argument'
    
  2. With hpl from NGC
    HPL ERROR: CUDART Error: cudaMemcpyAsync( ipiv_l, ipiv + step * nb, nb * sizeof(int64_t), cudaMemcpyDeviceToDevice, fact_stream) = 700 (an illegal memory access was encountered) on gpu46 at (cugetrfs_mp/getrfs_mp.cu:750)
    
  3. With tensorflow and NCCL backend
    ==== backtrace (tid:  57187) ====
    0 0x0000000000042520 __sigaction()  ???:0
    1 0x000000000007e6d1 ncclGroupEnd()  ???:0
    2 0x0000000000094b43 pthread_condattr_setpshared()  ???:0
    3 0x0000000000125bb4 clone()  ???:0
    =================================
    Fatal Python error: Segmentation fault
    
  4. HPC codes without NCCL implementation, e.g. Quantum-Espresso, LAMMPS, GROMACS have shown no performance degradation despite issue with fabric manager.

Other observations:

  • NCCL still function correctly between 2x H200 GPUs, but 4x H200 GPUs will trigger the aforementioned invalid argument failure.
  • dmesg also shows Xid 31 error, which we are uncertain if it actually contributes to the issue:
    [Wed Dec 25 15:08:32 2024] NVRM: Xid (PCI:0000:90:00): 31, pid=43715, name=python, Ch 00000009, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
    

Perusing the nccl and nccl-tests issue trackers, it seems to be common issue with H100/H200 GPUs.

The recommended solution so far is to either:

  • stop fabric manager service -> reset GPUs -> start fabric manager service
  • reboot the node if all else fails

The issue is attributed to fabric manager being forcefully restarted, if I have understood the issue correctly. In such a case:

  • What could be the potential trigger, especially if the users' codes have run routinely on A100 partitions without issue ?
  • Is there a specific NCCL version requirement for NVLINK4 to mitigate this issue ?
  • Is there some way we can preemptively detect this issue so that our users' ML/AI workflows are not interrupted ?

The current failure rate concerns us especially if we want to scale up our service.

Thanks.

@AddyLaddy
Copy link
Collaborator

NVLink SHARP (NVLS) is a new HW feature introduced with the Hopper generation of NVLink and NVSwitches.
It offers acceleration of up to 1.3x for AllReduce operations on a single node.

You have found that the root cause of these issues is indeed the incorrect management of the FM and GPU reset sequences.
The solution is to make sure those reset sequences are followed.

If you cannot control the FM/GPU reset sequence of your system, then I can only suggest you disable NCCL NVLink SHARP use with

NCCL_NVLS_ENABLE=0

and accept that you will not be able to benefit from NVLS acceleration.

You could also run a simple nccl-tests:all_reduce_perf before starting your main job to see if NVLS is operating correctly. This would need to be run on >= 4 GPUs for NVLS to be used.

@vitduck
Copy link
Author

vitduck commented Jan 7, 2025

Hi @AddyLaddy

Thanks for mention NCCL_NVLS_ENABLE=0. I will monitor the status of fabric manager without NVLS to bisection the issue.

You have found that the root cause of these issues is indeed the incorrect management of the FM and GPU reset sequences.
The solution is to make sure those reset sequences are followed

Under normal operation, we do not manually reset the GPUs/FMs in between user jobs.
That's why we are thinking that some thing has triggered this intermittent issue.
Form previous reports linked in this thread, the common denominator is Hopper architecture.

Or do you mean that the reset sequence is done automatically when a cuda job finishes ?

@AddyLaddy
Copy link
Collaborator

Hopper is the first architecture to support NVLink SHARP (NVLS).

There is no reason for the FM to be restarted after the node has been booted.
So, either it crashed and got restarted or some tool is restarting it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants