NCCL 2.18 / Cuda 12.2 fails on H100 system with transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument' #976

walkup · 2023-08-28T18:02:39Z

All NCCL tests are currently failing on an H100 system with Cuda 12.2, Driver Version: 535.54.03, with error text : transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument'. I am getting the same error with the master branch of NCCL (2.18.3) and the v2.18 branch (currently 2.18.5). For a simple test that fails, I can use the allreduce performance test from the NCCL Tests repository. All of the other NCCL use cases that I have tried fail in the same manner. The call stack printed with NCCL_DEBUG=INFO is :

NCCL version 2.18.5+cuda12.2
...
transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument'
NCCL INFO transport/nvls.cc:324 -> 1
NCCL INFO init.cc:1093 -> 1
NCCL INFO init.cc:1358 -> 1
NCCL INFO init.cc:1598 -> 1

The test codes actually run correctly when there are only two processes using two GPUs, but fail with three or more processes during NCCL initialization. All of the currently failing test cases used to run correctly before a software update. I suspect that the issue appeared with Cuda 12.2, but I did not have continuous access to the test system.

sjeaugey · 2023-08-29T09:39:30Z

That's usually due to nv-fabricmanager having been manually restarted while processes were still holding resources on the GPU. Rebooting the node is usually the easiest solution to fix that. Stopping everything on the GPUs and restarting the fabric manager should also work.

walkup · 2023-08-29T15:54:33Z

Indeed a re-boot fixed the problem ... thanks!

Flionay mentioned this issue Jun 7, 2024

🐛[BUG]: Graphcast: Error when running mpirun --allow-run-as-root -np 3 for GraphCast model, but works with -np 2 NVIDIA/modulus#539

Closed

vitduck mentioned this issue Jan 3, 2025

[Hopper/NVLINK4] Origin of failure of fabric manager manifested through NCCL-based codes #1562

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL 2.18 / Cuda 12.2 fails on H100 system with transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument' #976

NCCL 2.18 / Cuda 12.2 fails on H100 system with transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument' #976

walkup commented Aug 28, 2023

sjeaugey commented Aug 29, 2023

walkup commented Aug 29, 2023

NCCL 2.18 / Cuda 12.2 fails on H100 system with transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument' #976

NCCL 2.18 / Cuda 12.2 fails on H100 system with transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument' #976

Comments

walkup commented Aug 28, 2023

sjeaugey commented Aug 29, 2023

walkup commented Aug 29, 2023