You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All NCCL tests are currently failing on an H100 system with Cuda 12.2, Driver Version: 535.54.03, with error text : transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument'. I am getting the same error with the master branch of NCCL (2.18.3) and the v2.18 branch (currently 2.18.5). For a simple test that fails, I can use the allreduce performance test from the NCCL Tests repository. All of the other NCCL use cases that I have tried fail in the same manner. The call stack printed with NCCL_DEBUG=INFO is :
NCCL version 2.18.5+cuda12.2
...
transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument'
NCCL INFO transport/nvls.cc:324 -> 1
NCCL INFO init.cc:1093 -> 1
NCCL INFO init.cc:1358 -> 1
NCCL INFO init.cc:1598 -> 1
The test codes actually run correctly when there are only two processes using two GPUs, but fail with three or more processes during NCCL initialization. All of the currently failing test cases used to run correctly before a software update. I suspect that the issue appeared with Cuda 12.2, but I did not have continuous access to the test system.
The text was updated successfully, but these errors were encountered:
That's usually due to nv-fabricmanager having been manually restarted while processes were still holding resources on the GPU. Rebooting the node is usually the easiest solution to fix that. Stopping everything on the GPUs and restarting the fabric manager should also work.
All NCCL tests are currently failing on an H100 system with Cuda 12.2, Driver Version: 535.54.03, with error text : transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument'. I am getting the same error with the master branch of NCCL (2.18.3) and the v2.18 branch (currently 2.18.5). For a simple test that fails, I can use the allreduce performance test from the NCCL Tests repository. All of the other NCCL use cases that I have tried fail in the same manner. The call stack printed with NCCL_DEBUG=INFO is :
NCCL version 2.18.5+cuda12.2
...
transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument'
NCCL INFO transport/nvls.cc:324 -> 1
NCCL INFO init.cc:1093 -> 1
NCCL INFO init.cc:1358 -> 1
NCCL INFO init.cc:1598 -> 1
The test codes actually run correctly when there are only two processes using two GPUs, but fail with three or more processes during NCCL initialization. All of the currently failing test cases used to run correctly before a software update. I suspect that the issue appeared with Cuda 12.2, but I did not have continuous access to the test system.
The text was updated successfully, but these errors were encountered: