Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL 2.18 / Cuda 12.2 fails on H100 system with transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument' #976

Open
walkup opened this issue Aug 28, 2023 · 2 comments

Comments

@walkup
Copy link

walkup commented Aug 28, 2023

All NCCL tests are currently failing on an H100 system with Cuda 12.2, Driver Version: 535.54.03, with error text : transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument'. I am getting the same error with the master branch of NCCL (2.18.3) and the v2.18 branch (currently 2.18.5). For a simple test that fails, I can use the allreduce performance test from the NCCL Tests repository. All of the other NCCL use cases that I have tried fail in the same manner. The call stack printed with NCCL_DEBUG=INFO is :

NCCL version 2.18.5+cuda12.2
...
transport/nvls.cc:165 NCCL WARN Cuda failure 'invalid argument'
NCCL INFO transport/nvls.cc:324 -> 1
NCCL INFO init.cc:1093 -> 1
NCCL INFO init.cc:1358 -> 1
NCCL INFO init.cc:1598 -> 1

The test codes actually run correctly when there are only two processes using two GPUs, but fail with three or more processes during NCCL initialization. All of the currently failing test cases used to run correctly before a software update. I suspect that the issue appeared with Cuda 12.2, but I did not have continuous access to the test system.

@sjeaugey
Copy link
Member

That's usually due to nv-fabricmanager having been manually restarted while processes were still holding resources on the GPU. Rebooting the node is usually the easiest solution to fix that. Stopping everything on the GPUs and restarting the fabric manager should also work.

@walkup
Copy link
Author

walkup commented Aug 29, 2023

Indeed a re-boot fixed the problem ... thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants