You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We operated a GPU cluster in which each node consists of:
8x A100-SMX4 interconnected with NVLINK3
8x H200-SMX5 interconnected with NVLINK4
The former is quite robust and we have rarely seen an issue involving NVLINK fabric managers.
The latter, however, has 3 instances of failure of fabric manager within one month of operation.
With nccl-tests:
gpu46:78367:78492 [0] transport/nvls.cc:244 NCCL WARN Cuda failure 1 'invalid argument'
NVLink SHARP (NVLS) is a new HW feature introduced with the Hopper generation of NVLink and NVSwitches.
It offers acceleration of up to 1.3x for AllReduce operations on a single node.
You have found that the root cause of these issues is indeed the incorrect management of the FM and GPU reset sequences.
The solution is to make sure those reset sequences are followed.
If you cannot control the FM/GPU reset sequence of your system, then I can only suggest you disable NCCL NVLink SHARP use with
NCCL_NVLS_ENABLE=0
and accept that you will not be able to benefit from NVLS acceleration.
You could also run a simple nccl-tests:all_reduce_perf before starting your main job to see if NVLS is operating correctly. This would need to be run on >= 4 GPUs for NVLS to be used.
Thanks for mention NCCL_NVLS_ENABLE=0. I will monitor the status of fabric manager without NVLS to bisection the issue.
You have found that the root cause of these issues is indeed the incorrect management of the FM and GPU reset sequences.
The solution is to make sure those reset sequences are followed
Under normal operation, we do not manually reset the GPUs/FMs in between user jobs.
That's why we are thinking that some thing has triggered this intermittent issue.
Form previous reports linked in this thread, the common denominator is Hopper architecture.
Or do you mean that the reset sequence is done automatically when a cuda job finishes ?
Hi,
We operated a GPU cluster in which each node consists of:
The former is quite robust and we have rarely seen an issue involving NVLINK fabric managers.
The latter, however, has 3 instances of failure of fabric manager within one month of operation.
nccl-tests
:hpl
from NGCtensorflow
and NCCL backendOther observations:
invalid argument
failure.dmesg
also showsXid 31
error, which we are uncertain if it actually contributes to the issue:Perusing the
nccl
andnccl-tests
issue trackers, it seems to be common issue with H100/H200 GPUs.The recommended solution so far is to either:
The issue is attributed to fabric manager being forcefully restarted, if I have understood the issue correctly. In such a case:
The current failure rate concerns us especially if we want to scale up our service.
Thanks.
The text was updated successfully, but these errors were encountered: