You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when Link Down error happens, the whole NCCL communication group will stop, and the whole training job will fail too, we have failover in job level, but restarting job takes a lot extra time.
For example:
In RoCE, when link down occurs, NCCL log shows this and fail:
In IB, when link down occurs, NCCL log shows this and fail:
In RoCE, dmesg shows:
In IB, UFM shows:
Link down won't last for a long time, usually after max to dozens of seconds, it will recover, during which network topo and other context information won't change. Can NCCL add a retry mechanism when there is a network flap?
The text was updated successfully, but these errors were encountered:
ProHuper
changed the title
Is is possiable for NCCL to add retry mechanism when net flap happens
Is is possiable for NCCL to add a retry mechanism when net flap happens
Dec 27, 2024
ProHuper
changed the title
Is is possiable for NCCL to add a retry mechanism when net flap happens
Is is possible for NCCL to add a retry mechanism when net flap happens
Dec 27, 2024
Currently, when Link Down error happens, the whole NCCL communication group will stop, and the whole training job will fail too, we have failover in job level, but restarting job takes a lot extra time.
For example:
In RoCE, when link down occurs, NCCL log shows this and fail:
In IB, when link down occurs, NCCL log shows this and fail:
In RoCE, dmesg shows:
In IB, UFM shows:
Link down won't last for a long time, usually after max to dozens of seconds, it will recover, during which network topo and other context information won't change. Can NCCL add a retry mechanism when there is a network flap?
The text was updated successfully, but these errors were encountered: