Is is possible for NCCL to add a retry mechanism when net flap happens #1557

ProHuper · 2024-12-27T06:17:03Z

Currently, when Link Down error happens, the whole NCCL communication group will stop, and the whole training job will fail too, we have failover in job level, but restarting job takes a lot extra time.

For example:
In RoCE, when link down occurs, NCCL log shows this and fail:

In IB, when link down occurs, NCCL log shows this and fail:

In RoCE, dmesg shows:

In IB, UFM shows:

Link down won't last for a long time, usually after max to dozens of seconds, it will recover, during which network topo and other context information won't change. Can NCCL add a retry mechanism when there is a network flap?

ProHuper changed the title ~~Is is possiable for NCCL to add retry mechanism when net flap happens~~ Is is possiable for NCCL to add a retry mechanism when net flap happens Dec 27, 2024

ProHuper changed the title ~~Is is possiable for NCCL to add a retry mechanism when net flap happens~~ Is is possible for NCCL to add a retry mechanism when net flap happens Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is is possible for NCCL to add a retry mechanism when net flap happens #1557

Is is possible for NCCL to add a retry mechanism when net flap happens #1557

ProHuper commented Dec 27, 2024 •

edited

Loading

Is is possible for NCCL to add a retry mechanism when net flap happens #1557

Is is possible for NCCL to add a retry mechanism when net flap happens #1557

Comments

ProHuper commented Dec 27, 2024 • edited Loading

ProHuper commented Dec 27, 2024 •

edited

Loading