Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is is possible for NCCL to add a retry mechanism when net flap happens #1557

Open
ProHuper opened this issue Dec 27, 2024 · 0 comments
Open

Comments

@ProHuper
Copy link

ProHuper commented Dec 27, 2024

Currently, when Link Down error happens, the whole NCCL communication group will stop, and the whole training job will fail too, we have failover in job level, but restarting job takes a lot extra time.

For example:
In RoCE, when link down occurs, NCCL log shows this and fail:
Image

In IB, when link down occurs, NCCL log shows this and fail:
Image

In RoCE, dmesg shows:
Image

In IB, UFM shows:
Image

Link down won't last for a long time, usually after max to dozens of seconds, it will recover, during which network topo and other context information won't change. Can NCCL add a retry mechanism when there is a network flap?

@ProHuper ProHuper changed the title Is is possiable for NCCL to add retry mechanism when net flap happens Is is possiable for NCCL to add a retry mechanism when net flap happens Dec 27, 2024
@ProHuper ProHuper changed the title Is is possiable for NCCL to add a retry mechanism when net flap happens Is is possible for NCCL to add a retry mechanism when net flap happens Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant