Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Provide a way to abort/time out ncclCommInitRank #289

Open
pritamdamania87 opened this issue Feb 4, 2020 · 1 comment

Comments

@pritamdamania87
Copy link

pritamdamania87 commented Feb 4, 2020

PyTorch Elastic Trainer is a project to provide elasticity for PyTorch Distributed DataParallel. PyTorch Elastic Trainer allows nodes to join and leave a distributed training job without causing the entire job to fail. PyTorch DDP allows users to choose which communication backend to use for collective operations and NCCL is primarily used for GPU training. To provide elasticity we rely on ncclCommGetAsyncError to detect errors and ncclCommAbort to abort any work that might get stuck due to node failures.

However, one scenario that we can't recover from is when there is a node failure when all nodes call ncclCommInitRank . When this does happen, the nodes that are alive get stuck on ncclCommInitRank forever and there is no way to abort this operation or have this operation time out after some time. The only solution currently seems to be killing the processes stuck waiting for ncclCommInitRank .

It would be great if there is a way to recover from ncclCommInitRank without having to kill the process. If the ncclCommInitRank API could support one of the options mentioned below, that would really helpful:

  1. Allow users to specify a timeout for this operation.
  2. The API can have an async mode where we return a handle to the user, which the user can then use to abort the operation.

cc @sjeaugey

@crccw
Copy link

crccw commented Sep 25, 2020

+1. This is also a problem for Tensorflow as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants