[Feature Request] Provide a way to abort/time out ncclCommInitRank #289

pritamdamania87 · 2020-02-04T22:14:14Z

PyTorch Elastic Trainer is a project to provide elasticity for PyTorch Distributed DataParallel. PyTorch Elastic Trainer allows nodes to join and leave a distributed training job without causing the entire job to fail. PyTorch DDP allows users to choose which communication backend to use for collective operations and NCCL is primarily used for GPU training. To provide elasticity we rely on ncclCommGetAsyncError to detect errors and ncclCommAbort to abort any work that might get stuck due to node failures.

However, one scenario that we can't recover from is when there is a node failure when all nodes call ncclCommInitRank . When this does happen, the nodes that are alive get stuck on ncclCommInitRank forever and there is no way to abort this operation or have this operation time out after some time. The only solution currently seems to be killing the processes stuck waiting for ncclCommInitRank .

It would be great if there is a way to recover from ncclCommInitRank without having to kill the process. If the ncclCommInitRank API could support one of the options mentioned below, that would really helpful:

Allow users to specify a timeout for this operation.
The API can have an async mode where we return a handle to the user, which the user can then use to abort the operation.

cc @sjeaugey

crccw · 2020-09-25T00:43:21Z

+1. This is also a problem for Tensorflow as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Provide a way to abort/time out ncclCommInitRank #289

[Feature Request] Provide a way to abort/time out ncclCommInitRank #289

pritamdamania87 commented Feb 4, 2020 •

edited

Loading

crccw commented Sep 25, 2020

[Feature Request] Provide a way to abort/time out ncclCommInitRank #289

[Feature Request] Provide a way to abort/time out ncclCommInitRank #289

Comments

pritamdamania87 commented Feb 4, 2020 • edited Loading

crccw commented Sep 25, 2020

pritamdamania87 commented Feb 4, 2020 •

edited

Loading