You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch Elastic Trainer is a project to provide elasticity for PyTorch Distributed DataParallel. PyTorch Elastic Trainer allows nodes to join and leave a distributed training job without causing the entire job to fail. PyTorch DDP allows users to choose which communication backend to use for collective operations and NCCL is primarily used for GPU training. To provide elasticity we rely on ncclCommGetAsyncError to detect errors and ncclCommAbort to abort any work that might get stuck due to node failures.
However, one scenario that we can't recover from is when there is a node failure when all nodes call ncclCommInitRank . When this does happen, the nodes that are alive get stuck on ncclCommInitRank forever and there is no way to abort this operation or have this operation time out after some time. The only solution currently seems to be killing the processes stuck waiting for ncclCommInitRank .
It would be great if there is a way to recover from ncclCommInitRank without having to kill the process. If the ncclCommInitRank API could support one of the options mentioned below, that would really helpful:
Allow users to specify a timeout for this operation.
The API can have an async mode where we return a handle to the user, which the user can then use to abort the operation.
PyTorch Elastic Trainer is a project to provide elasticity for PyTorch Distributed DataParallel. PyTorch Elastic Trainer allows nodes to join and leave a distributed training job without causing the entire job to fail. PyTorch DDP allows users to choose which communication backend to use for collective operations and NCCL is primarily used for GPU training. To provide elasticity we rely on ncclCommGetAsyncError to detect errors and ncclCommAbort to abort any work that might get stuck due to node failures.
However, one scenario that we can't recover from is when there is a node failure when all nodes call ncclCommInitRank . When this does happen, the nodes that are alive get stuck on ncclCommInitRank forever and there is no way to abort this operation or have this operation time out after some time. The only solution currently seems to be killing the processes stuck waiting for ncclCommInitRank .
It would be great if there is a way to recover from ncclCommInitRank without having to kill the process. If the ncclCommInitRank API could support one of the options mentioned below, that would really helpful:
cc @sjeaugey
The text was updated successfully, but these errors were encountered: