add ncclCommMarkAbort api in convienence of upper ai system controller #564

woodlgz · 2021-09-08T12:45:36Z

as issue 279 and issue 549 discussed, it's up to upper ai system controller to handle error when some rank losts.

it 's common for upper ai system controller to call ncclCommAbort to abort communicator related operations when rank lost situation is detected.
however, in system where multiple nccl communicators get involved, say horovod with HOROVOD_NUM_NCCL_STREAMS set to above 1, when lost rank is detected, trying to abort or destroy all these communicators will cause system hang. It's due to cuda implicit synchronization.
Image a scenario with 2 communicator (say A,B), system tries to abort all nccl communicators in sequence of A,B. If cuda kernel related to communicator B keeps running and won't exit until abort is detected while controller try to free communicator A(cudaFree involved), deadlock is met.

mark all communicator as aborted and give up all operations first, and release related resources later can avoid above problem.

Signed-off-by: guoze.lin <[email protected]>

woodlgz · 2021-09-09T11:40:05Z

@alsrgv take a look at this please

add ncclCommMarkAbort api in convienence of upper ai system controller

8a31ce7

Signed-off-by: guoze.lin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ncclCommMarkAbort api in convienence of upper ai system controller #564

add ncclCommMarkAbort api in convienence of upper ai system controller #564

woodlgz commented Sep 8, 2021

woodlgz commented Sep 9, 2021

add ncclCommMarkAbort api in convienence of upper ai system controller #564

Are you sure you want to change the base?

add ncclCommMarkAbort api in convienence of upper ai system controller #564

Conversation

woodlgz commented Sep 8, 2021

woodlgz commented Sep 9, 2021