add ncclCommMarkAbort api in convienence of upper ai system controller #564
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
as issue 279 and issue 549 discussed, it's up to upper ai system controller to handle error when some rank losts.
it 's common for upper ai system controller to call ncclCommAbort to abort communicator related operations when rank lost situation is detected.
however, in system where multiple nccl communicators get involved, say horovod with HOROVOD_NUM_NCCL_STREAMS set to above 1, when lost rank is detected, trying to abort or destroy all these communicators will cause system hang. It's due to cuda implicit synchronization.
Image a scenario with 2 communicator (say A,B), system tries to abort all nccl communicators in sequence of A,B. If cuda kernel related to communicator B keeps running and won't exit until abort is detected while controller try to free communicator A(cudaFree involved), deadlock is met.
mark all communicator as aborted and give up all operations first, and release related resources later can avoid above problem.