Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ncclCommMarkAbort api in convienence of upper ai system controller #564

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

woodlgz
Copy link

@woodlgz woodlgz commented Sep 8, 2021

as issue 279 and issue 549 discussed, it's up to upper ai system controller to handle error when some rank losts.

it 's common for upper ai system controller to call ncclCommAbort to abort communicator related operations when rank lost situation is detected.
however, in system where multiple nccl communicators get involved, say horovod with HOROVOD_NUM_NCCL_STREAMS set to above 1, when lost rank is detected, trying to abort or destroy all these communicators will cause system hang. It's due to cuda implicit synchronization.
Image a scenario with 2 communicator (say A,B), system tries to abort all nccl communicators in sequence of A,B. If cuda kernel related to communicator B keeps running and won't exit until abort is detected while controller try to free communicator A(cudaFree involved), deadlock is met.

mark all communicator as aborted and give up all operations first, and release related resources later can avoid above problem.

@woodlgz
Copy link
Author

woodlgz commented Sep 9, 2021

@alsrgv take a look at this please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant