-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ncclCommGetAsyncError doesn't report errors for failures within a host. #279
Comments
Following up on this issue, I was wondering if someone from the NCCL team could provide some guidance here? I'm happy to provide any additional information that might be missing in the original issue. Thanks! |
Sorry for the delay. I think there is a misunderstanding : So it it the responsibility of the application to call |
Thanks for the explanation, I understand that
What sort of errors (apart from network communication errors) are reported via |
The fact that But this is an exception. When GPUs communicate through shared memory, there is no such mechanism. With other networking technologies, there could be no such detection either. In case of a proper clean-up we could certainly mark the shared memory as "closed" but that would not work if, for example, one process crashes and exits without the opportunity to properly close its communication channels.
And indeed this is not saying anything about other ranks, nor does it guarantee that other ranks will fail as well, so you should have a separate external mechanism to monitor processes errors, crashes, ... and handle that at a higher level (usually the process management system does that). |
Thanks again for the detailed explanation and this clarifies a lot of things in terms of the behavior that we are seeing. I had one more question regarding the behavior of |
It might see an error, but that is not guaranteed (it will effectively see an error only if it is communicating with it through the network). |
For reference I'm running two instances of the following script:
I run two instances of the script as follows:
The problem I run into is that when I run both instances on the same host,
ncclCommGetAsyncError
doesn't report an error and the operation eventually times out. Although, If I run them on separate hostsncclCommGetAsyncError
does report an error and we see a separate exception being thrown. For additional context, this is the PyTorch PR that usesncclCommGetAsyncError
: pytorch/pytorch#25012.I was wondering why
ncclCommGetAsyncError
doesn't report an error when we run multiple ranks within a single host and one of the ranks fail?The text was updated successfully, but these errors were encountered: