Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncclCommGetAsyncError doesn't report errors for failures within a host. #279

Open
pritamdamania87 opened this issue Dec 27, 2019 · 6 comments

Comments

@pritamdamania87
Copy link

For reference I'm running two instances of the following script:

from __future__ import absolute_import, division, print_function, unicode_literals

import torch.distributed as c10d
import torch
import argparse
import os
import logging
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description='Simple script to simulate NCCL errors. The script is '
        'supposed to be run on multiple different nodes simultaneously with '
        'appropriate rank and world_size. The script run an allreduce() on '
        'the rank 0 node and aborts all the other nodes to simulate an error '
        'in NCCL')
    parser.add_argument('addr', help='address of the master node to connect to.')
    parser.add_argument('port', help='port of the master node to connect to.')
    parser.add_argument('rank', help='rank of this node')
    parser.add_argument('world_size', help='number of nodes in process group')
    args = parser.parse_args()
    rank = int(args.rank)
    world_size = int(args.world_size)
    port = int(args.port)

    store = c10d.TCPStore(args.addr, port, world_size, rank == 0)
    process_group = c10d.ProcessGroupNCCL(store, rank, world_size)
    logging.info('Running first allreduce')
    process_group.allreduce(torch.rand(10).cuda(rank)).wait()
    if rank == 0:
        logging.info('Running second allreduce only on rank 0')
        work = process_group.allreduce(torch.rand(10).cuda(rank))
        logging.info('Waiting for allreduce to complete...')
        work.wait()
        logging.info('Second allreduce successful: {}'.format(work.is_success()))
    else:
        logging.info('Aborting all other ranks.')
        os.abort()

I run two instances of the script as follows:

NCCL_BLOCKING_WAIT=1 python test/simulate_nccl_errors.py <addr> <port> 0 2
NCCL_BLOCKING_WAIT=1 python test/simulate_nccl_errors.py <addr> <port> 1 2

The problem I run into is that when I run both instances on the same host, ncclCommGetAsyncError doesn't report an error and the operation eventually times out. Although, If I run them on separate hosts ncclCommGetAsyncError does report an error and we see a separate exception being thrown. For additional context, this is the PyTorch PR that uses ncclCommGetAsyncError: pytorch/pytorch#25012.

I was wondering why ncclCommGetAsyncError doesn't report an error when we run multiple ranks within a single host and one of the ranks fail?

@pritamdamania87
Copy link
Author

Following up on this issue, I was wondering if someone from the NCCL team could provide some guidance here? I'm happy to provide any additional information that might be missing in the original issue. Thanks!

@sjeaugey
Copy link
Member

sjeaugey commented Jan 9, 2020

Sorry for the delay.

I think there is a misunderstanding : ncclCommGetAsyncError is not meant to report that other ranks have failed. Stopping all ranks consistently is the responsibility of the application. ncclGetAsyncError is there to report errors which happen in NCCL (in particular network communication errors) and which NCCL cannot report through any other channel, since NCCL operations are CUDA kernels.
When those errors happens, NCCL could abort the current kernel, but then the application might think the operation completed successfully, which is why we let the application abort the NCCL kernel in a controlled manner.

So it it the responsibility of the application to call ncclCommAbort on all NCCL ranks if any rank experiences an error during a NCCL call, either through a CUDA error on the NCCL call, or through ncclCommGetAsyncError.

@pritamdamania87
Copy link
Author

Thanks for the explanation, I understand that ncclCommGetAsyncError wouldn't report errors when other ranks fail if we don't have any NCCL operations ongoing. But if we have two processes running ncclAllReduce and one of the process crashes, shouldn't ncclCommGetAsyncError report an error on the process that is still alive? In the examples I posted above, one of the ranks is trying to perform ncclAllReduce while the other rank fails.

ncclGetAsyncError is there to report errors which happen in NCCL (in particular network communication errors) and which NCCL cannot report through any other channel, since NCCL operations are CUDA kernels.

What sort of errors (apart from network communication errors) are reported via ncclCommGetAsyncError? Also, in general is your suggestion that we shouldn't rely on ncclCommGetAsyncError for detecting failures of participating ranks and instead have a separate external mechanism to do this instead?

@sjeaugey
Copy link
Member

sjeaugey commented Jan 9, 2020

The fact that ncclCommGetAsyncError reports an error when you abort another rank is because they are communicating through sockets, and when one side closes the socket, the other side sees it and you get an error.

But this is an exception. When GPUs communicate through shared memory, there is no such mechanism. With other networking technologies, there could be no such detection either. In case of a proper clean-up we could certainly mark the shared memory as "closed" but that would not work if, for example, one process crashes and exits without the opportunity to properly close its communication channels.

ncclCommGetAsyncError currently only reports networking errors, e.g. sockets being closed or any other error while communication (could be an Infiniband send failing as well). You will not get any asynchronous error when running within a node.

And indeed this is not saying anything about other ranks, nor does it guarantee that other ranks will fail as well, so you should have a separate external mechanism to monitor processes errors, crashes, ... and handle that at a higher level (usually the process management system does that).

@pritamdamania87
Copy link
Author

Thanks again for the detailed explanation and this clarifies a lot of things in terms of the behavior that we are seeing. I had one more question regarding the behavior of ncclCommAbort. If I have two ranks 0 and 1 and only rank 1 calls ncclCommAbort, would rank 0 see any sort of error when they call ncclCommGetAsyncError?

@sjeaugey
Copy link
Member

sjeaugey commented Jan 9, 2020

It might see an error, but that is not guaranteed (it will effectively see an error only if it is communicating with it through the network).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants