ncclCommGetAsyncError doesn't report errors for failures within a host. #279

pritamdamania87 · 2019-12-27T21:02:30Z

For reference I'm running two instances of the following script:

from __future__ import absolute_import, division, print_function, unicode_literals

import torch.distributed as c10d
import torch
import argparse
import os
import logging
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description='Simple script to simulate NCCL errors. The script is '
        'supposed to be run on multiple different nodes simultaneously with '
        'appropriate rank and world_size. The script run an allreduce() on '
        'the rank 0 node and aborts all the other nodes to simulate an error '
        'in NCCL')
    parser.add_argument('addr', help='address of the master node to connect to.')
    parser.add_argument('port', help='port of the master node to connect to.')
    parser.add_argument('rank', help='rank of this node')
    parser.add_argument('world_size', help='number of nodes in process group')
    args = parser.parse_args()
    rank = int(args.rank)
    world_size = int(args.world_size)
    port = int(args.port)

    store = c10d.TCPStore(args.addr, port, world_size, rank == 0)
    process_group = c10d.ProcessGroupNCCL(store, rank, world_size)
    logging.info('Running first allreduce')
    process_group.allreduce(torch.rand(10).cuda(rank)).wait()
    if rank == 0:
        logging.info('Running second allreduce only on rank 0')
        work = process_group.allreduce(torch.rand(10).cuda(rank))
        logging.info('Waiting for allreduce to complete...')
        work.wait()
        logging.info('Second allreduce successful: {}'.format(work.is_success()))
    else:
        logging.info('Aborting all other ranks.')
        os.abort()

I run two instances of the script as follows:

NCCL_BLOCKING_WAIT=1 python test/simulate_nccl_errors.py <addr> <port> 0 2
NCCL_BLOCKING_WAIT=1 python test/simulate_nccl_errors.py <addr> <port> 1 2

The problem I run into is that when I run both instances on the same host, ncclCommGetAsyncError doesn't report an error and the operation eventually times out. Although, If I run them on separate hosts ncclCommGetAsyncError does report an error and we see a separate exception being thrown. For additional context, this is the PyTorch PR that uses ncclCommGetAsyncError: pytorch/pytorch#25012.

I was wondering why ncclCommGetAsyncError doesn't report an error when we run multiple ranks within a single host and one of the ranks fail?

The text was updated successfully, but these errors were encountered:

pritamdamania87 · 2020-01-09T18:54:35Z

Following up on this issue, I was wondering if someone from the NCCL team could provide some guidance here? I'm happy to provide any additional information that might be missing in the original issue. Thanks!

sjeaugey · 2020-01-09T19:09:37Z

Sorry for the delay.

I think there is a misunderstanding : ncclCommGetAsyncError is not meant to report that other ranks have failed. Stopping all ranks consistently is the responsibility of the application. ncclGetAsyncError is there to report errors which happen in NCCL (in particular network communication errors) and which NCCL cannot report through any other channel, since NCCL operations are CUDA kernels.
When those errors happens, NCCL could abort the current kernel, but then the application might think the operation completed successfully, which is why we let the application abort the NCCL kernel in a controlled manner.

So it it the responsibility of the application to call ncclCommAbort on all NCCL ranks if any rank experiences an error during a NCCL call, either through a CUDA error on the NCCL call, or through ncclCommGetAsyncError.

pritamdamania87 · 2020-01-09T19:20:01Z

Thanks for the explanation, I understand that ncclCommGetAsyncError wouldn't report errors when other ranks fail if we don't have any NCCL operations ongoing. But if we have two processes running ncclAllReduce and one of the process crashes, shouldn't ncclCommGetAsyncError report an error on the process that is still alive? In the examples I posted above, one of the ranks is trying to perform ncclAllReduce while the other rank fails.

ncclGetAsyncError is there to report errors which happen in NCCL (in particular network communication errors) and which NCCL cannot report through any other channel, since NCCL operations are CUDA kernels.

What sort of errors (apart from network communication errors) are reported via ncclCommGetAsyncError? Also, in general is your suggestion that we shouldn't rely on ncclCommGetAsyncError for detecting failures of participating ranks and instead have a separate external mechanism to do this instead?

sjeaugey · 2020-01-09T19:30:59Z

The fact that ncclCommGetAsyncError reports an error when you abort another rank is because they are communicating through sockets, and when one side closes the socket, the other side sees it and you get an error.

But this is an exception. When GPUs communicate through shared memory, there is no such mechanism. With other networking technologies, there could be no such detection either. In case of a proper clean-up we could certainly mark the shared memory as "closed" but that would not work if, for example, one process crashes and exits without the opportunity to properly close its communication channels.

ncclCommGetAsyncError currently only reports networking errors, e.g. sockets being closed or any other error while communication (could be an Infiniband send failing as well). You will not get any asynchronous error when running within a node.

And indeed this is not saying anything about other ranks, nor does it guarantee that other ranks will fail as well, so you should have a separate external mechanism to monitor processes errors, crashes, ... and handle that at a higher level (usually the process management system does that).

pritamdamania87 · 2020-01-09T21:20:28Z

Thanks again for the detailed explanation and this clarifies a lot of things in terms of the behavior that we are seeing. I had one more question regarding the behavior of ncclCommAbort. If I have two ranks 0 and 1 and only rank 1 calls ncclCommAbort, would rank 0 see any sort of error when they call ncclCommGetAsyncError?

sjeaugey · 2020-01-09T21:41:37Z

It might see an error, but that is not guaranteed (it will effectively see an error only if it is communicating with it through the network).

pritamdamania87 mentioned this issue Dec 27, 2019

Detect and handle NCCL errors appropriately in single-server use cases pytorch/pytorch#24906

Closed

woodlgz mentioned this issue Sep 8, 2021

add ncclCommMarkAbort api in convienence of upper ai system controller #564

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ncclCommGetAsyncError doesn't report errors for failures within a host. #279

ncclCommGetAsyncError doesn't report errors for failures within a host. #279

pritamdamania87 commented Dec 27, 2019

pritamdamania87 commented Jan 9, 2020

sjeaugey commented Jan 9, 2020

pritamdamania87 commented Jan 9, 2020

sjeaugey commented Jan 9, 2020 •

edited

Loading

pritamdamania87 commented Jan 9, 2020

sjeaugey commented Jan 9, 2020

ncclCommGetAsyncError doesn't report errors for failures within a host. #279

ncclCommGetAsyncError doesn't report errors for failures within a host. #279

Comments

pritamdamania87 commented Dec 27, 2019

pritamdamania87 commented Jan 9, 2020

sjeaugey commented Jan 9, 2020

pritamdamania87 commented Jan 9, 2020

sjeaugey commented Jan 9, 2020 • edited Loading

pritamdamania87 commented Jan 9, 2020

sjeaugey commented Jan 9, 2020

sjeaugey commented Jan 9, 2020 •

edited

Loading