coll collectives segfaults on CUDA buffers #12045

wenduwan · 2023-11-03T02:05:50Z

Background information

While I tested Open MPI 5 using OMB. I observed segfaults when running some collective benchmarks with cuda buffer.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Open MPI 5: https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.0.tar.bz2
OMB: http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.3.tar.gz

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Configure Open MPI

$ ./configure --enable-debug --with-cuda=/usr/local/cuda --with-cuda-libdir=/lib64

Configure OMB

$ ./configure --with-cuda=/usr/local/cuda --enable-cuda CC=/path/to/ompi5 CXX=/path/to/ompi5
$ PATH=/usr/local/cuda/bin:$PATH make -j install

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: Amazon Linux 2. Can also reproduce on Ubuntu 22.04. Installed CUDA 12.2 and 535 driver.
Computer hardware: p4d.24xlarge instance with A100 GPU
Network type: EFA. Can also reproduce with pml ob1

Details of the problem

Here is an example with osu_ireduce on 4 ranks on a single node.

$ mpirun -n 4 --mca pml ob1 --mca coll_base_verbose 1 osu-micro-benchmarks/mpi/collective/osu_ireduce -d cuda
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07273] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07272] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07271] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07270] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07273] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07272] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07271] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07270] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
# OSU MPI-CUDA Non-blocking Reduce Latency Test
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
[ip-172-31-31-62.us-west-2.compute.internal:07271] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07273] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62:07270] *** Process received signal ***
[ip-172-31-31-62:07270] Signal: Segmentation fault (11)
[ip-172-31-31-62:07270] Signal code: Invalid permissions (2)
[ip-172-31-31-62:07270] Failing at address: 0x7fb321200000
[ip-172-31-31-62:07272] *** Process received signal ***
[ip-172-31-31-62:07272] Signal: Segmentation fault (11)
[ip-172-31-31-62:07272] Signal code: Invalid permissions (2)
[ip-172-31-31-62:07272] Failing at address: 0x7fe881200000
[ip-172-31-31-62:07270] [ 0] /usr/lib/habanalabs/libhl_logger.so(_Z13signalHandleriP9siginfo_tPv+0x18e)[0x7fb32c69c7be]
[ip-172-31-31-62:07270] [ 1] /lib64/libpthread.so.0(+0x118e0)[0x7fb356ed08e0]
[ip-172-31-31-62:07270] [ 2] [ip-172-31-31-62:07272] [ 0] /usr/lib/habanalabs/libhl_logger.so(_Z13signalHandleriP9siginfo_tPv+0x18e)[0x7fe8687417be]
[ip-172-31-31-62:07272] [ 1] /lib64/libpthread.so.0(+0x118e0)[0x7fe8b60b58e0]
[ip-172-31-31-62:07272] [ 2] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x389c1c)[0x7fb35767cc1c]
[ip-172-31-31-62:07270] [ 3] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x389c1c)[0x7fe8b6861c1c]
[ip-172-31-31-62:07272] [ 3] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d3411)[0x7fb3574c6411]
[ip-172-31-31-62:07270] [ 4] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d3411)[0x7fe8b66ab411]
[ip-172-31-31-62:07272] [ 4] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d4e89)[0x7fb3574c7e89]
[ip-172-31-31-62:07270] [ 5] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d4e89)[0x7fe8b66ace89]
[ip-172-31-31-62:07272] [ 5] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(NBC_Progress+0x3bc)[0x7fb3574c77eb]
[ip-172-31-31-62:07270] [ 6] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(NBC_Progress+0x3bc)[0x7fe8b66ac7eb]
[ip-172-31-31-62:07272] [ 6] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_coll_libnbc_progress+0xc3)[0x7fb3574c508d]
[ip-172-31-31-62:07270] [ 7] /home/ec2-user/openmpi-5.0.0/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7fb3563cfcc6]
[ip-172-31-31-62:07270] [ 8] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_coll_libnbc_progress+0xc3)[0x7fe8b66aa08d]
[ip-172-31-31-62:07272] [ 7] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0xa335b)[0x7fb35739635b]
[ip-172-31-31-62:07270] [ 9] /home/ec2-user/openmpi-5.0.0/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7fe8b55b4cc6]
[ip-172-31-31-62:07272] [ 8] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0xa335b)[0x7fe8b657b35b]
[ip-172-31-31-62:07272] [ 9] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7fb3573963c4]
[ip-172-31-31-62:07270] [10] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7fe8b657b3c4]
[ip-172-31-31-62:07272] [10] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(MPI_Wait+0x138)[0x7fb3574355f1]
[ip-172-31-31-62:07270] [11] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x402a8c]
[ip-172-31-31-62:07270] [12] /lib64/libc.so.6(__libc_start_main+0xea)[0x7fb356b3313a]
[ip-172-31-31-62:07270] [13] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x40332a]
/home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(MPI_Wait+0x138)[0x7fe8b661a5f1]
[ip-172-31-31-62:07272] [11] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x402a8c]
[ip-172-31-31-62:07272] [12] [ip-172-31-31-62:07270] *** End of error message ***

Backtrace:

#0  0x00007fb35767cc1c in ompi_op_avx_2buff_add_float_avx512 (_in=0x7fb321200000, _out=0x254fbf0, count=0x7ffcb82a3fc4, dtype=0x7ffcb82a3f88, module=0x181ff30)
    at op_avx_functions.c:680
#1  0x00007fb3574c6411 in ompi_op_reduce (op=0x62c760 <ompi_mpi_op_sum>, source=0x7fb321200000, target=0x254fbf0, full_count=1, dtype=0x62e3a0 <ompi_mpi_float>)
    at ../../../../ompi/op/op.h:572
#2  0x00007fb3574c7e89 in NBC_Start_round (handle=0x25540e8) at nbc.c:539
#3  0x00007fb3574c77eb in NBC_Progress (handle=0x25540e8) at nbc.c:419
#4  0x00007fb3574c508d in ompi_coll_libnbc_progress () at coll_libnbc_component.c:445
#5  0x00007fb3563cfcc6 in opal_progress () at runtime/opal_progress.c:224
#6  0x00007fb35739635b in ompi_request_wait_completion (req=0x25540e8) at ../ompi/request/request.h:492
#7  0x00007fb3573963c4 in ompi_request_default_wait (req_ptr=0x7ffcb82a43c8, status=0x7ffcb82a43f0) at request/req_wait.c:40
#8  0x00007fb3574355f1 in PMPI_Wait (request=0x7ffcb82a43c8, status=0x7ffcb82a43f0) at wait.c:72
#9  0x0000000000402a8c in main (argc=<optimized out>, argv=<optimized out>) at osu_ireduce.c:136

~~It appears to be an invalid temp buf in libnbc, note the address target=0x254fbf0~~

The text was updated successfully, but these errors were encountered:

wenduwan · 2023-11-03T02:08:53Z

I'm not sure if this is a bug since technically we don't claim CUDA support in coll according to https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#what-kind-of-cuda-support-exists-in-open-mpi

bosilca · 2023-11-03T13:08:47Z

Let me correct what I said yesterday on Slack. All blocking collectives have accelerator support (not the nonblocking versions). If people are interested, the CUDA coll can be extended to provide support for the nonblocking collectives.

wenduwan · 2023-11-03T14:04:58Z

Thanks @bosilca for the discussion in slack.

In my understanding HAN utilizes non-blocking collectives from other coll components for its own blocking collectives - so does that mean HAN in general does not guarantee CUDA support?

wenduwan · 2023-11-09T19:27:59Z

Update 11/9

So far I've been focusing on reductive collectives, i.e. MPI_Reduce, MPI_Ireduce, MPI_Allreduce, MPI_Iallreduce. I observe a common failure mode with the corresponding OMB benchmark, e.g. osu_reduce -d cuda.

I confirmed that both blocking and non-blocking versions have this problem, depending on the coll module.

Non-blocking reduction

Both adapt and libnbc provides ireduce, both producing similar segfaults to the original post.

Blocking reduction

Both tuned and adapt produce similar segfaults to the original post.

Cause

The ompi_op_reduce function does not have accelerator awareness and assumes both source and target are system buffers. For certain reduce operations, e.g. SUM as used in OMB, it will call into subroutines such as ompi_op_avx_2buff_sum_int8_t_avx512. In the above example, this caused a segfault since the source buffer is allocated on CUDA device.

ompi/ompi/op/op.h

Lines 503 to 538 in b816edf

    
           static inline void ompi_op_reduce(ompi_op_t * op, void *source, 
        
                                             void *target, size_t full_count, 
        
                                             ompi_datatype_t * dtype) 
        
           { 
        
               MPI_Fint f_dtype, f_count; 
        
               int count = full_count; 
        
               /* 
        
                * If the full_count is > INT_MAX then we need to call the reduction op 
        
                * in iterations of counts <= INT_MAX since it has an `int *len` 
        
                * parameter. 
        
                * 
        
                * Note: When we add BigCount support then we can distinguish between 
        
                * a reduction operation with `int *len` and `MPI_Count *len`. At which 
        
                * point we can avoid this loop. 
        
                */ 
        
               if( OPAL_UNLIKELY(full_count > INT_MAX) ) { 
        
                   size_t done_count = 0, shift; 
        
                   int iter_count; 
        
                   ptrdiff_t ext, lb; 
        
                   ompi_datatype_get_extent(dtype, &lb, &ext); 
        
                   while(done_count < full_count) { 
        
                       if(done_count + INT_MAX > full_count) { 
        
                           iter_count = full_count - done_count; 
        
                       } else { 
        
                           iter_count = INT_MAX; 
        
                       } 
        
                       shift = done_count * ext; 
        
                       // Recurse one level in iterations of 'int' 
        
                       ompi_op_reduce(op, (char*)source + shift, (char*)target + shift, iter_count, dtype); 
        
                       done_count += iter_count; 
        
                   } 
        
                   return; 
        
               }

Implication

Based on the above finding, it is not straight forward to declare CUDA support in coll due to implementation differences in the collective modules. Depending on the user tuning, e.g. which module/algorithm is used, the application might get away with most of the collectives on CUDA device, except for non-blocking reductions; however, a change in the tuning could break the application just as easy.

Mitigation

As far as ompi_op_reduce is concerned, we could possibly introduce accelerator awareness to detect heterogeneous source and target buffers. This might involve additional memory copies between device and host, or some smart on-device reduction tricks. We should be weary of performance impacts, especially for the non-accelerator happy path.

Unknowns

I have only observed the reduction issue so far. I'm not sure what else could cause collectives to fail on CUDA.

wenduwan · 2023-11-10T01:15:39Z

An example of protecting ompi_op_reduce from illegal device memory access

ompi/ompi/mca/osc/rdma/osc_rdma_accumulate.c

Lines 496 to 514 in 76b91ce

    
                   if (&ompi_mpi_op_no_op.op != op) { 
        
                       /* Cannot call ompi_op_reduce on a device buffer for non managed 
        
                        * memory. Copy into temporary buffer first */ 
        
                       ret = osc_rdma_is_accel(source); 
        
                       if (0 < ret) { 
        
                           tmp_source = malloc(len); 
        
                           ret = opal_accelerator.mem_copy(MCA_ACCELERATOR_NO_DEVICE_ID, MCA_ACCELERATOR_NO_DEVICE_ID, 
        
                                                           tmp_source, source, len, MCA_ACCELERATOR_TRANSFER_DTOH); 
        
                           ompi_op_reduce (op, (void *) tmp_source, ptr, source_count, source_datatype); 
        
                           free(tmp_source); 
        
                       } else if (0 == ret){ 
        
                           /* NTH: need to cast away const for the source buffer. the buffer will not be modified by this call */ 
        
                           ompi_op_reduce (op, (void *) source, ptr, source_count, source_datatype); 
        
                       } else { 
        
                           return ret; 
        
                       } 
        
                       return ompi_osc_rdma_put_contig (sync, peer, target_address, target_handle, ptr, len, request); 
        
                   }

wenduwan added the Target: main label Nov 3, 2023

wenduwan self-assigned this Nov 7, 2023

Keluaa mentioned this issue Jun 24, 2024

Add IReduce! and IAllreduce! JuliaParallel/MPI.jl#827

Open

wenduwan mentioned this issue Jun 27, 2024

Reduce_local Segmentation fault when Running with IMB-MPI1 built for GPU #12620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll collectives segfaults on CUDA buffers #12045

coll collectives segfaults on CUDA buffers #12045

wenduwan commented Nov 3, 2023 •

edited

Loading

wenduwan commented Nov 3, 2023 •

edited

Loading

bosilca commented Nov 3, 2023

wenduwan commented Nov 3, 2023

wenduwan commented Nov 9, 2023

wenduwan commented Nov 10, 2023

coll collectives segfaults on CUDA buffers #12045

coll collectives segfaults on CUDA buffers #12045

Comments

wenduwan commented Nov 3, 2023 • edited Loading

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

wenduwan commented Nov 3, 2023 • edited Loading

bosilca commented Nov 3, 2023

wenduwan commented Nov 3, 2023

wenduwan commented Nov 9, 2023

Update 11/9

Non-blocking reduction

Blocking reduction

Cause

Implication

Mitigation

Unknowns

wenduwan commented Nov 10, 2023

wenduwan commented Nov 3, 2023 •

edited

Loading

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

wenduwan commented Nov 3, 2023 •

edited

Loading