Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll collectives segfaults on CUDA buffers #12045

Open
wenduwan opened this issue Nov 3, 2023 · 5 comments
Open

coll collectives segfaults on CUDA buffers #12045

wenduwan opened this issue Nov 3, 2023 · 5 comments
Assignees

Comments

@wenduwan
Copy link
Contributor

wenduwan commented Nov 3, 2023

Background information

While I tested Open MPI 5 using OMB. I observed segfaults when running some collective benchmarks with cuda buffer.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Open MPI 5: https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.0.tar.bz2
OMB: http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.3.tar.gz

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Configure Open MPI

$ ./configure --enable-debug --with-cuda=/usr/local/cuda --with-cuda-libdir=/lib64

Configure OMB

$ ./configure --with-cuda=/usr/local/cuda --enable-cuda CC=/path/to/ompi5 CXX=/path/to/ompi5
$ PATH=/usr/local/cuda/bin:$PATH make -j install

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Amazon Linux 2. Can also reproduce on Ubuntu 22.04. Installed CUDA 12.2 and 535 driver.
  • Computer hardware: p4d.24xlarge instance with A100 GPU
  • Network type: EFA. Can also reproduce with pml ob1

Details of the problem

Here is an example with osu_ireduce on 4 ranks on a single node.

$ mpirun -n 4 --mca pml ob1 --mca coll_base_verbose 1 osu-micro-benchmarks/mpi/collective/osu_ireduce -d cuda
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:component_open: done!
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07273] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07272] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07271] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07270] (0/MPI_COMM_WORLD): no underlying reduce; disqualifying myself
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_init called.
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_init Tuned is in use
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:module_tuned query called
[ip-172-31-31-62.us-west-2.compute.internal:07273] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07272] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07272] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07271] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07270] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07270] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
# OSU MPI-CUDA Non-blocking Reduce Latency Test
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
[ip-172-31-31-62.us-west-2.compute.internal:07271] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07271] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62.us-west-2.compute.internal:07273] ompi_coll_tuned_barrier_intra_dec_fixed com_size 4
[ip-172-31-31-62.us-west-2.compute.internal:07273] coll:tuned:barrier_intra_do_this selected algorithm 1 topo fanin/out0
[ip-172-31-31-62:07270] *** Process received signal ***
[ip-172-31-31-62:07270] Signal: Segmentation fault (11)
[ip-172-31-31-62:07270] Signal code: Invalid permissions (2)
[ip-172-31-31-62:07270] Failing at address: 0x7fb321200000
[ip-172-31-31-62:07272] *** Process received signal ***
[ip-172-31-31-62:07272] Signal: Segmentation fault (11)
[ip-172-31-31-62:07272] Signal code: Invalid permissions (2)
[ip-172-31-31-62:07272] Failing at address: 0x7fe881200000
[ip-172-31-31-62:07270] [ 0] /usr/lib/habanalabs/libhl_logger.so(_Z13signalHandleriP9siginfo_tPv+0x18e)[0x7fb32c69c7be]
[ip-172-31-31-62:07270] [ 1] /lib64/libpthread.so.0(+0x118e0)[0x7fb356ed08e0]
[ip-172-31-31-62:07270] [ 2] [ip-172-31-31-62:07272] [ 0] /usr/lib/habanalabs/libhl_logger.so(_Z13signalHandleriP9siginfo_tPv+0x18e)[0x7fe8687417be]
[ip-172-31-31-62:07272] [ 1] /lib64/libpthread.so.0(+0x118e0)[0x7fe8b60b58e0]
[ip-172-31-31-62:07272] [ 2] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x389c1c)[0x7fb35767cc1c]
[ip-172-31-31-62:07270] [ 3] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x389c1c)[0x7fe8b6861c1c]
[ip-172-31-31-62:07272] [ 3] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d3411)[0x7fb3574c6411]
[ip-172-31-31-62:07270] [ 4] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d3411)[0x7fe8b66ab411]
[ip-172-31-31-62:07272] [ 4] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d4e89)[0x7fb3574c7e89]
[ip-172-31-31-62:07270] [ 5] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0x1d4e89)[0x7fe8b66ace89]
[ip-172-31-31-62:07272] [ 5] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(NBC_Progress+0x3bc)[0x7fb3574c77eb]
[ip-172-31-31-62:07270] [ 6] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(NBC_Progress+0x3bc)[0x7fe8b66ac7eb]
[ip-172-31-31-62:07272] [ 6] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_coll_libnbc_progress+0xc3)[0x7fb3574c508d]
[ip-172-31-31-62:07270] [ 7] /home/ec2-user/openmpi-5.0.0/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7fb3563cfcc6]
[ip-172-31-31-62:07270] [ 8] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_coll_libnbc_progress+0xc3)[0x7fe8b66aa08d]
[ip-172-31-31-62:07272] [ 7] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0xa335b)[0x7fb35739635b]
[ip-172-31-31-62:07270] [ 9] /home/ec2-user/openmpi-5.0.0/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7fe8b55b4cc6]
[ip-172-31-31-62:07272] [ 8] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(+0xa335b)[0x7fe8b657b35b]
[ip-172-31-31-62:07272] [ 9] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7fb3573963c4]
[ip-172-31-31-62:07270] [10] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(ompi_request_default_wait+0x27)[0x7fe8b657b3c4]
[ip-172-31-31-62:07272] [10] /home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(MPI_Wait+0x138)[0x7fb3574355f1]
[ip-172-31-31-62:07270] [11] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x402a8c]
[ip-172-31-31-62:07270] [12] /lib64/libc.so.6(__libc_start_main+0xea)[0x7fb356b3313a]
[ip-172-31-31-62:07270] [13] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x40332a]
/home/ec2-user/openmpi-5.0.0/install/lib/libmpi.so.40(MPI_Wait+0x138)[0x7fe8b661a5f1]
[ip-172-31-31-62:07272] [11] /home/ec2-user/osu-micro-benchmarks/mpi/collective/osu_ireduce[0x402a8c]
[ip-172-31-31-62:07272] [12] [ip-172-31-31-62:07270] *** End of error message ***

Backtrace:

#0  0x00007fb35767cc1c in ompi_op_avx_2buff_add_float_avx512 (_in=0x7fb321200000, _out=0x254fbf0, count=0x7ffcb82a3fc4, dtype=0x7ffcb82a3f88, module=0x181ff30)
    at op_avx_functions.c:680
#1  0x00007fb3574c6411 in ompi_op_reduce (op=0x62c760 <ompi_mpi_op_sum>, source=0x7fb321200000, target=0x254fbf0, full_count=1, dtype=0x62e3a0 <ompi_mpi_float>)
    at ../../../../ompi/op/op.h:572
#2  0x00007fb3574c7e89 in NBC_Start_round (handle=0x25540e8) at nbc.c:539
#3  0x00007fb3574c77eb in NBC_Progress (handle=0x25540e8) at nbc.c:419
#4  0x00007fb3574c508d in ompi_coll_libnbc_progress () at coll_libnbc_component.c:445
#5  0x00007fb3563cfcc6 in opal_progress () at runtime/opal_progress.c:224
#6  0x00007fb35739635b in ompi_request_wait_completion (req=0x25540e8) at ../ompi/request/request.h:492
#7  0x00007fb3573963c4 in ompi_request_default_wait (req_ptr=0x7ffcb82a43c8, status=0x7ffcb82a43f0) at request/req_wait.c:40
#8  0x00007fb3574355f1 in PMPI_Wait (request=0x7ffcb82a43c8, status=0x7ffcb82a43f0) at wait.c:72
#9  0x0000000000402a8c in main (argc=<optimized out>, argv=<optimized out>) at osu_ireduce.c:136

It appears to be an invalid temp buf in libnbc, note the address target=0x254fbf0

@wenduwan
Copy link
Contributor Author

wenduwan commented Nov 3, 2023

I'm not sure if this is a bug since technically we don't claim CUDA support in coll according to https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#what-kind-of-cuda-support-exists-in-open-mpi

@bosilca
Copy link
Member

bosilca commented Nov 3, 2023

Let me correct what I said yesterday on Slack. All blocking collectives have accelerator support (not the nonblocking versions). If people are interested, the CUDA coll can be extended to provide support for the nonblocking collectives.

@wenduwan
Copy link
Contributor Author

wenduwan commented Nov 3, 2023

Thanks @bosilca for the discussion in slack.

In my understanding HAN utilizes non-blocking collectives from other coll components for its own blocking collectives - so does that mean HAN in general does not guarantee CUDA support?

@wenduwan wenduwan self-assigned this Nov 7, 2023
@wenduwan
Copy link
Contributor Author

wenduwan commented Nov 9, 2023

Update 11/9

So far I've been focusing on reductive collectives, i.e. MPI_Reduce, MPI_Ireduce, MPI_Allreduce, MPI_Iallreduce. I observe a common failure mode with the corresponding OMB benchmark, e.g. osu_reduce -d cuda.

I confirmed that both blocking and non-blocking versions have this problem, depending on the coll module.

Non-blocking reduction

Both adapt and libnbc provides ireduce, both producing similar segfaults to the original post.

Blocking reduction

Both tuned and adapt produce similar segfaults to the original post.

Cause

The ompi_op_reduce function does not have accelerator awareness and assumes both source and target are system buffers. For certain reduce operations, e.g. SUM as used in OMB, it will call into subroutines such as ompi_op_avx_2buff_sum_int8_t_avx512. In the above example, this caused a segfault since the source buffer is allocated on CUDA device.

ompi/ompi/op/op.h

Lines 503 to 538 in b816edf

static inline void ompi_op_reduce(ompi_op_t * op, void *source,
void *target, size_t full_count,
ompi_datatype_t * dtype)
{
MPI_Fint f_dtype, f_count;
int count = full_count;
/*
* If the full_count is > INT_MAX then we need to call the reduction op
* in iterations of counts <= INT_MAX since it has an `int *len`
* parameter.
*
* Note: When we add BigCount support then we can distinguish between
* a reduction operation with `int *len` and `MPI_Count *len`. At which
* point we can avoid this loop.
*/
if( OPAL_UNLIKELY(full_count > INT_MAX) ) {
size_t done_count = 0, shift;
int iter_count;
ptrdiff_t ext, lb;
ompi_datatype_get_extent(dtype, &lb, &ext);
while(done_count < full_count) {
if(done_count + INT_MAX > full_count) {
iter_count = full_count - done_count;
} else {
iter_count = INT_MAX;
}
shift = done_count * ext;
// Recurse one level in iterations of 'int'
ompi_op_reduce(op, (char*)source + shift, (char*)target + shift, iter_count, dtype);
done_count += iter_count;
}
return;
}

Implication

Based on the above finding, it is not straight forward to declare CUDA support in coll due to implementation differences in the collective modules. Depending on the user tuning, e.g. which module/algorithm is used, the application might get away with most of the collectives on CUDA device, except for non-blocking reductions; however, a change in the tuning could break the application just as easy.

Mitigation

As far as ompi_op_reduce is concerned, we could possibly introduce accelerator awareness to detect heterogeneous source and target buffers. This might involve additional memory copies between device and host, or some smart on-device reduction tricks. We should be weary of performance impacts, especially for the non-accelerator happy path.

Unknowns

I have only observed the reduction issue so far. I'm not sure what else could cause collectives to fail on CUDA.

@wenduwan
Copy link
Contributor Author

An example of protecting ompi_op_reduce from illegal device memory access

if (&ompi_mpi_op_no_op.op != op) {
/* Cannot call ompi_op_reduce on a device buffer for non managed
* memory. Copy into temporary buffer first */
ret = osc_rdma_is_accel(source);
if (0 < ret) {
tmp_source = malloc(len);
ret = opal_accelerator.mem_copy(MCA_ACCELERATOR_NO_DEVICE_ID, MCA_ACCELERATOR_NO_DEVICE_ID,
tmp_source, source, len, MCA_ACCELERATOR_TRANSFER_DTOH);
ompi_op_reduce (op, (void *) tmp_source, ptr, source_count, source_datatype);
free(tmp_source);
} else if (0 == ret){
/* NTH: need to cast away const for the source buffer. the buffer will not be modified by this call */
ompi_op_reduce (op, (void *) source, ptr, source_count, source_datatype);
} else {
return ret;
}
return ompi_osc_rdma_put_contig (sync, peer, target_address, target_handle, ptr, len, request);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants