-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll collectives segfaults on CUDA buffers #12045
Comments
I'm not sure if this is a bug since technically we don't claim CUDA support in |
Let me correct what I said yesterday on Slack. All blocking collectives have accelerator support (not the nonblocking versions). If people are interested, the CUDA coll can be extended to provide support for the nonblocking collectives. |
Thanks @bosilca for the discussion in slack. In my understanding HAN utilizes non-blocking collectives from other coll components for its own blocking collectives - so does that mean HAN in general does not guarantee CUDA support? |
Update 11/9So far I've been focusing on reductive collectives, i.e. I confirmed that both blocking and non-blocking versions have this problem, depending on the coll module. Non-blocking reductionBoth Blocking reductionBoth CauseThe Lines 503 to 538 in b816edf
ImplicationBased on the above finding, it is not straight forward to declare CUDA support in MitigationAs far as UnknownsI have only observed the reduction issue so far. I'm not sure what else could cause collectives to fail on CUDA. |
An example of protecting ompi/ompi/mca/osc/rdma/osc_rdma_accumulate.c Lines 496 to 514 in 76b91ce
|
Background information
While I tested Open MPI 5 using OMB. I observed segfaults when running some collective benchmarks with cuda buffer.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Open MPI 5: https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.0.tar.bz2
OMB: http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.3.tar.gz
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Configure Open MPI
Configure OMB
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
pml ob1
Details of the problem
Here is an example with
osu_ireduce
on 4 ranks on a single node.Backtrace:
It appears to be an invalid temp buf in libnbc, note the addresstarget=0x254fbf0
The text was updated successfully, but these errors were encountered: