-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_Barrier hangs in 10-20h long runs #3042
Comments
Thanks for your analysis, you have correctly pinpointed the issue. During the process of adding a new proc, initiated due to a remote request, an error condition is reached and we decide to release the newly created proc. That path clearly has a bug, as the lock protecting the proc_table was not recursive. A patch is attached here. However, the real question is why we decided to drop the newly created proc ? Are you seeing additional information on your output ? Please test the following patch: diff --git a/opal/mca/btl/tcp/btl_tcp_proc.c b/opal/mca/btl/tcp/btl_tcp_proc.c
index 68ea3f020c..66be012fb9 100644
--- a/opal/mca/btl/tcp/btl_tcp_proc.c
+++ b/opal/mca/btl/tcp/btl_tcp_proc.c
@@ -125,16 +125,18 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc)
return btl_proc;
}
- do {
+ do { /* This loop is only necessary so that we can break out of the serial code */
btl_proc = OBJ_NEW(mca_btl_tcp_proc_t);
if(NULL == btl_proc) {
rc = OPAL_ERR_OUT_OF_RESOURCE;
break;
}
- btl_proc->proc_opal = proc;
-
- OBJ_RETAIN(btl_proc->proc_opal);
+ /* Retain the proc, but don't store the ref into the btl_proc just yet. This
+ * provides a way to release the btl_proc in case of failure without having to
+ * unlock the mutex.
+ */
+ OBJ_RETAIN(proc);
/* lookup tcp parameters exported by this proc */
OPAL_MODEX_RECV(rc, &mca_btl_tcp_component.super.btl_version,
@@ -184,12 +186,14 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc)
} while (0);
if (OPAL_SUCCESS == rc) {
+ btl_proc->proc_opal = proc; /* link with the proc */
/* add to hash table of all proc instance. */
opal_proc_table_set_value(&mca_btl_tcp_component.tcp_procs,
proc->proc_name, btl_proc);
} else {
if (btl_proc) {
- OBJ_RELEASE(btl_proc);
+ OBJ_RELEASE(btl_proc); /* release the local proc */
+ OBJ_RELEASE(proc); /* and the ref on the OMPI proc */
btl_proc = NULL;
}
} |
The application log and dmesg contain only the expected information. I applied the patch + debug printf with rc, and started the app with --mca btl_base_verbose 30. I will let you know, if the application runs w/o deadlocks for 20h, or will provide more info about the deadlock. |
As of now, the app runs for 21h. According to the log, the control never passed via @@ -184,12 +186,14 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc) in btl_tcp_proc.c. There are no warnings or debug info from MPI in the log. |
Is that a good or a bad sign ? |
The fix seems to fix the deadlock. To simulate errors, I have put the following code inside do {...} while (0) in mca_btl_tcp_proc_create. I also see that mca_btl_tcp_proc_create is not called for a long period of time after the first call at startup. if (fopen("/home/espetrov/error.txt", "r")) { |
I pushed the fix in master (ec4a235). @jsquyres will create the PR for v2.x. I wonder the same question about the mca_btl_tcp_proc_create. One possible reason is if you have multiple IPs, and we open the connections as needed. Or simply because we choose a different barrier algorithm on the communicator, and therefore we need to setup new connections. |
Fixed via ec4a235. Closing this issue because the problem is fixed (but feel free to keep conversing here if there's more questions / comments). |
Many thanks for bearing with me. |
Another question: what happens when openmpi sees multiple interfaces? E.g. vlan with an "inet6 scope global" address and eth1 with "inet scope global" and "inet6 scope global" addresses? |
If no restrictions have been provided via MCA parameters (btl_tcp_if_include or btl_tcp_if_exclude), then Open MPI is supposed to use all non-local interfaces. |
Does Open MPI use IPv4 and IPv6 connections simultaneously? |
iirc, IPv6 is disabled unless you configure with --enable-ipv6 |
Since we do not --enable-ipv6, multiple IPs are not relevant here. Then it must be that Open MPI switched to a different barrier algorithm. |
Ping |
the coll/tuned module selects the collective algorithm based on communicator and message sizes. |
Hi,
After 10-20 hours, MPI_Barrier hangs with the following stack.
It looks like some error happens during mca_btl_tcp_proc_create, and then mca_btl_tcp_proc_destruct cannot proceed because mca_btl_tcp_component.tcp_lock is not recursive.
The stack below has been captured using OMPI 2.0.2a1, and the OMPI 2.0.2 release does not seem to fix this issue.
I would be greateful for any ideas about the root-cause of this error -- wrong host config, etc.
Unfortunately (as usual), the application is too large and I do not have a small reproducer.
Here is some information that seems relevant.
MPI is used to distribute jobs to CUDA cards inside one host.
There are no inter-host communications.
Each MPI process is multithreaded.
MPI is called from different threads but the calls are serialized at the application level.
MPI is built with --enable-thread-multiple, the app calls MPI_Init_thread(MPI_THREAD_MULTIPLE).
The application uses 50-100 communicators which are created at the application launch in the master thread and they all stay alive while the application is running.
Each communicator contains all MPI processes.
All processes get stuck with identical stack.
I used gdb to attach to the processes after they got stuck and verified that the processes passed the correct communicator to MPI_Barrier.
EDIT: Added verbatim blocks
The text was updated successfully, but these errors were encountered: