MPI_Barrier hangs in 10-20h long runs #3042

Evgueni-Petrov-aka-espetrov · 2017-02-27T12:20:37Z

Hi,

After 10-20 hours, MPI_Barrier hangs with the following stack.

It looks like some error happens during mca_btl_tcp_proc_create, and then mca_btl_tcp_proc_destruct cannot proceed because mca_btl_tcp_component.tcp_lock is not recursive.

The stack below has been captured using OMPI 2.0.2a1, and the OMPI 2.0.2 release does not seem to fix this issue.

I would be greateful for any ideas about the root-cause of this error -- wrong host config, etc.

Unfortunately (as usual), the application is too large and I do not have a small reproducer.

Here is some information that seems relevant.

MPI is used to distribute jobs to CUDA cards inside one host.
There are no inter-host communications.
Each MPI process is multithreaded.
MPI is called from different threads but the calls are serialized at the application level.
MPI is built with --enable-thread-multiple, the app calls MPI_Init_thread(MPI_THREAD_MULTIPLE).
The application uses 50-100 communicators which are created at the application launch in the master thread and they all stay alive while the application is running.
Each communicator contains all MPI processes.
All processes get stuck with identical stack.
I used gdb to attach to the processes after they got stuck and verified that the processes passed the correct communicator to MPI_Barrier.

#0  __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f0e250c7649 in _L_lock_909 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f0e250c7470 in __GI___pthread_mutex_lock (
    mutex=0x7f0e1ad547f8 <mca_btl_tcp_component+440>)
    at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f0e1ab4f76d in mca_btl_tcp_proc_destruct ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#4  0x00007f0e1ab4fce1 in mca_btl_tcp_proc_create ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#5  0x00007f0e1ab48e2c in mca_btl_tcp_add_procs ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#6  0x00007f0e1ab50270 in mca_btl_tcp_proc_lookup ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#7  0x00007f0e1ab4ac15 in mca_btl_tcp_component_recv_handler ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#8  0x00007f0e238c4108 in event_process_active_single_queue ()
   from /home/espetrov/lib/libopen-pal.so.20
#9  0x00007f0e238c437c in event_process_active ()
   from /home/espetrov/lib/libopen-pal.so.20
#10 0x00007f0e238c49cb in opal_libevent2022_event_base_loop ()
   from /home/espetrov/lib/libopen-pal.so.20
#11 0x00007f0e23881894 in opal_progress ()
   from /home/espetrov/lib/libopen-pal.so.20
#12 0x00007f0e23886c2d in sync_wait_mt ()
   from /home/espetrov/lib/libopen-pal.so.20
#13 0x00007f0e24a042b9 in ompi_request_default_wait ()
   from /home/espetrov/lib/libmpi.so.20
#14 0x00007f0e24a5b97d in ompi_coll_base_barrier_intra_recursivedoubling ()
   from /home/espetrov/lib/libmpi.so.20
#15 0x00007f0e24a18284 in PMPI_Barrier () from /home/espetrov/lib/libmpi.so.20

bash$ netstat -i
Kernel Interface table
Iface   MTU Met   RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       8950 0  25787150      0      1 0       7448504      0      0      0 BMRU
lo        65536 0   1265039      0      0 0       1265039      0      0      0 LRU
vlan762    8950 0   1154868      0      0 0         91581      0      0      0 BMRU

EDIT: Added verbatim blocks

The text was updated successfully, but these errors were encountered:

bosilca · 2017-02-27T23:51:00Z

Thanks for your analysis, you have correctly pinpointed the issue. During the process of adding a new proc, initiated due to a remote request, an error condition is reached and we decide to release the newly created proc. That path clearly has a bug, as the lock protecting the proc_table was not recursive. A patch is attached here.

However, the real question is why we decided to drop the newly created proc ? Are you seeing additional information on your output ?

Please test the following patch:

diff --git a/opal/mca/btl/tcp/btl_tcp_proc.c b/opal/mca/btl/tcp/btl_tcp_proc.c
index 68ea3f020c..66be012fb9 100644
--- a/opal/mca/btl/tcp/btl_tcp_proc.c
+++ b/opal/mca/btl/tcp/btl_tcp_proc.c
@@ -125,16 +125,18 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc)
         return btl_proc;
     }

-    do {
+    do {  /* This loop is only necessary so that we can break out of the serial code */
         btl_proc = OBJ_NEW(mca_btl_tcp_proc_t);
         if(NULL == btl_proc) {
             rc = OPAL_ERR_OUT_OF_RESOURCE;
             break;
         }

-        btl_proc->proc_opal = proc;
-
-        OBJ_RETAIN(btl_proc->proc_opal);
+        /* Retain the proc, but don't store the ref into the btl_proc just yet. This
+         * provides a way to release the btl_proc in case of failure without having to
+         * unlock the mutex.
+         */
+        OBJ_RETAIN(proc);

         /* lookup tcp parameters exported by this proc */
         OPAL_MODEX_RECV(rc, &mca_btl_tcp_component.super.btl_version,
@@ -184,12 +186,14 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc)
     } while (0);

     if (OPAL_SUCCESS == rc) {
+        btl_proc->proc_opal = proc;  /* link with the proc */
         /* add to hash table of all proc instance. */
         opal_proc_table_set_value(&mca_btl_tcp_component.tcp_procs,
                                   proc->proc_name, btl_proc);
     } else {
         if (btl_proc) {
-            OBJ_RELEASE(btl_proc);
+            OBJ_RELEASE(btl_proc);  /* release the local proc */
+            OBJ_RELEASE(proc);      /* and the ref on the OMPI proc */
             btl_proc = NULL;
         }
     }

Evgueni-Petrov-aka-espetrov · 2017-02-28T08:29:21Z

The application log and dmesg contain only the expected information.
The error may come only from OPAL_MODEX_RECV and malloc.
What may go wrong in OPAL_MODEX_RECV?

I applied the patch + debug printf with rc, and started the app with --mca btl_base_verbose 30.

I will let you know, if the application runs w/o deadlocks for 20h, or will provide more info about the deadlock.

Evgueni-Petrov-aka-espetrov · 2017-03-01T05:28:35Z

As of now, the app runs for 21h.

According to the log, the control never passed via @@ -184,12 +186,14 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc) in btl_tcp_proc.c.

There are no warnings or debug info from MPI in the log.

bosilca · 2017-03-01T06:31:06Z

Is that a good or a bad sign ?

Evgueni-Petrov-aka-espetrov · 2017-03-01T08:54:31Z

The fix seems to fix the deadlock.

To simulate errors, I have put the following code inside do {...} while (0) in mca_btl_tcp_proc_create.
If /home/espetrov/error.txt exists, without the fix, the MPI processes hang at startup.
After the fix, the MPI processes start normally even if /home/espetrov/error.txt exists.
Please confirm that errors in mca_btl_tcp_proc_create are not "fatal".

I also see that mca_btl_tcp_proc_create is not called for a long period of time after the first call at startup.
Why may we need to call mca_btl_tcp_proc_create again?

if (fopen("/home/espetrov/error.txt", "r")) {
rc = OPAL_ERROR;
break;
}

bosilca · 2017-03-01T18:27:19Z

I pushed the fix in master (ec4a235). @jsquyres will create the PR for v2.x.

I wonder the same question about the mca_btl_tcp_proc_create. One possible reason is if you have multiple IPs, and we open the connections as needed. Or simply because we choose a different barrier algorithm on the communicator, and therefore we need to setup new connections.

jsquyres · 2017-03-02T00:03:50Z

Fixed via ec4a235. Closing this issue because the problem is fixed (but feel free to keep conversing here if there's more questions / comments).

Evgueni-Petrov-aka-espetrov · 2017-03-03T05:26:07Z

Many thanks for bearing with me.
"ip address show" gives "inet scope global" and "inet6 scope global" addresses for eth1.
Is this considered as multiple IPs?

Evgueni-Petrov-aka-espetrov · 2017-03-03T10:21:11Z

Another question: what happens when openmpi sees multiple interfaces? E.g. vlan with an "inet6 scope global" address and eth1 with "inet scope global" and "inet6 scope global" addresses?

bosilca · 2017-03-03T12:33:51Z

If no restrictions have been provided via MCA parameters (btl_tcp_if_include or btl_tcp_if_exclude), then Open MPI is supposed to use all non-local interfaces.

Evgueni-Petrov-aka-espetrov · 2017-03-03T12:43:21Z

Does Open MPI use IPv4 and IPv6 connections simultaneously?

ggouaillardet · 2017-03-03T12:57:56Z

iirc, IPv6 is disabled unless you configure with --enable-ipv6

Evgueni-Petrov-aka-espetrov · 2017-03-03T16:26:41Z

Since we do not --enable-ipv6, multiple IPs are not relevant here.

Then it must be that Open MPI switched to a different barrier algorithm.
But what could cause this?
Does Open MPI collect performance data at run-time and tune its performance? E.g. switches from one barrier algo to another?

Evgueni-Petrov-aka-espetrov · 2017-03-07T05:07:53Z

Ping

ggouaillardet · 2017-03-07T05:30:10Z

the coll/tuned module selects the collective algorithm based on communicator and message sizes.
in the case of MPI_Barrier, decision is only based on the communicator size.

jsquyres added the bug label Feb 28, 2017

jsquyres assigned bosilca Feb 28, 2017

jsquyres added Target: v2.0.x Target: v2.x Target: v3.0.x labels Feb 28, 2017

jsquyres closed this as completed Mar 2, 2017

jsquyres mentioned this issue Mar 2, 2017

v2.1.x: TCP BTL fixes #3078

Merged

Evgueni-Petrov-aka-espetrov mentioned this issue Mar 28, 2017

Open MPI 2.1.0: MPI_Finalize hangs because cuIpcCloseMemHandle fails #3244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI_Barrier hangs in 10-20h long runs #3042

MPI_Barrier hangs in 10-20h long runs #3042

Evgueni-Petrov-aka-espetrov commented Feb 27, 2017 •

edited by jsquyres

Loading

bosilca commented Feb 27, 2017

Evgueni-Petrov-aka-espetrov commented Feb 28, 2017 •

edited

Loading

Evgueni-Petrov-aka-espetrov commented Mar 1, 2017 •

edited

Loading

bosilca commented Mar 1, 2017

Evgueni-Petrov-aka-espetrov commented Mar 1, 2017 •

edited

Loading

bosilca commented Mar 1, 2017

jsquyres commented Mar 2, 2017

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017 •

edited

Loading

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017

bosilca commented Mar 3, 2017

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017

ggouaillardet commented Mar 3, 2017

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017

Evgueni-Petrov-aka-espetrov commented Mar 7, 2017

ggouaillardet commented Mar 7, 2017

MPI_Barrier hangs in 10-20h long runs #3042

MPI_Barrier hangs in 10-20h long runs #3042

Comments

Evgueni-Petrov-aka-espetrov commented Feb 27, 2017 • edited by jsquyres Loading

bosilca commented Feb 27, 2017

Evgueni-Petrov-aka-espetrov commented Feb 28, 2017 • edited Loading

Evgueni-Petrov-aka-espetrov commented Mar 1, 2017 • edited Loading

bosilca commented Mar 1, 2017

Evgueni-Petrov-aka-espetrov commented Mar 1, 2017 • edited Loading

bosilca commented Mar 1, 2017

jsquyres commented Mar 2, 2017

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017 • edited Loading

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017

bosilca commented Mar 3, 2017

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017

ggouaillardet commented Mar 3, 2017

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017

Evgueni-Petrov-aka-espetrov commented Mar 7, 2017

ggouaillardet commented Mar 7, 2017

Evgueni-Petrov-aka-espetrov commented Feb 27, 2017 •

edited by jsquyres

Loading

Evgueni-Petrov-aka-espetrov commented Feb 28, 2017 •

edited

Loading

Evgueni-Petrov-aka-espetrov commented Mar 1, 2017 •

edited

Loading

Evgueni-Petrov-aka-espetrov commented Mar 1, 2017 •

edited

Loading

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017 •

edited

Loading