Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Barrier hangs in 10-20h long runs #3042

Closed
Evgueni-Petrov-aka-espetrov opened this issue Feb 27, 2017 · 15 comments
Closed

MPI_Barrier hangs in 10-20h long runs #3042

Evgueni-Petrov-aka-espetrov opened this issue Feb 27, 2017 · 15 comments

Comments

@Evgueni-Petrov-aka-espetrov
Copy link

Evgueni-Petrov-aka-espetrov commented Feb 27, 2017

Hi,

After 10-20 hours, MPI_Barrier hangs with the following stack.

It looks like some error happens during mca_btl_tcp_proc_create, and then mca_btl_tcp_proc_destruct cannot proceed because mca_btl_tcp_component.tcp_lock is not recursive.

The stack below has been captured using OMPI 2.0.2a1, and the OMPI 2.0.2 release does not seem to fix this issue.

I would be greateful for any ideas about the root-cause of this error -- wrong host config, etc.

Unfortunately (as usual), the application is too large and I do not have a small reproducer.

Here is some information that seems relevant.

MPI is used to distribute jobs to CUDA cards inside one host.
There are no inter-host communications.
Each MPI process is multithreaded.
MPI is called from different threads but the calls are serialized at the application level.
MPI is built with --enable-thread-multiple, the app calls MPI_Init_thread(MPI_THREAD_MULTIPLE).
The application uses 50-100 communicators which are created at the application launch in the master thread and they all stay alive while the application is running.
Each communicator contains all MPI processes.
All processes get stuck with identical stack.
I used gdb to attach to the processes after they got stuck and verified that the processes passed the correct communicator to MPI_Barrier.

#0  __lll_lock_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f0e250c7649 in _L_lock_909 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f0e250c7470 in __GI___pthread_mutex_lock (
    mutex=0x7f0e1ad547f8 <mca_btl_tcp_component+440>)
    at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f0e1ab4f76d in mca_btl_tcp_proc_destruct ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#4  0x00007f0e1ab4fce1 in mca_btl_tcp_proc_create ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#5  0x00007f0e1ab48e2c in mca_btl_tcp_add_procs ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#6  0x00007f0e1ab50270 in mca_btl_tcp_proc_lookup ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#7  0x00007f0e1ab4ac15 in mca_btl_tcp_component_recv_handler ()
   from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#8  0x00007f0e238c4108 in event_process_active_single_queue ()
   from /home/espetrov/lib/libopen-pal.so.20
#9  0x00007f0e238c437c in event_process_active ()
   from /home/espetrov/lib/libopen-pal.so.20
#10 0x00007f0e238c49cb in opal_libevent2022_event_base_loop ()
   from /home/espetrov/lib/libopen-pal.so.20
#11 0x00007f0e23881894 in opal_progress ()
   from /home/espetrov/lib/libopen-pal.so.20
#12 0x00007f0e23886c2d in sync_wait_mt ()
   from /home/espetrov/lib/libopen-pal.so.20
#13 0x00007f0e24a042b9 in ompi_request_default_wait ()
   from /home/espetrov/lib/libmpi.so.20
#14 0x00007f0e24a5b97d in ompi_coll_base_barrier_intra_recursivedoubling ()
   from /home/espetrov/lib/libmpi.so.20
#15 0x00007f0e24a18284 in PMPI_Barrier () from /home/espetrov/lib/libmpi.so.20
bash$ netstat -i
Kernel Interface table
Iface   MTU Met   RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1       8950 0  25787150      0      1 0       7448504      0      0      0 BMRU
lo        65536 0   1265039      0      0 0       1265039      0      0      0 LRU
vlan762    8950 0   1154868      0      0 0         91581      0      0      0 BMRU

EDIT: Added verbatim blocks

@bosilca
Copy link
Member

bosilca commented Feb 27, 2017

Thanks for your analysis, you have correctly pinpointed the issue. During the process of adding a new proc, initiated due to a remote request, an error condition is reached and we decide to release the newly created proc. That path clearly has a bug, as the lock protecting the proc_table was not recursive. A patch is attached here.

However, the real question is why we decided to drop the newly created proc ? Are you seeing additional information on your output ?

Please test the following patch:

diff --git a/opal/mca/btl/tcp/btl_tcp_proc.c b/opal/mca/btl/tcp/btl_tcp_proc.c
index 68ea3f020c..66be012fb9 100644
--- a/opal/mca/btl/tcp/btl_tcp_proc.c
+++ b/opal/mca/btl/tcp/btl_tcp_proc.c
@@ -125,16 +125,18 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc)
         return btl_proc;
     }

-    do {
+    do {  /* This loop is only necessary so that we can break out of the serial code */
         btl_proc = OBJ_NEW(mca_btl_tcp_proc_t);
         if(NULL == btl_proc) {
             rc = OPAL_ERR_OUT_OF_RESOURCE;
             break;
         }

-        btl_proc->proc_opal = proc;
-
-        OBJ_RETAIN(btl_proc->proc_opal);
+        /* Retain the proc, but don't store the ref into the btl_proc just yet. This
+         * provides a way to release the btl_proc in case of failure without having to
+         * unlock the mutex.
+         */
+        OBJ_RETAIN(proc);

         /* lookup tcp parameters exported by this proc */
         OPAL_MODEX_RECV(rc, &mca_btl_tcp_component.super.btl_version,
@@ -184,12 +186,14 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc)
     } while (0);

     if (OPAL_SUCCESS == rc) {
+        btl_proc->proc_opal = proc;  /* link with the proc */
         /* add to hash table of all proc instance. */
         opal_proc_table_set_value(&mca_btl_tcp_component.tcp_procs,
                                   proc->proc_name, btl_proc);
     } else {
         if (btl_proc) {
-            OBJ_RELEASE(btl_proc);
+            OBJ_RELEASE(btl_proc);  /* release the local proc */
+            OBJ_RELEASE(proc);      /* and the ref on the OMPI proc */
             btl_proc = NULL;
         }
     }

@Evgueni-Petrov-aka-espetrov
Copy link
Author

Evgueni-Petrov-aka-espetrov commented Feb 28, 2017

The application log and dmesg contain only the expected information.
The error may come only from OPAL_MODEX_RECV and malloc.
What may go wrong in OPAL_MODEX_RECV?

I applied the patch + debug printf with rc, and started the app with --mca btl_base_verbose 30.

I will let you know, if the application runs w/o deadlocks for 20h, or will provide more info about the deadlock.

@Evgueni-Petrov-aka-espetrov
Copy link
Author

Evgueni-Petrov-aka-espetrov commented Mar 1, 2017

As of now, the app runs for 21h.

According to the log, the control never passed via @@ -184,12 +186,14 @@ mca_btl_tcp_proc_t* mca_btl_tcp_proc_create(opal_proc_t* proc) in btl_tcp_proc.c.

There are no warnings or debug info from MPI in the log.

@bosilca
Copy link
Member

bosilca commented Mar 1, 2017

Is that a good or a bad sign ?

@Evgueni-Petrov-aka-espetrov
Copy link
Author

Evgueni-Petrov-aka-espetrov commented Mar 1, 2017

The fix seems to fix the deadlock.

To simulate errors, I have put the following code inside do {...} while (0) in mca_btl_tcp_proc_create.
If /home/espetrov/error.txt exists, without the fix, the MPI processes hang at startup.
After the fix, the MPI processes start normally even if /home/espetrov/error.txt exists.
Please confirm that errors in mca_btl_tcp_proc_create are not "fatal".

I also see that mca_btl_tcp_proc_create is not called for a long period of time after the first call at startup.
Why may we need to call mca_btl_tcp_proc_create again?

if (fopen("/home/espetrov/error.txt", "r")) {
rc = OPAL_ERROR;
break;
}

@bosilca
Copy link
Member

bosilca commented Mar 1, 2017

I pushed the fix in master (ec4a235). @jsquyres will create the PR for v2.x.

I wonder the same question about the mca_btl_tcp_proc_create. One possible reason is if you have multiple IPs, and we open the connections as needed. Or simply because we choose a different barrier algorithm on the communicator, and therefore we need to setup new connections.

@jsquyres
Copy link
Member

jsquyres commented Mar 2, 2017

Fixed via ec4a235. Closing this issue because the problem is fixed (but feel free to keep conversing here if there's more questions / comments).

@Evgueni-Petrov-aka-espetrov
Copy link
Author

Evgueni-Petrov-aka-espetrov commented Mar 3, 2017

Many thanks for bearing with me.
"ip address show" gives "inet scope global" and "inet6 scope global" addresses for eth1.
Is this considered as multiple IPs?

@Evgueni-Petrov-aka-espetrov
Copy link
Author

Another question: what happens when openmpi sees multiple interfaces? E.g. vlan with an "inet6 scope global" address and eth1 with "inet scope global" and "inet6 scope global" addresses?

@bosilca
Copy link
Member

bosilca commented Mar 3, 2017

If no restrictions have been provided via MCA parameters (btl_tcp_if_include or btl_tcp_if_exclude), then Open MPI is supposed to use all non-local interfaces.

@Evgueni-Petrov-aka-espetrov
Copy link
Author

Does Open MPI use IPv4 and IPv6 connections simultaneously?

@ggouaillardet
Copy link
Contributor

iirc, IPv6 is disabled unless you configure with --enable-ipv6

@Evgueni-Petrov-aka-espetrov
Copy link
Author

Since we do not --enable-ipv6, multiple IPs are not relevant here.

Then it must be that Open MPI switched to a different barrier algorithm.
But what could cause this?
Does Open MPI collect performance data at run-time and tune its performance? E.g. switches from one barrier algo to another?

@Evgueni-Petrov-aka-espetrov
Copy link
Author

Ping

@ggouaillardet
Copy link
Contributor

the coll/tuned module selects the collective algorithm based on communicator and message sizes.
in the case of MPI_Barrier, decision is only based on the communicator size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants