BTL/TCP broken on master #4131

rhc54 · 2017-08-22T20:29:49Z

When trying to run a simple ring program, I am getting the following error:

[rhc002][[16379,1],2][btl_tcp.c:556:mca_btl_tcp_recv_blocking] remote peer unexpectedly closed connection while I was waiting for blocking message
[rhc002][[16379,1],3][btl_tcp.c:556:mca_btl_tcp_recv_blocking] remote peer unexpectedly closed connection while I was waiting for blocking message
--------------------------------------------------------------------------
WARNING: Open MPI failed to handshake with a connecting peer MPI
process over TCP.  This should not happen.

Your Open MPI job may now fail.

  Local host: rhc002
  PID:        209028
  Message:    did not receive entire connect ACK from peer
--------------------------------------------------------------------------

The ring still completes - does anyone know why this started happening?

The text was updated successfully, but these errors were encountered:

bwbarrett · 2017-08-22T21:55:17Z

@rhc54, is this two porcesses on the same host or different hosts?

rhc54 · 2017-08-22T22:21:03Z

The two processes reporting the warning are on the same host, which is odd because vader is also enabled. However, they are talking to two procs on another host, and that might be the connection they are complaining about.

if a short read occurs, return the number of bytes read, and only issue an error message if something was read, otherwise that might be just fine (e.g. this is how we detect a race condition) Fixes open-mpi#4131 Signed-off-by: Gilles Gouaillardet <[email protected]>

ggouaillardet · 2017-08-23T01:31:43Z

it looks like there is some inconsistency here.

from mca_btl_tcp_recv_blocking()

int mca_btl_tcp_recv_blocking(int sd, void* data, size_t size)
{
       int retval = recv(sd, ((char *)ptr) + cnt, size - cnt, 0);
        /* remote closed connection */
        if (0 == retval) {
            BTL_ERROR(("remote peer unexpectedly closed connection while I was waiting for blocking message"));
            return -1;
        }

so if zero byte is read

an error message is displayed
the function returns -1

but in mca_btl_tcp_endpoint_recv_connect_ack()

static int mca_btl_tcp_endpoint_recv_connect_ack(mca_btl_base_endpoint_t* btl_endpoint)
{
    mca_btl_tcp_endpoint_hs_msg_t hs_msg;
    retval = mca_btl_tcp_endpoint_recv_blocking(btl_endpoint, &hs_msg, sizeof(hs_msg));

    if (sizeof(hs_msg) != retval) {
        if (0 == retval) {
            /* If we get zero bytes, the peer closed the socket. This
               can happen when the two peers started the connection
               protocol simultaneously. Just report the problem
               upstream. */
            return OPAL_ERROR;
        }
        opal_show_help("help-mpi-btl-tcp.txt", "client handshake fail",
                       true, opal_process_info.nodename,
                       getpid(), "did not receive entire connect ACK from peer");

        return OPAL_ERR_BAD_PARAM;
    }

this block of code

suggests it is ok to read zero byte (there is a known race condition two peers connect to each other, and we know how to handle that)
when zero byte is read, retval is expected to be 0, but as seen previously, the value is set to -1 instead (and an error message was issued)

a made #4134 in order to fix that.
this issue is caused by some recent changes, so i'd like to have my PR reviewed before it lands into master

bosilca · 2017-08-23T01:35:42Z

@ggouaillardet is right, if both processes open connections to each other simultaneously one of the connections will be dropped (the one initiated by the lowest gid). The new behavior was introduced by #3955.

if a short read occurs, return the number of bytes read, and only issue an error message if something was read, otherwise that might be just fine (e.g. this is how we detect a race condition) Fixes open-mpi#4131 Signed-off-by: Gilles Gouaillardet <[email protected]>

mohanasudhan · 2017-10-04T00:15:04Z

closed #4295

rhc54 added the bug label Aug 22, 2017

rhc54 assigned bosilca, mohanasudhan and bwbarrett Aug 22, 2017

dalcinl mentioned this issue Aug 22, 2017

MPI_Comm_spawn(): Warnings about failing handshake #4130

Closed

ggouaillardet mentioned this issue Aug 23, 2017

btl/tcp: fix mca_btl_tcp_recv_blocking() returned value #4134

Closed

bwbarrett removed their assignment Aug 23, 2017

bwbarrett added Target: main Target: v3.1.x labels Sep 12, 2017

mohanasudhan closed this as completed Oct 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BTL/TCP broken on master #4131

BTL/TCP broken on master #4131

rhc54 commented Aug 22, 2017

bwbarrett commented Aug 22, 2017

rhc54 commented Aug 22, 2017

ggouaillardet commented Aug 23, 2017

bosilca commented Aug 23, 2017

mohanasudhan commented Oct 4, 2017

BTL/TCP broken on master #4131

BTL/TCP broken on master #4131

Comments

rhc54 commented Aug 22, 2017

bwbarrett commented Aug 22, 2017

rhc54 commented Aug 22, 2017

ggouaillardet commented Aug 23, 2017

bosilca commented Aug 23, 2017

mohanasudhan commented Oct 4, 2017