Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTL/TCP broken on master #4131

Closed
rhc54 opened this issue Aug 22, 2017 · 5 comments
Closed

BTL/TCP broken on master #4131

rhc54 opened this issue Aug 22, 2017 · 5 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2017

When trying to run a simple ring program, I am getting the following error:

[rhc002][[16379,1],2][btl_tcp.c:556:mca_btl_tcp_recv_blocking] remote peer unexpectedly closed connection while I was waiting for blocking message
[rhc002][[16379,1],3][btl_tcp.c:556:mca_btl_tcp_recv_blocking] remote peer unexpectedly closed connection while I was waiting for blocking message
--------------------------------------------------------------------------
WARNING: Open MPI failed to handshake with a connecting peer MPI
process over TCP.  This should not happen.

Your Open MPI job may now fail.

  Local host: rhc002
  PID:        209028
  Message:    did not receive entire connect ACK from peer
--------------------------------------------------------------------------

The ring still completes - does anyone know why this started happening?

@bwbarrett
Copy link
Member

@rhc54, is this two porcesses on the same host or different hosts?

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 22, 2017

The two processes reporting the warning are on the same host, which is odd because vader is also enabled. However, they are talking to two procs on another host, and that might be the connection they are complaining about.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Aug 23, 2017
if a short read occurs, return the number of bytes read,
and only issue an error message if something was read, otherwise
that might be just fine (e.g. this is how we detect a race condition)

Fixes open-mpi#4131

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor

it looks like there is some inconsistency here.

from mca_btl_tcp_recv_blocking()

int mca_btl_tcp_recv_blocking(int sd, void* data, size_t size)
{
       int retval = recv(sd, ((char *)ptr) + cnt, size - cnt, 0);
        /* remote closed connection */
        if (0 == retval) {
            BTL_ERROR(("remote peer unexpectedly closed connection while I was waiting for blocking message"));
            return -1;
        }

so if zero byte is read

  • an error message is displayed
  • the function returns -1

but in mca_btl_tcp_endpoint_recv_connect_ack()

static int mca_btl_tcp_endpoint_recv_connect_ack(mca_btl_base_endpoint_t* btl_endpoint)
{
    mca_btl_tcp_endpoint_hs_msg_t hs_msg;
    retval = mca_btl_tcp_endpoint_recv_blocking(btl_endpoint, &hs_msg, sizeof(hs_msg));

    if (sizeof(hs_msg) != retval) {
        if (0 == retval) {
            /* If we get zero bytes, the peer closed the socket. This
               can happen when the two peers started the connection
               protocol simultaneously. Just report the problem
               upstream. */
            return OPAL_ERROR;
        }
        opal_show_help("help-mpi-btl-tcp.txt", "client handshake fail",
                       true, opal_process_info.nodename,
                       getpid(), "did not receive entire connect ACK from peer");

        return OPAL_ERR_BAD_PARAM;
    }

this block of code

  • suggests it is ok to read zero byte (there is a known race condition two peers connect to each other, and we know how to handle that)
  • when zero byte is read, retval is expected to be 0, but as seen previously, the value is set to -1 instead (and an error message was issued)

a made #4134 in order to fix that.
this issue is caused by some recent changes, so i'd like to have my PR reviewed before it lands into master

@bosilca
Copy link
Member

bosilca commented Aug 23, 2017

@ggouaillardet is right, if both processes open connections to each other simultaneously one of the connections will be dropped (the one initiated by the lowest gid). The new behavior was introduced by #3955.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Aug 23, 2017
if a short read occurs, return the number of bytes read,
and only issue an error message if something was read, otherwise
that might be just fine (e.g. this is how we detect a race condition)

Fixes open-mpi#4131

Signed-off-by: Gilles Gouaillardet <[email protected]>
@bwbarrett bwbarrett removed their assignment Aug 23, 2017
@mohanasudhan
Copy link
Contributor

closed #4295

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants