-
Notifications
You must be signed in to change notification settings - Fork 882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BTL/TCP broken on master #4131
Comments
@rhc54, is this two porcesses on the same host or different hosts? |
The two processes reporting the warning are on the same host, which is odd because vader is also enabled. However, they are talking to two procs on another host, and that might be the connection they are complaining about. |
if a short read occurs, return the number of bytes read, and only issue an error message if something was read, otherwise that might be just fine (e.g. this is how we detect a race condition) Fixes open-mpi#4131 Signed-off-by: Gilles Gouaillardet <[email protected]>
it looks like there is some inconsistency here. from int mca_btl_tcp_recv_blocking(int sd, void* data, size_t size)
{
int retval = recv(sd, ((char *)ptr) + cnt, size - cnt, 0);
/* remote closed connection */
if (0 == retval) {
BTL_ERROR(("remote peer unexpectedly closed connection while I was waiting for blocking message"));
return -1;
} so if zero byte is read
but in static int mca_btl_tcp_endpoint_recv_connect_ack(mca_btl_base_endpoint_t* btl_endpoint)
{
mca_btl_tcp_endpoint_hs_msg_t hs_msg;
retval = mca_btl_tcp_endpoint_recv_blocking(btl_endpoint, &hs_msg, sizeof(hs_msg));
if (sizeof(hs_msg) != retval) {
if (0 == retval) {
/* If we get zero bytes, the peer closed the socket. This
can happen when the two peers started the connection
protocol simultaneously. Just report the problem
upstream. */
return OPAL_ERROR;
}
opal_show_help("help-mpi-btl-tcp.txt", "client handshake fail",
true, opal_process_info.nodename,
getpid(), "did not receive entire connect ACK from peer");
return OPAL_ERR_BAD_PARAM;
} this block of code
a made #4134 in order to fix that. |
@ggouaillardet is right, if both processes open connections to each other simultaneously one of the connections will be dropped (the one initiated by the lowest gid). The new behavior was introduced by #3955. |
if a short read occurs, return the number of bytes read, and only issue an error message if something was read, otherwise that might be just fine (e.g. this is how we detect a race condition) Fixes open-mpi#4131 Signed-off-by: Gilles Gouaillardet <[email protected]>
closed #4295 |
When trying to run a simple ring program, I am getting the following error:
The ring still completes - does anyone know why this started happening?
The text was updated successfully, but these errors were encountered: