Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Socket closed" messages from TCP BTL #5849

Closed
jsquyres opened this issue Oct 5, 2018 · 2 comments
Closed

"Socket closed" messages from TCP BTL #5849

jsquyres opened this issue Oct 5, 2018 · 2 comments

Comments

@jsquyres
Copy link
Member

jsquyres commented Oct 5, 2018

Cisco MTT is getting a lot of "Socket closed" messages from the TCP BTL.

For example: https://mtt.open-mpi.org/index.php?do_redir=2762

That message appears to come from here:

if( MCA_BTL_TCP_FAILED == btl_endpoint->endpoint_state ) {
mca_btl_tcp_frag_t* frag = btl_endpoint->endpoint_send_frag;
if( NULL == frag )
frag = (mca_btl_tcp_frag_t*)opal_list_remove_first(&btl_endpoint->endpoint_frags);
while(NULL != frag) {
frag->base.des_cbfunc(&frag->btl->super, frag->endpoint, &frag->base, OPAL_ERR_UNREACH);
if( frag->base.des_flags & MCA_BTL_DES_FLAGS_BTL_OWNERSHIP ) {
MCA_BTL_TCP_FRAG_RETURN(frag);
}
frag = (mca_btl_tcp_frag_t*)opal_list_remove_first(&btl_endpoint->endpoint_frags);
}
btl_endpoint->endpoint_send_frag = NULL;
/* Let's report the error upstream */
if(NULL != btl_endpoint->endpoint_btl->tcp_error_cb) {
btl_endpoint->endpoint_btl->tcp_error_cb((mca_btl_base_module_t*)btl_endpoint->endpoint_btl, 0,
btl_endpoint->endpoint_proc->proc_opal, "Socket closed");
}

I don't know if this is an actual error (that the user should be informed about) or a "it's ok, we'll just ignore it and move on to the next interface in the list" kind of issue (that could probably only be reported with a high enough verbosity). Regardless, "Socket closed" is probably not enough of a detailed message to convey meaningful information to the end user. 😄

I don't know if this is related to #3035 or #5818, but it's worth cross-referencing them here.

@bwbarrett
Copy link
Member

#5892 is a workaround for this problem. This is unrelated to #3035 or #5818 and only happens on master.

Jeff actually got the code slightly wrong. The issue is on the next line down, where the BTL calls back to the PML with the disconnect. OB1 just aborts the job on an error callback, but in this case, the callback is fired because readv() returned 0 because the other side was in MPI_FINALIZE and shut down the socket.

bwbarrett added a commit to bwbarrett/ompi that referenced this issue Oct 12, 2018
This reverts commit 6acebc4.

This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run.  See
open-mpi#5849 for more information.

Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett
Copy link
Member

Based on MTT results, da1189d (which reverted 6acebc4) fixed the test failures. Since there are open issues for all the other issues we're seeing, I'm going to close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants