btl tcp: Work around shutdown race bug #5892

bwbarrett · 2018-10-11T01:31:43Z

Work around a shutdown race in the TCP BTL when one process closes
the endpoint socket and the FIN arrives at the other process before
the second process shuts down. In this case, the socket will
close, readv() will return 0, and the TCP BTL will push the error
to the PML, which will abort the job (from MPI_FINALIZE, which is
just rude).

There are two commits that appear to cause this issue. First, in
ed8141e (Dec 2014), we started setting the endpoint state to
FAILED on a readv() returning 0. Second, in 6acebc4 we started
pushing the error to the PML for any FAILED state endpoints that
enter endpoint_close().

The best workaround I could think of without rewriting lots of
code is to only error to the PML if there were frags in flight.
There is a long comment in the commit as to why I think this is
a reasonable approach for now, as well as future approaches. The
future approach, I think, is to add some auto-retry to the endpoint
so that a single TCP disconnect doesn't abort a job.

Work around a shutdown race in the TCP BTL when one process closes the endpoint socket and the FIN arrives at the other process before the second process shuts down. In this case, the socket will close, readv() will return 0, and the TCP BTL will push the error to the PML, which will abort the job (from MPI_FINALIZE, which is just rude). There are two commits that appear to cause this issue. First, in ed8141e (Dec 2014), we started setting the endpoint state to FAILED on a readv() returning 0. Second, in 6acebc4 we started pushing the error to the PML for any FAILED state endpoints that enter endpoint_close(). The best workaround I could think of without rewriting lots of code is to only error to the PML if there were frags in flight. There is a long comment in the commit as to why I think this is a reasonable approach for now, as well as future approaches. The future approach, I think, is to add some auto-retry to the endpoint so that a single TCP disconnect doesn't abort a job. Signed-off-by: Brian Barrett <[email protected]>

bwbarrett · 2018-10-11T01:35:18Z

I was able to replicate Jeff's stream of "Socket closed" error message/test failures on master on a 5 node cluster with 3 IP interfaces / node, running 18 tasks per node (so 90 total), approximately 30% of the runs. With this patch, I could no longer replicate the failure.

I'm open to better ideas, but this seemed simple and clean enough. Also curious if George remembers why he did the FAILED change; the commit message wasn't enlightening.

Worth noting that the BTL error callback patch is only on master, so this particular problem does not impact the v3.0.x, v3.1.x, or v4.0.x release branches.

abouteiller

I gave it a spin and this does not work as intended in error cases.

On the sender side, it does something reasonable: the presence of outgoing fragments is a good marker that an MPI_Send may be blocking on, and needs an error reporting. If no such fragments are present it may be ok to silently suppress the report.
On the receiver side, even when an MPI_Recv may be blocking for the peer, no such fragments are present, thus suppressing the report leads to non-reporting an important reportable event outside of MPI_Finalize.

A potential workaround could be to let the report bubble up to the PML, in which case the PML can decide to suppress the error report when either no MPI operation are posted (should be the case during the finalize shutdown storm) or after the later stage of MPI_Finalize is initiated.

bwbarrett · 2018-10-12T21:46:45Z

@abouteiller You added the original bug when you started bubbling up errors to the PML. I don't have time to figure out how to better handle this situation in your error handling code (that doesn't work in the PML most customers use). We're going to push this patch, and you're welcome to iterate on a fix that works for everyone.

bosilca · 2018-10-12T21:49:58Z

Excuse me ?

bwbarrett · 2018-10-12T21:51:34Z

@bosilca if you'd prefer, I can roll back the patch that broke master. This seems less bad.

bwbarrett · 2018-10-16T20:55:46Z

Closed in favor of #5916.

bwbarrett requested review from abouteiller, bosilca and jsquyres October 11, 2018 01:31

bwbarrett added bug Target: main labels Oct 11, 2018

bwbarrett mentioned this pull request Oct 11, 2018

"Socket closed" messages from TCP BTL #5849

Closed

abouteiller requested changes Oct 12, 2018

View reviewed changes

bwbarrett mentioned this pull request Oct 12, 2018

Revert "Handle error cases in TCP BTL" #5916

Merged

bwbarrett closed this Oct 16, 2018

bwbarrett deleted the feature/tcp-init-cleanup branch May 8, 2020 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

btl tcp: Work around shutdown race bug #5892

btl tcp: Work around shutdown race bug #5892

bwbarrett commented Oct 11, 2018

bwbarrett commented Oct 11, 2018 •

edited

Loading

abouteiller left a comment

bwbarrett commented Oct 12, 2018

bosilca commented Oct 12, 2018

bwbarrett commented Oct 12, 2018

bwbarrett commented Oct 16, 2018

btl tcp: Work around shutdown race bug #5892

btl tcp: Work around shutdown race bug #5892

Conversation

bwbarrett commented Oct 11, 2018

bwbarrett commented Oct 11, 2018 • edited Loading

abouteiller left a comment

Choose a reason for hiding this comment

bwbarrett commented Oct 12, 2018

bosilca commented Oct 12, 2018

bwbarrett commented Oct 12, 2018

bwbarrett commented Oct 16, 2018

bwbarrett commented Oct 11, 2018 •

edited

Loading