-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl tcp: Work around shutdown race bug #5892
Conversation
Work around a shutdown race in the TCP BTL when one process closes the endpoint socket and the FIN arrives at the other process before the second process shuts down. In this case, the socket will close, readv() will return 0, and the TCP BTL will push the error to the PML, which will abort the job (from MPI_FINALIZE, which is just rude). There are two commits that appear to cause this issue. First, in ed8141e (Dec 2014), we started setting the endpoint state to FAILED on a readv() returning 0. Second, in 6acebc4 we started pushing the error to the PML for any FAILED state endpoints that enter endpoint_close(). The best workaround I could think of without rewriting lots of code is to only error to the PML if there were frags in flight. There is a long comment in the commit as to why I think this is a reasonable approach for now, as well as future approaches. The future approach, I think, is to add some auto-retry to the endpoint so that a single TCP disconnect doesn't abort a job. Signed-off-by: Brian Barrett <[email protected]>
I was able to replicate Jeff's stream of "Socket closed" error message/test failures on master on a 5 node cluster with 3 IP interfaces / node, running 18 tasks per node (so 90 total), approximately 30% of the runs. With this patch, I could no longer replicate the failure. I'm open to better ideas, but this seemed simple and clean enough. Also curious if George remembers why he did the FAILED change; the commit message wasn't enlightening. Worth noting that the BTL error callback patch is only on master, so this particular problem does not impact the v3.0.x, v3.1.x, or v4.0.x release branches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a spin and this does not work as intended in error cases.
- On the sender side, it does something reasonable: the presence of outgoing fragments is a good marker that an MPI_Send may be blocking on, and needs an error reporting. If no such fragments are present it may be ok to silently suppress the report.
- On the receiver side, even when an MPI_Recv may be blocking for the peer, no such fragments are present, thus suppressing the report leads to non-reporting an important reportable event outside of MPI_Finalize.
A potential workaround could be to let the report bubble up to the PML, in which case the PML can decide to suppress the error report when either no MPI operation are posted (should be the case during the finalize shutdown storm) or after the later stage of MPI_Finalize is initiated.
@abouteiller You added the original bug when you started bubbling up errors to the PML. I don't have time to figure out how to better handle this situation in your error handling code (that doesn't work in the PML most customers use). We're going to push this patch, and you're welcome to iterate on a fix that works for everyone. |
Excuse me ? |
@bosilca if you'd prefer, I can roll back the patch that broke master. This seems less bad. |
Closed in favor of #5916. |
Work around a shutdown race in the TCP BTL when one process closes
the endpoint socket and the FIN arrives at the other process before
the second process shuts down. In this case, the socket will
close, readv() will return 0, and the TCP BTL will push the error
to the PML, which will abort the job (from MPI_FINALIZE, which is
just rude).
There are two commits that appear to cause this issue. First, in
ed8141e (Dec 2014), we started setting the endpoint state to
FAILED on a readv() returning 0. Second, in 6acebc4 we started
pushing the error to the PML for any FAILED state endpoints that
enter endpoint_close().
The best workaround I could think of without rewriting lots of
code is to only error to the PML if there were frags in flight.
There is a long comment in the commit as to why I think this is
a reasonable approach for now, as well as future approaches. The
future approach, I think, is to add some auto-retry to the endpoint
so that a single TCP disconnect doesn't abort a job.