-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core/swarm] Graceful shutdown for connections, networks and swarms. #1682
Conversation
The `Network` does currently not emit events for actively closed connections, e.g. via `EstablishedConnection::close` or `ConnectedPeer::disconnect()`. As a result, when actively closing connections, there will be `ConnectionEstablished` events emitted without eventually a matching `ConnectionClosed` event. This seems undesirable and has the consequence that the `Swarm::ban_peer_id` feature in `libp2p-swarm` does not result in appropriate calls to `NetworkBehaviour::inject_connection_closed` and `NetworkBehaviour::inject_disconnected`. Furthermore, the `disconnect()` functionality in `libp2p-core` is currently broken as it leaves the `Pool` in an inconsistent state. This commit does the following: 1. When connection background tasks are dropped (i.e. removed from the `Manager`), they always terminate immediately, without attempting an orderly close of the connection. 2. An orderly close is sent to the background task of a connection as a regular command. The background task emits a `Closed` event before terminating. 3. `Pool::disconnect()` removes all connection tasks for the affected peer from the `Manager`, i.e. without an orderly close, thereby also fixing the discovered state inconsistency due to not removing the corresponding entries in the `Pool` itself after removing them from the `Manager`. 4. A new test is added to `libp2p-swarm` that exercises the ban/unban functionality and places assertions on the number and order of calls to the `NetworkBehaviour`. In that context some new testing utilities have been added to `libp2p-swarm`. This addresses libp2p#1584.
Co-authored-by: Toralf Wittner <[email protected]>
Building on the ability to wait for connection shutdown to complete introduced in libp2p#1619, this commit extends the ability for performing graceful shutdowns in the following ways: 1. The `ConnectionHandler` (and thus also `ProtocolsHandler`) can participate in the shutdown, via new `poll_close` methods. The muxer and underlying transport connection only starts closing once the connection handler signals readiness to do so. 2. A `Network` can be gracefully shut down, which involves a graceful shutdown of the underlying connection `Pool`. The `Pool` in turn proceeds with a shutdown by rejecting new connections while draining established connections. 3. A `Swarm` can be gracefully shut down, which involves a graceful shutdown of the underlying `Network` followed by polling the `NetworkBehaviour` until it returns `Poll::Pending`, i.e. it has no more output. In particular, the following are important details: * Analogous to new inbound and outbound connections during shutdown, while a single connection is shutting down, it rejects new inbound substreams and, by the return type of `ConnectionHandler::poll_close`, no new outbound substreams can be requested. * The `NodeHandlerWrapper` managing the `ProtocolsHandler` always waits for already ongoing inbound and outbound substream upgrades to complete. Since the `NodeHandlerWrapper` is a `ConnectionHandler`, the previous point applies w.r.t. new inbound and outbound substreams. * When the `connection_keep_alive` expires, a graceful shutdown is initiated.
Shut down the `NetworkBehaviour` in tandem with the `Network` via a dedicated `poll_close` API.
core/src/connection.rs
Outdated
if self.state == ConnectionState::Open { | ||
self.handler.inject_substream(substream, SubstreamEndpoint::Listener) | ||
} else { | ||
log::trace!("Inbound substream dropped. Connection is closing.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that dropping a substream can cause write errors on the remote side, e.g. sending of initial data could fail when the substream is reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current idea is to treat new inbound substreams analogously (as much as possible anyway) to new connections during shutdown - that is, to refuse them at the earliest possible moment. Any failures relating to attempts at creating new substreams or connections towards the peer that is shutting down should ideally result in retries with a different peer, much like without the graceful shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the shutdown logic as currently implemented can really alleviate error rates because any error in one substream will often lead to a connection close, causing failures in other substreams. For example, the request-response handler will immediately close the connection when any inbound or outbound upgrade fails other than via timeout or protocol mismatch. This means that if the remote closes the connection and my next outbound request over this connection fails, all my ongoing requests to this connection will fail too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, unfortunately. It depends very much on how substream write errors are handled, in the specific case of libp2p-request-response
by RequestResponseCodec::write_request
, which is currently completely up to the user. As you say, most other protocols probably consider write errors on a substream as fatal as well. The only somewhat clean solution that comes to my mind would seem to lead back to the ability for the substream multiplexers to have a configurable inbound substream limit similar to that of the connection limit for the Pool
which can be changed during shutdown (i.e. set to 0), i.e. a limit that does not affect existing substreams but can be used to exert back-pressure in a way such that the remote knows it reached the current substream limit on that connection, but the connection itself is not broken.
Co-authored-by: Toralf Wittner <[email protected]>
So apart from the technical details concerning how it is done, I guess the main point for discussion here is whether the perceived benefits of providing such shutdown options justify the added complexity. I happen to think that, while the diff is large, the actual changes are relatively simple in nature, and I find that providing "clean" shutdown options for a |
There are some unresolved but important considerations relating to the handling of new inbound substreams and substream write errors during shutdown, see this discussion, which is why I'm putting this effort on ice for the moment. |
Closing, since I don't think I will pick this up again. The main difficulty here is that a generic approach to a clean shutdown probably cannot be implemented without making too many assumptions about how concrete applications operate with the substreams on a connection and hence only a concrete application knows what can and needs to be done to gracefully cease traffic for a shutdown. |
This is a proposal (at least for discussion) of enhanced graceful shutdown options for
Connection
s,Network
s andSwarm
s that builds on #1619. Even if #1619 is ultimately not merged, this may at least be a thread for discussing the topic of graceful shutdowns and its usefulness forlibp2p
. For a proper diff against #1619 see here.Motivation
In the context of network I/O, a graceful shutdown is a form of shutdown that allows ongoing network traffic to cease in a controlled manner and at chosen boundaries (e.g. request/response boundaries), while at the same time rejecting new I/O streams (e.g. connections or substreams). The main purpose of a graceful shutdown of a networked application is thereby to minimise friction on restarts, incremental rollouts of updates, etc. that is, to minimise the amount of failed requests, responses or any errors resulting from sudden interruption of the ongoing network I/O at an arbitrary point. The more often networked nodes get restarted or redeployed, possibly as a result of frequent updates, the more errors and "hiccups" can result in the network traffic without graceful shutdowns. Graceful shutdowns usually also include redirection of traffic away from the nodes being shut down, which can sometimes be assisted by proxies or load balancers operating at network layers for more classical client-server deployments, but only the end-to-end applications and protocols typically know what needs to be done for a clean shutdown that avoids unnecessary errors.
How it works
This PR builds on the ability to wait for connection shutdown to complete, introduced in #1619 and extends the ability for performing graceful shutdowns in the following ways:
The
ConnectionHandler
(and thus alsoProtocolsHandler
) can participate in the shutdown, via newpoll_close
methods. The muxer and underlying transport connection only starts closing once the connection handler signals readiness to do so. The default implementations ofConnectionHandler::poll_close
always signal readiness to immediately proceed with the transport-level shutdown.A
Network
can be gracefully shut down viaNetwork::close
orNetwork::start_close
, which involves a graceful shutdown of the underlying connectionPool
. ThePool
in turn proceeds with a shutdown by rejecting new connections while draining established connections. Draining established connections just means waiting for them to close, i.e. waiting forConnectionHandler::poll_close
followed byMuxer::poll_close
to complete.A
Swarm
can be gracefully shut down viaSwarm::close
orSwarm::start_close
, which involves a graceful shutdown of the underlyingNetwork
in tandem with theNetworkBehaviour
until theNetwork
is closed and theNetworkBehaviour
ready for shutdown.Important Details
Analogous to new inbound and outbound connections during shutdown, while a single connection is shutting down, it rejects new inbound substreams and, by the return type of
ConnectionHandler::poll_close
, no new outbound substreams can be requested.The
NodeHandlerWrapper
managing theProtocolsHandler
always waits for already ongoing inbound and outbound substream upgrades to complete. Since theNodeHandlerWrapper
is aConnectionHandler
, the previous point applies w.r.t. new inbound and outbound substreams.While the connection handler is closing, it can still emit and receive events (e.g. to/from a
NetworkBehaviour
), but as mentioned above no longer receives new inbound substreams and can no longer request new outbound substreams.When the
connection_keep_alive
expires, a graceful shutdown is initiated rather than a sudden close via returning a keep-alive "error" as is done now.Connection::poll
,Network::poll
andSwarm::poll
now return the event in anOption
, sinceNone
signals termination as a result of a clean shutdown, i.e. in the manner of aStream
that is exhausted.I added
ConnectionHandlerEvent::Close
to allow aConnectionHandler
to request a graceful close of the connection (as opposed to returning an error, which will close the connection without delay).I added
NetworkBehaviourAction::CloseConnection
andNetworkBehaviourAction::DisconnectPeer
to allow aNetworkBehaviour
to request a connection to be closed (gracefully) or for a peer to be disconnected (immediately, i.e. not gracefully), respectively.Usage Example
For an example of how a connection handler can make use of the ability to participate in graceful connection shutdown, I've included an implementation of
ProtocolsHandler::poll_close
for theRequestResponseHandler
inlibp2p-request-response
. In the case oflibp2p-request-response
, these changes amount to allowing already ongoing requests to complete when a connection closes, while redirecting new requests to other connections.Tests
I have essentially changed all tests that would previously emit frequent "broken pipe" errors (more of these since #1619 no longer always tries a clean connection shutdown automatically in the background task - only if actually requested) to perform a clean shutdown of the
Swarm
s orNetwork
s used in these tests.