-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transport-manager: Too many file descriptors crash #282
Comments
…t` (#283) This PR ensures that the stream implementation of `TransportContext` does not overflow. Instead, this PR ensures a round-robin polling strategy with the index capped at the number of elements registered to the `TransportContext`. While at it, added a test to ensure polling functionality works with round-robin expectations. Discovered during: #282 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>
This PR handles the following: - Align Webscoket listener with TCP listener to not miss out on `Poll::Ready(None)` events and enhance logging - Add warnings whenever pending connection IDs do not exist - ~~Filtering out `Poll::Ready` events into `Poll::Pending` without calling the context waker will result in the scheduler not polling in the future. This was happening, for example, for both TCP and Websocket when receiving a connection established event that was previously canceled~~ cc @paritytech/networking Discovered during: #282 --------- Signed-off-by: Alexandru Vasile <[email protected]>
This PR ensures that connection IDs are properly tracked to enforce connection limits. After ~1/2 days the substrate litep2p running node is starting to reject multiple pending socket connections. This is related to the fact that accepting of established connection is a two-part process: - step 1. Can we accept the connection based on number of connections? - step 2. The transport manager transitions the peer state - step 3. If the transition succeeds, the connection is tracked (inserted in a hashmap). If step 2 fails, we were previously leaking the connection ID. The connection ID is also counting towards the maximum number of connections the node can sustain. This PR effectively modifies the connection limits API, `on_connection_established` is split into two methods: - `can_accept_connection()` - `accept_established_connection()` This fixes a subtle memory leak bounded at 10000 Connection IDs. However, even more important, this fixes connection stability for long-running nodes. Discovered during the investigation of: - #282 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>
The first graph highlights litep2p supporting many inbound connections when substrate is configured for 500/1k in peers. This does not reproduce anymore. I suspect this was an artifact of having multiple polkadot workspaces opened with vscode and rustanalizer, while running 2 substrates nodes on the same machine. |
Issue
Litep2p crashes after 4/5 hours when tested with a high number of inbound / outbound connections (500 in Kusama).
Investigation
This may happen because both TCP and WebSocket transports are polling the listener socket which produces sockets (file descriptors) that await negotiation:
litep2p/src/transport/tcp/mod.rs
Lines 526 to 530 in af0535c
The transport manager looks like its not consuming the events:
In the span of 20 seconds we received roughly 1653 pending inbound and handled 1436.
Possible Solutions
pending_inbound_connections
configurable by TcpConfig and WebSocketConfigThe text was updated successfully, but these errors were encountered: