transport-manager: Too many file descriptors crash #282

lexnv · 2024-11-08T13:31:04Z

Issue

Litep2p crashes after 4/5 hours when tested with a high number of inbound / outbound connections (500 in Kusama).

2024-11-07 21:03:37.155 ERROR tokio-runtime-worker grandpa: GRANDPA voter error:
could not complete a round on disk:
Database error: IO error: While open a file for appending: ../dbs/2024-11-04_07-55-18-9e0cebad9f60d62c72c933340c8a3f9e3af63c76-litep2p/chains/ksmcc3/db/full/591987.log:

Too many open files

Investigation

This may happen because both TCP and WebSocket transports are polling the listener socket which produces sockets (file descriptors) that await negotiation:

litep2p/src/transport/tcp/mod.rs

Lines 526 to 530 in af0535c

    
           impl Stream for TcpTransport { 
        
               type Item = TransportEvent; 
        
               fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> { 
        
                   if let Poll::Ready(event) = self.listener.poll_next_unpin(cx) {

The transport manager looks like its not consuming the events:

2024-11-07 20:13:39.554  INFO tokio-runtime-worker litep2p::websocket: pending_inbound_connections=0

2024-11-07 20:13:59.556  INFO tokio-runtime-worker litep2p::websocket: pending_inbound_connections=217

In the span of 20 seconds we received roughly 1653 pending inbound and handled 1436.

Possible Solutions

Don't poll the socket listener (TCP and WebSocket) if we get above a number of pending_inbound_connections configurable by TcpConfig and WebSocketConfig
Ensure TCP and WebSocket are robust wrt "too many open files" errors from the listener
investigate further polling of futures (transports / transport-manager): it looks offhand like we were able to sustain this pace in the past

The text was updated successfully, but these errors were encountered:

…t` (#283) This PR ensures that the stream implementation of `TransportContext` does not overflow. Instead, this PR ensures a round-robin polling strategy with the index capped at the number of elements registered to the `TransportContext`. While at it, added a test to ensure polling functionality works with round-robin expectations. Discovered during: #282 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>

This PR handles the following: - Align Webscoket listener with TCP listener to not miss out on `Poll::Ready(None)` events and enhance logging - Add warnings whenever pending connection IDs do not exist - ~~Filtering out `Poll::Ready` events into `Poll::Pending` without calling the context waker will result in the scheduler not polling in the future. This was happening, for example, for both TCP and Websocket when receiving a connection established event that was previously canceled~~ cc @paritytech/networking Discovered during: #282 --------- Signed-off-by: Alexandru Vasile <[email protected]>

This PR ensures that connection IDs are properly tracked to enforce connection limits. After ~1/2 days the substrate litep2p running node is starting to reject multiple pending socket connections. This is related to the fact that accepting of established connection is a two-part process: - step 1. Can we accept the connection based on number of connections? - step 2. The transport manager transitions the peer state - step 3. If the transition succeeds, the connection is tracked (inserted in a hashmap). If step 2 fails, we were previously leaking the connection ID. The connection ID is also counting towards the maximum number of connections the node can sustain. This PR effectively modifies the connection limits API, `on_connection_established` is split into two methods: - `can_accept_connection()` - `accept_established_connection()` This fixes a subtle memory leak bounded at 10000 Connection IDs. However, even more important, this fixes connection stability for long-running nodes. Discovered during the investigation of: - #282 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>

lexnv · 2024-12-17T10:29:14Z

The first graph highlights litep2p supporting many inbound connections when substrate is configured for 500/1k in peers.
The second graph shows a memory leak in our tracking of connections which is included in one of our latest releases.

This does not reproduce anymore. I suspect this was an artifact of having multiple polkadot workspaces opened with vscode and rustanalizer, while running 2 substrates nodes on the same machine.

lexnv added the bug Something isn't working label Nov 8, 2024

lexnv self-assigned this Nov 8, 2024

This was referenced Nov 11, 2024

manager: Avoid overflow on stream implementation for TransportContext #283

Merged

transports: Fix missing Poll::Ready(None) event from listenener #285

Merged

manager: Fix connection limits tracking of rejected connections #286

Merged

lexnv closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport-manager: Too many file descriptors crash #282

transport-manager: Too many file descriptors crash #282

lexnv commented Nov 8, 2024 •

edited

Loading

lexnv commented Dec 17, 2024

transport-manager: Too many file descriptors crash #282

transport-manager: Too many file descriptors crash #282

Comments

lexnv commented Nov 8, 2024 • edited Loading

Issue

Investigation

Possible Solutions

lexnv commented Dec 17, 2024

lexnv commented Nov 8, 2024 •

edited

Loading