Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport-manager: Too many file descriptors crash #282

Closed
lexnv opened this issue Nov 8, 2024 · 1 comment
Closed

transport-manager: Too many file descriptors crash #282

lexnv opened this issue Nov 8, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@lexnv
Copy link
Collaborator

lexnv commented Nov 8, 2024

Issue

Litep2p crashes after 4/5 hours when tested with a high number of inbound / outbound connections (500 in Kusama).

2024-11-07 21:03:37.155 ERROR tokio-runtime-worker grandpa: GRANDPA voter error:
could not complete a round on disk:
Database error: IO error: While open a file for appending: ../dbs/2024-11-04_07-55-18-9e0cebad9f60d62c72c933340c8a3f9e3af63c76-litep2p/chains/ksmcc3/db/full/591987.log:

Too many open files   

Investigation

This may happen because both TCP and WebSocket transports are polling the listener socket which produces sockets (file descriptors) that await negotiation:

impl Stream for TcpTransport {
type Item = TransportEvent;
fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
if let Poll::Ready(event) = self.listener.poll_next_unpin(cx) {

The transport manager looks like its not consuming the events:

2024-11-07 20:13:39.554  INFO tokio-runtime-worker litep2p::websocket: pending_inbound_connections=0

2024-11-07 20:13:59.556  INFO tokio-runtime-worker litep2p::websocket: pending_inbound_connections=217

In the span of 20 seconds we received roughly 1653 pending inbound and handled 1436.
Screenshot 2024-11-08 at 14 52 17
Screenshot 2024-11-08 at 15 04 06

Possible Solutions

  • Don't poll the socket listener (TCP and WebSocket) if we get above a number of pending_inbound_connections configurable by TcpConfig and WebSocketConfig
  • Ensure TCP and WebSocket are robust wrt "too many open files" errors from the listener
  • investigate further polling of futures (transports / transport-manager): it looks offhand like we were able to sustain this pace in the past
@lexnv lexnv added the bug Something isn't working label Nov 8, 2024
@lexnv lexnv self-assigned this Nov 8, 2024
lexnv added a commit that referenced this issue Nov 12, 2024
…t` (#283)

This PR ensures that the stream implementation of `TransportContext`
does not overflow.
Instead, this PR ensures a round-robin polling strategy with the index
capped at the number of elements registered to the `TransportContext`.

While at it, added a test to ensure polling functionality works with
round-robin expectations.

Discovered during: #282

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
lexnv added a commit that referenced this issue Nov 12, 2024
This PR handles the following:
- Align Webscoket listener with TCP listener to not miss out on
`Poll::Ready(None)` events and enhance logging

- Add warnings whenever pending connection IDs do not exist

- ~~Filtering out `Poll::Ready` events into `Poll::Pending` without
calling the context waker will result in the scheduler not polling in
the future. This was happening, for example, for both TCP and Websocket
when receiving a connection established event that was previously
canceled~~


cc @paritytech/networking

Discovered during: #282

---------

Signed-off-by: Alexandru Vasile <[email protected]>
lexnv added a commit that referenced this issue Nov 13, 2024
This PR ensures that connection IDs are properly tracked to enforce
connection limits.

After ~1/2 days the substrate litep2p running node is starting to reject
multiple pending socket connections.

This is related to the fact that accepting of established connection is
a two-part process:
- step 1. Can we accept the connection based on number of connections?
- step 2. The transport manager transitions the peer state
- step 3. If the transition succeeds, the connection is tracked
(inserted in a hashmap).

If step 2 fails, we were previously leaking the connection ID. The
connection ID is also counting towards the maximum number of connections
the node can sustain.

This PR effectively modifies the connection limits API,
`on_connection_established` is split into two methods:
  - `can_accept_connection()`
  - `accept_established_connection()`

This fixes a subtle memory leak bounded at 10000 Connection IDs.
However, even more important, this fixes connection stability for
long-running nodes.

Discovered during the investigation of:
- #282

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
@lexnv
Copy link
Collaborator Author

lexnv commented Dec 17, 2024

The first graph highlights litep2p supporting many inbound connections when substrate is configured for 500/1k in peers.
The second graph shows a memory leak in our tracking of connections which is included in one of our latest releases.

This does not reproduce anymore. I suspect this was an artifact of having multiple polkadot workspaces opened with vscode and rustanalizer, while running 2 substrates nodes on the same machine.

@lexnv lexnv closed this as completed Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant