Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect split between dialed and incoming peers when running a light server #1669

Closed
oneeman opened this issue Aug 12, 2021 · 0 comments · Fixed by #1670
Closed

Incorrect split between dialed and incoming peers when running a light server #1669

oneeman opened this issue Aug 12, 2021 · 0 comments · Fixed by #1670
Assignees
Labels
theme: client-reliability Whatever makes client unstable (leaks, others) theme: protocol-correctness Does the protocol adhere to the spec, and behave as it should type:bug Something isn't working

Comments

@oneeman
Copy link
Contributor

oneeman commented Aug 12, 2021

Expected Behavior

Connection slots for eth/istanbul peers should be divided between dialed and incoming connections according to the DialRatio value (which defaults to 3, meaning 1/3 of slots for dialed connections and 2/3 for incoming).

The p2p server does not explicitly divide up slots between eth peers and LES clients. However, since LES servers don't dial LES clients (only the clients dial the servers), connection slots for LES clients should all be incoming. Therefore the correct logic is:

max_dialed = max_eth_peers / DialRatio
max_incoming = max_peers - max_dialed

where max_peers = max_eth_peers + max_les_clients

Current Behavior

If running an LES server, the calculation which splits up slots between dialed and incoming uses the max total number of peers (both eth/istanbul peers and LES clients) rather than the max eth peers number:

if srv.DialRatio == 0 {
limit = srv.MaxPeers / defaultDialRatio
} else {
limit = srv.MaxPeers / srv.DialRatio
}

This behavior is inherited from upstream, so this is also a bug in go-ethereum.

For example, with the values recommended in our docs (--light.maxpeers 1000 --maxpeers 1100), the resulting split is 366 dialed and 734 incoming. However, the eth protocol manager has a limit of 100 peers, so the correct split would be 33 dialed and 1067 incoming. The resulting behavior is the following:

  1. Until it gets to its limit of eth peers (100 with these flag values), it keeps trying to dial peers (since it never gets to 366 dialed peers) and also accepts incoming connections. The resulting number of dialed vs incoming connections is determined on a "first come, first serve" basis depending on how many of the dials succeed and how many peers try to connect to it.
  2. Once it's reached its limit of eth peers, the p2p server it will still keep trying to dial peers and accepting incoming connections. However, these connections (unless they are static or trusted, of course) are immediately closed by the eth protocol manager because it's already at its limit of eth peers:
    if !isStaticOrTrusted && pm.peers.Len() >= pm.maxPeers && p.Peer.Server != pm.proxyServer {
    return p2p.DiscTooManyPeers
    }

This behavior is undesirable for two reasons:

  1. The split between dialed and incoming eth peers is left up to chance rather than being determined by the dial ratio (1/3 to 2/3). This is likely the cause for the observed behavior of new nodes having a hard time finding peers that will let them connect. If we instead had the correct split of 1/3 to 2/3, it would help ensure that peers have open incoming slots. Of course, some peers may be behind a firewall and not forwarding the port and so unable to have any incoming connections, so we can't know for sure how many open slots would result. But the current behavior prevents nodes running LES servers from respecting the dial ratio and so is likely a big contributing factor to the problem.
  2. The node should not be trying to dial new eth peers when it already has as many eth peers as is allowed, because it just dials them and then closes the connection immediately.

Steps to Reproduce Behavior

  • Run an Alfajores node with the following command. Alfajores is best to illustrate this problem, since it has less nodes and so it's easy for the node to connect to peers quickly. The lower values for maxpeers and light.maxpeers make the problem easy to see as well, since then it shows up without needing to connect to so many nodes.
geth --datadir [your datadir] --light.serve 100  --alfajores --maxpeers 71 --light.maxpeers 50
  • To see the incorrect behavior you can either add some logging to show the values of max dialed and max incoming, or let the node connect to peers and look at how many eth/istanbul peers are dialed vs incoming. I tried this right now and I have 21 dialed connections and 0 incoming connections. Since the eth peers limit is 21, the correct behavior would be to only allow 7 dialed connections, and reserve the rest of the slots for incoming connections.

Notes for fixing

  • Fixing this will require the p2p server config to include enough information to know (a) the max number of eth peers, and (b) the max number of total peers. Just the total can't work because the max dialed connections needs to be based on the number of eth peers allowed. Having just the max eth peers can't work either, because the overall peers limit (dialed + incoming) needs to allow for both eth peers and LES clients). One way to do this would be to also include the light maxpeers value in the config (since the max eth peers can then be calculated from that and the max total).
  • As mentioned above, this is also a bug upstream. However, running a LES server is less common on the Ethereum network (it's only in go-ethereum and most nodes don't run LES servers), and they don't recommend values as large as 1000 for the max light peers number, so for this reason it hasn't caused them visible issues like it has for us. However, it is definitely a bug and once we fix it we could open up a PR against upstream as well, or file a github issue. I had previously asked about it in discord (https://discord.com/channels/482467812179181568/482478927374188555/786674697478733845) but didn't hear back
  • There is some tangentially related behavior inherited from upstream which is confusing: the value of max light peers is added to the max peers value, unless both have been specified using the flags. That is to say: (a) if one or both of them come from a default value, the max peers value is assumed to mean max eth peers and the value of max light peers is added to it to get the total max peers, but (b) if both have been specified then it's assumed that the max peers value instead specifies the total, and the light max peers value isn't added to it. This is very unintuitive and likely something we should change. However, as it is a breaking change for the flags and would have implications for the docs and node operators, I wouldn't suggest changing it as part of a PR fixing this bug. The relevant code is:
    lightPeers := ctx.GlobalInt(LightMaxPeersFlag.Name)
    if lightClient && !ctx.GlobalIsSet(LightMaxPeersFlag.Name) {
    // dynamic default - for clients we use 1/10th of the default for servers
    lightPeers /= 10
    }
    if ctx.GlobalIsSet(MaxPeersFlag.Name) {
    cfg.MaxPeers = ctx.GlobalInt(MaxPeersFlag.Name)
    if lightServer && !ctx.GlobalIsSet(LightMaxPeersFlag.Name) {
    cfg.MaxPeers += lightPeers
    }
    } else {
    if lightServer {
    cfg.MaxPeers += lightPeers
    }
    if lightClient && (ctx.GlobalIsSet(LightMaxPeersFlag.Name)) && cfg.MaxPeers < lightPeers {
    cfg.MaxPeers = lightPeers
    }
    }
    if !(lightClient || lightServer) {
    lightPeers = 0
    }
    ethPeers := cfg.MaxPeers - lightPeers
    if lightClient {
    ethPeers = 0
    }
    log.Info("Maximum peer count", "ETH", ethPeers, "LES", lightPeers, "total", cfg.MaxPeers)

System information

Celo
Version: 1.4.0-unstable
Git Commit: f82aa14059c9b1f210077123df2700b4c912f6f4
Git Commit Date: 20210809
Architecture: amd64
Protocol Versions: [66]
Go Version: go1.16.4
Operating System: darwin
GOPATH=/Users/or/go
GOROOT=go
@oneeman oneeman added type:bug Something isn't working blockchain theme: client-reliability Whatever makes client unstable (leaks, others) theme: protocol-correctness Does the protocol adhere to the spec, and behave as it should labels Aug 12, 2021
@oneeman oneeman self-assigned this Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme: client-reliability Whatever makes client unstable (leaks, others) theme: protocol-correctness Does the protocol adhere to the spec, and behave as it should type:bug Something isn't working
Projects
None yet
1 participant