Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

improved gossip topology #3270

Merged
merged 16 commits into from
Jun 18, 2021
Merged

improved gossip topology #3270

merged 16 commits into from
Jun 18, 2021

Conversation

ordian
Copy link
Member

@ordian ordian commented Jun 16, 2021

Fixes #3239.

Changes parachain gossip topology to reduce the amount of hops and duplicate messages for approval assignments distribution.

  • resolve TODO (ordian)
  • update the guide
  • fix the tests

@ordian ordian added B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Jun 16, 2021
@github-actions github-actions bot added the A3-in_progress Pull request is in progress. No review needed at this stage. label Jun 16, 2021
@@ -350,7 +356,11 @@ where
}
})
.collect::<Vec<PeerId>>();
let interested_peers = util::choose_random_sqrt_subset(interested_peers, MIN_GOSSIP_PEERS);
let interested_peers = util::choose_random_subset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only use a subset?

Copy link
Member Author

@ordian ordian Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that the subset will be limited to MIN_GOSSIP_PEERS in two cases:

  1. Small network (2 * sqrt(N) < MIN_GOSSIP_PEERS).
  2. The node has just restarted and its DHT is not populated. It would go up on the next session, so there is no much point in making this logic more complex than it needs to be.

@@ -532,6 +542,9 @@ where
// get rid of superfluous data
state.peer_views.remove(&peerid);
}
NetworkBridgeEvent::NewGossipTopology(peers) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this per relay chain head? So at session boundaries, we would gossip to different peers depending on the current head? I think that would be good for smooth session changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that during the session change we should gossip to two gossip group (from previous and current topology)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, usually we gossip per head - right? So if we are sending messages with regard to head x, we should be using gossip peers that belong to the session of head x. Otherwise we lose the guarantee that we are going to reach everybody in two hops. I mean in practice it will be fine, just because of the huge amount of redundancy.

Copy link
Member Author

@ordian ordian Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • newly added gossip peers should receive their updates as in 12de015
  • peers should receive NewGossipTopology on the first block that has new session index for child, which should happen roughly at the same time for peers. But if not, as you mentioned, redundancy helps.
  • we accept all valid incoming messages (not just from gossip peers)

@ordian ordian marked this pull request as ready for review June 17, 2021 15:51
@ordian ordian added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Jun 17, 2021
ordian added 2 commits June 17, 2021 17:54
* master:
  Companion #9019 (max rpc payload override) (#3276)
  Implementers' Guide: Chain Selection (#3262)
  CLI: Add missing feature checking and check if someone passes a file (#3283)
  Export 'TakeRevenue' trait. (#3278)
  Add XCM Decode Limit (#3273)
  Allow Council to Use Scheduler (#3237)
  fix xcm pallet origin (#3272)
  extract determine_new_blocks into a separate utility (#3261)
  Approval checking unit tests (#3252)
  bridges: update finality-grandpa to 0.14.1 (#3266)
  malus - mockable overseer mvp (#3224)
  use safe math (#3249)
  Companion for #8920 (Control Staking) (#3260)
  Companion for #8949 (#3216)
@rphmeier rphmeier added B1-releasenotes and removed B0-silent Changes should not be mentioned in any release notes labels Jun 17, 2021
@rphmeier
Copy link
Contributor

@ordian I added the releasenotes label.

Can you add something to the PR description about the reduced bandwidth in expectation to 2sqrt(n) / n, so for 900 validators expected 6.5% of current bandwidth?

@ordian
Copy link
Member Author

ordian commented Jun 17, 2021

@ordian I added the releasenotes label.

Can you add something to the PR description about the reduced bandwidth in expectation to 2sqrt(n) / n, so for 900 validators expected 6.5% of current bandwidth?

We'd send more messages for bitfield distribution and small statement distribution (2x), but less for approval distribution (because it's not based on active leaves, but rather on peers known messages and sqrt selection if now static for the duration of the session). But the real benefit would be guaranteed network diameter of 2 I think. Will update the description.

@ordian
Copy link
Member Author

ordian commented Jun 17, 2021

@ordian I added the releasenotes label.
Can you add something to the PR description about the reduced bandwidth in expectation to 2sqrt(n) / n, so for 900 validators expected 6.5% of current bandwidth?

We'd send more messages for bitfield distribution and small statement distribution (2x), but less for approval distribution (because it's not based on active leaves, but rather on peers known messages and sqrt selection if now static for the duration of the session). But the real benefit would be guaranteed network diameter of 2 I think. Will update the description.

confirmed by metrics:
number of /polkadot/validation/1 notifications went up < 2x, but the bandwidth went down > 2x and so is avg size of notifications

@@ -303,3 +303,17 @@ impl Network for Arc<NetworkService<Block, Hash>> {
);
}
}

/// We assume one peer_id per authority_id.
pub async fn get_peer_id_by_authority_id<AD: AuthorityDiscovery>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would start to break if many nodes changed their PeerIds at once, without rotating their session keys.
I believe this is acceptable, as honest nodes will practically always rotate their keys before changing the machine / VM their node is running on, due to the risk of equivocation / slashing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node changes their PeerId, I'd assume they restart the node and connect under the new PeerId and disconnect under the old one. They'd also publish a new record on the DHT with their new PeerId. It's true that our cache might contain the old record when that happens.
But the list of our neighbors is not necessarily connected to us, we'll use random peers if there are not enough.

* master:
  Set new staking limits (#3299)
  Bump derive_more from 0.99.11 to 0.99.14 (#3248)
  add revert consensus log (#3275)
  Add bridge team as codeowners of `bridges` Subtree (#3291)
  Extract and test count_no_shows method for approval voting (#3264)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improved gossip network topology
4 participants