-
Notifications
You must be signed in to change notification settings - Fork 38
Cache known votes in gossip #227
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable for a stop gap solution.
|
||
expired | ||
}) | ||
} | ||
|
||
#[allow(clippy::type_complexity)] | ||
fn message_allowed<'a>(&'a self) -> Box<dyn FnMut(&PeerId, MessageIntent, &B::Hash, &[u8]) -> bool + 'a> { | ||
let live_rounds = self.live_rounds.read(); | ||
let known_votes = self.known_votes.read(); | ||
Box::new(move |_who, _intent, _topic, mut data| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to how propagate() handles MessageIntent::PeriodicRebroadcast
, I'm inclined to just drop rebroadcasts. Something like
Box::new(move |_who, intent, _topic, mut data| {
// We drop rebroadcasts. If a peer doesn't know a message, the gossip state machine will
// treat a rebroadcast as a broadcast. In other words, if we do see a rebroadcast, it is
// *alwayas* an already known message.
if intent == MessageIntent::PeriodicRebroadcast {
return false;
}
I tested this with rococo-local
and it worked well. Theoretically this should give us a pretty minimal amount of gossip messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but this doesn't guarantee that the round is concluded for everyone really. Until we have nodes requesting missing BEEFY justifications (I guess part of import queue stuff) I think we should be on the safe side of gossiping more, but without risking some sessions not being concluded.
I guess it is a potential workaround for now as well, but given this cache receiving re-broadcasted votes should be cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gossip state machine seems to assure, that every peer sees every message at least once as a Broadcast
. That should be good enough to conclude a round. Especially since we only need 2/3 + 1 votes and not every vote from every validator.
Whether rebroadcasting is cheap totally depends on the connectivity of the nodes. If every nodes sees every vote, we do rebroadcast n * (n - 1) times.
This PR introduces a cache for known votes (ideally I think we should be avoiding the
RwLock
stuff). Each time we receive a valid vote for a particular round, we keep it's hash in gossip to quickly approve such votes in the future if they are coming from other peers. AFAICT thesc_network_gossip
does de-duplicate incoming messages, but only if they are coming from the same peer, not necessarily if they are sent by multiple peers.My guess is that receiving & verifying the same votes from multiple peers might be causing the high CPU usage we are observing on Rococo, so this cache should alleviate the issue before we implement #212 and #182.
Also in
validate
function we quickly reject messages that are for old rounds. Previously we were always verifying and accepting such messages, but later on they were considered "expired" by the gossip subsystem, hence removed.