-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOAL2-731] Better slow peers disconnection logic #15
[GOAL2-731] Better slow peers disconnection logic #15
Conversation
…ding buffer size.
merge from master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's one hole in checking for slow peers that we can fix.
network/wsNetwork.go
Outdated
config.Consensus[protocol.ConsensusCurrentVersion].SoftCommitteeSize + | ||
config.Consensus[protocol.ConsensusCurrentVersion].CertCommitteeSize + | ||
config.Consensus[protocol.ConsensusCurrentVersion].NextCommitteeSize + | ||
config.Consensus[protocol.ConsensusCurrentVersion].LateCommitteeSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a small nit, I would change "single round" to "single period" (and say that this is the total number of messages sent at once).
I don't think it makes a big difference here as it's a heuristic, but I would also add RedoCommitteeSize
and DownCommitteeSize
. In particular, the committee for the down votes is the largest, at 6000 possible votes. (It needs to be large because it intersects with the cert votes, which are the key committing votes.)
We also pipeline (relaying) all of these votes from the next round and the next period, so it's possible that this number should be 3x as big (as in, we might pipeline 3 periods' worth of votes). On the other hand, this is a pretty unlikely situation and means that the network is experiencing extreme congestion. I think with the current committee sizes, the sum of all our committee sizes is about 20000 messages, which would make 3x about 60000 messages (so with 0.5KB votes this is 30MB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll increase the size of the buffer by RedoCommitteeSize+ DownCommitteeSize.
@@ -99,10 +108,12 @@ var networkHandleMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_ne | |||
var networkBroadcasts = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcasts_total", Description: "number of broadcast operations"}) | |||
var networkBroadcastQueueMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcast_queue_micros_total", Description: "microseconds broadcast requests sit on queue"}) | |||
var networkBroadcastSendMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcast_send_micros_total", Description: "microseconds spent broadcasting"}) | |||
var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to some peer"}) | |||
var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to any peer"}) | |||
var networkPeerBroadcastDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_peer_broadcast_dropped_total", Description: "number of broadcast messages not sent to some peer"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have separate metrics for drops of high-priority messages and low-priority messages? It seems that high-priority drops would be much more alarming than low-priority drops (a lot of low-priority drops means that we might have a ping-pong script bug; a lot of high-priority drops means that the network could be about to stall).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. I'll defer this to a separate PR. Opened a JIRA issue to track this:
https://algorand.atlassian.net/browse/GOAL2-790
To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right? Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue. I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network. |
Your observations are correct. If the peer is too slow to process messages, that peer is going to start loosing messages. That's a connection-dependent message drop. |
merge from master
Implement return type checking
Our existing FPR could generate large quantity of messages.
These, in turn, could cause the internal output buffer to overflow, triggering a peer disconnection.
That's not the desired behavior; instead, we want to disconnect the peer if the messages that are being written to it are too old.