Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOAL2-731] Better slow peers disconnection logic #15

Merged
merged 8 commits into from
Jun 19, 2019

Conversation

tsachiherman
Copy link
Contributor

Our existing FPR could generate large quantity of messages.
These, in turn, could cause the internal output buffer to overflow, triggering a peer disconnection.
That's not the desired behavior; instead, we want to disconnect the peer if the messages that are being written to it are too old.

@CLAassistant
Copy link

CLAassistant commented Jun 13, 2019

CLA assistant check
All committers have signed the CLA.

network/wsNetwork.go Outdated Show resolved Hide resolved
network/wsNetwork.go Outdated Show resolved Hide resolved
@tsachiherman tsachiherman requested a review from zeldovich June 13, 2019 19:34
network/wsPeer.go Outdated Show resolved Hide resolved
zeldovich
zeldovich previously approved these changes Jun 14, 2019
@tsachiherman tsachiherman changed the title [GOAL2-720] Better slow peers disconnection logic [GOAL2-731] Better slow peers disconnection logic Jun 17, 2019
Copy link
Contributor

@algobolson algobolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's one hole in checking for slow peers that we can fix.

network/wsPeer.go Show resolved Hide resolved
config.Consensus[protocol.ConsensusCurrentVersion].SoftCommitteeSize +
config.Consensus[protocol.ConsensusCurrentVersion].CertCommitteeSize +
config.Consensus[protocol.ConsensusCurrentVersion].NextCommitteeSize +
config.Consensus[protocol.ConsensusCurrentVersion].LateCommitteeSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a small nit, I would change "single round" to "single period" (and say that this is the total number of messages sent at once).

I don't think it makes a big difference here as it's a heuristic, but I would also add RedoCommitteeSize and DownCommitteeSize. In particular, the committee for the down votes is the largest, at 6000 possible votes. (It needs to be large because it intersects with the cert votes, which are the key committing votes.)

We also pipeline (relaying) all of these votes from the next round and the next period, so it's possible that this number should be 3x as big (as in, we might pipeline 3 periods' worth of votes). On the other hand, this is a pretty unlikely situation and means that the network is experiencing extreme congestion. I think with the current committee sizes, the sum of all our committee sizes is about 20000 messages, which would make 3x about 60000 messages (so with 0.5KB votes this is 30MB).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll increase the size of the buffer by RedoCommitteeSize+ DownCommitteeSize.

@@ -99,10 +108,12 @@ var networkHandleMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_ne
var networkBroadcasts = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcasts_total", Description: "number of broadcast operations"})
var networkBroadcastQueueMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcast_queue_micros_total", Description: "microseconds broadcast requests sit on queue"})
var networkBroadcastSendMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcast_send_micros_total", Description: "microseconds spent broadcasting"})
var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to some peer"})
var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to any peer"})
var networkPeerBroadcastDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_peer_broadcast_dropped_total", Description: "number of broadcast messages not sent to some peer"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have separate metrics for drops of high-priority messages and low-priority messages? It seems that high-priority drops would be much more alarming than low-priority drops (a lot of low-priority drops means that we might have a ping-pong script bug; a lot of high-priority drops means that the network could be about to stall).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I'll defer this to a separate PR. Opened a JIRA issue to track this:
https://algorand.atlassian.net/browse/GOAL2-790

network/wsNetwork.go Outdated Show resolved Hide resolved
@derbear
Copy link
Contributor

derbear commented Jun 18, 2019

To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right?

Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue.

I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network.

@tsachiherman
Copy link
Contributor Author

To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right?

Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue.

I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network.

Your observations are correct. If the peer is too slow to process messages, that peer is going to start loosing messages. That's a connection-dependent message drop.
if the message is too old and being eliminated from all the peers, it means that we're sending the messages too fast for the broadcastThread to process.

@zeldovich zeldovich merged commit ea85541 into algorand:master Jun 19, 2019
pzbitskiy pushed a commit to pzbitskiy/go-algorand that referenced this pull request Mar 19, 2020
shiqizng pushed a commit to shiqizng/go-algorand that referenced this pull request Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants