[GOAL2-731] Better slow peers disconnection logic #15

tsachiherman · 2019-06-13T17:26:00Z

Our existing FPR could generate large quantity of messages.
These, in turn, could cause the internal output buffer to overflow, triggering a peer disconnection.
That's not the desired behavior; instead, we want to disconnect the peer if the messages that are being written to it are too old.

…ding buffer size.

CLAassistant · 2019-06-13T17:26:09Z

All committers have signed the CLA.

network/wsNetwork.go

network/wsPeer.go

merge from master

algobolson

I think there's one hole in checking for slow peers that we can fix.

network/wsPeer.go

derbear · 2019-06-18T01:49:37Z

network/wsNetwork.go

+		config.Consensus[protocol.ConsensusCurrentVersion].SoftCommitteeSize +
+		config.Consensus[protocol.ConsensusCurrentVersion].CertCommitteeSize +
+		config.Consensus[protocol.ConsensusCurrentVersion].NextCommitteeSize +
+		config.Consensus[protocol.ConsensusCurrentVersion].LateCommitteeSize)


As a small nit, I would change "single round" to "single period" (and say that this is the total number of messages sent at once).

I don't think it makes a big difference here as it's a heuristic, but I would also add RedoCommitteeSize and DownCommitteeSize. In particular, the committee for the down votes is the largest, at 6000 possible votes. (It needs to be large because it intersects with the cert votes, which are the key committing votes.)

We also pipeline (relaying) all of these votes from the next round and the next period, so it's possible that this number should be 3x as big (as in, we might pipeline 3 periods' worth of votes). On the other hand, this is a pretty unlikely situation and means that the network is experiencing extreme congestion. I think with the current committee sizes, the sum of all our committee sizes is about 20000 messages, which would make 3x about 60000 messages (so with 0.5KB votes this is 30MB).

I'll increase the size of the buffer by RedoCommitteeSize+ DownCommitteeSize.

derbear · 2019-06-18T01:56:11Z

network/wsNetwork.go

@@ -99,10 +108,12 @@ var networkHandleMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_ne
 var networkBroadcasts = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcasts_total", Description: "number of broadcast operations"})
 var networkBroadcastQueueMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcast_queue_micros_total", Description: "microseconds broadcast requests sit on queue"})
 var networkBroadcastSendMicros = metrics.MakeCounter(metrics.MetricName{Name: "algod_network_broadcast_send_micros_total", Description: "microseconds spent broadcasting"})
-var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to some peer"})
+var networkBroadcastsDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_broadcasts_dropped_total", Description: "number of broadcast messages not sent to any peer"})
+var networkPeerBroadcastDropped = metrics.MakeCounter(metrics.MetricName{Name: "algod_peer_broadcast_dropped_total", Description: "number of broadcast messages not sent to some peer"})


Could we have separate metrics for drops of high-priority messages and low-priority messages? It seems that high-priority drops would be much more alarming than low-priority drops (a lot of low-priority drops means that we might have a ping-pong script bug; a lot of high-priority drops means that the network could be about to stall).

That's a good idea. I'll defer this to a separate PR. Opened a JIRA issue to track this:
https://algorand.atlassian.net/browse/GOAL2-790

network/wsNetwork.go

derbear · 2019-06-18T02:11:11Z

To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right?

Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue.

I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network.

tsachiherman · 2019-06-18T15:03:29Z

To clarify, this commit also changes behavior for high-priority broadcasts, removing the distinction between high- and low-priority broadcasts, right?

Specifically, for any message, drop a send (1) to a certain peer if the non-blocking send fails and (2) to all peers if the message took too long to leave the queue.

I think this change is probably a good idea. The interesting thing is how it'll affect the agreement protocol given that before this change, vote delivery was "guaranteed" on a persistent connection. It might be a good idea to stress-test this change somehow with a private network.

Your observations are correct. If the peer is too slow to process messages, that peer is going to start loosing messages. That's a connection-dependent message drop.
if the message is too old and being eliminated from all the peers, it means that we're sending the messages too fast for the broadcastThread to process.

merge from master

Implement return type checking

Disconnect slow peers based on their network activity rather than pen…

3079306

…ding buffer size.

tsachiherman self-assigned this Jun 13, 2019

tsachiherman requested review from algobolson, algoradam and zeldovich June 13, 2019 17:37

zeldovich reviewed Jun 13, 2019

View reviewed changes

network/wsNetwork.go Outdated Show resolved Hide resolved

zeldovich reviewed Jun 13, 2019

View reviewed changes

network/wsNetwork.go Outdated Show resolved Hide resolved

Make changes according to reviewer's comments.

825c3a8

tsachiherman requested a review from zeldovich June 13, 2019 19:34

zeldovich reviewed Jun 13, 2019

View reviewed changes

network/wsPeer.go Outdated Show resolved Hide resolved

Remove the wp.closing case, since it's not really needed.

331d09c

zeldovich previously approved these changes Jun 14, 2019

View reviewed changes

Merge pull request #1 from algorand/master

3f6de0e

merge from master

tsachiherman changed the title ~~[GOAL2-720] Better slow peers disconnection logic~~ [GOAL2-731] Better slow peers disconnection logic Jun 17, 2019

algobolson reviewed Jun 17, 2019

View reviewed changes

network/wsPeer.go Show resolved Hide resolved

derbear reviewed Jun 18, 2019

View reviewed changes

network/wsNetwork.go Outdated Show resolved Hide resolved

tsachizehub added 2 commits June 18, 2019 10:30

Merge branch 'master' into tsachi/disconnectslowpeers

def7231

Add another test for stale messages during writeLoopSend

3a2c801

tsachiherman dismissed zeldovich’s stale review via 3a2c801 June 18, 2019 14:55

tsachiherman and others added 2 commits June 18, 2019 14:56

Merge pull request #2 from algorand/master

2dea925

merge from master

Improve unit test.

d6a0c24

tsachiherman requested review from algobolson and derbear June 18, 2019 20:24

algobolson approved these changes Jun 19, 2019

View reviewed changes

zeldovich merged commit ea85541 into algorand:master Jun 19, 2019

ian-algorand mentioned this pull request Aug 19, 2019

Improve dropped message metrics #249

Closed

pzbitskiy pushed a commit to pzbitskiy/go-algorand that referenced this pull request Mar 19, 2020

Merge pull request algorand#15 from pzbitskiy/pavel/stf-teal

165ea64

Implement return type checking

shiqizng pushed a commit to shiqizng/go-algorand that referenced this pull request Apr 7, 2022

Move Accounts to tab. (algorand#15)

5a4ce63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOAL2-731] Better slow peers disconnection logic #15

[GOAL2-731] Better slow peers disconnection logic #15

tsachiherman commented Jun 13, 2019

CLAassistant commented Jun 13, 2019 •

edited

Loading

algobolson left a comment

derbear Jun 18, 2019

tsachiherman Jun 18, 2019

derbear Jun 18, 2019

tsachiherman Jun 18, 2019

derbear commented Jun 18, 2019

tsachiherman commented Jun 18, 2019

[GOAL2-731] Better slow peers disconnection logic #15

[GOAL2-731] Better slow peers disconnection logic #15

Conversation

tsachiherman commented Jun 13, 2019

CLAassistant commented Jun 13, 2019 • edited Loading

algobolson left a comment

Choose a reason for hiding this comment

derbear Jun 18, 2019

Choose a reason for hiding this comment

tsachiherman Jun 18, 2019

Choose a reason for hiding this comment

derbear Jun 18, 2019

Choose a reason for hiding this comment

tsachiherman Jun 18, 2019

Choose a reason for hiding this comment

derbear commented Jun 18, 2019

tsachiherman commented Jun 18, 2019

CLAassistant commented Jun 13, 2019 •

edited

Loading