-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
network: non-participating nodes request TX gossip only if ForceFetchTransactions: true #3918
Conversation
fix postMessagesOfInterestThraed() fix maybeSendMessagesOfInterest lock/unlock lots of debug logging
wsNetwork rename 'isParticipating' to 'wantTXGossip' to be more precise
Codecov Report
@@ Coverage Diff @@
## master #3918 +/- ##
==========================================
+ Coverage 49.78% 54.54% +4.75%
==========================================
Files 409 390 -19
Lines 69145 48541 -20604
==========================================
- Hits 34426 26477 -7949
+ Misses 30997 19834 -11163
+ Partials 3722 2230 -1492
Continue to review full report at Codecov.
|
@brianolson did you see the race condition? |
network/wsNetwork.go
Outdated
wantTXGossip := atomic.LoadUint32(&wn.wantTXGossip) | ||
if wantTXGossip != wantTXGossipNo { | ||
wn.messagesOfInterest[protocol.TxnTag] = true | ||
} else { | ||
delete(wn.messagesOfInterest, protocol.TxnTag) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic looks like belongs to the caller where wn.messagesOfInterest
is updated (added, deleted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. wound up being able to simplify away a chunk of logic that became redundant during development.
incomingMsgSync.Unlock() | ||
} | ||
|
||
func TestWebsocketNetworkTXMessageOfInterestForceTx(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test and above are quite similar, consider generalizing into a _testWebsocket...
function
|
||
netB.OnNetworkAdvance() | ||
// TODO: better event driven thing for netB sending new MOI | ||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe messagesOfInterestCond.Wait()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite because postMessagesOfInterestThread()
is waiting on that and I want to wait until after that thread has had a chance to run. (But nothing that isn't test code would wait for that thread)
wn.log.Infof("ws send msgOfInterest: %v", err) | ||
} | ||
} | ||
wn.maybeSendMessagesOfInterest(peer, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not pass in wn.messagesOfInterestEnc here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because wn.messagesOfInterestEnc should only be grabbed inside wn.messagesOfInterestMu.Lock().
Having it as an argument is really just an optimization for postMessagesOfInterestThread()
which already holds the lock while doing a Cond.Wait().
I could put a lock/get/unlock three lines in front of the two other calls to maybeSendMOI() but it's slightly fewer lines of code to do it inside that? (and still reasonable separation of concerns?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine to me, the tests with time.Sleep could be improved since it adds potential random failure point
This reverts commit a74255a.
I think this PR is still generally ready to go except for cce's tweak dismissing the review? |
I noticed an unpredictable race detector error that went away after rerunning, which I guess means it is a flaky race issue: |
The 'race' is that wsPeer.messagesOfInterestGeneration is written by wsPeer.go code and read by wsNetwork.go code. |
Why? the variable accessed from two different go-routines without synchronization, and this is a race, isn't it? |
currently working on a sometimes-deadlock in TestWebsocketNetworkMessageOfInterest() |
…m that needed some restructuring
@@ -1677,6 +1725,17 @@ func (wn *WebsocketNetwork) OnNetworkAdvance() { | |||
wn.lastNetworkAdvanceMu.Lock() | |||
defer wn.lastNetworkAdvanceMu.Unlock() | |||
wn.lastNetworkAdvance = time.Now().UTC() | |||
if wn.nodeInfo != nil && !wn.relayMessages && !wn.config.ForceFetchTransactions { | |||
// if we're not a relay, and not participating, we don't need txn pool | |||
wantTXGossip := wn.nodeInfo.IsParticipating() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking deeper at OnNetworkAdvance, here it calls nodeInfo.IsParticipating() which contains a call to Ledger.Latest() (which takes a lock on blockQueue.mu.Lock()) and accountManager.HasLiveKeys() (which takes a lock on manager.mu.Lock()).
Before this PR, OnNetworkAdvance used to be very fast — just updating a timestamp. Now it could get stuck for a long time, I know that accountManager.DeleteOldKeys() (which also takes the manager.mu.Lock()) could take seconds in some cases.
Is it possible that OnNetworkAdvance could take too long, and what would be the impact if it did? Agreement synchronously waits for OnNetworkAdvance to finish at the end of every round, when calling the node's EnsureBlock/EnsureValidatedBlock methods. This means the performance of OnNetworkAdvance will delay the start of agreement beginning the next round.
Summary
Save bandwidth by having non-participating non-relay nodes opt-out of TX transaction gossip traffic.
Test Plan
New unit tests.
Manual testing has started local private networks to ensure that the new message-of-interest propagated. Cluster tests were run to check bandwidth usage.