Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network: non-participating nodes request TX gossip only if ForceFetchTransactions: true #3918

Merged
merged 24 commits into from
May 25, 2022

Conversation

brianolson
Copy link
Contributor

@brianolson brianolson commented Apr 25, 2022

Summary

Save bandwidth by having non-participating non-relay nodes opt-out of TX transaction gossip traffic.

Test Plan

New unit tests.
Manual testing has started local private networks to ensure that the new message-of-interest propagated. Cluster tests were run to check bandwidth usage.

fix postMessagesOfInterestThraed()
fix maybeSendMessagesOfInterest lock/unlock
lots of debug logging
wsNetwork rename 'isParticipating' to 'wantTXGossip' to be more precise
@cce cce changed the title NPN no TX network: NPN no TX May 4, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 4, 2022

Codecov Report

Merging #3918 (df15614) into master (8f3e557) will increase coverage by 4.75%.
The diff coverage is 78.46%.

@@            Coverage Diff             @@
##           master    #3918      +/-   ##
==========================================
+ Coverage   49.78%   54.54%   +4.75%     
==========================================
  Files         409      390      -19     
  Lines       69145    48541   -20604     
==========================================
- Hits        34426    26477    -7949     
+ Misses      30997    19834   -11163     
+ Partials     3722     2230    -1492     
Impacted Files Coverage Δ
data/account/participationRegistry.go 78.25% <0.00%> (-1.00%) ⬇️
data/accountManager.go 68.42% <0.00%> (+1.75%) ⬆️
node/node.go 24.96% <33.33%> (+1.76%) ⬆️
network/wsPeer.go 67.67% <40.00%> (-0.39%) ⬇️
network/wsNetwork.go 64.98% <94.11%> (+2.18%) ⬆️
catchup/peerSelector.go 98.95% <0.00%> (-1.05%) ⬇️
data/abi/abi_type.go 87.67% <0.00%> (-0.95%) ⬇️
compactcert/msgp_gen.go
crypto/compactcert/msgp_gen.go
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f3e557...df15614. Read the comment docs.

@brianolson brianolson self-assigned this May 5, 2022
@brianolson brianolson marked this pull request as ready for review May 5, 2022 00:39
@brianolson brianolson requested a review from a user May 5, 2022 10:54
@cce
Copy link
Contributor

cce commented May 16, 2022

@brianolson did you see the race condition?

Comment on lines 2354 to 2359
wantTXGossip := atomic.LoadUint32(&wn.wantTXGossip)
if wantTXGossip != wantTXGossipNo {
wn.messagesOfInterest[protocol.TxnTag] = true
} else {
delete(wn.messagesOfInterest, protocol.TxnTag)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic looks like belongs to the caller where wn.messagesOfInterest is updated (added, deleted)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. wound up being able to simplify away a chunk of logic that became redundant during development.

incomingMsgSync.Unlock()
}

func TestWebsocketNetworkTXMessageOfInterestForceTx(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test and above are quite similar, consider generalizing into a _testWebsocket... function


netB.OnNetworkAdvance()
// TODO: better event driven thing for netB sending new MOI
time.Sleep(10 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe messagesOfInterestCond.Wait()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite because postMessagesOfInterestThread() is waiting on that and I want to wait until after that thread has had a chance to run. (But nothing that isn't test code would wait for that thread)

@cce cce requested review from winder and zeldovich May 20, 2022 14:25
wn.log.Infof("ws send msgOfInterest: %v", err)
}
}
wn.maybeSendMessagesOfInterest(peer, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pass in wn.messagesOfInterestEnc here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because wn.messagesOfInterestEnc should only be grabbed inside wn.messagesOfInterestMu.Lock().
Having it as an argument is really just an optimization for postMessagesOfInterestThread() which already holds the lock while doing a Cond.Wait().
I could put a lock/get/unlock three lines in front of the two other calls to maybeSendMOI() but it's slightly fewer lines of code to do it inside that? (and still reasonable separation of concerns?)

algorandskiy
algorandskiy previously approved these changes May 20, 2022
Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine to me, the tests with time.Sleep could be improved since it adds potential random failure point

@cce cce changed the title network: NPN no TX network: non-participating nodes request TX gossip only if ForceFetchTransactions: true May 20, 2022
@algobarb algobarb self-requested a review May 20, 2022 17:30
@brianolson
Copy link
Contributor Author

I think this PR is still generally ready to go except for cce's tweak dismissing the review?

@cce
Copy link
Contributor

cce commented May 23, 2022

I noticed an unpredictable race detector error that went away after rerunning, which I guess means it is a flaky race issue:
error case: https://app.circleci.com/pipelines/github/algorand/go-algorand/6649/workflows/cbf48a17-5877-4409-82a1-e4cab32caae3/jobs/115645
successful rerun on the same commit https://app.circleci.com/pipelines/github/algorand/go-algorand/6649/workflows/7812f37e-038f-4dc6-8145-03a371ba3e64

@brianolson
Copy link
Contributor Author

The 'race' is that wsPeer.messagesOfInterestGeneration is written by wsPeer.go code and read by wsNetwork.go code.
I don't think this is really a race, but the race detector notes that an address is written by one thread and read by another and without a lock or an atomic. Adding an atomic store/load would be the fastest, but I'm kinda grumpy because I think this is a false alarm from the race detector.

@algorandskiy
Copy link
Contributor

Why? the variable accessed from two different go-routines without synchronization, and this is a race, isn't it?

@brianolson
Copy link
Contributor Author

currently working on a sometimes-deadlock in TestWebsocketNetworkMessageOfInterest()
production code is fine, but the test code poking into wsNetwork has a race...

@@ -1677,6 +1725,17 @@ func (wn *WebsocketNetwork) OnNetworkAdvance() {
wn.lastNetworkAdvanceMu.Lock()
defer wn.lastNetworkAdvanceMu.Unlock()
wn.lastNetworkAdvance = time.Now().UTC()
if wn.nodeInfo != nil && !wn.relayMessages && !wn.config.ForceFetchTransactions {
// if we're not a relay, and not participating, we don't need txn pool
wantTXGossip := wn.nodeInfo.IsParticipating()
Copy link
Contributor

@cce cce Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking deeper at OnNetworkAdvance, here it calls nodeInfo.IsParticipating() which contains a call to Ledger.Latest() (which takes a lock on blockQueue.mu.Lock()) and accountManager.HasLiveKeys() (which takes a lock on manager.mu.Lock()).

Before this PR, OnNetworkAdvance used to be very fast — just updating a timestamp. Now it could get stuck for a long time, I know that accountManager.DeleteOldKeys() (which also takes the manager.mu.Lock()) could take seconds in some cases.

Is it possible that OnNetworkAdvance could take too long, and what would be the impact if it did? Agreement synchronously waits for OnNetworkAdvance to finish at the end of every round, when calling the node's EnsureBlock/EnsureValidatedBlock methods. This means the performance of OnNetworkAdvance will delay the start of agreement beginning the next round.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants