-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add start stop e2e test #1705
Add start stop e2e test #1705
Conversation
2fce4a5
to
00a9f7b
Compare
00a9f7b
to
00358de
Compare
Codecov Report
@@ Coverage Diff @@
## master #1705 +/- ##
==========================================
+ Coverage 56.03% 56.08% +0.05%
==========================================
Files 606 606
Lines 80293 80333 +40
==========================================
+ Hits 44989 45057 +68
+ Misses 31769 31737 -32
- Partials 3535 3539 +4
Continue to review full report at Codecov.
|
00358de
to
a04b020
Compare
Coverage from tests in
|
a04b020
to
99a2d7f
Compare
@@ -210,6 +210,7 @@ type Backend struct { | |||
|
|||
aWallets atomic.Value | |||
|
|||
wg sync.WaitGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the amount of properties backend already has, I'd consider adding a much more descriptive variable name for this waitgroup (and a small comment explaining why it may be needed). closeReadyWG
, or something along those lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like we may just be able to remove this wg, since the go-routine it was waiting for doesn't appear to be necessary.
core/blockchain.go
Outdated
@@ -196,6 +196,7 @@ type BlockChain struct { | |||
|
|||
quit chan struct{} // blockchain quit channel | |||
wg sync.WaitGroup // chain processing wait group for shutting down | |||
wgMu sync.Mutex // chain processing wait group for shutting down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mutex intended to make sure the Wait
happens after the first Add
? AFAIK sync.WaitGroup
is thread safe in its use and there shouldn't be any need to syncronize its access, but I may be missing some border case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mutex ensures that wait and add cannot be called concurrently see - golang/go#23842
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, it's thread safe if you call Add when the count is already > 0, but if it's exactly 0 then it's not thread safe. Either a broken API or a weird use by ourselves.
Given that in this particular use case Add should never be called from a different goroutine without the other lock, I think we better encapsulate this in a new type (DynamicWaitGroup
or whatever name you can think of) and have its own internal mutex, to avoid adding one more reference to the blockchain type, and also making sure that any new code written here cannot accidentally call the Add without the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, but I've actually cherry picked a fix from upstream that was merged during the lifetime of this PR, because I think that is going to cause less problems when merging.
trie/database.go
Outdated
@@ -778,10 +778,15 @@ func (db *Database) Commit(node common.Hash, report bool, callback func(common.H | |||
// commit is the private locked version of Commit. | |||
func (db *Database) commit(hash common.Hash, batch ethdb.Batch, uncacher *cleaner, callback func(common.Hash)) error { | |||
// If the node does not exist, it's a previously committed node | |||
|
|||
db.lock.RLock() | |||
node, ok := db.dirties[hash] | |||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as moving the RUnlock
call right after the db.dirties[hash]
read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is :) thank you !
I'm confused about some changes. Are the |
Hey @hbandura they are meant to be part of this pr. I think the commit comments make this clear.
And
I would suggest just checking this PR commit by commit for a better understanding of all the individual changes that it contains. We could still split these changes out, if you think that would be better. I generally try to keep changes with the thing that precipitated the changes, since that makes it clear why the changes are needed. Otherwise you end up with PRs having changes that don't seem to serve any purpose. That's why I kept all these changes together because they are all needed for this test to work. |
Coverage from tests in coverage: 51.4% of statements across all listed packagescoverage: 64.1% of statements in consensus/istanbul coverage: 51.7% of statements in consensus/istanbul/backend coverage: 0.0% of statements in consensus/istanbul/backend/backendtest coverage: 50.0% of statements in consensus/istanbul/backend/internal/db coverage: 46.0% of statements in consensus/istanbul/backend/internal/enodes coverage: 24.3% of statements in consensus/istanbul/backend/internal/replica coverage: 70.8% of statements in consensus/istanbul/core coverage: 0.0% of statements in consensus/istanbul/proxy coverage: 75.3% of statements in consensus/istanbul/uptime coverage: 100.0% of statements in consensus/istanbul/uptime/store coverage: 51.8% of statements in consensus/istanbul/validator coverage: 79.2% of statements in consensus/istanbul/validator/randomCommentID: 9bd7c06e8e |
@@ -549,7 +551,11 @@ func (sb *Backend) Commit(proposal istanbul.Proposal, aggregatedSeal types.Istan | |||
return err | |||
} | |||
} | |||
go sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State) | |||
sb.wg.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this waitgroup face the same problem with the concurrent Add
? or are Commit and Close both called from the same thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted, this is not safe. I'm thinking to just remove the use of the go-routine. It doesn't seem to affect the tests and it makes the code simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving it, but with some comments, feel free to address them
} | ||
|
||
// Close locks the mutex, then closes it. | ||
func (cm *ClosableMutex) Close() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Close
work if the mutex is locked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @hernan this mutex was added by the go-ethereum team, its part of the fix I cherry-picked from upstream.
So they would know best, but my understanding is that it should wait to acquire the lock before closing the channel, since closing the channel prior to obtaining the lock could result in calls to Unlock panicking.
@@ -0,0 +1,64 @@ | |||
// Copyright 2021 The go-ethereum Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The Celo Authors"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix cherry picked from go-ethereum
@@ -549,8 +549,7 @@ func (sb *Backend) Commit(proposal istanbul.Proposal, aggregatedSeal types.Istan | |||
return err | |||
} | |||
} | |||
go sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State) | |||
|
|||
sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is running this on a goroutine no longer necessary due to the other syncronization added in worker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it ever was necessary. All I did was remove the go routine and nothing broke.
@@ -351,6 +357,8 @@ func (w *worker) mainLoop() { | |||
var cancel context.CancelFunc | |||
var wg sync.WaitGroup | |||
|
|||
// Ensure that block construction is complete before exiting this function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expand on why block construction is needed before exiting, if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I meant there was just that we wait for all started go routines to close before exiting, and its the process of block construction that is starting those go-routines, but its not actually important that block construction has finished, just that all started go routines have finished.
So what about ?
// Ensure that all started go-routines finish before exiting this function.
49c2f9d
to
ff1dd46
Compare
In order to run small scale e2e tests reliably we need to be able to let nodes sync from just one other node. This commit allows the minimum sync peers to be configured, where previously it was hardcoded to 5.
Now dialHistoryExpiration and inboundThrottleTime can be configured via config. This allows for the writing of fast tests.
Ensure that Add is called before Wait.
backend.Backend.Commit was starting a goroutine but not waiting for it to complete. This commit adds a wait group to wait for the goroutine to complete when backend.Backend is closed.
Also make miner wait for goroutines to end at Shutdown.
Ensure Add is called before Wait
core.current was being read via calls to core.ParentCommits from the worker's main loop, and written via calls to core.Stop coming from the miner's update loop. The fix is to add a mutext to synchronise access to current. ================== WARNING: DATA RACE Write at 0x00c00114dd98 by goroutine 329: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:67 +0x8e github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193 github.com/celo-org/celo-blockchain/miner.(*worker).stop() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb github.com/celo-org/celo-blockchain/miner.(*Miner).update() /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb Previous read at 0x00c00114dd98 by goroutine 471: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).ParentCommits() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:204 +0x4a github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal.func1() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:989 +0x38a github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1019 +0x3ab github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).Prepare() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:404 +0x404 github.com/celo-org/celo-blockchain/miner.prepareBlock() /home/pierspowlesland/projects/celo-blockchain/miner/block.go:89 +0x62f github.com/celo-org/celo-blockchain/miner.(*worker).constructAndSubmitNewBlock() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:242 +0x84 github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.1() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:379 +0x5c
core.resetResendRoundChangeTimer was being called from the main engine go-routine and core.stopResendRoundChangeTimer was being called via core.Stop from miner's update go-routine. Both accessed resendRoundChangeMessageTimer without syncronisation. This commit adds a mutex to ensure synchronised access to resendRoundChangeMessageTimer. ================== WARNING: DATA RACE Read at 0x00c01377f568 by goroutine 72: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).stopResendRoundChangeTimer() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:710 +0x47 github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).stopAllTimers() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:719 +0x54 github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:61 +0x44 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193 github.com/celo-org/celo-blockchain/miner.(*worker).stop() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb github.com/celo-org/celo-blockchain/miner.(*Miner).update() /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb Previous write at 0x00c01377f568 by goroutine 312: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).resetResendRoundChangeTimer() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:780 +0x368 github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).resendRoundChangeMessage() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:795 +0x7b github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleResendRoundChangeEvent() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:247 +0x353 github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleEvents() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:143 +0x77e
The current round state of the engine was being read without synchronisation via core.CurrentView by the miner the current round state was also being written in the engines stop function. The fix is to use the pre-existing mutex to syncronise accesses through CurrentView. ================== WARNING: DATA RACE Read at 0x00c000bb20b8 by goroutine 506: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).CurrentView() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:195 +0x4a github.com/celo-org/celo-blockchain/consensus/istanbul/backend.waitCoreToReachSequence() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1203 +0x24c github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal.func1() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:982 +0x130 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1019 +0x3ab github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).Prepare() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:404 +0x404 github.com/celo-org/celo-blockchain/miner.prepareBlock() /home/pierspowlesland/projects/celo-blockchain/miner/block.go:89 +0x62f github.com/celo-org/celo-blockchain/miner.(*worker).constructAndSubmitNewBlock() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:242 +0x84 github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.1() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:379 +0x5c Previous write at 0x00c000bb20b8 by goroutine 93: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:67 +0x8e github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193 github.com/celo-org/celo-blockchain/miner.(*worker).stop() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb github.com/celo-org/celo-blockchain/miner.(*Miner).update() /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb
The eth.chainSyncer main loop periodically calls eth.chainSyncer.startSync which fires off a sync operation in its own goroutine which will eventually result in Downloader.Cancel being called. The eth.chainSyncer also calls Downloader.Cancel when exiting its main loop. This resulted in a data race on Downloader.ancientLimit, the solution was to add a mutex to lock over Downloader.ancientLimit in Downloader.Cancel. ================== WARNING: DATA RACE Write at 0x00c01a0e3ce0 by goroutine 466: github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Cancel() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:615 +0x68 github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).synchronise() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:444 +0x601 github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Synchronise() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:354 +0xb5 github.com/celo-org/celo-blockchain/eth.(*ProtocolManager).doSync() /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:327 +0x1ad github.com/celo-org/celo-blockchain/eth.(*chainSyncer).startSync.func1() /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:303 +0x59 Previous write at 0x00c01a0e3ce0 by goroutine 182: github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Cancel() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:615 +0x68 github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Terminate() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:635 +0xcc github.com/celo-org/celo-blockchain/eth.(*chainSyncer).loop() /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:232 +0x65b
p2p.PurposeFlag is not threadsafe but was being accessed from many go-routines without syncronisation via p2p.Peer.HasPurpose. The solution was to lock over the access in HasPurpose using the mutex that already existed to synchronise access to a peer's purposes flag. ================== WARNING: DATA RACE Read at 0x00c001520680 by goroutine 291: github.com/celo-org/celo-blockchain/p2p.(*Peer).HasPurpose() /home/pierspowlesland/projects/celo-blockchain/p2p/peer.go:167 +0x5fd github.com/celo-org/celo-blockchain/eth.(*peer).PurposeIsSet() /home/pierspowlesland/projects/celo-blockchain/eth/peer.go:696 +0x516 github.com/celo-org/celo-blockchain/eth.(*ProtocolManager).FindPeers() /home/pierspowlesland/projects/celo-blockchain/eth/handler.go:1055 +0x612 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*validatorPeerHandler).ReplaceValidatorPeers() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/peer_handler.go:147 +0x201 github.com/celo-org/celo-blockchain/consensus/istanbul/backend/internal/enodes.(*ValidatorEnodeDB).RefreshValPeers() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/internal/enodes/val_enode_db.go:480 +0x641 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).RefreshValPeers() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/backend.go:846 +0x264 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StartValidating() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:691 +0x3a5 github.com/celo-org/celo-blockchain/miner.(*worker).start() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:210 +0x1a2 github.com/celo-org/celo-blockchain/miner.(*Miner).update() /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:161 +0x6a4 Previous write at 0x00c001520680 by goroutine 343: github.com/celo-org/celo-blockchain/p2p.(*Peer).AddPurpose() /home/pierspowlesland/projects/celo-blockchain/p2p/peer.go:148 +0xd5 github.com/celo-org/celo-blockchain/p2p.(*Server).run.func3() /home/pierspowlesland/projects/celo-blockchain/p2p/server.go:866 +0x2d1 github.com/celo-org/celo-blockchain/p2p.(*Server).run() /home/pierspowlesland/projects/celo-blockchain/p2p/server.go:916 +0x241a
announceRunning is set in istanbul.Backend.StartAnnouncing and istanbul.Backend.StopAnnouncing and also was being accessed without syncronisation from within istanbul.Backend.newChainHead which is executed repeatedly from within its own go-routine. This commit read locks over announceRunning in istanbul.Backend.newChainHead to avoid a data race. Unfortunately we can't see both sides of this data race, it seems the stack was corrupted somehow since the first stack trace cannot be correct (how can fmt.Fscanf call TestStartStopValidators ?). Ive shortened the stack with an elipsis because it was also very long. ================== WARNING: DATA RACE Write at 0x00c001448520 by goroutine 354: github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopAnnouncing() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:749 +0x12b github.com/celo-org/celo-blockchain/eth.(*Ethereum).stopAnnounce() /home/pierspowlesland/projects/celo-blockchain/eth/backend.go:507 +0x2ad github.com/celo-org/celo-blockchain/eth.(*Ethereum).Stop() /home/pierspowlesland/projects/celo-blockchain/eth/backend.go:575 +0x2ae github.com/celo-org/celo-blockchain/node.(*Node).stopServices() /home/pierspowlesland/projects/celo-blockchain/node/node.go:309 +0x150 github.com/celo-org/celo-blockchain/node.(*Node).Close() /home/pierspowlesland/projects/celo-blockchain/node/node.go:221 +0x1bb github.com/celo-org/celo-blockchain/test.(*Node).Close() /home/pierspowlesland/projects/celo-blockchain/test/node.go:312 +0x3b8 github.com/celo-org/celo-blockchain/test.Network.Shutdown() /home/pierspowlesland/projects/celo-blockchain/test/node.go:498 +0x9a runtime.call32() /usr/local/go/src/runtime/asm_amd64.s:551 +0x3d testing.(*T).FailNow() <autogenerated>:1 +0x44 github.com/stretchr/testify/require.NoError() /home/pierspowlesland/go/pkg/mod/github.com/stretchr/[email protected]/require/require.go:974 +0x104 github.com/celo-org/celo-blockchain/e2e_test_test.TestStartStopValidators() /home/pierspowlesland/projects/celo-blockchain/e2e_test/e2e_test.go:168 +0x15ce fmt.Fscanf() /usr/local/go/src/fmt/scan.go:143 +0xee fmt.Sscanf() /usr/local/go/src/fmt/scan.go:114 +0x191 github.com/syndtr/goleveldb/leveldb/storage.fsParseName() /home/pierspowlesland/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/storage/file_storage.go:643 +0xa6 github.com/syndtr/goleveldb/leveldb/storage.(*fileStorage).List() /home/pierspowlesland/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/storage/file_storage.go:458 +0x309 github.com/syndtr/goleveldb/leveldb.(*DB).checkAndCleanFiles() /home/pierspowlesland/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db_util.go:52 +0x301 github.com/syndtr/goleveldb/leveldb.openDB() /home/pierspowlesland/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db.go:136 +0x9bb fmt.(*ss).doScanf() /usr/local/go/src/fmt/scan.go:1230 +0x411 fmt.Fscanf() ... Previous read at 0x00c001448520 by goroutine 203: [failed to restore the stack]
ff1dd46
to
6c36712
Compare
Removing the go-routine does not seem to have any effect on the tests.
It seems that very occasionally we are hitting this error condition and its not clear why.
…2853) This change removes misuses of sync.WaitGroup in BlockChain. Before this change, block insertion modified the WaitGroup counter in order to ensure that Stop would wait for pending operations to complete. This was racy and could even lead to crashes if Stop was called at an unfortunate time. The issue is resolved by adding a specialized 'closable' mutex, which prevents chain modifications after stopping while also synchronizing writers with each other. Co-authored-by: Felix Lange <[email protected]>
The start stop test would sometimes time out while nodes were stopped and the Network.Shutdown method would print errors relating to trying to stop a node twice. The shutdown mechanism has been updated to allow for more flexibility in deciding what errors are printed, allowing us to avoid printing double shutdown errors when we have manually stopped nodes.
6c36712
to
1ab51ee
Compare
Description
This PR adds a test that starts and stops nodes in a small network and verifies that the network can continue to function correctly throughout the process.
A quick overview of what the test does:
Other changes
This PR makes the minSyncPeers, dialHistoryExpiration and InboundThrottleTime configurable, in order for this test to run/run quickly these values needed to be set and they were previously hardcoded.
This PR also ensures correct shutdown of the miner, worker and backend.Backend, and fixes a number of data races/ race conditions. Each fix is described in the associated commit messages, it's probably easier to review this PR commit by commit, since these changes otherwise look unrelated.
I think this PR would be best merged rater than 'squashed and merged' due to the large number of commits containing unrelated fixes.
Upstream PRs that came out of this work
Tested
I managed to run this test 200 times on my local machine with no failures.
Related issues
#1642
Backwards compatibility
Yes