Add start stop e2e test #1705

piersy · 2021-09-29T22:24:53Z

Description

This PR adds a test that starts and stops nodes in a small network and verifies that the network can continue to function correctly throughout the process.

A quick overview of what the test does:

Start 4 nodes
Verify they can confirm a tx
Stop one node
Verify the remaining 3 can confirm a tx
Stop another node
Verify the remaining 2 cannot confirm a tx
Start the last stopped node
Verify that the previously un-confirmable tx can now be confirmed by the 3 running nodes.
Verify that the 3 running nodes can confirm a further tx
Start the remaining node
Verify that the netowrk can process another tx and that all sent txs have been processed by all nodes.

Other changes

This PR makes the minSyncPeers, dialHistoryExpiration and InboundThrottleTime configurable, in order for this test to run/run quickly these values needed to be set and they were previously hardcoded.

This PR also ensures correct shutdown of the miner, worker and backend.Backend, and fixes a number of data races/ race conditions. Each fix is described in the associated commit messages, it's probably easier to review this PR commit by commit, since these changes otherwise look unrelated.

I think this PR would be best merged rater than 'squashed and merged' due to the large number of commits containing unrelated fixes.

Upstream PRs that came out of this work

Tested

I managed to run this test 200 times on my local machine with no failures.

Related issues

#1642

Backwards compatibility

Yes

codecov · 2021-10-01T07:33:44Z

Codecov Report

Merging #1705 (801a432) into master (58ca798) will increase coverage by 0.05%.
The diff coverage is 59.13%.

❗ Current head 801a432 differs from pull request most recent head b9ba309. Consider uploading reports for the commit b9ba309 to get more accurate results

@@            Coverage Diff             @@
##           master    #1705      +/-   ##
==========================================
+ Coverage   56.03%   56.08%   +0.05%     
==========================================
  Files         606      606              
  Lines       80293    80333      +40     
==========================================
+ Hits        44989    45057      +68     
+ Misses      31769    31737      -32     
- Partials     3535     3539       +4

Impacted Files	Coverage Δ
eth/backend.go	`0.00% <0.00%> (ø)`
miner/miner.go	`0.00% <0.00%> (ø)`
p2p/peer.go	`73.54% <0.00%> (-0.58%)`	⬇️
core/blockchain.go	`59.52% <25.64%> (-1.42%)`	⬇️
consensus/istanbul/core/core.go	`68.95% <75.00%> (+0.12%)`	⬆️
consensus/istanbul/backend/backend.go	`50.80% <100.00%> (ø)`
consensus/istanbul/backend/handler.go	`29.58% <100.00%> (+0.45%)`	⬆️
consensus/istanbul/backend/peer_handler.go	`58.33% <100.00%> (ø)`
consensus/istanbul/core/handler.go	`83.46% <100.00%> (+0.26%)`	⬆️
eth/downloader/downloader.go	`76.62% <100.00%> (+0.14%)`	⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58ca798...b9ba309. Read the comment docs.

piersy · 2021-10-01T13:58:57Z

Coverage from tests in ./e2e_test/... for ./consensus/istanbul/... at commit a04b020

coverage:  51.5% of statements across all listed packages
coverage:  64.1% of statements in consensus/istanbul
coverage:  51.8% of statements in consensus/istanbul/backend
coverage:   0.0% of statements in consensus/istanbul/backend/backendtest
coverage:  50.0% of statements in consensus/istanbul/backend/internal/db
coverage:  46.5% of statements in consensus/istanbul/backend/internal/enodes
coverage:  23.6% of statements in consensus/istanbul/backend/internal/replica
coverage:  71.0% of statements in consensus/istanbul/core
coverage:   0.0% of statements in consensus/istanbul/proxy
coverage:  75.3% of statements in consensus/istanbul/uptime
coverage: 100.0% of statements in consensus/istanbul/uptime/store
coverage:  51.8% of statements in consensus/istanbul/validator
coverage:  79.2% of statements in consensus/istanbul/validator/random

hbandura · 2021-10-05T07:41:54Z

consensus/istanbul/backend/backend.go

@@ -210,6 +210,7 @@ type Backend struct {

 	aWallets atomic.Value

+	wg           sync.WaitGroup


Given the amount of properties backend already has, I'd consider adding a much more descriptive variable name for this waitgroup (and a small comment explaining why it may be needed). closeReadyWG, or something along those lines

So it looks like we may just be able to remove this wg, since the go-routine it was waiting for doesn't appear to be necessary.

consensus/istanbul/backend/handler.go

hbandura · 2021-10-05T07:50:32Z

core/blockchain.go

@@ -196,6 +196,7 @@ type BlockChain struct {

 	quit          chan struct{}  // blockchain quit channel
 	wg            sync.WaitGroup // chain processing wait group for shutting down
+	wgMu          sync.Mutex     // chain processing wait group for shutting down


Is this mutex intended to make sure the Wait happens after the first Add? AFAIK sync.WaitGroup is thread safe in its use and there shouldn't be any need to syncronize its access, but I may be missing some border case

This mutex ensures that wait and add cannot be called concurrently see - golang/go#23842

Ah, right, it's thread safe if you call Add when the count is already > 0, but if it's exactly 0 then it's not thread safe. Either a broken API or a weird use by ourselves.

Given that in this particular use case Add should never be called from a different goroutine without the other lock, I think we better encapsulate this in a new type (DynamicWaitGroup or whatever name you can think of) and have its own internal mutex, to avoid adding one more reference to the blockchain type, and also making sure that any new code written here cannot accidentally call the Add without the lock.

Good suggestion, but I've actually cherry picked a fix from upstream that was merged during the lifetime of this PR, because I think that is going to cause less problems when merging.

ethereum/go-ethereum#22853

hbandura · 2021-10-05T07:53:57Z

trie/database.go

@@ -778,10 +778,15 @@ func (db *Database) Commit(node common.Hash, report bool, callback func(common.H
 // commit is the private locked version of Commit.
 func (db *Database) commit(hash common.Hash, batch ethdb.Batch, uncacher *cleaner, callback func(common.Hash)) error {
 	// If the node does not exist, it's a previously committed node
+
+	db.lock.RLock()
 	node, ok := db.dirties[hash]
 	if !ok {


Isn't this the same as moving the RUnlock call right after the db.dirties[hash] read?

Yes it is :) thank you !

hbandura · 2021-10-05T07:56:08Z

I'm confused about some changes. Are the p2p changes and the eth/config changes actually part of this PR or is github botching the diff due to a merge? If they are, they seem like big (well, not big, but important) and I'd rather have them in a separate PR to make them more standout more

piersy · 2021-10-05T12:47:27Z

I'm confused about some changes. Are the p2p changes and the eth/config changes actually part of this PR or is github botching the diff due to a merge? If they are, they seem like big (well, not big, but important) and I'd rather have them in a separate PR to make them more standout more

Hey @hbandura they are meant to be part of this pr. I think the commit comments make this clear.

Allow configuring the min peers needed to sync
In order to run small scale e2e tests reliably we need to be able to let
nodes sync from just one other node. This commit allows the minimum sync
peers to be configured, where previously it was hardcoded to 5.

And

Allow modification of dial and connection timeouts
Now dialHistoryExpiration and inboundThrottleTime can be configured via
config. This allows for the writing of fast tests.

I would suggest just checking this PR commit by commit for a better understanding of all the individual changes that it contains.

We could still split these changes out, if you think that would be better. I generally try to keep changes with the thing that precipitated the changes, since that makes it clear why the changes are needed. Otherwise you end up with PRs having changes that don't seem to serve any purpose. That's why I kept all these changes together because they are all needed for this test to work.

piersy · 2021-10-05T12:51:22Z

Coverage from tests in ./e2e_test/... for ./consensus/istanbul/... at commit 1ab51ee

coverage: 51.4% of statements across all listed packages

coverage:  64.1% of statements in consensus/istanbul
coverage:  51.7% of statements in consensus/istanbul/backend
coverage:   0.0% of statements in consensus/istanbul/backend/backendtest
coverage:  50.0% of statements in consensus/istanbul/backend/internal/db
coverage:  46.0% of statements in consensus/istanbul/backend/internal/enodes
coverage:  24.3% of statements in consensus/istanbul/backend/internal/replica
coverage:  70.8% of statements in consensus/istanbul/core
coverage:   0.0% of statements in consensus/istanbul/proxy
coverage:  75.3% of statements in consensus/istanbul/uptime
coverage: 100.0% of statements in consensus/istanbul/uptime/store
coverage:  51.8% of statements in consensus/istanbul/validator
coverage:  79.2% of statements in consensus/istanbul/validator/random

CommentID: 9bd7c06e8e

hbandura · 2021-10-18T22:44:36Z

consensus/istanbul/backend/backend.go

@@ -549,7 +551,11 @@ func (sb *Backend) Commit(proposal istanbul.Proposal, aggregatedSeal types.Istan
 			return err
 		}
 	}
-	go sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State)
+	sb.wg.Add(1)


Doesn't this waitgroup face the same problem with the concurrent Add ? or are Commit and Close both called from the same thread?

Well spotted, this is not safe. I'm thinking to just remove the use of the go-routine. It doesn't seem to affect the tests and it makes the code simpler.

hbandura

Approving it, but with some comments, feel free to address them

hbandura · 2021-10-22T03:59:11Z

internal/syncx/mutex.go

+}
+
+// Close locks the mutex, then closes it.
+func (cm *ClosableMutex) Close() {


Should Close work if the mutex is locked?

Hey @hernan this mutex was added by the go-ethereum team, its part of the fix I cherry-picked from upstream.

So they would know best, but my understanding is that it should wait to acquire the lock before closing the channel, since closing the channel prior to obtaining the lock could result in calls to Unlock panicking.

hbandura · 2021-10-22T04:00:47Z

internal/syncx/mutex.go

@@ -0,0 +1,64 @@
+// Copyright 2021 The go-ethereum Authors


"The Celo Authors"

This is the fix cherry picked from go-ethereum

hbandura · 2021-10-22T04:15:49Z

consensus/istanbul/backend/backend.go

@@ -549,8 +549,7 @@ func (sb *Backend) Commit(proposal istanbul.Proposal, aggregatedSeal types.Istan
 			return err
 		}
 	}
-	go sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State)
-
+	sb.onNewConsensusBlock(block, result.Receipts, result.Logs, result.State)


Is running this on a goroutine no longer necessary due to the other syncronization added in worker?

I don't know if it ever was necessary. All I did was remove the go routine and nothing broke.

hbandura · 2021-10-22T04:16:40Z

miner/worker.go

@@ -351,6 +357,8 @@ func (w *worker) mainLoop() {
 	var cancel context.CancelFunc
 	var wg sync.WaitGroup

+	// Ensure that block construction is complete before exiting this function.


expand on why block construction is needed before exiting, if possible

I think what I meant there was just that we wait for all started go routines to close before exiting, and its the process of block construction that is starting those go-routines, but its not actually important that block construction has finished, just that all started go routines have finished.

So what about ?

// Ensure that all started go-routines finish before exiting this function.

In order to run small scale e2e tests reliably we need to be able to let nodes sync from just one other node. This commit allows the minimum sync peers to be configured, where previously it was hardcoded to 5.

Now dialHistoryExpiration and inboundThrottleTime can be configured via config. This allows for the writing of fast tests.

Ensure that Add is called before Wait.

backend.Backend.Commit was starting a goroutine but not waiting for it to complete. This commit adds a wait group to wait for the goroutine to complete when backend.Backend is closed.

Also make miner wait for goroutines to end at Shutdown.

Ensure Add is called before Wait

core.current was being read via calls to core.ParentCommits from the worker's main loop, and written via calls to core.Stop coming from the miner's update loop. The fix is to add a mutext to synchronise access to current. ================== WARNING: DATA RACE Write at 0x00c00114dd98 by goroutine 329: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:67 +0x8e github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193 github.com/celo-org/celo-blockchain/miner.(*worker).stop() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb github.com/celo-org/celo-blockchain/miner.(*Miner).update() /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb Previous read at 0x00c00114dd98 by goroutine 471: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).ParentCommits() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:204 +0x4a github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal.func1() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:989 +0x38a github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1019 +0x3ab github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).Prepare() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:404 +0x404 github.com/celo-org/celo-blockchain/miner.prepareBlock() /home/pierspowlesland/projects/celo-blockchain/miner/block.go:89 +0x62f github.com/celo-org/celo-blockchain/miner.(*worker).constructAndSubmitNewBlock() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:242 +0x84 github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.1() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:379 +0x5c

core.resetResendRoundChangeTimer was being called from the main engine go-routine and core.stopResendRoundChangeTimer was being called via core.Stop from miner's update go-routine. Both accessed resendRoundChangeMessageTimer without syncronisation. This commit adds a mutex to ensure synchronised access to resendRoundChangeMessageTimer. ================== WARNING: DATA RACE Read at 0x00c01377f568 by goroutine 72: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).stopResendRoundChangeTimer() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:710 +0x47 github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).stopAllTimers() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:719 +0x54 github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:61 +0x44 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193 github.com/celo-org/celo-blockchain/miner.(*worker).stop() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb github.com/celo-org/celo-blockchain/miner.(*Miner).update() /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb Previous write at 0x00c01377f568 by goroutine 312: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).resetResendRoundChangeTimer() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:780 +0x368 github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).resendRoundChangeMessage() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:795 +0x7b github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleResendRoundChangeEvent() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:247 +0x353 github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).handleEvents() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:143 +0x77e

The current round state of the engine was being read without synchronisation via core.CurrentView by the miner the current round state was also being written in the engines stop function. The fix is to use the pre-existing mutex to syncronise accesses through CurrentView. ================== WARNING: DATA RACE Read at 0x00c000bb20b8 by goroutine 506: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).CurrentView() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/core.go:195 +0x4a github.com/celo-org/celo-blockchain/consensus/istanbul/backend.waitCoreToReachSequence() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1203 +0x24c github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal.func1() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:982 +0x130 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).addParentSeal() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:1019 +0x3ab github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).Prepare() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:404 +0x404 github.com/celo-org/celo-blockchain/miner.prepareBlock() /home/pierspowlesland/projects/celo-blockchain/miner/block.go:89 +0x62f github.com/celo-org/celo-blockchain/miner.(*worker).constructAndSubmitNewBlock() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:242 +0x84 github.com/celo-org/celo-blockchain/miner.(*worker).mainLoop.func1.1() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:379 +0x5c Previous write at 0x00c000bb20b8 by goroutine 93: github.com/celo-org/celo-blockchain/consensus/istanbul/core.(*core).Stop() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/core/handler.go:67 +0x8e github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopValidating() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:707 +0x193 github.com/celo-org/celo-blockchain/miner.(*worker).stop() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:220 +0xbb github.com/celo-org/celo-blockchain/miner.(*Miner).update() /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:173 +0x6fb

The eth.chainSyncer main loop periodically calls eth.chainSyncer.startSync which fires off a sync operation in its own goroutine which will eventually result in Downloader.Cancel being called. The eth.chainSyncer also calls Downloader.Cancel when exiting its main loop. This resulted in a data race on Downloader.ancientLimit, the solution was to add a mutex to lock over Downloader.ancientLimit in Downloader.Cancel. ================== WARNING: DATA RACE Write at 0x00c01a0e3ce0 by goroutine 466: github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Cancel() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:615 +0x68 github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).synchronise() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:444 +0x601 github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Synchronise() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:354 +0xb5 github.com/celo-org/celo-blockchain/eth.(*ProtocolManager).doSync() /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:327 +0x1ad github.com/celo-org/celo-blockchain/eth.(*chainSyncer).startSync.func1() /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:303 +0x59 Previous write at 0x00c01a0e3ce0 by goroutine 182: github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Cancel() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:615 +0x68 github.com/celo-org/celo-blockchain/eth/downloader.(*Downloader).Terminate() /home/pierspowlesland/projects/celo-blockchain/eth/downloader/downloader.go:635 +0xcc github.com/celo-org/celo-blockchain/eth.(*chainSyncer).loop() /home/pierspowlesland/projects/celo-blockchain/eth/sync.go:232 +0x65b

p2p.PurposeFlag is not threadsafe but was being accessed from many go-routines without syncronisation via p2p.Peer.HasPurpose. The solution was to lock over the access in HasPurpose using the mutex that already existed to synchronise access to a peer's purposes flag. ================== WARNING: DATA RACE Read at 0x00c001520680 by goroutine 291: github.com/celo-org/celo-blockchain/p2p.(*Peer).HasPurpose() /home/pierspowlesland/projects/celo-blockchain/p2p/peer.go:167 +0x5fd github.com/celo-org/celo-blockchain/eth.(*peer).PurposeIsSet() /home/pierspowlesland/projects/celo-blockchain/eth/peer.go:696 +0x516 github.com/celo-org/celo-blockchain/eth.(*ProtocolManager).FindPeers() /home/pierspowlesland/projects/celo-blockchain/eth/handler.go:1055 +0x612 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*validatorPeerHandler).ReplaceValidatorPeers() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/peer_handler.go:147 +0x201 github.com/celo-org/celo-blockchain/consensus/istanbul/backend/internal/enodes.(*ValidatorEnodeDB).RefreshValPeers() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/internal/enodes/val_enode_db.go:480 +0x641 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).RefreshValPeers() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/backend.go:846 +0x264 github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StartValidating() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:691 +0x3a5 github.com/celo-org/celo-blockchain/miner.(*worker).start() /home/pierspowlesland/projects/celo-blockchain/miner/worker.go:210 +0x1a2 github.com/celo-org/celo-blockchain/miner.(*Miner).update() /home/pierspowlesland/projects/celo-blockchain/miner/miner.go:161 +0x6a4 Previous write at 0x00c001520680 by goroutine 343: github.com/celo-org/celo-blockchain/p2p.(*Peer).AddPurpose() /home/pierspowlesland/projects/celo-blockchain/p2p/peer.go:148 +0xd5 github.com/celo-org/celo-blockchain/p2p.(*Server).run.func3() /home/pierspowlesland/projects/celo-blockchain/p2p/server.go:866 +0x2d1 github.com/celo-org/celo-blockchain/p2p.(*Server).run() /home/pierspowlesland/projects/celo-blockchain/p2p/server.go:916 +0x241a

announceRunning is set in istanbul.Backend.StartAnnouncing and istanbul.Backend.StopAnnouncing and also was being accessed without syncronisation from within istanbul.Backend.newChainHead which is executed repeatedly from within its own go-routine. This commit read locks over announceRunning in istanbul.Backend.newChainHead to avoid a data race. Unfortunately we can't see both sides of this data race, it seems the stack was corrupted somehow since the first stack trace cannot be correct (how can fmt.Fscanf call TestStartStopValidators ?). Ive shortened the stack with an elipsis because it was also very long. ================== WARNING: DATA RACE Write at 0x00c001448520 by goroutine 354: github.com/celo-org/celo-blockchain/consensus/istanbul/backend.(*Backend).StopAnnouncing() /home/pierspowlesland/projects/celo-blockchain/consensus/istanbul/backend/engine.go:749 +0x12b github.com/celo-org/celo-blockchain/eth.(*Ethereum).stopAnnounce() /home/pierspowlesland/projects/celo-blockchain/eth/backend.go:507 +0x2ad github.com/celo-org/celo-blockchain/eth.(*Ethereum).Stop() /home/pierspowlesland/projects/celo-blockchain/eth/backend.go:575 +0x2ae github.com/celo-org/celo-blockchain/node.(*Node).stopServices() /home/pierspowlesland/projects/celo-blockchain/node/node.go:309 +0x150 github.com/celo-org/celo-blockchain/node.(*Node).Close() /home/pierspowlesland/projects/celo-blockchain/node/node.go:221 +0x1bb github.com/celo-org/celo-blockchain/test.(*Node).Close() /home/pierspowlesland/projects/celo-blockchain/test/node.go:312 +0x3b8 github.com/celo-org/celo-blockchain/test.Network.Shutdown() /home/pierspowlesland/projects/celo-blockchain/test/node.go:498 +0x9a runtime.call32() /usr/local/go/src/runtime/asm_amd64.s:551 +0x3d testing.(*T).FailNow() <autogenerated>:1 +0x44 github.com/stretchr/testify/require.NoError() /home/pierspowlesland/go/pkg/mod/github.com/stretchr/[email protected]/require/require.go:974 +0x104 github.com/celo-org/celo-blockchain/e2e_test_test.TestStartStopValidators() /home/pierspowlesland/projects/celo-blockchain/e2e_test/e2e_test.go:168 +0x15ce fmt.Fscanf() /usr/local/go/src/fmt/scan.go:143 +0xee fmt.Sscanf() /usr/local/go/src/fmt/scan.go:114 +0x191 github.com/syndtr/goleveldb/leveldb/storage.fsParseName() /home/pierspowlesland/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/storage/file_storage.go:643 +0xa6 github.com/syndtr/goleveldb/leveldb/storage.(*fileStorage).List() /home/pierspowlesland/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/storage/file_storage.go:458 +0x309 github.com/syndtr/goleveldb/leveldb.(*DB).checkAndCleanFiles() /home/pierspowlesland/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db_util.go:52 +0x301 github.com/syndtr/goleveldb/leveldb.openDB() /home/pierspowlesland/go/pkg/mod/github.com/syndtr/[email protected]/leveldb/db.go:136 +0x9bb fmt.(*ss).doScanf() /usr/local/go/src/fmt/scan.go:1230 +0x411 fmt.Fscanf() ... Previous read at 0x00c001448520 by goroutine 203: [failed to restore the stack]

Removing the go-routine does not seem to have any effect on the tests.

It seems that very occasionally we are hitting this error condition and its not clear why.

…2853) This change removes misuses of sync.WaitGroup in BlockChain. Before this change, block insertion modified the WaitGroup counter in order to ensure that Stop would wait for pending operations to complete. This was racy and could even lead to crashes if Stop was called at an unfortunate time. The issue is resolved by adding a specialized 'closable' mutex, which prevents chain modifications after stopping while also synchronizing writers with each other. Co-authored-by: Felix Lange <[email protected]>

The start stop test would sometimes time out while nodes were stopped and the Network.Shutdown method would print errors relating to trying to stop a node twice. The shutdown mechanism has been updated to allow for more flexibility in deciding what errors are printed, allowing us to avoid printing double shutdown errors when we have manually stopped nodes.

piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch 5 times, most recently from 2fce4a5 to 00a9f7b Compare September 30, 2021 21:29

piersy requested review from hbandura and 37ng September 30, 2021 21:46

piersy marked this pull request as ready for review September 30, 2021 21:46

piersy requested review from trianglesphere and a team as code owners September 30, 2021 21:46

piersy changed the title ~~Piersy/add start stop e2e test rebased~~ Add start stop e2e tes Sep 30, 2021

piersy changed the title ~~Add start stop e2e tes~~ Add start stop e2e test Sep 30, 2021

piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from 00a9f7b to 00358de Compare October 1, 2021 07:12

piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from 00358de to a04b020 Compare October 1, 2021 13:55

piersy added this to the Reliability track milestone Oct 1, 2021

piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from a04b020 to 99a2d7f Compare October 4, 2021 11:51

piersy self-assigned this Oct 4, 2021

hbandura reviewed Oct 5, 2021

View reviewed changes

consensus/istanbul/backend/handler.go Show resolved Hide resolved

hbandura reviewed Oct 5, 2021

View reviewed changes

piersy requested a review from hbandura October 5, 2021 12:47

piersy added current-sprint labels Oct 6, 2021

hbandura reviewed Oct 18, 2021

View reviewed changes

piersy modified the milestone: Reliability track Oct 20, 2021

piersy requested a review from hbandura October 21, 2021 08:46

hbandura approved these changes Oct 22, 2021

View reviewed changes

piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from 49c2f9d to ff1dd46 Compare December 15, 2021 12:20

piersy added 15 commits December 15, 2021 18:03

Add start stop validators e2e test

67e6a6a

Fix panic on StopTracking

35b3d7e

Use ephemeral key for validator p2p network

f9dbb7f

Allow configuring the min peers needed to sync

2cb021a

In order to run small scale e2e tests reliably we need to be able to let nodes sync from just one other node. This commit allows the minimum sync peers to be configured, where previously it was hardcoded to 5.

Allow modification of dial and connection timeouts

727c3e5

Now dialHistoryExpiration and inboundThrottleTime can be configured via config. This allows for the writing of fast tests.

Fix use of waitgroup in core.Start

960df15

Ensure that Add is called before Wait.

Make backend.Backend wait for goroutines to close

0e44393

backend.Backend.Commit was starting a goroutine but not waiting for it to complete. This commit adds a wait group to wait for the goroutine to complete when backend.Backend is closed.

Ensure the miner is closed

1efbe20

Also make miner wait for goroutines to end at Shutdown.

Fix use of waitgroup in validatorPeerHandler

dd0d0c9

Ensure Add is called before Wait

piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from ff1dd46 to 6c36712 Compare December 15, 2021 18:03

piersy and others added 6 commits December 15, 2021 19:52

Add comment to clarify lock over announceRunning

2988ea9

Remove WaitGroup & go-routine from Backend.Commit

b428549

Removing the go-routine does not seem to have any effect on the tests.

Remove unused debug function

704ec7d

Add debug stack trace to help track down failures

7c46396

It seems that very occasionally we are hitting this error condition and its not clear why.

piersy force-pushed the piersy/add-start-stop-e2e-test-rebased branch from 6c36712 to 1ab51ee Compare December 15, 2021 19:53

mcortesi merged commit 75f0db3 into master Dec 16, 2021

mcortesi deleted the piersy/add-start-stop-e2e-test-rebased branch December 16, 2021 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add start stop e2e test #1705

Add start stop e2e test #1705

piersy commented Sep 29, 2021 •

edited

Loading

codecov bot commented Oct 1, 2021 •

edited

Loading

piersy commented Oct 1, 2021

hbandura Oct 5, 2021

piersy Oct 19, 2021

hbandura Oct 5, 2021

piersy Oct 5, 2021 •

edited

Loading

hbandura Oct 18, 2021

piersy Oct 20, 2021

hbandura Oct 5, 2021

piersy Oct 5, 2021

hbandura commented Oct 5, 2021

piersy commented Oct 5, 2021

piersy commented Oct 5, 2021 •

edited

Loading

hbandura Oct 18, 2021

piersy Oct 19, 2021

hbandura left a comment

hbandura Oct 22, 2021

piersy Oct 27, 2021

hbandura Oct 22, 2021

piersy Oct 27, 2021

hbandura Oct 22, 2021

piersy Oct 27, 2021

hbandura Oct 22, 2021

piersy Oct 27, 2021 •

edited

Loading

		@@ -210,6 +210,7 @@ type Backend struct {

		aWallets atomic.Value

		wg sync.WaitGroup

Add start stop e2e test #1705

Add start stop e2e test #1705

Conversation

piersy commented Sep 29, 2021 • edited Loading

Description

Other changes

Upstream PRs that came out of this work

Tested

Related issues

Backwards compatibility

codecov bot commented Oct 1, 2021 • edited Loading

Codecov Report

piersy commented Oct 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piersy Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hbandura commented Oct 5, 2021

piersy commented Oct 5, 2021

piersy commented Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hbandura left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piersy Oct 27, 2021 • edited Loading

Choose a reason for hiding this comment

piersy commented Sep 29, 2021 •

edited

Loading

codecov bot commented Oct 1, 2021 •

edited

Loading

piersy Oct 5, 2021 •

edited

Loading

piersy commented Oct 5, 2021 •

edited

Loading

piersy Oct 27, 2021 •

edited

Loading