Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore2TB/nodes=32 failed #31745

Closed
cockroach-teamcity opened this issue Oct 23, 2018 · 6 comments
Closed

roachtest: restore2TB/nodes=32 failed #31745

cockroach-teamcity opened this issue Oct 23, 2018 · 6 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/d3f39a5cb60eb9cea64635f78e25f67a435057a5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=restore2TB/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=980026&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1110,restore.go:244,cluster.go:1432,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-980026-restore2tb-nodes-32:1 -- ./cockroach sql --insecure -e "
						RESTORE csv.bank FROM
						'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
						WITH into_db = 'restore2tb'" returned:
		stderr:
		
		stdout:
		: signal: killed
	test.go:639,cluster.go:1453,restore.go:250: unexpected node event: 15: dead

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Oct 23, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 23, 2018
@tbg
Copy link
Member

tbg commented Oct 23, 2018

F181023 15:50:28.273905 169 storage/replica.go:5713  [n7,s11,r4529/3:/Table/53/1/1096{7455…-8347}] RHS of merge r4529:/Table/53/1/1096{7455/0-8347/0} [(n31,s31):1, (n24,s17):2, (n7,s11):3, next=4, gen=2] <- r4922:/Table/53/1/10968347{-/0} [(n31,s31):1, (n24,s17):2, (n7,s11):3, next=4, gen=1] not present on store; found r4922:{-} [<no replicas>, next=0, gen=0] in place of the RHS
goroutine 169 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc4204be500, 0xc4204be540, 0x40c1400, 0x12)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:997 +0xcf
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x49b84a0, 0xc400000004, 0x40c146d, 0x12, 0x1651, 0xc4258d8420, 0x14a)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:864 +0x8fe
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3172e20, 0xc434819da0, 0x4, 0x2, 0x2caab39, 0x48, 0xc425c2c9f8, 0x3, 0x3)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2e5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3172e20, 0xc434819da0, 0x1, 0x4, 0x2caab39, 0x48, 0xc425c2c9f8, 0x3, 0x3)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3172e20, 0xc434819da0, 0x2caab39, 0x48, 0xc425c2c9f8, 0x3, 0x3)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:172 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).acquireMergeLock(0xc43336fb00, 0x3172e20, 0xc434819da0, 0xc425d6b400, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:5713 +0x1e5
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).maybeAcquireSplitMergeLock(0xc43336fb00, 0x3172e20, 0xc434819da0, 0x1100000018, 0x2, 0xa, 0x0, 0xa, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:5648 +0x66
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc43336fb00, 0x3172e20, 0xc434819da0, 0xc42043feb8, 0x8, 0x6, 0x15, 0x1100000018, 0x2, 0xa, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:5341 +0x157f
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc43336fb00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:4363 +0x125f
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x3172e20, 0xc4278d8930, 0xc43336fb00, 0x3172e20)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3703 +0x109
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc420dfa580, 0x3172e20, 0xc4278d8930, 0xc4265af040, 0xc425c2ded0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3357 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc420dfa580, 0x3172e20, 0xc42071c4b0, 0x11b1)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3691 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc420735280, 0x3172e20, 0xc42071c4b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x3172e20, 0xc42071c4b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42028e510, 0xc42002a000, 0xc42028e4d0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

cc @benesch. Heh, you're not gonna have trouble repro'ing that one.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a4950e75f6592b8cbea217cf1392d8ef8961130f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=restore2TB/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=985074&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1110,restore.go:244,cluster.go:1432,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-985074-restore2tb-nodes-32:1 -- ./cockroach sql --insecure -e "
						RESTORE csv.bank FROM
						'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
						WITH into_db = 'restore2tb'" returned:
		stderr:
		
		stdout:
		: signal: killed
	test.go:639,cluster.go:1453,restore.go:250: unexpected node event: 6: dead

@tbg
Copy link
Member

tbg commented Oct 25, 2018

^- the current merge bug

tbg added a commit to tbg/cockroach that referenced this issue Oct 26, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this issue Oct 26, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
craig bot pushed a commit that referenced this issue Oct 26, 2018
31914: storage: fix Raft log size accounting r=petermattis a=tschottdorf

We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as #31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches #31732.
Touches #31740.
Touches #30261.
Touches #31768.
Touches #31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.

Co-authored-by: Tobias Schottdorf <[email protected]>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/fa7fb35ebd4ed4a2a4d0c1692cb1560654b73ea6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=restore2TB/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=987340&tab=buildLog

The test failed on master:
	test.go:639,cluster.go:1110,restore.go:244,cluster.go:1432,errgroup.go:58: /home/agent/work/.go/bin/roachprod run teamcity-987340-restore2tb-nodes-32:1 -- ./cockroach sql --insecure -e "
						RESTORE csv.bank FROM
						'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
						WITH into_db = 'restore2tb'" returned:
		stderr:
		
		stdout:
		: signal: killed
	test.go:639,cluster.go:1453,restore.go:250: unexpected node event: 3: dead

@tbg
Copy link
Member

tbg commented Oct 26, 2018

^- ditto

F181026 16:55:34.192020 136 storage/replica.go:5763 [n27,s30,r12252/2:/Table/53/1/1639{6870…-7790}] RHS of merge r12252:/Table/53/1/1639{6870/0-7790/0} [(n19,s17):1, (n27,s30):2, (n12,s5):3, next=4, gen=2] <- r12256:/Table/53/1/16397790{-/0} [(n19,s17):1, (n27,s30):2, (n12,s5):3, next=4, gen=1] not present on store; found r12256:{-} [, next=0, gen=0] in place of the RHS

@benesch
Copy link
Contributor

benesch commented Oct 26, 2018

The fix is bors'ing now: #31842. Closing since I didn't think to include this bug in the commit message.

@benesch benesch closed this as completed Oct 26, 2018
tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants