Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpch/nodes=32 failed #31732

Closed
cockroach-teamcity opened this issue Oct 23, 2018 · 38 comments
Closed

roachtest: import/tpch/nodes=32 failed #31732

cockroach-teamcity opened this issue Oct 23, 2018 · 38 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/d3f39a5cb60eb9cea64635f78e25f67a435057a5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=980026&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Oct 23, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 23, 2018
@tbg tbg assigned tbg and unassigned andreimatei Oct 23, 2018
@tbg
Copy link
Member

tbg commented Oct 23, 2018

Ugh, hoped we were done with this sort of thing for a while. A number of ranges simply seems to "freeze up". The SHA notably has #31689.

181023 10:11:08.714477 63680 storage/replica.go:3885  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] proposing REMOVE_REPLICA((n9,s4):4): updated=[(n23,s23):1 (n16,s16):2 (n25,s25):3 (n5,s15):5] next=6
181023 10:11:08.714477 63680 storage/replica.go:3885  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] proposing REMOVE_REPLICA((n9,s4):4): updated=[(n23,s23):1 (n16,s16):2 (n25,s25):3 (n5,s15):5] next=6
181023 10:11:09.242351 100766 storage/store.go:2580  [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removing replica r2159/4
181023 10:11:09.242351 100766 storage/store.go:2580  [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removing replica r2159/4
181023 10:11:09.319611 100766 storage/replica.go:863  [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removed 267187 (267179+8) keys in 73ms [clear=1ms commit=72ms]
181023 10:11:09.319611 100766 storage/replica.go:863  [n9,replicaGC,s4,r2159/4:/Table/53/1/369{11440…-38215…}] removed 267187 (267179+8) keys in 73ms [clear=1ms commit=72ms]
181023 10:12:09.079428 63831 storage/replica.go:3346  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:09.079428 63831 storage/replica.go:3346  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:30.130245 66387 storage/replica_command.go:816  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas (REMOVE_REPLICA (n25,s25):3): read existing descriptor r2159:/Table/53/1/369{114403/4-382151/3} [(n23,s23):1, (n16,s16):2, (n25,s25):3, (n5,s15):5, next=6, gen=2]
181023 10:12:30.130245 66387 storage/replica_command.go:816  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas (REMOVE_REPLICA (n25,s25):3): read existing descriptor r2159:/Table/53/1/369{114403/4-382151/3} [(n23,s23):1, (n16,s16):2, (n25,s25):3, (n5,s15):5, next=6, gen=2]
181023 10:13:30.130222 66387 storage/queue.go:791  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas of r2159 failed: aborted while in command queue: context deadline exceeded
181023 10:13:30.130222 66387 storage/queue.go:791  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] change replicas of r2159 failed: aborted while in command queue: context deadline exceeded
181023 10:14:32.058314 67379 storage/queue.go:791  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}]: unable to transfer lease to s25: context deadline exceeded
181023 10:14:32.058314 67379 storage/queue.go:791  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}]: unable to transfer lease to s25: context deadline exceeded
181023 10:15:13.602238 73950 storage/store.go:2580  [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removing replica r2159/0
181023 10:15:13.602238 73950 storage/store.go:2580  [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removing replica r2159/0
181023 10:15:13.638081 73950 storage/replica.go:863  [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removed 8 (0+8) keys in 35ms [clear=0ms commit=35ms]
181023 10:15:13.638081 73950 storage/replica.go:863  [n2,replicaGC,s2,r2159/?:/Table/53/1/3{691144…-714161…}] removed 8 (0+8) keys in 35ms [clear=0ms commit=35ms]
181023 10:16:18.923507 68830 storage/replica.go:1546  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s attempting to acquire lease
181023 10:16:18.923507 68830 storage/replica.go:1546  [n23,replicate,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s attempting to acquire lease

@tbg
Copy link
Member

tbg commented Oct 23, 2018

However, this seems across the board, for multiple tests at once:

$ rg -a --no-line-number --no-filename --no-heading 'have been waiting' *.logs | cut -c 2- | sort
181023 10:12:09.079428 63831 storage/replica.go:3346  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:09.079428 63831 storage/replica.go:3346  [n23,s23,r2159/1:/Table/53/1/369{11440…-38215…}] have been waiting 1m0s for proposing command GC [/Table/53/1/369114403/4,/Table/53/1/369114403/4/NULL)
181023 10:12:35.476713 59095 storage/replica.go:3346  [n6,s12,r900/3:/Table/53/{1/57795…-2/15679…}] have been waiting 1m0s for proposing command [txn: fae558a9], ConditionalPut [/Local/Range/Table/53/1/578185447/5/RangeDescriptor,/Min)
181023 10:12:35.476713 59095 storage/replica.go:3346  [n6,s12,r900/3:/Table/53/{1/57795…-2/15679…}] have been waiting 1m0s for proposing command [txn: fae558a9], ConditionalPut [/Local/Range/Table/53/1/578185447/5/RangeDescriptor,/Min)
181023 10:12:54.974354 62435 storage/replica.go:3346  [n4,s7,r2605/1:/Table/53/1/43{19763…-22439…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/431976384/6,/Min)
181023 10:12:54.974354 62435 storage/replica.go:3346  [n4,s7,r2605/1:/Table/53/1/43{19763…-22439…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/431976384/6,/Min)
181023 10:12:59.372869 61940 storage/replica.go:3346  [n27,s27,r450/3:/Table/53/1/502{41091…-67850…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/502410912/5,/Min)
181023 10:12:59.372869 61940 storage/replica.go:3346  [n27,s27,r450/3:/Table/53/1/502{41091…-67850…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/502410912/5,/Min)
181023 10:13:01.338531 64853 storage/replica.go:3346  [n32,s32,r2528/1:/Table/53/1/511{47459…-74131…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/511474593/1,/Min)
181023 10:13:01.338531 64853 storage/replica.go:3346  [n32,s32,r2528/1:/Table/53/1/511{47459…-74131…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/511474593/1,/Min)
181023 10:13:19.530750 60820 storage/replica.go:3346  [n6,s12,r2861/3:/Table/53/1/441{44128…-70851…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/441441284/2,/Min)
181023 10:13:19.530750 60820 storage/replica.go:3346  [n6,s12,r2861/3:/Table/53/1/441{44128…-70851…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/441441284/2,/Min)
181023 10:13:37.079546 154081 storage/replica.go:3346  [n7,s10,r2630/4:/Table/53/1/5{699299…-701968…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/569929990/8,/Min)
181023 10:13:37.079546 154081 storage/replica.go:3346  [n7,s10,r2630/4:/Table/53/1/5{699299…-701968…}] have been waiting 1m0s for proposing command TransferLease [/Table/53/1/569929990/8,/Min)
[...]

I guess the only way to really make progress on this is a live repro. Going to kick off a run (with fewer nodes).

@tbg
Copy link
Member

tbg commented Oct 23, 2018

At least this repro'ed right away:

goroutine 43709 [select, 13 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).tryExecuteWriteBatch(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3322 +0x961
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).executeWriteBatch(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3139 +0xab
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sendWithRangeID(0xc4256b6900, 0x3170500, 0xc42a762b10, 0x6e5, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:2017 +0x711
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).Send(0xc4256b6900, 0x3170500, 0xc42a762ae0, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:1965 +0x90
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Send(0xc42044d180, 0x3170500, 0xc42a762ae0, 0x156043aa68e2373d, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3022 +0x60c
github.com/cockroachdb/cockroach/pkg/storage.(*Stores).Send(0xc4208154a0, 0x3170500, 0xc42a762a80, 0x0, 0x0, 0x200000004, 0x2, 0x6e5, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/stores.go:185 +0xdb
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x3170500, 0xc42a762a80, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:988 +0x1c1
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc4207cef30, 0x3170500, 0xc42a762a80, 0x2c15619, 0x10, 0xc427e96288, 0x0, 0x0)

(among others)

@tbg
Copy link
Member

tbg commented Oct 23, 2018

@tbg
Copy link
Member

tbg commented Oct 23, 2018

Interestingly this had resolved itself when I just looked again.

From the logspy above, I can see that it talked to replicaIDs 1, 3, 5 (there aren't any responses in the 10s window)

I181023 14:56:41.442261 125 storage/raft.go:134  [n4,s2,r1765/2:/Table/53/1/406{58851…-85561…}] raft ready (must-sync=true)
  New Entry[0]: 7/403 EntryNormal [59ca2f1fd9f654d1] [104]
  Outgoing Message[0]: 2->3 MsgHeartbeat Term:7 Log:0/0 Commit:26
  Outgoing Message[1]: 2->5 MsgHeartbeat Term:7 Log:0/0
  Outgoing Message[2]: 2->1 MsgHeartbeat Term:7 Log:0/0
  Outgoing Message[3]: 2->3 MsgApp Term:7 Log:7/402 Commit:26 Entries:[7/403 EntryNormal [59ca2f1fd9f654d1] [104]]

This is interesting because that makes four members of the range in total.

The range status page shows that replicaIDs 1, 2, 5 are part of the replica now, so 3 is gone. But I'm having trouble piecing this together. We only send the MsgApp to 3, why not to the other two? We're also only sending a Commit index to 3 in the heartbeat. Does Raft perhaps think that 1 and 5 are down? But why isn't there an MsgAppResp from 3 at least?

image

@tbg
Copy link
Member

tbg commented Oct 23, 2018

cc @bdarnell in case you see something I don't.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/de005ed7e08f4edd3b95b9388596e149058d13b8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=982526&tab=buildLog

The test failed on release-2.1:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@tbg
Copy link
Member

tbg commented Oct 24, 2018

^- stuck raft proposals until the bitter end.

@tbg
Copy link
Member

tbg commented Oct 24, 2018

I kicked off another manual run (disabled merges) and that wedged too. Note that we somehow have four members in our replica set and two of them seem completely uninitialized, blocking forward progress.

image

@tbg
Copy link
Member

tbg commented Oct 24, 2018

The Raft snapshot queue is absolutely backed up:

image

and so are some of these other queues. My best guess as to why this is happening so suddenly is that the test is done with the "import" part and begins a restore part (where it actually splits and sends SSTs around). In the situation above, where two replicas need a snapshot, it would make sense that the range would just fall over. And it also makes sense that once you have thousands of pending Raft snapshots, there is big trouble.

@tbg
Copy link
Member

tbg commented Oct 24, 2018

logspying on one of the nodes for ,raftsnapshot, shows that everyone wants a snapshot, but it takes ~40s to get even a single one out.

I181024 13:41:35.836885 161068 server/debug/logspy.go:182  intercepting logs with options {Count:10000 Duration:1m0s Grep:,raftsnapshot,}
I181024 13:41:41.839521 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1151/2:/Table/53/1/305{66937…-93632…}] raft snapshot needed, enqueuing
I181024 13:41:45.345314 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1931/2:/Table/53/1/565{42704…-69402…}] raft snapshot needed, enqueuing
I181024 13:41:46.222135 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2204/2:/Table/53/6/8{342/11…-622/21…}] raft snapshot needed, enqueuing
I181024 13:41:50.603532 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1949/2:/Table/53/1/53{48500…-51035…}] raft snapshot needed, enqueuing
I181024 13:41:58.765126 167166 storage/store_snapshot.go:636  [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] sending Raft snapshot 56cdd770 at applied index 25
I181024 13:42:01.120067 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2226/2:/Table/53/6/937{0/349…-7/171…}] raft snapshot needed, enqueuing
I181024 13:42:03.176108 167166 storage/store_snapshot.go:679  [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] streamed snapshot to (n2,s2):1: kv pairs: 267151, log entries: 2, rate-limit: 8.0 MiB/sec, 39747ms
I181024 13:42:03.499741 167166 storage/queue.go:751  [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] done
I181024 13:42:03.499774 167166 storage/queue.go:667  [n4,raftsnapshot,s7,r1670/2:/Table/53/1/38{09628…-12306…}] done 40.071031219s
I181024 13:42:03.499835 167499 storage/queue.go:707  [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] processing replica
I181024 13:42:03.499871 167499 storage/queue.go:745  [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] processing
I181024 13:42:03.499879 167499 storage/raft_snapshot_queue.go:94  [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] sending raft snapshot
I181024 13:42:09.005845 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2216/2:/Table/53/6/87{09/33…-17/29…}] raft snapshot needed, enqueuing
I181024 13:42:09.882265 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2213/2:/Table/53/6/86{85/47…-93/42…}] raft snapshot needed, enqueuing
I181024 13:42:11.638441 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1157/2:/Table/53/1/32{08396…-11064…}] raft snapshot needed, enqueuing
I181024 13:42:14.272795 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2127/2:/Table/53/1/56{59614…-61666…}] raft snapshot needed, enqueuing
I181024 13:42:16.901743 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1428/2:/Table/53/1/443{68460…-95187…}] raft snapshot needed, enqueuing
I181024 13:42:19.531425 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r501/2:/Table/53/1/8{686486…-713126…}] raft snapshot needed, enqueuing
I181024 13:42:20.409178 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1814/2:/Table/53/1/377{30771…-57437…}] raft snapshot needed, enqueuing
I181024 13:42:21.285828 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1232/2:/Table/53/1/307{45625…-72234…}] raft snapshot needed, enqueuing
I181024 13:42:22.162619 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r873/2:/Table/53/1/19{08200…-10884…}] raft snapshot needed, enqueuing
I181024 13:42:26.543767 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r2231/2:/Table/53/6/99{58/13…-66/94…}] raft snapshot needed, enqueuing
I181024 13:42:29.172478 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r498/2:/Table/53/1/88{46665…-73388…}] raft snapshot needed, enqueuing
I181024 13:42:35.304600 185 storage/raft_snapshot_queue.go:76  [n4,raftsnapshot,s7,r1457/2:/Table/53/1/22{59905…-62578…}] raft snapshot needed, enqueuing
I181024 13:42:35.459626 167499 storage/store_snapshot.go:636  [n4,raftsnapshot,s7,r1686/2:/Table/53/1/3820{3114…-8406…}] sending Raft snapshot d9244af9 at applied index 20

@tbg
Copy link
Member

tbg commented Oct 24, 2018

The delay between "sending raft snapshot" and "sending Raft snapshot d9244af9 at applied index 20" of ~32s is suspicious. What's taking so long? This is a) creating the snapshot (but that's just opening an iterator) and exchanging metadata with the target store. It's not actually transferring any data yet. We see that the actual transfer is more like 5s. The 39747ms log line indicates that we do spend most of that time in sendSnapshot so that my assumption is that we spend lots of time in

	if err := stream.Send(&SnapshotRequest{Header: &header}); err != nil {
		return err
	}

Ah, of course, this will block on the receiver's snapshot apply semaphore:

case s.snapshotApplySem <- struct{}{}:

Presumably all nodes are sending snapshots to all other nodes, so everything's completely clogged.

Ok, it seems that one very fundamental badness here indeed has to do with why we're seeing all these snapshots being necessary in the first place.

I'm going to make it a priority to look into the related truncation logic discussion in #31409 (comment) to see if we're doing stupid things when AddSSTable is involved (or generally).

cc @andreimatei

@bdarnell
Copy link
Contributor

I'd start by looking at the replica GC queue. We send preemptive snapshots before adding replicas; this leaves a brief window between the preemptive snapshot and the finalization of of the replica change during which the replica can be GC'd and require an immediate raft snapshot. Previously, we relied on the fact that the replica GC queue was not responsive enough to make this scenario likely. Maybe now the replica GC is happening faster and is more likely to catch replicas during this vulnerable window

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/0dba537ae88e495ddf29b4c347b4c30ee99bd046

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=983962&tab=buildLog

The test failed on release-2.1:
	asm_amd64.s:574,panic.go:502,panic.go:63,signal_unix.go:388,log.go:172,log.go:216,cluster.go:221,cluster.go:695: runtime error: invalid memory address or nil pointer dereference

@benesch
Copy link
Contributor

benesch commented Oct 24, 2018

I'm starting to suspect this change: e436e56

@tbg
Copy link
Member

tbg commented Oct 24, 2018

I had two successful runs after "fixing" (or rather, disarming) the Raft truncation bugs, but good point about the GC queue. I'll keep an eye out.

Our goal really ought to be to get these import tests passing with zero (or barely any) Raft snapshots.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a4950e75f6592b8cbea217cf1392d8ef8961130f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=985074&tab=buildLog

The test failed on master:
	test.go:639,test.go:651: /home/agent/work/.go/bin/roachprod create teamcity-985074-import-tpch-nodes-32 -n 32 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b returned:
		stderr:
		
		stdout:
		Creating cluster teamcity-985074-import-tpch-nodes-32 with 32 nodes
		Unable to create cluster:
		in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181004 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account [email protected] --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script786808859 --project cockroach-ephemeral]
		Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - Quota 'CPUS' exceeded. Limit: 1200.0 in region us-central1.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@tbg
Copy link
Member

tbg commented Oct 25, 2018

The AddSSTable requests are ~32mb each (

var importBatchSize = settings.RegisterByteSizeSetting(
"kv.import.batch_size",
"the maximum size of the payload in an AddSSTable request",
32<<20,
)
) as we expected. I think with the log size tracking fixed, the behavior here is borderline ok:

  1. range 123 is created (empty)
  2. AddSSTable 1 (of probably 2) arrives and clears the quota pool, gets appended
  3. raftLogSize is now ~32mb
  4. AddSSTable 2/2 arrives, blocks on quota pool
  5. raft truncation queue immediately queues a truncation (using the quorum commit index instead of lowest available index. Truncation will go into the log behind the first AddSSTable, but the leader has a good chance of applying it well before one of the followers and it might apply the truncation too. I think this is probably fine because by the time that happens we've at least sent out the AddSSTable to all followers, so we're not actually going to force any of them into a snapshot.
  6. AddSSTable 1/2 and the truncation apply, quota pool releases quota only as all followers catch up (since everything is healthy)
  7. AddSSTable 2/2 clears the quota pool, the game repeats.

What I don't understand however is why things get so bad with the bug. When the next AddSSTable comes in, the Ready that would append to the log would usually "atomically" send out these new entries to followers. So for a truncation to actually require a snapshot, we need a follower to not be caught up on the previous entry (because then Raft can only send out the message in a later Ready). But when the quota pool lets the second AddSSTable proceed into the Raft log, didn't we make sure that all followers are caught up?

I might be missing something or the behavior is more subtle (for example, it might be worse when ranges are newly created, or when there are additional small proposals like lease requests involved).

I'm going to see if there's something here that I can reproduce with a bit of hacking on a local 3x cluster and kv with a large payload.

@tbg
Copy link
Member

tbg commented Oct 25, 2018

I'm running a local experiment via roachprod three nodes, where the third node has a 100ms sleep in applyRaftCommand and the workload is

./cockroach sql --insecure -e 'alter range default configure zone using range_max_bytes = 1048471142400;' && ./bin/workload run kv --cycle-length 1 --concurrency 4 --drop --init --max-ops 100 --min-block-bytes $((1024)) --max-block-bytes $((1 << 20)) --sequential

I additionally have a diff applied that makes the tracked size of the raft log always 8mb (i.e. well above what would trigger an aggressive truncation, 4mb).

I'm watching for Raft snapshots by tailing the logs and by watching crdb_internal.gossip_alerts.

I seem to be able to wedge the cluster in a stuck state:

image

The rightmost replica isn't catching up over minutes. It's clearly in need of a Raft snapshot, but this snapshot doesn't seem to receive the appropriate priority. Instead, snapshots intersecting that replica are tried in an endless loop and, of course, fail. The slow replica is sent for GC repeatedly, but of course it can't be GC'ed -- it's still in the range, but needs to catch up on the split trigger.

I181025 11:15:55.969305 6632 storage/replica_gc_queue.go:222 [n3,replicaGC,s3,r31/3:/{Table/73-Max}] not gc'able, replica is still in range descriptor: (n3,s3):3

This situation resolved itself after a few minutes, but imagine this happening with thousands of ranges at the same time.

Looking at the code that adds replicas into the raft snapshot queue, I see that the same priority (zero) is used for all.

This suggests that we could use additional smarts. For example, instead of sending a replica blocking a snapshot to GC, we should (also?) offer it to the Raft snapshot queue, ideally at a priority that's higher if we suspect that the snapshot will shrink the replica.

@tbg
Copy link
Member

tbg commented Oct 25, 2018

I can also repro a stats inconsistency (running with COCKROACH_FATAL_ON_STATS_MISMATCH=true):

F181025 11:39:16.531669 2071 storage/replica_consistency.go:119  [n3,consistencyChecker,s3,r32/2:/{Table/55-Max}] found a delta of {false 0 0 315173140 -280215 0 36 0 16307809 3 0 0 0 0}
goroutine 2071 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000056301, 0xc000056300, 0x835e100, 0x1e)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:997 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x8a99f60, 0xc000000004, 0x835e142, 0x1e, 0x77, 0xc00087c800, 0x79)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:866 +0x8e8
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x6e5f6a0, 0xc005f7b6e0, 0x4, 0x2, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x6e5f6a0, 0xc005f7b6e0, 0x1, 0x4, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x6e5f6a0, 0xc005f7b6e0, 0x675e60c, 0x14, 0xc0069779e0, 0x1, 0x1)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:172 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).CheckConsistency(0xc0008d5c00, 0x6e5f6a0, 0xc005f7b6e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_consistency.go:119 +0x6e6
github.com/cockroachdb/cockroach/pkg/storage.(*consistencyQueue).process(0xc000364260, 0x6e5f6a0, 0xc005f7b6e0, 0xc0008d5c00, 0x0, 0x0, 0x300000003)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/consistency_queue.go:117 +0x1d0
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processReplica(0xc0001fcb00, 0x6e5f6e0, 0xc0067923f0, 0xc0008d5c00, 0x0, 0x0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:747 +0x339
github.com/cockroachdb/cockroach/pkg/storage.(*baseQueue).processLoop.func1.2(0x6e5f6e0, 0xc0067923f0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/queue.go:626 +0xb8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001b1e60, 0x6e5f6e0, 0xc0067923f0, 0xc005622500, 0x36, 0x0, 0x0, 0xc004f3b3e0)
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:324 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/Users/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:319 +0x134

@tbg
Copy link
Member

tbg commented Oct 25, 2018

Ok, much simpler:

./bin/workload run kv --splits 1000 --init --drop --max-ops 1

With the truncation hack that simulates overly aggressive truncations (and the delay on n3), that gives me approximately ~700 successful Raft snapshots and ~1100 failures. Ouch.
With aggressive truncation and e436e56 reverted, it still gets ~700 successful snapshots and ~700 failures.

If I remove my hacks (i.e. run master without aggressive truncations or delay), I get typically <300 Raft snapshot successes and very few failures.

I'm just going to go ahead and fix the truncation problems to see what sticks after.

@tbg
Copy link
Member

tbg commented Oct 25, 2018

Experimenting some more, I can see that (on master with merges turned off) I still see Raft snapshots if I disable both the replicaGCQueue and the Raft log queue. I think this is due to the race in which the RHS of the split gets created via an incoming message from another replica that processed the split faster.

@tbg
Copy link
Member

tbg commented Oct 25, 2018

I tried to "hack" this problem away by disabling the Raft snapshot queue, but this leads to badness. I assume this is because once Raft asks for a snapshot, you have to give it one. You can't just forget about that request. Not 100% sure this is what's happening, but it would make sense.

@tbg
Copy link
Member

tbg commented Oct 25, 2018

I was able to hack this problem away by simply reporting all MsgSnap back to the raft group as completed. I opened my repro for this with a comment as a PR so that I don't lose track of it: #31875

@tbg
Copy link
Member

tbg commented Oct 25, 2018

Attempting to productionize @petermattis' patch https://github.com/cockroachdb/cockroach/compare/master...petermattis:pmattis/tpcc-9x64?expand=1 I went down a little rabbit hole wondering why my test saw unexpected raft log sizes. Turns we're mixing something up in log truncation. A TruncateRequest has a field Index which is supposedly the first index of the resulting log (i.e. we delete only strictly smaller indexes). But it turns out that our code actually includes Index itself in the deletion. That sounds pretty bad, but I also noticed that the Index we compute is usually (always?) one below what we could truncate (i.e. if all followers have index 100 in our logs, we're going to send Index: 100 even though we could delete 100 itself by sending Index: 101).

Digging with @nvanbenschoten.

@tbg
Copy link
Member

tbg commented Oct 25, 2018

Looks mostly like a stats issue. We're subtracting a little more (one entry worth) from the log size than we should. We generate the correct TruncatedState (in which the Index is the last deleted one, not the first undeleted one) by populating it with args.Index - 1. The actual deletion happens downstream of Raft and uses the correct range. 😓

edit: PR here https://github.com/cockroachdb/cockroach/pull/31885/files

@tbg
Copy link
Member

tbg commented Oct 25, 2018

Oh wow, are you ready for another small dumpster fire? TruncateTo for the sideloaded storage was never updated to deal with the fact that we're doing all of our writes through the RocksDB env. It seems that putting and getting from the sideloaded storage does use the env, so that works, but we're likely not cleaning these things up properly. Or maybe we are, but definitely not if you're using an in-mem engine as we are in tests.

tbg added a commit to tbg/cockroach that referenced this issue Oct 26, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/fa7fb35ebd4ed4a2a4d0c1692cb1560654b73ea6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=987340&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

tbg added a commit to tbg/cockroach that referenced this issue Oct 26, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
craig bot pushed a commit that referenced this issue Oct 26, 2018
31914: storage: fix Raft log size accounting r=petermattis a=tschottdorf

We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as #31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches #31732.
Touches #31740.
Touches #30261.
Touches #31768.
Touches #31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.

Co-authored-by: Tobias Schottdorf <[email protected]>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5ef4d2c8621fc5465f73a96221b0bd0bc5cd27aa

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990073&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/441d7134902e6bcc6a5abdff6b168003f19ce592

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=990998&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/109cf8705b773c0d3a1e7ab02ce63f764e101106

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=991714&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bbc646fc6de90b59c0253fd682667715959fb657

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=993605&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1453,import.go:122: context canceled

@tbg
Copy link
Member

tbg commented Oct 30, 2018

^- this last run has #31988.

@tbg
Copy link
Member

tbg commented Oct 30, 2018

Stuck Raft requests all the way through.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5b5738084b8cdc769d5e7973921de5cae4457380

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=995412&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1461,import.go:122: context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c196a52e8ec61851e32a062c13135bbb4111eb96

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=997430&tab=buildLog

The test failed on master:
	test.go:1037: test timed out (3h0m0s)
	test.go:639,cluster.go:1461,import.go:122: context canceled

tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as cockroachdb#31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches cockroachdb#31732.
Touches cockroachdb#31740.
Touches cockroachdb#30261.
Touches cockroachdb#31768.
Touches cockroachdb#31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.
@tbg
Copy link
Member

tbg commented Dec 3, 2018

Based on the log around Nov 1st, I think this was fixed by #32053. (There's likely a large buildup in snapshots after that, but it gets processed fast enough for the test to pass). The last run doesn't have a buildup of snapshots at all, likely due to #32594.

@tbg tbg closed this as completed Dec 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants