sql/tests: TestMonotonicInserts/distsql=on failed #67802

cockroach-teamcity · 2021-07-20T13:23:41Z

sql/tests.TestMonotonicInserts/distsql=on failed with artifacts on master @ 9f39af35fd81a7754e6123862b63247e3277b458:

Slow failing tests:
TestMonotonicInserts/distsql=on - 2317.81s

Slow passing tests:
Example_demo - 210.29s
TestMergeQueueSeesNonVoters - 195.92s
TestExternalSortRandomized - 158.21s
TestAggregatorRandom - 143.75s
TestSortRandomized - 112.27s
TestIndexCleanupAfterAlterFromRegionalByRow - 110.84s
TestExternalSortMemoryAccounting - 104.40s
TestUnorderedDistinctRandom - 99.64s
TestLearnerSnapshotFailsRollback - 97.97s
TestBackupRestoreTenant - 94.55s
TestImportAvro - 83.87s
TestRestoreOldVersions - 77.22s
TestInvertedIndex - 70.65s
TestExternalDistinctSpilling - 66.81s
TestStoreRangeMergeTimestampCache - 66.38s
TestChangefeedSchemaChangeAllowBackfill - 58.81s
TestDefaultColumns - 56.85s
TestReadWriteFile - 56.74s
TestScatterRandomizeLeases - 56.47s
TestMultiNodeExportStmt - 56.22s

Reproduce

To reproduce, try:

make stressrace TESTS=TestMonotonicInserts/distsql=on PKG=./pkg/sql/tests TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

GOFLAGS=-json

/cc @cockroachdb/sql-queries petermattis _{This test on roachdash | Improve this report!}

The text was updated successfully, but these errors were encountered:

yuzefovich · 2021-07-20T16:56:14Z

I've stressed it with make stress PKG=./pkg/sql/tests TESTS=TestMonotonicInserts TESTFLAGS='-test.v -show-logs' TESTTIMEOUT=30s STRESSFLAGS='-p 1', and it timed out pretty quickly (on 12th run, at 3 minute mark). And to me it looks like a deadlock, block like this one is repeated a couple of thousand times:

I210720 16:51:15.662527 20124 sql/tests/monotonic_insert_test.go:134  [-] 11005  363.002: attempt 11
I210720 16:51:15.662752 20124 sql/tests/monotonic_insert_test.go:134  [-] 11006  363.002: read max val
I210720 16:51:15.664405 20124 sql/tests/monotonic_insert_test.go:134  [-] 11007  363.002: read max row for val=368
I210720 16:51:15.666129 20124 sql/tests/monotonic_insert_test.go:134  [-] 11008  363.002: insert
I210720 16:51:15.666918 20124 sql/tests/monotonic_insert_test.go:134  [-] 11009  363.002: commit

...

I210720 16:51:35.254051 20124 sql/tests/monotonic_insert_test.go:134  [-] 26750  363.002: attempt 3158
I210720 16:51:35.254224 20124 sql/tests/monotonic_insert_test.go:134  [-] 26751  363.002: read max val
I210720 16:51:35.257515 20124 sql/tests/monotonic_insert_test.go:134  [-] 26752  363.002: read max row for val=368
I210720 16:51:35.260935 20124 sql/tests/monotonic_insert_test.go:134  [-] 26753  363.002: insert
I210720 16:51:35.261624 20124 sql/tests/monotonic_insert_test.go:134  [-] 26754  363.002: commit
I210720 16:51:35.263058 20124 sql/tests/monotonic_insert_test.go:134  [-] 26755  363.002: attempt 3159
I210720 16:51:35.263238 20124 sql/tests/monotonic_insert_test.go:134  [-] 26756  363.002: read max val
I210720 16:51:35.266559 20124 sql/tests/monotonic_insert_test.go:134  [-] 26757  363.002: read max row for val=368
I210720 16:51:35.270079 20124 sql/tests/monotonic_insert_test.go:134  [-] 26758  363.002: insert
panic: test timed out after 30s

Seems like a KV issue.

RaduBerinde · 2021-07-20T17:34:09Z

Another instance: https://teamcity.cockroachdb.com/viewLog.html?buildId=3200931&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Testrace

If we keep hitting it, I will have to skip TestMonotonicInserts

aliher1911 · 2021-07-21T15:24:36Z

Chasing on the deadlock I saw that it always get stuck waiting on a lock queue for all 6 parallel queries. What I noticed is that if I bump the timeout to 90s and run it for some time test time varies widely with most iterations finishing in 15s, but some iterations running 45-60s and eventually hitting 90s threshold. I'll try to bisect if this behaviour is something new or we just have more stressed machines that run tests maybe.

aliher1911 · 2021-07-21T20:29:48Z

So I bisected the failure to this PR: #66505 which is bumping GO to 1.16.5

RaduBerinde · 2021-07-22T15:58:05Z

Could it be related to #67658 ? If yes, we could try bisecting among changes to x/net/http2, crypto and net.

aliher1911 · 2021-07-22T16:19:14Z

Current findings:

Test runs fine unless one of transactions gets into bad shape where it indefinitely restarts with the same txnid while holding locks. On each iteration it would obtain more and more locks for unsuccessful writes. This could either lead to test time out, or if luck strikes txn_interceptor_heartbeater.go tries heartbeat at the time error is sent to client and it didn't retry yet (which is done in a tight loop) then it would trigger transaction abort that would kill transaction non-retryably and other writers could proceed.

aliher1911 · 2021-07-22T16:23:26Z

Could it be related to #67658 ? If yes, we could try bisecting among changes to x/net/http2, crypto and net.

Doesn't seem likely. Test runs fine in on 2cafdfb and starts failing on e3d35e0 which is a go update.

It might be it just surfaced some race in transaction retry.

aliher1911 · 2021-07-27T15:45:48Z

Update: retracting the bisect conclusion to Go update. After running pre 1.16 for longer I made it fail. So it may be failing more frequently, but it was definitely failing even before that change.

Another observation that I had is that in cases where we get stuck, requests tried to push transaction and push requests end up waiting for push on the same lock table object (see 0xc0006bd680 below)?

goroutine 15919 [select]:
github.com/cockroachdb/cockroach/pkg/kv/kvserver/txnwait.(*Queue).MaybeWaitForPush(0xc0006bd810, 0x9fa0b10, 0xc005b3c150, 0xc005cf9340, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/txnwait/queue.go:513 +0x10dc
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).maybeInterceptReq(0xc0006bd680, 0x9fa0b10, 0xc005b3c150, 0x0, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0xc004cfde00, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:275 +0xa6
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc0006bd680, 0x9fa0b10, 0xc005b3c150, 0xc006ce0bd0, 0x0, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0xc004cfde00, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:187 +0xbb
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc0006bd680, 0x9fa0b10, 0xc005b3c150, 0x0, 0x0, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0xc004cfde00, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:164 +0x118
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc0056c0000, 0x9fa0b10, 0xc005b3c150, 0xc003d7ec60, 0x90a6b58, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:357 +0x2c3
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc0056c0000, 0x9fa0b10, 0xc005b3c120, 0x26, 0xc003d7ec60, 0x1, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:107 +0x57d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:43
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc002a4e000, 0x9fa0b10, 0xc005b3c0f0, 0x169416c49f65e580, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:180 +0x614
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000ff8f50, 0x9fa0b10, 0xc005b3c0f0, 0x169416c49f65e580, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:191 +0xf9
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x9fa0b10, 0xc005b3c0f0, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:890 +0x1f8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000be6780, 0x9fa0b10, 0xc005b3c0f0, 0x8a11930, 0x10, 0xc0038965c0, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0xb2
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000fc4000, 0x9fa0b10, 0xc005b3c0f0, 0xc003d7eb00, 0x0, 0x0, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:878 +0x165
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000fc4000, 0x9fa0b10, 0xc005b3c0c0, 0xc003d7eb00, 0xc006867e90, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:945 +0xc5
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:462
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc000cd24c0, 0x9fa0b10, 0xc005b3c0c0, 0x1, 0x9feb028, 0xc0006c3760, 0x169416c49f65e580, 0x0, 0x100000001, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:193 +0x157
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc000cd24c0, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:175 +0x17a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc00078fc00, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1860 +0xad5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc00078fc00, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1478 +0x367
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc00078fc00, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1116 +0x37f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc00078fc00, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:755 +0x5db
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc000be7758, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:220 +0x9a
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc000be7700, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:817 +0x135
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:800
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x9fa0b10, 0xc006867e90, 0xc00389a1b0, 0xc004815900, 0x0, 0xc00389a6c8)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:728 +0x135
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc000be7700, 0x9fa0b10, 0xc006867e90, 0xc004815900, 0xc00389a410, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:751 +0xa5
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).MaybePushTransactions(0xc001186700, 0x9fa0b10, 0xc006867e90, 0xc00389a6c8, 0x169416c49f657c08, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:389 +0x778
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).PushTransaction(0xc001186700, 0x9fa0b10, 0xc006867e90, 0xc0063bb748, 0x169416c49f657c08, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:297 +0x14e
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).pushLockTxn(0xc0006bd6d0, 0x9fa0b10, 0xc006867e90, 0xc0072df9e0, 0x169416c49f214b00, 0x0, 0x0, 0x0, 0xc005e4ea90, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:434 +0x13c
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).WaitOn(0xc0006bd6d0, 0x9fa0b10, 0xc006867e90, 0xc0072df9e0, 0x169416c49f214b00, 0x0, 0x0, 0x0, 0xc005e4ea90, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:338 +0x305
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc0006bd680, 0x9fa0b10, 0xc006867e90, 0xc0050573b0, 0xc0072df9e0, 0x169416c49f214b00, 0x0, 0x0, 0x0, 0xc005e4ea90, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:256 +0x37f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc0006bd680, 0x9fa0b10, 0xc006867e90, 0x0, 0xc0072df9e0, 0x169416c49f214b00, 0x0, 0x0, 0x0, 0xc005e4ea90, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:164 +0x118
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc0056c0000, 0x9fa0b10, 0xc006867e90, 0xc003d7ea50, 0x90a6b40, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:357 +0x2c3
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc0056c0000, 0x9fa0b10, 0xc006867e60, 0x26, 0xc003d7ea50, 0x1, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:103 +0x2d1
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:43
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc002a4e000, 0x9fa0b10, 0xc006867e30, 0x169416c49f214b00, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:180 +0x614
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000ff8f50, 0x9fa0b10, 0xc006867e30, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:191 +0xf9
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x9fa0b10, 0xc006867e30, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:890 +0x1f8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000be6780, 0x9fa0b10, 0xc006867e30, 0x8a11930, 0x10, 0xc00389c380, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0xb2
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000fc4000, 0x9fa0b10, 0xc006867e30, 0xc003d7e8f0, 0x0, 0x0, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:878 +0x165
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000fc4000, 0x9fa0b10, 0xc006867e00, 0xc003d7e8f0, 0xc006e31800, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:945 +0xc5
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:462
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc005259cc0, 0x9fa0b10, 0xc006867e00, 0x1, 0x9feb028, 0xc0006c3760, 0x0, 0x0, 0x100000001, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:193 +0x157
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc005259cc0, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:175 +0x17a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc00078fc00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1860 +0xad5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc00078fc00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1478 +0x367
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc00078fc00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1116 +0x37f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc00078fc00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:755 +0x5db
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc004815358, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x15d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc004815250, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:240 +0x9b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc004815250, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:175 +0x20b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc004815180, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:255 +0x1b5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc004815160, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x1fc
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc004814f00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:513 +0x2f5
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc000be7700, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:817 +0x135
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc0059c8fa0, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:982 +0x152
github.com/cockroachdb/cockroach/pkg/sql/row.makeKVBatchFetcher.func1(0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:225 +0x78
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).fetch(0xc0069ca900, 0x9fa0a68, 0xc006e31800, 0x862d180, 0x4012200)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:444 +0x8d5
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).nextBatch(0xc0069ca900, 0x9fa0a68, 0xc006e31800, 0xc0038a11a8, 0x6337b3a, 0xc006867200, 0x5, 0x5, 0x1, 0x9f09120, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:573 +0x86b
github.com/cockroachdb/cockroach/pkg/sql/row.(*KVFetcher).NextKV(0xc002bba990, 0x9fa0a68, 0xc006e31800, 0x0, 0xc11892a, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:136 +0x104
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*cFetcher).nextBatch(0xc003dd8960, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/cfetcher.go:822 +0xb9
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*cFetcher).nextAdapter(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/cfetcher.go:804
github.com/cockroachdb/cockroach/pkg/sql/colmem.(*Allocator).PerformOperation(0xc006867140, 0x0, 0x0, 0x0, 0xc0038a16a0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colmem/allocator.go:302 +0x59
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*cFetcher).NextBatch(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/cfetcher.go:796
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*ColBatchScan).Next(0xc004cfdc80, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:97 +0xab
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*InvariantsChecker).Next(0xc006e31580, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:79 +0x69
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*CancelChecker).Next(0xc006e315c0, 0x160, 0xc006535b80)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/cancel_checker.go:59 +0x42
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*simpleProjectOp).Next(0xc003c58d20, 0x0, 0xd7021d8)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/simple_project.go:125 +0x49
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*InvariantsChecker).Next(0xc006e31600, 0x401227b, 0xc006e31980)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:79 +0x69
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*fnOp).Next(0xc006e31680, 0x378466d8, 0x28)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/fn_op.go:32 +0x33
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*orderedAggregator).Next(0xc002aa9400, 0x0, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/ordered_aggregator.go:210 +0x544
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*InvariantsChecker).Next(0xc006e31700, 0xc000cb2900, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:79 +0x69
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*deselectorOp).Next(0xc003c59500, 0x9fa0a68, 0xc006e31800)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/deselector.go:63 +0xcd
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).sendBatches.func1()
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:272 +0xdf
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0xc0038a1d60, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91 +0x57
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).sendBatches(0xc003d7e840, 0x9fa0b10, 0xc006867650, 0x36c6c0f0, 0xc005e4e8e0, 0xc0037812a0, 0xc005e4e860, 0xc005e4e800, 0xc003d7e840, 0xc003c1a360)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:263 +0xd3
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).runWithStream(0xc003d7e840, 0x9fa0b10, 0xc006867650, 0x36c6c0f0, 0xc005e4e8e0, 0xc0037812a0, 0xc005e4e860)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:385 +0x155
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).Run(0xc003d7e840, 0x9fa0a68, 0xc006867650, 0x9f09fa0, 0xc001101d10, 0xe3daf53700000003, 0xc10c52b089456372, 0xc06730b4b1, 0x1, 0xc0037812a0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:206 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).setupRemoteOutputStream.func1(0x9fa0a68, 0xc006e31540, 0xc0037812a0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:718 +0xd5
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreatorHelper).accumulateAsyncComponent.func1.1(0xc006867440, 0x9fa0a68, 0xc006e31540, 0xc0037812a0, 0xc004e05770)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1303 +0x44
created by github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreatorHelper).accumulateAsyncComponent.func1
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1302 +0x6f

aliher1911 · 2021-08-12T11:15:39Z

Update:

test fails once transaction is delayed for more than closed timestamp lag
we fix transaction time when reading cluster_logical_timestamp() and if tx falls behind it always refreshes to closed timestamp so it becomes impossible to commit as we are always fixed below closed commit timestamp
the reason for 3 sec pause that introduces the failure is not clear. test spins up workers as needed for 5 sec and they could all block on SELECT max(val) AS m FROM mono.mono, after that test is a toast. interestingly that increasing test run time prevents if from failing be preventing this pause somehow

Refs: cockroachdb#67802 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None

…ierProcessor} Also TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces. Refs: cockroachdb#67802, \cockroachdb#68127, cockroachdb#68795, cockroachdb#68801 Release note: None

68794: roachtest: fix apt-get upgrade in Jepsen tests r=knz a=erikgrinaker An `apt-get upgrade` command blocked on: ``` Setting up openssh-server (1:8.2p1-4ubuntu0.3) ... debconf: unable to initialize frontend: Dialog debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.) debconf: falling back to frontend: Readline Configuring openssh-server -------------------------- A new version (/tmp/fileMcydAr) of configuration file /etc/ssh/sshd_config is available, but the version installed currently has been locally modified. 1. install the package maintainer's version 2. keep the local version currently installed 3. show the differences between the versions 4. show a side-by-side difference between the versions 5. show a 3-way difference between available versions 6. do a 3-way merge between available versions 7. start a new shell to examine the situation What do you want to do about modified configuration file sshd_config? ``` This patch sets `DEBIAN_FRONTEND=noninteractive` to avoid this. Resolves #68783. Resolves #68782. Resolves #68781. Resolves #68780. Resolves #68779. Resolves #68778. Resolves #68777. Resolves #68776. Resolves #68775. Resolves #68774. Resolves #68773. Resolves #68772. Resolves #68771. Resolves #68770. Resolves #68769. Resolves #68768. Resolves #68767. Resolves #68766. Resolves #68765. Resolves #68764. Resolves #68763. Resolves #68762. Resolves #68761. Resolves #68760. Resolves #68759. Resolves #68758. Resolves #68757. Resolves #68756. Resolves #68755. Resolves #68754. Resolves #68753. Resolves #68752. Resolves #68744. Resolves #68726. Resolves #68724. Resolves #68722. Resolves #68720. Resolves #68719. Resolves #68718. Release note: None 68802: *: skip Test{MonotonicInserts,BackupRestoreEmpty,StreamIngestionFront… r=irfansharif a=irfansharif …ierProcessor} Also TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces. Refs: #67802, \#68127, #68795, #68801 Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: irfan sharif <[email protected]>

aliher1911 · 2021-08-17T14:35:01Z

As discussed at some point with @nvanbenschoten we thought the problem is that we refresh to closed timestamp, but I think it is just a matter of how test uses cluster_logical_timestamp() to fix transaction to closed ts from previous try. Unless there's a guaranteed way that client could do the same "legally", I don't think it is a bug.

So we'd rather fix the test or at least make it respond to the situation where it gets temporarily stuck.

As for stuck time, what happens is that test can pause on aforementioned select max(val) on all 6 workes for few sec and then gets unblocked by test termination code. I think it is not even trying to make a verification query at that time as we are still waiting for all workers to finish.

aliher1911 · 2021-08-17T16:45:57Z

Ok, so I added the check once cluster_logical_timestamp() diverges from wallclock for more than 2 seconds, drop transaction and start again. Running without getting stuck. It is a band aid, but i think the problem is with unable to restart is inherent to the test, not the way we refresh transactions.
If it won't fail for me after few hours, I'll make a PR with reenabled test and maybe create a separate issue to improve the test and figure out why transaction could recover and what causes it to get stuck for seconds.

nvanbenschoten · 2021-08-17T18:10:54Z

Unless there's a guaranteed way that client could do the same "legally", I don't think it is a bug.

There is. Clients have access to the cluster_logical_timestamp() builtin function, so they can reproduce this "legally".

Fixing the test sounds like a good idea, but I don't think we should miss this opportunity to improve upon how transactions are retried after being bumped by the closed timestamp. Right now, transactions that get bumped by the closed timestamp push their write timestamp to immediately above the closed timestamp (closedTimestamp.Next()) before refreshing/retrying. But the closed timestamp is always progressing. In fact, the recent closed timestamp rewrite made the closed timestamp progress even more incrementally. Instead of progressing periodically every 600ms, it progresses on every write on a range. This means that writers who hit the closed timestamp are almost guaranteed to be bumped by the closed timestamp again if they need to perform the same write again during a retry.

This can be visualizes as:

now = 10, txn ts = 10, closed ts = 7
// txn takes 3 seconds
now = 13, txn ts = 10, closed ts = 10
// txn writes, gets bumped by closed ts
now = 13, txn ts = 10.1, closed ts = 10
// txn retries due to ts bump, takes .2 seconds, writes, and gets bumped by closed ts
now = 13.2, txn ts = 10.3, closed ts = 10.2
closed ts = 10.2, txn ts = 10.3
// txn retries due to ts bump, takes .2 seconds, writes, and gets bumped by closed ts
now = 13.4, txn ts = 10.5, closed ts = 10.4
...
// indefinite retries

This is was something we noticed in #51294. The 600ms referenced there was the period of the closed timestamp at the time. We addressed that issue by preventing transaction refreshes from being starved in such situations. But in retrospect, it's not surprising that transactions that use cluster_logical_timestamp() and disable transaction refreshing are still susceptible to this issue. To address this case, we may need to explore one of the other proposed solutions on that issue. Namely:

if the closed timestamp bumped victims up to the present time instead of its current value, requests would be given more time to refresh and complete: 3s instead of 600ms. This would avoid a large class of these issues, in exchange for increasing the chance that refreshes fail under contention and increasing the contention footprint of transactions that get bumped by the closed timestamp.

I imagine that if we were to address that, this test would no longer be flaky. Even if a transaction took 3s during its first epoch due to some slowdown, it wouldn't enter this endless starvation loop where it has effectively no chance of ever succeeding in subsequent retries. Instead, it would be given a full 3s again to complete its next retry.

aliher1911 · 2021-08-18T15:56:06Z

Will this change work #69096? I'm a bit wary of changing transaction timestamp "randomly" and I don't know how much to trust our tests to cover that part.

cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 20, 2021

blathers-crl bot added the T-sql-queries SQL Queries Team label Jul 20, 2021

blathers-crl bot added the T-kv KV Team label Jul 20, 2021

yuzefovich mentioned this issue Jul 20, 2021

sql/tests: TestMonotonicInserts/distsql=off failed #67734

Closed

aliher1911 self-assigned this Jul 20, 2021

This was referenced Aug 10, 2021

sql/tests: TestMonotonicInserts/distsql=off failed #68636

Closed

sql/tests: TestMonotonicInserts/distsql=off failed #68784

Closed

irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 12, 2021

sql/tests: skip TestMonotonicInserts

ce3ad3c

Refs: cockroachdb#67802 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None

irfansharif mentioned this issue Aug 12, 2021

*: skip Test{MonotonicInserts,BackupRestoreEmpty,StreamIngestionFront… #68802

Merged

aliher1911 mentioned this issue Aug 17, 2021

sql/tests: fix flaky TestMonotonicInserts test #69042

Closed

aliher1911 mentioned this issue Sep 1, 2021

kvclient: when retrying transaction bump ts to now #69096

Merged

craig bot closed this as completed in 9c89a20 Oct 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/tests: TestMonotonicInserts/distsql=on failed #67802

sql/tests: TestMonotonicInserts/distsql=on failed #67802

cockroach-teamcity commented Jul 20, 2021

yuzefovich commented Jul 20, 2021

RaduBerinde commented Jul 20, 2021

aliher1911 commented Jul 21, 2021

aliher1911 commented Jul 21, 2021

RaduBerinde commented Jul 22, 2021

aliher1911 commented Jul 22, 2021

aliher1911 commented Jul 22, 2021

aliher1911 commented Jul 27, 2021

aliher1911 commented Aug 12, 2021

aliher1911 commented Aug 17, 2021

aliher1911 commented Aug 17, 2021

nvanbenschoten commented Aug 17, 2021

aliher1911 commented Aug 18, 2021

sql/tests: TestMonotonicInserts/distsql=on failed #67802

sql/tests: TestMonotonicInserts/distsql=on failed #67802

Comments

cockroach-teamcity commented Jul 20, 2021

yuzefovich commented Jul 20, 2021

RaduBerinde commented Jul 20, 2021

aliher1911 commented Jul 21, 2021

aliher1911 commented Jul 21, 2021

RaduBerinde commented Jul 22, 2021

aliher1911 commented Jul 22, 2021

aliher1911 commented Jul 22, 2021

aliher1911 commented Jul 27, 2021

aliher1911 commented Aug 12, 2021

aliher1911 commented Aug 17, 2021

aliher1911 commented Aug 17, 2021

nvanbenschoten commented Aug 17, 2021

aliher1911 commented Aug 18, 2021