Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/tests: TestMonotonicInserts/distsql=on failed #67802

Closed
cockroach-teamcity opened this issue Jul 20, 2021 · 13 comments · Fixed by #69096
Closed

sql/tests: TestMonotonicInserts/distsql=on failed #67802

cockroach-teamcity opened this issue Jul 20, 2021 · 13 comments · Fixed by #69096
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

sql/tests.TestMonotonicInserts/distsql=on failed with artifacts on master @ 9f39af35fd81a7754e6123862b63247e3277b458:

Slow failing tests:
TestMonotonicInserts/distsql=on - 2317.81s

Slow passing tests:
Example_demo - 210.29s
TestMergeQueueSeesNonVoters - 195.92s
TestExternalSortRandomized - 158.21s
TestAggregatorRandom - 143.75s
TestSortRandomized - 112.27s
TestIndexCleanupAfterAlterFromRegionalByRow - 110.84s
TestExternalSortMemoryAccounting - 104.40s
TestUnorderedDistinctRandom - 99.64s
TestLearnerSnapshotFailsRollback - 97.97s
TestBackupRestoreTenant - 94.55s
TestImportAvro - 83.87s
TestRestoreOldVersions - 77.22s
TestInvertedIndex - 70.65s
TestExternalDistinctSpilling - 66.81s
TestStoreRangeMergeTimestampCache - 66.38s
TestChangefeedSchemaChangeAllowBackfill - 58.81s
TestDefaultColumns - 56.85s
TestReadWriteFile - 56.74s
TestScatterRandomizeLeases - 56.47s
TestMultiNodeExportStmt - 56.22s
Reproduce

To reproduce, try:

make stressrace TESTS=TestMonotonicInserts/distsql=on PKG=./pkg/sql/tests TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

  • GOFLAGS=-json

/cc @cockroachdb/sql-queries petermattis

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 20, 2021
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jul 20, 2021
@yuzefovich
Copy link
Member

I've stressed it with make stress PKG=./pkg/sql/tests TESTS=TestMonotonicInserts TESTFLAGS='-test.v -show-logs' TESTTIMEOUT=30s STRESSFLAGS='-p 1', and it timed out pretty quickly (on 12th run, at 3 minute mark). And to me it looks like a deadlock, block like this one is repeated a couple of thousand times:

I210720 16:51:15.662527 20124 sql/tests/monotonic_insert_test.go:134  [-] 11005  363.002: attempt 11
I210720 16:51:15.662752 20124 sql/tests/monotonic_insert_test.go:134  [-] 11006  363.002: read max val
I210720 16:51:15.664405 20124 sql/tests/monotonic_insert_test.go:134  [-] 11007  363.002: read max row for val=368
I210720 16:51:15.666129 20124 sql/tests/monotonic_insert_test.go:134  [-] 11008  363.002: insert
I210720 16:51:15.666918 20124 sql/tests/monotonic_insert_test.go:134  [-] 11009  363.002: commit

...

I210720 16:51:35.254051 20124 sql/tests/monotonic_insert_test.go:134  [-] 26750  363.002: attempt 3158
I210720 16:51:35.254224 20124 sql/tests/monotonic_insert_test.go:134  [-] 26751  363.002: read max val
I210720 16:51:35.257515 20124 sql/tests/monotonic_insert_test.go:134  [-] 26752  363.002: read max row for val=368
I210720 16:51:35.260935 20124 sql/tests/monotonic_insert_test.go:134  [-] 26753  363.002: insert
I210720 16:51:35.261624 20124 sql/tests/monotonic_insert_test.go:134  [-] 26754  363.002: commit
I210720 16:51:35.263058 20124 sql/tests/monotonic_insert_test.go:134  [-] 26755  363.002: attempt 3159
I210720 16:51:35.263238 20124 sql/tests/monotonic_insert_test.go:134  [-] 26756  363.002: read max val
I210720 16:51:35.266559 20124 sql/tests/monotonic_insert_test.go:134  [-] 26757  363.002: read max row for val=368
I210720 16:51:35.270079 20124 sql/tests/monotonic_insert_test.go:134  [-] 26758  363.002: insert
panic: test timed out after 30s

Seems like a KV issue.

@RaduBerinde
Copy link
Member

Another instance: https://teamcity.cockroachdb.com/viewLog.html?buildId=3200931&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Testrace

If we keep hitting it, I will have to skip TestMonotonicInserts

@aliher1911
Copy link
Contributor

Chasing on the deadlock I saw that it always get stuck waiting on a lock queue for all 6 parallel queries. What I noticed is that if I bump the timeout to 90s and run it for some time test time varies widely with most iterations finishing in 15s, but some iterations running 45-60s and eventually hitting 90s threshold. I'll try to bisect if this behaviour is something new or we just have more stressed machines that run tests maybe.

@aliher1911
Copy link
Contributor

So I bisected the failure to this PR: #66505 which is bumping GO to 1.16.5

@RaduBerinde
Copy link
Member

Could it be related to #67658 ? If yes, we could try bisecting among changes to x/net/http2, crypto and net.

@aliher1911
Copy link
Contributor

Current findings:

Test runs fine unless one of transactions gets into bad shape where it indefinitely restarts with the same txnid while holding locks. On each iteration it would obtain more and more locks for unsuccessful writes. This could either lead to test time out, or if luck strikes txn_interceptor_heartbeater.go tries heartbeat at the time error is sent to client and it didn't retry yet (which is done in a tight loop) then it would trigger transaction abort that would kill transaction non-retryably and other writers could proceed.

@aliher1911
Copy link
Contributor

Could it be related to #67658 ? If yes, we could try bisecting among changes to x/net/http2, crypto and net.

Doesn't seem likely. Test runs fine in on 2cafdfb and starts failing on e3d35e0 which is a go update.

It might be it just surfaced some race in transaction retry.

@aliher1911
Copy link
Contributor

Update: retracting the bisect conclusion to Go update. After running pre 1.16 for longer I made it fail. So it may be failing more frequently, but it was definitely failing even before that change.

Another observation that I had is that in cases where we get stuck, requests tried to push transaction and push requests end up waiting for push on the same lock table object (see 0xc0006bd680 below)?

goroutine 15919 [select]:
github.com/cockroachdb/cockroach/pkg/kv/kvserver/txnwait.(*Queue).MaybeWaitForPush(0xc0006bd810, 0x9fa0b10, 0xc005b3c150, 0xc005cf9340, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/txnwait/queue.go:513 +0x10dc
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).maybeInterceptReq(0xc0006bd680, 0x9fa0b10, 0xc005b3c150, 0x0, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0xc004cfde00, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:275 +0xa6
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc0006bd680, 0x9fa0b10, 0xc005b3c150, 0xc006ce0bd0, 0x0, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0xc004cfde00, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:187 +0xbb
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc0006bd680, 0x9fa0b10, 0xc005b3c150, 0x0, 0x0, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0xc004cfde00, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:164 +0x118
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc0056c0000, 0x9fa0b10, 0xc005b3c150, 0xc003d7ec60, 0x90a6b58, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:357 +0x2c3
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc0056c0000, 0x9fa0b10, 0xc005b3c120, 0x26, 0xc003d7ec60, 0x1, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:107 +0x57d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:43
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc002a4e000, 0x9fa0b10, 0xc005b3c0f0, 0x169416c49f65e580, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:180 +0x614
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000ff8f50, 0x9fa0b10, 0xc005b3c0f0, 0x169416c49f65e580, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:191 +0xf9
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x9fa0b10, 0xc005b3c0f0, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:890 +0x1f8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000be6780, 0x9fa0b10, 0xc005b3c0f0, 0x8a11930, 0x10, 0xc0038965c0, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0xb2
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000fc4000, 0x9fa0b10, 0xc005b3c0f0, 0xc003d7eb00, 0x0, 0x0, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:878 +0x165
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000fc4000, 0x9fa0b10, 0xc005b3c0c0, 0xc003d7eb00, 0xc006867e90, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:945 +0xc5
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:462
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc000cd24c0, 0x9fa0b10, 0xc005b3c0c0, 0x1, 0x9feb028, 0xc0006c3760, 0x169416c49f65e580, 0x0, 0x100000001, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:193 +0x157
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc000cd24c0, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:175 +0x17a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc00078fc00, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1860 +0xad5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc00078fc00, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1478 +0x367
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc00078fc00, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1116 +0x37f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc00078fc00, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:755 +0x5db
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc000be7758, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:220 +0x9a
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc000be7700, 0x9fa0b10, 0xc006867e90, 0x169416c49f65e580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:817 +0x135
github.com/cockroachdb/cockroach/pkg/kv.(*DB).send(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:800
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill(0x9fa0b10, 0xc006867e90, 0xc00389a1b0, 0xc004815900, 0x0, 0xc00389a6c8)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:728 +0x135
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Run(0xc000be7700, 0x9fa0b10, 0xc006867e90, 0xc004815900, 0xc00389a410, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:751 +0xa5
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).MaybePushTransactions(0xc001186700, 0x9fa0b10, 0xc006867e90, 0xc00389a6c8, 0x169416c49f657c08, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:389 +0x778
github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver.(*IntentResolver).PushTransaction(0xc001186700, 0x9fa0b10, 0xc006867e90, 0xc0063bb748, 0x169416c49f657c08, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go:297 +0x14e
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).pushLockTxn(0xc0006bd6d0, 0x9fa0b10, 0xc006867e90, 0xc0072df9e0, 0x169416c49f214b00, 0x0, 0x0, 0x0, 0xc005e4ea90, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:434 +0x13c
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).WaitOn(0xc0006bd6d0, 0x9fa0b10, 0xc006867e90, 0xc0072df9e0, 0x169416c49f214b00, 0x0, 0x0, 0x0, 0xc005e4ea90, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go:338 +0x305
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard(0xc0006bd680, 0x9fa0b10, 0xc006867e90, 0xc0050573b0, 0xc0072df9e0, 0x169416c49f214b00, 0x0, 0x0, 0x0, 0xc005e4ea90, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:256 +0x37f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq(0xc0006bd680, 0x9fa0b10, 0xc006867e90, 0x0, 0xc0072df9e0, 0x169416c49f214b00, 0x0, 0x0, 0x0, 0xc005e4ea90, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:164 +0x118
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc0056c0000, 0x9fa0b10, 0xc006867e90, 0xc003d7ea50, 0x90a6b40, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:357 +0x2c3
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc0056c0000, 0x9fa0b10, 0xc006867e60, 0x26, 0xc003d7ea50, 0x1, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:103 +0x2d1
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:43
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc002a4e000, 0x9fa0b10, 0xc006867e30, 0x169416c49f214b00, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:180 +0x614
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000ff8f50, 0x9fa0b10, 0xc006867e30, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:191 +0xf9
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x9fa0b10, 0xc006867e30, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:890 +0x1f8
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000be6780, 0x9fa0b10, 0xc006867e30, 0x8a11930, 0x10, 0xc00389c380, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:326 +0xb2
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc000fc4000, 0x9fa0b10, 0xc006867e30, 0xc003d7e8f0, 0x0, 0x0, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:878 +0x165
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc000fc4000, 0x9fa0b10, 0xc006867e00, 0xc003d7e8f0, 0xc006e31800, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:945 +0xc5
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:462
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc005259cc0, 0x9fa0b10, 0xc006867e00, 0x1, 0x9feb028, 0xc0006c3760, 0x0, 0x0, 0x100000001, 0x1, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:193 +0x157
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc005259cc0, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:175 +0x17a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc00078fc00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x26, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1860 +0xad5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc00078fc00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1478 +0x367
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc00078fc00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1116 +0x37f
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc00078fc00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:755 +0x5db
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0xc004815358, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x15d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0xc004815250, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:240 +0x9b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0xc004815250, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:175 +0x20b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0xc004815180, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:255 +0x1b5
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0xc004815160, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x1fc
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0xc004814f00, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:513 +0x2f5
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0xc000be7700, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:817 +0x135
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0xc0059c8fa0, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:982 +0x152
github.com/cockroachdb/cockroach/pkg/sql/row.makeKVBatchFetcher.func1(0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:225 +0x78
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).fetch(0xc0069ca900, 0x9fa0a68, 0xc006e31800, 0x862d180, 0x4012200)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:444 +0x8d5
github.com/cockroachdb/cockroach/pkg/sql/row.(*txnKVFetcher).nextBatch(0xc0069ca900, 0x9fa0a68, 0xc006e31800, 0xc0038a11a8, 0x6337b3a, 0xc006867200, 0x5, 0x5, 0x1, 0x9f09120, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_batch_fetcher.go:573 +0x86b
github.com/cockroachdb/cockroach/pkg/sql/row.(*KVFetcher).NextKV(0xc002bba990, 0x9fa0a68, 0xc006e31800, 0x0, 0xc11892a, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/kv_fetcher.go:136 +0x104
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*cFetcher).nextBatch(0xc003dd8960, 0x9fa0a68, 0xc006e31800, 0x0, 0x0, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/cfetcher.go:822 +0xb9
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*cFetcher).nextAdapter(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/cfetcher.go:804
github.com/cockroachdb/cockroach/pkg/sql/colmem.(*Allocator).PerformOperation(0xc006867140, 0x0, 0x0, 0x0, 0xc0038a16a0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colmem/allocator.go:302 +0x59
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*cFetcher).NextBatch(...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/cfetcher.go:796
github.com/cockroachdb/cockroach/pkg/sql/colfetcher.(*ColBatchScan).Next(0xc004cfdc80, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colfetcher/colbatch_scan.go:97 +0xab
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*InvariantsChecker).Next(0xc006e31580, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:79 +0x69
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*CancelChecker).Next(0xc006e315c0, 0x160, 0xc006535b80)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/cancel_checker.go:59 +0x42
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*simpleProjectOp).Next(0xc003c58d20, 0x0, 0xd7021d8)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/simple_project.go:125 +0x49
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*InvariantsChecker).Next(0xc006e31600, 0x401227b, 0xc006e31980)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:79 +0x69
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase.(*fnOp).Next(0xc006e31680, 0x378466d8, 0x28)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecbase/fn_op.go:32 +0x33
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*orderedAggregator).Next(0xc002aa9400, 0x0, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/ordered_aggregator.go:210 +0x544
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*InvariantsChecker).Next(0xc006e31700, 0xc000cb2900, 0x1)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:79 +0x69
github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils.(*deselectorOp).Next(0xc003c59500, 0x9fa0a68, 0xc006e31800)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecutils/deselector.go:63 +0xcd
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).sendBatches.func1()
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:272 +0xdf
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0xc0038a1d60, 0x0, 0x0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91 +0x57
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).sendBatches(0xc003d7e840, 0x9fa0b10, 0xc006867650, 0x36c6c0f0, 0xc005e4e8e0, 0xc0037812a0, 0xc005e4e860, 0xc005e4e800, 0xc003d7e840, 0xc003c1a360)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:263 +0xd3
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).runWithStream(0xc003d7e840, 0x9fa0b10, 0xc006867650, 0x36c6c0f0, 0xc005e4e8e0, 0xc0037812a0, 0xc005e4e860)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:385 +0x155
github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc.(*Outbox).Run(0xc003d7e840, 0x9fa0a68, 0xc006867650, 0x9f09fa0, 0xc001101d10, 0xe3daf53700000003, 0xc10c52b089456372, 0xc06730b4b1, 0x1, 0xc0037812a0, ...)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/colrpc/outbox.go:206 +0x42b
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreator).setupRemoteOutputStream.func1(0x9fa0a68, 0xc006e31540, 0xc0037812a0)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:718 +0xd5
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreatorHelper).accumulateAsyncComponent.func1.1(0xc006867440, 0x9fa0a68, 0xc006e31540, 0xc0037812a0, 0xc004e05770)
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1303 +0x44
created by github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlowCreatorHelper).accumulateAsyncComponent.func1
	/Users/ali/go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:1302 +0x6f

@aliher1911
Copy link
Contributor

Update:

  • test fails once transaction is delayed for more than closed timestamp lag
  • we fix transaction time when reading cluster_logical_timestamp() and if tx falls behind it always refreshes to closed timestamp so it becomes impossible to commit as we are always fixed below closed commit timestamp
  • the reason for 3 sec pause that introduces the failure is not clear. test spins up workers as needed for 5 sec and they could all block on SELECT max(val) AS m FROM mono.mono, after that test is a toast. interestingly that increasing test run time prevents if from failing be preventing this pause somehow

irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 12, 2021
Refs: cockroachdb#67802

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 12, 2021
…ierProcessor}

Also TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces. Refs: cockroachdb#67802,
\cockroachdb#68127, cockroachdb#68795, cockroachdb#68801

Release note: None
craig bot pushed a commit that referenced this issue Aug 12, 2021
68794: roachtest: fix apt-get upgrade in Jepsen tests r=knz a=erikgrinaker

An `apt-get upgrade` command blocked on:

```
Setting up openssh-server (1:8.2p1-4ubuntu0.3) ...
debconf: unable to initialize frontend: Dialog
debconf: (Dialog frontend will not work on a dumb terminal, an emacs shell buffer, or without a controlling terminal.)
debconf: falling back to frontend: Readline
Configuring openssh-server
--------------------------

A new version (/tmp/fileMcydAr) of configuration file /etc/ssh/sshd_config is
available, but the version installed currently has been locally modified.

  1. install the package maintainer's version
  2. keep the local version currently installed
  3. show the differences between the versions
  4. show a side-by-side difference between the versions
  5. show a 3-way difference between available versions
  6. do a 3-way merge between available versions
  7. start a new shell to examine the situation

What do you want to do about modified configuration file sshd_config?
```

This patch sets `DEBIAN_FRONTEND=noninteractive` to avoid this.

Resolves #68783.
Resolves #68782.
Resolves #68781.
Resolves #68780.
Resolves #68779.
Resolves #68778.
Resolves #68777.
Resolves #68776.
Resolves #68775.
Resolves #68774.
Resolves #68773.
Resolves #68772.
Resolves #68771.
Resolves #68770.
Resolves #68769.
Resolves #68768.
Resolves #68767.
Resolves #68766.
Resolves #68765.
Resolves #68764.
Resolves #68763.
Resolves #68762.
Resolves #68761.
Resolves #68760.
Resolves #68759.
Resolves #68758.
Resolves #68757.
Resolves #68756.
Resolves #68755.
Resolves #68754.
Resolves #68753.
Resolves #68752.
Resolves #68744.
Resolves #68726.
Resolves #68724.
Resolves #68722.
Resolves #68720.
Resolves #68719.
Resolves #68718.


Release note: None

68802: *: skip Test{MonotonicInserts,BackupRestoreEmpty,StreamIngestionFront… r=irfansharif a=irfansharif

…ierProcessor}

Also TestRangefeedUpdatesHandledProperlyInTheFaceOfRaces. Refs: #67802,
\#68127, #68795, #68801

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
@aliher1911
Copy link
Contributor

As discussed at some point with @nvanbenschoten we thought the problem is that we refresh to closed timestamp, but I think it is just a matter of how test uses cluster_logical_timestamp() to fix transaction to closed ts from previous try. Unless there's a guaranteed way that client could do the same "legally", I don't think it is a bug.

So we'd rather fix the test or at least make it respond to the situation where it gets temporarily stuck.

As for stuck time, what happens is that test can pause on aforementioned select max(val) on all 6 workes for few sec and then gets unblocked by test termination code. I think it is not even trying to make a verification query at that time as we are still waiting for all workers to finish.

@aliher1911
Copy link
Contributor

Ok, so I added the check once cluster_logical_timestamp() diverges from wallclock for more than 2 seconds, drop transaction and start again. Running without getting stuck. It is a band aid, but i think the problem is with unable to restart is inherent to the test, not the way we refresh transactions.
If it won't fail for me after few hours, I'll make a PR with reenabled test and maybe create a separate issue to improve the test and figure out why transaction could recover and what causes it to get stuck for seconds.

@nvanbenschoten
Copy link
Member

Unless there's a guaranteed way that client could do the same "legally", I don't think it is a bug.

There is. Clients have access to the cluster_logical_timestamp() builtin function, so they can reproduce this "legally".

Fixing the test sounds like a good idea, but I don't think we should miss this opportunity to improve upon how transactions are retried after being bumped by the closed timestamp. Right now, transactions that get bumped by the closed timestamp push their write timestamp to immediately above the closed timestamp (closedTimestamp.Next()) before refreshing/retrying. But the closed timestamp is always progressing. In fact, the recent closed timestamp rewrite made the closed timestamp progress even more incrementally. Instead of progressing periodically every 600ms, it progresses on every write on a range. This means that writers who hit the closed timestamp are almost guaranteed to be bumped by the closed timestamp again if they need to perform the same write again during a retry.

This can be visualizes as:

now = 10, txn ts = 10, closed ts = 7
// txn takes 3 seconds
now = 13, txn ts = 10, closed ts = 10
// txn writes, gets bumped by closed ts
now = 13, txn ts = 10.1, closed ts = 10
// txn retries due to ts bump, takes .2 seconds, writes, and gets bumped by closed ts
now = 13.2, txn ts = 10.3, closed ts = 10.2
closed ts = 10.2, txn ts = 10.3
// txn retries due to ts bump, takes .2 seconds, writes, and gets bumped by closed ts
now = 13.4, txn ts = 10.5, closed ts = 10.4
...
// indefinite retries

This is was something we noticed in #51294. The 600ms referenced there was the period of the closed timestamp at the time. We addressed that issue by preventing transaction refreshes from being starved in such situations. But in retrospect, it's not surprising that transactions that use cluster_logical_timestamp() and disable transaction refreshing are still susceptible to this issue. To address this case, we may need to explore one of the other proposed solutions on that issue. Namely:

if the closed timestamp bumped victims up to the present time instead of its current value, requests would be given more time to refresh and complete: 3s instead of 600ms. This would avoid a large class of these issues, in exchange for increasing the chance that refreshes fail under contention and increasing the contention footprint of transactions that get bumped by the closed timestamp.

I imagine that if we were to address that, this test would no longer be flaky. Even if a transaction took 3s during its first epoch due to some slowdown, it wouldn't enter this endless starvation loop where it has effectively no chance of ever succeeding in subsequent retries. Instead, it would be given a full 3s again to complete its next retry.

@aliher1911
Copy link
Contributor

Will this change work #69096? I'm a bit wary of changing transaction timestamp "randomly" and I don't know how much to trust our tests to cover that part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team T-sql-queries SQL Queries Team
Projects
None yet
5 participants