Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restoreTPCCInc/nodes=10 failed #84162

Closed
cockroach-teamcity opened this issue Jul 11, 2022 · 25 comments
Closed

roachtest: restoreTPCCInc/nodes=10 failed #84162

cockroach-teamcity opened this issue Jul 11, 2022 · 25 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 11, 2022

roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ 86e007dcb5cbde3501f138c1d768519db3487857:

		Wraps: (2) output in run_071656.109541392_n1_cockroach_sql
		Wraps: (3) ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'" returned
		  | stderr:
		  | ERROR: importing 12872 ranges: Get "https://storage.googleapis.com/cockroach-fixtures/tpcc-incrementals/2021/05/21-020411.00/20210521/091500.00/660334760784756739.sst": stream error: stream ID 7; INTERNAL_ERROR; received from peer
		  | Failed running "sql"
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'"
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,restore.go:453,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:453
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-17500

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 11, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 11, 2022
@msbutler
Copy link
Collaborator

This is a test infra flake.

See #84132 (comment) for explanation.

@msbutler msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 11, 2022
@msbutler
Copy link
Collaborator

msbutler commented Jul 11, 2022

keeping the master issue open to see if the flake happens again.

@msbutler msbutler reopened this Jul 11, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ 571bfa3afb3858ae84d8a8fcdbb0a38e058402a5:

		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:439
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_072421.682187968_n1_cockroach_sql
		Wraps: (3) ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'" returned
		  | stderr:
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | UNCLASSIFIED_PROBLEM: context canceled
		  | (1) UNCLASSIFIED_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ./cockroach sql --insecure -e "
		  |   | 				RESTORE FROM '2021/05/21-020411.00' IN
		  |   | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  |   | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'"
		  |   | ``````
		  | Wraps: (3) context canceled
		  | Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,restore.go:453,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 5: dead (exit status 134)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:453
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 5: dead (exit status 134)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ 687171ac6c2cd9992486bb3b8c9d252ac95ca1cd:

		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:439
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_070047.171177298_n1_cockroach_sql
		Wraps: (3) ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'" returned
		  | stderr:
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | UNCLASSIFIED_PROBLEM: context canceled
		  | (1) UNCLASSIFIED_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ./cockroach sql --insecure -e "
		  |   | 				RESTORE FROM '2021/05/21-020411.00' IN
		  |   | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  |   | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'"
		  |   | ``````
		  | Wraps: (3) context canceled
		  | Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,restore.go:453,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 9: dead (exit status 134)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:453
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 9: dead (exit status 134)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ 88d3253301457ac57820e0f4a4fab8f74bf9f38b:

		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:439
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (2) output in run_073326.003957262_n1_cockroach_sql
		Wraps: (3) ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'" returned
		  | stderr:
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | UNCLASSIFIED_PROBLEM: context canceled
		  | (1) UNCLASSIFIED_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ``````
		  |   | ./cockroach sql --insecure -e "
		  |   | 				RESTORE FROM '2021/05/21-020411.00' IN
		  |   | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  |   | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'"
		  |   | ``````
		  | Wraps: (3) context canceled
		  | Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,restore.go:453,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 134)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:453
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 2: dead (exit status 134)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@msbutler msbutler self-assigned this Jul 14, 2022
@msbutler
Copy link
Collaborator

also failing because of #84396

@cockroach-teamcity
Copy link
Member Author

roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ e9ee21860458d997a8155734dc608cfcd050ef24:

		Wraps: (2) output in run_072630.363989245_n1_cockroach_sql
		Wraps: (3) ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'" returned
		  | stderr:
		  | ERROR: importing 12872 ranges: pebble/table: invalid table 000000 (checksum mismatch at 0/51367)
		  | Failed running "sql"
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'"
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,restore.go:453,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:453
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@msbutler
Copy link
Collaborator

this error message looks scary. adding a release blocker to this roachtest thread while I investigate.

@msbutler msbutler added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 15, 2022
@msbutler
Copy link
Collaborator

heh, this relates to the new pebbleIterator. I'll try to repro the failure with logging to understand what keys it's tripping up on:

3095 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble/internal/base.CorruptionErrorf
3096 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/internal/base/external/com_github_cockroachdb_pebble/internal/base/error.go:27
3097 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble/sstable.checkChecksum
3098 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/reader.go:2267
3099 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble/sstable.(*Reader).readBlock
3100 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/reader.go:2330
3101 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).readBlockWithStats
3102 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/reader.go:398
3103 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).loadBlock
3104 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/reader.go:380
3105 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).seekGEHelper
3106 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/reader.go:580
3107 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble/sstable.(*singleLevelIterator).SeekGE
3108 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/reader.go:513
3109 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble.(*mergingIter).seekGE
3110 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:863
3111 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble.(*mergingIter).SeekGE
3112 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:918
3113 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble/internal/keyspan.(*InterleavingIter).SeekGE
3114 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/internal/keyspan/external/com_github_cockroachdb_pebble/internal/keyspan/interleaving_iter.go:183
3115 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble.(*Iterator).SeekGEWithLimit
3116 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/iterator.go:1019
3117 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/pebble.(*Iterator).SeekGE
3118 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/iterator.go:944
3119 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).NextKey
3120 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  |    github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:442
3121 I220715 07:28:04.198980 11233 jobs/registry.go:1208 ⋮ [n1] 2957 +  | github.com/cockroachdb/cockroach/pkg/storage.(*ReadAsOfIterator).NextKey

@msbutler
Copy link
Collaborator

msbutler commented Jul 15, 2022

@erikgrinaker @jbowens I suspect the pebbleIterator is incorrectly surfacing a checksum mismatch error after a Valid() call that should actually return false, nil. I re-ran this roachtest using this commit which adds more logging and removes restore's readAsOfIterator to understand exactly what keys a PebbleIterator.NextKey() scan would surface when we hit this checksum. This error happens multiple times on several pebbleIterators processing a variety of different spans. The error surfaces when the invalid key equals /Min. Here's what I see when grepping across my verbose logs:
grep -C4 'checksum' *.unredacted/cockroach.log | sed -n -e 's/^.*job//p':

=779370259110723585] 306196  	File 50: ‹/Table/54/1/717/8/-1791/1 , /Table/54/1/723/6/-679/6›
=779370259110723585] 306197  	File 51: ‹/Table/54/1/717/8/-1791/1 , /Table/54/1/723/6/-679/6›
=779370259110723585] 306198  	Last Valid Key: ‹/Table/54/1/718/2/-2900/1/0/1621602065.810516484,0›
=779370259110723585] 306199  	Invalid Key: ‹/Min›
=779370259110723585] 306200  	Valid hit error: pebble/table: invalid table 000000 (checksum mismatch at 0/52123)
=779370259110723585] 322392  	File 50: ‹/Table/54/1/265/3/-8/8 , /Table/54/1/271/2/-1754/4›
=779370259110723585] 322393  	File 51: ‹/Table/54/1/265/3/-8/8 , /Table/54/1/271/2/-1754/4›
=779370259110723585] 322394  	Last Valid Key: ‹/Table/54/1/265/8/-3004/2/0/1621606940.034169154,0›
=779370259110723585] 322395  	Invalid Key: ‹/Min›
=779370259110723585] 322396  	Valid hit error: pebble/table: invalid table 000000 (checksum mismatch at 0/51188)
=779370259110723585] 317542  	File 50: ‹/Table/54/1/358/8/-2310/3 , /Table/54/1/364/6/-1387/4›
=779370259110723585] 317543  	File 51: ‹/Table/54/1/358/8/-2310/3 , /Table/54/1/364/6/-1387/4›
=779370259110723585] 317544  	Last Valid Key: ‹/Table/54/1/359/2/-2143/1/0/1621565258.234638969,0›
=779370259110723585] 317545  	Invalid Key: ‹/Min›
=779370259110723585] 317546  	Valid hit error: pebble/table: invalid table 000000 (checksum mismatch at 0/51874)
=779370259110723585] 183357  	File 50: ‹/Table/54/1/393/8/-1495/6 , /Table/54/1/399/6/-315/2›
=779370259110723585] 183358  	File 51: ‹/Table/54/1/393/8/-1495/6 , /Table/54/1/399/6/-315/2›
=779370259110723585] 183359  	Last Valid Key: ‹/Table/54/1/394/3/-3009/3/0/1621606882.256607799,0›
=779370259110723585] 183360  	Invalid Key: ‹/Min›
=779370259110723585] 183361  	Valid hit error: pebble/table: invalid table 000000 (checksum mismatch at 0/50598)
=779370259110723585] 344044  	File 50: ‹/Table/54/1/8/10/-467/6 , /Table/54/1/11/9/-28/3›
=779370259110723585] 344045  	File 51: ‹/Table/54/1/8/10/-467/6 , /Table/54/1/11/9/-28/3›
=779370259110723585] 344046  	Last Valid Key: ‹/Table/54/1/9/5/-3013/1/0/1621607089.816615673,0›
=779370259110723585] 344047  	Invalid Key: ‹/Min›
=779370259110723585] 344048  	Valid hit error: pebble/table: invalid table 000000 (checksum mismatch at 0/51156)
=779370259110723585] 331021  	File 50: ‹/Table/54/1/300/4/-2499/8 , /Table/54/1/306/2/-1446/9›
=779370259110723585] 331022  	File 51: ‹/Table/54/1/300/4/-2499/8 , /Table/54/1/306/2/-1446/9›
=779370259110723585] 331023  	Last Valid Key: ‹/Table/54/1/300/8/-2442/7/0/1621579789.908751970,0›
=779370259110723585] 331024  	Invalid Key: ‹/Min›
=779370259110723585] 331025  	Valid hit error: pebble/table: invalid table 000000 (checksum mismatch at 0/51681)
=779370259110723585] 358966  	File 50: ‹/Table/54/1/170/8/-1031/3 , /Table/54/1/176/7/-573/3›
=779370259110723585] 358967  	File 51: ‹/Table/54/1/170/8/-1031/3 , /Table/54/1/176/7/-573/3›
=779370259110723585] 358968  	Last Valid Key: ‹/Table/54/1/171/3/-3010/11/0/1621607159.669877244,0›
=779370259110723585] 358969  	Invalid Key: ‹/Min›
=779370259110723585] 358970  	Valid hit error: pebble/table: invalid table 000000 (checksum mismatch at 0/51090)
=779370259110723585] 472236  	File 50: ‹/Table/53/1/800/1/2101 , /Table/53/1/1600/1/2101›
=779370259110723585] 472237  	File 51: ‹/Table/53/1/800/1/2101 , /Table/53/1/1600/1/2101›
=779370259110723585] 472238  	Last Valid Key: ‹/Table/53/1/812/10/2357/0/1621576026.245050377,0›
=779370259110723585] 472239  	Invalid Key: ‹/Min›
=779370259110723585] 472240  	Valid hit error: pebble/table: invalid table 000000 (checksum mismatch at 0/52920)

@cockroach-teamcity
Copy link
Member Author

roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ e4cafeb8b1d586d091fb98e3e570650d7eeea294:

		Wraps: (2) output in run_073249.503685885_n1_cockroach_sql
		Wraps: (3) ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'" returned
		  | stderr:
		  | ERROR: importing 12872 ranges: splitting key /Table/54/1/232/8/-1484/8: change replicas of r132 failed: descriptor changed: [expected] r132:/Table/54/1/2{29/10/-1861/14-55/PrefixEnd} [(n1,s1):1VOTER_DEMOTING_LEARNER, (n5,s5):2, (n7,s7):3, (n6,s6):4VOTER_INCOMING, next=5, gen=21, sticky=1658046795.613712493,0] != [actual] r132:/Table/54/1/2{29/10/-1861/14-55/PrefixEnd} [(n6,s6):4, (n9,s9):5, (n7,s7):3VOTER_DEMOTING_LEARNER, (n3,s3):6VOTER_INCOMING, next=7, gen=29, sticky=1658046795.613712493,0]
		  | Failed running "sql"
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach sql --insecure -e "
		  | 				RESTORE FROM '2021/05/21-020411.00' IN
		  | 				'gs://cockroach-fixtures/tpcc-incrementals?AUTH=implicit'
		  | 				AS OF SYSTEM TIME '2021-05-21 14:40:22'"
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,restore.go:453,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestore.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:453
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

msbutler added a commit to msbutler/cockroach that referenced this issue Jul 19, 2022
This PR refactors all call sites of ExternalSSTReader(), to support using the
new PebbleIterator, which has baked in range key support. Most notably, this
PR replaces the multiIterator used in the restore data processor with the
PebbleSSTIterator.

This patch is apart of a larger effort to teach backup and restore about MVCC
bulk operations. Next, the readAsOfIterator will need to learn how to
deal with range keys.

Informs cockroachdb#71155

This PR addresses a bug created in cockroachdb#83984: loop variables in
ExternalSSTReader were captured by reference, leading to roachtest failures
(cockroachdb#84240, cockroachdb#84162).

Informs #71155i

Fixes: cockroachdb#84240, cockroachdb#84162, cockroachdb#84181

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Jul 19, 2022
This PR refactors all call sites of ExternalSSTReader(), to support using the
new PebbleIterator, which has baked in range key support. Most notably, this
PR replaces the multiIterator used in the restore data processor with the
PebbleSSTIterator.

This patch is apart of a larger effort to teach backup and restore about MVCC
bulk operations. Next, the readAsOfIterator will need to learn how to
deal with range keys.

Informs cockroachdb#71155

This PR addresses a bug created in cockroachdb#83984: loop variables in
ExternalSSTReader were captured by reference, leading to roachtest failures
(cockroachdb#84240, cockroachdb#84162).

Informs #71155i

Fixes: cockroachdb#84240, cockroachdb#84162, cockroachdb#84181

Release note: none
@adityamaru
Copy link
Contributor

Get "https://storage.googleapis.com/cockroach-fixtures/tpcc-incrementals/2021/05/21-020411.00/20210521/143000.00/660396657505009669.sst?generation=1621607465986929": stream error: stream ID 20819; INTERNAL_ERROR; received from peer

rhu713 pushed a commit to rhu713/cockroach that referenced this issue Jul 29, 2022
…AL_ERROR

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#84162

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Jul 29, 2022
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Jul 29, 2022
…AL_ERROR

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162

Release note: None
craig bot pushed a commit that referenced this issue Jul 29, 2022
…85329

84975: storage: add `MVCCRangeKeyStack` for range keys r=nicktrav,jbowens a=erikgrinaker

**storage: add `MVCCRangeKeyStack` for range keys**

This patch adds `MVCCRangeKeyStack` and `MVCCRangeKeyVersion`, a new
range key representation that will be returned by `SimpleMVCCIterator`.
It is more compact, for efficiency, and comes with a set of convenience
methods to simplify common range key processing.

Resolves #83895.

Release note: None
  
**storage: return `MVCCRangeKeyStack` from `SimpleMVCCIterator`**

This patch changes `SimpleMVCCIterator.RangeKeys()` to return
`MVCCRangeKeyStack` instead of `[]MVCCRangeKeyValue`. Callers have not
been migrated to properly make use of this -- instead, they call
`AsRangeKeyValues()` and construct and use the old data structure.

The MVCC range tombstones tech note is also updated to reflect this.

Release note: None
  
**storage: migrate MVCC code to `MVCCRangeKeyStack`**

Release note: None
  
***: migrate higher-level code to `MVCCRangeKeyStack`**

Release note: None
  
**kvserver/gc: partially migrate to `MVCCRangeKeyStack`**

Some parts require invasive changes to MVCC stats helpers. These will
shortly be consolidated with other MVCC stats logic elsewhere, so the
existing logic is retained for now by using `AsRangeKeyValues()`.

Release note: None
  
**storage: remove `FirstRangeKeyAbove()` and `HasRangeKeyBetween()`**

Release note: None

85017: Revert "sql: Add database ID to sampled query log" r=THardy98 a=THardy98

Reverts: #84195
This reverts commit 307817e.

Removes the DatabaseID field from the
`SampledQuery` telemetry log due to the potential of indefinite blocking
in the case of a lease acquisition failure. Protobuf field not reserved as 
no official build was released with these changes yet.

Release note (sql change): Removes the DatabaseID field from the
`SampledQuery` telemetry log due to the potential of indefinite blocking
in the case of a lease acquisition failure.

85024: cloud/gcp: add custom retryer for gcs storage, retry on stream INTERNAL_ERROR r=rhu713 a=rhu713

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: #85217, #85216, #85204, #84162

Release note: None


85069: optbuilder: handle unnest returning a tuple r=DrewKimball a=DrewKimball

Currently, the return types of SRFs that return multiple columns are
represented as tuples with labels. The tuple labels are used to decide
whether or not to create a single output column for the SRF, or multiple.
The `unnest` function can return a single column if it has a single argument,
and the type of that column can be a tuple with labels. This could cause the
old logic to mistakenly create multiple output columns for `unnest`, which
could lead to panics down the line and incorrect behavior otherwise.

This commit adds a special case for `unnest` in the `optbuilder` to only expand
tuple return types if there is more than one argument (implying more than one
output column). Other SRFs do not have the same problem because they either
always return the same number of columns, cannot return tuples, or both.

Fixes #58438

Release note (bug fix): Fixed a bug existing since release 20.1 that could
cause a panic in rare cases when the unnest function was used with a
tuple return type.

85100: opt: perf improvements for large queries r=DrewKimball a=DrewKimball

**opt: add bench test for slow queries**

This commit adds two slow-planning queries pulled from #64793 to be used
in benchmarking the optimizer. In addition, the `ReorderJoinsLimit` has been
set to the default 8 for benchmarking tests.

**opt: add struct for tracking column equivalence sets**

Previously, the `JoinOrderBuilder` would construct a `FuncDepSet` from
scratch on each call to `addJoins` in order to eliminate redundant join
filters. This led to unnecessary large allocations because `addJoins` is
called an exponential number of times in query size.

This commit adds a struct `EquivSet` that efficiently stores equivalence
relations as `ColSets` in a slice. Rather than being constructed on each
call to `addJoins`, a `Reset` method is called that maintains slice memory.

In the future, `EquivSet` can be used to handle equivalencies within `FuncDepSet`
structs as well. This well avoid a significant number of allocations in cases with
many equivalent columns, as outlined in #83963.

**opt: avoid usage of FastIntMap in optimizer hot paths**

Previously, `computeHashJoinCost` would use a `FastIntMap` to represent join
equality filters to pass to `computeFiltersCost`. In addition,
`GenerateMergeJoins` used a `FastIntMap` to look up columns among its join
equality columns. This lead to unnecessary allocations since column IDs are
often large enough to exceed the small field of `FastIntMap`.

This commit modifies `computeFiltersCost` to take an anonymous function
that is used to decide whether to skip an equality condition, removing the
need for a mapping between columns.

This commit also refactors `GenerateMergeJoins` to simply perform a linear
scan of its equality columns; this avoids the allocation issue, and should be
fast in practice because the number of equalities will not generally be large.

Release note: None

85146: [backupccl] Use Expr for backup's Detached and Revision History options r=benbardin a=benbardin

This will allow us to set them to null, which will be helpful for ALTER commands.

Release note: None

85234: dev: add rewritable paths for norm tests r=mgartner a=mgartner

Tests in `pkg/sql/opt/norm` are similar to tests in `pkg/sql/opt/xform`
and `pkg/sql/opt/memo` in that they rely on fixtures in
`pkg/sql/opt/testutils/opttester/testfixtures`. This commit adds these
fixtures as rewritable paths for norm tests so that
`./dev test pkg/sql/opt/xform --rewrite` does not fail with errors like:

    open pkg/sql/opt/testutils/opttester/testfixtures/tpcc_schema: operation not permitted

Release note: None

85325: sql: fix explain gist output to show number of scan span constraints r=cucaroach a=cucaroach

If there were span constraints we would always print 1, need to actually
append them to get the count right.

Fixes: #85324

Release note: None


85327: sql: fix udf logic test r=chengxiong-ruan a=chengxiong-ruan

Fixes: #85303

Release note: None

85329: colexec: fix recent concat fix r=yuzefovich a=yuzefovich

The recent fix of the Concat operator in the vectorized engine doesn't
handle the array concatenation correctly and this is now fixed.

Fixes: #85295.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: DrewKimball <[email protected]>
Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 1, 2022
…shot

Informs cockroachdb#84635
Informs cockroachdb#84162

This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot`
status to be a form of a retriable replication change error. It then hooks
`Replica.executeAdminCommandWithDescriptor` up to consult this status in its
retry loop.

This avoids spurious errors when a split gets blocked behind a lateral replica
move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4. can’t transfer lease because doing so is deemed to be potentially unsafe

Release note: None
@adityamaru
Copy link
Contributor

The latest failures should all be fixed by #85024. We did see one instance of the fix in #85405 but recently all failures have been GCS stream internal errors.

rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 3, 2022
Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Retry these errors assuggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162

Release note: None
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 3, 2022
Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Retry these errors assuggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162

Release note: None
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 9, 2022
…AL_ERROR

Currently, errors like
`stream error: stream ID <x>; INTERNAL_ERROR; received from peer`
are not being retried. Create a custom retryer to retry these errors as
suggested by:

googleapis/google-cloud-go#3735
googleapis/google-cloud-go#784

Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162

Release note: None

Release justification: add retries for temporary errors that were causing
roachtests to fail.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 30, 2022
…shot

Informs cockroachdb#84635
Informs cockroachdb#84162
Fixes cockroachdb#85449.
Fixes cockroachdb#83174.

This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot`
status to be a form of a retriable replication change error. It then hooks
`Replica.executeAdminCommandWithDescriptor` up to consult this status in its
retry loop.

This avoids spurious errors when a split gets blocked behind a lateral replica
move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4. can’t transfer lease because doing so is deemed to be potentially unsafe

Release note: None

Release justification: Low risk.
craig bot pushed a commit that referenced this issue Aug 31, 2022
85405: kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot r=shralex a=nvanbenschoten

Informs #84635.
Informs #84162.
Fixes #85449.
Fixes #83174.

This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot`
status to be a form of a retriable replication change error. It then hooks
`Replica.executeAdminCommandWithDescriptor` up to consult this status in its
retry loop.

This avoids spurious errors when a split gets blocked behind a lateral replica
move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4. can’t transfer lease because doing so is deemed to be potentially unsafe

Release note: None

Release justification: Low risk, resolves flaky test.

87137: storage: default to TableFormatPebblev1 in backups r=itsbilal,dt a=jbowens

If the v22.2 upgrade has not yet been finalized, so we're not permitted
to use the new TableFormatPebblev2 sstable format, default to
TableFormatPebblev1 which is the format used by v22.1 internally.

This change is intended to allow us to remove code for understanding the
old RocksDB table format version sooner (eg, v23.1).

Release justification: low-risk updates to existing functionality
Release note: None

87152: sql: encode either 0 or 1 spans in scan gists r=mgartner a=mgartner

#### dev: add rewritable paths for pkg/sql/opt/exec/explain tests

This commit adds fixtures in
`pkg/sql/opt/testutils/opttester/testfixtures` as rewritable paths for
tests in `pkg/sql/opt/exec/explain`. This prevents
`dev test pkg/sql/opt/exec/explain` from erring when the `--rewrite`
flag is used.

Release justification: This is a test-only change.

Release note: None

#### sql: encode either 0 or 1 spans in scan gists

In plan gists, we no longer encode the exact number of spans for scans
so that two queries with the same plan but a different number of spans
have the same gist.

In addition, plan gists are now decoded with the `OnlyShape` flag which
prints any non-zero number of spans as "1+ spans" and removes attributes
like "missing stats" from scans.

Fixes #87138

Release justification: This is a minor change that makes plan gist
instrumentation more scalable.

Release note (bug fix): The Explain Tab inside the Statement Details
page now groups plans that have the same shape but a different number of
spans in corresponding scans.


87154: roachtest: stop cockroach gracefully when upgrading nodes r=yuzefovich a=yuzefovich

This commit makes it so that we stop cockroach nodes gracefully when
upgrading them. Previous abrupt behavior of stopping the nodes during
the upgrade could lead to test flakes because the nodes were not
being properly drained.

Here is one scenario for how one of the flakes (`pq: version mismatch in
flow request: 65; this node accepts 69 through 69`, which means that
a gateway running an older version asks another node running a newer
version to do DistSQL computation, but the versions are not DistSQL
compatible):
- we are in a state when node 1 is running a newer version when node
2 is running an older version. Importantly, node 1 was upgraded
"abruptly" meaning that it wasn't properly drained; in particular, it
didn't send DistSQL draining notification through gossip.
- newer node has already been started but its DistSQL server hasn't been
started yet (although it already can accept incoming RPCs - see comments
on `distsql.ServerImpl.Start` for more details). This means that newer
node has **not** sent through gossip an update about its DistSQL version.
- node 2 acts as the gateway for a query that reads some data that node
1 is the leaseholder for. During the physical planning, older node
2 checks whether newer node 1 is "healthy and compatible", and node 1 is
deemed both healthy (because it can accept incoming RPCs) and is
compatible (because node 2 hasn't received updated DistSQL version of
node 1 since it hasn't been sent yet). As a result, node 2 plans a read
on node 1.
- when node 1 receives that request, it errors out with "version
mismatch" error.

This whole problem is solved if we stop nodes gracefully when upgrading
them. In particular, this will mean that node 1 would first dissipate its
draining notification across the cluster, so during the physical
planning it will only be considered IFF node 1 has already communicated
its updated DistSQL version, and then it would be deemed
DistSQL-incompatible.

I verified that this scenario is possible (with manual adjustments of the
version upgrade test and cockroach binary to insert a delay) and that
it's fixed by this commit. I believe it is likely that other flake types
have the same root cause, but I haven't verified it.

Fixes: #87104.

Release justification: test-only change.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants