Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: alterpk-tpcc-250 failed #48428

Closed
cockroach-teamcity opened this issue May 5, 2020 · 27 comments · Fixed by #49574
Closed

roachtest: alterpk-tpcc-250 failed #48428

cockroach-teamcity opened this issue May 5, 2020 · 27 comments · Fixed by #49574
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).alterpk-tpcc-250 failed on master@425eaa8fb05fc32b2c42827b85338daa52f4177c:

		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_082315.738_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1920699-1588664105-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200505 08:23:17.413428 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 46.06278ms
		  | I200505 08:23:21.015011 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 3.601500282s
		  | I200505 08:23:22.115033 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.099956167s
		  | I200505 08:23:47.275584 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 25.160505004s
		  | I200505 08:24:38.433919 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 51.158198158s
		  | I200505 09:31:38.320034 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 1h6m59.885964635s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1920699-1588664105-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 4170
		3: 4164
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) 1: dead
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Error types: (1) errors.Unclassified (2) *errors.fundamental

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 5, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone May 5, 2020
@yuzefovich yuzefovich assigned yuzefovich and unassigned andreimatei May 5, 2020
@yuzefovich
Copy link
Member

Hm, I expected to find a fatal OOM, but it's not the case. For some reason node 2 and node 3 lost connections to node 1 which is struggling tremendously.

Node 2:

W200505 08:31:15.648273 444 kv/kvserver/raft_transport.go:637  [n2] while processing outgoing Raft queue to node 1: rpc error: code = Unavailable desc = transport is closing:

Node 3:

W200505 08:31:17.430855 156 kv/kvserver/raft_transport.go:637  [n3] while processing outgoing Raft queue to node 1: rpc error: code = Unavailable desc = transport is closing:

And here are the logs around 08:31:15 from node 1 (note I copy-pasted a chunk, so the logs contain a lot less entries than usually):

I200505 08:30:43.824580 331 server/status/runtime.go:498  [n1] runtime stats: 12 GiB RSS, 327 goroutines, 6.3 GiB/37 MiB/6.8 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 993.0 CGO/sec, 97.8/3.8 %(u/s)time, 0.0 %gc (0x), 248 KiB/717 KiB (r/w)net
W200505 08:30:54.851526 306 kv/kvserver/store_raft.go:508  [n1,s1,r99/1:/Table/55/1/1{69/5/-…-72/3/-…}] handle raft ready: 0.6s [applied=2, batches=2, state_assertions=0]
W200505 08:32:10.021797 142 storage/rocksdb.go:2095  batch [0/14/0] commit took 1.021889972s (>= warning threshold 500ms)
W200505 08:33:04.927289 324 kv/kvserver/store_rebalancer.go:223  [n1,s1,store-rebalancer] StorePool missing descriptor for local store
W200505 08:33:19.277498 333 kv/kvserver/node_liveness.go:571  [n1,liveness-hb] slow heartbeat took 3.3s

I'm very confused why there are no runtime stats messages after 08:30:43 - those should be logged every 10 seconds. The next one appears only at 08:46:53 and then at 08:54:19. It appears to me that cockroach process is somehow getting suspended(?).

Or maybe it does hit OOM error, but the process is not killed and is thrashed instead (not sure if it is possible). That seems somewhat unlikely because the machine has 14GiB of RAM, and the last logged stats has 12GiB RSS.

I don't see anything suspicious in the logs from node 1 before this badness started happening. I'm thinking that it could have been an infra flake and want to see another occurrence of the problem before digging in further. cc @jordanlewis @asubiotto

@tbg
Copy link
Member

tbg commented May 6, 2020

W200505 08:37:49.438045 1069704 server/node_engine_health.go:72 [n1] disk stall detected: unable to write to =/mnt/data1/cockroach within 10s

Possible that this is just due to the overload, but it could indicate that the disks had issues.

The runtime stats disappearing for 8 minutes is suspicious, especially since other logs do make it through in the meantime (if the disks were stalled, you'd get a bunch of log messages all at once at some point).

The overload/issue must have blocked something in the loop that periodically prints the runtime. We had such problems before, though we addressed them:

f53c14a#diff-09856fe9becddf0199651f451409356aR1957

There might be something else that is blocking. It will be hard to find out now.

@tbg
Copy link
Member

tbg commented May 6, 2020

The rocksdb log also shows a giant gap (usually there's something every minute):

I200505 08:38:42.669696 17 storage/rocksdb.go:102  EVENT_LOG_v1 {"time_micros": 1588667911524841, "cf_name": "default", "job": 18, "event": "table_file_creation", "file_number": 41, "file_size": 4209801, "table_properties": {"data_size": 4202237, "index_size": 9785, "index_partitions": 0, "top_level_index_size": 0, "index_key_is_user_key": 0, "index_value_is_delta_encoded": 0, "filter_size": 0, "raw_key_size": 11847588, "raw_average_key_size": 19, "raw_value_size": 595983, "raw_average_value_size": 1, "num_data_blocks": 357, "num_entries": 595983, "num_deletions": 0, "num_merge_operands": 0, "num_range_deletions": 0, "format_version": 0, "fixed_key_len": 0, "filter_policy": "rocksdb.BuiltinBloomFilter", "column_family_name": "default", "column_family_id": 0, "comparator": "cockroach_comparator", "merge_operator": "cockroach_merge_operator", "prefix_extractor_name": "cockroach_prefix_extractor", "property_collectors": "[TimeBoundTblPropCollectorFactory,DeleteRangeTblPropCollectorFactory]", "compression": "Snappy", "compression_options": "window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ", "creation_time": 1588667449, "oldest_key_time": 0, "file_creation_time": 1588667452}}
I200505 08:54:18.767824 17 storage/rocksdb.go:102  [db/compaction_job.cc:1334] [default] [JOB 18] Generated table #42: 595980 keys, 4212023 bytes

I agree that this might be an infra flake.

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@683f0d561bf9b73902edb27d681bca5333bdad60:

		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_082525.873_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1922964-1588750145-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200506 08:25:27.414642 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 34.966612ms
		  | I200506 08:25:30.939928 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 3.525239882s
		  | I200506 08:25:31.991322 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.051339882s
		  | I200506 08:26:21.220363 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 49.22899943s
		  | I200506 08:27:12.185241 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 50.964733393s
		  | I200506 08:31:12.051596 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 3m59.866049796s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1922964-1588750145-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4487
		2: 4710
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) 1: dead
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Error types: (1) errors.Unclassified (2) *errors.fundamental

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

@tbg thanks for looking into the failure.

The new failure is a fatal OOM:

I200506 08:30:25.589418 141 server/status/runtime.go:498  [n1] runtime stats: 12 GiB RSS, 328 goroutines, 7.0 GiB/16 MiB/7.5 GiB GO alloc/idle/total, 3.9 GiB/5.0 GiB CGO alloc/total, 1355.1 CGO/sec, 130.4/6.0 %(u/s)time, 0.0 %gc (0x), 375 KiB/309 KiB (r/w)net
fatal error: runtime: out of memory

I'm confused though why it would occur at this point - the machine has 14GiB RAM, so there should be 1-2GiB available.

We turned the vectorized engine on by default, and the crash has this in the stack trace:

github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*DatumAlloc).NewDInt(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/datum_alloc.go:64
github.com/cockroachdb/cockroach/pkg/sql/colexec.PhysicalTypeColElemToDatum(0x4dae520, 0xc02ce41200, 0x1ce, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/vec_elem_to_datum.go:52 +0x2d3 fp=0xc06500af48 sp=0xc06500adb8 pc=0x2819b73
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).next(0xc00b12e300, 0x10, 0xc00c750928, 0x79be8a, 0x8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:176 +0x1e5 fp=0xc06500b3b8 sp=0xc06500af48 pc=0x224a685
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).nextAdapter(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:148
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).nextAdapter-fm()
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:147 +0x33 fp=0xc06500b3f8 sp=0xc06500b3b8 pc=0x2994b63
github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror.CatchVectorizedRuntimeError(0xc06500b460, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror/error.go:95 +0x5f fp=0xc06500b448 sp=0xc06500b3f8 pc=0x21665ef
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).Next(0xc00b12e300, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:185 +0x4d fp=0xc06500b480 sp=0xc06500b448 pc=0x224a94d
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*hashJoiner).receiveNext(0xc007f6a000, 0x4d25801, 0xc02ce406c0, 0xc004102840, 0x4, 0x4, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/hashjoiner.go:652 +0x194 fp=0xc06500b518 sp=0xc06500b480 pc=0x1fb8384
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*hashJoiner).consumeStoredSide(0xc007f6a000, 0x2, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/hashjoiner.go:366 +0x4a fp=0xc06500b580 sp=0xc06500b518 pc=0x1fb66ea
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*hashJoiner).Next(0xc007f6a000, 0xc00d9388d0, 0x1, 0x1, 0xc00cf6d650)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/hashjoiner.go:235 +0x187 fp=0xc06500b5e8 sp=0xc06500b580 pc=0x1fb5c27
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Columnarizer).Next(0xc005aa3000, 0x4d25800, 0xc02ce406c0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/columnarizer.go:112 +0x120 fp=0xc06500b730 sp=0xc06500b5e8 pc=0x2195110

I'll investigate whether we're missing some memory accounting in the vectorized engine.

@yuzefovich
Copy link
Member

Oh yes, we're missing memory accounting in the hash aggregator (similar to what we had in rowexec). All this stuff is not accounted for:
Screen Shot 2020-05-06 at 1 54 21 PM

@yuzefovich
Copy link
Member

yuzefovich commented May 6, 2020

Another interesting question here is why the memory usage with vectorize=on higher than with vectorize=off.

Here is the peak of the usage I observed when running EXPLAIN ANALYZE of 3.3.2.6 query with vectorize=off:

I200506 20:45:06.037068 194 server/status/runtime.go:498  [n2] runtime stats: 8.4 GiB RSS, 246 goroutines, 2.8 GiB/1.2 GiB/3.1 GiB GO alloc/idle/total, 4.0 GiB/5.0 GiB CGO alloc/total, 329.1 CGO/sec, 285.3/26.0 %(u/s)time, 0.0 %gc (2x), 15 MiB/19 MiB (r/w)net

and with vectorize=on:

I200506 20:49:45.435430 194 server/status/runtime.go:498  [n2] runtime stats: 12 GiB RSS, 239 goroutines, 6.9 GiB/31 MiB/7.4 GiB GO alloc/idle/total, 4.1 GiB/5.0 GiB CGO alloc/total, 301896.0 CGO/sec, 109.0/15.3 %(u/s)time, 0.0 %gc (0x), 15 MiB/23 MiB (r/w)net

I'm guessing that with rowexec hash aggregator we now eagerly delete buckets from the hash table which allows for faster GC, and the "true" peak usage with vectorize=off could've been not captured with runtime stats every 10 seconds. We don't do the same with vectorized hash aggregator, but I'll look into whether it would make sense. Never mind, we do delete the entries from the map, so the question remains open - why is vectorized case using more memory?

@yuzefovich
Copy link
Member

Hm, here is the heap profile at the peak with vectorize=on:
Screen Shot 2020-05-06 at 2 34 15 PM
I guess the increase in memory usage can be explained by two reasons:

  1. vectorized aggregate builtin structs are bigger than in rowexec
  2. the query contains two stages of aggregation with the second one feeding into ExceptAll hash join. Vectorized engine doesn't support it, so we wrap rowexec.hashJoiner which requires us to plan the following chain: materializer -> rowexec.hashJoiner -> columnarizer -> materializer. These conversions probably account for the rest of the discrepancy.

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@f89c93fa20887f8d269149d8bc573ba8e00c99e2:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_080214.591_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1928471-1588923040-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200508 08:02:16.130684 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 31.862168ms
		  | I200508 08:02:19.242242 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 3.111437272s
		  | I200508 08:02:20.389308 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.147009345s
		  | I200508 08:02:40.110423 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 19.721062348s
		  | I200508 08:03:30.013845 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 49.903370399s
		  | I200508 08:13:52.372941 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 10m22.358757484s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1928471-1588923040-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 4322
		3: 4717
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@1d6db7e386e3ab21bca53f38637014f61153dc90:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_075428.147_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1930996-1589009205-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200509 07:54:29.797646 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 78.38665ms
		  | I200509 07:54:36.709111 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 6.911413179s
		  | I200509 07:54:37.740875 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.031704301s
		  | I200509 07:54:57.785635 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 20.044687042s
		  | I200509 07:55:47.136536 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 49.350783924s
		  | I200509 08:01:29.828948 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 5m42.692325489s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1930996-1589009205-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 4248
		3: 4258
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@c1abb272c94a437f1df9175fc30dc6a6729d3338:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_083902.065_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1932413-1589096848-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200510 08:39:03.613267 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 33.749216ms
		  | I200510 08:39:07.201411 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 3.588093107s
		  | I200510 08:39:08.557711 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.356200547s
		  | I200510 08:39:57.476924 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 48.91899501s
		  | I200510 08:40:47.624098 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 50.147127767s
		  | I200510 08:44:42.758833 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 3m55.134524832s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1932413-1589096848-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4243
		2: 4227
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@e6b47088aba9c9978501b966a0e88aeb273d9990:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_083424.536_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1934023-1589183787-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200511 08:34:26.110677 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 55.566978ms
		  | I200511 08:34:30.844502 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 4.733773355s
		  | I200511 08:34:32.016901 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.172336795s
		  | I200511 08:35:26.587460 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 54.570452322s
		  | I200511 08:36:18.725882 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 52.138254552s
		  | I200511 09:52:33.631583 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 1h16m14.905333641s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1934023-1589183787-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4213
		2: 4237
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

craig bot pushed a commit that referenced this issue May 12, 2020
48511: colexec: some memory accounting and performance optimizations r=yuzefovich a=yuzefovich

**colexec: account for aggregate functions structs**

Previously, we were not accounting for the memory used by aggregate
functions structs. This behavior is acceptable in case of ordered
aggregator (because we will only constant number of such structs), but
it's no bueno for the hash aggregator - we will be creating a separate
struct for each group. This is now fixed by performing the memory
accounting upon creation of these aggregate functions structs.

This commit also adds accounting for `hashAggFuncs` structs in the hash
aggregator.

However, this commit does not address the missing memory accounting of
the Golang's `map` that we use for mapping a hash code to all aggregate
builtins (i.e. "groups") that correspond to that hash code. This is left
as TODO, but we need to address it for 20.2. A note here is that we
might be replacing the usage of this `map` with our vectorized hash
table, so it'll probably make sense to wait for that.

Addresses: #48428.

Release note: None

**colexec: introduce Flush method into aggregateFunc interface**

I was looking at cpu profile with memory intensive hash aggregation, and
it was saying that we spend noticeable amount of time checking whether
aggregation function is `done`. This can be avoided if we refactor the
interface slightly which is what this commit does.

Previously, we used `Compute` method to both compute the aggregation on
non-empty batch and to flush the result of aggregation of the last group
on an empty batch. These two purposes are now split into two functions:
the former is done by the same `Compute` function and the latter is done
by newly-introduced `Flush` function which must be called to "flush" the
result for the last group.

Release note: None

**colexec: generate different structs for two COUNT variants**

In one profile I saw noticeable time spent on an `if` condition that
determines whether we have `count_rows` aggregate or not. This commit
refactors the template to generate two different structs which allows us
to remove that check.

Release note: None

**typeconv: change map usage to family switch**

In a cpu profile I noticed that some time was spent in the accesses to
the map when instantiating aggregate funcs. This can be avoided by
changing the usage of map to a switch on a type family, and the function
should be inlined.

Release note: None

48548: colexec: clean up materializer a bit r=yuzefovich a=yuzefovich

While working on cfetcher + materializer instead of table reader,
I noticed something weird - that we get `OutputTypes` in the
materializer, and use those when converting physical vectors to datums.
I thought this was a bug because `OutputTypes` returns the type schema
after post-processing stage, and internal type schema of a processor
could be different. It turns out we simply always pass in empty
`PostProcessSpec` into the materializer because we expect the input
operator to handle the post-processing itself. This commit cleans it up
by removing that unused argument and the call to `ProcessRowHelper`.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@752c32580ff452bb0039e5af0c00044b91bdcb12:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_085406.862_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1941974-1589442983-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200514 08:54:08.477070 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 47.024622ms
		  | I200514 08:54:12.579047 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 4.101930635s
		  | I200514 08:54:13.676269 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.097174823s
		  | I200514 08:55:06.654942 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 52.978458705s
		  | I200514 08:55:55.560875 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 48.905860715s
		  | I200514 09:02:41.606455 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 6m46.045296235s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1941974-1589442983-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 4169
		3: 4243
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@f66ff29e25bc546ee239867e316485528e11e3dc:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_083418.739_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1944755-1589530238-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200515 08:34:20.257908 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 32.515749ms
		  | I200515 08:34:23.775396 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 3.517430919s
		  | I200515 08:34:24.832735 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.057276048s
		  | I200515 08:35:15.007720 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 50.174833947s
		  | I200515 08:36:03.960309 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 48.952492114s
		  | I200515 08:39:11.392906 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 3m7.432427224s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1944755-1589530238-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4166
		2: 4147
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@5156843cc23adecb6a70fabf19f51f46de1241ec:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_081135.287_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1947362-1589615123-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200516 08:11:36.933362 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 57.640457ms
		  | I200516 08:11:40.253187 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 3.319769993s
		  | I200516 08:11:41.347222 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.093991682s
		  | I200516 08:12:38.913112 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 57.565745137s
		  | I200516 08:13:14.648499 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 35.735328662s
		  | I200516 08:20:58.544068 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 7m43.895343865s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1947362-1589615123-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 4754
		3: 4551
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@2cbf620cd229a55622fe17fb15d20ada1dbcccd3:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_080228.954_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1948994-1589701075-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200517 08:02:30.490584 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 33.466025ms
		  | I200517 08:02:36.985391 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 6.494665144s
		  | I200517 08:02:38.082509 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.097064151s
		  | I200517 08:03:07.523161 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 29.440487908s
		  | I200517 08:03:41.620476 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 34.09721348s
		  | I200517 08:32:57.861929 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 29m16.241149452s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1948994-1589701075-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 4139
		3: 4245
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@7ab7fb86f0634df1f5a0b04460e1f3a1d6bead1f:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_075423.692_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1950201-1589787198-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200518 07:54:25.251554 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 47.06698ms
		  | I200518 07:54:28.145723 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 2.894124108s
		  | I200518 07:54:29.148314 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.00251345s
		  | I200518 07:55:35.780399 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m6.632033094s
		  | I200518 07:56:07.431401 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 31.650945559s
		  | I200518 08:00:01.444361 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 3m54.012794292s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1950201-1589787198-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4134
		2: 4165
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

craig bot pushed a commit that referenced this issue May 18, 2020
48831: opt: implement ON DELETE SET NULL/DEFAULT r=RaduBerinde a=RaduBerinde

Implementing the ON DELETE SET NULL and SET DEFAULT actions. These actions
trigger an update in the child table whenever a row is deleted from the parent.

Release note: None

49102: colexec: aggregator accounting fix and performance optimizations r=yuzefovich a=yuzefovich

**colexec: fix wrong usage of pointer when getting the size of a struct**

I mistakenly used `unsafe.SizeOf(&aggFuncStruct{})` when calculating the
size of the struct, and as a result, the sizes were all 8 bytes, but
they should have been higher. This fixed the discrepancy I saw in the
heap profiles and reported usage to bytes monitor.

Addresses: #48428.

Release note: None

**colexec: put aggregate funcs in a separate file**

This is just code movement and some files' renaming.

Release note: None

**colexec: introduce aggregate function allocator**

This commit introduces an aggregate functions allocator object that
supports an arbitrary function schema as well as a single aggregate
function alloc that pools allocations of the same statically-typed
aggregate function. All of these objects support arbitrary allocation
size, so ordered aggregator sets it to 1 whereas hash aggregator sets it
to 64.

Release note: None

**colexec: pool allocations of slices of pointers in hash aggregator**

This gives modest (5% or so) improvements on microbenchmarks when group
sizes are small.

Release note: None

Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich
Copy link
Member

I looked into the logs of three latest failures, and for all of them the allocation request that tipped the node over came from the storage engine (AFAICT).

Three days ago:

I200516 08:20:12.132800 185 server/status/runtime.go:499  [n1] runtime stats: 13 GiB RSS, 339 goroutines, 7.9 GiB/61 MiB/8.5 GiB GO alloc/idle/total, 4.0 GiB/5.2 GiB CGO alloc/total, 145.9 CGO/sec, 92.9/10.7 %(u/s)time, 0.0 %gc (0x), 387 KiB/160 KiB (r/w)net
fatal error: runtime: out of memory

runtime stack:
...
goroutine 1006715 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc00dec94e8 sp=0xc00dec94e0 pc=0x7cb9d0
runtime.(*mheap).alloc(0x78366a0, 0x1, 0x1003f, 0xc00dec9608)
	/usr/local/go/src/runtime/mheap.go:1092 +0x8a fp=0xc00dec9538 sp=0xc00dec94e8 pc=0x7940ba
runtime.(*mcentral).grow(0x7837a18, 0x0)
	/usr/local/go/src/runtime/mcentral.go:255 +0x7b fp=0xc00dec9578 sp=0xc00dec9538 pc=0x78601b
runtime.(*mcentral).cacheSpan(0x7837a18, 0xc20bff2335)
	/usr/local/go/src/runtime/mcentral.go:106 +0x2fe fp=0xc00dec95d8 sp=0xc00dec9578 pc=0x785b3e
runtime.(*mcache).refill(0x7f8e31c08648, 0x3f)
	/usr/local/go/src/runtime/mcache.go:138 +0x85 fp=0xc00dec95f8 sp=0xc00dec95d8 pc=0x7855e5
runtime.(*mcache).nextFree(0x7f8e31c08648, 0x3f, 0x2, 0xc20bff233a, 0x1)
	/usr/local/go/src/runtime/malloc.go:854 +0x87 fp=0xc00dec9630 sp=0xc00dec95f8 pc=0x779ea7
runtime.mallocgc(0x400, 0x0, 0x193bb00, 0xc020bf8bf1)
	/usr/local/go/src/runtime/malloc.go:1022 +0x793 fp=0xc00dec96d0 sp=0xc00dec9630 pc=0x77a7e3
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/rawalloc.New(...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/rawalloc/rawalloc.go:22
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Batch).init(0xc20bffc2a0, 0x2c)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:736 +0x51 fp=0xc00dec9708 sp=0xc00dec96d0 pc=0x18774c1
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Batch).prepareDeferredKeyValueRecord(0xc20bffc2a0, 0xb, 0x1, 0xc020bf8b01)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:371 +0x233 fp=0xc00dec9730 sp=0xc00dec9708 pc=0x1875de3
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Batch).SetDeferred(...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:475
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Batch).Set(0xc20bffc2a0, 0xc020bf8bf0, 0xb, 0x10, 0xc01f197a58, 0x1, 0x8, 0x0, 0xc0039f8e10, 0xc0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:456 +0x48 fp=0xc00dec9760 sp=0xc00dec9730 pc=0x1876078
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMapBatchWriter).Put(0xc18672d620, 0xc20bff2321, 0xa, 0xf, 0xc01f197a58, 0x1, 0x8, 0x4183c00, 0x1eb3bf2)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/disk_map.go:413 +0xac fp=0xc00dec97c0 sp=0xc00dec9760 pc=0x18d327c
github.com/cockroachdb/cockroach/pkg/sql/rowcontainer.(*hashDiskRowBucketIterator).Mark(0xc0216c0960, 0x4da4da0, 0xc00af650c0, 0x1, 0xc0039f8e10, 0x4)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowcontainer/hash_row_container.go:658 +0x21f fp=0xc00dec9860 sp=0xc00dec97c0 pc=0x1eb3e8f
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*hashJoiner).probeRow(0xc0074b2000, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/hashjoiner.go:556 +0x615 fp=0xc00dec9968 sp=0xc00dec9860 pc=0x1fe39e5
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*hashJoiner).Next(0xc0074b2000, 0xc00d09ba18, 0xeb23f1, 0x770444, 0x7875750)

Two days ago:

I200517 08:10:36.152589 153 server/status/runtime.go:499  [n1] runtime stats: 13 GiB RSS, 338 goroutines, 8.1 GiB/16 MiB/8.6 GiB GO alloc/idle/total, 4.1 GiB/5.1 GiB CGO alloc/total, 172.7 CGO/sec, 86.8/9.9 %(u/s)time, 0.0 %gc (0x), 282 KiB/587 KiB (r/w)net
fatal error: runtime: out of memory

runtime stack:
...
goroutine 1039567 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0dbf82b50 sp=0xc0dbf82b48 pc=0x7cb9d0
runtime.(*mheap).alloc(0x783b700, 0x1, 0xc004010032, 0xc0dbf82bf0)
	/usr/local/go/src/runtime/mheap.go:1092 +0x8a fp=0xc0dbf82ba0 sp=0xc0dbf82b50 pc=0x7940ba
runtime.(*mcentral).grow(0x783c738, 0x0)
	/usr/local/go/src/runtime/mcentral.go:255 +0x7b fp=0xc0dbf82be0 sp=0xc0dbf82ba0 pc=0x78601b
runtime.(*mcentral).cacheSpan(0x783c738, 0x7a0)
	/usr/local/go/src/runtime/mcentral.go:106 +0x2fe fp=0xc0dbf82c40 sp=0xc0dbf82be0 pc=0x785b3e
runtime.(*mcache).refill(0x7f2f3e060dc8, 0x32)
	/usr/local/go/src/runtime/mcache.go:138 +0x85 fp=0xc0dbf82c60 sp=0xc0dbf82c40 pc=0x7855e5
runtime.(*mcache).nextFree(0x7f2f3e060dc8, 0x4634e32, 0xc0dbf82ce8, 0xc0dbf82cd8, 0x770894)
	/usr/local/go/src/runtime/malloc.go:854 +0x87 fp=0xc0dbf82c98 sp=0xc0dbf82c60 pc=0x779ea7
runtime.mallocgc(0x200, 0x3f54880, 0xc0dbf82d01, 0x7ed48b)
	/usr/local/go/src/runtime/malloc.go:1022 +0x793 fp=0xc0dbf82d38 sp=0xc0dbf82c98 pc=0x77a7e3
runtime.makeslice(0x3f54880, 0x0, 0x40, 0xf)
	/usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc0dbf82d68 sp=0xc0dbf82d38 pc=0x7b40ac
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*tableCacheShard).init.func1(0xc000bb2120, 0xb)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:145 +0x51 fp=0xc0dbf82db0 sp=0xc0dbf82d68 pc=0x18c4d91
sync.(*Pool).Get(0xc000bb2120, 0xc000bb20f0, 0x74c)
	/usr/local/go/src/sync/pool.go:148 +0xa6 fp=0xc0dbf82df8 sp=0xc0dbf82db0 pc=0x7ed286
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc000149260, 0xc00b3a6280, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:266 +0xe3 fp=0xc0dbf82e98 sp=0xc0dbf82df8 pc=0x18b6563
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*tableCacheShard).newIters(0xc000149260, 0xc00b3a6280, 0xc20fe84938, 0x0, 0x11, 0x9, 0x11, 0xc0dbf82fc0, 0x18c114b, 0xc01b2dcab0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:163 +0x4d fp=0xc0dbf82f38 sp=0xc0dbf82e98 pc=0x18b5ead
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*tableCache).newIters(0xc000bf0098, 0xc00b3a6280, 0xc20fe84938, 0x0, 0x5, 0xc0dbf82fe8, 0x18a4f51, 0x0, 0x18ea6434ffa, 0x5)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:58 +0x78 fp=0xc0dbf82f98 sp=0xc0dbf82f38 pc=0x18b5708
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*tableCache).newIters-fm(0xc00b3a6280, 0xc20fe84938, 0x0, 0xc0dbf83020, 0x1, 0x18c10c0, 0xc20fe848f0, 0xc20e48e470, 0xa)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/table_cache.go:55 +0x4c fp=0xc0dbf82ff8 sp=0xc0dbf82f98 pc=0x18cbf2c
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*levelIter).loadFile(0xc20fe848f0, 0x1, 0x1, 0xa)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/level_iter.go:304 +0x151 fp=0xc0dbf83058 sp=0xc0dbf82ff8 pc=0x18a3b21
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE(0xc20fe848f0, 0xc20fc5e870, 0xa, 0xa, 0xc20e48e470, 0xa, 0xa, 0xc00b9471b0, 0x7f2cd5dd9382, 0x44, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/level_iter.go:365 +0x9a fp=0xc0dbf830c0 sp=0xc0dbf83058 pc=0x18a3f5a
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*mergingIter).seekGE(0xc20fe84150, 0xc20e48e470, 0xa, 0xa, 0x1)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/merging_iter.go:743 +0x10d fp=0xc0dbf831f8 sp=0xc0dbf830c0 pc=0x18a97fd
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGE(0xc20fe84150, 0xc20fc5e870, 0xa, 0xa, 0xc20e48e470, 0xa, 0xa, 0x0, 0x0, 0xa, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/merging_iter.go:810 +0x9a fp=0xc0dbf83230 sp=0xc0dbf831f8 pc=0x18a9eca
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Iterator).SeekPrefixGE(0xc20fe84000, 0xc20e48e470, 0xa, 0xa, 0x9)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/iterator.go:365 +0x1c0 fp=0xc0dbf832c0 sp=0xc0dbf83230 pc=0x189c580
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekGE(0xc0108e9a20, 0xc064ff6300, 0x9, 0x20, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:183 +0xb3 fp=0xc0dbf83328 sp=0xc0dbf832c0 pc=0x18fae63
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).get(0xc008bba000)

One day ago:

I200518 07:59:14.848877 32 server/status/runtime.go:499  [n1] runtime stats: 12 GiB RSS, 339 goroutines, 6.6 GiB/17 MiB/7.1 GiB GO alloc/idle/total, 4.1 GiB/5.2 GiB CGO alloc/total, 319.3 CGO/sec, 111.0/8.4 %(u/s)time, 0.0 %gc (0x), 361 KiB/296 KiB (r/w)net
fatal error: runtime: out of memory

runtime stack:
...
goroutine 1195549 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc02c19d4e8 sp=0xc02c19d4e0 pc=0x7cb9d0
runtime.(*mheap).alloc(0x784ab40, 0x1, 0x1003f, 0xc02c19d608)
	/usr/local/go/src/runtime/mheap.go:1092 +0x8a fp=0xc02c19d538 sp=0xc02c19d4e8 pc=0x7940ba
runtime.(*mcentral).grow(0x784beb8, 0x0)
	/usr/local/go/src/runtime/mcentral.go:255 +0x7b fp=0xc02c19d578 sp=0xc02c19d538 pc=0x78601b
runtime.(*mcentral).cacheSpan(0x784beb8, 0xc20bfe6755)
	/usr/local/go/src/runtime/mcentral.go:106 +0x2fe fp=0xc02c19d5d8 sp=0xc02c19d578 pc=0x785b3e
runtime.(*mcache).refill(0x7f9645b778b8, 0x3f)
	/usr/local/go/src/runtime/mcache.go:138 +0x85 fp=0xc02c19d5f8 sp=0xc02c19d5d8 pc=0x7855e5
runtime.(*mcache).nextFree(0x7f9645b778b8, 0x3f, 0x2, 0xc20bfe675a, 0x1)
	/usr/local/go/src/runtime/malloc.go:854 +0x87 fp=0xc02c19d630 sp=0xc02c19d5f8 pc=0x779ea7
runtime.mallocgc(0x400, 0x0, 0x193fb00, 0xc034be4741)
	/usr/local/go/src/runtime/malloc.go:1022 +0x793 fp=0xc02c19d6d0 sp=0xc02c19d630 pc=0x77a7e3
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/rawalloc.New(...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/rawalloc/rawalloc.go:22
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Batch).init(0xc20bffe620, 0x2c)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:736 +0x51 fp=0xc02c19d708 sp=0xc02c19d6d0 pc=0x187b461
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Batch).prepareDeferredKeyValueRecord(0xc20bffe620, 0xb, 0x1, 0xc034be4701)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:371 +0x233 fp=0xc02c19d730 sp=0xc02c19d708 pc=0x1879d83
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Batch).SetDeferred(...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:475
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble.(*Batch).Set(0xc20bffe620, 0xc034be4740, 0xb, 0x10, 0xc0057aa278, 0x1, 0x8, 0x0, 0xc0089043c0, 0xc0)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/batch.go:456 +0x48 fp=0xc02c19d760 sp=0xc02c19d730 pc=0x187a018
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMapBatchWriter).Put(0xc0eaa80080, 0xc20bfe6741, 0xa, 0xf, 0xc0057aa278, 0x1, 0x8, 0x418cfe0, 0x1eb7362)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/disk_map.go:413 +0xac fp=0xc02c19d7c0 sp=0xc02c19d760 pc=0x18d721c
github.com/cockroachdb/cockroach/pkg/sql/rowcontainer.(*hashDiskRowBucketIterator).Mark(0xc012774060, 0x4db2ec0, 0xc03b26be80, 0x1, 0xc0089043c0, 0x4)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowcontainer/hash_row_container.go:658 +0x21f fp=0xc02c19d860 sp=0xc02c19d7c0 pc=0x1eb75ff
github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*hashJoiner).probeRow(0xc015160000, 0x0, 0x0, 0x0, 0x0, 0x0)

Note that the machine has 14GB of RAM, so both Go and CGo usages seem to be exceeding their limits. I'm assuming that the former is due to some missing memory accounting in Pebble as well as missing memory accounting in the hash aggregator of the Golang's map which is a known issue.

What makes these crashes interesting is that in the past I've seen that Go's allocations do not increase this significantly at once due to SQL engine - we see that one day ago there was still about 2GB of RAM available. This makes me think that maybe Pebble's allocations should be investigated. I'll ask Storage friends.

@yuzefovich
Copy link
Member

Four days ago the killing request allocation came from the vectorized engine:

I200515 08:38:30.314739 169 server/status/runtime.go:499  [n1] runtime stats: 12 GiB RSS, 346 goroutines, 6.9 GiB/49 MiB/7.4 GiB GO alloc/idle/total, 3.9 GiB/4.9 GiB CGO alloc/total, 585.8 CGO/sec, 138.4/23.3 %(u/s)time, 0.0 %gc (0x), 672 KiB/73 MiB (r/w)net
fatal error: runtime: out of memory

runtime stack:
...
goroutine 1157734 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc0049316a8 sp=0xc0049316a0 pc=0x7cb2d0
runtime.mallocgc(0x12100000, 0x4119e00, 0x4dd7301, 0xc004931800)
	/usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc004931748 sp=0xc0049316a8 pc=0x77a1e5
runtime.newarray(0x4119e00, 0x110000, 0x3f58780)
	/usr/local/go/src/runtime/malloc.go:1173 +0x63 fp=0xc004931778 sp=0xc004931748 pc=0x77a623
runtime.makeBucketArray(0x3eeb600, 0xc004931714, 0x0, 0xc0009a8380, 0x7fa100198648)
	/usr/local/go/src/runtime/map.go:362 +0x183 fp=0xc0049317b0 sp=0xc004931778 pc=0x77b513
runtime.hashGrow(0x3eeb600, 0xc00f2b8150)
	/usr/local/go/src/runtime/map.go:1033 +0x89 fp=0xc004931800 sp=0xc0049317b0 pc=0x77d219
runtime.mapassign_fast64(0x3eeb600, 0xc00f2b8150, 0x6a2e270cfc6a418e, 0xc0799dc868)
	/usr/local/go/src/runtime/map_fast64.go:156 +0x12c fp=0xc004931840 sp=0xc004931800 pc=0x77f3ec
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*hashAggregator).onlineAgg(0xc00c86e240, 0x4dbaf40, 0xc01dfc39c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/hash_aggregator.go:361 +0xad7 fp=0xc004931a50 sp=0xc004931840 pc=0x21cf4e7
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*hashAggregator).Next(0xc00c86e240, 0x4d47ea0, 0xc01dfc2c00, 0x100000400, 0x400)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/hash_aggregator.go:240 +0x120 fp=0xc004931b88 sp=0xc004931a50 pc=0x21ce090

Five days ago the logs abruptly stop at

I200514 09:00:16.359415 312 server/status/runtime.go:499  [n1] runtime stats: 12 GiB RSS, 328 goroutines, 6.6 GiB/188 MiB/7.1 GiB GO alloc/idle/total, 4.0 GiB/5.0 GiB CGO alloc/total, 1181.2 CGO/sec, 99.3/3.1 %(u/s)time, 0.0 %gc (0x), 624 KiB/294 KiB (r/w)net

8 days ago the logs also abruptly stopped at

I200511 08:40:18.020033 328 server/status/runtime.go:499  [n1] runtime stats: 13 GiB RSS, 329 goroutines, 7.9 GiB/47 MiB/8.5 GiB GO alloc/idle/total, 3.9 GiB/4.9 GiB CGO alloc/total, 1021.1 CGO/sec, 123.4/7.1 %(u/s)time, 0.0 %gc (0x), 250 KiB/729 KiB (r/w)net

@tbg
Copy link
Member

tbg commented May 19, 2020

Have you been looking at heap profiles? You can run this test with COCKROACH_MEMPROF_INTERVAL=1s to get a heap profile written to the log dir every second. I imagine that would clear up where the memory is used. You can use roachtest --count=10 --cpu-quota=1024 [...] to run ten instances of the test concurrently, which I'd expect to net you a few repros.

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@73631089c58554c289d0fb1b26cad62667388335:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_081746.397_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1953080-1589874985-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200519 08:17:47.971993 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 39.243277ms
		  | I200519 08:17:51.476357 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 3.504327331s
		  | I200519 08:17:52.599438 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.123040635s
		  | I200519 08:18:45.532520 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 52.932951249s
		  | I200519 08:19:17.033483 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 31.500804531s
		  | I200519 09:51:36.005298 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 1h32m18.971623139s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1953080-1589874985-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4144
		2: 4265
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@8fec3f4c6d136a86f472c975edd36b75e5ab9a8c:

		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_075921.634_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1955606-1589960148-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200520 07:59:23.166183 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 35.981463ms
		  | I200520 07:59:27.614476 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 4.448242423s
		  | I200520 07:59:28.701555 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.087035088s
		  | I200520 08:00:30.596948 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 1m1.895235922s
		  | I200520 08:01:02.664253 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 32.067151158s
		  | I200520 08:09:50.831424 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 8m48.16697985s
		  | Error: check failed: 3.3.2.6: driver: bad connection
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

	cluster.go:1481,context.go:135,cluster.go:1470,test_runner.go:825: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1955606-1589960148-17-n4cpu16 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		2: 4271
		3: 4126
		1: dead
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1129
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:272
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:766
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:852
		  | github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:800
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1799
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errors.errorString

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@1520ad2ba7c926f8043de8b6e044ab35c2f67b13:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/alterpk-tpcc-250/run_1
	cluster.go:2012,alterpk.go:164,alterpk.go:184,test_runner.go:753: output in run_081527.349_n4_workload_check_tpcc: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1957823-1590046508-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned: exit status 30
		(1) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2087
		  | main.(*cluster).Run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2010
		  | main.registerAlterPK.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:164
		  | main.registerAlterPK.func4
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:184
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_081527.349_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1957823-1590046508-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200521 08:15:28.869841 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 52.853267ms
		  | I200521 08:15:33.167049 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 4.297131583s
		  | I200521 08:15:34.237912 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.070815521s
		  | I200521 08:16:26.277315 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 52.03922495s
		  | I200521 08:16:54.914924 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 28.637509593s
		  | I200521 08:19:41.407070 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 2m46.492079301s
		  | Error: check failed: 3.3.2.6: left EXCEPT right returned nonzero results
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

left EXCEPT right returned nonzero results - this is interesting, I've seen it a couple of times in my testing as well. My current theory is that somehow the gateway node emits some rows before hitting fatal OOM, but I'm not sure how that is possible.

@cockroach-teamcity
Copy link
Member Author

(roachtest).alterpk-tpcc-250 failed on master@809829bfe7ff27a610fa78f409fae658b9f2d9d9:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/alterpk-tpcc-250/run_1
	cluster.go:2012,alterpk.go:164,alterpk.go:184,test_runner.go:753: output in run_080021.593_n4_workload_check_tpcc: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1964552-1590305390-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned: exit status 30
		(1) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2087
		  | main.(*cluster).Run
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2010
		  | main.registerAlterPK.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:164
		  | main.registerAlterPK.func4
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/alterpk.go:184
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_080021.593_n4_workload_check_tpcc
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1964552-1590305390-17-n4cpu16:4 -- ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1} returned
		  | stderr:
		  | I200524 08:00:23.147602 1 workload/tpcc/tpcc.go:372  check 3.3.2.1 took 46.886145ms
		  | I200524 08:00:26.621823 1 workload/tpcc/tpcc.go:372  check 3.3.2.2 took 3.474164068s
		  | I200524 08:00:27.717761 1 workload/tpcc/tpcc.go:372  check 3.3.2.3 took 1.095849144s
		  | I200524 08:01:24.950346 1 workload/tpcc/tpcc.go:372  check 3.3.2.4 took 57.232474533s
		  | I200524 08:01:37.854389 1 workload/tpcc/tpcc.go:372  check 3.3.2.5 took 12.903965977s
		  | I200524 08:04:28.201381 1 workload/tpcc/tpcc.go:372  check 3.3.2.6 took 2m50.346760357s
		  | Error: check failed: 3.3.2.6: pq: root: memory budget exceeded: 20480 bytes requested, 3781230592 currently allocated, 3781237760 bytes in budget
		  | Error: DEAD_ROACH_PROBLEM: exit status 1
		  | (1) DEAD_ROACH_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload check tpcc --warehouses 250 --expensive-checks {pgurl:1}
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cockroach (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 30
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

More

Artifacts: /alterpk-tpcc-250

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

I think that left EXCEPT right returned nonzero results might have the same root cause as #49315.

@yuzefovich
Copy link
Member

Here is the commit message that will close this issue:

The test has been failing for a while now because we're operating very
close to memory limit. Usually, 3.3.2.6 expensive check query ends up
crashing the node, sometimes it hits SQL memory limit. I've been staring
at it for a while, but there is a discrepancy between the memory usage
reported in runtime stats in logs and the usage that shows up in heap
profiles (for example, I could see 8GB in the former and 3GB in the
latter), and I couldn't figure out where that gaps goes. I don't think
spending any more time on this issue is worthwhile, so this commit bumps
the machine type from cpu-16 to cpu-32 which doubles available RAM from
14.4 to 28.8.

@craig craig bot closed this as completed in #49574 May 27, 2020
@craig craig bot closed this as completed in 126df91 May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants