roachtest: gossip/chaos/nodes=9 failed #135154

cockroach-teamcity · 2024-11-14T09:26:28Z

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 39e43b85ec3b02bc760df10fce1c19d09419d6f2:

(gossip.go:80).2: gossip did not stabilize (dead node 9) in 20.2s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/cpu_arch=arm64/run_1

Parameters:

arch=arm64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=expiration
runtimeAssertionsBuild=false
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

roachtest: gossip/chaos/nodes=9 failed #135059 roachtest: gossip/chaos/nodes=9 failed [C-test-failure O-roachtest O-robot T-kv branch-release-24.2 release-blocker]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

Jira issue: CRDB-44376

The text was updated successfully, but these errors were encountered:

tbg · 2024-11-15T10:58:17Z

This is, sadly, not another instance of the problem we're trying to diagnose with #134527. Instead, it seems that the gossip checks are fairly slow:

2024/11/14 09:25:09 cluster.go:702: test status: stopping nodes :9
teamcity-17753083-1731566467-55-n9cpu4: stopping
2024/11/14 09:25:11 gossip.go:122: test status: waiting for gossip to exclude dead node 9
2024/11/14 09:25:11 gossip.go:125: checking if gossip excludes dead node 9 (0s)
2024/11/14 09:25:11 gossip.go:87: 1: checking gossip
2024/11/14 09:25:11 gossip.go:91: 1: gossip not ok (dead node 9 present) (1s)
2024/11/14 09:25:11 gossip.go:140: sleeping for 1s (1s)
2024/11/14 09:25:12 gossip.go:125: checking if gossip excludes dead node 9 (2s)
2024/11/14 09:25:12 gossip.go:87: 1: checking gossip
2024/11/14 09:25:13 gossip.go:91: 1: gossip not ok (dead node 9 present) (2s)
2024/11/14 09:25:13 gossip.go:140: sleeping for 1s (2s)
2024/11/14 09:25:14 gossip.go:125: checking if gossip excludes dead node 9 (3s)
2024/11/14 09:25:14 gossip.go:87: 1: checking gossip
2024/11/14 09:25:16 gossip.go:98: 1: found 8 live nodes
2024/11/14 09:25:16 gossip.go:87: 2: checking gossip
2024/11/14 09:25:17 gossip.go:87: 3: checking gossip
2024/11/14 09:25:19 gossip.go:87: 4: checking gossip
2024/11/14 09:25:21 gossip.go:87: 5: checking gossip
2024/11/14 09:25:26 gossip.go:87: 6: checking gossip
2024/11/14 09:25:28 gossip.go:87: 7: checking gossip
2024/11/14 09:25:31 test_impl.go:460: test failure #1: full stack retained in failure_1.log: (gossip.go:80).2: gossip did not stabilize (dead node 9) in 20.2s

Note for example the 5s delay between those two lines:

2024/11/14 09:25:21 gossip.go:87: 5: checking gossip
2024/11/14 09:25:26 gossip.go:87: 6: checking gossip

The test wants to assert that gossip converges within 20s, but it runs a lot of checks that it assumes take much less than 20s. We're seeing here that they can take 20s in their own right.

I wouldn't call this expected, though. Here's the code that runs in the check:

cockroach/pkg/cmd/roachtest/tests/gossip.go

Lines 87 to 115 in 39e43b8

    
           t.L().Printf("%d: checking gossip\n", i) 
        
           liveNodes := gossipNetworkAccordingTo(i) 
        
           for _, id := range liveNodes { 
        
           	if id == deadNode { 
        
           		t.L().Printf("%d: gossip not ok (dead node %d present) (%.0fs)\n", 
        
           			i, deadNode, timeutil.Since(start).Seconds()) 
        
           		return false 
        
           	} 
        
           } 
        
           if len(expLiveNodes) == 0 { 
        
           	t.L().Printf("%d: found %d live nodes\n", i, len(liveNodes)) 
        
           	expLiveNodes = liveNodes 
        
           	continue 
        
           } 
        
           if len(liveNodes) != len(expLiveNodes) { 
        
           	t.L().Printf("%d: gossip not ok (mismatched size of network); expected %d, got %d (%.0fs)\n", 
        
           		i, len(expLiveNodes), len(liveNodes), timeutil.Since(start).Seconds()) 
        
           	return false 
        
           } 
        
           for i := range liveNodes { 
        
           	if liveNodes[i] != expLiveNodes[i] { 
        
           		t.L().Printf("%d: gossip not ok (mismatched view of live nodes); expected %s, got %s (%.0fs)\n", 
        
           			i, expLiveNodes, liveNodes, timeutil.Since(start).Seconds()) 
        
           		return false 
        
           	} 
        
           }

Nothing here should take appreciable amounts of time (but consider that the problem in #134527 might well boil down to slow t.L().Printf) and neither should this query on which the check is based:

cockroach/pkg/cmd/roachtest/tests/gossip.go

Lines 50 to 55 in 39e43b8

    
           			const query = ` 
        
           SELECT node_id 
        
             FROM (SELECT node_id, to_timestamp(split_part(split_part(expiration, ',', 1), '.', 1)::FLOAT8) AS expiration 
        
           		  FROM crdb_internal.gossip_liveness) 
        
            WHERE expiration > now(); 
        
           `

But we just restarted a node, so I can understand how logging into a SQL shell might take some time (on the order of 6s, the lease interval - we're killing nodes here). But it also reliably seems to be taking around 1s at least to run this query, and why would one that isn't the first one suddenly see a lease failover?
Running the query on a three node roachprod cluster I had laying around (and which is at >90% CPU) shows that in general that query executes ~instantly, as it should, since it only reads node-local data.

This test is known for strange behavior (cockroachdb#134527). A recent test failure[^1] shows that we "lose time" in more unexpected locations. I'm assuming an execution trace (of the roachtest process) will be helpful here, as we'll be able to determine whether the time is spent in roachtest or on the CockroachDB server side. [^1]: cockroachdb#135154 Epic: none Release note: None

tbg · 2024-11-15T11:10:27Z

Removing release-blocker label, as I consider it unlikely that there's anything new here. This test has been flaky for a long time.

tbg · 2024-11-15T11:37:30Z

When I ran this test manually, I don't see these pathological timings. 3s for everything isn't nothing, but it's a lot less than 20s.

2024/11/15 11:17:33 cluster.go:701: test status: stopping nodes :5
tobias-1731669232-01-n9cpu4: stopping
2024/11/15 11:17:35 gossip.go:134: test status: waiting for gossip to exclude dead n
ode 5
2024/11/15 11:17:35 gossip.go:137: checking if gossip excludes dead node 5 (0s)
2024/11/15 11:17:35 gossip.go:99: 1: checking gossip
2024/11/15 11:17:35 gossip.go:103: 1: gossip not ok (dead node 5 present) (0s)
2024/11/15 11:17:35 gossip.go:152: sleeping for 1s (0s)
2024/11/15 11:17:36 gossip.go:137: checking if gossip excludes dead node 5 (1s)
2024/11/15 11:17:36 gossip.go:99: 1: checking gossip
2024/11/15 11:17:36 gossip.go:103: 1: gossip not ok (dead node 5 present) (1s)
2024/11/15 11:17:36 gossip.go:152: sleeping for 1s (1s)
2024/11/15 11:17:37 gossip.go:137: checking if gossip excludes dead node 5 (2s)
2024/11/15 11:17:37 gossip.go:99: 1: checking gossip
2024/11/15 11:17:37 gossip.go:110: 1: found 8 live nodes
2024/11/15 11:17:37 gossip.go:99: 2: checking gossip
2024/11/15 11:17:37 gossip.go:99: 3: checking gossip
2024/11/15 11:17:37 gossip.go:99: 4: checking gossip
2024/11/15 11:17:37 gossip.go:99: 6: checking gossip
2024/11/15 11:17:37 gossip.go:99: 7: checking gossip
2024/11/15 11:17:37 gossip.go:99: 8: checking gossip
2024/11/15 11:17:37 gossip.go:99: 9: checking gossip
2024/11/15 11:17:38 gossip.go:129: gossip ok (size: 8) (3s)

135173: kv: add a backoff to the retry loop in db.Txn r=miraradeva a=miraradeva In rare cases (e.g. #77376), two transactions can get repeatedly deadlocked while trying to write to same key(s): one aborts the other, but before it can proceed, the other transaction has restarted and acquired a lock on the key again. This can result in the max transaction retries being exceeded without either transaction succeeding. This commit adds a backoff to the transaction retry loop in `db.Txn`, which will hopefully help one transaction slow down and let the other one commit. Fixes: #77376 Release note: None 135253: roachtest: get exec traces in gossip/chaos/nodes=9 r=tbg a=tbg This test is known for strange behavior (#134527). A recent test failure[^1] shows that we "lose time" in more unexpected locations. I'm assuming an execution trace (of the roachtest process) will be helpful here, as we'll be able to determine whether the time is spent in roachtest or on the CockroachDB server side. [^1]: #135154 Epic: none Release note: None 135333: roachtest: fix "context cancelled" errors in db-console tests r=kyle-a-wong a=kyle-a-wong changes the context used when writing cypress artifacts to the test artifact directory. This is needed because the existing context is getting cancelled, pressumable when `rtCluster.RunE(ctx...` fails. Adds `-e NO_COLOR=1` to the docker run command so that the output is more humanreadable in log files Updates the tests to use `registry.StandardCockroach`. By default, `registry.RandomizedCockroach` is used, and `registry.RuntimeAssertionsCockroach` is built using `cockroach-short`, which does not include db-console in the binary. Resolves: #134808 Epic: none Release note: None Co-authored-by: Mira Radeva <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Kyle Wong <[email protected]>

kvoli · 2024-11-19T22:05:46Z

Reading from above, guessing this is similar to the other failure on a different branch, I'll match the labels rather than marks as a duplicate just in case: #135059.

cockroach-teamcity · 2024-12-04T09:04:30Z

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 44de2d379610067e14a7ebfbc92e64311f13a232:

(gossip.go:92).2: gossip did not stabilize (dead node 3) in 22.8s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

arch=amd64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=default
runtimeAssertionsBuild=false
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

roachtest: gossip/chaos/nodes=9 failed #136083 roachtest: gossip/chaos/nodes=9 failed [B-runtime-assertions-enabled C-bug C-test-failure O-roachtest O-robot P-3 T-kv X-duplicate branch-release-24.3.0-rc]
roachtest: gossip/chaos/nodes=9 failed #135059 roachtest: gossip/chaos/nodes=9 failed [A-testing C-bug C-test-failure O-roachtest O-robot P-3 T-kv branch-release-24.2]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-07T08:48:42Z

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 9744e5f1676a752d5b200fe7bce84ca8b44afca0:

(gossip.go:92).2: gossip did not stabilize (dead node 7) in 20.7s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

arch=amd64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=epoch
runtimeAssertionsBuild=true
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

roachtest: gossip/chaos/nodes=9 failed #136083 roachtest: gossip/chaos/nodes=9 failed [B-runtime-assertions-enabled C-bug C-test-failure O-roachtest O-robot P-3 T-kv X-duplicate branch-release-24.3.0-rc]
roachtest: gossip/chaos/nodes=9 failed #135059 roachtest: gossip/chaos/nodes=9 failed [A-testing C-bug C-test-failure O-roachtest O-robot P-3 T-kv branch-release-24.2]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-10T09:10:00Z

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ de3b1220f5c71ac966561505c1b379060fa1407f:

(gossip.go:92).2: gossip did not stabilize (dead node 4) in 38.4s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/cpu_arch=arm64/run_1

Parameters:

arch=arm64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=default
runtimeAssertionsBuild=false
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

roachtest: gossip/chaos/nodes=9 failed #136083 roachtest: gossip/chaos/nodes=9 failed [B-runtime-assertions-enabled C-bug C-test-failure O-roachtest O-robot P-3 T-kv X-duplicate branch-release-24.3.0-rc]
roachtest: gossip/chaos/nodes=9 failed #135059 roachtest: gossip/chaos/nodes=9 failed [A-testing C-bug C-test-failure O-roachtest O-robot P-3 T-kv branch-release-24.2]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2024-12-11T09:21:42Z

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ d18eb683b2759fd8814dacf0baa913f596074a17:

(gossip.go:92).2: gossip did not stabilize (dead node 8) in 89.9s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

arch=amd64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=expiration
runtimeAssertionsBuild=true
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

roachtest: gossip/chaos/nodes=9 failed #136083 roachtest: gossip/chaos/nodes=9 failed [B-runtime-assertions-enabled C-bug C-test-failure O-roachtest O-robot P-3 T-kv X-duplicate branch-release-24.3.0-rc]
roachtest: gossip/chaos/nodes=9 failed #135059 roachtest: gossip/chaos/nodes=9 failed [A-testing C-bug C-test-failure O-roachtest O-robot P-3 T-kv branch-release-24.2]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2025-01-09T08:17:11Z

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 58e75b8c97804fea87f8f793665de98098e84b20:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/gossip.go:63
	            				pkg/cmd/roachtest/tests/gossip.go:101
	            				pkg/cmd/roachtest/tests/gossip.go:140
	            				pkg/cmd/roachtest/tests/gossip.go:159
	            				pkg/cmd/roachtest/tests/gossip.go:191
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	dial tcp 20.51.253.200:26257: connect: connection refused
	Test:       	gossip/chaos/nodes=9
(require.go:1357).NoError: FailNow called
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/cpu_arch=arm64/run_1

Parameters:

arch=arm64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=default
runtimeAssertionsBuild=false
ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

_{This test on roachdash | Improve this report!}

tbg · 2025-01-09T09:40:16Z

This last failure is unclear to me. I started a thread on #test-eng.

We're getting

2025/01/09 08:16:51 gossip.go:138: checking if gossip excludes dead node 0 (0s)
2025/01/09 08:16:51 gossip.go:100: 1: checking gossip
2025/01/09 08:16:51 gossip.go:111: 1: found 8 live nodes
2025/01/09 08:16:51 gossip.go:100: 2: checking gossip
2025/01/09 08:16:51 gossip.go:100: 3: checking gossip
2025/01/09 08:16:51 gossip.go:100: 4: checking gossip
2025/01/09 08:16:52 gossip.go:100: 5: checking gossip
2025/01/09 08:16:52 test_impl.go:470: test failure #1: full stack retained in failure_1.log: (assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/gossip.go:63
	            				pkg/cmd/roachtest/tests/gossip.go:101
	            				pkg/cmd/roachtest/tests/gossip.go:140
	            				pkg/cmd/roachtest/tests/gossip.go:159
	            				pkg/cmd/roachtest/tests/gossip.go:191
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	dial tcp 20.51.253.200:26257: connect: connection refused
	Test:       	gossip/chaos/nodes=9

Note the timestamp 8:16:52. 20.51.253.200 is n5 (teamcity-18431727-1736405034-48-n9cpu4-0005, see
cluster file in roachprod_state). Indeed, n5 wasn't running then:

I250109 08:16:55.596566 1 util/log/file_sync_buffer.go:237 â‹® [config] file created at: 2025/01/09 08:16:55

but the cluster was definitely started (c.Start()) well before this point. However, 5.journalctl.txt confirms the later timestamp (I'm just assuming that there's no clock offset, after all, the conn refused error makes it very likely that the process just wasn't running when the conn attempt was made):

Jan 09 08:16:55 teamcity-18431727-1736405034-48-n9cpu4-0005 sudo[3347]:   ubuntu : PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/systemd-run --unit cockroach-system --same-dir --uid 1000 --gid 1000 --service-type=notify -p NotifyAccess=all -p MemoryMax=95% -p LimitCORE=infinity -p LimitNOFILE=532480 bash ./cockroach.sh run

The only explanation I have is that somehow c.Start returned before it had actually "done" the job. But it actually ssh'ed into the machine way after returning, so it must have been asynchronous in the roachtest infra (i.e. systemd-run returning early before actually having spawned the process couldn't have caused this). It's so strange.

tbg · 2025-01-09T11:17:03Z

@herkolategan kindly pointed out this:

I250109 08:16:55.820484 282 kv/kvclient/kvcoord/range_iter.go:213 ⋮ [T1,Vsystem,n5,ts-poll] 95  range descriptor lookup failed: remote wall time is too far ahead (12.355389759s) to be trustworthy

so more likely scenario is, the cluster got started, timestamps on n5 are consistently +12s in the future relative to everyone else, n5 dies, then we try to connect to n5 and fail.

I will say, azure has not been looking good here. Is there a way we can vet machines before we try to use them? Are we not setting up clock sync properly?

tbg · 2025-01-09T13:34:36Z

I'm going to take this opportunity to classify as infra flake and close out the issue, since all recent failures were on azure (and could have been caused by clock sync failures). We now have #138726 to track improving this.

tbg self-assigned this Nov 15, 2024

tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 15, 2024

kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure P-3 Issues/test failures with no fix SLA labels Nov 19, 2024

cockroach-teamcity mentioned this issue Nov 24, 2024

roachtest: gossip/chaos/nodes=9 failed #136083

Closed

cockroach-teamcity mentioned this issue Dec 8, 2024

roachtest: gossip/chaos/nodes=9 failed #136987

Closed

cockroach-teamcity mentioned this issue Dec 30, 2024

roachtest: gossip/chaos/nodes=9 failed #138075

Closed

cockroach-teamcity mentioned this issue Jan 7, 2025

roachtest: gossip/chaos/nodes=9 failed #138366

Closed

cockroach-teamcity mentioned this issue Jan 9, 2025

roachtest: gossip/chaos/nodes=9 failed #138723

Closed

tbg mentioned this issue Jan 9, 2025

roachprod: clock sync issues on azure-provisioned machines #138726

Open

tbg added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Jan 9, 2025

tbg closed this as completed Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: gossip/chaos/nodes=9 failed #135154

roachtest: gossip/chaos/nodes=9 failed #135154

cockroach-teamcity commented Nov 14, 2024 •

edited by cockroach-jira-scripts

Loading

tbg commented Nov 15, 2024

tbg commented Nov 15, 2024

tbg commented Nov 15, 2024

kvoli commented Nov 19, 2024

cockroach-teamcity commented Dec 4, 2024

cockroach-teamcity commented Dec 7, 2024

cockroach-teamcity commented Dec 10, 2024

cockroach-teamcity commented Dec 11, 2024

cockroach-teamcity commented Jan 9, 2025

tbg commented Jan 9, 2025 •

edited

Loading

tbg commented Jan 9, 2025

tbg commented Jan 9, 2025

roachtest: gossip/chaos/nodes=9 failed #135154

roachtest: gossip/chaos/nodes=9 failed #135154

Comments

cockroach-teamcity commented Nov 14, 2024 • edited by cockroach-jira-scripts Loading

tbg commented Nov 15, 2024

tbg commented Nov 15, 2024

tbg commented Nov 15, 2024

kvoli commented Nov 19, 2024

cockroach-teamcity commented Dec 4, 2024

cockroach-teamcity commented Dec 7, 2024

cockroach-teamcity commented Dec 10, 2024

cockroach-teamcity commented Dec 11, 2024

cockroach-teamcity commented Jan 9, 2025

tbg commented Jan 9, 2025 • edited Loading

tbg commented Jan 9, 2025

tbg commented Jan 9, 2025

cockroach-teamcity commented Nov 14, 2024 •

edited by cockroach-jira-scripts

Loading

tbg commented Jan 9, 2025 •

edited

Loading