Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: gossip/chaos/nodes=9 failed #135154

Closed
cockroach-teamcity opened this issue Nov 14, 2024 · 12 comments
Closed

roachtest: gossip/chaos/nodes=9 failed #135154

cockroach-teamcity opened this issue Nov 14, 2024 · 12 comments
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 14, 2024

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 39e43b85ec3b02bc760df10fce1c19d09419d6f2:

(gossip.go:80).2: gossip did not stabilize (dead node 9) in 20.2s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=azure
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-44376

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Nov 14, 2024
@tbg
Copy link
Member

tbg commented Nov 15, 2024

This is, sadly, not another instance of the problem we're trying to diagnose with #134527. Instead, it seems that the gossip checks are fairly slow:

2024/11/14 09:25:09 cluster.go:702: test status: stopping nodes :9
teamcity-17753083-1731566467-55-n9cpu4: stopping
2024/11/14 09:25:11 gossip.go:122: test status: waiting for gossip to exclude dead node 9
2024/11/14 09:25:11 gossip.go:125: checking if gossip excludes dead node 9 (0s)
2024/11/14 09:25:11 gossip.go:87: 1: checking gossip
2024/11/14 09:25:11 gossip.go:91: 1: gossip not ok (dead node 9 present) (1s)
2024/11/14 09:25:11 gossip.go:140: sleeping for 1s (1s)
2024/11/14 09:25:12 gossip.go:125: checking if gossip excludes dead node 9 (2s)
2024/11/14 09:25:12 gossip.go:87: 1: checking gossip
2024/11/14 09:25:13 gossip.go:91: 1: gossip not ok (dead node 9 present) (2s)
2024/11/14 09:25:13 gossip.go:140: sleeping for 1s (2s)
2024/11/14 09:25:14 gossip.go:125: checking if gossip excludes dead node 9 (3s)
2024/11/14 09:25:14 gossip.go:87: 1: checking gossip
2024/11/14 09:25:16 gossip.go:98: 1: found 8 live nodes
2024/11/14 09:25:16 gossip.go:87: 2: checking gossip
2024/11/14 09:25:17 gossip.go:87: 3: checking gossip
2024/11/14 09:25:19 gossip.go:87: 4: checking gossip
2024/11/14 09:25:21 gossip.go:87: 5: checking gossip
2024/11/14 09:25:26 gossip.go:87: 6: checking gossip
2024/11/14 09:25:28 gossip.go:87: 7: checking gossip
2024/11/14 09:25:31 test_impl.go:460: test failure #1: full stack retained in failure_1.log: (gossip.go:80).2: gossip did not stabilize (dead node 9) in 20.2s

Note for example the 5s delay between those two lines:

2024/11/14 09:25:21 gossip.go:87: 5: checking gossip
2024/11/14 09:25:26 gossip.go:87: 6: checking gossip

The test wants to assert that gossip converges within 20s, but it runs a lot of checks that it assumes take much less than 20s. We're seeing here that they can take 20s in their own right.

I wouldn't call this expected, though. Here's the code that runs in the check:

t.L().Printf("%d: checking gossip\n", i)
liveNodes := gossipNetworkAccordingTo(i)
for _, id := range liveNodes {
if id == deadNode {
t.L().Printf("%d: gossip not ok (dead node %d present) (%.0fs)\n",
i, deadNode, timeutil.Since(start).Seconds())
return false
}
}
if len(expLiveNodes) == 0 {
t.L().Printf("%d: found %d live nodes\n", i, len(liveNodes))
expLiveNodes = liveNodes
continue
}
if len(liveNodes) != len(expLiveNodes) {
t.L().Printf("%d: gossip not ok (mismatched size of network); expected %d, got %d (%.0fs)\n",
i, len(expLiveNodes), len(liveNodes), timeutil.Since(start).Seconds())
return false
}
for i := range liveNodes {
if liveNodes[i] != expLiveNodes[i] {
t.L().Printf("%d: gossip not ok (mismatched view of live nodes); expected %s, got %s (%.0fs)\n",
i, expLiveNodes, liveNodes, timeutil.Since(start).Seconds())
return false
}
}

Nothing here should take appreciable amounts of time (but consider that the problem in #134527 might well boil down to slow t.L().Printf) and neither should this query on which the check is based:

const query = `
SELECT node_id
FROM (SELECT node_id, to_timestamp(split_part(split_part(expiration, ',', 1), '.', 1)::FLOAT8) AS expiration
FROM crdb_internal.gossip_liveness)
WHERE expiration > now();
`

But we just restarted a node, so I can understand how logging into a SQL shell might take some time (on the order of 6s, the lease interval - we're killing nodes here). But it also reliably seems to be taking around 1s at least to run this query, and why would one that isn't the first one suddenly see a lease failover?
Running the query on a three node roachprod cluster I had laying around (and which is at >90% CPU) shows that in general that query executes ~instantly, as it should, since it only reads node-local data.

tbg added a commit to tbg/cockroach that referenced this issue Nov 15, 2024
This test is known for strange behavior (cockroachdb#134527).
A recent test failure[^1] shows that we "lose time"
in more unexpected locations.

I'm assuming an execution trace (of the roachtest process) will be
helpful here, as we'll be able to determine whether the time is
spent in roachtest or on the CockroachDB server side.

[^1]: cockroachdb#135154

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 15, 2024
This test is known for strange behavior (cockroachdb#134527).
A recent test failure[^1] shows that we "lose time"
in more unexpected locations.

I'm assuming an execution trace (of the roachtest process) will be
helpful here, as we'll be able to determine whether the time is
spent in roachtest or on the CockroachDB server side.

[^1]: cockroachdb#135154

Epic: none
Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Nov 15, 2024
This test is known for strange behavior (cockroachdb#134527).
A recent test failure[^1] shows that we "lose time"
in more unexpected locations.

I'm assuming an execution trace (of the roachtest process) will be
helpful here, as we'll be able to determine whether the time is
spent in roachtest or on the CockroachDB server side.

[^1]: cockroachdb#135154

Epic: none
Release note: None
@tbg tbg self-assigned this Nov 15, 2024
@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 15, 2024
@tbg
Copy link
Member

tbg commented Nov 15, 2024

Removing release-blocker label, as I consider it unlikely that there's anything new here. This test has been flaky for a long time.

@tbg
Copy link
Member

tbg commented Nov 15, 2024

When I ran this test manually, I don't see these pathological timings. 3s for everything isn't nothing, but it's a lot less than 20s.

2024/11/15 11:17:33 cluster.go:701: test status: stopping nodes :5
tobias-1731669232-01-n9cpu4: stopping
2024/11/15 11:17:35 gossip.go:134: test status: waiting for gossip to exclude dead n
ode 5
2024/11/15 11:17:35 gossip.go:137: checking if gossip excludes dead node 5 (0s)
2024/11/15 11:17:35 gossip.go:99: 1: checking gossip
2024/11/15 11:17:35 gossip.go:103: 1: gossip not ok (dead node 5 present) (0s)
2024/11/15 11:17:35 gossip.go:152: sleeping for 1s (0s)
2024/11/15 11:17:36 gossip.go:137: checking if gossip excludes dead node 5 (1s)
2024/11/15 11:17:36 gossip.go:99: 1: checking gossip
2024/11/15 11:17:36 gossip.go:103: 1: gossip not ok (dead node 5 present) (1s)
2024/11/15 11:17:36 gossip.go:152: sleeping for 1s (1s)
2024/11/15 11:17:37 gossip.go:137: checking if gossip excludes dead node 5 (2s)
2024/11/15 11:17:37 gossip.go:99: 1: checking gossip
2024/11/15 11:17:37 gossip.go:110: 1: found 8 live nodes
2024/11/15 11:17:37 gossip.go:99: 2: checking gossip
2024/11/15 11:17:37 gossip.go:99: 3: checking gossip
2024/11/15 11:17:37 gossip.go:99: 4: checking gossip
2024/11/15 11:17:37 gossip.go:99: 6: checking gossip
2024/11/15 11:17:37 gossip.go:99: 7: checking gossip
2024/11/15 11:17:37 gossip.go:99: 8: checking gossip
2024/11/15 11:17:37 gossip.go:99: 9: checking gossip
2024/11/15 11:17:38 gossip.go:129: gossip ok (size: 8) (3s)

craig bot pushed a commit that referenced this issue Nov 18, 2024
135173: kv: add a backoff to the retry loop in db.Txn r=miraradeva a=miraradeva

In rare cases (e.g. #77376), two transactions can get repeatedly deadlocked while trying to write to same key(s): one aborts the other, but before it can proceed, the other transaction has restarted and acquired a lock on the key again. This can result in the max transaction retries being exceeded without either transaction succeeding.

This commit adds a backoff to the transaction retry loop in `db.Txn`, which will hopefully help one transaction slow down and let the other one commit.

Fixes: #77376

Release note: None

135253: roachtest: get exec traces in gossip/chaos/nodes=9 r=tbg a=tbg

This test is known for strange behavior (#134527).
A recent test failure[^1] shows that we "lose time"
in more unexpected locations.

I'm assuming an execution trace (of the roachtest process) will be
helpful here, as we'll be able to determine whether the time is
spent in roachtest or on the CockroachDB server side.

[^1]: #135154

Epic: none
Release note: None


135333: roachtest: fix "context cancelled" errors in db-console tests r=kyle-a-wong a=kyle-a-wong

changes the context used when writing cypress artifacts to the test artifact directory. This is needed because the existing context is getting cancelled, pressumable when `rtCluster.RunE(ctx...` fails.

Adds `-e NO_COLOR=1` to the docker run command so that the output is more humanreadable in log files

Updates the tests to use `registry.StandardCockroach`. By default, `registry.RandomizedCockroach` is used, and `registry.RuntimeAssertionsCockroach` is built using `cockroach-short`, which does not include db-console in the binary.

Resolves: #134808
Epic: none
Release note: None

Co-authored-by: Mira Radeva <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Kyle Wong <[email protected]>
@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure P-3 Issues/test failures with no fix SLA labels Nov 19, 2024
@kvoli
Copy link
Collaborator

kvoli commented Nov 19, 2024

Reading from above, guessing this is similar to the other failure on a different branch, I'll match the labels rather than marks as a duplicate just in case: #135059.

@cockroach-teamcity
Copy link
Member Author

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 44de2d379610067e14a7ebfbc92e64311f13a232:

(gossip.go:92).2: gossip did not stabilize (dead node 3) in 22.8s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 9744e5f1676a752d5b200fe7bce84ca8b44afca0:

(gossip.go:92).2: gossip did not stabilize (dead node 7) in 20.7s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ de3b1220f5c71ac966561505c1b379060fa1407f:

(gossip.go:92).2: gossip did not stabilize (dead node 4) in 38.4s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=azure
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ d18eb683b2759fd8814dacf0baa913f596074a17:

(gossip.go:92).2: gossip did not stabilize (dead node 8) in 89.9s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 58e75b8c97804fea87f8f793665de98098e84b20:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/gossip.go:63
	            				pkg/cmd/roachtest/tests/gossip.go:101
	            				pkg/cmd/roachtest/tests/gossip.go:140
	            				pkg/cmd/roachtest/tests/gossip.go:159
	            				pkg/cmd/roachtest/tests/gossip.go:191
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	dial tcp 20.51.253.200:26257: connect: connection refused
	Test:       	gossip/chaos/nodes=9
(require.go:1357).NoError: FailNow called
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=azure
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jan 9, 2025

This last failure is unclear to me. I started a thread on #test-eng.

We're getting

2025/01/09 08:16:51 gossip.go:138: checking if gossip excludes dead node 0 (0s)
2025/01/09 08:16:51 gossip.go:100: 1: checking gossip
2025/01/09 08:16:51 gossip.go:111: 1: found 8 live nodes
2025/01/09 08:16:51 gossip.go:100: 2: checking gossip
2025/01/09 08:16:51 gossip.go:100: 3: checking gossip
2025/01/09 08:16:51 gossip.go:100: 4: checking gossip
2025/01/09 08:16:52 gossip.go:100: 5: checking gossip
2025/01/09 08:16:52 test_impl.go:470: test failure #1: full stack retained in failure_1.log: (assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/gossip.go:63
	            				pkg/cmd/roachtest/tests/gossip.go:101
	            				pkg/cmd/roachtest/tests/gossip.go:140
	            				pkg/cmd/roachtest/tests/gossip.go:159
	            				pkg/cmd/roachtest/tests/gossip.go:191
	            				pkg/cmd/roachtest/test_runner.go:1314
	            				src/runtime/asm_amd64.s:1695
	Error:      	Received unexpected error:
	            	dial tcp 20.51.253.200:26257: connect: connection refused
	Test:       	gossip/chaos/nodes=9

Note the timestamp 8:16:52. 20.51.253.200 is n5 (teamcity-18431727-1736405034-48-n9cpu4-0005, see
cluster file in roachprod_state). Indeed, n5 wasn't running then:

I250109 08:16:55.596566 1 util/log/file_sync_buffer.go:237 â‹® [config] file created at: 2025/01/09 08:16:55

but the cluster was definitely started (c.Start()) well before this point. However, 5.journalctl.txt confirms the later timestamp (I'm just assuming that there's no clock offset, after all, the conn refused error makes it very likely that the process just wasn't running when the conn attempt was made):

Jan 09 08:16:55 teamcity-18431727-1736405034-48-n9cpu4-0005 sudo[3347]:   ubuntu : PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/systemd-run --unit cockroach-system --same-dir --uid 1000 --gid 1000 --service-type=notify -p NotifyAccess=all -p MemoryMax=95% -p LimitCORE=infinity -p LimitNOFILE=532480 bash ./cockroach.sh run

The only explanation I have is that somehow c.Start returned before it had actually "done" the job. But it actually ssh'ed into the machine way after returning, so it must have been asynchronous in the roachtest infra (i.e. systemd-run returning early before actually having spawned the process couldn't have caused this). It's so strange.

@tbg
Copy link
Member

tbg commented Jan 9, 2025

@herkolategan kindly pointed out this:

I250109 08:16:55.820484 282 kv/kvclient/kvcoord/range_iter.go:213 ⋮ [T1,Vsystem,n5,ts-poll] 95  range descriptor lookup failed: remote wall time is too far ahead (12.355389759s) to be trustworthy

so more likely scenario is, the cluster got started, timestamps on n5 are consistently +12s in the future relative to everyone else, n5 dies, then we try to connect to n5 and fail.

I will say, azure has not been looking good here. Is there a way we can vet machines before we try to use them? Are we not setting up clock sync properly?

@tbg tbg added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Jan 9, 2025
@tbg
Copy link
Member

tbg commented Jan 9, 2025

I'm going to take this opportunity to classify as infra flake and close out the issue, since all recent failures were on azure (and could have been caused by clock sync failures). We now have #138726 to track improving this.

@tbg tbg closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

3 participants