-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: gossip/chaos/nodes=9 failed #135154
Comments
This is, sadly, not another instance of the problem we're trying to diagnose with #134527. Instead, it seems that the gossip checks are fairly slow:
Note for example the 5s delay between those two lines:
The test wants to assert that gossip converges within 20s, but it runs a lot of checks that it assumes take much less than 20s. We're seeing here that they can take 20s in their own right. I wouldn't call this expected, though. Here's the code that runs in the check: cockroach/pkg/cmd/roachtest/tests/gossip.go Lines 87 to 115 in 39e43b8
Nothing here should take appreciable amounts of time (but consider that the problem in #134527 might well boil down to slow cockroach/pkg/cmd/roachtest/tests/gossip.go Lines 50 to 55 in 39e43b8
But we just restarted a node, so I can understand how logging into a SQL shell might take some time (on the order of 6s, the lease interval - we're killing nodes here). But it also reliably seems to be taking around 1s at least to run this query, and why would one that isn't the first one suddenly see a lease failover? |
This test is known for strange behavior (cockroachdb#134527). A recent test failure[^1] shows that we "lose time" in more unexpected locations. I'm assuming an execution trace (of the roachtest process) will be helpful here, as we'll be able to determine whether the time is spent in roachtest or on the CockroachDB server side. [^1]: cockroachdb#135154 Epic: none Release note: None
This test is known for strange behavior (cockroachdb#134527). A recent test failure[^1] shows that we "lose time" in more unexpected locations. I'm assuming an execution trace (of the roachtest process) will be helpful here, as we'll be able to determine whether the time is spent in roachtest or on the CockroachDB server side. [^1]: cockroachdb#135154 Epic: none Release note: None
This test is known for strange behavior (cockroachdb#134527). A recent test failure[^1] shows that we "lose time" in more unexpected locations. I'm assuming an execution trace (of the roachtest process) will be helpful here, as we'll be able to determine whether the time is spent in roachtest or on the CockroachDB server side. [^1]: cockroachdb#135154 Epic: none Release note: None
Removing |
When I ran this test manually, I don't see these pathological timings. 3s for everything isn't nothing, but it's a lot less than 20s.
|
135173: kv: add a backoff to the retry loop in db.Txn r=miraradeva a=miraradeva In rare cases (e.g. #77376), two transactions can get repeatedly deadlocked while trying to write to same key(s): one aborts the other, but before it can proceed, the other transaction has restarted and acquired a lock on the key again. This can result in the max transaction retries being exceeded without either transaction succeeding. This commit adds a backoff to the transaction retry loop in `db.Txn`, which will hopefully help one transaction slow down and let the other one commit. Fixes: #77376 Release note: None 135253: roachtest: get exec traces in gossip/chaos/nodes=9 r=tbg a=tbg This test is known for strange behavior (#134527). A recent test failure[^1] shows that we "lose time" in more unexpected locations. I'm assuming an execution trace (of the roachtest process) will be helpful here, as we'll be able to determine whether the time is spent in roachtest or on the CockroachDB server side. [^1]: #135154 Epic: none Release note: None 135333: roachtest: fix "context cancelled" errors in db-console tests r=kyle-a-wong a=kyle-a-wong changes the context used when writing cypress artifacts to the test artifact directory. This is needed because the existing context is getting cancelled, pressumable when `rtCluster.RunE(ctx...` fails. Adds `-e NO_COLOR=1` to the docker run command so that the output is more humanreadable in log files Updates the tests to use `registry.StandardCockroach`. By default, `registry.RandomizedCockroach` is used, and `registry.RuntimeAssertionsCockroach` is built using `cockroach-short`, which does not include db-console in the binary. Resolves: #134808 Epic: none Release note: None Co-authored-by: Mira Radeva <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Kyle Wong <[email protected]>
Reading from above, guessing this is similar to the other failure on a different branch, I'll match the labels rather than marks as a duplicate just in case: #135059. |
roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 44de2d379610067e14a7ebfbc92e64311f13a232:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for azure clusters Same failure on other branches
|
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 9744e5f1676a752d5b200fe7bce84ca8b44afca0:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for azure clusters Same failure on other branches
|
roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ de3b1220f5c71ac966561505c1b379060fa1407f:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for azure clusters Same failure on other branches
|
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ d18eb683b2759fd8814dacf0baa913f596074a17:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for azure clusters Same failure on other branches
|
roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 58e75b8c97804fea87f8f793665de98098e84b20:
Parameters:
Help
See: roachtest README See: How To Investigate (internal) Grafana is not yet available for azure clusters |
This last failure is unclear to me. I started a thread on #test-eng. We're getting
Note the timestamp
but the cluster was definitely started (
The only explanation I have is that somehow |
@herkolategan kindly pointed out this:
so more likely scenario is, the cluster got started, timestamps on n5 are consistently +12s in the future relative to everyone else, n5 dies, then we try to connect to n5 and fail. I will say, azure has not been looking good here. Is there a way we can vet machines before we try to use them? Are we not setting up clock sync properly? |
I'm going to take this opportunity to classify as infra flake and close out the issue, since all recent failures were on azure (and could have been caused by clock sync failures). We now have #138726 to track improving this. |
roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 39e43b85ec3b02bc760df10fce1c19d09419d6f2:
Parameters:
arch=arm64
cloud=azure
coverageBuild=false
cpu=4
encrypted=false
fs=ext4
localSSD=true
metamorphicLeases=expiration
runtimeAssertionsBuild=false
ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-44376
The text was updated successfully, but these errors were encountered: