-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restoreTPCCInc/nodes=10 failed #84162
Comments
This is a test infra flake. See #84132 (comment) for explanation. |
keeping the master issue open to see if the flake happens again. |
roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ 571bfa3afb3858ae84d8a8fcdbb0a38e058402a5:
Parameters: Same failure on other branches
|
roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ 687171ac6c2cd9992486bb3b8c9d252ac95ca1cd:
Parameters: Same failure on other branches
|
roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ 88d3253301457ac57820e0f4a4fab8f74bf9f38b:
Parameters: Same failure on other branches
|
also failing because of #84396 |
roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ e9ee21860458d997a8155734dc608cfcd050ef24:
Parameters: Same failure on other branches
|
this error message looks scary. adding a release blocker to this roachtest thread while I investigate. |
heh, this relates to the new pebbleIterator. I'll try to repro the failure with logging to understand what keys it's tripping up on:
|
@erikgrinaker @jbowens I suspect the
|
roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ e4cafeb8b1d586d091fb98e3e570650d7eeea294:
Parameters: Same failure on other branches
|
This PR refactors all call sites of ExternalSSTReader(), to support using the new PebbleIterator, which has baked in range key support. Most notably, this PR replaces the multiIterator used in the restore data processor with the PebbleSSTIterator. This patch is apart of a larger effort to teach backup and restore about MVCC bulk operations. Next, the readAsOfIterator will need to learn how to deal with range keys. Informs cockroachdb#71155 This PR addresses a bug created in cockroachdb#83984: loop variables in ExternalSSTReader were captured by reference, leading to roachtest failures (cockroachdb#84240, cockroachdb#84162). Informs #71155i Fixes: cockroachdb#84240, cockroachdb#84162, cockroachdb#84181 Release note: none
This PR refactors all call sites of ExternalSSTReader(), to support using the new PebbleIterator, which has baked in range key support. Most notably, this PR replaces the multiIterator used in the restore data processor with the PebbleSSTIterator. This patch is apart of a larger effort to teach backup and restore about MVCC bulk operations. Next, the readAsOfIterator will need to learn how to deal with range keys. Informs cockroachdb#71155 This PR addresses a bug created in cockroachdb#83984: loop variables in ExternalSSTReader were captured by reference, leading to roachtest failures (cockroachdb#84240, cockroachdb#84162). Informs #71155i Fixes: cockroachdb#84240, cockroachdb#84162, cockroachdb#84181 Release note: none
|
…AL_ERROR Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Create a custom retryer to retry these errors as suggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#84162 Release note: None
Informs cockroachdb#84635,cockroachdb#84162 Release note: none
…AL_ERROR Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Create a custom retryer to retry these errors as suggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162 Release note: None
…85329 84975: storage: add `MVCCRangeKeyStack` for range keys r=nicktrav,jbowens a=erikgrinaker **storage: add `MVCCRangeKeyStack` for range keys** This patch adds `MVCCRangeKeyStack` and `MVCCRangeKeyVersion`, a new range key representation that will be returned by `SimpleMVCCIterator`. It is more compact, for efficiency, and comes with a set of convenience methods to simplify common range key processing. Resolves #83895. Release note: None **storage: return `MVCCRangeKeyStack` from `SimpleMVCCIterator`** This patch changes `SimpleMVCCIterator.RangeKeys()` to return `MVCCRangeKeyStack` instead of `[]MVCCRangeKeyValue`. Callers have not been migrated to properly make use of this -- instead, they call `AsRangeKeyValues()` and construct and use the old data structure. The MVCC range tombstones tech note is also updated to reflect this. Release note: None **storage: migrate MVCC code to `MVCCRangeKeyStack`** Release note: None ***: migrate higher-level code to `MVCCRangeKeyStack`** Release note: None **kvserver/gc: partially migrate to `MVCCRangeKeyStack`** Some parts require invasive changes to MVCC stats helpers. These will shortly be consolidated with other MVCC stats logic elsewhere, so the existing logic is retained for now by using `AsRangeKeyValues()`. Release note: None **storage: remove `FirstRangeKeyAbove()` and `HasRangeKeyBetween()`** Release note: None 85017: Revert "sql: Add database ID to sampled query log" r=THardy98 a=THardy98 Reverts: #84195 This reverts commit 307817e. Removes the DatabaseID field from the `SampledQuery` telemetry log due to the potential of indefinite blocking in the case of a lease acquisition failure. Protobuf field not reserved as no official build was released with these changes yet. Release note (sql change): Removes the DatabaseID field from the `SampledQuery` telemetry log due to the potential of indefinite blocking in the case of a lease acquisition failure. 85024: cloud/gcp: add custom retryer for gcs storage, retry on stream INTERNAL_ERROR r=rhu713 a=rhu713 Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Create a custom retryer to retry these errors as suggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: #85217, #85216, #85204, #84162 Release note: None 85069: optbuilder: handle unnest returning a tuple r=DrewKimball a=DrewKimball Currently, the return types of SRFs that return multiple columns are represented as tuples with labels. The tuple labels are used to decide whether or not to create a single output column for the SRF, or multiple. The `unnest` function can return a single column if it has a single argument, and the type of that column can be a tuple with labels. This could cause the old logic to mistakenly create multiple output columns for `unnest`, which could lead to panics down the line and incorrect behavior otherwise. This commit adds a special case for `unnest` in the `optbuilder` to only expand tuple return types if there is more than one argument (implying more than one output column). Other SRFs do not have the same problem because they either always return the same number of columns, cannot return tuples, or both. Fixes #58438 Release note (bug fix): Fixed a bug existing since release 20.1 that could cause a panic in rare cases when the unnest function was used with a tuple return type. 85100: opt: perf improvements for large queries r=DrewKimball a=DrewKimball **opt: add bench test for slow queries** This commit adds two slow-planning queries pulled from #64793 to be used in benchmarking the optimizer. In addition, the `ReorderJoinsLimit` has been set to the default 8 for benchmarking tests. **opt: add struct for tracking column equivalence sets** Previously, the `JoinOrderBuilder` would construct a `FuncDepSet` from scratch on each call to `addJoins` in order to eliminate redundant join filters. This led to unnecessary large allocations because `addJoins` is called an exponential number of times in query size. This commit adds a struct `EquivSet` that efficiently stores equivalence relations as `ColSets` in a slice. Rather than being constructed on each call to `addJoins`, a `Reset` method is called that maintains slice memory. In the future, `EquivSet` can be used to handle equivalencies within `FuncDepSet` structs as well. This well avoid a significant number of allocations in cases with many equivalent columns, as outlined in #83963. **opt: avoid usage of FastIntMap in optimizer hot paths** Previously, `computeHashJoinCost` would use a `FastIntMap` to represent join equality filters to pass to `computeFiltersCost`. In addition, `GenerateMergeJoins` used a `FastIntMap` to look up columns among its join equality columns. This lead to unnecessary allocations since column IDs are often large enough to exceed the small field of `FastIntMap`. This commit modifies `computeFiltersCost` to take an anonymous function that is used to decide whether to skip an equality condition, removing the need for a mapping between columns. This commit also refactors `GenerateMergeJoins` to simply perform a linear scan of its equality columns; this avoids the allocation issue, and should be fast in practice because the number of equalities will not generally be large. Release note: None 85146: [backupccl] Use Expr for backup's Detached and Revision History options r=benbardin a=benbardin This will allow us to set them to null, which will be helpful for ALTER commands. Release note: None 85234: dev: add rewritable paths for norm tests r=mgartner a=mgartner Tests in `pkg/sql/opt/norm` are similar to tests in `pkg/sql/opt/xform` and `pkg/sql/opt/memo` in that they rely on fixtures in `pkg/sql/opt/testutils/opttester/testfixtures`. This commit adds these fixtures as rewritable paths for norm tests so that `./dev test pkg/sql/opt/xform --rewrite` does not fail with errors like: open pkg/sql/opt/testutils/opttester/testfixtures/tpcc_schema: operation not permitted Release note: None 85325: sql: fix explain gist output to show number of scan span constraints r=cucaroach a=cucaroach If there were span constraints we would always print 1, need to actually append them to get the count right. Fixes: #85324 Release note: None 85327: sql: fix udf logic test r=chengxiong-ruan a=chengxiong-ruan Fixes: #85303 Release note: None 85329: colexec: fix recent concat fix r=yuzefovich a=yuzefovich The recent fix of the Concat operator in the vectorized engine doesn't handle the array concatenation correctly and this is now fixed. Fixes: #85295. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Thomas Hardy <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: DrewKimball <[email protected]> Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Ben Bardin <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
…shot Informs cockroachdb#84635 Informs cockroachdb#84162 This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` status to be a form of a retriable replication change error. It then hooks `Replica.executeAdminCommandWithDescriptor` up to consult this status in its retry loop. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4. can’t transfer lease because doing so is deemed to be potentially unsafe Release note: None
Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Retry these errors assuggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162 Release note: None
Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Retry these errors assuggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162 Release note: None
…AL_ERROR Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Create a custom retryer to retry these errors as suggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162 Release note: None Release justification: add retries for temporary errors that were causing roachtests to fail.
…shot Informs cockroachdb#84635 Informs cockroachdb#84162 Fixes cockroachdb#85449. Fixes cockroachdb#83174. This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` status to be a form of a retriable replication change error. It then hooks `Replica.executeAdminCommandWithDescriptor` up to consult this status in its retry loop. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4. can’t transfer lease because doing so is deemed to be potentially unsafe Release note: None Release justification: Low risk.
85405: kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot r=shralex a=nvanbenschoten Informs #84635. Informs #84162. Fixes #85449. Fixes #83174. This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` status to be a form of a retriable replication change error. It then hooks `Replica.executeAdminCommandWithDescriptor` up to consult this status in its retry loop. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4. can’t transfer lease because doing so is deemed to be potentially unsafe Release note: None Release justification: Low risk, resolves flaky test. 87137: storage: default to TableFormatPebblev1 in backups r=itsbilal,dt a=jbowens If the v22.2 upgrade has not yet been finalized, so we're not permitted to use the new TableFormatPebblev2 sstable format, default to TableFormatPebblev1 which is the format used by v22.1 internally. This change is intended to allow us to remove code for understanding the old RocksDB table format version sooner (eg, v23.1). Release justification: low-risk updates to existing functionality Release note: None 87152: sql: encode either 0 or 1 spans in scan gists r=mgartner a=mgartner #### dev: add rewritable paths for pkg/sql/opt/exec/explain tests This commit adds fixtures in `pkg/sql/opt/testutils/opttester/testfixtures` as rewritable paths for tests in `pkg/sql/opt/exec/explain`. This prevents `dev test pkg/sql/opt/exec/explain` from erring when the `--rewrite` flag is used. Release justification: This is a test-only change. Release note: None #### sql: encode either 0 or 1 spans in scan gists In plan gists, we no longer encode the exact number of spans for scans so that two queries with the same plan but a different number of spans have the same gist. In addition, plan gists are now decoded with the `OnlyShape` flag which prints any non-zero number of spans as "1+ spans" and removes attributes like "missing stats" from scans. Fixes #87138 Release justification: This is a minor change that makes plan gist instrumentation more scalable. Release note (bug fix): The Explain Tab inside the Statement Details page now groups plans that have the same shape but a different number of spans in corresponding scans. 87154: roachtest: stop cockroach gracefully when upgrading nodes r=yuzefovich a=yuzefovich This commit makes it so that we stop cockroach nodes gracefully when upgrading them. Previous abrupt behavior of stopping the nodes during the upgrade could lead to test flakes because the nodes were not being properly drained. Here is one scenario for how one of the flakes (`pq: version mismatch in flow request: 65; this node accepts 69 through 69`, which means that a gateway running an older version asks another node running a newer version to do DistSQL computation, but the versions are not DistSQL compatible): - we are in a state when node 1 is running a newer version when node 2 is running an older version. Importantly, node 1 was upgraded "abruptly" meaning that it wasn't properly drained; in particular, it didn't send DistSQL draining notification through gossip. - newer node has already been started but its DistSQL server hasn't been started yet (although it already can accept incoming RPCs - see comments on `distsql.ServerImpl.Start` for more details). This means that newer node has **not** sent through gossip an update about its DistSQL version. - node 2 acts as the gateway for a query that reads some data that node 1 is the leaseholder for. During the physical planning, older node 2 checks whether newer node 1 is "healthy and compatible", and node 1 is deemed both healthy (because it can accept incoming RPCs) and is compatible (because node 2 hasn't received updated DistSQL version of node 1 since it hasn't been sent yet). As a result, node 2 plans a read on node 1. - when node 1 receives that request, it errors out with "version mismatch" error. This whole problem is solved if we stop nodes gracefully when upgrading them. In particular, this will mean that node 1 would first dissipate its draining notification across the cluster, so during the physical planning it will only be considered IFF node 1 has already communicated its updated DistSQL version, and then it would be deemed DistSQL-incompatible. I verified that this scenario is possible (with manual adjustments of the version upgrade test and cockroach binary to insert a delay) and that it's fixed by this commit. I believe it is likely that other flake types have the same root cause, but I haven't verified it. Fixes: #87104. Release justification: test-only change. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
roachtest.restoreTPCCInc/nodes=10 failed with artifacts on master @ 86e007dcb5cbde3501f138c1d768519db3487857:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-17500
The text was updated successfully, but these errors were encountered: