Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/random-load failed [descriptor not found during DROP VIEW commit] #137487

Closed
cockroach-teamcity opened this issue Dec 14, 2024 · 6 comments · Fixed by #137868
Assignees
Labels
branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 14, 2024

roachtest.schemachange/random-load failed with artifacts on release-23.2 @ 5af1d41f70807b3b0c608ff93c05f47be7676661:

(schemachange_random_load.go:117).runSchemaChangeRandomLoad: full command output in run_095129.088934559_n1_workload-run-schemac.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/schemachange/random-load/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-45605

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Dec 14, 2024
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Dec 14, 2024
@spilchen
Copy link
Contributor

This is the error:

{
 "workerId": 14,
 "clientTimestamp": "09:54:30.048584",
 "ops": [
  "BEGIN",
  {
   "sql": "DROP VIEW IF EXISTS public.view4761 RESTRICT"
  },
  "COMMIT"
 ],
 "expectedExecErrors": "",
 "expectedCommitErrors": "",
 "message": "***UNEXPECTED COMMIT ERROR; Received an unexpected commit error: ERROR: transaction committed but schema change aborted with error: (XXUUU): failed to read descriptors [338 625 627 660] for the declarative schema change state: referenced descriptor ID 660: looking up ID 660: descriptor not found (SQLSTATE XXUUU)",

This problem has come up before. See #134928 (comment) and #135883

@spilchen spilchen changed the title roachtest: schemachange/random-load failed roachtest: schemachange/random-load failed [descriptor not found during DROP VIEW commit] Dec 16, 2024
@exalate-issue-sync exalate-issue-sync bot added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 17, 2024
@spilchen
Copy link
Contributor

It's attempting to read these descriptors: [338 625 627 660]. And it's complaining that it cannot find 660.

When I searched system.namespaces.txt. I see descriptors for the first 3:

338: table2030
625: table4402
627: table4478

These are tables that are accessed in the view. 660 wasn't in system.namespaces.txt. But it did show up in system.eventlog.txt. And it would appear that it's the descriptor of the view that we are trying to drop:

2024-12-14 09:54:24.221385  create_view 0   1   "{""Timestamp"":1734170064221385019,""EventType"":""create_view"",""Statement"":""CREATE VIEW schemachange.public.view4737 AS SELECT public.table4478.col4478_4480, public.table4478.col4478_4481, public.table4478.\""c'ol̃4478_4479\"", public.table4402.col4402_4417, public.table4402.co͝l4402_4409, public.table4402.\""col440\t2_4405\"", public.table4402.\""c\u000bol440%p2_4416\"", public.table4402.\""col44\\f0\""\""2_4407\"", public.table4402.col4402_4406, public.table4402.\""c\\gol4402_4411\"", public.table4402.col4402_4404, public.table4402.col4402_4410, public.table4402.col4402_4419, public.table4402.\""col4?402_4421\"", public.table4402.\""c'\\fol4'402_4412\"", public.table4402.\""col4402\t_ 4408\"", public.table4402.col4402_4420, public.table4402.\""co\""\""l4402_4415\"", public.table4402.\"" col4402_4418\"", public.table4402.\""Col44͖02_4414\"", public.table2030.\""col%q2030_2042\"", public.table2030.\""co%pl2030_2048\"", public.table2030.col2030_2043, public.table2030.col2030_2039, public.table2030.\""col/2\\\\u2F98030_2044\"", public.table2030.\""col2*030_2038\"", public.table2030.\""col2%p030_2041\"", public.table2030.col2̈́030_2033, public.table2030.\""c%pol2030}_2051\"", public.table2030.col2030_2050, public.table2030.col2030_2036, public.table2030.\""col203/0\r_🙂2049\"", public.table2030.col2030_2047, public.table2030.\""Col'2030_2045\"", public.table2030.col2030_2040, public.table2030.col2030_2037, public.table2030.col2030_2046 FROM schemachange.public.table4478, schemachange.public.table4402, schemachange.public.table2030"",""Tag"":""CREATE VIEW"",""User"":""roachprod"",""DescriptorID"":660...
2024-12-14 09:54:27.68477   rename_table    0   1   "{""Timestamp"":1734170067684769965,""EventType"":""rename_table"",""Statement"":""ALTER VIEW public.view4737 RENAME TO public.view4761"",""Tag"":""ALTER VIEW"",""User"":""roachprod"",""DescriptorID"":660,""ApplicationName"":""schemachange"",""TableName"":""schemachange.public.view4737"",""NewTableName"":""schemachange.public.view4761""}"    \x5661e16e58ad43ae8fc3fffb368d9f40
2024-12-14 09:54:30.186735  drop_view   0   1   "{""Timestamp"":1734170070186735307,""EventType"":""drop_view"",""Statement"":""DROP VIEW IF EXISTS schemachange.public.view4761 RESTRICT"",""Tag"":""DROP VIEW"",""User"":""roachprod"",""DescriptorID"":660,""ApplicationName"":""schemachange"",""ViewName"":""schemachange.public.view4761""}"  \x3234ab7c6db5458f8f7ad54084e9312a

The rename view was successful only a few seconds earlier. And this is backed up by the fact that we are dropping the view by its new name.

@spilchen
Copy link
Contributor

I was able to reproduce this issue. It occurs when ALTER VIEW .. RENAME is executed simultaneously with DROP VIEW. The rename operation detects that the descriptor is in a dropped state and proceeds to delete it using a call to DeleteTableDescAndZoneConfig in schema_changer.go. This creates a conflict for the DROP VIEW, which expects the descriptor to still be present.

However, I could not reproduce this issue on master. Since this scenario only arises in version 23.2, I’m confident the problem is isolated to that release.

@spilchen
Copy link
Contributor

I spoke too soon—I was able to reproduce this on master as well. It just takes some time to trigger, and I was probably a bit impatient earlier.

@spilchen
Copy link
Contributor

It turns out we have tried to solve the timing hole with VIEW rename before. It was added in #128683. But there is still a timing hole with it.

The timeline that I see is:

  1. rename view: batch changes to the descriptor for a new name
  2. rename view: confirm DSC isn't in progress
  3. rename view: queue job to commit the descriptor change
  4. drop view: run statement phase
  5. drop view: run precommit phases (mark the descriptor as dropped)
  6. rename view: begin executing its job
  7. rename view: delete the descriptor because its marked as dropped
  8. drop view: fail in post commit phase because descriptor not found

@exalate-issue-sync exalate-issue-sync bot removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Dec 20, 2024
spilchen added a commit to spilchen/cockroach that referenced this issue Dec 20, 2024
Running the legacy schema changer and the declarative schema changer
concurrently can cause issues due to their different approaches to
updating descriptors. Normally we have checks to prevent the legacy
schema changer from running in such scenarios, timing issues
persisted—particularly between `ALTER VIEW .. RENAME` (legacy) and `DROP
VIEW` (DSC). In these cases, the view rename could delete the descriptor
being processed by the drop view operation.

This change addresses the timing issue by adding a check for an active
DSC job at the start of the legacy schema changer job. With this fix,
the issue could no longer be reproduced, whereas it was consistently
reproducible before.

Epic: none
Closes: cockroachdb#137487
Release note (bug fix): Fixed a timing issue between `ALTER VIEW ..
RENAME` and `DROP VIEW` that caused repeated failures in the `DROP VIEW`
job.
craig bot pushed a commit that referenced this issue Dec 21, 2024
137868: sql: Check for concurrent DSC job in legacy schema changer r=spilchen a=spilchen

Running the legacy schema changer and the declarative schema changer concurrently can cause issues due to their different approaches to updating descriptors. Normally we have checks to prevent the legacy schema changer from running in such scenarios, timing issues persisted—particularly between `ALTER VIEW .. RENAME` (legacy) and `DROP VIEW` (DSC). In these cases, the view rename could delete the descriptor being processed by the drop view operation.

This change addresses the timing issue by adding a check for an active DSC job at the start of the legacy schema changer job. With this fix, the issue could no longer be reproduced, whereas it was consistently reproducible before.

Epic: none
Closes: #137487
Closes: #137828
Release note (bug fix): Fixed a timing issue between `ALTER VIEW .. RENAME` and `DROP VIEW` that caused repeated failures in the `DROP VIEW` job.

Co-authored-by: Matt Spilchen <[email protected]>
@craig craig bot closed this as completed in 53ee43c Dec 21, 2024
Copy link

blathers-crl bot commented Dec 21, 2024

Based on the specified backports for linked PR #137868, I applied the following new label(s) to this issue: branch-release-24.1, branch-release-24.2, branch-release-24.3. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels Dec 21, 2024
blathers-crl bot pushed a commit that referenced this issue Dec 21, 2024
Running the legacy schema changer and the declarative schema changer
concurrently can cause issues due to their different approaches to
updating descriptors. Normally we have checks to prevent the legacy
schema changer from running in such scenarios, timing issues
persisted—particularly between `ALTER VIEW .. RENAME` (legacy) and `DROP
VIEW` (DSC). In these cases, the view rename could delete the descriptor
being processed by the drop view operation.

This change addresses the timing issue by adding a check for an active
DSC job at the start of the legacy schema changer job. With this fix,
the issue could no longer be reproduced, whereas it was consistently
reproducible before.

Epic: none
Closes: #137487
Release note (bug fix): Fixed a timing issue between `ALTER VIEW ..
RENAME` and `DROP VIEW` that caused repeated failures in the `DROP VIEW`
job.
blathers-crl bot pushed a commit that referenced this issue Dec 21, 2024
Running the legacy schema changer and the declarative schema changer
concurrently can cause issues due to their different approaches to
updating descriptors. Normally we have checks to prevent the legacy
schema changer from running in such scenarios, timing issues
persisted—particularly between `ALTER VIEW .. RENAME` (legacy) and `DROP
VIEW` (DSC). In these cases, the view rename could delete the descriptor
being processed by the drop view operation.

This change addresses the timing issue by adding a check for an active
DSC job at the start of the legacy schema changer job. With this fix,
the issue could no longer be reproduced, whereas it was consistently
reproducible before.

Epic: none
Closes: #137487
Release note (bug fix): Fixed a timing issue between `ALTER VIEW ..
RENAME` and `DROP VIEW` that caused repeated failures in the `DROP VIEW`
job.
blathers-crl bot pushed a commit that referenced this issue Dec 21, 2024
Running the legacy schema changer and the declarative schema changer
concurrently can cause issues due to their different approaches to
updating descriptors. Normally we have checks to prevent the legacy
schema changer from running in such scenarios, timing issues
persisted—particularly between `ALTER VIEW .. RENAME` (legacy) and `DROP
VIEW` (DSC). In these cases, the view rename could delete the descriptor
being processed by the drop view operation.

This change addresses the timing issue by adding a check for an active
DSC job at the start of the legacy schema changer job. With this fix,
the issue could no longer be reproduced, whereas it was consistently
reproducible before.

Epic: none
Closes: #137487
Release note (bug fix): Fixed a timing issue between `ALTER VIEW ..
RENAME` and `DROP VIEW` that caused repeated failures in the `DROP VIEW`
job.
spilchen added a commit that referenced this issue Jan 6, 2025
Running the legacy schema changer and the declarative schema changer
concurrently can cause issues due to their different approaches to
updating descriptors. Normally we have checks to prevent the legacy
schema changer from running in such scenarios, timing issues
persisted—particularly between `ALTER VIEW .. RENAME` (legacy) and `DROP
VIEW` (DSC). In these cases, the view rename could delete the descriptor
being processed by the drop view operation.

This change addresses the timing issue by adding a check for an active
DSC job at the start of the legacy schema changer job. With this fix,
the issue could no longer be reproduced, whereas it was consistently
reproducible before.

Epic: none
Closes: #137487
Release note (bug fix): Fixed a timing issue between `ALTER VIEW ..
RENAME` and `DROP VIEW` that caused repeated failures in the `DROP VIEW`
job.
spilchen added a commit that referenced this issue Jan 6, 2025
Running the legacy schema changer and the declarative schema changer
concurrently can cause issues due to their different approaches to
updating descriptors. Normally we have checks to prevent the legacy
schema changer from running in such scenarios, timing issues
persisted—particularly between `ALTER VIEW .. RENAME` (legacy) and `DROP
VIEW` (DSC). In these cases, the view rename could delete the descriptor
being processed by the drop view operation.

This change addresses the timing issue by adding a check for an active
DSC job at the start of the legacy schema changer job. With this fix,
the issue could no longer be reproduced, whereas it was consistently
reproducible before.

Epic: none
Closes: #137487
Release note (bug fix): Fixed a timing issue between `ALTER VIEW ..
RENAME` and `DROP VIEW` that caused repeated failures in the `DROP VIEW`
job.
spilchen added a commit that referenced this issue Jan 6, 2025
Running the legacy schema changer and the declarative schema changer
concurrently can cause issues due to their different approaches to
updating descriptors. Normally we have checks to prevent the legacy
schema changer from running in such scenarios, timing issues
persisted—particularly between `ALTER VIEW .. RENAME` (legacy) and `DROP
VIEW` (DSC). In these cases, the view rename could delete the descriptor
being processed by the drop view operation.

This change addresses the timing issue by adding a check for an active
DSC job at the start of the legacy schema changer job. With this fix,
the issue could no longer be reproduced, whereas it was consistently
reproducible before.

Epic: none
Closes: #137487
Release note (bug fix): Fixed a timing issue between `ALTER VIEW ..
RENAME` and `DROP VIEW` that caused repeated failures in the `DROP VIEW`
job.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants