Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/sql/sqlstats/persistedsqlstats/persistedsqlstats_test: TestPersistedSQLStatsReadDisk failed #138224

Closed
cockroach-teamcity opened this issue Jan 3, 2025 · 3 comments · Fixed by #139279
Assignees
Labels
branch-release-24.3.3-rc C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-observability

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 3, 2025

pkg/sql/sqlstats/persistedsqlstats/persistedsqlstats_test.TestPersistedSQLStatsReadDisk failed on release-24.3.3-rc @ 7dab2bb943152658ca94aeae83c450742ec361a5:

      pkg/kv/db.go:1061 +0xec
  github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn()
      pkg/kv/db.go:1036 +0x5c
  github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn-fm()
      <autogenerated>:1 +0x1b
  github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn()
      pkg/sql/internal.go:2012 +0x6f5
  github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).Txn()
      pkg/sql/internal.go:1939 +0xd7
  github.com/cockroachdb/cockroach/pkg/sql.(*internalDBWithOverrides).Txn()
      pkg/sql/internal.go:2094 +0x12b
  github.com/cockroachdb/cockroach/pkg/jobs.Updater.update()
      pkg/jobs/update.go:56 +0x30da
  github.com/cockroachdb/cockroach/pkg/jobs.Updater.Update()
      pkg/jobs/update.go:456 +0x1c6
  github.com/cockroachdb/cockroach/pkg/jobs.Updater.started()
      pkg/jobs/jobs.go:254 +0x85
  github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine()
      pkg/jobs/registry.go:1627 +0x10a5
  github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob()
      pkg/jobs/adopt.go:446 +0x811
  github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1()
      pkg/jobs/adopt.go:280 +0x1ab
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      pkg/util/stop/stopper.go:498 +0x338
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.gowrap1()
      pkg/util/stop/stopper.go:499 +0x4f

Goroutine 2184 (running) created at:
  testing.(*T).Run()
      GOROOT/src/testing/testing.go:1742 +0x825
  testing.runTests.func1()
      GOROOT/src/testing/testing.go:2161 +0x85
  testing.tRunner()
      GOROOT/src/testing/testing.go:1689 +0x21e
  testing.runTests()
      GOROOT/src/testing/testing.go:2159 +0x8be
  testing.(*M).Run()
      GOROOT/src/testing/testing.go:2027 +0xf17
  pkg/sql/sqlstats/persistedsqlstats/persistedsqlstats_test_test.TestMain()
      pkg/sql/sqlstats/persistedsqlstats/main_test.go:23 +0x191
  main.main()
      bazel-out/k8-fastbuild/bin/pkg/sql/sqlstats/persistedsqlstats/persistedsqlstats_test_/testmain.go:192 +0x8f3
  runtime.main()
      GOROOT/src/runtime/proc.go:271 +0x29c
  github.com/cockroachdb/cockroach/pkg/util/parquet.box2DDecoder.decode()
      pkg/util/parquet/decoders.go:175 +0x3b
  github.com/cockroachdb/cockroach/pkg/util/parquet.init.0()
      pkg/util/parquet/decoders.go:402 +0x1d0
==================

Parameters:

  • attempt=1
  • race=true
  • run=3
  • shard=6
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/obs-prs

This test on roachdash | Improve this report!

Jira issue: CRDB-45958

@cockroach-teamcity cockroach-teamcity added branch-release-24.3.3-rc C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-observability labels Jan 3, 2025
@kyle-a-wong
Copy link
Contributor

Full race condition stack trace:

WARNING: DATA RACE
Read at 0x00c01102c808 by goroutine 17265:
  github.com/cockroachdb/cockroach/pkg/util.CombineUnique[go.shape.int32]()
      pkg/util/slices.go:30 +0x144
  github.com/cockroachdb/cockroach/pkg/sql/sqlstats/ssmemstorage.(*Container).RecordStatement()
      pkg/sql/sqlstats/ssmemstorage/ss_mem_writer.go:127 +0x19af
  github.com/cockroachdb/cockroach/pkg/sql/sqlstats/sslocal.(*StatsCollector).RecordStatement()
      pkg/sql/sqlstats/sslocal/sslocal_stats_collector.go:349 +0x1670
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).recordStatementSummary()
      pkg/sql/executor_statement_metrics.go:216 +0x150f
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine()
      pkg/sql/conn_executor_exec.go:2032 +0x315a
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState()
      pkg/sql/conn_executor_exec.go:1192 +0x93b3
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1()
      pkg/sql/conn_executor_exec.go:141 +0x184
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling()
      pkg/sql/conn_executor_exec.go:3456 +0x539
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt()
      pkg/sql/conn_executor_exec.go:140 +0xbdd
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execPortal()
      pkg/sql/conn_executor_exec.go:251 +0x744
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func2()
      pkg/sql/conn_executor.go:2446 +0x1179
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd()
      pkg/sql/conn_executor.go:2448 +0xe3e
  github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run()
      pkg/sql/conn_executor.go:2263 +0x3ea
  github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).runWithEx.func2()
      pkg/sql/internal.go:253 +0x1d9
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      pkg/util/stop/stopper.go:498 +0x338
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.gowrap1()
      pkg/util/stop/stopper.go:499 +0x4f
Previous write at 0x00c01102c808 by goroutine 2184:
  github.com/cockroachdb/cockroach/pkg/util.CombineUnique[go.shape.int32]()
      pkg/util/slices.go:39 +0x353
  github.com/cockroachdb/cockroach/pkg/sql/appstatspb.(*StatementStatistics).Add()
      pkg/sql/appstatspb/app_stats.go:185 +0x14cb
  github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats.(*CombinedStmtStatsIterator).Next()
      pkg/sql/sqlstats/persistedsqlstats/combined_iterator.go:140 +0x5b3
  github.com/cockroachdb/cockroach/pkg/sql/sqlstats/persistedsqlstats.(*PersistedSQLStats).IterateStatementStats()
      pkg/sql/sqlstats/persistedsqlstats/stmt_reader.go:58 +0x484
  pkg/sql/sqlstats/persistedsqlstats/persistedsqlstats_test_test.verifyStoredStmtFingerprints()
      pkg/sql/sqlstats/persistedsqlstats/reader_test.go:329 +0x284
  pkg/sql/sqlstats/persistedsqlstats/persistedsqlstats_test_test.TestPersistedSQLStatsReadDisk()
      pkg/sql/sqlstats/persistedsqlstats/reader_test.go:58 +0x304
  testing.tRunner()
      GOROOT/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
      GOROOT/src/testing/testing.go:1742 +0x44

@kyle-a-wong kyle-a-wong added P-1 Issues/test failures with a fix SLA of 1 month and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 15, 2025
@kyle-a-wong
Copy link
Contributor

This doesn't seem to have been caused by any recent change, so removed the release-blocker

@dhartunian
Copy link
Collaborator

@kyle-a-wong I think this is happening because the test is iterating through stats objects that were copied incorrectly from the in-memory containers where stuff is modified concurrently.

I think it boils down to this code:

statementStats.mu.Lock()
data := statementStats.mu.data
distSQLUsed := statementStats.mu.distSQLUsed
vectorized := statementStats.mu.vectorized
fullScan := statementStats.mu.fullScan
database := statementStats.mu.database
querySummary := statementStats.mu.querySummary
statementStats.mu.Unlock()
s.currentValue = &appstatspb.CollectedStatementStatistics{
Key: appstatspb.StatementStatisticsKey{
Query: stmtKey.stmtNoConstants,
QuerySummary: querySummary,
DistSQL: distSQLUsed,
Vec: vectorized,
ImplicitTxn: stmtKey.implicitTxn,
FullScan: fullScan,
App: s.container.appName,
Database: database,
PlanHash: stmtKey.planHash,
TransactionFingerprintID: stmtKey.transactionFingerprintID,
},
ID: statementStats.ID,
Stats: data,
}

On line 80 we copy the data object from the mutex and on line 102 we keep it for later observation and use in the Add() function mentioned in the stacktrace above. The slices within data are not safe for concurrent modification. I think they need to be copied explicitly.

kyle-a-wong added a commit to kyle-a-wong/cockroach that referenced this issue Jan 16, 2025
StmtStatsIterator.Next() copies the next statement stats info into a
new CollectedStatementStatistics object, but the slice fields
of stmtStats.mu.data are not copied explicitly. This makes them
susecptible to race conditions, as seen in cockroachdb#138224.

To fix, all slice fields in the StatementStatistics are
explicitly copied, using a new util.CopySlice function.

Fixes: 138224
Epic: none
Release note: none
@kyle-a-wong kyle-a-wong added P-3 Issues/test failures with no fix SLA and removed P-1 Issues/test failures with a fix SLA of 1 month labels Jan 17, 2025
kyle-a-wong added a commit to kyle-a-wong/cockroach that referenced this issue Jan 28, 2025
StmtStatsIterator.Next() copies the next statement stats info into a
new CollectedStatementStatistics object, but the slice fields
of stmtStats.mu.data are not copied explicitly. This makes them
susecptible to race conditions, as seen in cockroachdb#138224.

To fix, all slice fields in the StatementStatistics are
explicitly copied, using slices.Clone.

Fixes: 138224
Epic: none
Release note: none
kyle-a-wong added a commit to kyle-a-wong/cockroach that referenced this issue Jan 29, 2025
StmtStatsIterator.Next() copies the next statement stats info into a
new CollectedStatementStatistics object, but the slice fields
of stmtStats.mu.data are not copied explicitly. This makes them
susecptible to race conditions, as seen in cockroachdb#138224.

To fix, all slice fields in the StatementStatistics are
explicitly copied, using slices.Clone.

Fixes: 138224
Epic: none
Release note: none
kyle-a-wong added a commit to kyle-a-wong/cockroach that referenced this issue Jan 30, 2025
StmtStatsIterator.Next() copies the next statement stats info into a
new CollectedStatementStatistics object, but the slice fields
of stmtStats.mu.data are not copied explicitly. This makes them
susecptible to race conditions, as seen in cockroachdb#138224.

To fix, all slice fields in the StatementStatistics are
explicitly copied, using slices.Clone.

Fixes: 138224
Epic: none
Release note: none
@craig craig bot closed this as completed in 084c62f Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.3.3-rc C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants