Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed #139310

Closed
cockroach-teamcity opened this issue Jan 17, 2025 · 6 comments
Closed

roachtest: restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed #139310

cockroach-teamcity opened this issue Jan 17, 2025 · 6 comments
Assignees
Labels
A-disaster-recovery branch-release-25.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 17, 2025

roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on release-25.1 @ f9c2a2f417a567a1dc6e7b23b41b8759cf62d077:

(sql_runner.go:267).Scan: error scanning '&{<nil> 0xc3cc68aea0}': sql: Scan error on column index 0, name "fraction_completed": converting NULL to float32 is unsupported
(1) sql: Scan error on column index 0, name "fraction_completed"
Wraps: (2) converting NULL to float32 is unsupported
Error types: (1) *fmt.wrapError (2) *errors.errorString
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-46592

@cockroach-teamcity cockroach-teamcity added branch-release-25.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Jan 17, 2025
@jeffswenson
Copy link
Collaborator

jeffswenson commented Jan 17, 2025

@dt it looks like this is caused by #138104. If there is not progress record, crdb_internal.jobs returns NULL instead of 0 for fraction_completed. I'm happy to fix it. Do you want me to adjust the job query so it returns 0 if there is no frontier, or should I make the test tolerant of NULL? I'm leaning towards fixing the test, but could be convinced we want to restore the old crdb_internal.jobs behavior.

@jeffswenson jeffswenson self-assigned this Jan 17, 2025
@jeffswenson jeffswenson removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jan 17, 2025
jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jan 17, 2025
As of cockroachdb#138104, `fraction_completed` is NULL if there is no progress
checkpoint for the job. Semantically, this is the right value. It's
weird that jobs with a frontier like CDC jobs would start out with
fraction_completed = 0 then switch to NULL once a checkpoint was
recorded.

Fixes: cockroachdb#139308
Part of: cockroachdb#139310
Release note: none
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on release-25.1 @ 29c0a3d5161b6818e414efc9c35321c1c0416e19:

(sql_runner.go:267).Scan: error scanning '&{<nil> 0xc522395dd0}': sql: Scan error on column index 0, name "fraction_completed": converting NULL to float32 is unsupported
(1) sql: Scan error on column index 0, name "fraction_completed"
Wraps: (2) converting NULL to float32 is unsupported
Error types: (1) *fmt.wrapError (2) *errors.errorString
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on release-25.1 @ 6123ebaa10c3847899aa81b09ec4cb8d08c0bd6d:

(sql_runner.go:267).Scan: error scanning '&{<nil> 0xc0030a6a20}': sql: Scan error on column index 0, name "fraction_completed": converting NULL to float32 is unsupported
(1) sql: Scan error on column index 0, name "fraction_completed"
Wraps: (2) converting NULL to float32 is unsupported
Error types: (1) *fmt.wrapError (2) *errors.errorString
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on release-25.1 @ 6123ebaa10c3847899aa81b09ec4cb8d08c0bd6d:

(sql_runner.go:267).Scan: error scanning '&{<nil> 0xc0053aacf0}': sql: Scan error on column index 0, name "fraction_completed": converting NULL to float32 is unsupported
(1) sql: Scan error on column index 0, name "fraction_completed"
Wraps: (2) converting NULL to float32 is unsupported
Error types: (1) *fmt.wrapError (2) *errors.errorString
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on release-25.1 @ 6123ebaa10c3847899aa81b09ec4cb8d08c0bd6d:

(sql_runner.go:267).Scan: error scanning '&{<nil> 0xc003d0f9e0}': sql: Scan error on column index 0, name "fraction_completed": converting NULL to float32 is unsupported
(1) sql: Scan error on column index 0, name "fraction_completed"
Wraps: (2) converting NULL to float32 is unsupported
Error types: (1) *fmt.wrapError (2) *errors.errorString
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

Same failure on other branches

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Jan 21, 2025
139359: roachtest: pause restore test should tolerate null fraction_completed r=jeffswenson a=jeffswenson

As of #138104, `fraction_completed` is NULL if there is no progress checkpoint for the job. Semantically, this is the right value. It's weird that jobs with a frontier like CDC jobs would start out with fraction_completed = 0 then switch to NULL once a checkpoint was recorded.

Fixes: #139308
Part of: #139310
Release note: none

139497: catalog/lease: deflake TestLeaseAtLatestVersion r=rafiss a=fqazi

Currently, TestLeaseAtLatestVersion can flake when the initial version
of a descriptor is acquired, which this test is not designed to handle.
To address this, this patch will intentionally use a different database
for the timestamp table, so that publishing a new version or a slow
range feed does not acquire the kv table in the same schema.

Fixes: #139386

Release note: None

139504: sql: avoid hangs in TestBackfillWithProtectedTS r=rafiss a=rafiss

This test has been flaking. I haven't reproduced the failure yet, but this patch will at least make it so the test doesn't hang if an internal step fails, so we should be able to debug flakes more easily.

informs #139281
informs #139493
Release note: None

Co-authored-by: Jeff Swenson <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Jan 21, 2025
As of #138104, `fraction_completed` is NULL if there is no progress
checkpoint for the job. Semantically, this is the right value. It's
weird that jobs with a frontier like CDC jobs would start out with
fraction_completed = 0 then switch to NULL once a checkpoint was
recorded.

Fixes: #139308
Part of: #139310
Release note: none
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on release-25.1 @ 0a14af43960ce2fcb79a2624d3b94fe2ce3662e2:

(sql_runner.go:267).Scan: error scanning '&{<nil> 0xc007fc6c60}': sql: Scan error on column index 0, name "fraction_completed": converting NULL to float32 is unsupported
(1) sql: Scan error on column index 0, name "fraction_completed"
Wraps: (2) converting NULL to float32 is unsupported
Error types: (1) *fmt.wrapError (2) *errors.errorString
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-release-25.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
None yet
Development

No branches or pull requests

2 participants