Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/fuse/log=false,data=true failed #98886

Closed
cockroach-teamcity opened this issue Mar 17, 2023 · 2 comments
Closed

roachtest: disk-stalled/fuse/log=false,data=true failed #98886

cockroach-teamcity opened this issue Mar 17, 2023 · 2 comments
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 17, 2023

roachtest.disk-stalled/fuse/log=false,data=true failed with artifacts on release-22.2 @ 4014ae25a2946464984b00d89411923303a1f634:

test artifacts and logs in: /artifacts/disk-stalled/fuse/log=false_data=true/run_1
(test_runner.go:938).runTest: test timed out (10h0m0s)
(cluster.go:1977).Run: output in run_100639.318181288_n1_charybdefsnemesis-cl: charybdefs-nemesis --clear returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_100639.324312367_n1_charybdefsnemesis-cl.log: exit status 131
(cluster.go:1977).Run: cluster.RunE: context canceled
(cluster.go:1977).Run: cluster.RunE: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-25568

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 17, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Mar 17, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Mar 17, 2023
@jbowens
Copy link
Collaborator

jbowens commented Mar 20, 2023

10:06:38 disk_stall.go:220: [w5] 489929.67 total transactions committed after stall
10:06:38 disk_stall.go:221: [w5] pre-stall tps: 1212.56, post-stall tps: 1166.50
10:06:38 disk_stall.go:228: test status: counting kv rows
10:06:39 disk_stall.go:231: [w5] Scan found 462504 rows.
19:47:54 test_impl.go:339: test failure #1: full stack retained in failure_1.log: (test_runner.go:938).runTest: test timed out (10h0m0s)

What was the test doing for 10h? Also, I thought this test had a much shorter timeout.

It was stuck removing the delay:

run_100639.318181288_n1_charybdefsnemesis-cl: 10:06:39 cluster.go:290: > charybdefs-nemesis --clear
./recipes: line 7: 37185 Quit                    (core dumped) python2 recipes.py "$@"
run_100639.318181288_n1_charybdefsnemesis-cl: 19:47:55 cluster.go:1999: > result: charybdefs-nemesis --clear returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_100639.324312367_n1_charybdefsnemesis-cl.log: exit status 131
(1) charybdefs-nemesis --clear returned
  | stderr:
  | ./recipes: line 7: 37185 Quit                    (core dumped) python2 recipes.py "$@"
  |
  | stdout:
Wraps: (2) Node 1. Command with error:
  | ```
  | charybdefs-nemesis --clear
  | ```
Wraps: (3) COMMAND_PROBLEM
Wraps: (4) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).errWithDebug
  |     github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:128
  | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).Run.func1
  |     github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:158
  | runtime.goexit
  |     GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (5) ssh verbose log retained in ssh_100639.324312367_n1_charybdefsnemesis-cl.log
Wraps: (6) exit status 131
Error types: (1) *cluster.WithCommandDetails (2) *hintdetail.withDetail (3) errors.Cmd (4) *withstack.withStack (5) *errutil.withPrefix (6) *exec.ExitError

Verbose log:

debug1: Sending command: export ROACHPROD=1 GOTRACEBACK=crash && bash -c "charybdefs-nemesis --clear"
debug2: channel 0: request exec confirm 1
debug3: send packet: type 98
debug2: channel_input_open_confirmation: channel 0: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug2: channel 0: rcvd adjust 2097152
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 0
debug2: exec request accepted on channel 0
debug2: channel 0: read<=0 rfd 5 len 0
debug2: channel 0: read failed
debug2: channel 0: chan_shutdown_read (i0 o0 sock -1 wfd 5 efd 7 [write])
debug2: channel 0: input open -> drain
debug2: channel 0: ibuf empty
debug2: channel 0: send eof
debug3: send packet: type 96
debug2: channel 0: input drain -> closed
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82

with this back and forth continuing until the test timed out.

Not sure what happened during the attempt to unstall the FUSE filesystem. If the FUSE variants prove too flaky, we could consider removing them and relying on the dmsetup and cgroup variants. I'm going to close this out as a flake in the test.

@jbowens jbowens closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2023
jbowens added a commit to jbowens/cockroach that referenced this issue Mar 20, 2023
@nicktrav nicktrav added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Mar 20, 2023
@nicktrav
Copy link
Collaborator

Thanks for digging in! I also added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue (we track these types of failures now).

jbowens added a commit to jbowens/cockroach that referenced this issue Mar 20, 2023
This commit sets a new 30m timeout for all disk stall roachtests. Previously,
the FUSE filesystem variants had no timeout and inherited the default 10h
timeout. The other variants had a 20m timeout, which has been observed to be
too short due to upreplication latency.

Informs cockroachdb#98904.
Informs cockroachdb#98886.
Epic: None
Release note: None
craig bot pushed a commit that referenced this issue Mar 21, 2023
97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357

All but the last commit are from #92966 and #97677.


    This commit adds the default_text_search_config variable for the tsearch
    package, which allows the user to set a default configuration for the
    text search builtin functions that take configurations, such as
    to_tsvector and to_tsquery. The default for this configuration variable
    is 'english', as it is in Postgres.

    Release note (sql change): add the default_text_search_config variable
    for compatibility with the single-argument variants of the text search
    functions to_tsvector, to_tsquery, phraseto_tsquery, and
    plainto_tsquery, which use the value of default_text_search_config
    instead of expecting one to be included as in the two-argument variants.
    The default value of this setting is 'english'.

99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens

This commit sets a new 30m timeout for all disk stall roachtests. Previously,
the FUSE filesystem variants had no timeout and inherited the default 10h
timeout. The other variants had a 20m timeout, which has been observed to be
too short due to upreplication latency.

Informs #98904.
Informs #98886.
Epic: None
Release note: None


99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373

Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`.

This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns.

Co-authored-by: [email protected]

Fixes: #99000
Epic: None

Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view.

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: craig[bot] <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Mar 21, 2023
This commit sets a new 30m timeout for all disk stall roachtests. Previously,
the FUSE filesystem variants had no timeout and inherited the default 10h
timeout. The other variants had a 20m timeout, which has been observed to be
too short due to upreplication latency.

Informs #98904.
Informs #98886.
Epic: None
Release note: None
blathers-crl bot pushed a commit that referenced this issue Mar 21, 2023
This commit sets a new 30m timeout for all disk stall roachtests. Previously,
the FUSE filesystem variants had no timeout and inherited the default 10h
timeout. The other variants had a 20m timeout, which has been observed to be
too short due to upreplication latency.

Informs #98904.
Informs #98886.
Epic: None
Release note: None
jbowens added a commit to jbowens/cockroach that referenced this issue Mar 21, 2023
This commit sets a new 30m timeout for all disk stall roachtests. Previously,
the FUSE filesystem variants had no timeout and inherited the default 10h
timeout. The other variants had a 20m timeout, which has been observed to be
too short due to upreplication latency.

Informs cockroachdb#98904.
Informs cockroachdb#98886.
Epic: None
Release note: None
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants