Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal panic: state service commit block failed: verified checkpoints must be committed transactionally: CommitFinalized(Closed) #2055

Closed
dconnolly opened this issue Apr 21, 2021 · 10 comments
Labels
A-dependencies Area: Dependency file updates A-rust Area: Updates to Rust code C-bug Category: This is a bug I-panic Zebra panics with an internal error message

Comments

@dconnolly
Copy link
Contributor

dconnolly commented Apr 21, 2021

Caught by Sentry: https://sentry.io/organizations/zfnd/issues/2354800484/?project=5540918&query=is%3Aunresolved&statsPeriod=14d

Error

state service commit block failed: verified checkpoints must be committed transactionally: CommitFinalized(Closed)

Analysis

We've seen this bug a few times now, but we need to be able to reproduce it to diagnose it and fix it.

Looking at recent commits, there are a few potential causes:

More broadly, it could be a regression in:

  • rustc atomic codegen or stdlib
  • rustc async codegen or stdlib
  • tokio
  • rocksdb
  • Zebra

It's probably a regression in PR #1637, which fixed ticket #1576.

Possible Fixes

Potential diagnostic:

Potential fixes:

Metadata

key value
version 1.0.0-alpha.6
git commit afac2c2
Zcash network Mainnet
location /zebra/zebra-consensus/src/checkpoint.rs:923:18

Backtrace

Backtrace:
   0: backtrace::capture::Backtrace::create
   1: backtrace::capture::Backtrace::new

@dconnolly dconnolly added C-bug Category: This is a bug P-High I-panic Zebra panics with an internal error message labels Apr 21, 2021
@teor2345 teor2345 added A-dependencies Area: Dependency file updates A-rust Area: Updates to Rust code labels Apr 22, 2021
@teor2345 teor2345 added this to the 2021 Sprint 8 milestone Apr 22, 2021
@teor2345
Copy link
Contributor

Looking at recent commits, there are a few potential causes:

@teor2345
Copy link
Contributor

@dconnolly I think I'll need you to give me access to the Google Cloud logs for the following instance_ids:

  • zebrad-main-5bhl
  • zebrad-main-qbg2

Or if the logs have been deleted already, let's increase the retention interval.

@teor2345
Copy link
Contributor

It's also possible that something like #1351 is the cause, if RocksDB starts shutting down before the shutdown flag is set.

@teor2345
Copy link
Contributor

teor2345 commented Apr 23, 2021

Unfortunately, there's no backtrace in sentry, so we can't really diagnose this issue any further.

It's also likely that the backtrace wouldn't provide any useful info - what we really need is the logs, so we can see what the rest of the app was doing.

@teor2345
Copy link
Contributor

Here's another possible cause…

We could be exiting inside the state service (debug_stop_at_height) or a panic, and racing with checkpointer shutdown.

Here are some possible fixes:

  • make the debug_stop_at_height code call a zebra exit function that sets the shutdown bool first
  • Replace all instances of exit with Zebra exit
  • Set the is_shutting_down flag as early as possible in the panic handler
  • Register an at_exit handler that sets the is_shutting_down flag

This work could also be done in #1678.

@teor2345
Copy link
Contributor

There's not much we can do here until we get better diagnostics, and it seems to have been a transient bug.

@teor2345 teor2345 removed this from the 2021 Sprint 8 milestone Apr 26, 2021
@mpguerra mpguerra added P-Medium and removed P-High labels Apr 26, 2021
@mpguerra
Copy link
Contributor

lowering priority to Medium for now also

@teor2345
Copy link
Contributor

teor2345 commented May 3, 2021

There's not much we can do here until we get better diagnostics, and it seems to have been a transient bug.

We've seen this bug again, but we need to be able to reproduce it to diagnose it and fix it.

@teor2345 teor2345 added P-Low and removed P-Medium labels May 24, 2021
@teor2345
Copy link
Contributor

We haven't seen this bug for a while, marking it as low priority.

@teor2345
Copy link
Contributor

We haven't seen this bug for a while, and we've added a shutdown workaround. So I think we can close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dependencies Area: Dependency file updates A-rust Area: Updates to Rust code C-bug Category: This is a bug I-panic Zebra panics with an internal error message
Projects
None yet
Development

No branches or pull requests

3 participants