Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorg handling may be broken in 0.20 #3940

Closed
victorkirov opened this issue Sep 11, 2024 · 18 comments
Closed

Reorg handling may be broken in 0.20 #3940

victorkirov opened this issue Sep 11, 2024 · 18 comments
Assignees
Labels

Comments

@victorkirov
Copy link
Contributor

There was a reorg on Testnet around block 2904360. Ord 0.20 successfully picked it up and executed the rollback, but the rollback has been stuck for over 2 hours where it used to be close to instant before. This is on multiple instances of the indexer, not just one, so it's not an outlier issue.

[2024-09-11T05:51:03Z INFO  ord::index] 5 block deep reorg detected at height 2904360
[2024-09-11T05:51:03Z INFO  ord::index::reorg] rolling back database after reorg of depth 5 at height 2904360

Could something have broken in a new version of redb?

@victorkirov
Copy link
Contributor Author

victorkirov commented Sep 11, 2024

It also looks like whatever state the process is in is blocking shutdown. After the first sigterm, it logs that it's gracefully shutting down and you can force shutdown by pushing ctrl-c again. A second sigterm usually kills the process at this point but it's not responding. A sigkill also does nothing, though I don't think sigkills are currently handled.

@raphjaph
Copy link
Collaborator

Hey, thanks for opening this issue! Unfortunately our testnet instance was still running v19 so I can't check this myself. I've just updated the server though. What does it say on the /status page of the server?

SIGKILL should always work. Could yo maybe provide a ps or top output? You can also send it to me privately to raphjaph AT protonmail.com.

@raphjaph raphjaph added the bug label Sep 13, 2024
@raphjaph raphjaph added this to Tracker Sep 13, 2024
@raphjaph raphjaph moved this to In Progress in Tracker Sep 13, 2024
@raphjaph raphjaph self-assigned this Sep 13, 2024
@raphjaph
Copy link
Collaborator

I just deliberately put testnet ord into recovery mode and SIGKILL worked for me. Weird that it doesn't work for you. I'll try simulating reorgs on regtest next and see what happens.

@victorkirov
Copy link
Contributor Author

Unfortunately, I already reverted the instance back to 0.19. I'll start a new 0.20 one and try to simulate a reorg, but it may take some time.

For SIG KILL, I just tried doing a SIGKILL on a normally running instance and it's ignored completely 👀 Maybe this is due to it running inside a container, but it shouldn't affect the signal as I'm executing it inside a bash terminal in the container (kill -9 1)

@raphjaph
Copy link
Collaborator

On regtest it seems to recover without a problem.
I did cargo run env on master then bitcoin-cli -datadir=env generatetoaddress 10 <ADDRESS> the bitcoin-cli -datadir=env invalidateblock <BLOCK_HASH> and the bitcoin-cli -datadir=env generatetoaddress 10 <ADDRESS>.

[2024-09-17T06:31:58Z INFO  ord::index::updater] Committing at block height 265, 2 outputs traversed, 3 in map, 0 cached
[2024-09-17T06:32:17Z INFO  ord::index] 6 block deep reorg detected at height 265
[2024-09-17T06:32:17Z INFO  ord::index::reorg] rolling back database after reorg of depth 6 at height 265
[2024-09-17T06:32:17Z INFO  ord::index::reorg] successfully rolled back database to height 250

@victorkirov
Copy link
Contributor Author

Hmm, I guess there may have been something else wrong. It's just really strange that it happened across multiple instances. I'll try upgrade again and see if it happens. Will close this ticket for now and reopen if I can reproduce it and get more info.

Thanks for looking into it 🙌

@dcorral
Copy link

dcorral commented Oct 3, 2024

Mainnet reorg crashing on my side as well could we reopen this to see if we can get to the root of the issue? 🙏

@victorkirov victorkirov reopened this Oct 3, 2024
@raphjaph
Copy link
Collaborator

raphjaph commented Oct 3, 2024

What block was the reorg at and do you have any log outputs?

@dcorral
Copy link

dcorral commented Oct 4, 2024

I have no logs sorry but the block was 863888, will report back if anything comes up

@dcorral
Copy link

dcorral commented Oct 6, 2024

Got the above while testing regtest:

  • generate 101 blocks
  • invalidate block 99
  • generate 3 blocks

thread '' panicked at src/index/reorg.rs:66:76:
called Option::unwrap() on a None value
stack backtrace:
0: 0x5969b7ebadb6 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h410d4c66be4e37f9
1: 0x5969b7eeb140 - core::fmt::write::he40921d4802ce2ac
2: 0x5969b7eb697f - std::io::Write::write_fmt::h5de5a4e7037c9b20
3: 0x5969b7ebab94 - std::sys_common::backtrace::print::h11c067a88e3bdb22
4: 0x5969b7ebc417 - std::panicking::default_hook::{{closure}}::h8c832ecb03fde8ea
5: 0x5969b7ebc179 - std::panicking::default_hook::h1633e272b4150cf3
6: 0x5969b7ebc8a8 - std::panicking::rust_panic_with_hook::hb164d19c0c1e71d4
7: 0x5969b7ebc749 - std::panicking::begin_panic_handler::{{closure}}::h0369088c533c20e9
8: 0x5969b7ebb2b6 - std::sys_common::backtrace::__rust_end_short_backtrace::hc11d910daf35ac2e
9: 0x5969b7ebc4d4 - rust_begin_unwind
10: 0x5969b720ebe5 - core::panicking::panic_fmt::ha6effc2775a0749c
11: 0x5969b720eca3 - core::panicking::panic::h44790a89027c670f
12: 0x5969b720eb36 - core::option::unwrap_failed::hcb3a256a9f1ca882
13: 0x5969b73e2dbe - ord::index::reorg::Reorg::handle_reorg::h5ff6b80ae2a95596
14: 0x5969b726160b - ord::index::Index::update::h7a56ebe30e595534
15: 0x5969b7817375 - std::sys_common::backtrace::__rust_begin_short_backtrace::hbf31dcab6cd6a497
16: 0x5969b77686cf - core::ops::function::FnOnce::call_once{{vtable.shim}}::h6189ab0ff9d9cd46
17: 0x5969b7ec1bc5 - std::sys::pal::unix::thread::Thread::new::thread_start::h3631815ad38387d6
18: 0x7bdadcfb439d -
19: 0x7bdadd03949c -
20: 0x0 -

I'm not able to consistently reproduce this error with the steps described above.

ord version: 0.20.0

@gus4rs
Copy link

gus4rs commented Oct 9, 2024

Version 0.20.1 dies after a reorg, doesn't respond to graceful shutdown signal and must be killed abruptly, screwing up the database. Happened to me on testnet3

@FZelin
Copy link

FZelin commented Nov 1, 2024

yep, happened to me on testnet3 also. 0.21.2 ,now can not update .

[2024-11-01T18:07:17Z INFO ord::index::reorg] rolling back database after reorg of depth 2 at height 3196602
thread '' panicked at src/index/reorg.rs:66:76:
called Option::unwrap() on a None value
stack backtrace:
0: 0x5fa669aa9c6a - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h304520fd6a30aa07
1: 0x5fa669ad625b - core::fmt::write::hf5713710ce10ff22
2: 0x5fa669aa50b3 - std::io::Write::write_fmt::hda708db57927dacf
3: 0x5fa669aaaf52 - std::panicking::default_hook::{{closure}}::he1ad87607d0c11c5
4: 0x5fa669aaabbe - std::panicking::default_hook::h81c8cd2e7c59ee33
5: 0x5fa669aab7df - std::panicking::rust_panic_with_hook::had2118629c312a4a
6: 0x5fa669aab493 - std::panicking::begin_panic_handler::{{closure}}::h7fa5985d111bafa2
7: 0x5fa669aaa149 - std::sys::backtrace::__rust_end_short_backtrace::h704d151dbefa09c5
8: 0x5fa669aab154 - rust_begin_unwind
9: 0x5fa668dd78b3 - core::panicking::panic_fmt::h3eea515d05f7a35e
10: 0x5fa668dd793c - core::panicking::panic::h102d65dbfa674afe
11: 0x5fa668dd7819 - core::option::unwrap_failed::hfd32652cc6017653
12: 0x5fa668ef1945 - ord::index::reorg::Reorg::handle_reorg::h8da2103b4d327514
13: 0x5fa6691a5c91 - ord::index::Index::update::h5b3331c0e6488b94
14: 0x5fa668de6055 - std::sys::backtrace::__rust_begin_short_backtrace::he5a6a51d16c02dfc
15: 0x5fa668fdaed2 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h0ba0e47ea4e1123b
16: 0x5fa669ab033b - std::sys::pal::unix::thread::Thread::new::thread_start::hcdbd1049068002f4
[2024-11-01T18:07:17Z INFO ord::index::updater] Block receiver disconnected: sending on a closed channel
17: 0x771988294ac3 - start_thread
at ./nptl/pthread_create.c:442:8
18: 0x771988326850 - __GI___clone3
at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
19: 0x0 -

@raphjaph
Copy link
Collaborator

I merged #2365 yesterday. This should fix that issue. I've been running it on https://testnet4.ordinals.com (which has very frequent reorgs) and it been going smoothly. Please try it out and let me know if it acutally fixes this.

@emilcondrea
Copy link
Contributor

@raphjaph Tested this with 0.22.1 and reorg logic is still broken but in a different way.
I am trying to index testnet4, and it indexes all the way to the tip (in my case 63038), then this happens:

  • it commits to the tip and creates the savepoint at the tip.
  • Then a reorg of deep 2 happens, ord tries to rollback to oldest savepoint which is still 63038, thinks its recoverable (but its not, as the savepoint is not that old),
  • reverts to the oldest savepoint which basically does nothing,
  • and then does this in an infinite loop and increases the used disk space exponentially its like 1 more GB each 2-3 mins. (ended up filling all disk space on my mac over night)

Opened a PR to fix this: #4169

@victorkirov
Copy link
Contributor Author

victorkirov commented Jan 8, 2025

Good catch! That PR will solve a few issues.

The other big one that has caught people out before and required a full reindex is when they use Ord in command line mode and not as a server and the index is only updated when they run a command. If they happen to run a command just as a reorg is happening, they will get a save point on a reorged block with the current prod logic.

Another issue it would solve is something that happened with the testnet3 attack and the constant reorgs. We had an instance of Ord running on testnet3 and there was a reorg. It rolled back the 30 or so blocks and reindexed, but immediately hit another reorg before it could create a save point, so it rolled back to the previous save point, now about 50 blocks back. Once it caught up, the same thing happened and it eventually ended up in a state where it would roll back about 1000 blocks with the testnet3 craziness going on. This also ended up filling the volume and we ended up having to scale the volume to 1500GB and changing our blocks indexed per run to 1 so that it could create save points reliably.

@victorkirov
Copy link
Contributor Author

It would also be good to find out the cause of the volume leak. I'm not sure why the db size keeps on growing and it never recovers the space once it passes a point where it has recovered from a reorg.

@emilcondrea
Copy link
Contributor

On space reclaim, on above scenario unfortunately it never recovered from the reorg, it was stuck in an infinite loop to recover from reorg. Maybe its related to this #3856 , tried committing twice on reorg logic but did not seem to help

@raphjaph
Copy link
Collaborator

I've merged #4169 and #4191. I'll close this as fixed. If it's still broken please reopen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

6 participants