Reorg handling may be broken in 0.20 #3940

victorkirov · 2024-09-11T07:11:55Z

There was a reorg on Testnet around block 2904360. Ord 0.20 successfully picked it up and executed the rollback, but the rollback has been stuck for over 2 hours where it used to be close to instant before. This is on multiple instances of the indexer, not just one, so it's not an outlier issue.

[2024-09-11T05:51:03Z INFO  ord::index] 5 block deep reorg detected at height 2904360
[2024-09-11T05:51:03Z INFO  ord::index::reorg] rolling back database after reorg of depth 5 at height 2904360

Could something have broken in a new version of redb?

The text was updated successfully, but these errors were encountered:

victorkirov · 2024-09-11T07:21:40Z

It also looks like whatever state the process is in is blocking shutdown. After the first sigterm, it logs that it's gracefully shutting down and you can force shutdown by pushing ctrl-c again. A second sigterm usually kills the process at this point but it's not responding. A sigkill also does nothing, though I don't think sigkills are currently handled.

raphjaph · 2024-09-13T15:56:56Z

Hey, thanks for opening this issue! Unfortunately our testnet instance was still running v19 so I can't check this myself. I've just updated the server though. What does it say on the /status page of the server?

SIGKILL should always work. Could yo maybe provide a ps or top output? You can also send it to me privately to raphjaph AT protonmail.com.

raphjaph · 2024-09-15T18:39:08Z

I just deliberately put testnet ord into recovery mode and SIGKILL worked for me. Weird that it doesn't work for you. I'll try simulating reorgs on regtest next and see what happens.

victorkirov · 2024-09-16T06:17:09Z

Unfortunately, I already reverted the instance back to 0.19. I'll start a new 0.20 one and try to simulate a reorg, but it may take some time.

For SIG KILL, I just tried doing a SIGKILL on a normally running instance and it's ignored completely 👀 Maybe this is due to it running inside a container, but it shouldn't affect the signal as I'm executing it inside a bash terminal in the container (kill -9 1)

raphjaph · 2024-09-17T06:35:11Z

On regtest it seems to recover without a problem.
I did cargo run env on master then bitcoin-cli -datadir=env generatetoaddress 10 <ADDRESS> the bitcoin-cli -datadir=env invalidateblock <BLOCK_HASH> and the bitcoin-cli -datadir=env generatetoaddress 10 <ADDRESS>.

[2024-09-17T06:31:58Z INFO  ord::index::updater] Committing at block height 265, 2 outputs traversed, 3 in map, 0 cached
[2024-09-17T06:32:17Z INFO  ord::index] 6 block deep reorg detected at height 265
[2024-09-17T06:32:17Z INFO  ord::index::reorg] rolling back database after reorg of depth 6 at height 265
[2024-09-17T06:32:17Z INFO  ord::index::reorg] successfully rolled back database to height 250

victorkirov · 2024-09-17T07:25:06Z

Hmm, I guess there may have been something else wrong. It's just really strange that it happened across multiple instances. I'll try upgrade again and see if it happens. Will close this ticket for now and reopen if I can reproduce it and get more info.

Thanks for looking into it 🙌

dcorral · 2024-10-03T04:45:41Z

Mainnet reorg crashing on my side as well could we reopen this to see if we can get to the root of the issue? 🙏

raphjaph · 2024-10-03T09:14:31Z

What block was the reorg at and do you have any log outputs?

dcorral · 2024-10-04T17:06:47Z

I have no logs sorry but the block was 863888, will report back if anything comes up

dcorral · 2024-10-06T13:19:01Z

Got the above while testing regtest:

generate 101 blocks
invalidate block 99
generate 3 blocks

thread '' panicked at src/index/reorg.rs:66:76:
called Option::unwrap() on a None value
stack backtrace:
0: 0x5969b7ebadb6 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h410d4c66be4e37f9
1: 0x5969b7eeb140 - core::fmt::write::he40921d4802ce2ac
2: 0x5969b7eb697f - std::io::Write::write_fmt::h5de5a4e7037c9b20
3: 0x5969b7ebab94 - std::sys_common::backtrace::print::h11c067a88e3bdb22
4: 0x5969b7ebc417 - std::panicking::default_hook::{{closure}}::h8c832ecb03fde8ea
5: 0x5969b7ebc179 - std::panicking::default_hook::h1633e272b4150cf3
6: 0x5969b7ebc8a8 - std::panicking::rust_panic_with_hook::hb164d19c0c1e71d4
7: 0x5969b7ebc749 - std::panicking::begin_panic_handler::{{closure}}::h0369088c533c20e9
8: 0x5969b7ebb2b6 - std::sys_common::backtrace::__rust_end_short_backtrace::hc11d910daf35ac2e
9: 0x5969b7ebc4d4 - rust_begin_unwind
10: 0x5969b720ebe5 - core::panicking::panic_fmt::ha6effc2775a0749c
11: 0x5969b720eca3 - core::panicking::panic::h44790a89027c670f
12: 0x5969b720eb36 - core::option::unwrap_failed::hcb3a256a9f1ca882
13: 0x5969b73e2dbe - ord::index::reorg::Reorg::handle_reorg::h5ff6b80ae2a95596
14: 0x5969b726160b - ord::index::Index::update::h7a56ebe30e595534
15: 0x5969b7817375 - std::sys_common::backtrace::__rust_begin_short_backtrace::hbf31dcab6cd6a497
16: 0x5969b77686cf - core::ops::function::FnOnce::call_once{{vtable.shim}}::h6189ab0ff9d9cd46
17: 0x5969b7ec1bc5 - std::sys::pal::unix::thread::Thread::new::thread_start::h3631815ad38387d6
18: 0x7bdadcfb439d -
19: 0x7bdadd03949c -
20: 0x0 -

I'm not able to consistently reproduce this error with the steps described above.

ord version: 0.20.0

gus4rs · 2024-10-09T15:37:47Z

Version 0.20.1 dies after a reorg, doesn't respond to graceful shutdown signal and must be killed abruptly, screwing up the database. Happened to me on testnet3

FZelin · 2024-11-01T18:02:46Z

yep, happened to me on testnet3 also. 0.21.2 ,now can not update .

[2024-11-01T18:07:17Z INFO ord::index::reorg] rolling back database after reorg of depth 2 at height 3196602
thread '' panicked at src/index/reorg.rs:66:76:
called Option::unwrap() on a None value
stack backtrace:
0: 0x5fa669aa9c6a - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h304520fd6a30aa07
1: 0x5fa669ad625b - core::fmt::write::hf5713710ce10ff22
2: 0x5fa669aa50b3 - std::io::Write::write_fmt::hda708db57927dacf
3: 0x5fa669aaaf52 - std::panicking::default_hook::{{closure}}::he1ad87607d0c11c5
4: 0x5fa669aaabbe - std::panicking::default_hook::h81c8cd2e7c59ee33
5: 0x5fa669aab7df - std::panicking::rust_panic_with_hook::had2118629c312a4a
6: 0x5fa669aab493 - std::panicking::begin_panic_handler::{{closure}}::h7fa5985d111bafa2
7: 0x5fa669aaa149 - std::sys::backtrace::__rust_end_short_backtrace::h704d151dbefa09c5
8: 0x5fa669aab154 - rust_begin_unwind
9: 0x5fa668dd78b3 - core::panicking::panic_fmt::h3eea515d05f7a35e
10: 0x5fa668dd793c - core::panicking::panic::h102d65dbfa674afe
11: 0x5fa668dd7819 - core::option::unwrap_failed::hfd32652cc6017653
12: 0x5fa668ef1945 - ord::index::reorg::Reorg::handle_reorg::h8da2103b4d327514
13: 0x5fa6691a5c91 - ord::index::Index::update::h5b3331c0e6488b94
14: 0x5fa668de6055 - std::sys::backtrace::__rust_begin_short_backtrace::he5a6a51d16c02dfc
15: 0x5fa668fdaed2 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h0ba0e47ea4e1123b
16: 0x5fa669ab033b - std::sys::pal::unix::thread::Thread::new::thread_start::hcdbd1049068002f4
[2024-11-01T18:07:17Z INFO ord::index::updater] Block receiver disconnected: sending on a closed channel
17: 0x771988294ac3 - start_thread
at ./nptl/pthread_create.c:442:8
18: 0x771988326850 - __GI___clone3
at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
19: 0x0 -

raphjaph · 2024-12-20T23:26:01Z

I merged #2365 yesterday. This should fix that issue. I've been running it on https://testnet4.ordinals.com (which has very frequent reorgs) and it been going smoothly. Please try it out and let me know if it acutally fixes this.

emilcondrea · 2025-01-07T16:02:30Z

@raphjaph Tested this with 0.22.1 and reorg logic is still broken but in a different way.
I am trying to index testnet4, and it indexes all the way to the tip (in my case 63038), then this happens:

it commits to the tip and creates the savepoint at the tip.
Then a reorg of deep 2 happens, ord tries to rollback to oldest savepoint which is still 63038, thinks its recoverable (but its not, as the savepoint is not that old),
reverts to the oldest savepoint which basically does nothing,
and then does this in an infinite loop and increases the used disk space exponentially its like 1 more GB each 2-3 mins. (ended up filling all disk space on my mac over night)

Opened a PR to fix this: #4169

victorkirov · 2025-01-08T06:47:45Z

Good catch! That PR will solve a few issues.

The other big one that has caught people out before and required a full reindex is when they use Ord in command line mode and not as a server and the index is only updated when they run a command. If they happen to run a command just as a reorg is happening, they will get a save point on a reorged block with the current prod logic.

Another issue it would solve is something that happened with the testnet3 attack and the constant reorgs. We had an instance of Ord running on testnet3 and there was a reorg. It rolled back the 30 or so blocks and reindexed, but immediately hit another reorg before it could create a save point, so it rolled back to the previous save point, now about 50 blocks back. Once it caught up, the same thing happened and it eventually ended up in a state where it would roll back about 1000 blocks with the testnet3 craziness going on. This also ended up filling the volume and we ended up having to scale the volume to 1500GB and changing our blocks indexed per run to 1 so that it could create save points reliably.

victorkirov · 2025-01-08T06:48:44Z

It would also be good to find out the cause of the volume leak. I'm not sure why the db size keeps on growing and it never recovers the space once it passes a point where it has recovered from a reorg.

emilcondrea · 2025-01-08T09:44:10Z

On space reclaim, on above scenario unfortunately it never recovered from the reorg, it was stuck in an infinite loop to recover from reorg. Maybe its related to this #3856 , tried committing twice on reorg logic but did not seem to help

raphjaph · 2025-01-27T15:08:33Z

I've merged #4169 and #4191. I'll close this as fixed. If it's still broken please reopen

raphjaph added the bug label Sep 13, 2024

raphjaph added this to Tracker Sep 13, 2024

raphjaph moved this to In Progress in Tracker Sep 13, 2024

raphjaph self-assigned this Sep 13, 2024

victorkirov closed this as completed Sep 17, 2024

victorkirov reopened this Oct 3, 2024

raphjaph closed this as completed Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorg handling may be broken in 0.20 #3940

Reorg handling may be broken in 0.20 #3940

victorkirov commented Sep 11, 2024

victorkirov commented Sep 11, 2024 •

edited

Loading

raphjaph commented Sep 13, 2024

raphjaph commented Sep 15, 2024

victorkirov commented Sep 16, 2024

raphjaph commented Sep 17, 2024

victorkirov commented Sep 17, 2024

dcorral commented Oct 3, 2024

raphjaph commented Oct 3, 2024

dcorral commented Oct 4, 2024

dcorral commented Oct 6, 2024 •

edited

Loading

gus4rs commented Oct 9, 2024

FZelin commented Nov 1, 2024 •

edited

Loading

raphjaph commented Dec 20, 2024

emilcondrea commented Jan 7, 2025

victorkirov commented Jan 8, 2025 •

edited

Loading

victorkirov commented Jan 8, 2025

emilcondrea commented Jan 8, 2025

raphjaph commented Jan 27, 2025

Reorg handling may be broken in 0.20 #3940

Reorg handling may be broken in 0.20 #3940

Comments

victorkirov commented Sep 11, 2024

victorkirov commented Sep 11, 2024 • edited Loading

raphjaph commented Sep 13, 2024

raphjaph commented Sep 15, 2024

victorkirov commented Sep 16, 2024

raphjaph commented Sep 17, 2024

victorkirov commented Sep 17, 2024

dcorral commented Oct 3, 2024

raphjaph commented Oct 3, 2024

dcorral commented Oct 4, 2024

dcorral commented Oct 6, 2024 • edited Loading

gus4rs commented Oct 9, 2024

FZelin commented Nov 1, 2024 • edited Loading

raphjaph commented Dec 20, 2024

emilcondrea commented Jan 7, 2025

victorkirov commented Jan 8, 2025 • edited Loading

victorkirov commented Jan 8, 2025

emilcondrea commented Jan 8, 2025

raphjaph commented Jan 27, 2025

victorkirov commented Sep 11, 2024 •

edited

Loading

dcorral commented Oct 6, 2024 •

edited

Loading

FZelin commented Nov 1, 2024 •

edited

Loading

victorkirov commented Jan 8, 2025 •

edited

Loading