Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Recover Database from WAL on Startup #1536

Closed
wants to merge 3 commits into from

Conversation

apavlo
Copy link
Member

@apavlo apavlo commented Apr 9, 2021

Description

This PR does two things:

  1. Attempt to recover the database from a WAL file if one already exists and logging is enabled.
  2. Renamed TerrierServer to NoisePageServer

Unfortunately the WAL recovery fails because we are logging bootstrap inserts into the catalog during initialization, which we then try to replay in the WAL:

noisepage: /home/pavlo/wanshenl-is-a-badass/NoisePage/Github/noisepage/src/storage/recovery/recovery_manager.cpp:487: uint32_t noisepage::storage::RecoveryManager::ProcessSpecialCasePGDatabaseRecord(noisepage::transaction::TransactionContext*, std::vector<std::pair<noisepage::storage::LogRecord*, std::vector<std::byte*> > >*, uint32_t): Assertion `(result) && ("Database recreation should succeed")' failed.

Thread 7 "noisepage" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffef3fc700 (LWP 794391)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff78d2859 in __GI_abort () at abort.c:79
#2  0x00007ffff78d2729 in __assert_fail_base (fmt=0x7ffff7a68588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555556961be8 "(result) && (\"Database recreation should succeed\")", 
    file=0x555556960bf0 "/home/pavlo/wanshenl-is-a-badass/NoisePage/Github/noisepage/src/storage/recovery/recovery_manager.cpp", line=487, function=<optimized out>) at assert.c:92
#3  0x00007ffff78e3f36 in __GI___assert_fail (assertion=0x555556961be8 "(result) && (\"Database recreation should succeed\")", 
    file=0x555556960bf0 "/home/pavlo/wanshenl-is-a-badass/NoisePage/Github/noisepage/src/storage/recovery/recovery_manager.cpp", line=487, 
    function=0x555556961920 "uint32_t noisepage::storage::RecoveryManager::ProcessSpecialCasePGDatabaseRecord(noisepage::transaction::TransactionContext*, std::vector<std::pair<noisepage::storage::LogRecord*, std::vector<std::byt"...) at assert.c:101
#4  0x00005555576477b0 in noisepage::storage::RecoveryManager::ProcessSpecialCasePGDatabaseRecord (this=0x7fffffffc510, txn=0x7fffdc01e880, buffered_changes=0x7fffdc0010a0, start_idx=0)
    at /home/pavlo/wanshenl-is-a-badass/NoisePage/Github/noisepage/src/storage/recovery/recovery_manager.cpp:487
#5  0x0000555557647291 in noisepage::storage::RecoveryManager::ProcessSpecialCaseCatalogRecord (this=0x7fffffffc510, txn=0x7fffdc01e880, buffered_changes=0x7fffdc0010a0, start_idx=0)
    at /home/pavlo/wanshenl-is-a-badass/NoisePage/Github/noisepage/src/storage/recovery/recovery_manager.cpp:422
#6  0x0000555557645463 in noisepage::storage::RecoveryManager::ProcessCommittedTransaction (this=0x7fffffffc510, txn_id=...)
    at /home/pavlo/wanshenl-is-a-badass/NoisePage/Github/noisepage/src/storage/recovery/recovery_manager.cpp:134
#7  0x0000555557645840 in noisepage::storage::RecoveryManager::ProcessDeferredTransactions (this=0x7fffffffc510, upper_bound_ts=...)
    at /home/pavlo/wanshenl-is-a-badass/NoisePage/Github/noisepage/src/storage/recovery/recovery_manager.cpp:173
#8  0x0000555557644e99 in noisepage::storage::RecoveryManager::RecoverFromLogs (this=0x7fffffffc510, log_provider=...) at /home/pavlo/wanshenl-is-a-badass/NoisePage/Github/noisepage/src/storage/recovery/recovery_manager.cpp:80
#9  0x0000555557656c40 in noisepage::storage::RecoveryManager::Recover (this=0x7fffffffc510) at ../src/include/storage/recovery/recovery_manager.h:190
#10 0x0000555557656bca in noisepage::storage::RecoveryManager::RecoveryTask::RunTask (this=0x5555592e4000) at ../src/include/storage/recovery/recovery_manager.h:63
#11 0x0000555557665cde in noisepage::common::DedicatedThreadRegistry::RegisterDedicatedThread<noisepage::storage::RecoveryManager::RecoveryTask, noisepage::storage::RecoveryManager*>(noisepage::common::DedicatedThreadOwner*, noisepage::storage::RecoveryManager*)::{lambda()#1}::operator()() const (this=0x555559217270) at ../src/include/common/dedicated_thread_registry.h:82
#12 0x000055555771ecdc in std::__invoke_impl<void, noisepage::common::DedicatedThreadRegistry::RegisterDedicatedThread<noisepage::storage::RecoveryManager::RecoveryTask, noisepage::storage::RecoveryManager*>(noisepage::common::DedicatedThreadOwner*, noisepage::storage::RecoveryManager*)::{lambda()#1}>(std::__invoke_other, noisepage::common::DedicatedThreadRegistry::RegisterDedicatedThread<noisepage::storage::RecoveryManager::RecoveryTask, noisepage::storage::RecoveryManager*>(noisepage::common::DedicatedThreadOwner*, noisepage::storage::RecoveryManager*)::{lambda()#1}&&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#13 0x000055555771dbc9 in std::__invoke<noisepage::common::DedicatedThreadRegistry::RegisterDedicatedThread<noisepage::storage::RecoveryManager::RecoveryTask, noisepage::storage::RecoveryManager*>(noisepage::common::DedicatedThreadOwner*, noisepage::storage::RecoveryManager*)::{lambda()#1}>(noisepage::common::DedicatedThreadRegistry::RegisterDedicatedThread<noisepage::storage::RecoveryManager::RecoveryTask, noisepage::storage::RecoveryManager*>(noisepage::common::DedicatedThreadOwner*, noisepage::storage::RecoveryManager*)::{lambda()#1}&&) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#14 0x000055555771c958 in std::thread::_Invoker<std::tuple<noisepage::common::DedicatedThreadRegistry::RegisterDedicatedThread<noisepage::storage::RecoveryManager::RecoveryTask, noisepage::storage::RecoveryManager*>(noisepage::common::DedicatedThreadOwner*, noisepage::storage::RecoveryManager*)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x5555592e8f68) at /usr/include/c++/9/thread:244
#15 0x000055555771c060 in std::thread::_Invoker<std::tuple<noisepage::common::DedicatedThreadRegistry::RegisterDedicatedThread<noisepage::storage::RecoveryManager::RecoveryTask, noisepage::storage::RecoveryManager*>(noisepage::common::DedicatedThreadOwner*, noisepage::storage::RecoveryManager*)::{lambda()#1}> >::operator()() (this=0x5555592e8f68) at /usr/include/c++/9/thread:251
#16 0x000055555771be24 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<noisepage::common::DedicatedThreadRegistry::RegisterDedicatedThread<noisepage::storage::RecoveryManager::RecoveryTask, noisepage::storage::RecoveryManager*>(noisepage::common::DedicatedThreadOwner*, noisepage::storage::RecoveryManager*)::{lambda()#1}> > >::_M_run() (this=0x5555592e8f60) at /usr/include/c++/9/thread:195

Remaining Tasks

We need to disable WAL entries for the catalog bootstrap transaction.

https://youtu.be/ie6QcPnBK1E

@apavlo apavlo added in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this. blocked This issue or pull request is in progress, but dependent on another task being completed first. labels Apr 9, 2021
@mbutrovich
Copy link
Contributor

mbutrovich commented Apr 9, 2021

We need to disable WAL entries for the catalog bootstrap transaction.

Just repeating discussion from Slack so it doesn't get lost. The catalog should already have correct logic to recover from WAL (see recovery_test with the recovery DBMain instance). It's just a matter of adding the command line arg logic to DBMain's builder (We want to skip the create_default_database step, at a minimum. not sure about additional logic). Again, recovery_test should be instructive.

@mbutrovich
Copy link
Contributor

@jrolli would probably remember why we made the decision to make sure the catalog startup is persistent and then on restart we could choose not to create the default catalog stuff and instead restore from the WAL. Off the top of my head I'm not sure if you could end up in an inconsistent state if you skip WAL playback on catalog startup and just always allow it to default initialize, before eventually resuming WAL playback of user transactions.

@jrolli
Copy link
Contributor

jrolli commented Apr 9, 2021

@mbutrovich We split the logic at startup to more easily handle future DDL. Since we're logging catalog logic, you'll need to special case it one way or another. Gus and I figured it was easier in the long run to special case writes to catalog OIDs and build the objects while piggy backing the existing tuple translation logic than have a second custom exception set that ignored some writes and managed a second tuple slot translation layer (for DDL).

apavlo added 2 commits April 16, 2021 20:04
…he WAL. This is currently broken because it tries to recreate the catalogs from the WAL but the catalog obviously already exists...
@apavlo apavlo force-pushed the walrecovery-apr2021 branch from 6b0ffb9 to bece91f Compare April 17, 2021 00:04
… hang and still load catalog... not sure what is going on yet...
@apavlo
Copy link
Member Author

apavlo commented May 25, 2021

Dead

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked This issue or pull request is in progress, but dependent on another task being completed first. in-progress This PR is being actively worked on and not ready to be reviewed or merged. Mark PRs with this.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants