Replication deadlock when replica times out #3649

andydunstall · 2024-09-04T15:54:43Z

We're seeing a Dragonfly replication deadlock on our test suite, which seems to happen when a replica times out.

I can reproduce the replication deadlock locally (or at least what I'm assuming is the same issue).

Running two Dragonfly processes, but intentionally setting the master process replication_timeout to a very small value (100ms) to force the replica to timeout:

./dragonfly --alsologtostderr --dbfilename= --port 7000 --replication_timeout 100
./dragonfly --alsologtostderr --dbfilename= --port 7001

Then start populating the master with 5GB:

redis-cli -p 7000 debug populate 5000000 test 1000 rand

Then while its populating configure the replica:

redis-cli -p 7001 replicaof localhost 7000

The replica will timeout as expected, but then the master is partially deadlocked.

INFO hangs even after you shutdown the replica, and attempting to add another replica hangs as the master is unresponsive.

Running v1.22.0 on AWS t4g.medium. I added some logs and I think its a deadlock attempting to lock DflyCmd::mu_ but thats as far as I got

Edit: poking around a bit out of curiosity, DflyCmd::BreakStalledFlowsInShard never releases mu_ as it blocks on replica_ptr->Cancel()

The text was updated successfully, but these errors were encountered:

romange · 2024-09-04T16:52:33Z

Thanks @andydunstall , I confirm I can easily reproduce it locally.

romange · 2024-09-04T18:25:24Z

Seems that the callback below does not proceed. Checking whether it's join or the cleanup callback.

FlowInfo* flow = &flows[shard->shard_id()];
    if (flow->cleanup) {
      flow->cleanup();
    }

    flow->full_sync_fb.JoinIfNeeded();
    flow->conn = nullptr;

romange · 2024-09-04T18:54:14Z

so flow->cleanup(); deadlock, which corresponds to RdbSaver::Cancel call.

Seems the following deadlock happens:

0x55feeae59b5c  std::lock_guard<>::lock_guard()
0x55feeae579d5  dfly::journal::JournalSlice::UnregisterOnChange()
0x55feeae4d01c  dfly::journal::Journal::UnregisterOnChange()
0x55feeaad4c06  dfly::SliceSnapshot::Cancel()
0x55feeaa7e08f  dfly::RdbSaver::Impl::Cancel()

and then:

0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feea705c46  util::fb2::EventCount::wait()
0x55feeaadfe22  util::fb2::SimpleChannel<>::Push<>()
0x55feeaade204  dfly::SizeTrackingChannel<>::Push()
0x55feeaad6771  dfly::SliceSnapshot::Serialize()
0x55feeaad69ee  dfly::SliceSnapshot::PushSerializedToChannel()
0x55feeaad6e74  dfly::SliceSnapshot::OnJournalEntry()

which is stuck on RecordChannel not being emptied. but then again dfly::RdbSaver::Cancel() is stuck on SliceSnapshot::Cancel() so it can not empty the channel (the code comes after that).

I repeat again - our inter component dependencies are very complicated and using mutexes leads to unexpected outcomes.
In this case I suspect (not certain though) that the deadlock is around JournalSlice::cb_mu_ mutex: FYI @chakaz @adiholden .

romange · 2024-09-04T18:57:12Z

The full stacktrace is here:


I20240904 21:26:07.579999 1815618 scheduler.cc:461] ------------ Fiber DflyConnection_repl_ctr (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feeb137674  util::fb2::Mutex::lock()
0x55feea77d3f4  util::fb2::LockGuard<>::LockGuard()
0x55feea7daa82  dfly::DflyCmd::GetReplicaInfo()
0x55feea7da3f9  dfly::DflyCmd::StopReplication()
0x55feea7da3ba  dfly::DflyCmd::OnClose()
0x55feeab199c2  dfly::ServerFamily::OnClose()
0x55feea9d6137  dfly::Service::OnConnectionClose()
0x55feeb018829  facade::Connection::ConnectionFlow()
I20240904 21:26:07.589468 1815618 scheduler.cc:461] ------------ Fiber DflyConnection_repl_flo (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feeb137674  util::fb2::Mutex::lock()
0x55feea77d3f4  util::fb2::LockGuard<>::LockGuard()
0x55feea7daa82  dfly::DflyCmd::GetReplicaInfo()
0x55feea7da3f9  dfly::DflyCmd::StopReplication()
0x55feea7da3ba  dfly::DflyCmd::OnClose()
0x55feeab199c2  dfly::ServerFamily::OnClose()
0x55feea9d6137  dfly::Service::OnConnectionClose()
0x55feeb018829  facade::Connection::ConnectionFlow()

I20240904 21:26:07.613564 1815617 scheduler.cc:461] ------------ Fiber shard_queue_0/0 (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feea705c46  util::fb2::EventCount::wait()
0x55feeaadfe22  util::fb2::SimpleChannel<>::Push<>()
0x55feeaade204  dfly::SizeTrackingChannel<>::Push()
0x55feeaad6771  dfly::SliceSnapshot::Serialize()
0x55feeaad69ee  dfly::SliceSnapshot::PushSerializedToChannel()
0x55feeaad6e74  dfly::SliceSnapshot::OnJournalEntry()
0x55feeaae5e5d  std::__invoke_impl<>()
0x55feeaae5507  std::__invoke<>()
I20240904 21:26:07.629513 1815617 scheduler.cc:461] ------------ Fiber l2_queue_0/0 (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feea705c46  util::fb2::EventCount::wait()
0x55feeb1b660b  util::fb2::EventCount::await<>()
0x55feeb1b5c0b  util::fb2::FiberQueue::Run()
0x55feeaf2bae1  dfly::TaskQueue::Start()::{lambda()#1}::operator()()
0x55feeaf2caca  std::__invoke_impl<>()
0x55feeaf2c809  std::__invoke<>()
0x55feeaf2c641  std::__apply_impl<>()
0x55feeaf2c67b  std::apply<>()
I20240904 21:26:07.629549 1815617 scheduler.cc:461] ------------ Fiber l2_queue_0/1 (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feea705c46  util::fb2::EventCount::wait()
0x55feeb1b660b  util::fb2::EventCount::await<>()
0x55feeb1b5c0b  util::fb2::FiberQueue::Run()
0x55feeaf2bae1  dfly::TaskQueue::Start()::{lambda()#1}::operator()()
0x55feeaf2caca  std::__invoke_impl<>()
0x55feeaf2c809  std::__invoke<>()
0x55feeaf2c641  std::__apply_impl<>()
0x55feeaf2c67b  std::apply<>()
I20240904 21:26:07.629580 1815617 scheduler.cc:461] ------------ Fiber shard_periodic0 (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feea705c46  util::fb2::EventCount::wait()
0x55feea788ca2  util::fb2::EventCount::await<>()
0x55feea787e8f  util::fb2::EmbeddedBlockingCounter::Wait()
0x55feea7dd8cb  dfly::EngineShardSet::RunBlockingInParallel<>()
0x55feea7dca6b  dfly::EngineShardSet::RunBlockingInParallel<>()
0x55feea7d4efd  dfly::DflyCmd::ReplicaInfo::Cancel()
0x55feea7da87b  dfly::DflyCmd::BreakStalledFlowsInShard()
0x55feea9c81a5  dfly::Service::Init()::{lambda()#4}::operator()()
I20240904 21:26:07.640745 1815617 scheduler.cc:461] ------------ Fiber DflyConnection (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feeb159958  util::fb2::detail::FiberInterface::Join()
0x55feeb129cd3  util::fb2::Fiber::Join()
0x55feea79b0fd  dfly::DebugCmd::Populate()
0x55feea798e95  dfly::DebugCmd::Run()
0x55feeab1d65d  dfly::ServerFamily::Debug()
0x55feeab0cae4  dfly::(anonymous namespace)::HandlerFunc()::{lambda()#1}::operator()()
0x55feeab4bc37  fu2::abi_400::detail::invocation::invoke<>()
0x55feeab442ee  fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::internal_invoker<>::invoke()
I20240904 21:26:07.653774 1815617 scheduler.cc:461] ------------ Fiber populate_range (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feea705c46  util::fb2::EventCount::wait()
0x55feea7a26b0  util::fb2::FiberQueue::Add<>()
0x55feea7a02a5  dfly::TaskQueue::Add<>()
0x55feea79b32c  dfly::EngineShardSet::Add<>()
0x55feea79b9a8  dfly::DebugCmd::PopulateRangeFiber()
0x55feea79ad0a  dfly::DebugCmd::Populate()::{lambda()#1}::operator()()
0x55feea7b4913  std::__invoke_impl<>()
0x55feea7b2711  std::__invoke<>()
I20240904 21:26:07.653806 1815617 scheduler.cc:461] ------------ Fiber DflyConnection_repl_flo (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feeb137674  util::fb2::Mutex::lock()
0x55feea77d3f4  util::fb2::LockGuard<>::LockGuard()
0x55feea7daa82  dfly::DflyCmd::GetReplicaInfo()
0x55feea7da3f9  dfly::DflyCmd::StopReplication()
0x55feea7da3ba  dfly::DflyCmd::OnClose()
0x55feeab199c2  dfly::ServerFamily::OnClose()
0x55feea9d6137  dfly::Service::OnConnectionClose()
0x55feeb018829  facade::Connection::ConnectionFlow()
I20240904 21:26:07.653836 1815617 scheduler.cc:461] ------------ Fiber Dispatched (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feea705c46  util::fb2::EventCount::wait()
0x55feea7f509c  util::fb2::EventCount::await<>()
0x55feea7f3845  util::fb2::SharedMutex::lock()
0x55feeae59b5c  std::lock_guard<>::lock_guard()
0x55feeae579d5  dfly::journal::JournalSlice::UnregisterOnChange()
0x55feeae4d01c  dfly::journal::Journal::UnregisterOnChange()
0x55feeaad4c06  dfly::SliceSnapshot::Cancel()
0x55feeaa7e08f  dfly::RdbSaver::Impl::Cancel()
I20240904 21:26:07.664995 1815617 scheduler.cc:461] ------------ Fiber stop_replication (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feeb137674  util::fb2::Mutex::lock()
0x55feea77d3f4  util::fb2::LockGuard<>::LockGuard()
0x55feea7daa82  dfly::DflyCmd::GetReplicaInfo()
0x55feea7da3f9  dfly::DflyCmd::StopReplication()
0x55feea806581  std::__invoke_impl<>()
0x55feea804b85  std::__invoke<>()
0x55feea801f1d  std::__apply_impl<>()
0x55feea801f62  std::apply<>()
I20240904 21:26:07.669771 1815617 scheduler.cc:461] ------------ Fiber DflyConnection (suspended) ------------
0x55feeb15b7f3  std::function<>::operator()()
0x55feeb15a7dc  util::fb2::detail::FiberInterface::ExecuteOnFiberStack()::{lambda()#1}::operator()()
0x55feeb15b47b  boost::context::detail::fiber_ontop<>()
0x55feeb15ae38  boost::context::fiber::resume_with<>()
0x55feeb15a188  util::fb2::detail::FiberInterface::SwitchTo()
0x55feeb14fa0a  util::fb2::detail::Scheduler::Preempt()
0x55feea704a04  util::fb2::detail::FiberInterface::Suspend()
0x55feeb137674  util::fb2::Mutex::lock()
0x55feea77d3f4  util::fb2::LockGuard<>::LockGuard()
0x55feea7d9efe  dfly::DflyCmd::CreateSyncSession()
0x55feeab275ef  dfly::ServerFamily::ReplConf()
0x55feeab0cae4  dfly::(anonymous namespace)::HandlerFunc()::{lambda()#1}::operator()()
0x55feeab4bc37  fu2::abi_400::detail::invocation::invoke<>()
0x55feeab442ee  fu2::abi_400::detail::type_erasure::invocation_table::function_trait<>::internal_invoker<>::invoke()
0x55feeae21437  fu2::abi_400::detail::type_erasure::tables::vtable<>::invoke<>()

romange · 2024-09-04T19:05:41Z

#3171 is also related why we added a mutex.

romange · 2024-09-05T09:43:11Z

possibly reversing socket shutdown and cancel might solve this issue:

    flow->saver->Cancel();
    flow->TryShutdownSocket();
    flow->full_sync_fb.JoinIfNeeded();
    flow->saver.reset();

andydunstall added the bug Something isn't working label Sep 4, 2024

romange added the important higher priority than the usual ongoing development tasks label Sep 4, 2024

romange added the urgent Important issue that needs to be fixed asap label Sep 4, 2024

kostasrim mentioned this issue Sep 7, 2024

chore: add SimpleChannel::NotifyProducer function romange/helio#314

Open

This was referenced Sep 9, 2024

fix(server): fix replication master deadlock on cancelation flow #3686

Merged

test: add test for replication deadlock on replication timeout #3691

Merged

adiholden closed this as completed in #3686 Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication deadlock when replica times out #3649

Replication deadlock when replica times out #3649

andydunstall commented Sep 4, 2024 •

edited

Loading

romange commented Sep 4, 2024

romange commented Sep 4, 2024

romange commented Sep 4, 2024 •

edited

Loading

romange commented Sep 4, 2024

romange commented Sep 4, 2024

romange commented Sep 5, 2024

Replication deadlock when replica times out #3649

Replication deadlock when replica times out #3649

Comments

andydunstall commented Sep 4, 2024 • edited Loading

romange commented Sep 4, 2024

romange commented Sep 4, 2024

romange commented Sep 4, 2024 • edited Loading

romange commented Sep 4, 2024

romange commented Sep 4, 2024

romange commented Sep 5, 2024

andydunstall commented Sep 4, 2024 •

edited

Loading

romange commented Sep 4, 2024 •

edited

Loading