Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] (Version: [1.10])m_mutex lock contention seen during the last consensus phase which is contributing to the consensus latency #4481

Open
sophiax851 opened this issue Mar 30, 2023 · 0 comments

Comments

@sophiax851
Copy link
Collaborator

Issue Description

When rippled is under high payment transactions load, we've noticed high latency at the end of the consensus process after all transactions have been applied and a new ledger is being built. The following are suspected calls that were competing on the same (last published ledger lock, m_mutex, which was causing contention in finishing up the consensus process.

The following can come concurrently from multiple threads but will be competing on the same lock:
Thread 13 (Thread 0x7f0ba8ff9700 (LWP 3330235)):
#0 __lll_lock_wait (futex=futex@entry=0x558f2e22dfc8, private=0) at lowlevellock.c:52
#1 0x00007f0bbac86131 in __GI___pthread_mutex_lock (mutex=0x558f2e22dfc8) at ../nptl/pthread_mutex_lock.c:115
#2 0x0000558f2be1bfe7 in __gthread_mutex_lock (__mutex=) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3 __gthread_recursive_mutex_lock (__mutex=) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811
#4 std::recursive_mutex::lock (this=) at /usr/include/c++/9/mutex:106
#5 0x0000558f2c2754ac in std::lock_guardstd::recursive_mutex::lock_guard (__m=..., this=) at /usr/include/c++/9/bits/std_mutex.h:158
#6 ripple::LedgerMaster::checkAccept (this=0x558f2e22dfb0, hash=..., seq=seq@entry=75271254) at /space/tools/newAMM/rippled/src/ripple/app/ledger/impl/LedgerMaster.cpp:994
#7 0x0000558f2c2adaf6 in ripple::handleNewValidation (app=..., val=..., source=...) at /space/tools/newAMM/rippled/src/ripple/app/consensus/RCLValidations.cpp:180
#8 0x0000558f2c10ba3e in ripple::NetworkOPsImp::recvValidation (this=this@entry=0x558f2e22f6f0, val=..., source=...) at /space/tools/newAMM/rippled/src/ripple/app/misc/NetworkOPs.cpp:2265
#9 0x0000558f2be72021 in ripple::PeerImp::checkValidation (this=0x7f0bb4094b80, val=..., key=..., packet=...) at /usr/include/c++/9/bits/basic_string.h:6555
#10 0x0000558f2be72202 in ripple::PeerImp::<lambda()>::operator() (__closure=) at /space/tools/newAMM/rippled/src/ripple/overlay/impl/PeerImp.cpp:2615
#11 ripple::ClosureCounter::Substitute<ripple::PeerImp::onMessage(const std::shared_ptrprotocol::TMValidation&)::<lambda()> >::operator() (this=) at /space/tools/newAMM/rippled/src/ripple/core/ClosureCounter.h:140
#12 std::_Function_handler<void(), ripple::ClosureCounter::Substitute<ripple::PeerImp::onMessage(const std::shared_ptrprotocol::TMValidation&)::<lambda()> > >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#13 0x0000558f2bfde71a in std::function<void ()>::operator()() const (this=0x7f0ba8ff8710) at /usr/include/c++/9/bits/std_function.h:683
#14 ripple::Job::doJob (this=this@entry=0x7f0ba8ff8700) at /space/tools/newAMM/rippled/src/ripple/core/impl/Job.cpp:68
#15 0x0000558f2bfc7399 in ripple::JobQueue::processTask (this=0x558f2e1ef8d0, instance=8) at /space/tools/newAMM/rippled/src/ripple/core/impl/JobQueue.cpp:374
#16 0x0000558f2bfc6517 in ripple::Workers::Worker::run (this=0x558f2e1f0dd0) at /space/tools/newAMM/rippled/src/ripple/core/impl/Workers.cpp:241
#17 0x0000558f2ce19654 in execute_native_thread_routine ()
#18 0x00007f0bbac83609 in start_thread (arg=) at pthread_create.c:477
#19 0x00007f0bbaba6133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The above can compete with the following calls:

Thread 6 (Thread 0x7f0bb897e700 (LWP 3330228)):
#0 __lll_lock_wait (futex=futex@entry=0x558f2e22dfc8, private=0) at lowlevellock.c:52
#1 0x00007f0bbac86131 in __GI___pthread_mutex_lock (mutex=0x558f2e22dfc8) at ../nptl/pthread_mutex_lock.c:115
#2 0x0000558f2be1bfe7 in __gthread_mutex_lock (__mutex=) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3 __gthread_recursive_mutex_lock (__mutex=) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811
#4 std::recursive_mutex::lock (this=) at /usr/include/c++/9/mutex:106
#5 0x0000558f2c27cdef in std::lock_guardstd::recursive_mutex::lock_guard (__m=..., this=) at /usr/include/c++/9/bits/std_mutex.h:158
#6 ripple::LedgerMaster::setFullLedger (isSynchronous=true, isCurrent=true, ledger=..., this=0x558f2e22dfb0) at /space/tools/newAMM/rippled/src/ripple/app/ledger/impl/LedgerMaster.cpp:942
#7 ripple::LedgerMaster::doAdvance (this=0x558f2e22dfb0, sl=...) at /space/tools/newAMM/rippled/src/ripple/app/ledger/impl/LedgerMaster.cpp:2115
#8 0x0000558f2c27dba3 in ripple::LedgerMaster::<lambda()>::operator()(void) const (__closure=) at /space/tools/newAMM/rippled/src/ripple/app/ledger/impl/LedgerMaster.cpp:1473
#9 0x0000558f2bfde71a in std::function<void ()>::operator()() const (this=0x7f0bb897d710) at /usr/include/c++/9/bits/std_function.h:683
#10 ripple::Job::doJob (this=this@entry=0x7f0bb897d700) at /space/tools/newAMM/rippled/src/ripple/core/impl/Job.cpp:68
#11 0x0000558f2bfc7399 in ripple::JobQueue::processTask (this=0x558f2e1ef8d0, instance=1) at /space/tools/newAMM/rippled/src/ripple/core/impl/JobQueue.cpp:374
#12 0x0000558f2bfc6517 in ripple::Workers::Worker::run (this=0x558f2e1eff60) at /space/tools/newAMM/rippled/src/ripple/core/impl/Workers.cpp:241
#13 0x0000558f2ce19654 in execute_native_thread_routine ()
#14 0x00007f0bbac83609 in start_thread (arg=) at pthread_create.c:477
#15 0x00007f0bbaba6133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

And

Thread 9 (Thread 0x7fc6417fa700 (LWP 579343)):
#0 __lll_lock_wait (futex=futex@entry=0x55d1874caf68, private=0) at lowlevellock.c:52
#1 0x00007fc649461131 in __GI___pthread_mutex_lock (mutex=0x55d1874caf68) at ../nptl/pthread_mutex_lock.c:115
#2 0x000055d18473e187 in __gthread_mutex_lock (__mutex=) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3 __gthread_recursive_mutex_lock (__mutex=) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811
#4 std::recursive_mutex::lock (this=) at /usr/include/c++/9/mutex:106
#5 0x000055d184b92c1c in std::lock_guardstd::recursive_mutex::lock_guard (__m=..., this=) at /usr/include/c++/9/bits/std_mutex.h:158
#6 ripple::LedgerMaster::checkAccept (this=0x55d1874caf50, ledger=...) at /space/tools/newAMM/rippledDev/src/ripple/app/ledger/impl/LedgerMaster.cpp:1049
#7 0x000055d184bb65a0 in ripple::InboundLedger::<lambda()>::operator() (__closure=0x7fc6463e6518) at /usr/include/c++/9/ext/atomicity.h:96
#8 ripple::ClosureCounter::Substitute<ripple::InboundLedger::done()::<lambda()> >::operator() (this=0x7fc6463e6510) at /space/tools/newAMM/rippledDev/src/ripple/core/ClosureCounter.h:140
#9 std::_Function_handler<void(), ripple::ClosureCounter::Substitute<ripple::InboundLedger::done()::<lambda()> > >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#10 0x000055d18490067a in std::function<void ()>::operator()() const (this=0x7fc6417f9710) at /usr/include/c++/9/bits/std_function.h:683
#11 ripple::Job::doJob (this=this@entry=0x7fc6417f9700) at /space/tools/newAMM/rippledDev/src/ripple/core/impl/Job.cpp:68
#12 0x000055d1848e92f9 in ripple::JobQueue::processTask (this=0x55d18748c900, instance=4) at /space/tools/newAMM/rippledDev/src/ripple/core/impl/JobQueue.cpp:374
#13 0x000055d1848e8477 in ripple::Workers::Worker::run (this=0x55d18748d5c0) at /space/tools/newAMM/rippledDev/src/ripple/core/impl/Workers.cpp:241
#14 0x000055d185737654 in execute_native_thread_routine ()
#15 0x00007fc64945e609 in start_thread (arg=) at pthread_create.c:477
#16 0x00007fc649381133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

And even server_info calls that was periodically monitoring the server's health

Thread 10 (Thread 0x7fc640ff9700 (LWP 579344)):
#0 __lll_lock_wait (futex=futex@entry=0x55d1874caf68, private=0) at lowlevellock.c:52
#1 0x00007fc649461131 in __GI___pthread_mutex_lock (mutex=0x55d1874caf68) at ../nptl/pthread_mutex_lock.c:115
#2 0x000055d184b8021c in __gthread_mutex_lock (__mutex=0x55d1874caf68) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3 __gthread_recursive_mutex_lock (__mutex=0x55d1874caf68) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811
#4 std::recursive_mutex::lock (this=0x55d1874caf68) at /usr/include/c++/9/mutex:106
#5 std::lock_guardstd::recursive_mutex::lock_guard (__m=..., this=) at /usr/include/c++/9/bits/std_mutex.h:159
#6 ripple::LedgerMaster::getPublishedLedger (this=0x55d1874caf50) at /space/tools/newAMM/rippledDev/src/ripple/app/ledger/impl/LedgerMaster.cpp:1692
#7 0x000055d184a365e4 in ripple::NetworkOPsImp::getServerInfo (this=0x55d1874cc690, human=, admin=, counters=) at /space/tools/newAMM/rippledDev/src/ripple/app/misc/NetworkOPs.cpp:2626
#8 0x000055d184711f55 in ripple::doServerInfo (context=...) at /space/tools/newAMM/rippledDev/src/ripple/rpc/handlers/ServerInfo.cpp:37
#9 0x000055d184576a72 in ripple::RPC::(anonymous namespace)::<lambda(ripple::RPC::JsonContext&, Json::Value&)>::operator() (this=, result=..., context=...) at /space/tools/newAMM/rippledDev/src/ripple/rpc/impl/Handler.cpp:34
#10 std::_Function_handler<ripple::RPC::Status(ripple::RPC::JsonContext&, Json::Value&), ripple::RPC::(anonymous namespace)::byRef(const Function&) [with Function = Json::Value (*)(ripple::RPC::JsonContext&); ripple::RPC::Handler::MethodJson::Value = std::function<ripple::RPC::Status(ripple::RPC::JsonContext&, Json::Value&)>]::<lambda(ripple::RPC::JsonContext&, Json::Value&)> >::_M_invoke(const std::_Any_data &, ripple::RPC::JsonContext &, Json::Value &) (__functor=..., __args#0=..., __args#1=...) at /usr/include/c++/9/bits/std_function.h:286
#11 0x000055d18458765e in std::function<ripple::RPC::Status (ripple::RPC::JsonContext&, Json::Value&)>::operator()(ripple::RPC::JsonContext&, Json::Value&) const (__args#1=..., __args#0=..., this=0x7fbec4b661d0) at /usr/include/c++/9/bits/std_function.h:683
#12 ripple::RPC::(anonymous namespace)::callMethod<Json::Value, std::function<ripple::RPC::Status(ripple::RPC::JsonContext&, Json::Value&)> >(ripple::RPC::JsonContext &, std::function<ripple::RPC::Status(ripple::RPC::JsonContext&, Json::Value&)>, const std::string &, Json::Value &) (context=..., method=..., name=..., result=...) at /space/tools/newAMM/rippledDev/src/ripple/rpc/impl/RPCHandler.cpp:200
#13 0x000055d18458b786 in ripple::RPC::doCommand (context=..., result=...) at /usr/include/c++/9/ext/new_allocator.h:80
#14 0x000055d18458d9db in ripple::ServerHandlerImp::processRequest(ripple::Port const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, beast::IP::Endpoint const&, std::function<void (boost::basic_string_view<char, std::char_traits > const&)>&&, std::shared_ptrripple::JobQueue::Coro, boost::basic_string_view<char, std::char_traits >, boost::basic_string_view<char, std::char_traits >) (this=0x55d187440be0, port=..., request=..., remoteIPAddress=..., output=..., coro=..., forwardedFor=..., user=...) at /space/tools/newAMM/rippledDev/src/ripple/rpc/impl/ServerHandlerImp.cpp:881
#15 0x000055d18458ec06 in ripple::ServerHandlerImp::processSession (this=0x55d187440be0, session=..., coro=...) at /usr/include/c++/9/bits/shared_ptr_base.h:1020
#16 0x000055d184590d5a in ripple::ServerHandlerImp::<lambda(std::shared_ptrripple::JobQueue::Coro)>::operator() (coro=..., __closure=0x7fbec4b66a90) at /usr/include/c++/9/ext/atomicity.h:96
#17 ripple::JobQueue::Coro::<lambda(boost::coroutines::asymmetric_coroutine::push_type&)>::operator() (do_yield=..., this=) at /space/tools/newAMM/rippledDev/src/ripple/core/Coro.ipp:44
#18 boost::coroutines::detail::pull_coroutine_object<boost::coroutines::push_coroutine, void, ripple::JobQueue::Coro::Coro(ripple::Coro_create_t, ripple::JobQueue&, ripple::JobType, const string&, F&&) [with F = ripple::ServerHandlerImp::onRequest(ripple::Session&)::<lambda(std::shared_ptrripple::JobQueue::Coro)>; std::string = std::__cxx11::basic_string]::<lambda(boost::coroutines::asymmetric_coroutine::push_type&)>, boost::coroutines::basic_standard_stack_allocatorboost::coroutines::stack_traits >::run (this=0x7fbec4b66a08) at /space/tools/newAMM/boost_1_75_0/boost/coroutine/detail/pull_coroutine_object.hpp:290
#19 boost::coroutines::detail::trampoline_pull<boost::coroutines::detail::pull_coroutine_object<boost::coroutines::push_coroutine, void, ripple::JobQueue::Coro::Coro(ripple::Coro_create_t, ripple::JobQueue&, ripple::JobType, const string&, F&&) [with F = ripple::ServerHandlerImp::onRequest(ripple::Session&)::<lambda(std::shared_ptrripple::JobQueue::Coro)>]::<lambda(boost::coroutines::asymmetric_coroutine::push_type&)>, boost::coroutines::basic_standard_stack_allocatorboost::coroutines::stack_traits > >(boost::context::detail::transfer_t) (t=...) at /space/tools/newAMM/boost_1_75_0/boost/coroutine/detail/trampoline_pull.hpp:41
#20 0x000055d18545ed9f in make_fcontext ()
#21 0xcf8627e9fdfb13e6 in ?? ()

One of the evidence was showing as the following in the rippled log where the server_info call was delayed when it was caught in the middle of the lock contention:
2023-Mar-07 19:56:13.419690149 UTC LedgerConsensus:DBG Consensus ledger fully validated
2023-Mar-07 19:56:13.419694523 UTC LedgerConsensus:DBG Test applying disputed transaction that did not get in 9A1C64A5597001F93CDBFC46CDFB58301B424EF1B0EE4937F3F61082E3A4B2D3
2023-Mar-07 19:56:13.419698767 UTC LedgerMaster:DBG tryAdvance publishing seq 75974816
2023-Mar-07 19:56:13.419709691 UTC LedgerMaster:DBG Ledger 75974816 accepted :C4B76637ECD0C93A0028D0CEB898AFDE1CF35B05B42E745AD1D622442E899712
2023-Mar-07 19:56:13.419716472 UTC LedgerConsensus:DBG Test applying disputed transaction that did not get in 95C5693B260E758B386C90439D1530AEE944BD237F4CE6E63C2585C8E7A528AD
Last disputed transaction
2023-Mar-07 19:56:13.425296440 UTC LedgerConsensus:DBG Test applying disputed transaction that did not get in A771C676B68ECB01876C3572FC1B05F61C140D9FB4A08901E3A21B16DB8261B0
2023-Mar-07 19:56:18.410071941 UTC LedgerConsensus:NFO We closed at 731534163
2023-Mar-07 19:56:18.414978656 UTC Server:WRN RPC request processing duration = 4623019 microseconds. request =

Steps to Reproduce

Run heavy payment transactions load testing

Expected Result

Actual Result

Environment

Performance lab environment, 5 validators and 9 total nodes in the network

Supporting Files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant