[DocDB] Increased follower lag observed in production with stuck OutboundCall #18744

karan-yb · 2023-08-18T01:02:25Z

Jira Link: DB-7637

Description

In production, we saw quite a few cases where Peer is waiting for the response of a Update RPC call. Since Peer holds the performing_update_mutex for the duration of the RPC, we get into a situation where leader never sends any data to this peer.

After analyzing the dump, we found that the OutboundCall for the RPC is in SENT state for a long time. There was no clear way to confirm if connection (on which this call was sent) was active or not.

To avoid this follower lag issue, add a mechanism to detect the situation, log additional details to root cause and automatically recover from it by cancelling the pending OutstandingCall.

Issue #18685 is created to track all stuck OutboundCalls and possibly recover from it.

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

…uck OutboundCall Summary: In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D27859

… lag caused by stuck OutboundCall Summary: Original commit: 30157f5df21928215b9d4b227bd2c59c6d0d565c / D28057 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28076

bmatican · 2023-08-29T21:54:36Z

Found some TSAN issues with the above commit in
https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=RaftConsensusITest&fail_tag=tsan_race&name=MultiThreadedInsertWithFailovers&platform=linux
https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=MasterFailoverTestIndexCreation%2FMasterFailoverTestIndexCreation&fail_tag=tsan_race&name=TestPauseAfterCreateIndexIssued%2F0&platform=linux
https://detective-gcp.dev.yugabyte.com/stability/test?branch=master&build_type=all&class=KVTableTsFailoverWriteIfTest&fail_tag=tsan_race&name=KillTabletServerDuringReplication&platform=linux

Flavor 1

[ts-2] WARNING: ThreadSanitizer: data race (pid=2163)
[ts-2]   Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0):
[ts-2]     #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f)
[ts-2]     #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f)
[ts-2]     #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976)
[ts-2]     #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e)
[ts-2]     #5 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bb07)
[ts-2]     #6 decltype(std::declval<yb::consensus::Peer::Init()::$_0&>()()) std::__invoke[abi:v160006]<yb::consensus::Peer::Init()::$_0&>(yb::consensus::Peer::Init()::$_0&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libconsensus.so+0x9bb07)
[ts-2]     #7 void std::__invoke_void_return_wrapper<void, true>::__call<yb::consensus::Peer::Init()::$_0&>(yb::consensus::Peer::Init()::$_0&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libconsensus.so+0x9bb07)
[ts-2]     #8 std::__function::__alloc_func<yb::consensus::Peer::Init()::$_0, std::allocator<yb::consensus::Peer::Init()::$_0>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libconsensus.so+0x9bb07)
[ts-2]     #9 std::__function::__func<yb::consensus::Peer::Init()::$_0, std::allocator<yb::consensus::Peer::Init()::$_0>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libconsensus.so+0x9bb07)
[ts-2]     #10 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyrpc.so+0x11bc78)
[ts-2]     #11 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyrpc.so+0x11bc78)
[ts-2]     #12 yb::rpc::PeriodicTimer::Callback(long) ${BUILD_ROOT}/../../src/yb/rpc/periodic.cc:189:5 (libyrpc.so+0x11bc78)
[ts-2]     #13 yb::rpc::PeriodicTimer::Callback(long)::$_0::operator()(yb::Status const&) const ${BUILD_ROOT}/../../src/yb/rpc/periodic.cc:208:14 (libyrpc.so+0x11c4ef)
[ts-2]     #14 yb::rpc::ScheduledTask<yb::rpc::PeriodicTimer::Callback(long)::$_0>::Run(yb::Status const&) ${BUILD_ROOT}/../../src/yb/rpc/scheduler.h:55:5 (libyrpc.so+0x11c4ef)
[ts-2]     #15 yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()::operator()() const ${BUILD_ROOT}/../../src/yb/rpc/scheduler.cc:142:57 (libyrpc.so+0x170a99)
[ts-2]     #16 void boost::asio::asio_handler_invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, ...) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/handler_invoke_hook.hpp:88:3 (libyrpc.so+0x170a99)
[ts-2]     #17 void boost_asio_handler_invoke_helpers::invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/handler_invoke_helpers.hpp:54:3 (libyrpc.so+0x170a99)
[ts-2]     #18 void boost::asio::detail::handler_work<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void>::complete<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/handler_work.hpp:520:7 (libyrpc.so+0x170a99)
[ts-2]     #19 boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/completion_handler.hpp:74:9 (libyrpc.so+0x170a99)
[ts-2]     #20 boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/scheduler_operation.hpp:40:5 (libtserver_main_impl.so+0x1c2da)
[ts-2]     #21 boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/impl/scheduler.ipp:492:12 (libtserver_main_impl.so+0x1c2da)
[ts-2]     #22 boost::asio::detail::scheduler::run(boost::system::error_code&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/impl/scheduler.ipp:210:10 (libtserver_main_impl.so+0x1be21)
[ts-2]     #23 boost::asio::io_context::run(boost::system::error_code&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/impl/io_context.ipp:71:16 (libyrpc.so+0xf2d00)
[ts-2]     #24 yb::rpc::IoThreadPool::Impl::Execute() ${BUILD_ROOT}/../../src/yb/rpc/io_thread_pool.cc:76:17 (libyrpc.so+0xf2d00)
[ts-2]     #25 decltype(*std::declval<yb::rpc::IoThreadPool::Impl*&>().*std::declval<void (yb::rpc::IoThreadPool::Impl::*&)()>()()) std::__invoke[abi:v160006]<void (yb::rpc::IoThreadPool::Impl::*&)(), yb::rpc::IoThreadPool::Impl*&, void>(void (yb::rpc::IoThreadPool::Impl::*&)(), yb::rpc::IoThreadPool::Impl*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libyrpc.so+0xf915b)
[ts-2]     #26 std::__bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, 0ul, std::tuple<>>(void (yb::rpc::IoThreadPool::Impl::*&)(), std::tuple<yb::rpc::IoThreadPool::Impl*>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0xf915b)
[ts-2]     #27 std::__bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>>::value>::type std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0xf915b)
[ts-2]     #28 decltype(std::declval<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&>(std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0xf915b)
[ts-2]     #29 std::__bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>, __is_valid_bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>(std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&, std::tuple<>&, std::__tuple_indices<>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0xf915b)
[ts-2]     #30 std::__bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>, __is_valid_bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>::value>::type std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0xf915b)
[ts-2]     #31 decltype(std::declval<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>()()) std::__invoke[abi:v160006]<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>(std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0xf915b)
[ts-2]     #32 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>(std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libyrpc.so+0xf915b)
[ts-2]     #33 std::__function::__alloc_func<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>, std::allocator<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libyrpc.so+0xf915b)
[ts-2]     #34 std::__function::__func<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>, std::allocator<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libyrpc.so+0xf915b)
[ts-2]     #35 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyb_util.so+0x3af852)
[ts-2]     #36 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyb_util.so+0x3af852)
[ts-2]     #37 yb::Thread::SuperviseThread(void*) ${BUILD_ROOT}/../../src/yb/util/thread.cc:842:3 (libyb_util.so+0x3af852)
[ts-2] 
[ts-2]   Previous write of size 8 at 0x7b54000c0198 by thread T25:
[ts-2]     #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a)
[ts-2]     #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a)
[ts-2]     #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a)
[ts-2]     #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a)
[ts-2]     #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64)
[ts-2]     #5 yb::rpc::Connection::CleanupExpirationQueue(std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>) ${BUILD_ROOT}/../../src/yb/rpc/connection.cc:240:13 (libyrpc.so+0xd8ad7)
[ts-2]     #6 yb::rpc::Connection::HandleTimeout(ev::timer&, int) ${BUILD_ROOT}/../../src/yb/rpc/connection.cc:221:3 (libyrpc.so+0xd84d6)
[ts-2]     #7 void ev::base<ev_timer, ev::timer>::method_thunk<yb::rpc::Connection, &yb::rpc::Connection::HandleTimeout(ev::timer&, int)>(ev_loop*, ev_timer*, int) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/common/include/ev++.h:479:7 (libyrpc.so+0xe7ee8)
[ts-2]     #8 ev_invoke_pending <null> (libev.so.4+0x896a)
[ts-2]     #9 decltype(*std::declval<yb::rpc::Reactor*&>().*std::declval<void (yb::rpc::Reactor::*&)()>()()) std::__invoke[abi:v160006]<void (yb::rpc::Reactor::*&)(), yb::rpc::Reactor*&, void>(void (yb::rpc::Reactor::*&)(), yb::rpc::Reactor*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libyrpc.so+0x13ba1b)
[ts-2]     #10 std::__bind_return<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, 0ul, std::tuple<>>(void (yb::rpc::Reactor::*&)(), std::tuple<yb::rpc::Reactor*>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0x13ba1b)
[ts-2]     #11 std::__bind_return<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::Reactor::*)(), std::tuple<yb::rpc::Reactor*>, std::tuple<>>::value>::type std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0x13ba1b)
[ts-2]     #12 decltype(std::declval<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&>(std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0x13ba1b)
[ts-2]     #13 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&>(std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libyrpc.so+0x13ba1b)
[ts-2]     #14 std::__function::__alloc_func<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>, std::allocator<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libyrpc.so+0x13ba1b)
[ts-2]     #15 std::__function::__func<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>, std::allocator<std::__bind<void (yb::rpc::Reactor::* const&)(), yb::rpc::Reactor* const&>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libyrpc.so+0x13ba1b)
[ts-2]     #16 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyb_util.so+0x3af852)
[ts-2]     #17 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyb_util.so+0x3af852)
[ts-2]     #18 yb::Thread::SuperviseThread(void*) ${BUILD_ROOT}/../../src/yb/util/thread.cc:842:3 (libyb_util.so+0x3af852)

Flavor 2:

[m-2] WARNING: ThreadSanitizer: data race (pid=27931)
[m-2]   Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0):
[m-2]     #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4)
[m-2]     #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4)
[m-2]     #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4)
[m-2]     #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb)
[m-2]     #5 decltype(*std::declval<std::shared_ptr<yb::consensus::Peer>&>().*std::declval<void (yb::consensus::Peer::*&)()>()()) std::__invoke[abi:v160006]<void (yb::consensus::Peer::*&)(), std::shared_ptr<yb::consensus::Peer>&, void>(void (yb::consensus::Peer::*&)(), std::shared_ptr<yb::consensus::Peer>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libconsensus.so+0xa269b)
[m-2]     #6 std::__bind_return<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, std::tuple<>, __is_valid_bind_return<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, 0ul, std::tuple<>>(void (yb::consensus::Peer::*&)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libconsensus.so+0xa269b)
[m-2]     #7 std::__bind_return<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, std::tuple<>, __is_valid_bind_return<void (yb::consensus::Peer::*)(), std::tuple<std::shared_ptr<yb::consensus::Peer>>, std::tuple<>>::value>::type std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libconsensus.so+0xa269b)
[m-2]     #8 decltype(std::declval<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&>(std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libconsensus.so+0xa269b)
[m-2]     #9 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&>(std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libconsensus.so+0xa269b)
[m-2]     #10 std::__function::__alloc_func<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>, std::allocator<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libconsensus.so+0xa269b)
[m-2]     #11 std::__function::__func<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>, std::allocator<std::__bind<void (yb::consensus::Peer::*)(), std::shared_ptr<yb::consensus::Peer>&>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libconsensus.so+0xa269b)
[m-2]     #12 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyrpc.so+0x10d123)
[m-2]     #13 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyrpc.so+0x10d123)
[m-2]     #14 yb::rpc::OutboundCall::InvokeCallbackSync() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:390:3 (libyrpc.so+0x10d123)
[m-2]     #15 yb::rpc::InvokeCallbackTask::Run() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:125:10 (libyrpc.so+0x10d03d)
[m-2]     #16 yb::rpc::(anonymous namespace)::Worker::Execute() ${BUILD_ROOT}/../../src/yb/rpc/thread_pool.cc:104:15 (libyrpc.so+0x1a1184)
[m-2]     #17 decltype(*std::declval<yb::rpc::(anonymous namespace)::Worker*&>().*std::declval<void (yb::rpc::(anonymous namespace)::Worker::*&)()>()()) std::__invoke[abi:v160006]<void (yb::rpc::(anonymous namespace)::Worker::*&)(), yb::rpc::(anonymous namespace)::Worker*&, void>(void (yb::rpc::(anonymous namespace)::Worker::*&)(), yb::rpc::(anonymous namespace)::Worker*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libyrpc.so+0x1a15db)
[m-2]     #18 std::__bind_return<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, 0ul, std::tuple<>>(void (yb::rpc::(anonymous namespace)::Worker::*&)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0x1a15db)
[m-2]     #19 std::__bind_return<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::(anonymous namespace)::Worker::*)(), std::tuple<yb::rpc::(anonymous namespace)::Worker*>, std::tuple<>>::value>::type std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0x1a15db)
[m-2]     #20 decltype(std::declval<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&>(std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0x1a15db)
[m-2]     #21 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&>(std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libyrpc.so+0x1a15db)
[m-2]     #22 std::__function::__alloc_func<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>, std::allocator<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libyrpc.so+0x1a15db)
[m-2]     #23 std::__function::__func<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>, std::allocator<std::__bind<void (yb::rpc::(anonymous namespace)::Worker::* const&)(), yb::rpc::(anonymous namespace)::Worker* const&>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libyrpc.so+0x1a15db)
[m-2]     #24 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyb_util.so+0x3afa02)
[m-2]     #25 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyb_util.so+0x3afa02)
[m-2]     #26 yb::Thread::SuperviseThread(void*) ${BUILD_ROOT}/../../src/yb/util/thread.cc:842:3 (libyb_util.so+0x3afa02)
[m-2] 
[m-2]   Previous read of size 8 at 0x7b6000040708 by thread T7:
[m-2]     #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea)
[m-2]     #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea)
[m-2]     #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea)
[m-2]     #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934)
[m-2]     #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67)
[m-2]     #5 decltype(std::declval<yb::consensus::Peer::Init()::$_0&>()()) std::__invoke[abi:v160006]<yb::consensus::Peer::Init()::$_0&>(yb::consensus::Peer::Init()::$_0&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libconsensus.so+0x9bc67)
[m-2]     #6 void std::__invoke_void_return_wrapper<void, true>::__call<yb::consensus::Peer::Init()::$_0&>(yb::consensus::Peer::Init()::$_0&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libconsensus.so+0x9bc67)
[m-2]     #7 std::__function::__alloc_func<yb::consensus::Peer::Init()::$_0, std::allocator<yb::consensus::Peer::Init()::$_0>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libconsensus.so+0x9bc67)
[m-2]     #8 std::__function::__func<yb::consensus::Peer::Init()::$_0, std::allocator<yb::consensus::Peer::Init()::$_0>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libconsensus.so+0x9bc67)
[m-2]     #9 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyrpc.so+0x11bc88)
[m-2]     #10 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyrpc.so+0x11bc88)
[m-2]     #11 yb::rpc::PeriodicTimer::Callback(long) ${BUILD_ROOT}/../../src/yb/rpc/periodic.cc:189:5 (libyrpc.so+0x11bc88)
[m-2]     #12 yb::rpc::PeriodicTimer::Callback(long)::$_0::operator()(yb::Status const&) const ${BUILD_ROOT}/../../src/yb/rpc/periodic.cc:208:14 (libyrpc.so+0x11c4ff)
[m-2]     #13 yb::rpc::ScheduledTask<yb::rpc::PeriodicTimer::Callback(long)::$_0>::Run(yb::Status const&) ${BUILD_ROOT}/../../src/yb/rpc/scheduler.h:55:5 (libyrpc.so+0x11c4ff)
[m-2]     #14 yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()::operator()() const ${BUILD_ROOT}/../../src/yb/rpc/scheduler.cc:142:57 (libyrpc.so+0x170b39)
[m-2]     #15 void boost::asio::asio_handler_invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, ...) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/handler_invoke_hook.hpp:88:3 (libyrpc.so+0x170b39)
[m-2]     #16 void boost_asio_handler_invoke_helpers::invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/handler_invoke_helpers.hpp:54:3 (libyrpc.so+0x170b39)
[m-2]     #17 void boost::asio::detail::handler_work<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void>::complete<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&, yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/handler_work.hpp:520:7 (libyrpc.so+0x170b39)
[m-2]     #18 boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/completion_handler.hpp:74:9 (libyrpc.so+0x170b39)
[m-2]     #19 boost::asio::detail::scheduler_operation::complete(void*, boost::system::error_code const&, unsigned long) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/scheduler_operation.hpp:40:5 (libyrpc.so+0xf7c2a)
[m-2]     #20 boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/impl/scheduler.ipp:492:12 (libyrpc.so+0xf7c2a)
[m-2]     #21 boost::asio::detail::scheduler::run(boost::system::error_code&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/detail/impl/scheduler.ipp:210:10 (libyrpc.so+0xf7771)
[m-2]     #22 boost::asio::io_context::run(boost::system::error_code&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/include/boost/asio/impl/io_context.ipp:71:16 (libyrpc.so+0xf2d10)
[m-2]     #23 yb::rpc::IoThreadPool::Impl::Execute() ${BUILD_ROOT}/../../src/yb/rpc/io_thread_pool.cc:76:17 (libyrpc.so+0xf2d10)
[m-2]     #24 decltype(*std::declval<yb::rpc::IoThreadPool::Impl*&>().*std::declval<void (yb::rpc::IoThreadPool::Impl::*&)()>()()) std::__invoke[abi:v160006]<void (yb::rpc::IoThreadPool::Impl::*&)(), yb::rpc::IoThreadPool::Impl*&, void>(void (yb::rpc::IoThreadPool::Impl::*&)(), yb::rpc::IoThreadPool::Impl*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:359:23 (libyrpc.so+0xf916b)
[m-2]     #25 std::__bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, 0ul, std::tuple<>>(void (yb::rpc::IoThreadPool::Impl::*&)(), std::tuple<yb::rpc::IoThreadPool::Impl*>&, std::__tuple_indices<0ul>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0xf916b)
[m-2]     #26 std::__bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>, __is_valid_bind_return<void (yb::rpc::IoThreadPool::Impl::*)(), std::tuple<yb::rpc::IoThreadPool::Impl*>, std::tuple<>>::value>::type std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0xf916b)
[m-2]     #27 decltype(std::declval<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&>()()) std::__invoke[abi:v160006]<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&>(std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0xf916b)
[m-2]     #28 std::__bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>, __is_valid_bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>::value>::type std::__apply_functor[abi:v160006]<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>(std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&, std::tuple<>&, std::__tuple_indices<>, std::tuple<>&&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:263:12 (libyrpc.so+0xf916b)
[m-2]     #29 std::__bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>, __is_valid_bind_return<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>, std::tuple<>, std::tuple<>>::value>::type std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>::operator()[abi:v160006]<>() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/bind.h:295:20 (libyrpc.so+0xf916b)
[m-2]     #30 decltype(std::declval<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>()()) std::__invoke[abi:v160006]<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>(std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:394:23 (libyrpc.so+0xf916b)
[m-2]     #31 void std::__invoke_void_return_wrapper<void, true>::__call<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&>(std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/invoke.h:487:9 (libyrpc.so+0xf916b)
[m-2]     #32 std::__function::__alloc_func<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>, std::allocator<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>>, void ()>::operator()[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:185:16 (libyrpc.so+0xf916b)
[m-2]     #33 std::__function::__func<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>, std::allocator<std::__bind<std::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>>>, void ()>::operator()() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:356:12 (libyrpc.so+0xf916b)
[m-2]     #34 std::__function::__value_func<void ()>::operator()[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:510:16 (libyb_util.so+0x3afa02)
[m-2]     #35 std::function<void ()>::operator()() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__functional/function.h:1156:12 (libyb_util.so+0x3afa02)
[m-2]     #36 yb::Thread::SuperviseThread(void*) ${BUILD_ROOT}/../../src/yb/util/thread.cc:842:3 (libyb_util.so+0x3afa02)

…ulted in tsan race in connection Summary: This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues. tsan race details - - Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck. ``` [ts-2] Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0): [ts-2] #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f) [ts-2] #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f) [ts-2] #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976) [ts-2] #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e) [ts-2] Previous write of size 8 at 0x7b54000c0198 by thread T25: [ts-2] #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a) [ts-2] #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a) [ts-2] #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a) [ts-2] #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64) ``` - RpcController call_ concurrent access - fix will avoid calling finished() function call. ``` [m-2] WARNING: ThreadSanitizer: data race (pid=27931) [m-2] Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0): [m-2] #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4) [m-2] #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4) [m-2] #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4) [m-2] #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb) [m-2] Previous read of size 8 at 0x7b6000040708 by thread T7: [m-2] #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea) [m-2] #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea) [m-2] #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934) [m-2] #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67) ``` Jira: DB-7637 Test Plan: ./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0 ./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers Jenkins Reviewers: mbautin Reviewed By: mbautin Subscribers: mbautin, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28161

…_ reset which resulted in tsan race in connection Summary: Original commit: 0dc13e6 / D28161 This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues. tsan race details - - Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck. ``` [ts-2] Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0): [ts-2] #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f) [ts-2] #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f) [ts-2] #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976) [ts-2] #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e) [ts-2] Previous write of size 8 at 0x7b54000c0198 by thread T25: [ts-2] #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a) [ts-2] #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a) [ts-2] #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a) [ts-2] #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64) ``` - RpcController call_ concurrent access - fix will avoid calling finished() function call. ``` [m-2] WARNING: ThreadSanitizer: data race (pid=27931) [m-2] Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0): [m-2] #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4) [m-2] #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4) [m-2] #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4) [m-2] #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb) [m-2] Previous read of size 8 at 0x7b6000040708 by thread T7: [m-2] #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea) [m-2] #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea) [m-2] #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934) [m-2] #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67) ``` Jira: DB-7637 Test Plan: ./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0 ./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers Jenkins Reviewers: mbautin Reviewed By: mbautin Subscribers: bogdan, ybase, mbautin Differential Revision: https://phorge.dev.yugabyte.com/D28166

…ag caused by stuck OutboundCall Summary: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand Differential Revision: https://phorge.dev.yugabyte.com/D28057

…ulted in tsan race in connection Summary: This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues. tsan race details - - Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck. ``` [ts-2] Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0): [ts-2] #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f) [ts-2] #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f) [ts-2] #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976) [ts-2] #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e) [ts-2] Previous write of size 8 at 0x7b54000c0198 by thread T25: [ts-2] #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a) [ts-2] #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a) [ts-2] #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a) [ts-2] #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64) ``` - RpcController call_ concurrent access - fix will avoid calling finished() function call. ``` [m-2] WARNING: ThreadSanitizer: data race (pid=27931) [m-2] Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0): [m-2] #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4) [m-2] #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4) [m-2] #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4) [m-2] #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb) [m-2] Previous read of size 8 at 0x7b6000040708 by thread T7: [m-2] #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea) [m-2] #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea) [m-2] #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934) [m-2] #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67) ``` Jira: DB-7637 Test Plan: ./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0 ./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers Jenkins Reviewers: mbautin Reviewed By: mbautin Subscribers: mbautin, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28161

…ag caused by stuck OutboundCall Summary: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 / D28057 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28695

…ag caused by stuck OutboundCall Summary: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand Differential Revision: https://phorge.dev.yugabyte.com/D28048

…ulted in tsan race in connection Summary: This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues. tsan race details - - Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck. ``` [ts-2] Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0): [ts-2] #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f) [ts-2] #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f) [ts-2] #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976) [ts-2] #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e) [ts-2] Previous write of size 8 at 0x7b54000c0198 by thread T25: [ts-2] #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a) [ts-2] #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a) [ts-2] #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a) [ts-2] #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64) ``` - RpcController call_ concurrent access - fix will avoid calling finished() function call. ``` [m-2] WARNING: ThreadSanitizer: data race (pid=27931) [m-2] Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0): [m-2] #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4) [m-2] #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4) [m-2] #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4) [m-2] #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb) [m-2] Previous read of size 8 at 0x7b6000040708 by thread T7: [m-2] #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea) [m-2] #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea) [m-2] #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934) [m-2] #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67) ``` Jira: DB-7637 Test Plan: ./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0 ./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers Jenkins Reviewers: mbautin Reviewed By: mbautin Subscribers: mbautin, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28161

karan-yb · 2023-09-28T16:53:51Z

Need to backport change to 2.14.10 branch.

…r lag caused by stuck OutboundCall Summary: Original commit: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 / D28695 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Jenkins: urgent Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand Differential Revision: https://phorge.dev.yugabyte.com/D28915

karan-yb added area/docdb status/awaiting-triage labels Aug 18, 2023

karan-yb self-assigned this Aug 18, 2023

yugabyte-ci added kind/enhancement priority/medium kind/bug and removed status/awaiting-triage kind/enhancement labels Aug 18, 2023

karan-yb mentioned this issue Aug 22, 2023

[DocDB] Ensure connection shutdown is only performed once #18807

Open

1 task

rthallamko3 added 2.14 Backport Required 2.16 Backport Required 2.18 Backport Required labels Sep 11, 2023

rthallamko3 closed this as completed Sep 28, 2023

karan-yb reopened this Sep 28, 2023

rthallamko3 closed this as completed Sep 29, 2023

yugabyte-ci unassigned karan-yb Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Increased follower lag observed in production with stuck OutboundCall #18744

[DocDB] Increased follower lag observed in production with stuck OutboundCall #18744

karan-yb commented Aug 18, 2023 •

edited by jira bot

Loading

bmatican commented Aug 29, 2023

karan-yb commented Sep 28, 2023

[DocDB] Increased follower lag observed in production with stuck OutboundCall #18744

[DocDB] Increased follower lag observed in production with stuck OutboundCall #18744

Comments

karan-yb commented Aug 18, 2023 • edited by jira bot Loading

Description

Warning: Please confirm that this issue does not contain any sensitive information

bmatican commented Aug 29, 2023

karan-yb commented Sep 28, 2023

karan-yb commented Aug 18, 2023 •

edited by jira bot

Loading