-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Increased follower lag observed in production with stuck OutboundCall #18744
Comments
…uck OutboundCall Summary: In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D27859
… lag caused by stuck OutboundCall Summary: Original commit: 30157f5df21928215b9d4b227bd2c59c6d0d565c / D28057 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28076
Found some TSAN issues with the above commit in Flavor 1
Flavor 2:
|
…ulted in tsan race in connection Summary: This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues. tsan race details - - Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck. ``` [ts-2] Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0): [ts-2] #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f) [ts-2] #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f) [ts-2] #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976) [ts-2] #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e) [ts-2] Previous write of size 8 at 0x7b54000c0198 by thread T25: [ts-2] #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a) [ts-2] #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a) [ts-2] #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a) [ts-2] #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64) ``` - RpcController call_ concurrent access - fix will avoid calling finished() function call. ``` [m-2] WARNING: ThreadSanitizer: data race (pid=27931) [m-2] Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0): [m-2] #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4) [m-2] #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4) [m-2] #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4) [m-2] #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb) [m-2] Previous read of size 8 at 0x7b6000040708 by thread T7: [m-2] #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea) [m-2] #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea) [m-2] #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934) [m-2] #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67) ``` Jira: DB-7637 Test Plan: ./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0 ./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers Jenkins Reviewers: mbautin Reviewed By: mbautin Subscribers: mbautin, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28161
…_ reset which resulted in tsan race in connection Summary: Original commit: 0dc13e6 / D28161 This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues. tsan race details - - Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck. ``` [ts-2] Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0): [ts-2] #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f) [ts-2] #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f) [ts-2] #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976) [ts-2] #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e) [ts-2] Previous write of size 8 at 0x7b54000c0198 by thread T25: [ts-2] #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a) [ts-2] #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a) [ts-2] #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a) [ts-2] #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64) ``` - RpcController call_ concurrent access - fix will avoid calling finished() function call. ``` [m-2] WARNING: ThreadSanitizer: data race (pid=27931) [m-2] Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0): [m-2] #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4) [m-2] #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4) [m-2] #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4) [m-2] #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb) [m-2] Previous read of size 8 at 0x7b6000040708 by thread T7: [m-2] #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea) [m-2] #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea) [m-2] #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934) [m-2] #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67) ``` Jira: DB-7637 Test Plan: ./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0 ./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers Jenkins Reviewers: mbautin Reviewed By: mbautin Subscribers: bogdan, ybase, mbautin Differential Revision: https://phorge.dev.yugabyte.com/D28166
…ag caused by stuck OutboundCall Summary: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand Differential Revision: https://phorge.dev.yugabyte.com/D28057
…ulted in tsan race in connection Summary: This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues. tsan race details - - Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck. ``` [ts-2] Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0): [ts-2] #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f) [ts-2] #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f) [ts-2] #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976) [ts-2] #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e) [ts-2] Previous write of size 8 at 0x7b54000c0198 by thread T25: [ts-2] #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a) [ts-2] #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a) [ts-2] #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a) [ts-2] #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64) ``` - RpcController call_ concurrent access - fix will avoid calling finished() function call. ``` [m-2] WARNING: ThreadSanitizer: data race (pid=27931) [m-2] Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0): [m-2] #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4) [m-2] #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4) [m-2] #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4) [m-2] #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb) [m-2] Previous read of size 8 at 0x7b6000040708 by thread T7: [m-2] #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea) [m-2] #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea) [m-2] #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934) [m-2] #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67) ``` Jira: DB-7637 Test Plan: ./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0 ./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers Jenkins Reviewers: mbautin Reviewed By: mbautin Subscribers: mbautin, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28161
…ag caused by stuck OutboundCall Summary: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 / D28057 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: amitanand, yql, yyan, rthallam, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28695
…ag caused by stuck OutboundCall Summary: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand Differential Revision: https://phorge.dev.yugabyte.com/D28048
…ulted in tsan race in connection Summary: This change fixes TSAN race issues caused by change 98586ef. In initial change, we were incorrectly setting the RPC start time to now when processing response, because of which detection logic was not working correctly, hence resulting in tsan race issues. tsan race details - - Connection getting reset when we are trying to get DebugString. This is solved by avoiding the call to DebugString() until we figure out that call is actually stuck. ``` [ts-2] Read of size 8 at 0x7b54000c0198 by thread T21 (mutexes: write M0): [ts-2] #0 std::shared_ptr<yb::rpc::Connection>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x11383f) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x11383f) [ts-2] #2 yb::rpc::OutboundCall::DebugString() const ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:613:7 (libyrpc.so+0x11383f) [ts-2] #3 yb::rpc::RpcController::CallStateDebugString() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:163:19 (libyrpc.so+0x156976) [ts-2] #4 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:188:77 (libconsensus.so+0x94c2e) [ts-2] Previous write of size 8 at 0x7b54000c0198 by thread T25: [ts-2] #0 std::enable_if<is_move_constructible<yb::rpc::Connection*>::value && is_move_assignable<yb::rpc::Connection*>::value, void>::type std::swap[abi:v160006]<yb::rpc::Connection*>(yb::rpc::Connection*&, yb::rpc::Connection*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x11166a) [ts-2] #1 std::shared_ptr<yb::rpc::Connection>::swap[abi:v160006](std::shared_ptr<yb::rpc::Connection>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x11166a) [ts-2] #2 std::shared_ptr<yb::rpc::Connection>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x11166a) [ts-2] #3 yb::rpc::OutboundCall::InvokeCallback() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:372:15 (libyrpc.so+0x11166a) [ts-2] #4 yb::rpc::OutboundCall::SetTimedOut() ${BUILD_ROOT}/../../src/yb/rpc/outbound_call.cc:538:5 (libyrpc.so+0x112c64) ``` - RpcController call_ concurrent access - fix will avoid calling finished() function call. ``` [m-2] WARNING: ThreadSanitizer: data race (pid=27931) [m-2] Write of size 8 at 0x7b6000040708 by thread T29 (mutexes: write M0): [m-2] #0 std::enable_if<is_move_constructible<yb::rpc::OutboundCall*>::value && is_move_assignable<yb::rpc::OutboundCall*>::value, void>::type std::swap[abi:v160006]<yb::rpc::OutboundCall*>(yb::rpc::OutboundCall*&, yb::rpc::OutboundCall*&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__utility/swap.h:41:7 (libyrpc.so+0x1560a4) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::swap[abi:v160006](std::shared_ptr<yb::rpc::OutboundCall>&) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:805:9 (libyrpc.so+0x1560a4) [m-2] #2 std::shared_ptr<yb::rpc::OutboundCall>::reset[abi:v160006]() /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:812:22 (libyrpc.so+0x1560a4) [m-2] #3 yb::rpc::RpcController::Reset() ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:83:9 (libyrpc.so+0x1560a4) [m-2] #4 yb::consensus::Peer::ProcessResponse() ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:496:15 (libconsensus.so+0x97ffb) [m-2] Previous read of size 8 at 0x7b6000040708 by thread T7: [m-2] #0 std::shared_ptr<yb::rpc::OutboundCall>::get[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:843:16 (libyrpc.so+0x1561ea) [m-2] #1 std::shared_ptr<yb::rpc::OutboundCall>::operator bool[abi:v160006]() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20230803170255-057e0a1188-centos7-x86_64-clang16/installed/tsan/libcxx/include/c++/v1/__memory/shared_ptr.h:875:16 (libyrpc.so+0x1561ea) [m-2] #2 yb::rpc::RpcController::finished() const ${BUILD_ROOT}/../../src/yb/rpc/rpc_controller.cc:87:7 (libyrpc.so+0x1561ea) [m-2] #3 yb::consensus::Peer::SignalRequest(yb::consensus::RequestTriggerMode) ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:183:81 (libconsensus.so+0x94934) [m-2] #4 yb::consensus::Peer::Init()::$_0::operator()() const ${BUILD_ROOT}/../../src/yb/consensus/consensus_peers.cc:156:25 (libconsensus.so+0x9bc67) ``` Jira: DB-7637 Test Plan: ./yb_build.sh tsan -n 10 --cxx-test integration-tests_master_failover-itest --gtest_filter MasterFailoverTestIndexCreation/MasterFailoverTestIndexCreation.TestPauseAfterCreateIndexIssued/0 ./yb_build.sh tsan -n 10 --cxx-test integration-tests_raft_consensus-itest --gtest_filter RaftConsensusITest.MultiThreadedInsertWithFailovers Jenkins Reviewers: mbautin Reviewed By: mbautin Subscribers: mbautin, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28161
Need to backport change to 2.14.10 branch. |
…r lag caused by stuck OutboundCall Summary: Original commit: Original commit: 98586ef / D27859 and 0dc13e6 / D28161 / D28695 In production, we saw a few cases where follower lag was continuously increasing for few tablets. Also we noticed that the replica which was lagging was not removed from quorum by ConsensusQueue either. After capturing a dump of a node where leader was hosted, we were able to figure out that the performing_update_mutex was held for a long time for the affected Peer. Peer acquires the performing_update_mutex when it is building a request to send to peer and keeps the mutex locked until it receives a response. In the captured dump, we noticed that the OutboundCall was in SENT state, but we were not able to confirm if the connection (on which it was sent) was active or not -- we did an analysis of OutboundCall references and we believe that the connection had been shut down, but we were not sure. This change tries to detect the stuck OutboundCall in Peer: * Whenever we try to send more data or heartbeat to the peer (in `SignalRequest`), we check if we can acquire the `performing_update_mutex`. If the mutex is already held, then we try to see if it has been more than the time specified by `FLAGS_stuck_peer_call_threshold_ms` + request timeout since the call start time. * If the lock is held for more than timeout + `FLAGS_stuck_peer_call_threshold_ms` time duration, then we log additional details which can help identify the root cause of the issue (see below for examples). * And when `FLAGS_force_recover_from_stuck_peer_call` is set to true, we try to mark the stuck call as failed. If the call object is not present, then we won't be able to recover from this situation. Another change is that whenever a Connection encounters a write failure, instead of immediately destroying the connection, the operation queues a reactor task to ensure that all queued operations on the socket are executed in order. However, since the socket is closed, all of these operations in queue will encounter write failures, resulting in all of them scheduling a DestroyConnection task. After the first DestroyConnection task is executed, we will not be able to find this connection in the reactor-tracked connections, which will lead to a CHECK error. To prevent multiple DestroyConnection tasks for a single connection, we track whether the connection has already queued the task for its destruction. Information logged when we detect this situation - ``` I0822 13:36:35.338526 4161232384 rpc_stub-test.cc:1153] OutboundCall (0x0000000107186018 -> RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT.) tostring: RPC call yb.rpc_test.CalculatorService.Concat -> { remote: 127.0.0.1:58098 idx: 1 protocol: 0x0000000103fae8f0 -> tcp } , state=SENT., times (start, sent, callback, now): (3713549.841s, 3713549.842s, 0.000s, 3713549.947s), connection: 0x000000010708a658 -> Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098 I0822 13:36:35.338694 1845915648 connection.cc:409] Connection (0x000000010708a658) client 127.0.0.1:58100 => 127.0.0.1:58098: LastActivityTime: 3713549.839s, ActiveCalls stats: { during shutdown: 0, current size: 0, earliest expiry: 0.000s, }, OutboundCall: { ptr: 0x0000000107186018, call id: 2, is active: 0 }, Shutdown status: Service unavailable (yb/rpc/reactor.cc:106): Shutdown connection (system error 58), Shutdown time: 3713549.940s, Queue attempts after shutdown: { calls: 0, responses: 0 } ``` 1. OutboundCall timing details - it includes the time when call was created, when it was sent on network and when callback call was received. 2. Connection - we dump the connection state on which the call was sent. This state includes whether connection is active or not. If alive, then we will see the active calls count, and whether the current call is present in active calls or not. If connection is not alive, then you will see the time when connection was shutdown. We also log the number of active call present during shutdown and queue attempts after shutdown. Using the above timing information and connection state, we can determine the order of events. For ex, in the above sample logs, as you can see the connection was closed after the data was already sent on the socket. Example trace for stuck call detection in Peer - ``` I0817 17:42:07.975589 4161232384 consensus_peers.cc:179] T test-peers-tablet P peer-0 -> Peer peer-1 ([], []): Found a RPC call in stuck state - is_finished: 0, timeout: 0.005s, last_update_lock_release_time_: 3296281.832s, stuck threshold: 0.001s, force recover: 0, call state: call not set ``` Jira: DB-7637 Test Plan: ./yb_build.sh --sj ./build/latest/tests-rpc/rpc_stub-test --gtest_filter=*StuckOutboundCall* Jenkins: urgent Reviewers: mbautin, sergei Reviewed By: mbautin Subscribers: bogdan, ybase, rthallam, yyan, yql, amitanand Differential Revision: https://phorge.dev.yugabyte.com/D28915
Jira Link: DB-7637
Description
In production, we saw quite a few cases where Peer is waiting for the response of a Update RPC call. Since Peer holds the performing_update_mutex for the duration of the RPC, we get into a situation where leader never sends any data to this peer.
After analyzing the dump, we found that the OutboundCall for the RPC is in SENT state for a long time. There was no clear way to confirm if connection (on which this call was sent) was active or not.
To avoid this follower lag issue, add a mechanism to detect the situation, log additional details to root cause and automatically recover from it by cancelling the pending OutstandingCall.
Issue #18685 is created to track all stuck OutboundCalls and possibly recover from it.
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: