Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Proactively track stuck OutboundCall objects #18685

Closed
1 task done
mbautin opened this issue Aug 15, 2023 · 0 comments
Closed
1 task done

[DocDB] Proactively track stuck OutboundCall objects #18685

mbautin opened this issue Aug 15, 2023 · 0 comments
Assignees
Labels
2.14 Backport Required 2.16 Backport Required 2.18 Backport Required area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/high High Priority

Comments

@mbautin
Copy link
Contributor

mbautin commented Aug 15, 2023

Jira Link: DB-7597

Description

There is a rare bug where OutboundCall's callback is never called. This could be the underlying cause of #16354. Automatically detect this situation and log as much relevant details as possible.

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@mbautin mbautin added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Aug 15, 2023
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Aug 15, 2023
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Aug 15, 2023
mbautin added a commit to mbautin/yugabyte-db that referenced this issue Aug 16, 2023
Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from that structure as soon as we learn the callback has been called. Report stuck calls with detailed debug information, including the connection objects associated with those calls. This tracking is disabled by default and can be enabled using the stuck_outbound_call_detection_interval_sec flag, set to the number of seconds after which a call will be reported as stuck unless its callback is called.

We also enhance ToString functionality for OutboundCall and Connection.

We now keep track of whether an OutboundCall had attempts of invalid state transitions which would previously be silently rejected, and display that information in the call's ToString representation. Another piece of debug information for OutboundCall is active_call_state_, which indicates whether the call was added to active_calls_ in the corresponding connection, whether it was removed from it, and for what reason.

Also add a way to simulate stuck OutboundRpc calls for testing. TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Test Plan:
Jenkins

Manual test:

bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,stuck_outbound_call_detection_interval_sec=60" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000

Look for output like this in the logs:

```
W0815 00:07:00.437063 184510 outbound_call.cc:388] Skipping OutboundCall callback as a test: Remote RPC call 0x000034f4b4af4f20: yb.tserver.TabletServerService.Read -> { remote: 127.0.0.3:9100 idx: 4 protocol: 0x00007f2e6bc30a80 -> tcpc }, state: FINISHED_SUCCESS, transfer_state: FINISHED, active_call_state: kErasedOnResponse, connection: 0x000034f4bd35e498
W0815 00:07:00.580775 184288 reactor.cc:604] Stuck OutboundCall: Remote RPC call 0x000034f4bb1496a0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 6 protocol: 0x00007f2e6bc30a80 -> tcpc }, state: FINISHED_SUCCESS, transfer_state: FINISHED, active_call_state: kErasedOnResponse, connection: 0x000034f4bd57b0f8
W0815 00:07:00.580840 184288 reactor.cc:620] Connection with stuck OutboundCalls: Connection (0x000034f4bd57b0f8) client 127.0.0.1:50995 => 127.0.0.2:9100 (no active calls)
```

Reviewers: sergei, kpopali

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27735

Address review comments

Address review comments

Set stuck outbound call detection interval to 10 seconds initially

Fix connection_weak_ thread safety

Fix handling of null call in response callback

WIP

Fix the bug with incorrect handling of invalid handle

Cleanup

Fix lint. Jenkins: skip

Fix double callback invocation on thread pool shutdown

Improvements

Cleanup. Jenkins: skip

Fix lint. Jenkins: skip

Various cleanup

More cleanup
mbautin added a commit to mbautin/yugabyte-db that referenced this issue Sep 3, 2023
…actor level

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called. Report stuck calls with detailed debug information, including the connection objects associated with those calls, and all active calls for those connections. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement. Reactor-based timeout enforcement kicks in after reactor_based_outbound_call_expiration_delay_ms have passed after the call's normal timeout, in case Connection-based timeout did not work. Set the flag to 0 to disable it.

We are enhancing ToString functionality for OutboundCall and Connection with various useful fields.

OutboundCall now keeps track of its expiration deadline.

We now keep track of whether an OutboundCall has had attempts of invalid state transitions, which would previously be silently rejected, and display that information in the call's ToString representation. Another piece of debug information for OutboundCall is active_call_state_, which indicates whether the call was added to active_calls_ in the corresponding connection, whether it was removed from it, and for what reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

Modify RpcCall::Transferred to use a compare-and-swap loop to make thread safety semantics of the state field clearer. Rename RpcCall::state_ to transfer_state_ to avoid confusion with OutboundCall::state_.

In Connection::Shutdown, we now explicitly ensure the shutdown status is not OK. Calling Transferred on an outbound call with an OK status can in theory result in stuck calls.

We introduce multiple ways to simulate stuck outbound calls for testing.
- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in `Connection::QueueOutboundCall`, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in `stream_->Send()` called from `Connection::DoQueueOutboundData`. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause `Connection::QueueOutboundCall` to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

Add convenient functions for converting CoarseTimePoint to string in relation to the current moment in time.

Test Plan:
Jenkins

Manual testing details below.

First test mode
===============

bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000

Look for output like this in the logs:

```
W0816 21:22:05.062952 821156 connection.cc:284] Simulating a call stuck in SENT state: Remote RPC call 0x000026bf38da62a0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007f4ff6965540 -> tcpc }, id: 2535, start_time: 11529.994s, state: SENT, transfer_state: PENDING, active_call_state: kNotAdded, callback_invoked: 0, connection: 0x000026bf3adc47f8, expires_at: N/A
...
W0816 21:22:09.128285 821156 reactor.cc:693] Stuck OutboundCall: Remote RPC call 0x000026bf38da62a0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007f4ff6965540 -> tcpc }, id: 2535, start_time: 11529.994s (4.065s ago), state: SENT, transfer_state: PENDING, active_call_state: kNotAdded, callback_invoked: 0, connection: 0x000026bf3adc47f8 (alive), expires_at: 11532.994s (1.065s ago) (forced expiration)
W0816 21:22:09.128386 821413 consensus_peers.cc:561] T d180d8d18d3349d3affb827a5bbb1306 P 6fb888c69eaf4b3b928bea594da3dd8c -> Peer 3998f4271cb940cd869e3cb19b4e17e4 ([host: "127.0.0.3" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:560): UpdateConsensus RPC (request call id 2535) to 127.0.0.3:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Second test mode
================

bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W0816 21:31:51.590170 843207 consensus_peers.cc:561] T aea41ebd2b2c470b8b552586e0541b11 P 60ccba818c62466f988b1a330d64a01e -> Peer daa05e0fc5524a7db6ac2dc9b4bc9f46 ([host: "127.0.0.1" port: 9100], []): Couldn't send request.  Status: Network error (yb/rpc/connection.cc:353): Simulated failure to send outbound data for 0x000025f33f2c9ba0 -> Remote RPC call 0x000025f33f2c9ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.1:9100 idx: 0 protocol: 0x00007efeb4eea540 -> tcpc }, id: 1651, start_time: 12116.521s, state: READY, transfer_state: PENDING, active_call_state: kNotAdded, callback_invoked: 0, connection: 0x000025f33f1798d8, expires_at: N/A. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

There should be no stuck outbound calls in the log.

Third test mode
===============

bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W0816 21:41:52.361452 845765 outbound_call.cc:403] Skipping OutboundCall callback as a test: Remote RPC call 0x00000455787af420: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007fc61c630540 -> tcpc }, id: 1573, start_time: 12717.292s, state: FINISHED_SUCCESS, transfer_state: FINISHED, active_call_state: kErasedOnResponse, callback_invoked: 0, connection: 0x000004557f1c2018, expires_at: 12720.292s
...
W0816 21:41:56.444655 845540 reactor.cc:693] Stuck OutboundCall: Remote RPC call 0x00000455787af420: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007fc61c630540 -> tcpc }, id: 1573, start_time: 12717.292s (4.084s ago), state: FINISHED_SUCCESS, transfer_state: FINISHED, active_call_state: kErasedOnResponse, callback_invoked: 0, connection: 0x000004557f1c2018 (alive), expires_at: 12720.292s (1.084s ago)
W0816 21:41:56.444737 845540 reactor.cc:706] Connection with stuck OutboundCalls: Connection (0x000004557f1c2018) client 127.0.0.2:41591 => 127.0.0.3:9100; active calls (1):
    1. { id: 40259 call: 0x00000455724342a0 expires_at: 12779.178s (57.801s from now) handle: 1005
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Reviewers: sergei, kpopali

Reviewed By: sergei

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27735
mbautin added a commit to mbautin/yugabyte-db that referenced this issue Sep 5, 2023
…actor level

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called. Report stuck calls with detailed debug information, including the connection objects associated with those calls, and all active calls for those connections. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement. Reactor-based timeout enforcement kicks in after reactor_based_outbound_call_expiration_delay_ms have passed after the call's normal timeout, in case Connection-based timeout did not work. Set the flag to 0 to disable it.

We are enhancing ToString functionality for OutboundCall and Connection with various useful fields.

OutboundCall now keeps track of its expiration deadline.

We now keep track of whether an OutboundCall has had attempts of invalid state transitions, which would previously be silently rejected, and display that information in the call's ToString representation. Another piece of debug information for OutboundCall is active_call_state_, which indicates whether the call was added to active_calls_ in the corresponding connection, whether it was removed from it, and for what reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

Modify RpcCall::Transferred to use a compare-and-swap loop to make thread safety semantics of the state field clearer. Rename RpcCall::state_ to transfer_state_ to avoid confusion with OutboundCall::state_.

In Connection::Shutdown, we now explicitly ensure the shutdown status is not OK. Calling Transferred on an outbound call with an OK status can in theory result in stuck calls.

We introduce multiple ways to simulate stuck outbound calls for testing.
- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in `Connection::QueueOutboundCall`, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in `stream_->Send()` called from `Connection::DoQueueOutboundData`. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause `Connection::QueueOutboundCall` to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

Add convenient functions for converting CoarseTimePoint to string in relation to the current moment in time.

Test Plan:
Jenkins

Manual testing details below.

First test mode
===============

bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000

Look for output like this in the logs:

```
W0816 21:22:05.062952 821156 connection.cc:284] Simulating a call stuck in SENT state: Remote RPC call 0x000026bf38da62a0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007f4ff6965540 -> tcpc }, id: 2535, start_time: 11529.994s, state: SENT, transfer_state: PENDING, active_call_state: kNotAdded, callback_invoked: 0, connection: 0x000026bf3adc47f8, expires_at: N/A
...
W0816 21:22:09.128285 821156 reactor.cc:693] Stuck OutboundCall: Remote RPC call 0x000026bf38da62a0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007f4ff6965540 -> tcpc }, id: 2535, start_time: 11529.994s (4.065s ago), state: SENT, transfer_state: PENDING, active_call_state: kNotAdded, callback_invoked: 0, connection: 0x000026bf3adc47f8 (alive), expires_at: 11532.994s (1.065s ago) (forced expiration)
W0816 21:22:09.128386 821413 consensus_peers.cc:561] T d180d8d18d3349d3affb827a5bbb1306 P 6fb888c69eaf4b3b928bea594da3dd8c -> Peer 3998f4271cb940cd869e3cb19b4e17e4 ([host: "127.0.0.3" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:560): UpdateConsensus RPC (request call id 2535) to 127.0.0.3:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Second test mode
================

bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W0816 21:31:51.590170 843207 consensus_peers.cc:561] T aea41ebd2b2c470b8b552586e0541b11 P 60ccba818c62466f988b1a330d64a01e -> Peer daa05e0fc5524a7db6ac2dc9b4bc9f46 ([host: "127.0.0.1" port: 9100], []): Couldn't send request.  Status: Network error (yb/rpc/connection.cc:353): Simulated failure to send outbound data for 0x000025f33f2c9ba0 -> Remote RPC call 0x000025f33f2c9ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.1:9100 idx: 0 protocol: 0x00007efeb4eea540 -> tcpc }, id: 1651, start_time: 12116.521s, state: READY, transfer_state: PENDING, active_call_state: kNotAdded, callback_invoked: 0, connection: 0x000025f33f1798d8, expires_at: N/A. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

There should be no stuck outbound calls in the log.

Third test mode
===============

bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W0816 21:41:52.361452 845765 outbound_call.cc:403] Skipping OutboundCall callback as a test: Remote RPC call 0x00000455787af420: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007fc61c630540 -> tcpc }, id: 1573, start_time: 12717.292s, state: FINISHED_SUCCESS, transfer_state: FINISHED, active_call_state: kErasedOnResponse, callback_invoked: 0, connection: 0x000004557f1c2018, expires_at: 12720.292s
...
W0816 21:41:56.444655 845540 reactor.cc:693] Stuck OutboundCall: Remote RPC call 0x00000455787af420: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007fc61c630540 -> tcpc }, id: 1573, start_time: 12717.292s (4.084s ago), state: FINISHED_SUCCESS, transfer_state: FINISHED, active_call_state: kErasedOnResponse, callback_invoked: 0, connection: 0x000004557f1c2018 (alive), expires_at: 12720.292s (1.084s ago)
W0816 21:41:56.444737 845540 reactor.cc:706] Connection with stuck OutboundCalls: Connection (0x000004557f1c2018) client 127.0.0.2:41591 => 127.0.0.3:9100; active calls (1):
    1. { id: 40259 call: 0x00000455724342a0 expires_at: 12779.178s (57.801s from now) handle: 1005
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Reviewers: sergei, kpopali

Reviewed By: sergei

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27735
basavaraj29 pushed a commit to basavaraj29/yugabyte-db that referenced this issue Sep 5, 2023
…actor level

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called. Report stuck calls with detailed debug information, including the connection objects associated with those calls, and all active calls for those connections. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition.

We introduce multiple ways to simulate stuck outbound calls for testing.
- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in `Connection::QueueOutboundCall`, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in `stream_->Send()` called from `Connection::DoQueueOutboundData`. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause `Connection::QueueOutboundCall` to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

Test Plan:
Jenkins

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W0901 22:36:51.984501 785126 connection.cc:298] Connection (0x000016d5b9d71058) client 127.0.0.1:58055 => 127.0.0.3:9100: Simulating a call stuck in SENT state: RPC call 0x000016d5b717a020: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 7 protocol: 0x00007f83a0057580 -> tcpc }, id: 25764, state: SENT, transfer_state: PENDING, start_time: 13349.912s (now), sent_time: -inf, callback_time: -inf, expiration_time: +inf, now: 13349.912s, connection: Connection (0x000016d5b9d71058) client 127.0.0.1:58055 => 127.0.0.3:9100, active_call_state: kNotAdded (forced expiration)
...
W0901 22:36:55.988431 785126 reactor.cc:683] Stuck OutboundCall: RPC call 0x000016d5b717a020: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 7 protocol: 0x00007f83a0057580 -> tcpc }, id: 25764, state: SENT, transfer_state: PENDING, start_time: 13349.912s (4.004s ago), sent_time: -inf, callback_time: -inf, expiration_time: 13352.913s (1.003s ago), now: 13353.916s, connection: Connection (0x000016d5b9d71058) client 127.0.0.1:58055 => 127.0.0.3:9100, active_call_state: kNotAdded (forced expiration)
W0901 22:36:55.988533 785126 reactor.cc:697] Connection with stuck OutboundCalls: Connection (0x000016d5b9d71058) client 127.0.0.1:58055 => 127.0.0.3:9100; active calls (1):
    1. { id: 87290 call: 0x000016d5bb51cb20 expires_at: 13413.915s (59.998s from now) handle: 2021 }
W0901 22:36:55.988540 785487 consensus_peers.cc:603] T 91cc3213c4904646b69466abbcee7f24 P f020fb79c26147b4b0036576e5a198c4 -> Peer 5ea01f7c0db84b89aa9b9a787137670e ([host: "127.0.0.3" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:597): UpdateConsensus RPC (request call id 25764) to 127.0.0.3:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
I0901 22:44:29.150039 794663 connection.cc:345] Connection::DoQueueOutboundData data: 0x000025532fef58e0 -> RPC call 0x000025532fef58e0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 1 protocol: 0x00007f9bd4dfb580 -> tcpc }, id: 481779, state: READY, transfer_state: PENDING shutdown_status_: Network error (yb/rpc/connection.cc:364): Simulated failure to send outbound data for 0x0000255333de6020 -> RPC call 0x0000255333de6020: yb.tserver.TabletServerService.Read -> { remote: 127.0.0.3:9100 idx: 1 protocol: 0x00007f9bd4dfb580 -> tcpc }, id: 205477, state: READY, transfer_state: PENDING
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W0901 22:53:34.242197 801309 outbound_call.cc:409] { OutboundCall@0x000016b1fe74b360 -> RPC call 0x000016b1fe74b360: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 4 protocol: 0x00007fe4243b4580 -> tcpc }, id: 17526, state: FINISHED_SUCCESS, transfer_state: FINISHED } Skipping OutboundCall callback as a test: RPC call 0x000016b1fe74b360: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 4 protocol: 0x00007fe4243b4580 -> tcpc }, id: 17526, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 14352.171s (now), sent_time: 14352.171s (now), callback_time: -inf, expiration_time: 14355.171s (3.000s from now), now: 14352.171s, connection: Connection (0x000016b1ff66f2d8) client 127.0.0.1:50569 => 127.0.0.3:9100, active_call_state: kErasedOnResponse (forced expiration)
...
W0901 22:53:38.263206 800968 reactor.cc:683] Stuck OutboundCall: RPC call 0x000016b1fe74b360: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.3:9100 idx: 4 protocol: 0x00007fe4243b4580 -> tcpc }, id: 17526, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 14352.171s (4.021s ago), sent_time: 14352.171s (4.021s ago), callback_time: -inf, expiration_time: 14355.171s (1.021s ago), now: 14356.192s, connection: Connection (0x000016b1ff66f2d8) client 127.0.0.1:50569 => 127.0.0.3:9100, active_call_state: kErasedOnResponse
W0901 22:53:38.263283 800968 reactor.cc:697] Connection with stuck OutboundCalls: Connection (0x000016b1ff66f2d8) client 127.0.0.1:50569 => 127.0.0.3:9100 (no active calls)
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: sergei, kpopali

Reviewed By: sergei

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27735
mbautin added a commit that referenced this issue Sep 7, 2023
Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called. Report stuck calls with detailed debug information. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition. Logically, there should never be a race between e.g. processing a call response and a call timeout, because both events happen on the reactor thread.

Fixing error handling in DoQueueOutboundData. If there is an error sending the call, we destroy the connection.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

We introduce multiple ways to simulate stuck outbound calls for testing.
- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in `Connection::QueueOutboundCall`, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in `stream_->Send()` called from `Connection::DoQueueOutboundData`. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause `Connection::QueueOutboundCall` to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Also replacing a DFATAL with an WARNING log in InboundCall::QueueResponse to avoid crashing tests during shutdown. Failures to queue a response on the server side should just result in timeouts on the client side in the worst case.

Test Plan:
Jenkins

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W0907 06:27:34.604096 54296 connection.cc:302] Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100: Simulating a call stuck in SENT state: RPC call 0x0000147cba7f4de0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f276aefeb60 -> tcpc }, id: 7840, state: SENT, transfer_state: PENDING, start_time: 3491.468s (0.001s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: +inf, now: 3491.469s, connection: Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100, active_call_state: kNotAdded
...
W0907 06:27:38.667785 54296 reactor.cc:676] TabletServer_R010: Stuck OutboundCall: RPC call 0x0000147cba7f4de0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f276aefeb60 -> tcpc }, id: 7840, state: SENT, transfer_state: PENDING, start_time: 3491.468s (4.064s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: 3494.469s (1.063s ago), now: 3495.532s, connection: Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100, active_call_state: kNotAdded (forcing a timeout)
W0907 06:27:38.667878 54602 consensus_peers.cc:603] T 49b1706cfba3483b8da3a286fd2115b1 P 02949377cfb94b308697b79fd5581fcf -> Peer fb5b7429056348329f3af97271a630ac ([host: "127.0.0.2" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:603): UpdateConsensus RPC (request call id 7840) to 127.0.0.2:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out. If they were not, they would be logged every minute.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W0907 06:35:47.248174 58709 connection.cc:391] Simulated network failure: Network error (yb/rpc/connection.cc:390): Simulated failure to send outbound data for 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747)
...
W0907 06:35:47.248219 58709 inbound_call.cc:95] 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747): Connection torn down before Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747) could send its response: Network error (yb/rpc/connection.cc:390): Simulated failure to send outbound data for 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747)
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W0907 06:38:05.760689 60982 outbound_call.cc:422] OutboundCall@0x00000774ff2a3ba0: Skipping OutboundCall callback as a test: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (now), sent_time: 4122.626s (now), trigger_callback_time: 4122.626s (now), invoke_callback_time: -inf, expiration_time: 4125.626s (3.000s from now), now: 4122.626s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
...
W0907 06:38:09.768546 60605 reactor.cc:676] TabletServer_R012: Stuck OutboundCall: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (4.008s ago), sent_time: 4122.626s (4.008s ago), trigger_callback_time: 4122.626s (4.008s ago), invoke_callback_time: -inf, expiration_time: 4125.626s (1.008s ago), now: 4126.634s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
...
W0907 06:39:09.867923 60605 reactor.cc:676] TabletServer_R012: Stuck OutboundCall: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (64.106s ago), sent_time: 4122.626s (64.106s ago), trigger_callback_time: 4122.626s (64.106s ago), invoke_callback_time: -inf, expiration_time: 4125.626s (61.106s ago), now: 4186.733s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: sergei, kpopali

Reviewed By: sergei

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27735
mbautin added a commit to mbautin/yugabyte-db that referenced this issue Oct 10, 2023
…actor level

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called. Report stuck calls with detailed debug information. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition. Logically, there should never be a race between e.g. processing a call response and a call timeout, because both events happen on the reactor thread.

Fixing error handling in DoQueueOutboundData. If there is an error sending the call, we destroy the connection.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

We introduce multiple ways to simulate stuck outbound calls for testing.
- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in `Connection::QueueOutboundCall`, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in `stream_->Send()` called from `Connection::DoQueueOutboundData`. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause `Connection::QueueOutboundCall` to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Also replacing a DFATAL with an WARNING log in InboundCall::QueueResponse to avoid crashing tests during shutdown. Failures to queue a response on the server side should just result in timeouts on the client side in the worst case.

Test Plan:
Jenkins

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W0907 06:27:34.604096 54296 connection.cc:302] Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100: Simulating a call stuck in SENT state: RPC call 0x0000147cba7f4de0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f276aefeb60 -> tcpc }, id: 7840, state: SENT, transfer_state: PENDING, start_time: 3491.468s (0.001s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: +inf, now: 3491.469s, connection: Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100, active_call_state: kNotAdded
...
W0907 06:27:38.667785 54296 reactor.cc:676] TabletServer_R010: Stuck OutboundCall: RPC call 0x0000147cba7f4de0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f276aefeb60 -> tcpc }, id: 7840, state: SENT, transfer_state: PENDING, start_time: 3491.468s (4.064s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: 3494.469s (1.063s ago), now: 3495.532s, connection: Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100, active_call_state: kNotAdded (forcing a timeout)
W0907 06:27:38.667878 54602 consensus_peers.cc:603] T 49b1706cfba3483b8da3a286fd2115b1 P 02949377cfb94b308697b79fd5581fcf -> Peer fb5b7429056348329f3af97271a630ac ([host: "127.0.0.2" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:603): UpdateConsensus RPC (request call id 7840) to 127.0.0.2:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out. If they were not, they would be logged every minute.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W0907 06:35:47.248174 58709 connection.cc:391] Simulated network failure: Network error (yb/rpc/connection.cc:390): Simulated failure to send outbound data for 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747)
...
W0907 06:35:47.248219 58709 inbound_call.cc:95] 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747): Connection torn down before Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747) could send its response: Network error (yb/rpc/connection.cc:390): Simulated failure to send outbound data for 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747)
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W0907 06:38:05.760689 60982 outbound_call.cc:422] OutboundCall@0x00000774ff2a3ba0: Skipping OutboundCall callback as a test: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (now), sent_time: 4122.626s (now), trigger_callback_time: 4122.626s (now), invoke_callback_time: -inf, expiration_time: 4125.626s (3.000s from now), now: 4122.626s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
...
W0907 06:38:09.768546 60605 reactor.cc:676] TabletServer_R012: Stuck OutboundCall: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (4.008s ago), sent_time: 4122.626s (4.008s ago), trigger_callback_time: 4122.626s (4.008s ago), invoke_callback_time: -inf, expiration_time: 4125.626s (1.008s ago), now: 4126.634s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
...
W0907 06:39:09.867923 60605 reactor.cc:676] TabletServer_R012: Stuck OutboundCall: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (64.106s ago), sent_time: 4122.626s (64.106s ago), trigger_callback_time: 4122.626s (64.106s ago), invoke_callback_time: -inf, expiration_time: 4125.626s (61.106s ago), now: 4186.733s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: sergei, kpopali

Reviewed By: sergei

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27735

[yugabyte#19090] Disable reactor-based outbound call expiration by default

Summary: Disable reactor-based outbound call expiration by default because it causes sysbench perf test failures.

Test Plan: Jenkins

Reviewers: bogdan, mlillibridge

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D28515

[yugabyte#19090] Fix a memory leak in tracked outbound calls in Reactor

Summary:
Fix a memory leak in tracked_outbound_calls_ in Reactor by cleaning up that data structure more aggressively:
- When traversing tracked calls in the order of increasing next check time, even if it already greater than the current time, stay in the loop and continue removing entries where the callback has been called or the weak pointer has expired.
- When a callback is invoked, notify the reactor through a multi-producer single-consumer queue that is accessed from the OutboundCall via a weak pointer.

Also remove the FinalizeTrackedCall function that was likely useless because it was called concurrently with the callback being handled by the thread pool.

In addition, we are switching OutboundCall allocation away from using make_shared to prevent long-lived weak OutboundCall pointers from consuming memory.

Test Plan:
Jenkins

Local sysbench testing:

```
./yb_build.sh release --no-tests --skip-java
bin/yb-ctl wipe_restart --tserver_flag="memory_limit_hard_bytes=$(( 4 * 1024 * 1024 * 1024 ))"
bin/yb-ctl add_node
bin/yb-ctl add_node
```

Verify we have 3 nodes:

```
bin/yb-ctl status
```

Run sysbench with a table size of 1000000 using a modified version of the run_sysbench.sh script ( https://gist.githubusercontent.com/mbautin/6f0d441b28479232bd2af9b434bed242/raw )

```
./run_sysbench.sh --tablesize 1000000
```

Wait until the run completes. Check the logs for errors (there should be none):

```
find ~/yugabyte-data -name "yb-tserver.INFO" -exec grep -E "Stuck|large number" {} \;
```

Reviewers: sergei

Reviewed By: sergei

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D28531
mbautin added a commit to mbautin/yugabyte-db that referenced this issue Oct 10, 2023
…actor level

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called. Report stuck calls with detailed debug information. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition. Logically, there should never be a race between e.g. processing a call response and a call timeout, because both events happen on the reactor thread.

Fixing error handling in DoQueueOutboundData. If there is an error sending the call, we destroy the connection.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

We introduce multiple ways to simulate stuck outbound calls for testing.
- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in `Connection::QueueOutboundCall`, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in `stream_->Send()` called from `Connection::DoQueueOutboundData`. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause `Connection::QueueOutboundCall` to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Also replacing a DFATAL with an WARNING log in InboundCall::QueueResponse to avoid crashing tests during shutdown. Failures to queue a response on the server side should just result in timeouts on the client side in the worst case.

Test Plan:
Jenkins

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W0907 06:27:34.604096 54296 connection.cc:302] Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100: Simulating a call stuck in SENT state: RPC call 0x0000147cba7f4de0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f276aefeb60 -> tcpc }, id: 7840, state: SENT, transfer_state: PENDING, start_time: 3491.468s (0.001s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: +inf, now: 3491.469s, connection: Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100, active_call_state: kNotAdded
...
W0907 06:27:38.667785 54296 reactor.cc:676] TabletServer_R010: Stuck OutboundCall: RPC call 0x0000147cba7f4de0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f276aefeb60 -> tcpc }, id: 7840, state: SENT, transfer_state: PENDING, start_time: 3491.468s (4.064s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: 3494.469s (1.063s ago), now: 3495.532s, connection: Connection (0x0000147cbdb4de18) client 127.0.0.1:51949 => 127.0.0.2:9100, active_call_state: kNotAdded (forcing a timeout)
W0907 06:27:38.667878 54602 consensus_peers.cc:603] T 49b1706cfba3483b8da3a286fd2115b1 P 02949377cfb94b308697b79fd5581fcf -> Peer fb5b7429056348329f3af97271a630ac ([host: "127.0.0.2" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:603): UpdateConsensus RPC (request call id 7840) to 127.0.0.2:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out. If they were not, they would be logged every minute.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W0907 06:35:47.248174 58709 connection.cc:391] Simulated network failure: Network error (yb/rpc/connection.cc:390): Simulated failure to send outbound data for 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747)
...
W0907 06:35:47.248219 58709 inbound_call.cc:95] 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747): Connection torn down before Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747) could send its response: Network error (yb/rpc/connection.cc:390): Simulated failure to send outbound data for 0x0000068b3935f500 -> Call yb.tserver.TabletServerService.Write 127.0.0.2:47425 => 127.0.0.1:9100 (request call id 6511747)
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W0907 06:38:05.760689 60982 outbound_call.cc:422] OutboundCall@0x00000774ff2a3ba0: Skipping OutboundCall callback as a test: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (now), sent_time: 4122.626s (now), trigger_callback_time: 4122.626s (now), invoke_callback_time: -inf, expiration_time: 4125.626s (3.000s from now), now: 4122.626s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
...
W0907 06:38:09.768546 60605 reactor.cc:676] TabletServer_R012: Stuck OutboundCall: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (4.008s ago), sent_time: 4122.626s (4.008s ago), trigger_callback_time: 4122.626s (4.008s ago), invoke_callback_time: -inf, expiration_time: 4125.626s (1.008s ago), now: 4126.634s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
...
W0907 06:39:09.867923 60605 reactor.cc:676] TabletServer_R012: Stuck OutboundCall: RPC call 0x00000774ff2a3ba0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 7 protocol: 0x00007f776d81eb60 -> tcpc }, id: 703, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 4122.626s (64.106s ago), sent_time: 4122.626s (64.106s ago), trigger_callback_time: 4122.626s (64.106s ago), invoke_callback_time: -inf, expiration_time: 4125.626s (61.106s ago), now: 4186.733s, connection: Connection (0x00000774fd76a158) client 127.0.0.1:46355 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: sergei, kpopali

Reviewed By: sergei

Subscribers: bogdan, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D27735

[yugabyte#19090] Disable reactor-based outbound call expiration by default

Summary: Disable reactor-based outbound call expiration by default because it causes sysbench perf test failures.

Test Plan: Jenkins

Reviewers: bogdan, mlillibridge

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D28515

[yugabyte#19090] Fix a memory leak in tracked outbound calls in Reactor

Summary:
Fix a memory leak in tracked_outbound_calls_ in Reactor by cleaning up that data structure more aggressively:
- When traversing tracked calls in the order of increasing next check time, even if it already greater than the current time, stay in the loop and continue removing entries where the callback has been called or the weak pointer has expired.
- When a callback is invoked, notify the reactor through a multi-producer single-consumer queue that is accessed from the OutboundCall via a weak pointer.

Also remove the FinalizeTrackedCall function that was likely useless because it was called concurrently with the callback being handled by the thread pool.

In addition, we are switching OutboundCall allocation away from using make_shared to prevent long-lived weak OutboundCall pointers from consuming memory.

Test Plan:
Jenkins

Local sysbench testing:

```
./yb_build.sh release --no-tests --skip-java
bin/yb-ctl wipe_restart --tserver_flag="memory_limit_hard_bytes=$(( 4 * 1024 * 1024 * 1024 ))"
bin/yb-ctl add_node
bin/yb-ctl add_node
```

Verify we have 3 nodes:

```
bin/yb-ctl status
```

Run sysbench with a table size of 1000000 using a modified version of the run_sysbench.sh script ( https://gist.githubusercontent.com/mbautin/6f0d441b28479232bd2af9b434bed242/raw )

```
./run_sysbench.sh --tablesize 1000000
```

Wait until the run completes. Check the logs for errors (there should be none):

```
find ~/yugabyte-data -name "yb-tserver.INFO" -exec grep -E "Stuck|large number" {} \;
```

Reviewers: sergei

Reviewed By: sergei

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D28531
mbautin added a commit that referenced this issue Oct 18, 2023
…alls

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called, using a multiple-producer single-consumer (MPSC) queue where the consumer is the reactor thread. Report stuck calls with detailed debug information. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition. Logically, there should never be a race between e.g. processing a call response and a call timeout, because both events happen on the reactor thread.

Fixing error handling in DoQueueOutboundData. If there is an error sending the call, we destroy the connection.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

We introduce multiple ways to simulate stuck outbound calls for testing.

TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in Connection::QueueOutboundCall, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in stream_->Send() called from Connection::DoQueueOutboundData. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause Connection::QueueOutboundCall to attempt to schedule a timer in a connection that has already been shut down.
TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.
Also replacing a DFATAL with an WARNING log in InboundCall::QueueResponse to avoid crashing tests during shutdown. Failures to queue a response on the server side should just result in timeouts on the client side in the worst case.

We are combining multiple original revisions from the master branch to avoid introducing and then fixing a memory leak bug, and to keep the resulting patch cleaner.

Original revisions:

- D27735 ( 1fbff49 ) [#18685] Track stuck OutboundCalls and enforce timeouts at Reactor level
- D27919 ( ddb817b ) [#18808] A write-once lock-free weak pointer class: it is frequently necessary to have a weak pointer equivalent that can only be written to once, but can be read without locking for performance.
- D28085 ( a5ba08f ) [#18855] Add functions for formatting a CoarseTimePoint relative to the current time. Useful for logging deadlines.
- D28138 ( d0a44db ) [#18877] Log invalid OutboundCall state transitions and other OutboundCall improvements. OutboundCall will track invalid state transitions and report them in the DebugString and in the destructor, in addition to reporting them as they happen. Do not allow OutboundCall callback to be invoked more than once and log occurrences of that. Use a compare-and-swap loop in RpcCall::Transferred and log invalid transfer_state transitions. invoke_callback_time_ and sent_time_ in OutboundCall were being accessed without synchronization, so making them atomic. OutboundCall::DebugString should not queue a task in the reactor to print the connection details in relation to the call anymore. That is now done by a separate method, QueueDumpConnectionState, called explicitly.
- D28531 ( 0648341 ) [#19090] Fix a memory leak in tracked outbound calls in Reactor. When an OutboundCall callback is invoked, notify the reactor through a multi-producer single-consumer queue that is accessed from the OutboundCall via a weak pointer. Also switching OutboundCall allocation away from using make_shared to prevent long-lived weak OutboundCall pointers from consuming memory.

Test Plan:
Jenkins

---

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W1017 21:23:13.309160 2187266 connection.cc:300] Connection (0x0000556728cd2158) client 127.0.0.1:38331 => 127.0.0.2:9100: Simulating a call stuck in SENT state: RPC call 0x000055672aea9080: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 4 protocol: 0x00007f03547e75d0 -> tcpc }, id: 70811, state: SENT, transfer_state: PENDING, start_time: 257388.919s (0.004s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: +inf, now: 257388.923s, connection: Connection (0x0000556728cd2158) client 127.0.0.1:38331 => 127.0.0.2:9100, active_call_state: kNotAdded
...
W1017 21:23:17.377820 2187266 reactor.cc:682] TabletServer_R000: Stuck OutboundCall: RPC call 0x000055672aea9080: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 4 protocol: 0x00007f03547e75d0 -> tcpc }, id: 70811, state: SENT, transfer_state: PENDING, start_time: 257388.919s (4.073s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: 257391.923s (1.069s ago), now: 257392.992s, connection: Connection (0x0000556728cd2158) client 127.0.0.1:38331 => 127.0.0.2:9100, active_call_state: kNotAdded (forcing a timeout)
W1017 21:23:17.378927 2187522 consensus_peers.cc:605] T 775284184b9241a89206fc0cdaa59302 P a5cddad5f52643e183e281d890cea703 -> Peer 1b86c67a8a50457794972712d3f0a860 ([host: "127.0.0.2" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:614): UpdateConsensus RPC (request call id 70811) to 127.0.0.2:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out. If they were not, they would be logged every minute.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W1017 21:37:57.068120 2190494 connection.cc:389] Simulated network failure: Network error (yb/rpc/connection.cc:388): Simulated failure to send outbound data for 0x00005623a3e23b80 -> Call yb.tserver.TabletServerService.Write 127.0.0.3:36207 => 127.0.0.1:9100 (request call id 907743)
...
W1017 21:37:57.068182 2190494 inbound_call.cc:99] 0x00005623a3e23b80 -> Call yb.tserver.TabletServerService.Write 127.0.0.3:36207 => 127.0.0.1:9100 (request call id 907743): Connection torn down before Call yb.tserver.TabletServerService.Write 127.0.0.3:36207 => 127.0.0.1:9100 (request call id 907743) could send its response: Network error (yb/rpc/connection.cc:388): Simulated failure to send outbound data for 0x00005623a3e23b80 -> Call yb.tserver.TabletServerService.Write 127.0.0.3:36207 => 127.0.0.1:9100 (request call id 907743)
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W1017 21:40:48.326967 2193494 outbound_call.cc:421] OutboundCall@0x00005578719ec2c0: Skipping OutboundCall callback as a test: RPC call 0x00005578719ec2c0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 2 protocol: 0x00007f11979b25d0 -> tcpc }, id: 3497, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 258443.926s (0.016s ago), sent_time: 258443.926s (0.016s ago), trigger_callback_time: 258443.941s (0.001s ago), invoke_callback_time: -inf, expiration_time: 258446.926s (2.983s from now), now: 258443.942s, connection: Connection (0x000055786dd06158) client 127.0.0.1:52911 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1017 21:40:52.362062 2193263 reactor.cc:682] TabletServer_R002: Stuck OutboundCall: RPC call 0x00005578719ec2c0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 2 protocol: 0x00007f11979b25d0 -> tcpc }, id: 3497, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 258443.926s (4.050s ago), sent_time: 258443.926s (4.050s ago), trigger_callback_time: 258443.941s (4.035s ago), invoke_callback_time: -inf, expiration_time: 258446.926s (1.050s ago), now: 258447.976s, connection: Connection (0x000055786dd06158) client 127.0.0.1:52911 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1017 21:41:22.461496 2193263 reactor.cc:682] TabletServer_R002: Stuck OutboundCall: RPC call 0x00005578719ec2c0: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 2 protocol: 0x00007f11979b25d0 -> tcpc }, id: 3497, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 258443.926s (34.149s ago), sent_time: 258443.926s (34.149s ago), trigger_callback_time: 258443.941s (34.134s ago), invoke_callback_time: -inf, expiration_time: 258446.926s (31.149s ago), now: 258478.075s, connection: Connection (0x000055786dd06158) client 127.0.0.1:52911 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: bogdan, rthallam, sergei

Reviewed By: sergei

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D29215
mbautin added a commit to mbautin/yugabyte-db that referenced this issue Oct 24, 2023
…alls

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called, using a multiple-producer single-consumer (MPSC) queue where the consumer is the reactor thread. Report stuck calls with detailed debug information. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition. Logically, there should never be a race between e.g. processing a call response and a call timeout, because both events happen on the reactor thread.

Fixing error handling in DoQueueOutboundData. If there is an error sending the call, we destroy the connection.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

We introduce multiple ways to simulate stuck outbound calls for testing.

- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in -Connection::QueueOutboundCall, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in stream_->Send() called from Connection::DoQueueOutboundData. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause Connection::QueueOutboundCall to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Also replacing a DFATAL with an WARNING log in InboundCall::QueueResponse to avoid crashing tests during shutdown. Failures to queue a response on the server side should just result in timeouts on the client side in the worst case.

We are combining multiple original revisions from the master branch to avoid introducing and then fixing a memory leak bug, and to keep the resulting patch cleaner.

This revision was backported from the 2.18 backport D29215 ( d028fe2 ).

Original revisions in master:

- D27735 ( 1fbff49 ) [yugabyte#18685] Track stuck OutboundCalls and enforce timeouts at Reactor level
- D27919 ( ddb817b ) [yugabyte#18808] A write-once lock-free weak pointer class: it is frequently necessary to have a weak pointer equivalent that can only be written to once, but can be read without locking for performance.
- D28085 ( a5ba08f ) [yugabyte#18855] Add functions for formatting a CoarseTimePoint relative to the current time. Useful for logging deadlines.
- D28138 ( d0a44db ) [yugabyte#18877] Log invalid OutboundCall state transitions and other OutboundCall improvements. OutboundCall will track invalid state transitions and report them in the DebugString and in the destructor, in addition to reporting them as they happen. Do not allow OutboundCall callback to be invoked more than once and log occurrences of that. Use a compare-and-swap loop in RpcCall::Transferred and log invalid transfer_state transitions. invoke_callback_time_ and sent_time_ in OutboundCall were being accessed without synchronization, so making them atomic. OutboundCall::DebugString should not queue a task in the reactor to print the connection details in relation to the call anymore. That is now done by a separate method, QueueDumpConnectionState, called explicitly.
- D28531 ( 0648341 ) [yugabyte#19090] Fix a memory leak in tracked outbound calls in Reactor. When an OutboundCall callback is invoked, notify the reactor through a multi-producer single-consumer queue that is accessed from the OutboundCall via a weak pointer. Also switching OutboundCall allocation away from using make_shared to prevent long-lived weak OutboundCall pointers from consuming memory.

Also, parts of D23468 ( 4879cfb ) "[yugabyte#16380] Use thread safety analysis annotations to enforce the reactor thread requirement" are included. That commit is present in 2.18 but not in 2.16.

Test Plan:
Jenkins

---

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W1024 21:29:58.941679 1731793 connection.cc:307] Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100: Simulating a call stuck in SENT state: RPC call 0x0000564dbbf3e580: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 6 protocol: 0x00007fd157f9cd70 -> tcpc }, id: 85493, state: SENT, transfer_state: PENDING, start_time: 455775.550s (0.001s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: +inf, now: 455775.551s, connection: Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100, active_call_state: kNotAdded
...
W1024 21:30:03.017571 1731793 reactor.cc:1142] TabletServer_R002: Stuck OutboundCall: RPC call 0x0000564dbbf3e580: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 6 protocol: 0x00007fd157f9cd70 -> tcpc }, id: 85493, state: SENT, transfer_state: PENDING, start_time: 455775.550s (4.077s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: 455778.551s (1.076s ago), now: 455779.628s, connection: Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100, active_call_state: kNotAdded (forcing a timeout)
W1024 21:30:03.017947 1731995 consensus_peers.cc:611] T 9735d87712974ea9b848cc606db8d664 P 9dcaaf7955ea4ca78667b1fa92c89192 -> Peer b6dbb1d832e14b0080d2143ac43fbaae ([host: "127.0.0.2" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:612): UpdateConsensus RPC (request call id 85493) to 127.0.0.2:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out. If they were not, they would be logged every minute.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W1024 22:33:22.772235 1739663 connection.cc:401] Simulated network failure: Network error (yb/rpc/connection.cc:400): Simulated failure to send outbound data for 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792
...
W1024 22:33:22.772346 1739663 inbound_call.cc:104] 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792: Connection torn down before CQL Call from 127.0.0.1:50690, stream id: 9792 could send its response: Network error (yb/rpc/connection.cc:400): Simulated failure to send outbound data for 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W1024 22:45:47.389113 1743081 outbound_call.cc:419] OutboundCall@0x0000562012975600: Skipping OutboundCall callback as a test: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (0.001s ago), sent_time: 460323.999s (0.001s ago), trigger_callback_time: 460323.999s (0.001s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (2.998s from now), now: 460324.000s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1024 22:45:51.463742 1742860 reactor.cc:1142] TabletServer_R001: Stuck OutboundCall: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (4.075s ago), sent_time: 460323.999s (4.075s ago), trigger_callback_time: 460323.999s (4.075s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (1.075s ago), now: 460328.074s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1024 22:46:21.564029 1742860 reactor.cc:1142] TabletServer_R001: Stuck OutboundCall: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (34.175s ago), sent_time: 460323.999s (34.175s ago), trigger_callback_time: 460323.999s (34.175s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (31.175s ago), now: 460358.174s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: sergei, bogdan

Reviewed By: sergei

Subscribers: bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D29539
mbautin added a commit that referenced this issue Oct 25, 2023
…alls

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called, using a multiple-producer single-consumer (MPSC) queue where the consumer is the reactor thread. Report stuck calls with detailed debug information. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition. Logically, there should never be a race between e.g. processing a call response and a call timeout, because both events happen on the reactor thread.

Fixing error handling in DoQueueOutboundData. If there is an error sending the call, we destroy the connection.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

We introduce multiple ways to simulate stuck outbound calls for testing.

- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in -Connection::QueueOutboundCall, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in stream_->Send() called from Connection::DoQueueOutboundData. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause Connection::QueueOutboundCall to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Also replacing a DFATAL with an WARNING log in InboundCall::QueueResponse to avoid crashing tests during shutdown. Failures to queue a response on the server side should just result in timeouts on the client side in the worst case.

We are combining multiple original revisions from the master branch to avoid introducing and then fixing a memory leak bug, and to keep the resulting patch cleaner.

This revision was backported from the 2.18 backport D29215 ( d028fe2 ).

Original revisions in master:

- D27735 ( 1fbff49 ) [#18685] Track stuck OutboundCalls and enforce timeouts at Reactor level
- D27919 ( ddb817b ) [#18808] A write-once lock-free weak pointer class: it is frequently necessary to have a weak pointer equivalent that can only be written to once, but can be read without locking for performance.
- D28085 ( a5ba08f ) [#18855] Add functions for formatting a CoarseTimePoint relative to the current time. Useful for logging deadlines.
- D28138 ( d0a44db ) [#18877] Log invalid OutboundCall state transitions and other OutboundCall improvements. OutboundCall will track invalid state transitions and report them in the DebugString and in the destructor, in addition to reporting them as they happen. Do not allow OutboundCall callback to be invoked more than once and log occurrences of that. Use a compare-and-swap loop in RpcCall::Transferred and log invalid transfer_state transitions. invoke_callback_time_ and sent_time_ in OutboundCall were being accessed without synchronization, so making them atomic. OutboundCall::DebugString should not queue a task in the reactor to print the connection details in relation to the call anymore. That is now done by a separate method, QueueDumpConnectionState, called explicitly.
- D28531 ( 0648341 ) [#19090] Fix a memory leak in tracked outbound calls in Reactor. When an OutboundCall callback is invoked, notify the reactor through a multi-producer single-consumer queue that is accessed from the OutboundCall via a weak pointer. Also switching OutboundCall allocation away from using make_shared to prevent long-lived weak OutboundCall pointers from consuming memory.

Also, parts of D23468 ( 4879cfb ) "[#16380] Use thread safety analysis annotations to enforce the reactor thread requirement" are included. That commit is present in 2.18 but not in 2.16.

Test Plan:
Jenkins

---

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W1024 21:29:58.941679 1731793 connection.cc:307] Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100: Simulating a call stuck in SENT state: RPC call 0x0000564dbbf3e580: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 6 protocol: 0x00007fd157f9cd70 -> tcpc }, id: 85493, state: SENT, transfer_state: PENDING, start_time: 455775.550s (0.001s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: +inf, now: 455775.551s, connection: Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100, active_call_state: kNotAdded
...
W1024 21:30:03.017571 1731793 reactor.cc:1142] TabletServer_R002: Stuck OutboundCall: RPC call 0x0000564dbbf3e580: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 6 protocol: 0x00007fd157f9cd70 -> tcpc }, id: 85493, state: SENT, transfer_state: PENDING, start_time: 455775.550s (4.077s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: 455778.551s (1.076s ago), now: 455779.628s, connection: Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100, active_call_state: kNotAdded (forcing a timeout)
W1024 21:30:03.017947 1731995 consensus_peers.cc:611] T 9735d87712974ea9b848cc606db8d664 P 9dcaaf7955ea4ca78667b1fa92c89192 -> Peer b6dbb1d832e14b0080d2143ac43fbaae ([host: "127.0.0.2" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:612): UpdateConsensus RPC (request call id 85493) to 127.0.0.2:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out. If they were not, they would be logged every minute.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W1024 22:33:22.772235 1739663 connection.cc:401] Simulated network failure: Network error (yb/rpc/connection.cc:400): Simulated failure to send outbound data for 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792
...
W1024 22:33:22.772346 1739663 inbound_call.cc:104] 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792: Connection torn down before CQL Call from 127.0.0.1:50690, stream id: 9792 could send its response: Network error (yb/rpc/connection.cc:400): Simulated failure to send outbound data for 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W1024 22:45:47.389113 1743081 outbound_call.cc:419] OutboundCall@0x0000562012975600: Skipping OutboundCall callback as a test: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (0.001s ago), sent_time: 460323.999s (0.001s ago), trigger_callback_time: 460323.999s (0.001s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (2.998s from now), now: 460324.000s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1024 22:45:51.463742 1742860 reactor.cc:1142] TabletServer_R001: Stuck OutboundCall: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (4.075s ago), sent_time: 460323.999s (4.075s ago), trigger_callback_time: 460323.999s (4.075s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (1.075s ago), now: 460328.074s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1024 22:46:21.564029 1742860 reactor.cc:1142] TabletServer_R001: Stuck OutboundCall: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (34.175s ago), sent_time: 460323.999s (34.175s ago), trigger_callback_time: 460323.999s (34.175s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (31.175s ago), now: 460358.174s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: sergei, bogdan

Reviewed By: sergei

Subscribers: bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D29539
mbautin added a commit to mbautin/yugabyte-db that referenced this issue Oct 25, 2023
…alls

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called, using a multiple-producer single-consumer (MPSC) queue where the consumer is the reactor thread. Report stuck calls with detailed debug information. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition. Logically, there should never be a race between e.g. processing a call response and a call timeout, because both events happen on the reactor thread.

Fixing error handling in DoQueueOutboundData. If there is an error sending the call, we destroy the connection.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

We introduce multiple ways to simulate stuck outbound calls for testing.

- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in -Connection::QueueOutboundCall, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in stream_->Send() called from Connection::DoQueueOutboundData. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause Connection::QueueOutboundCall to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Also replacing a DFATAL with an WARNING log in InboundCall::QueueResponse to avoid crashing tests during shutdown. Failures to queue a response on the server side should just result in timeouts on the client side in the worst case.

We are combining multiple original revisions from the master branch to avoid introducing and then fixing a memory leak bug, and to keep the resulting patch cleaner.

This revision was backported from the 2.18 backport D29215 ( d028fe2 ).

Original revisions in master:

- D27735 ( 1fbff49 ) [yugabyte#18685] Track stuck OutboundCalls and enforce timeouts at Reactor level
- D27919 ( ddb817b ) [yugabyte#18808] A write-once lock-free weak pointer class: it is frequently necessary to have a weak pointer equivalent that can only be written to once, but can be read without locking for performance.
- D28085 ( a5ba08f ) [yugabyte#18855] Add functions for formatting a CoarseTimePoint relative to the current time. Useful for logging deadlines.
- D28138 ( d0a44db ) [yugabyte#18877] Log invalid OutboundCall state transitions and other OutboundCall improvements. OutboundCall will track invalid state transitions and report them in the DebugString and in the destructor, in addition to reporting them as they happen. Do not allow OutboundCall callback to be invoked more than once and log occurrences of that. Use a compare-and-swap loop in RpcCall::Transferred and log invalid transfer_state transitions. invoke_callback_time_ and sent_time_ in OutboundCall were being accessed without synchronization, so making them atomic. OutboundCall::DebugString should not queue a task in the reactor to print the connection details in relation to the call anymore. That is now done by a separate method, QueueDumpConnectionState, called explicitly.
- D28531 ( 0648341 ) [yugabyte#19090] Fix a memory leak in tracked outbound calls in Reactor. When an OutboundCall callback is invoked, notify the reactor through a multi-producer single-consumer queue that is accessed from the OutboundCall via a weak pointer. Also switching OutboundCall allocation away from using make_shared to prevent long-lived weak OutboundCall pointers from consuming memory.

Also, parts of D23468 ( 4879cfb ) "[yugabyte#16380] Use thread safety analysis annotations to enforce the reactor thread requirement" are included. That commit is present in 2.18 but not in 2.16.

Test Plan:
Jenkins

---

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W1024 21:29:58.941679 1731793 connection.cc:307] Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100: Simulating a call stuck in SENT state: RPC call 0x0000564dbbf3e580: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 6 protocol: 0x00007fd157f9cd70 -> tcpc }, id: 85493, state: SENT, transfer_state: PENDING, start_time: 455775.550s (0.001s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: +inf, now: 455775.551s, connection: Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100, active_call_state: kNotAdded
...
W1024 21:30:03.017571 1731793 reactor.cc:1142] TabletServer_R002: Stuck OutboundCall: RPC call 0x0000564dbbf3e580: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 6 protocol: 0x00007fd157f9cd70 -> tcpc }, id: 85493, state: SENT, transfer_state: PENDING, start_time: 455775.550s (4.077s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: 455778.551s (1.076s ago), now: 455779.628s, connection: Connection (0x0000564dbd812298) client 127.0.0.1:39689 => 127.0.0.2:9100, active_call_state: kNotAdded (forcing a timeout)
W1024 21:30:03.017947 1731995 consensus_peers.cc:611] T 9735d87712974ea9b848cc606db8d664 P 9dcaaf7955ea4ca78667b1fa92c89192 -> Peer b6dbb1d832e14b0080d2143ac43fbaae ([host: "127.0.0.2" port: 9100], []): Couldn't send request.  Status: Timed out (yb/rpc/outbound_call.cc:612): UpdateConsensus RPC (request call id 85493) to 127.0.0.2:9100 timed out after 3.000s. Retrying in the next heartbeat period. Already tried 1 times. State: 2
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out. If they were not, they would be logged every minute.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W1024 22:33:22.772235 1739663 connection.cc:401] Simulated network failure: Network error (yb/rpc/connection.cc:400): Simulated failure to send outbound data for 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792
...
W1024 22:33:22.772346 1739663 inbound_call.cc:104] 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792: Connection torn down before CQL Call from 127.0.0.1:50690, stream id: 9792 could send its response: Network error (yb/rpc/connection.cc:400): Simulated failure to send outbound data for 0x000055999706f040 -> CQL Call from 127.0.0.1:50690, stream id: 9792
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W1024 22:45:47.389113 1743081 outbound_call.cc:419] OutboundCall@0x0000562012975600: Skipping OutboundCall callback as a test: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (0.001s ago), sent_time: 460323.999s (0.001s ago), trigger_callback_time: 460323.999s (0.001s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (2.998s from now), now: 460324.000s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1024 22:45:51.463742 1742860 reactor.cc:1142] TabletServer_R001: Stuck OutboundCall: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (4.075s ago), sent_time: 460323.999s (4.075s ago), trigger_callback_time: 460323.999s (4.075s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (1.075s ago), now: 460328.074s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1024 22:46:21.564029 1742860 reactor.cc:1142] TabletServer_R001: Stuck OutboundCall: RPC call 0x0000562012975600: yb.consensus.ConsensusService.UpdateConsensus -> { remote: 127.0.0.2:9100 idx: 1 protocol: 0x00007fbff47fed70 -> tcpc }, id: 3815, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 460323.999s (34.175s ago), sent_time: 460323.999s (34.175s ago), trigger_callback_time: 460323.999s (34.175s ago), invoke_callback_time: -inf, expiration_time: 460326.999s (31.175s ago), now: 460358.174s, connection: Connection (0x00005620102b0798) client 127.0.0.1:55971 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: sergei, bogdan

Subscribers: bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29679
mbautin added a commit that referenced this issue Oct 26, 2023
…alls

Summary:
Track all outbound calls submitted to a Reactor in a multi-index container in that reactor. Remove them from the container as soon as we learn the callback has been called, using a multiple-producer single-consumer (MPSC) queue where the consumer is the reactor thread. Report stuck calls with detailed debug information. A call is considered stuck if it is a certain time past its expiration (specified by reactor_based_outbound_call_expiration_delay_ms -- set it to 0 to disable this feature). If the call does not have a timeout, the default value stuck_outbound_call_default_timeout_sec is used for reporting only, but not for timeout enforcement.

OutboundCall now keeps track of its expiration deadline, and an additional field active_call_state_, which indicates whether the call was added to the connection's active_calls_ in the corresponding connection, or removed from it, and the removal reason.

If in the OutboundCall destructor the state of the call is not final, or the callback has not been called, we will log that even in release build.

In OutboundCall::SetState, if the current state is already finished, treat this the same as any other invalid state transition. Logically, there should never be a race between e.g. processing a call response and a call timeout, because both events happen on the reactor thread.

Fixing error handling in DoQueueOutboundData. If there is an error sending the call, we destroy the connection.

Introduce a typedef CallHandle and a special value kUnknownCallHandle instead of just using the max value of size_t in case the call handle is unknown or not being used. Change the error handling logic in Connection::DoQueueOutboundData to avoid returning kUnknownCallHandle in case of errors, and make sure the callback is called on the call in case of those errors. The connection was already being shut down.

Update the YB_STRUCT_TO_STRING macro implementation to always add parentheses around an explicitly specified field value.

We introduce multiple ways to simulate stuck outbound calls for testing.

- TEST_simulated_sent_stuck_call_probability specifies the probability of pretending a call is sent to the remote server in -Connection::QueueOutboundCall, but instead just transitioning it to a SENT state. This is similar to the situation that we have observed.
- TEST_simulated_failure_to_send_call_probability specifies the probability of a network error in stream_->Send() called from Connection::DoQueueOutboundData. This will cause the connection to be closed. Prior to this revision, this kind of an error would cause Connection::QueueOutboundCall to attempt to schedule a timer in a connection that has already been shut down.
- TEST_outbound_call_skip_callback_probability specifies the probability of skipping calling the callback on a particular remote RPC call. We don't do this for local calls.

Also replacing a DFATAL with an WARNING log in InboundCall::QueueResponse to avoid crashing tests during shutdown. Failures to queue a response on the server side should just result in timeouts on the client side in the worst case.

We are combining multiple original revisions from the master branch to avoid introducing and then fixing a memory leak bug, and to keep the resulting patch cleaner.

This revision was backported from the 2.18 backport D29215 ( d028fe2 ).

Original revisions in master:

- D27735 ( 1fbff49 ) [#18685] Track stuck OutboundCalls and enforce timeouts at Reactor level
- D27919 ( ddb817b ) [#18808] A write-once lock-free weak pointer class: it is frequently necessary to have a weak pointer equivalent that can only be written to once, but can be read without locking for performance.
- D28085 ( a5ba08f ) [#18855] Add functions for formatting a CoarseTimePoint relative to the current time. Useful for logging deadlines.
- D28138 ( d0a44db ) [#18877] Log invalid OutboundCall state transitions and other OutboundCall improvements. OutboundCall will track invalid state transitions and report them in the DebugString and in the destructor, in addition to reporting them as they happen. Do not allow OutboundCall callback to be invoked more than once and log occurrences of that. Use a compare-and-swap loop in RpcCall::Transferred and log invalid transfer_state transitions. invoke_callback_time_ and sent_time_ in OutboundCall were being accessed without synchronization, so making them atomic. OutboundCall::DebugString should not queue a task in the reactor to print the connection details in relation to the call anymore. That is now done by a separate method, QueueDumpConnectionState, called explicitly.
- D28531 ( 0648341 ) [#19090] Fix a memory leak in tracked outbound calls in Reactor. When an OutboundCall callback is invoked, notify the reactor through a multi-producer single-consumer queue that is accessed from the OutboundCall via a weak pointer. Also switching OutboundCall allocation away from using make_shared to prevent long-lived weak OutboundCall pointers from consuming memory.

Also, parts of D23468 ( 4879cfb ) "[#16380] Use thread safety analysis annotations to enforce the reactor thread requirement" are included. That commit is present in 2.18 but not in 2.16.

Test Plan:
Jenkins

---

Manual testing details below.

First test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_sent_stuck_call_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3

java -jar yb-sample-apps.jar  --workload CassandraBatchKeyValue --nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_threads_read 16 --num_threads_write 16 --num_reads 1000000000000 --num_writes 1000000000000
```

Look for output like this in the logs:

```
W1026 21:24:56.539093 414631 connection.cc:309] Connection (0x0000000002eb4798) client 127.0.0.1:39823 => 127.0.0.3:9100: Simulating a call stuck in SENT state: RPC call 0x00000000027de000: yb.tserver.TabletServerService.Read -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007f881e43d288 -> tcpc }, id: 40542, state: SENT, transfer_state: PENDING, start_time: 11115.055s (now), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: +inf, now: 11115.055s, connection: Connection (0x0000000002eb4798) client 127.0.0.1:39823 => 127.0.0.3:9100, active_call_state: kNotAdded
...
W1026 21:25:57.560870 414631 reactor.cc:1160] TabletServer_R000: Stuck OutboundCall: RPC call 0x00000000027de000: yb.tserver.TabletServerService.Read -> { remote: 127.0.0.3:9100 idx: 6 protocol: 0x00007f881e43d288 -> tcpc }, id: 40542, state: SENT, transfer_state: PENDING, start_time: 11115.055s (61.021s ago), sent_time: -inf, trigger_callback_time: -inf, invoke_callback_time: -inf, expiration_time: 11175.055s (1.021s ago), now: 11176.077s, connection: Connection (0x0000000002eb4798) client 127.0.0.1:39823 => 127.0.0.3:9100, active_call_state: kNotAdded (forcing a timeout)
W1026 21:25:57.560998 414715 tablet_rpc.cc:473] Timed out (yb/rpc/outbound_call.cc:607): Failed Read(tablet: 1a3b09b848f1430f81a9351fbc467969, num_ops: 1, num_attempts: 1, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 1a3b09b848f1430f81a9351fbc467969 on tablet server { uuid: 082e0b47233a4c3aa675aa325a33f2b9 private: [host: "127.0.0.3" port: 9100] cloud_info: placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" after 1 attempt(s): Read RPC (request call id 40542) to 127.0.0.3:9100 timed out after 60.000s
```

Ensure that all stuck outbound calls are eventually detected. After stopping the workload, the output of the command below should gradually decrease to zero after a couple of minutes:

```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Simulating a call stuck in SENT state" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

Also, these stuck calls should stop appearing in the log because they would be forcibly timed out. If they were not, they would be logged every minute.

Second test mode
================

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_simulated_failure_to_send_call_probability=0.00001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Simulated failure to send outbound data" in the tablet server log:

```
W1026 21:44:12.015067 424964 connection.cc:404] Simulated network failure: Network error (yb/rpc/connection.cc:403): Simulated failure to send outbound data for 0x0000000006874800 -> Call yb.consensus.ConsensusService.UpdateConsensus 127.0.0.2:50031 => 127.0.0.1:9100 (request call id 6989)
. . .
W1026 21:44:12.015129 424964 inbound_call.cc:104] 0x0000000006874800 -> Call yb.consensus.ConsensusService.UpdateConsensus 127.0.0.2:50031 => 127.0.0.1:9100 (request call id 6989): Connection torn down before Call yb.consensus.ConsensusService.UpdateConsensus 127.0.0.2:50031 => 127.0.0.1:9100 (request call id 6989) could send its response: Network error (yb/rpc/connection.cc:403): Simulated failure to send outbound data for 0x0000000006874800 -> Call yb.consensus.ConsensusService.UpdateConsensus 127.0.0.2:50031 => 127.0.0.1:9100 (request call id 6989)
```

There should be no stuck outbound calls in the log, because these simulated failures will just result in the connection being closed. Depending on the value TEST_simulated_failure_to_send_call_probability, the workload may succeed or fail. With the value 0.00001, it manages to make progress in my experience.

Third test mode
===============

```
bin/yb-ctl wipe_restart  --tserver_flags="TEST_outbound_call_skip_callback_probability=0.0001,reactor_based_outbound_call_expiration_delay_ms=1000" --rf=3
```

Same workload as above.

Look for "Skipping OutboundCall callback as a test" and "Stuck OutboundCall" in the log.

Output:
```
W1026 21:49:18.330013 427274 outbound_call.cc:414] OutboundCall@0x0000000007f9cdc0: Skipping OutboundCall callback as a test: RPC call 0x0000000007f9cdc0: yb.tserver.TabletServerService.Write -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f4e5cb17288 -> tcpc }, id: 58289, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 12576.835s (0.011s ago), sent_time: 12576.835s (0.011s ago), trigger_callback_time: 12576.847s (now), invoke_callback_time: -inf, expiration_time: 12636.835s (59.988s from now), now: 12576.847s, connection: Connection (0x0000000004a51058) client 127.0.0.1:55591 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1026 21:50:19.366196 426374 reactor.cc:1160] TabletServer_R001: Stuck OutboundCall: RPC call 0x0000000007f9cdc0: yb.tserver.TabletServerService.Write -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f4e5cb17288 -> tcpc }, id: 58289, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 12576.835s (61.047s ago), sent_time: 12576.835s (61.047s ago), trigger_callback_time: 12576.847s (61.035s ago), invoke_callback_time: -inf, expiration_time: 12636.835s (1.047s ago), now: 12637.883s, connection: Connection (0x0000000004a51058) client 127.0.0.1:55591 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
. . .
W1026 21:50:49.467150 426374 reactor.cc:1160] TabletServer_R001: Stuck OutboundCall: RPC call 0x0000000007f9cdc0: yb.tserver.TabletServerService.Write -> { remote: 127.0.0.2:9100 idx: 5 protocol: 0x00007f4e5cb17288 -> tcpc }, id: 58289, state: FINISHED_SUCCESS, transfer_state: FINISHED, start_time: 12576.835s (91.148s ago), sent_time: 12576.835s (91.148s ago), trigger_callback_time: 12576.847s (91.136s ago), invoke_callback_time: -inf, expiration_time: 12636.835s (31.148s ago), now: 12667.984s, connection: Connection (0x0000000004a51058) client 127.0.0.1:55591 => 127.0.0.2:9100, active_call_state: kErasedOnResponse
```

Similarly to the first test, wait for all stuck outbound calls to get reported:
```
log_path="$HOME/yugabyte-data/node-1/disk-1/yb-data/tserver/logs/yb-tserver.INFO"; grep "Skipping OutboundCall callback as a test" "$log_path" | egrep -Eo 'id: [0-9]+' | awk '{print $NF}' | sort -n | while read c; do if ! grep "id: $c," "$log_path" | grep -q "Stuck OutboundCall"; then echo "Undetected stuck OutboundCall: id=$c"; fi; done | wc -l
```

In this mode, because we are skipping callbacks but allowing state transitions to the finished state, forced expiration at reactor level does not work.

Reviewers: sergei, bogdan

Reviewed By: sergei

Subscribers: bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D29679
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.14 Backport Required 2.16 Backport Required 2.18 Backport Required area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature priority/high High Priority
Projects
None yet
Development

No branches or pull requests

3 participants