[YSQL] Longer recovery window for YSQL apps after a network partition #11799

amitanandaiyer · 2022-03-18T17:25:05Z

Description

Related to #11306

However it turns out that YSQL takes much longer to recover than YCQL.

The core issue in #11306 causes an outtage of a few seconds when the remote server disappears (i.e. machine is lost/unreachable, and network stack is inactive).

This is compounded by 2 factors:

Metacache updates the status of RemoteReplica per tablet. So if we have a lot of tablets then each tablet may need to suffer from a failing query before it gets updated.
YSQL backends are independent processes and thus have separate metacaches. This results in each connection to ysql having to fail once for each tablet before updating and moving to the new leader.

We should be able to do a quicker recovery by updating the state for all tablets whenever a remote server is unreachable.

bmatican · 2022-03-23T17:16:56Z

Linking #11306 in here as well. Imo we can close both, once the fix here lands.

amitanandaiyer · 2022-03-23T20:56:35Z

With the proposed changes, preliminary graphs show that the recovery happens in 15-30sec; and does not increase with the number of tablets.

amitanandaiyer · 2022-03-23T21:10:18Z

Instead of marking node as unreachable on NetworkError (which may be prone to false-positives if we have some other kind of network error); if we only mark it as unreachable on Connect error/HostUnreachable (reduced false-positives) we see that Sql workloads recover in ~30s

amitanandaiyer · 2022-03-23T21:51:42Z

For CQL, this change is not as critical. Although it does help, in cases where the number of readers & writers is low.
For cases where the number of readers/writers is high (i.e. comparable to num_tablets downed), the recovery time is unaffected.

CQL workloads seem to recover in ~45sec. (CQL seems to take about 15s more than SQL in both the diffs)

Marking only Connect error as failed:

48 tablets per node.

24 tablets per node. (varying readers/writers: 16/16 and 1/1). The difference is negligible for 16/16 readers/writers. The recovery is significantly improved for the case with 1/1 reader/writer(s).

…chable Summary: For network errors YBClient/Metacache should not only update the specific tablet but should also MarkTSFailed() to help share the knowledge with other tablets. This can improve the recovery time, esp for cases with a lot of tablets. Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature. Test Plan: Jenkins + repro manually 1) Create a dev-cluster with a lot of tablets `bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24` 2) Run yb-sample apps with 16 readers and 16 writers ``` java -jar yb-sample-apps.jar \ --workload SqlSecondaryIndex \ --nodes $HOSTS \ --verbose true --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \ --num_reads 15000000 --num_writes 75000000 \ ``` 3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature. without this change, the recovery takes over 5 mins. With the change, the operations recover in about 30-40sec. Reviewers: timur, bogdan, sergei Reviewed By: sergei Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16073

… server is unreachable Summary: For network errors YBClient/Metacache should not only update the specific tablet but should also MarkTSFailed() to help share the knowledge with other tablets. This can improve the recovery time, esp for cases with a lot of tablets. Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature. Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992 Test Plan: Jenkins: rebase 2.12 Jenkins + repro manually 1) Create a dev-cluster with a lot of tablets `bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24` 2) Run yb-sample apps with 16 readers and 16 writers ``` java -jar yb-sample-apps.jar \ --workload SqlSecondaryIndex \ --nodes $HOSTS \ --verbose true --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \ --num_reads 15000000 --num_writes 75000000 \ ``` 3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature. without this change, the recovery takes over 5 mins. With the change, the operations recover in about 30-40sec. Reviewers: timur, sergei, bogdan Reviewed By: bogdan Subscribers: ybase, kannan Differential Revision: https://phabricator.dev.yugabyte.com/D16183

…server is unreachable Summary: For network errors YBClient/Metacache should not only update the specific tablet but should also MarkTSFailed() to help share the knowledge with other tablets. This can improve the recovery time, esp for cases with a lot of tablets. Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature. Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992 Test Plan: Jenkins: rebase 2.8 Jenkins + repro manually 1) Create a dev-cluster with a lot of tablets `bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24` 2) Run yb-sample apps with 16 readers and 16 writers ``` java -jar yb-sample-apps.jar \ --workload SqlSecondaryIndex \ --nodes $HOSTS \ --verbose true --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \ --num_reads 15000000 --num_writes 75000000 \ ``` 3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature. without this change, the recovery takes over 5 mins. With the change, the operations recover in about 30-40sec. Reviewers: timur, sergei, bogdan Reviewed By: bogdan Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16185

…server is unreachable Summary: For network errors YBClient/Metacache should not only update the specific tablet but should also MarkTSFailed() to help share the knowledge with other tablets. This can improve the recovery time, esp for cases with a lot of tablets. Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature. Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992 Test Plan: Jenkins: rebase 2.6 Jenkins + repro manually 1) Create a dev-cluster with a lot of tablets `bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24` 2) Run yb-sample apps with 16 readers and 16 writers ``` java -jar yb-sample-apps.jar \ --workload SqlSecondaryIndex \ --nodes $HOSTS \ --verbose true --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \ --num_reads 15000000 --num_writes 75000000 \ ``` 3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature. without this change, the recovery takes over 5 mins. With the change, the operations recover in about 30-40sec. Reviewers: timur, sergei, bogdan Reviewed By: bogdan Subscribers: ybase, kannan Differential Revision: https://phabricator.dev.yugabyte.com/D16186

amitanandaiyer · 2022-04-13T21:30:41Z

Note that after this fix, the recovery window for YSQL should be down to ~30s for releases 2.13.1 and beyond.

For older releases, i.e. 2.12 and older, the recovery window will be O ( number of nodes downed) ~= 15s * (num_nodes downed + 1).

This is because the yb-client is shared across all the connections in 2.13.1 and beyond, so all the downed nodes can be discovered simultaneously. For older releases, each client connection needs to discover the downed nodes one by one.

The reduction ithere is going down from O(num-tablets) to O(num-nodes downed).

With #9936 https://phabricator.dev.yugabyte.com/rYBDBc5f512583fc7ecbc054584d59b19c584a94bd0ee the recovery goes down to O(1) * connection_timeout

amitanandaiyer added the area/docdb YugabyteDB core features label Mar 18, 2022

amitanandaiyer self-assigned this Mar 18, 2022

amitanandaiyer mentioned this issue Apr 7, 2022

[DocDB] Add ability to specify an RPC connect timeout #11306

Open

amitanandaiyer closed this as completed Apr 12, 2022

qvad mentioned this issue May 27, 2022

[Jepsen] [YSQL] Occasional long recovery during partition-one #1992

Closed

amitanandaiyer mentioned this issue Feb 24, 2023

[YSQL] [Read-Replica] Random high latencies with Read Replica #15135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Longer recovery window for YSQL apps after a network partition #11799

[YSQL] Longer recovery window for YSQL apps after a network partition #11799

amitanandaiyer commented Mar 18, 2022

bmatican commented Mar 23, 2022

amitanandaiyer commented Mar 23, 2022 •

edited

Loading

amitanandaiyer commented Mar 23, 2022 •

edited

Loading

amitanandaiyer commented Mar 23, 2022 •

edited

Loading

amitanandaiyer commented Apr 13, 2022

[YSQL] Longer recovery window for YSQL apps after a network partition #11799

[YSQL] Longer recovery window for YSQL apps after a network partition #11799

Comments

amitanandaiyer commented Mar 18, 2022

Description

bmatican commented Mar 23, 2022

amitanandaiyer commented Mar 23, 2022 • edited Loading

amitanandaiyer commented Mar 23, 2022 • edited Loading

amitanandaiyer commented Mar 23, 2022 • edited Loading

amitanandaiyer commented Apr 13, 2022

amitanandaiyer commented Mar 23, 2022 •

edited

Loading

amitanandaiyer commented Mar 23, 2022 •

edited

Loading

amitanandaiyer commented Mar 23, 2022 •

edited

Loading