Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Longer recovery window for YSQL apps after a network partition #11799

Closed
amitanandaiyer opened this issue Mar 18, 2022 · 5 comments
Closed
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@amitanandaiyer
Copy link
Contributor

Description

Related to #11306

However it turns out that YSQL takes much longer to recover than YCQL.

The core issue in #11306 causes an outtage of a few seconds when the remote server disappears (i.e. machine is lost/unreachable, and network stack is inactive).

This is compounded by 2 factors:

  1. Metacache updates the status of RemoteReplica per tablet. So if we have a lot of tablets then each tablet may need to suffer from a failing query before it gets updated.
  2. YSQL backends are independent processes and thus have separate metacaches. This results in each connection to ysql having to fail once for each tablet before updating and moving to the new leader.

We should be able to do a quicker recovery by updating the state for all tablets whenever a remote server is unreachable.

@amitanandaiyer amitanandaiyer added the area/docdb YugabyteDB core features label Mar 18, 2022
@amitanandaiyer amitanandaiyer self-assigned this Mar 18, 2022
@bmatican
Copy link
Contributor

Linking #11306 in here as well. Imo we can close both, once the fix here lands.

@amitanandaiyer
Copy link
Contributor Author

amitanandaiyer commented Mar 23, 2022

With the proposed changes, preliminary graphs show that the recovery happens in 15-30sec; and does not increase with the number of tablets.

SQL-graphs

@amitanandaiyer
Copy link
Contributor Author

amitanandaiyer commented Mar 23, 2022

Instead of marking node as unreachable on NetworkError (which may be prone to false-positives if we have some other kind of network error); if we only mark it as unreachable on Connect error/HostUnreachable (reduced false-positives) we see that Sql workloads recover in ~30s

Screen Shot 2022-03-23 at 2 02 47 PM

Screen Shot 2022-03-23 at 2 07 24 PM

@amitanandaiyer
Copy link
Contributor Author

amitanandaiyer commented Mar 23, 2022

For CQL, this change is not as critical. Although it does help, in cases where the number of readers & writers is low.
For cases where the number of readers/writers is high (i.e. comparable to num_tablets downed), the recovery time is unaffected.

CQL workloads seem to recover in ~45sec. (CQL seems to take about 15s more than SQL in both the diffs)

Marking only Connect error as failed:

48 tablets per node.
Screen Shot 2022-03-23 at 2 09 16 PM

24 tablets per node. (varying readers/writers: 16/16 and 1/1). The difference is negligible for 16/16 readers/writers. The recovery is significantly improved for the case with 1/1 reader/writer(s).
Screen Shot 2022-03-23 at 2 51 30 PM

amitanandaiyer added a commit that referenced this issue Mar 24, 2022
…chable

Summary:
For network errors YBClient/Metacache should not only update the specific tablet but should
also MarkTSFailed() to help share the knowledge with other tablets.

This can improve the recovery time, esp for cases with a lot of tablets.

Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature.

Test Plan:
Jenkins + repro manually

1) Create a dev-cluster with a lot of tablets
`bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24`

2) Run yb-sample apps with 16 readers and 16 writers
```
java -jar yb-sample-apps.jar \
                        --workload SqlSecondaryIndex  \
                        --nodes $HOSTS \
                        --verbose true  --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \
                        --num_reads 15000000 --num_writes 75000000 \
```
3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature.

without this change, the recovery takes over 5 mins.
With the change, the operations recover in about 30-40sec.

Reviewers: timur, bogdan, sergei

Reviewed By: sergei

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16073
amitanandaiyer added a commit that referenced this issue Mar 25, 2022
… server is unreachable

Summary:
For network errors YBClient/Metacache should not only update the specific tablet but should
also MarkTSFailed() to help share the knowledge with other tablets.

This can improve the recovery time, esp for cases with a lot of tablets.

Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature.

Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992

Test Plan:
Jenkins: rebase 2.12

Jenkins + repro manually

1) Create a dev-cluster with a lot of tablets
`bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24`

2) Run yb-sample apps with 16 readers and 16 writers
```
java -jar yb-sample-apps.jar \
                        --workload SqlSecondaryIndex  \
                        --nodes $HOSTS \
                        --verbose true  --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \
                        --num_reads 15000000 --num_writes 75000000 \
```
3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature.

without this change, the recovery takes over 5 mins.
With the change, the operations recover in about 30-40sec.

Reviewers: timur, sergei, bogdan

Reviewed By: bogdan

Subscribers: ybase, kannan

Differential Revision: https://phabricator.dev.yugabyte.com/D16183
amitanandaiyer added a commit that referenced this issue Mar 25, 2022
…server is unreachable

Summary:
For network errors YBClient/Metacache should not only update the specific tablet but should
also MarkTSFailed() to help share the knowledge with other tablets.

This can improve the recovery time, esp for cases with a lot of tablets.

Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature.

Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992

Test Plan:
Jenkins: rebase 2.8

Jenkins + repro manually

1) Create a dev-cluster with a lot of tablets
`bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24`

2) Run yb-sample apps with 16 readers and 16 writers
```
java -jar yb-sample-apps.jar \
                        --workload SqlSecondaryIndex  \
                        --nodes $HOSTS \
                        --verbose true  --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \
                        --num_reads 15000000 --num_writes 75000000 \
```
3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature.

without this change, the recovery takes over 5 mins.
With the change, the operations recover in about 30-40sec.

Reviewers: timur, sergei, bogdan

Reviewed By: bogdan

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D16185
amitanandaiyer added a commit that referenced this issue Apr 1, 2022
…server is unreachable

Summary:
For network errors YBClient/Metacache should not only update the specific tablet but should
also MarkTSFailed() to help share the knowledge with other tablets.

This can improve the recovery time, esp for cases with a lot of tablets.

Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature.

Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992

Test Plan:
Jenkins: rebase 2.6

Jenkins + repro manually

1) Create a dev-cluster with a lot of tablets
`bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24`

2) Run yb-sample apps with 16 readers and 16 writers
```
java -jar yb-sample-apps.jar \
                        --workload SqlSecondaryIndex  \
                        --nodes $HOSTS \
                        --verbose true  --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \
                        --num_reads 15000000 --num_writes 75000000 \
```
3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature.

without this change, the recovery takes over 5 mins.
With the change, the operations recover in about 30-40sec.

Reviewers: timur, sergei, bogdan

Reviewed By: bogdan

Subscribers: ybase, kannan

Differential Revision: https://phabricator.dev.yugabyte.com/D16186
@amitanandaiyer
Copy link
Contributor Author

Note that after this fix, the recovery window for YSQL should be down to ~30s for releases 2.13.1 and beyond.

For older releases, i.e. 2.12 and older, the recovery window will be O ( number of nodes downed) ~= 15s * (num_nodes downed + 1).

This is because the yb-client is shared across all the connections in 2.13.1 and beyond, so all the downed nodes can be discovered simultaneously. For older releases, each client connection needs to discover the downed nodes one by one.

The reduction ithere is going down from O(num-tablets) to O(num-nodes downed).

With #9936 https://phabricator.dev.yugabyte.com/rYBDBc5f512583fc7ecbc054584d59b19c584a94bd0ee the recovery goes down to O(1) * connection_timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
None yet
Development

No branches or pull requests

2 participants