-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Longer recovery window for YSQL apps after a network partition #11799
Comments
Linking #11306 in here as well. Imo we can close both, once the fix here lands. |
…chable Summary: For network errors YBClient/Metacache should not only update the specific tablet but should also MarkTSFailed() to help share the knowledge with other tablets. This can improve the recovery time, esp for cases with a lot of tablets. Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature. Test Plan: Jenkins + repro manually 1) Create a dev-cluster with a lot of tablets `bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24` 2) Run yb-sample apps with 16 readers and 16 writers ``` java -jar yb-sample-apps.jar \ --workload SqlSecondaryIndex \ --nodes $HOSTS \ --verbose true --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \ --num_reads 15000000 --num_writes 75000000 \ ``` 3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature. without this change, the recovery takes over 5 mins. With the change, the operations recover in about 30-40sec. Reviewers: timur, bogdan, sergei Reviewed By: sergei Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16073
… server is unreachable Summary: For network errors YBClient/Metacache should not only update the specific tablet but should also MarkTSFailed() to help share the knowledge with other tablets. This can improve the recovery time, esp for cases with a lot of tablets. Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature. Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992 Test Plan: Jenkins: rebase 2.12 Jenkins + repro manually 1) Create a dev-cluster with a lot of tablets `bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24` 2) Run yb-sample apps with 16 readers and 16 writers ``` java -jar yb-sample-apps.jar \ --workload SqlSecondaryIndex \ --nodes $HOSTS \ --verbose true --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \ --num_reads 15000000 --num_writes 75000000 \ ``` 3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature. without this change, the recovery takes over 5 mins. With the change, the operations recover in about 30-40sec. Reviewers: timur, sergei, bogdan Reviewed By: bogdan Subscribers: ybase, kannan Differential Revision: https://phabricator.dev.yugabyte.com/D16183
…server is unreachable Summary: For network errors YBClient/Metacache should not only update the specific tablet but should also MarkTSFailed() to help share the knowledge with other tablets. This can improve the recovery time, esp for cases with a lot of tablets. Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature. Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992 Test Plan: Jenkins: rebase 2.8 Jenkins + repro manually 1) Create a dev-cluster with a lot of tablets `bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24` 2) Run yb-sample apps with 16 readers and 16 writers ``` java -jar yb-sample-apps.jar \ --workload SqlSecondaryIndex \ --nodes $HOSTS \ --verbose true --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \ --num_reads 15000000 --num_writes 75000000 \ ``` 3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature. without this change, the recovery takes over 5 mins. With the change, the operations recover in about 30-40sec. Reviewers: timur, sergei, bogdan Reviewed By: bogdan Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D16185
…server is unreachable Summary: For network errors YBClient/Metacache should not only update the specific tablet but should also MarkTSFailed() to help share the knowledge with other tablets. This can improve the recovery time, esp for cases with a lot of tablets. Also introducing a new Gflag `update_all_tablets_upon_network_failure` (defaults to `true`) which can be used to disable this feature. Original Revision/Commit: https://phabricator.dev.yugabyte.com/D16073 1b8f992 Test Plan: Jenkins: rebase 2.6 Jenkins + repro manually 1) Create a dev-cluster with a lot of tablets `bin/yb-ctl restart --tserver_flags 'fail_whole_ts_upon_network_failure=true,txn_slow_op_threshold_ms=3000,enable_tracing=true,tracing_level=2,rpc_connection_timeout_ms=15000' --replication_factor 3 --ysql_num_shards_per_tserver 24` 2) Run yb-sample apps with 16 readers and 16 writers ``` java -jar yb-sample-apps.jar \ --workload SqlSecondaryIndex \ --nodes $HOSTS \ --verbose true --drop_table_name postgresqlkeyvalue --num_threads_read $NUM_READERS --num_threads_write $NUM_WRITERS \ --num_reads 15000000 --num_writes 75000000 \ ``` 3) Cause a network partition using `iptables drop` to isolate 127.0.0.3 and compare recovery times with and without the feature. without this change, the recovery takes over 5 mins. With the change, the operations recover in about 30-40sec. Reviewers: timur, sergei, bogdan Reviewed By: bogdan Subscribers: ybase, kannan Differential Revision: https://phabricator.dev.yugabyte.com/D16186
Note that after this fix, the recovery window for YSQL should be down to ~30s for releases 2.13.1 and beyond. For older releases, i.e. 2.12 and older, the recovery window will be O ( number of nodes downed) ~= 15s * (num_nodes downed + 1). This is because the yb-client is shared across all the connections in 2.13.1 and beyond, so all the downed nodes can be discovered simultaneously. For older releases, each client connection needs to discover the downed nodes one by one. The reduction ithere is going down from O(num-tablets) to O(num-nodes downed). With #9936 https://phabricator.dev.yugabyte.com/rYBDBc5f512583fc7ecbc054584d59b19c584a94bd0ee the recovery goes down to O(1) * connection_timeout |
Description
Related to #11306
However it turns out that YSQL takes much longer to recover than YCQL.
The core issue in #11306 causes an outtage of a few seconds when the remote server disappears (i.e. machine is lost/unreachable, and network stack is inactive).
This is compounded by 2 factors:
We should be able to do a quicker recovery by updating the state for all tablets whenever a remote server is unreachable.
The text was updated successfully, but these errors were encountered: