Support for FailMasterPromotionOnLagMinutes #1115

shlomi-noach · 2020-04-06T08:31:37Z

Fixes #83

Scenario: M -> R topology was running with replica R broken for a few hours without anyone noticing. Then, master M fails.

Curent behavior: orchestrator promotes R. But by this we lose:

potentially hours of worth of relay logs obtained by R but not executed (e.g. because of some SQL error), or
the ability to recover hours of worth of binary logs from the master M.

New config variable FailMasterPromotionOnLagMinutes tells orchestrator to fail a promotion if, at the time a candidate replica is chosen, it is determined to be lagging too much ( >= FailMasterPromotionOnLagMinutes).

cc @sougou @mcrauwel

mcrauwel · 2020-04-06T08:46:42Z

thanks for the mention @shlomi-noach! I will be testing it out!

…to be set

shlomi-noach · 2020-05-02T07:57:23Z

Behavior of this change is now validated via the (new) system tests framework:

Support for FailMasterPromotionOnLagMinutes

aa25d73

shlomi-noach mentioned this pull request Apr 6, 2020

Slaves lagging by couple of hours are elected as master by orchestrator #83

Closed

shlomi-noach and others added 21 commits April 7, 2020 11:06

nonzero FailMasterPromotionOnLagMinutes requires ReplicationLagQuery …

e76a873

…to be set

Merge branch 'master' into fail-promotion-lag-seconds

0f86ae7

Merge branch 'master' into fail-promotion-lag-seconds

3c9e733

Merge branch 'master' into fail-promotion-lag-seconds

c3415eb

Merge branch 'master' into fail-promotion-lag-seconds

20127fb

Merge branch 'master' into fail-promotion-lag-seconds

f568850

Merge branch 'master' into fail-promotion-lag-seconds

4a9a85d

Merge branch 'master' into fail-promotion-lag-seconds

2e8038c

clear SlaveLagQuery once ReplicationLagQuery is populated

a6b6635

expect failure

b4dc767

two tests: success and failure

20e7b33

validate one replica has been halfway promoted

ed82427

cleanup

8bbe951

ackk-all-recoveries

c42cea9

reload-configuration on teardown and on deploy-replication

b1ac72d

more sleep

7b7b5f8

retries, instead of hard coded sleep

ab5dbd8

path

ae8d2a2

force discover

76cdf3c

more debug info

6d767b2

more silent

b79ff3c

shlomi-noach merged commit 4f6635e into master May 2, 2020

shlomi-noach deleted the fail-promotion-lag-seconds branch May 2, 2020 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for FailMasterPromotionOnLagMinutes #1115

Support for FailMasterPromotionOnLagMinutes #1115

shlomi-noach commented Apr 6, 2020

mcrauwel commented Apr 6, 2020

shlomi-noach commented May 2, 2020

Support for FailMasterPromotionOnLagMinutes #1115

Support for FailMasterPromotionOnLagMinutes #1115

Conversation

shlomi-noach commented Apr 6, 2020

mcrauwel commented Apr 6, 2020

shlomi-noach commented May 2, 2020