Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Support for FailMasterPromotionOnLagMinutes #1115

Merged
merged 22 commits into from
May 2, 2020

Conversation

shlomi-noach
Copy link
Collaborator

Fixes #83

Scenario: M -> R topology was running with replica R broken for a few hours without anyone noticing. Then, master M fails.

Curent behavior: orchestrator promotes R. But by this we lose:

  • potentially hours of worth of relay logs obtained by R but not executed (e.g. because of some SQL error), or
  • the ability to recover hours of worth of binary logs from the master M.

New config variable FailMasterPromotionOnLagMinutes tells orchestrator to fail a promotion if, at the time a candidate replica is chosen, it is determined to be lagging too much ( >= FailMasterPromotionOnLagMinutes).

cc @sougou @mcrauwel

@mcrauwel
Copy link
Contributor

mcrauwel commented Apr 6, 2020

thanks for the mention @shlomi-noach! I will be testing it out!

@shlomi-noach
Copy link
Collaborator Author

@shlomi-noach shlomi-noach merged commit 4f6635e into master May 2, 2020
@shlomi-noach shlomi-noach deleted the fail-promotion-lag-seconds branch May 2, 2020 08:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slaves lagging by couple of hours are elected as master by orchestrator
2 participants