-
Notifications
You must be signed in to change notification settings - Fork 936
No way to handle the only semi-sync replica failing #1137
Comments
If I understand correctly, in the post failover hook you try and set up a new replica as semi-sync, and, since that could fail, you'd wish to have some hook, so you can try another replica? I think of a different solution. Post failover, you can check with orchestrator which replicas are OK, and then choose one to enable semi-sync. Does that make sense? Regardless, am I correct to understand you never ever want to have a write on a master unless backed by a replica? How about, for the duration of the failover, you set |
Should PostFailover hook be running not only on master failover, but also on failure of replicas? Or perhaps other hook that is executed in that case? I'm already using orchestrator API to make sure that semi-sync is configured on master failover, and it would be pretty straightforward to do that same when replica failure is detected, but I'm not sure how to detect that failure in the first place. |
there is no failover for replicas, and no hooks will run on such failure. This is beyond
try |
Let me rephrase that - I don't see a way to handle a failure of semi-sync replica with orchestrator. When replica failure occurs, master with semi-sync replication enabled will still be waiting on commits for semi-sync replica to ACK them. Without any hooks being executed I can't do any changes to remediate that. |
I'm not sure we're on the same page. How does my suggestion not solve the problem? |
Hmm.. We probably aren't :). In my original post I've pasted two screenshots: first one with In that scenario, failure of My other screenshot is an attempt to workaround that - I turn |
Right. I guess the confusion was one my side, sorry. I'm curious, though: why do you only set up a single |
I want to use semi-sync replication to achieve two connected goals in case of master failover to another DC:
Enabling semi-sync replication on more than one replica in another datacenter would solve a problem of single replica failing, but not entire DC going down - that will still block my master from accepting writes. On the other hand if I put another semi-sync replica in the same DC as master, this will break the goals I've listed. |
Understood. So the single semi-sync replica is a single point of failure. Let me think about this further. There's no support for this in |
Some foundational work at #1171 is able to analyze a situation where semi sync is enabled on |
OK, good news and bad news. The good news is that I believe it should be achievable for The bad news is that the action taken (orchestrator could be the one to take it, actually), which is to set up a new semi-sync replica -- is risky. In a scenario where the original semi-sync replica went down because of network issue (thus, the server is not dead), we will enable semi-sync on a different replica, and then that original replica suddenly reappears -- and we have two semi-sync replicas. The topology is now susceptible to split brain. I'm not sure how split-brain is 100% solvable via single-semi-sync-replica setup. At least, without being able to shoot the other node in the head. Thoughts welcome. |
On further thinking, I guess the "bad news" part is reasonably solvable with careful monitoring/orchestration. Possibly Either way, the challenge I'm seeing right now is: exactly when to draw the analysis (and run hooks). It might be reasonable for some stalls from time to time, since that's reality. So I probably don't want to fire a hook With |
(updated topology screenshot to match the question)
I'm testing the following topology:
Single master that has semi-sync replication enabled, with single semi-sync slave running in another DC. The idea here is to gracefully failover to another DC in case master's DC goes offline.
I have a PostFailover hook that makes sure that, in case of master failover, semi-sync replication is enabled on either replica from another DC (preferably) or replica in the same DC as master and disabled on all other replicas.
This seems to be working fine, however I can't figure out how to handle failure of semi-sync replica - there is no hook executed when replica is lost, and so master stops accepting writes as it's waiting for ACK from its remaining replicas (none of which have semi-sync replication enabled).
Is this somehow related to Semi-sync enforcement? I've already raised similar question on the ML, but it's not obvious from documentation and code that this feature is a correct approach here.
One idea I had was to turn that topology into something similar to that:
In this topology semi-sync replica is also intermediate master, which triggers hooks on failure and lets me reconfigure other replica for semi-sync replication.
The text was updated successfully, but these errors were encountered: