-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add current cluster state version to zen pings and use them in master election #20384
Conversation
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in elastic#20235 This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)). On top of that we have some new, evil and nasty tests.
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING) | ||
|
||
During a networking partition, cluster states updates (like mapping changes or shard assignments) | ||
are committed if a majority of the masters node received the update correctly. This means that the current master has access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
masters -> master-eligible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @clintongormley said, masters
-> master-eligible
but then node -> nodes
so it reads majority of the master-eligible nodes...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check
Would you mind merging master in after you integrate #20348? |
56d3f6e
to
0ef5ec6
Compare
} | ||
|
||
/** | ||
* compares two candidate to indicate who's the a better master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: candidate
-> candidates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: who's the a better
-> which is the better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed. You know that to me the nodes are human...
private volatile int minimumMasterNodes; | ||
|
||
public static class Candidate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this class have Javadocs please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the class should be called something like MasterCandidate
or CandidateMaster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a inner class of ElectMaster, but sure. Can do MasterCandidate
return sortedCandidates.get(0); | ||
} | ||
|
||
/** selects the best active master to join, where multiple are discovered (oh noes) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the "oh noes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
party pooper. removed.
out.writeLong(id); | ||
} | ||
|
||
@Override | ||
public String toString() { | ||
return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], hasJoinedOnce [" + hasJoinedOnce + "], cluster_name[" + clusterName.value() + "]}"; | ||
return "ping_response{node [" + node + "], id[" + id + "], master [" + master + "], cs version [" + clusterStateVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: cs version
-> cluster_state_version
, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sooo long. replaced.
@@ -64,6 +64,22 @@ framework. As the Jepsen tests evolve, we will continue porting new scenarios th | |||
all new scenarios and will report issues that we find on this page and in our GitHub repository. | |||
|
|||
[float] | |||
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING) | |||
|
|||
During a networking partition, cluster states updates (like mapping changes or shard assignments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
states
-> state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
public void testIsolateAll() { | ||
Set<String> nodes = generateRandomStringSet(1, 10); | ||
NetworkDisruption.DisruptedLinks topology = new NetworkDisruption.IsolateAllNodes(nodes); | ||
for (int i = 0; i < 10; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not test all possible pairs, it's only 10 choose 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah , balancing act between speed and chances the test fails if you get something wrong. I also just hate the resulting double loop for a "check all combinations"
ElectMasterService service = electMasterService(); | ||
int min_master_nodes = randomIntBetween(0, nodes.size()); | ||
int min_master_nodes = randomIntBetween(0, candidates.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are here, can we give this a proper Java variable name (minMasterNodes
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
} else if (min_master_nodes > 0 && master_nodes < min_master_nodes) { | ||
assertNull(master); | ||
} else { | ||
Candidate master = service.electMaster(candidates); | ||
assertNotNull(master); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation is off here and the rest of the way through this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
assertTrue(master.getId().compareTo(node.getId()) <= 0); | ||
for (Candidate candidate : candidates) { | ||
if (candidate.getNode().equals(master.getNode())) { | ||
// meh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a more descriptive comment? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a longer but just as meaningless text :)
assertThat("candidate " + candidate + " has a lower or equal id than master " + master, candidate.getNode().getId(), | ||
greaterThan(master.getNode().getId())); | ||
} else { | ||
assertThat("candidate " + master + " has a higher id than candidate " + candidate, master.getClusterStateVersion(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say higher cluster state version
instead of higher id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
@@ -1189,6 +1192,61 @@ public void testIndicesDeleted() throws Exception { | |||
assertFalse(client().admin().indices().prepareExists(idxName).get().isExists()); | |||
} | |||
|
|||
public void testElectMasterWithLatestVersion() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a beautiful test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bleskes, I left some feedback. In general, it looks sound.
final AtomicBoolean counted = new AtomicBoolean(); | ||
try { | ||
zenPing.ping(pings -> { | ||
response.addPings(pings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the add pings only be done inside the guard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tja - doesn't really matter. I figured every extra bit of information, if we manage to get it in, counts
thx @jasontedor , @s1monw and @clintongormley . I addressed all the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… election (#20384) During a networking partition, cluster states updates (like mapping changes or shard assignments) are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost. This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote). Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.
… election (#20384) During a networking partition, cluster states updates (like mapping changes or shard assignments) are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost. This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote). Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.
this one plus logical time plus this PR: #13062 is really makeing a raft. |
@makeyang there are a lot of similarities between ZenDiscovery and Raft if you look at it the right way (although ZenDiscovery was built before Raft was there). I'm not sure 100% what you mean, but a pacificA like log replication model requires an external concensus oracle. For pacifica it indeed doesn't matter which algorithm you use,. |
@bleskes that's all I mean: as long as leader election is solid concensus, then your make log replication solid. |
@bleskes just another question which is more serious than last one: based on the current condition, ES won't pass jepsen-like test, right? |
@makeyang sadly there is no "just implement it" in distributed systems. It's a long process consisting of small steps. This and and other PRs you follow are part of that journey. |
That's broad question. ES 5.0 is light years ahead of 1.x but there are still known issues. You can read about them in our documentation here. |
@bleskes agree what u said. but please make these critical small setps, which is impact data safety, faster and faster before ES ruin reputation like MongoDB does before due to MongoDB's careless to data loss. |
@makeyang we're making as a fast as we can responsibly make them.
I think this very conversation shows otherwise. If you are speaking from experience, please do share your problem so we can see if it has already been solved or we need to fix something and add it to the working queue. Abstract claims are dangerous and hard to address. |
@bleskes what I mentioned is mongdb, just google "mongodb loses data" and u'll see that. I'm not saying ES. |
@bleskes I have a question related pacificA: |
During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one and the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.
This commit fixed 95% of this rare problem by adding the current cluster state version to
PingResponse
and use them when deciding which master to join (and thus casting the node's vote).Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side and the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.
PS this PR contains and depends on #20348 , which was required for long testing. That part doesn't need to be reviewed.