Introduce promoting index shard state #28004

dnhatn · 2017-12-27T22:18:16Z

This commit adds a new index shard state - promoting. This state
indicates that a replica is promoting to primary and primary-replica
resync is in progress.

Relates #24841

This commit adds a new index shard state - promoting. This state indicates that a replica is promoting to primary and primary-replica resync is in progress. Relates elastic#24841

ywelsch

There are more places where shard states are used. We have to be careful in getting all of those. A grep for IndexShardState.STARTED yields other places where we need to account for the newly introduced shard state, for example in IndexMemoryController, IndicesClusterStateService, IndicesStore, MockFSIndexStore.

ywelsch · 2017-12-28T11:17:35Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -192,7 +192,7 @@
    private final GlobalCheckpointTracker globalCheckpointTracker;

    protected volatile ShardRouting shardRouting;
-    protected volatile IndexShardState state;
+    protected final AtomicReference<IndexShardState> state;


why change this to an AtomicReference? Just because of stylistic reasons or is there more to it? Looking through the PR, I could not find a reason for this change. Every write access to it is guarded by a mutex, and read access is ok with volatile. Let's keep it a volatile variable.

Yes, I will revert this.

ywelsch · 2017-12-28T11:28:20Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -609,6 +607,11 @@ public void relocated(
    }

    private void verifyRelocatingState() {
+        final IndexShardState state = state();
+        if (state == IndexShardState.PROMOTING) {


this is already covered by the check if (state != IndexShardState.STARTED) { below?

dnhatn · 2017-12-28T13:33:37Z

Thanks @ywelsch, I will look at all other places that use IndexShardState.STARTED.

dnhatn · 2017-12-28T20:43:09Z

@ywelsch, I've addressed your comments. Would you please take a look? Thank you.

ywelsch · 2017-12-29T13:10:18Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-                    boolean resyncStarted = primaryReplicaResyncInProgress.compareAndSet(false, true);
-                    if (resyncStarted == false) {
-                        throw new IllegalStateException("cannot start resync while it's already in progress");
+                    final IndexShardState prevState = changeState(IndexShardState.PROMOTING, "Promoting to primary");


instead of changing the state first and then checking whether the previous state was the right one, let's only change the state if the current state matches (note that we're under the mutex here already, so it's safe to do this).

ywelsch · 2017-12-29T13:11:06Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-                                        boolean resyncCompleted = primaryReplicaResyncInProgress.compareAndSet(true, false);
-                                        assert resyncCompleted : "primary-replica resync finished but was not started";
+                                        synchronized (mutex) {
+                                            final IndexShardState prevState = changeState(IndexShardState.STARTED, "Resync is completed");


same comment as above, let's check state first and then change it.

ywelsch · 2017-12-29T13:18:06Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

                                    }

                                    @Override
                                    public void onFailure(Exception e) {
-                                        boolean resyncCompleted = primaryReplicaResyncInProgress.compareAndSet(true, false);


as we have no corresponding state transition here now, does it mean that the shard can forever be stuck in PROMOTING state, allow no relocations?

dnhatn · 2017-12-29T14:36:06Z

@ywelsch Could you please give it another go. Thank you.

ywelsch · 2018-01-02T16:57:50Z

I've talked to @jasontedor and @dnhatn and suggested to take another approach. I don't like adding new shard states which are somewhat replicating information that's already available in the GlobalCheckpointTracker (to be called ReplicationTracker). We could instead get rid of the RELOCATED state, whose purpose is to say that the primary is not in charge anymore of assigning sequence numbers, something which is already covered by GlobalCheckpointTracker.primaryMode which we can use instead. To ensure that we only complete primary relocation after resync is done, we can do the following. When calling GlobalCheckpointTracker.startRelocationHandoff() we could check that the local checkpoints for all active shards match the local checkpoint of the primary shard. This would be a sufficient condition to relocate the shard (i.e. all in-sync copies have sufficiently caught up with the primary, so that the new primary won't need to a trigger a resync of its own). Note that this requires the resync to be less lenient as it is today, namely to fail replicas if it does not successfully complete on those replicas, which will be addressed by a separate upcoming PR.

dnhatn · 2018-01-21T15:13:47Z

Thanks @ywelsch for your suggestion. I am closing this.

bleskes · 2018-03-22T17:26:08Z

Some clarification for future readers - the reason why the GlobalCheckpointTracker on the local checkpoints to detect that resync has finished, relies on the following:

Once a replica is promoted, it doesn't know about the local checkpoint of the other replicas.
Fetching those checkpoints will cause the local checkpoint of replicas to go.
When startRelocationHandoff is called, all operations permits on the primary have been acquired, meaning that the all on going operations are completed. That it turns means that on the (new) primary, the local checkpoint is equal to the max seq#
If the resync has previously completed, we expect the local checkpoint of the replicas to equal the to the local checkpoint of the primary, which is equal to the max seq#

ywelsch · 2018-03-26T12:06:50Z

To ensure that we only complete primary relocation after resync is done, we can do the following. When calling GlobalCheckpointTracker.startRelocationHandoff() we could check that the local checkpoints for all active shards match the local checkpoint of the primary shard. This would be a sufficient condition to relocate the shard ...

While the presented logic is correct for what the primary-replica resync is doing today, it's incorrect if the future primary-replica sync has the additional job of trimming / rolling back portions of the translog.
Assume that primary P1 fails, but has two in-flight operations to replica R2 and R3. Assume replica R2 receives none of the two ops, and R3 receives only the operation with the higher sequence number, creating a gap on R3. If R2 is now promoted to primary, its local checkpoint will match the local checkpoint on R3 under the new term, but the max sequence number on R3 will not match the max sequence number on R2. If R2 then relocates to a different place before the primary-replica resync gets to send a trim command to R3, that trim command might never make it to R3.
Let's wait on making a decision for this until we have the primary-replica resync with rollback implemented.

Introduce promoting index shard state

e293770

This commit adds a new index shard state - promoting. This state indicates that a replica is promoting to primary and primary-replica resync is in progress. Relates elastic#24841

dnhatn added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement review v6.2.0 v7.0.0 labels Dec 27, 2017

dnhatn requested review from bleskes, ywelsch and jasontedor December 27, 2017 22:18

ywelsch suggested changes Dec 28, 2017

View reviewed changes

dnhatn added 2 commits December 28, 2017 14:23

use volatile variable

4ec2e36

uses promoting state

b9a0599

ywelsch reviewed Dec 29, 2017

View reviewed changes

dnhatn added 2 commits December 29, 2017 08:59

check then change

2239164

correct message when change state

fb1e71d

ignore shutting down shard

c382abb

dnhatn closed this Jan 21, 2018

dnhatn deleted the promoting_state branch January 21, 2018 15:08

bleskes mentioned this pull request Mar 22, 2018

Add Sequence Numbers to write operations #10708

Closed

64 tasks

ywelsch mentioned this pull request Mar 26, 2018

Remove RELOCATED index shard state #29246

Merged

dnhatn removed v6.2.0 v7.0.0 labels Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce promoting index shard state #28004

Introduce promoting index shard state #28004

dnhatn commented Dec 27, 2017

ywelsch left a comment

ywelsch Dec 28, 2017

dnhatn Dec 28, 2017

ywelsch Dec 28, 2017

dnhatn commented Dec 28, 2017

dnhatn commented Dec 28, 2017

ywelsch Dec 29, 2017

ywelsch Dec 29, 2017

ywelsch Dec 29, 2017

dnhatn commented Dec 29, 2017

ywelsch commented Jan 2, 2018

dnhatn commented Jan 21, 2018

bleskes commented Mar 22, 2018 •

edited

Loading

ywelsch commented Mar 26, 2018

Introduce promoting index shard state #28004

Introduce promoting index shard state #28004

Conversation

dnhatn commented Dec 27, 2017

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Dec 28, 2017

Choose a reason for hiding this comment

dnhatn Dec 28, 2017

Choose a reason for hiding this comment

ywelsch Dec 28, 2017

Choose a reason for hiding this comment

dnhatn commented Dec 28, 2017

dnhatn commented Dec 28, 2017

ywelsch Dec 29, 2017

Choose a reason for hiding this comment

ywelsch Dec 29, 2017

Choose a reason for hiding this comment

ywelsch Dec 29, 2017

Choose a reason for hiding this comment

dnhatn commented Dec 29, 2017

ywelsch commented Jan 2, 2018

dnhatn commented Jan 21, 2018

bleskes commented Mar 22, 2018 • edited Loading

ywelsch commented Mar 26, 2018

bleskes commented Mar 22, 2018 •

edited

Loading