Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce promoting index shard state #28004

Closed
wants to merge 6 commits into from

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Dec 27, 2017

This commit adds a new index shard state - promoting. This state
indicates that a replica is promoting to primary and primary-replica
resync is in progress.

Relates #24841

This commit adds a new index shard state - promoting. This state
indicates that a replica is promoting to primary and primary-replica
resync is in progress.

Relates elastic#24841
@dnhatn dnhatn added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement review v6.2.0 v7.0.0 labels Dec 27, 2017
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more places where shard states are used. We have to be careful in getting all of those. A grep for IndexShardState.STARTED yields other places where we need to account for the newly introduced shard state, for example in IndexMemoryController, IndicesClusterStateService, IndicesStore, MockFSIndexStore.

@@ -192,7 +192,7 @@
private final GlobalCheckpointTracker globalCheckpointTracker;

protected volatile ShardRouting shardRouting;
protected volatile IndexShardState state;
protected final AtomicReference<IndexShardState> state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this to an AtomicReference? Just because of stylistic reasons or is there more to it? Looking through the PR, I could not find a reason for this change. Every write access to it is guarded by a mutex, and read access is ok with volatile. Let's keep it a volatile variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will revert this.

@@ -609,6 +607,11 @@ public void relocated(
}

private void verifyRelocatingState() {
final IndexShardState state = state();
if (state == IndexShardState.PROMOTING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already covered by the check if (state != IndexShardState.STARTED) { below?

@dnhatn
Copy link
Member Author

dnhatn commented Dec 28, 2017

Thanks @ywelsch, I will look at all other places that use IndexShardState.STARTED.

@dnhatn
Copy link
Member Author

dnhatn commented Dec 28, 2017

@ywelsch, I've addressed your comments. Would you please take a look? Thank you.

boolean resyncStarted = primaryReplicaResyncInProgress.compareAndSet(false, true);
if (resyncStarted == false) {
throw new IllegalStateException("cannot start resync while it's already in progress");
final IndexShardState prevState = changeState(IndexShardState.PROMOTING, "Promoting to primary");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of changing the state first and then checking whether the previous state was the right one, let's only change the state if the current state matches (note that we're under the mutex here already, so it's safe to do this).

boolean resyncCompleted = primaryReplicaResyncInProgress.compareAndSet(true, false);
assert resyncCompleted : "primary-replica resync finished but was not started";
synchronized (mutex) {
final IndexShardState prevState = changeState(IndexShardState.STARTED, "Resync is completed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above, let's check state first and then change it.

}

@Override
public void onFailure(Exception e) {
boolean resyncCompleted = primaryReplicaResyncInProgress.compareAndSet(true, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we have no corresponding state transition here now, does it mean that the shard can forever be stuck in PROMOTING state, allow no relocations?

@dnhatn
Copy link
Member Author

dnhatn commented Dec 29, 2017

@ywelsch Could you please give it another go. Thank you.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 2, 2018

I've talked to @jasontedor and @dnhatn and suggested to take another approach. I don't like adding new shard states which are somewhat replicating information that's already available in the GlobalCheckpointTracker (to be called ReplicationTracker). We could instead get rid of the RELOCATED state, whose purpose is to say that the primary is not in charge anymore of assigning sequence numbers, something which is already covered by GlobalCheckpointTracker.primaryMode which we can use instead. To ensure that we only complete primary relocation after resync is done, we can do the following. When calling GlobalCheckpointTracker.startRelocationHandoff() we could check that the local checkpoints for all active shards match the local checkpoint of the primary shard. This would be a sufficient condition to relocate the shard (i.e. all in-sync copies have sufficiently caught up with the primary, so that the new primary won't need to a trigger a resync of its own). Note that this requires the resync to be less lenient as it is today, namely to fail replicas if it does not successfully complete on those replicas, which will be addressed by a separate upcoming PR.

@dnhatn dnhatn closed this Jan 21, 2018
@dnhatn dnhatn deleted the promoting_state branch January 21, 2018 15:08
@dnhatn
Copy link
Member Author

dnhatn commented Jan 21, 2018

Thanks @ywelsch for your suggestion. I am closing this.

@bleskes
Copy link
Contributor

bleskes commented Mar 22, 2018

Some clarification for future readers - the reason why the GlobalCheckpointTracker on the local checkpoints to detect that resync has finished, relies on the following:

  1. Once a replica is promoted, it doesn't know about the local checkpoint of the other replicas.
  2. Fetching those checkpoints will cause the local checkpoint of replicas to go.
  3. When startRelocationHandoff is called, all operations permits on the primary have been acquired, meaning that the all on going operations are completed. That it turns means that on the (new) primary, the local checkpoint is equal to the max seq#
  4. If the resync has previously completed, we expect the local checkpoint of the replicas to equal the to the local checkpoint of the primary, which is equal to the max seq#

@ywelsch
Copy link
Contributor

ywelsch commented Mar 26, 2018

To ensure that we only complete primary relocation after resync is done, we can do the following. When calling GlobalCheckpointTracker.startRelocationHandoff() we could check that the local checkpoints for all active shards match the local checkpoint of the primary shard. This would be a sufficient condition to relocate the shard ...

While the presented logic is correct for what the primary-replica resync is doing today, it's incorrect if the future primary-replica sync has the additional job of trimming / rolling back portions of the translog.
Assume that primary P1 fails, but has two in-flight operations to replica R2 and R3. Assume replica R2 receives none of the two ops, and R3 receives only the operation with the higher sequence number, creating a gap on R3. If R2 is now promoted to primary, its local checkpoint will match the local checkpoint on R3 under the new term, but the max sequence number on R3 will not match the max sequence number on R2. If R2 then relocates to a different place before the primary-replica resync gets to send a trim command to R3, that trim command might never make it to R3.
Let's wait on making a decision for this until we have the primary-replica resync with rollback implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants