Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for all shards to be active after a cluster restart may never be possible for a shrink step #35321

Closed
dakrone opened this issue Nov 6, 2018 · 1 comment
Assignees
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management

Comments

@dakrone
Copy link
Member

dakrone commented Nov 6, 2018

Consider the following scenario:

An index with at least 1 replica is just about to start its Shrink step, so it does the following:

  1. sets the index to read-only
  2. sets the index to be allocated only on node_id:123XYZ
  3. waits for a copy of each shard on node_id:123XYZ
  4. performs the shrink step
  5. etc

If, after accomplishing step 2, but before step 3 is done, the user restarts the cluster, when the cluster comes back up, due to the allocation rule, the replicas for the index will not be allowed to be allocated because of the _id filtering performed in step 2. This leads the check in step 3 never to pass due to the check at:

if (ActiveShardCount.ALL.enoughShardsActive(clusterState, index.getName()) == false) {
logger.debug("[{}] shrink action for [{}] cannot make progress because not all shards are active",
getKey().getAction(), index.getName());
return new Result(false, new CheckShrinkReadyStep.Info("", expectedShardCount, -1));
}

And a perpetual error step op:

    "test-000039" : {
      "step" : "check-shrink-allocation",
      "step_time" : "2018-11-06T22:54:39.805Z",
      "step_time_millis" : 1541544879805,
      "step_info" : {
        "message" : "Waiting for all shards to become active",
        "node_id" : "",
        "shards_left_to_allocate" : -1,
        "expected_shards" : 2
      }
    },

Since shrink does not require all copies of the shard to be active, we should remove this check

@dakrone dakrone added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 6, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@dakrone dakrone self-assigned this Nov 6, 2018
dakrone added a commit to dakrone/elasticsearch that referenced this issue Nov 7, 2018
Since it's still possible to shrink an index when replicas are unassigned, we
should not check that all copies are available when performing the shrink, since
we set the allocation requirement for a single node.

Resolves elastic#35321
dakrone added a commit that referenced this issue Nov 7, 2018
Since it's still possible to shrink an index when replicas are unassigned, we
should not check that all copies are available when performing the shrink, since
we set the allocation requirement for a single node.

Resolves #35321
dakrone added a commit that referenced this issue Nov 7, 2018
Since it's still possible to shrink an index when replicas are unassigned, we
should not check that all copies are available when performing the shrink, since
we set the allocation requirement for a single node.

Resolves #35321
pgomulka pushed a commit to pgomulka/elasticsearch that referenced this issue Nov 13, 2018
Since it's still possible to shrink an index when replicas are unassigned, we
should not check that all copies are available when performing the shrink, since
we set the allocation requirement for a single node.

Resolves elastic#35321
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

No branches or pull requests

2 participants