Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM shrink action runs when shards aren't allocated on the same node #34938

Closed
dakrone opened this issue Oct 26, 2018 · 1 comment · Fixed by #35161
Closed

ILM shrink action runs when shards aren't allocated on the same node #34938

dakrone opened this issue Oct 26, 2018 · 1 comment · Fixed by #35161
Assignees
Labels
blocker >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management

Comments

@dakrone
Copy link
Member

dakrone commented Oct 26, 2018

For ILM, we have a step that allocates an index on a single machine so that we can then call the shrink/resize action, however, in some cases, the shrink can be run after allocating the index to a single node, but still error out related to the shards not being on the same node:

    "test-000019" : {
      "step" : "ERROR",
      "step_time" : 1540588519429,
      "step_info" : {
        "type" : "illegal_state_exception",
        "reason" : "index test-000019 must have all shards allocated on the same node to shrink index"
      }
    },

I was able to reproduce this with the following configuration:

  • 2 nodes with the "hot" type
  • 3 nodes with the "cold" type
  • 1 node with the "other" type

Using a 1 second poll interval:

PUT /_cluster/settings
{
  "transient": {
    "logger.org.elasticsearch.xpack.core.indexlifecycle": "TRACE",
    "logger.org.elasticsearch.xpack.indexlifecycle": "TRACE",
    "indices.lifecycle.poll_interval": "1s"
  }
}

The following policy:

PUT _ilm/my_lifecycle3
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "5s"
          }
        }
      },
      "warm": {
        "minimum_age": "30s",
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          },
          "shrink": {
            "number_of_shards": 1
          },
          "allocate": {
            "include": {
              "type": ""
            },
            "exclude": {},
            "require": {}
          }
        }
      },
      "cold": {
        "minimum_age": "1m",
        "actions": {
          "allocate": {
            "number_of_replicas": 2,
            "include": {
              "type": "cold"
            },
            "exclude": {},
            "require": {}
          }
        }
      },
      "delete": {
        "minimum_age": "2m",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Index template:

PUT _template/my_template
{
  "index_patterns": ["test-*"],
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1,
    "index.lifecycle.name": "my_lifecycle3",
    "index.lifecycle.rollover_alias": "test-alias",
    "index.routing.allocation.include.type": "hot"
  }
}

Then, I created an index:

PUT test-000001
{
  "aliases": {
    "test-alias":{
      "is_write_index": true
    }
  }
}

And then continually ran:

GET /*/_ilm/explain?filter_path=indices.*.step*

Until I saw a failure similar to the one above (took 1-30 minutes to reproduce).

I've added additional logging to see what's going on with the node:

[2018-10-26T15:15:19,399][TRACE][o.e.x.i.ExecuteStepsUpdateTask] [hot1] [test-000019] waiting for cluster state step condition (AllocationRoutedStep) [{"phase":"warm","action":"shrink","name":"check-allocation"}], next: [{"phase":"warm","action":"shrink","name":"shrink"}]
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> SHRINK checking whether [test-000019] has enough shards allocated
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> shard [test-000019][1], node[Mi73iCROTT2dM4We9oQIgA], [P], s[STARTED], a[id=IXX6Ix8EQdmsvhNT-7BQug] cannot remain on Mi73iCROTT2dM4We9oQIgA, allocPendingThisShard: 1
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> SHRINK shardCopiesThisShard(2) - allocationPendingThisShard(1) == 0 ? 1 
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> shard [test-000019][0], node[RiSQ1bfhSkS_G90VZH-BLA], [R], s[STARTED], a[id=iCGSUFcYRXWl8yvDtcuhHg] cannot remain on RiSQ1bfhSkS_G90VZH-BLA, allocPendingThisShard: 1
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> SHRINK shardCopiesThisShard(2) - allocationPendingThisShard(1) == 0 ? 1 
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] SHRINK [shrink] lifecycle action for index [[test-000019/pIKgUp5bTpCxZhJMOAWRxg]] complete
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> test-000019 SUCCESS allocationPendingAllShards: 0
[2018-10-26T15:15:19,399][TRACE][o.e.x.i.ExecuteStepsUpdateTask] [hot1] [test-000019] cluster state step condition met successfully (AllocationRoutedStep) [{"phase":"warm","action":"shrink","name":"check-allocation"}], moving to next step {"phase":"warm","action":"shrink","name":"shrink"}

And then a bit further down:

[2018-10-26T15:15:19,428][ERROR][o.e.x.i.IndexLifecycleRunner] [hot1] policy [my_lifecycle3] for index [test-000019] failed on step [{"phase":"warm","action":"shrink","name":"shrink"}]. Moving to ERROR step
java.lang.IllegalStateException: index test-000019 must have all shards allocated on the same node to shrink index
	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateShrinkIndex(MetaDataCreateIndexService.java:679) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.prepareResizeIndexSettings(MetaDataCreateIndexService.java:740) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:406) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

It looks like the check succeeds and that the shards are in the right place, but then the shrink fails nonetheless.

It's worth noting I could only reproduce this with a 1 second poll interval, so it may be a timing issue. Also, it does appear that the shard is correctly allocated from /_cat/shards output (hot2 is the node that ILM set as the _name allocation filtering target):

test-000019        1     r      STARTED    0  261b 127.0.0.1 hot2
test-000019        1     p      STARTED    0  261b 127.0.0.1 hot1
test-000019        0     p      STARTED    0  261b 127.0.0.1 hot2
test-000019        0     r      STARTED    0  261b 127.0.0.1 other
@dakrone dakrone added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 26, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

dakrone added a commit to dakrone/elasticsearch that referenced this issue Nov 1, 2018
This adds a new step for checking whether an index is allocated correctly based
on the rules added prior to running the shrink step. It also fixes a bug where
for shrink we are not allowed to have the shards relocating for the shrink step.

Resolves elastic#34938
dakrone added a commit that referenced this issue Nov 5, 2018
This adds a new step for checking whether an index is allocated correctly based
on the rules added prior to running the shrink step. It also fixes a bug where
for shrink we are not allowed to have the shards relocating for the shrink step.

This also allows us to simplify AllocationRoutedStep and provide better
feedback in the step info for why either the allocation or the shrink checks
have failed.

Resolves #34938
dakrone added a commit that referenced this issue Nov 5, 2018
This adds a new step for checking whether an index is allocated correctly based
on the rules added prior to running the shrink step. It also fixes a bug where
for shrink we are not allowed to have the shards relocating for the shrink step.

This also allows us to simplify AllocationRoutedStep and provide better
feedback in the step info for why either the allocation or the shrink checks
have failed.

Resolves #34938
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants