You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For ILM, we have a step that allocates an index on a single machine so that we can then call the shrink/resize action, however, in some cases, the shrink can be run after allocating the index to a single node, but still error out related to the shards not being on the same node:
"test-000019" : {
"step" : "ERROR",
"step_time" : 1540588519429,
"step_info" : {
"type" : "illegal_state_exception",
"reason" : "index test-000019 must have all shards allocated on the same node to shrink index"
}
},
I was able to reproduce this with the following configuration:
Until I saw a failure similar to the one above (took 1-30 minutes to reproduce).
I've added additional logging to see what's going on with the node:
[2018-10-26T15:15:19,399][TRACE][o.e.x.i.ExecuteStepsUpdateTask] [hot1] [test-000019] waiting for cluster state step condition (AllocationRoutedStep) [{"phase":"warm","action":"shrink","name":"check-allocation"}], next: [{"phase":"warm","action":"shrink","name":"shrink"}]
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> SHRINK checking whether [test-000019] has enough shards allocated
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> shard [test-000019][1], node[Mi73iCROTT2dM4We9oQIgA], [P], s[STARTED], a[id=IXX6Ix8EQdmsvhNT-7BQug] cannot remain on Mi73iCROTT2dM4We9oQIgA, allocPendingThisShard: 1
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> SHRINK shardCopiesThisShard(2) - allocationPendingThisShard(1) == 0 ? 1
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> shard [test-000019][0], node[RiSQ1bfhSkS_G90VZH-BLA], [R], s[STARTED], a[id=iCGSUFcYRXWl8yvDtcuhHg] cannot remain on RiSQ1bfhSkS_G90VZH-BLA, allocPendingThisShard: 1
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> SHRINK shardCopiesThisShard(2) - allocationPendingThisShard(1) == 0 ? 1
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] SHRINK [shrink] lifecycle action for index [[test-000019/pIKgUp5bTpCxZhJMOAWRxg]] complete
[2018-10-26T15:15:19,399][DEBUG][o.e.x.c.i.AllocationRoutedStep] [hot1] --> test-000019 SUCCESS allocationPendingAllShards: 0
[2018-10-26T15:15:19,399][TRACE][o.e.x.i.ExecuteStepsUpdateTask] [hot1] [test-000019] cluster state step condition met successfully (AllocationRoutedStep) [{"phase":"warm","action":"shrink","name":"check-allocation"}], moving to next step {"phase":"warm","action":"shrink","name":"shrink"}
And then a bit further down:
[2018-10-26T15:15:19,428][ERROR][o.e.x.i.IndexLifecycleRunner] [hot1] policy [my_lifecycle3] for index [test-000019] failed on step [{"phase":"warm","action":"shrink","name":"shrink"}]. Moving to ERROR step
java.lang.IllegalStateException: index test-000019 must have all shards allocated on the same node to shrink index
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateShrinkIndex(MetaDataCreateIndexService.java:679) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.prepareResizeIndexSettings(MetaDataCreateIndexService.java:740) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:406) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
It looks like the check succeeds and that the shards are in the right place, but then the shrink fails nonetheless.
It's worth noting I could only reproduce this with a 1 second poll interval, so it may be a timing issue. Also, it does appear that the shard is correctly allocated from /_cat/shards output (hot2 is the node that ILM set as the _name allocation filtering target):
test-000019 1 r STARTED 0 261b 127.0.0.1 hot2
test-000019 1 p STARTED 0 261b 127.0.0.1 hot1
test-000019 0 p STARTED 0 261b 127.0.0.1 hot2
test-000019 0 r STARTED 0 261b 127.0.0.1 other
The text was updated successfully, but these errors were encountered:
This adds a new step for checking whether an index is allocated correctly based
on the rules added prior to running the shrink step. It also fixes a bug where
for shrink we are not allowed to have the shards relocating for the shrink step.
Resolveselastic#34938
This adds a new step for checking whether an index is allocated correctly based
on the rules added prior to running the shrink step. It also fixes a bug where
for shrink we are not allowed to have the shards relocating for the shrink step.
This also allows us to simplify AllocationRoutedStep and provide better
feedback in the step info for why either the allocation or the shrink checks
have failed.
Resolves#34938
This adds a new step for checking whether an index is allocated correctly based
on the rules added prior to running the shrink step. It also fixes a bug where
for shrink we are not allowed to have the shards relocating for the shrink step.
This also allows us to simplify AllocationRoutedStep and provide better
feedback in the step info for why either the allocation or the shrink checks
have failed.
Resolves#34938
For ILM, we have a step that allocates an index on a single machine so that we can then call the shrink/resize action, however, in some cases, the shrink can be run after allocating the index to a single node, but still error out related to the shards not being on the same node:
I was able to reproduce this with the following configuration:
Using a 1 second poll interval:
The following policy:
Index template:
Then, I created an index:
And then continually ran:
Until I saw a failure similar to the one above (took 1-30 minutes to reproduce).
I've added additional logging to see what's going on with the node:
And then a bit further down:
It looks like the check succeeds and that the shards are in the right place, but then the shrink fails nonetheless.
It's worth noting I could only reproduce this with a 1 second poll interval, so it may be a timing issue. Also, it does appear that the shard is correctly allocated from /_cat/shards output (
hot2
is the node that ILM set as the_name
allocation filtering target):The text was updated successfully, but these errors were encountered: