Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unassigned replica shards after cluster restart #9602

Closed
darsh221 opened this issue Feb 6, 2015 · 11 comments
Closed

unassigned replica shards after cluster restart #9602

darsh221 opened this issue Feb 6, 2015 · 11 comments

Comments

@darsh221
Copy link

darsh221 commented Feb 6, 2015

We are using ES version 1.4.1. We have hourly index with 53 shards per index and replication is 1 . We restarted our cluster after the some maintenance we had from system side. No we don't have any disk failures. After the restart, i see all primary shards are assigned but all replicas are unassigned?

These are the steps we followed to do the restart

  1. Flush all indices
  2. stop all nodes using the shutdown command
  3. Start all master nodes only
  4. "cluster.routing.allocation.enable" : "none"
  5. Start all data nodes
  6. After all nodes joined the cluster we did cluster.routing.allocation.enable" : "all"
  7. I see all primary shards are assigned but replica shards are unassigned.
@darsh221
Copy link
Author

darsh221 commented Feb 6, 2015

I see cluster started initializing replica shards after 2 hr. For 2 hours there was nothing in pending tasks other than
reroute_after_cluster_update_settings task. Didn't see anything useful in the master server logs. Why is this so slow ?

@bleskes
Copy link
Contributor

bleskes commented Feb 9, 2015

@darsh221 something is holding the replicas from being re-assigned. It may be the DiskThresholdAllocator protecting for disk space. You can get an explanation of the current decisions using: curl -XPOST "http://localhost:9200/_cluster/reroute?explain"

@darsh221
Copy link
Author

darsh221 commented Feb 9, 2015

We are not using any disk threshold watermark settings. May be whatever default it is. Disks in our cluster are only 5 % full of its total size. Currently cluster is in green status so i will not be able to see reroute info.

I tried running "http://localhost:9200/_cluster/reroute?explain" but i got

{
error: "ElasticsearchIllegalArgumentException[No feature for name [reroute]]",
status: 400
}

More info about our cluster
40 physical servers with 200 GB of RAM and 32 cpus.
120 data nodes. 3 nodes on each physical server with 30 GB RAM each.
5 master nodes.
We are using one 3 * 2 and two 2 * 2 RAIDs on each server.
We are indexing 10-12 TB of data everyday. 2 different indices per hour with 53 shards for each index. Currently each shard is 3-5 gb in size.

Configs we are using

discovery.zen.ping.multicast.enabled: false
node.master : false
node.data : true
transport.tcp.port: 9311
index.number_of_shards: 53
index.number_of_replicas: 1
index.refresh_interval: 30s
action.disable_delete_all_indices: true
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping_timeout : 1m
discovery.zen.join_timeout : 10m
script.disable_dynamic : false
gateway.recover_after_data_nodes: 110
gateway.expected_nodes: 117
gateway.local.auto_import_dangled: yes
discovery.zen.ping.unicast.hosts: [“5 “master nodes]
cluster.routing.allocation.same_shard.host: true
cluster.routing.allocation.cluster_concurrent_rebalance: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 20
cluster.routing.allocation.node_concurrent_recoveries: 10
indices.recovery.concurrent_streams: 8
indices.recovery.max_bytes_per_sec: 100mb
threadpool.bulk.type: fixed
threadpool.bulk.size: 75
threadpool.bulk.queue_size: -1
threadpool.index.type: fixed
threadpool.index.size: 75
threadpool.index.queue_size: -1
indices.store.throttle.max_bytes_per_sec : 200mb
index.merge.scheduler.max_thread_count: 8
index.store.type: mmapfs

@bleskes
Copy link
Contributor

bleskes commented Feb 10, 2015

I tried running "http://localhost:9200/_cluster/reroute?explain" but i got

This should be a POST not a get. Note the curl command I sent.

I initially misread the ticket- I thought your shards were not initializing at all for 2h. Now I read they started intializing after 2h. In these two hours, did all data nodes successfully join the cluster? When you run pending_tasks did you see any task with the executing flag set to true?

A couple of comments about your settings

discovery.zen.ping_timeout : 1m

this is quite high. Any reason for that?

threadpool.bulk.type: fixed
threadpool.bulk.size: 75
threadpool.bulk.queue_size: -1
threadpool.index.type: fixed
threadpool.index.size: 75

We have smart defaults in ES based on the number of cores. I think you can remove this. Especially the unbound queue is a recipe for memory issues.

index.store.type: mmapfs

Why did you change from the default? ES now uses smarter compound dir format (mmapfs for smaller files, NIO for bigger).

@btecu
Copy link

btecu commented Mar 5, 2015

I'm having the same issue. I've restarted my master and the other node (slave) become the master, but my slave now doesn't get any shard. have n shards assigned on master and n unassigned shards (slave doesn't have anything).

Ideas?

@clintongormley
Copy link
Contributor

No more info from original poster, so closing.

@btecu, you'd need to provide more info than you have for us to have any chance of diagnosing. Please feel free to open a new ticket if you're still seeing this issue.

@allthedrones
Copy link

I've experienced this same thing a couple of times now in 1.4.5. I'll try to provide as much detail as I can...

1 - Disable shard allocation,
2 - reboot data nodes one at a time, waiting for cluster to return to yellow in between (primaries online, replicas unassigned)
3 - after last node & all primaries online, re-enable allocation to bring replicas back online.

But the replicas don't start assigning, not for a long time (update: in the most recent case, the reroute_after_cluster_update_settings task was executing for ~30 minutes, after which time replica assignment began). At the top of the pending tasks queue is this:

{
    "insert_order" : 4581,
    "priority" : "URGENT",
    "source" : "reroute_after_cluster_update_settings",
    "executing" : true,
    "time_in_queue_millis" : 736479,
    "time_in_queue" : "12.2m"
}

Trying to execute the curl command as noted here results in an error:
curl -XPOST "http://localhost:9200/_cluster/reroute?explain"

{"error":"RemoteTransportException[[acme-stage-em3][inet[/10.0.0.10:9300]][cluster:admin/reroute]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_reroute (api)) within 30s]; ","status":503}

Other relevant details ... [heavy] indexing is still going on during this period (for one index only out of about 100). The cluster has 3 master-only, 2 client-only, and 4 data-only nodes.

The only possibly suspicious log entry is a timeout complaint in the elected master's log file, a few minutes after re-enabling allocation:

[2015-08-03 16:21:11,378][INFO ][cluster.routing.allocation.decider] [acme-stage-em3] updating [cluster.routing.allocation.enable] from [NONE] to [ALL]
[2015-08-03 16:25:08,625][WARN ][gateway.local            ] [acme-stage-em3] [c1-15.7.16][1]: failed to list shard stores on node [uqW-DKN2QtWtB3go3hufiQ]
org.elasticsearch.action.FailedNodeException: Failed node [uqW-DKN2QtWtB3go3hufiQ]
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [acme-stage-es3][inet[/10.0.0.6:9300]][internal:cluster/nodes/indices/shard/store[n]] request_id [7443495] timed out after [30006ms]
        ... 4 more
[2015-08-03 16:25:09,625][WARN ][transport                ] [acme-stage-em3] Received response for a request that has timed out, sent [31006ms] ago, timed out [1000ms] ago, action [internal:cluster/nodes/indices/shard/store[n]], node [[acme-stage-es3][uqW-DKN2QtWtB3go3hufiQ][acme-stage-es3][inet[/10.0.0.6:9300]]{master=false}], id [7443495]

@clintongormley
Copy link
Contributor

@allthedrones I suggest upgrading - recent versions have greatly improved this situation

@allthedrones
Copy link

Thanks @clintongormley! We're rolling out 1.6 in our pre-production environments now. Are there some better issue #'s that more specifically document which improvements were made that address this particular circumstance? Thanks again!

@s1monw
Copy link
Contributor

s1monw commented Aug 5, 2015

@allthedrones I'd recommend to move to the latest 1.7.1 instead!

@cywjackson
Copy link

The purpose of this comment is to bring some more awareness to this problem, and share what I have done to recover. I feel this unassigned shards could happen in many diff forms, so my soln in the end may not be your soln.

We had faced this problem in previous upgrade, (could have been 0.90.x to 0.90.y, or 0.90.y to 1.0.x, or 1.0.x to 1.3.x , don't remember). Tried the reroute b4, there was a guy even posted a nice script to try to find all the unassigned shards and allocate them: http://www.unknownerror.org/opensource/elastic/elasticsearch/q/stackoverflow/19967472/elasticsearch-unassigned-shards-how-to-fix , at the very bottom , and notice how he has "allow_primary": true , so be careful using this. I don't recall how we fixed this last time (not sure if the script fixed for us)

We ARE facing this issue today again, when upgrading from 1.3.x to 1.7.y. 😢

I first tried that script without the "allow_primary": true , don't want to result in any data lost! I have 195 unassigned shards

  "number_of_data_nodes" : 24,
  "active_primary_shards" : 2249,
  "active_shards" : 6022,
...
"unassigned_shards" : 195,

It returns a lot of results , for each execution, but it didn't work. I still have 195 shards unassigned. So I add the explain parameter and only execute on 1 shard. My result looks similar to http://stackoverflow.com/questions/32685188/elasticsearch-shard-relocation-not-working , which has:

{
>       "decider" : "awareness",
>       "decision" : "NO",
>       "explanation" : "too many shards on nodes for attribute: [dc]"  }

Mine is:

          "decider": "awareness",
          "decision": "NO",
          "explanation": "too many shards on nodes for attribute: [aws_availability_zone]"

Btw, @s1monw , correct me if I am wrong, but I don't think the explain parameter work on a simple GET call with empty body, otherwise I end up getting the same 400 like others. (haven't gone through everything in #5027 to confirm yet) . I got the above result via the POST method with the allocate cmd in the body. (This method has such a harsh body)

Anyway, I was curious why there are "too many shards on nodes for attribute, so I followed some tip here: https://www.elastic.co/guide/en/elasticsearch/guide/current/_cluster_health.html . After wait for a while, my cluster is in yellow state, instead of red. So that means none of the unassigned shards are primary. which is good, Then I follow the guide from https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html , i've found my shards in 2 of the usw2c, and 1 unassigned.

/_cat/shards/feed_v1 | grep " 16 "
feed_v1 16 p STARTED    167399 53.6mb IP1  usw2c-.... 
feed_v1 16 r STARTED    167399 53.6mb IP2  usw2c-... 
feed_v1 16 r UNASSIGNED

And I start looking across from all the nodes's data directory (chef + bash), i've found 1 of my nodes in usw2a has the shard 16 folder. Interesting about this node is it was restarted at first, but it seems it failed to pick up init the shards. (key moment here for later)

In fact, this node has been doing A LOT of relocation since the restart. Almost as if it's starting a new node. Here are some more details about our upgrade/restart:

We were running hybrid nodes, and we want to convert into using master-only/data-only nodes. 3 Master-eligible nodes were already created and joined the cluster, but none of them was a master yet. They all have discovery.zen.minimum_master_nodes=3. Because #10793 is not yet available, we'd have to restart our hybrid nodes with new settings. We choose to do that as part of our upgrade. We use chef to create/bootstrap the ES node, and those hybrid node were previously started with a discovery.zen.minimum_master_nodes value that calculated based on the nodes.length / 2 + 1. But the problem with this is the bootstrapping node itself is not included in the nodes.length, and we have been expanding/decomming nodes. And future expansion / removal of nodes would impact this value, but the ES may not have been restarted to take in effect. Basically each node has a value that's not necessary the most up-to-date as the node is started -- TL;DR: prior to that 2a node was shut down, it was a master-eligible node, another node (lets call this 2b) was a master and has a discovery.zen.minimum_master_nodes of X. Once 2a was shut down, total # of master-eligible nodes in the cluster is X-1, and this led to the master (2b) no longer a master! (this whole saga demonstrates the importances of the request #10793 .)

I immediately shut down 2b (while 2a was starting), made the necessary change (set node.master=false, set discovery.zen.minimum_master_nodes=3 , among other changes as part of the upgrade) . This is when I noticed there are unassigned shards failed to be initialized, after both 2a and 2b finished the init state

After letting the relocation running for hours, and study from various posts including this one, and done the analysis (there are more I have done than ^), I have decided to stop the allocation "transient" : {"cluster.routing.allocation.enable" : "all"}} , wait and confirm no allocation running /_cat/shards?v | egrep -v "START|UNASS" , and restart the same 2a again . (refer from key moment earlier when it DIDN'T init those shards). Now all the shards are re-init'ed and started upon the restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants