-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unassigned replica shards after cluster restart #9602
Comments
I see cluster started initializing replica shards after 2 hr. For 2 hours there was nothing in pending tasks other than |
@darsh221 something is holding the replicas from being re-assigned. It may be the DiskThresholdAllocator protecting for disk space. You can get an explanation of the current decisions using: curl -XPOST "http://localhost:9200/_cluster/reroute?explain" |
We are not using any disk threshold watermark settings. May be whatever default it is. Disks in our cluster are only 5 % full of its total size. Currently cluster is in green status so i will not be able to see reroute info. I tried running "http://localhost:9200/_cluster/reroute?explain" but i got { More info about our cluster Configs we are using discovery.zen.ping.multicast.enabled: false |
This should be a POST not a get. Note the curl command I sent. I initially misread the ticket- I thought your shards were not initializing at all for 2h. Now I read they started intializing after 2h. In these two hours, did all data nodes successfully join the cluster? When you run pending_tasks did you see any task with the A couple of comments about your settings
this is quite high. Any reason for that?
We have smart defaults in ES based on the number of cores. I think you can remove this. Especially the unbound queue is a recipe for memory issues.
Why did you change from the default? ES now uses smarter compound dir format (mmapfs for smaller files, NIO for bigger). |
I'm having the same issue. I've restarted my master and the other node (slave) become the master, but my slave now doesn't get any shard. have n shards assigned on master and n unassigned shards (slave doesn't have anything). Ideas? |
No more info from original poster, so closing. @btecu, you'd need to provide more info than you have for us to have any chance of diagnosing. Please feel free to open a new ticket if you're still seeing this issue. |
I've experienced this same thing a couple of times now in 1.4.5. I'll try to provide as much detail as I can... 1 - Disable shard allocation, But the replicas don't start assigning, not for a long time (update: in the most recent case, the {
"insert_order" : 4581,
"priority" : "URGENT",
"source" : "reroute_after_cluster_update_settings",
"executing" : true,
"time_in_queue_millis" : 736479,
"time_in_queue" : "12.2m"
} Trying to execute the curl command as noted here results in an error: {"error":"RemoteTransportException[[acme-stage-em3][inet[/10.0.0.10:9300]][cluster:admin/reroute]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_reroute (api)) within 30s]; ","status":503} Other relevant details ... [heavy] indexing is still going on during this period (for one index only out of about 100). The cluster has 3 master-only, 2 client-only, and 4 data-only nodes. The only possibly suspicious log entry is a timeout complaint in the elected master's log file, a few minutes after re-enabling allocation:
|
@allthedrones I suggest upgrading - recent versions have greatly improved this situation |
Thanks @clintongormley! We're rolling out 1.6 in our pre-production environments now. Are there some better issue #'s that more specifically document which improvements were made that address this particular circumstance? Thanks again! |
@allthedrones I'd recommend to move to the latest 1.7.1 instead! |
The purpose of this comment is to bring some more awareness to this problem, and share what I have done to recover. I feel this unassigned shards could happen in many diff forms, so my soln in the end may not be your soln. We had faced this problem in previous upgrade, (could have been 0.90.x to 0.90.y, or 0.90.y to 1.0.x, or 1.0.x to 1.3.x , don't remember). Tried the reroute b4, there was a guy even posted a nice script to try to find all the unassigned shards and allocate them: http://www.unknownerror.org/opensource/elastic/elasticsearch/q/stackoverflow/19967472/elasticsearch-unassigned-shards-how-to-fix , at the very bottom , and notice how he has We ARE facing this issue today again, when upgrading from 1.3.x to 1.7.y. 😢 I first tried that script without the
It returns a lot of results , for each execution, but it didn't work. I still have 195 shards unassigned. So I add the explain parameter and only execute on 1 shard. My result looks similar to http://stackoverflow.com/questions/32685188/elasticsearch-shard-relocation-not-working , which has:
Mine is:
Btw, @s1monw , correct me if I am wrong, but I don't think the Anyway, I was curious why there are
And I start looking across from all the nodes's data directory (chef + bash), i've found 1 of my nodes in usw2a has the shard In fact, this node has been doing A LOT of relocation since the restart. Almost as if it's starting a new node. Here are some more details about our upgrade/restart: We were running hybrid nodes, and we want to convert into using master-only/data-only nodes. 3 Master-eligible nodes were already created and joined the cluster, but none of them was a master yet. They all have I immediately shut down 2b (while 2a was starting), made the necessary change (set After letting the relocation running for hours, and study from various posts including this one, and done the analysis (there are more I have done than ^), I have decided to stop the allocation |
We are using ES version 1.4.1. We have hourly index with 53 shards per index and replication is 1 . We restarted our cluster after the some maintenance we had from system side. No we don't have any disk failures. After the restart, i see all primary shards are assigned but all replicas are unassigned?
These are the steps we followed to do the restart
The text was updated successfully, but these errors were encountered: