Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangling indices living in non-data nodes are detected and auto-imported #27073

Closed
tsouza opened this issue Oct 21, 2017 · 16 comments
Closed

Dangling indices living in non-data nodes are detected and auto-imported #27073

tsouza opened this issue Oct 21, 2017 · 16 comments
Assignees
Labels
>bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. good first issue low hanging fruit help wanted adoptme

Comments

@tsouza
Copy link

tsouza commented Oct 21, 2017

Elasticsearch version (bin/elasticsearch --version):

Version: 5.5.3, Build: 9305a5e/2017-09-07T15:56:59.599Z, JVM: 1.8.0_151

Plugins installed: []

JVM version (java -version):

java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version (uname -a if on a Unix-like system): Darwin Thiagos-MacBook-Pro.local 17.0.0 Darwin Kernel Version 17.0.0: Thu Aug 24 21:48:19 PDT 2017; root:xnu-4570.1.46~2/RELEASE_X86_64 x86_64

Description of the problem including expected versus actual behavior:

If a non-data node, that contains dangling indices in it's data path, joins a cluster these dangling indices will be detected and auto-imported.

IMO, a non-data node that contains index data in it's data path is probably accidental and unintended. In this case, those dangling indices should not be detected, better yet if the node does not even starts (maybe a bootstrap check that fails if a non-data node contains index data in it's data path).

Steps to reproduce:

This can be done in a single machine:

  1. Start node-1 with bin/elasticsearch -E path.data=/Users/thiago/data-1 -E node.name=node-1
  2. Start node-2 with bin/elasticsearch -E path.data=/Users/thiago/data-2 -E node.name=node-2
  3. Create an index test configured with 1S/0R with curl -XPUT localhost:9200/test -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"
  4. Create a document curl -XPOST localhost:9200/test -d '{ "test": 1 }' -H "Content-Type: application/json"
  5. Stop both nodes
  6. Check which data directory, either data-1 or data-2, that the shard for index test was created in and delete the other empty data directory (so we effectively make a dangling index).
  7. Consider that data-2 was deleted. So start node-2 again with bin/elasticsearch -E path.data=/Users/thiago/data-2 -E node.name=node-2
  8. Start node-1 (which contains dangling indices) as a non-data node with bin/elasticsearch -E path.data=/Users/thiago/data-1 -E node.name=node-1 -E node.data=false

Provide logs (if relevant):

After non-data node node-1 starts, node-2 will detect and auto-import dangling indices even though node-1 is a non-data node:

[2017-10-21T18:02:14,158][INFO ][o.e.g.LocalAllocateDangledIndices] [node-2] auto importing dangled indices [[test/R2Nh9sERThmkJ-0IZ0ppwA]/OPEN] from [{node-1}{RqWMW2AeSXWOpkUm4cT1TA}{lEqpWLIhRqqU_n1DSFuv2Q}{127.0.0.1}{127.0.0.1:9301}]
@dnhatn dnhatn changed the title Dandling indices living in non-data nodes are detected and auto-imported Dangling indices living in non-data nodes are detected and auto-imported Oct 27, 2017
@ywelsch
Copy link
Contributor

ywelsch commented Oct 27, 2017

We discussed this on Fixit Friday and agreed to add a check that will fail:

  • starting up a non-data node that has shard data (e.g. dedicated master node or coordinating-only node)
  • starting up a coordinating-only node that has index metadata.

This means that some user action (explicitly deleting shard data) is going to be required if a data node is switched to a master-only/ coordinating node.

@ywelsch ywelsch added help wanted adoptme good first issue low hanging fruit and removed discuss labels Oct 27, 2017
@swethapavan
Copy link

Is this taken or can I pick it?

@ywelsch
Copy link
Contributor

ywelsch commented Nov 22, 2017

@swethapavan sure, go ahead.

@swethapavan
Copy link

Thank you

@jasontedor
Copy link
Member

jasontedor commented Nov 23, 2017

I think we can fail earlier than the bootstrap checks so I'm not sure if this should be a bootstrap check, isn't it enough to be a check in node environment (we've done this in the past with the default path data issue)?

@ywelsch
Copy link
Contributor

ywelsch commented Nov 23, 2017

I'm not sure if this should be a bootstrap check

yes, I used bootstrap check in the larger sense here when I meant "a boot/start time check". It does not require the bootstrap checks code infrastructure.

@swethapavan
Copy link

I have done the changes but I get errors when i run some tests because the node fails due to the existence of dangling indices

@swethapavan
Copy link

Specifically, these are the tests that fail:
org.elasticsearch.indices.flush.FlushIT.testSyncedFlushWithConcurrentIndexing

  • org.elasticsearch.indices.flush.FlushIT.testWaitIfOngoing
  • org.elasticsearch.indices.flush.FlushIT.testSyncedFlush
  • org.elasticsearch.search.geo.GeoShapeIntegrationIT.testOrientationPersistence
  • org.elasticsearch.search.geo.GeoShapeIntegrationIT.testIgnoreMalformed
  • org.elasticsearch.gateway.GatewayIndexStateIT.testJustMasterNode
  • org.elasticsearch.index.store.CorruptedFileIT.testReplicaCorruption

@s1monw
Copy link
Contributor

s1monw commented Dec 13, 2017

I think we can fail earlier than the bootstrap checks so I'm not sure if this should be a bootstrap check, isn't it enough to be a check in node environment (we've done this in the past with the default path data issue)?

I wonder if adding it as a bootstrap check is actually a feature (ie. testing for it later). Like I can totally see starting up a node with data=false for testing in my dev env with local host disco etc. and I don't want them to fail in that case? Just putting out my way of thinking here.

@s1monw
Copy link
Contributor

s1monw commented Dec 13, 2017

@swethapavan please open a PullRequest or share your code otherwise we won't be able to help you

swethapavan pushed a commit to swethapavan/elasticsearch that referenced this issue Dec 14, 2017
… and auto-imported. Some test cases are failing. Need to check further.
@swethapavan
Copy link

@s1monw I have created a pull request. Kindly have a look.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 9, 2018

I wonder if adding it as a bootstrap check is actually a feature (ie. testing for it later). Like I can totally see starting up a node with data=false for testing in my dev env with local host disco etc. and I don't want them to fail in that case?

My preference would be not to have this as a bootstrap check. Bootstrap checks are requirements for going to production, and we should keep them at a strict minimum so that the difference between prod and dev stays low. For this particular check, I don't see a good reason why we would not want to enforce it for development mode as well. If you want to start-up a node with data=false for testing, and that you happen to do that on a data folder which previously had a node with data, you can as easily just define a different path.data.

@lcawl lcawl added :Search/Search Search-related issues that do not fall into other categories and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Search/Search Search-related issues that do not fall into other categories labels Feb 13, 2018
@diwasjoshi
Copy link
Contributor

Is this issue still open, there seems to be no update on it since long. I would like to work on this.

@scathatheworm
Copy link

Is this fixed on 6.x? Ran into this issue yesterday on 5.6.10

@vladimirdolzhenko vladimirdolzhenko self-assigned this Nov 6, 2018
@henningandersen
Copy link
Contributor

The proposal is to detect if a data=false node have any data and fail startup if that is the case. However, even indices without any data can be resurrected and I wonder if we need to also handle that? I have created a slightly modified reproduction case to explain this:

  1. Clear out any previous experiments:

rm -r data-1 data-2

  1. Start two nodes:
bin/elasticsearch -E path.data=data-1 -E node.name=node-1
bin/elasticsearch -E path.data=data-2 -E node.name=node-2
  1. Create two indexes and data for them:
curl -XPUT localhost:9200/test?pretty -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"
curl -XPOST localhost:9200/test/_doc?pretty -d '{ "test": 1 }' -H "Content-Type: application/json"

curl -XPUT localhost:9200/test2?pretty -d '{ "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } }' -H "Content-Type: application/json"
curl -XPOST localhost:9200/test2/_doc?pretty -d '{ "test": 1 }' -H "Content-Type: application/json"
  1. Verify that data for the two indexes are on different nodes:

ls -d data-*/nodes/0/indices/*/0

should give something like following (notice: different data folders):

data-1/nodes/0/indices/bF19AZJvREOs33p8udeD-A/0  data-2/nodes/0/indices/xpuL1YkcR1SttdAYF6zGEg/0
  1. Shutdown both nodes. Remove the data folder for node-1:

rm -r data-1

  1. Start node-1 and then node-2 with node.data=false:

bin/elasticsearch -E path.data=data-1 -E node.name=node-1
bin/elasticsearch -E path.data=data-2 -E node.name=node-2 -E node.data=false

Expected log for node-2:

[2019-01-10T11:54:46,133][INFO ][o.e.g.DanglingIndicesState] [node-2] [[test2/bF19AZJvREOs33p8udeD-A]] dangling index exists on local file system, but not in cluster metadata, auto import to cluster state
[2019-01-10T11:54:46,133][INFO ][o.e.g.DanglingIndicesState] [node-2] [[test/xpuL1YkcR1SttdAYF6zGEg]] dangling index exists on local file system, but not in cluster metadata, auto import to cluster state

and for node-1:

[2019-01-10T11:54:46,308][INFO ][o.e.g.LocalAllocateDangledIndices] [node-1] auto importing dangled indices [[test2/bF19AZJvREOs33p8udeD-A]/OPEN][[test/xpuL1YkcR1SttdAYF6zGEg]/OPEN] from [{node-2}{wwM9q--3TmW0VCAHerzmNg}{OYshEsG6Rv6CvNmANivlnQ}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=33465024512, ml.max_open_jobs=20, xpack.installed=true}]

Looking at the file system, both indices now exist on node-1 too without any data:

ls -d data-1/nodes/0/indices/*/*
data-1/nodes/0/indices/bF19AZJvREOs33p8udeD-A/_state  data-1/nodes/0/indices/xpuL1YkcR1SttdAYF6zGEg/_state

and both are red status:

curl localhost:9200/_cat/indices?v
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
red    open   test  xpuL1YkcR1SttdAYF6zGEg   1   0                                                  
red    open   test2 bF19AZJvREOs33p8udeD-A   1   0

This makes me wonder whether the proposed change is enough since there is still a risk of resurrecting old indexes that did not have any shards allocated on the node?

@henningandersen
Copy link
Contributor

Had a conversation with @ywelsch on this on another channel. We came to the conclusion that the original proposal should be implemented to avoid resurrecting the indices in clearly bad cases and also to avoid having old data lying around that are invalid for the type of node.

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 11, 2019
Check that nodes started with node.data=false cannot start if they have
shard data to avoid (old) indexes being resurrected into the cluster in red status.

Issue elastic#27073
henningandersen added a commit that referenced this issue Jan 22, 2019
* Fail start of non-data node if node has data

Check that nodes started with node.data=false cannot start if they have
shard data to avoid (old) indexes being resurrected into the cluster in red status.

Issue #27073
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 23, 2019
Node started with node.data=false and node.master=false can no longer
start if they have index metadata. This avoids resurrecting old indexes
into the cluster and ensures metadata is cleaned out before
re-purposing a node that was previously master or data node.

Closes elastic#27073
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 23, 2019
Added breaking changes documentation for node start up obsolete indices
detection.

Issue elastic#27073
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 24, 2019
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 24, 2019
For a non-data, non-master node we now warn about dangling indices and
will otherwise ignore them. This avoids import of old indices with a
following inevitable red cluster status.

Issue elastic#27073
henningandersen added a commit that referenced this issue Jan 25, 2019
Node started with node.data=false and node.master=false can no longer
start if they have index metadata. This avoids resurrecting old indexes
into the cluster and ensures metadata is cleaned out before
re-purposing a node that was previously master or data node.

Issue #27073
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Jan 30, 2019
Improved documentation on when nodes will refuse to start up.

Issue elastic#27073
henningandersen added a commit that referenced this issue Jan 31, 2019
Added breaking changes documentation for node start up obsolete indices
detection.

Issue #27073
henningandersen added a commit that referenced this issue Feb 2, 2019
Now warn about both left-behind data and metadata for non-data or
non-data and non-master nodes. Disable dangling indices check completely
for coordinating only nodes (non-data and non-master).

Issue #27073
6.x backport of #37347 and #37748 (without failing start up).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. good first issue low hanging fruit help wanted adoptme
Projects
None yet
Development

No branches or pull requests