Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node repurpose tool #39403

Merged

Conversation

henningandersen
Copy link
Contributor

@henningandersen henningandersen commented Feb 26, 2019

When a node is repurposed to master/no-data or no-master/no-data, v7.x
will not start (see #37748 and #37347). The elasticsearch-node repurpose
tool can fix this by cleaning up the problematic data.

Unsure if this should be backported to 7.0 too?

Documentation to be done in a follow-up PR.

When a node is repurposed to master/no-data or no-master/no-data, v7.x
will not start (see elastic#37748 and elastic#37347). The `elasticsearch repurpose`
tool can fix this by cleaning up the problematic data.
@henningandersen henningandersen added >feature :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.2.0 labels Feb 26, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henningandersen thanks for your PR! I've left several comments, overall looks good already.

import static org.mockito.Matchers.contains;

@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0)
public class NodeRepurposeCommandIT extends ESIntegTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's an integration test, I think we need to be able to verify that the document that we have indexed is lost. For example, you may start one master-only node and one data-only node and then repurpose data-only to be master-only or no-master no-data. In this case, when the repurposed node joins the cluster, the index will be re-created but the document is gone.
Also due to randomization, executeRepurposeCommand is a bit hard to follow. I think here we can stop a particular node using nameFilter and explicitly run repurpose tool successfully on this node and unsuccessfully on another node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding checking for lost data, I added a check that we loose the index completely in f5396b1 , I think that should suffice, do you agree?

Commit 85cf0e6 fixes test to stop one node, run the tool successfully on that node and unsuccessfully on the other node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the test works, but after you restart two nodes - they should not form the cluster, because there is no longer the master node. It'd nice to figure out why ensureGreen returns successfully in this case.
But in general, I think, having one dedicated master-eligible node and one data node and repurposing of data node (mentioned in the comment) would be easier to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrershov this works because InternalTestCluster maintains the voting configuration in InternalTestCluster.excludeMasters, which takes the stopping node of the voting configuration. This is used by several other tests (is the default for InternalTestCluster), for instance I think GatewayIndexStateIT.testIndexDeletionWhenNodeRejoins depends on the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to one master only and one data only node in 7a0712c.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the test needed more adjustments, fixed in 4cbf288

I somehow ran the wrong test 100 times ;-)

assertEquals("Must not touch files", before, after);
}

private long digestPaths() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we already have a similar function implemented somewhere (it takes 40 lines of code here) or probably it makes sense to add it to some util class. Probably @ywelsch can suggest something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found none, will await feedback from @ywelsch on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any such existing functionality. There is BufferedChecksumIndexInput in Lucene, but that will not shorten the code here.

Moved "how to get verbose info" message to after summary message.
Test case now randomly creates up to 10 shards.
Renamed ElasticSearchNodeCommandIT to avoid confusion with repurpose
tool that has a separate IT.
Changed NodeRepurposeCommandIT to be more specific about which node it
runs the tool on rather than just try all.
Fixed NodeToolCli messaging.
Changed NodeRepurposeCommandIT to validate indexes still exists after
repurpose and do not exist after repurpose on all nodes.
@henningandersen
Copy link
Contributor Author

Thanks for a thorough review, @andrershov , I believe I addressed all comments, so is ready for another round.

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @henningandersen. I've left a comment about the test and would like to review it again after it's addressed.

Changed NodeRepurposeCommandIT to be less random according to review
comments.
@henningandersen
Copy link
Contributor Author

Thanks for the changes @henningandersen. I've left a comment about the test and would like to review it again after it's addressed.

Thanks @andrershov , I modified the test, please have another look.

Fixed NodeRepurposeCommandIT.
@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/bwc
@elasticmachine run elasticsearch-ci/default-distro

@henningandersen
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/packaging-sample

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henningandersen I've left several comments to the test, but no need for the next pass.
Since there is no feedback from @yannick on "digest" function, let's have this custom implementation for now.
As for backporting to 7.0, on the one hand, it would be nice to have it in 7.0, but I'm affraid we don't have enough time for documentation. So I think backport to 7.1 is enough.
@yannick what do you think?

ensureGreen();

// docs gone
assertFalse(client().prepareGet(indexName, "type1", "1").get().isExists());
assertTrue(indexExists(indexName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also check here that "prepareGet" will throw NoShardAvailableActionException, because there is no data node to host the shard.

assertTrue(indexExists(indexName));

logger.info("--> Restarting and repurposing other node");

internalCluster().stopRandomNode(s -> true);
internalCluster().stopRandomNode(s -> true);

executeRepurposeCommandForOrdinal(noMasterNoDataSettings, indexUUID, 0);
executeRepurposeCommandForOrdinal(noMasterNoDataSettings, indexUUID, 0, 0);

internalCluster().startMasterOnlyNode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not easy to follow. We've repurposed both nodes as non-master and non-data. And now we're starting them as "master" and "data" correspondingly.
I think, that we at least need to add a comment that describes what we're doing here.
Like "now both nodes are coordinating, let's start one of them as a master node and the second one as the data node, this does not require repurpose tool invocation. However, since they were converted to coordinating nodes previously, all data and metadata should be lost and we can verify it".


ensureGreen();

// indexes gone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index is gone

Improvements to NodeRepurposeCommandIT.
@ywelsch
Copy link
Contributor

ywelsch commented Mar 19, 2019

Given that #37748 and #37347 are in 7.0, we should try to get this in there as well. I'm sure we will figure out the docs before release.

@henningandersen henningandersen merged commit bd04b4f into elastic:master Mar 19, 2019
henningandersen added a commit that referenced this pull request Mar 19, 2019
When a node is repurposed to master/no-data or no-master/no-data, v7.x
will not start (see #37748 and #37347). The `elasticsearch repurpose`
tool can fix this by cleaning up the problematic data.
henningandersen added a commit that referenced this pull request Mar 19, 2019
When a node is repurposed to master/no-data or no-master/no-data, v7.x
will not start (see #37748 and #37347). The `elasticsearch repurpose`
tool can fix this by cleaning up the problematic data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >feature v7.0.0-rc1 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants