-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node repurpose tool #39403
Node repurpose tool #39403
Conversation
When a node is repurposed to master/no-data or no-master/no-data, v7.x will not start (see elastic#37748 and elastic#37347). The `elasticsearch repurpose` tool can fix this by cleaning up the problematic data.
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henningandersen thanks for your PR! I've left several comments, overall looks good already.
server/src/main/java/org/elasticsearch/env/NodeRepurposeCommand.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/env/NodeRepurposeCommandTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/env/NodeRepurposeCommandTests.java
Outdated
Show resolved
Hide resolved
import static org.mockito.Matchers.contains; | ||
|
||
@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0) | ||
public class NodeRepurposeCommandIT extends ESIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's an integration test, I think we need to be able to verify that the document that we have indexed is lost. For example, you may start one master-only node and one data-only node and then repurpose data-only to be master-only or no-master no-data. In this case, when the repurposed node joins the cluster, the index will be re-created but the document is gone.
Also due to randomization, executeRepurposeCommand
is a bit hard to follow. I think here we can stop a particular node using nameFilter
and explicitly run repurpose tool successfully on this node and unsuccessfully on another node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how the test works, but after you restart two nodes - they should not form the cluster, because there is no longer the master node. It'd nice to figure out why ensureGreen returns successfully in this case.
But in general, I think, having one dedicated master-eligible node and one data node and repurposing of data node (mentioned in the comment) would be easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrershov this works because InternalTestCluster
maintains the voting configuration in InternalTestCluster.excludeMasters
, which takes the stopping node of the voting configuration. This is used by several other tests (is the default for InternalTestCluster
), for instance I think GatewayIndexStateIT.testIndexDeletionWhenNodeRejoins
depends on the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to one master only and one data only node in 7a0712c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the test needed more adjustments, fixed in 4cbf288
I somehow ran the wrong test 100 times ;-)
assertEquals("Must not touch files", before, after); | ||
} | ||
|
||
private long digestPaths() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we already have a similar function implemented somewhere (it takes 40 lines of code here) or probably it makes sense to add it to some util class. Probably @ywelsch can suggest something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found none, will await feedback from @ywelsch on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any such existing functionality. There is BufferedChecksumIndexInput
in Lucene, but that will not shorten the code here.
server/src/main/java/org/elasticsearch/cluster/coordination/NodeToolCli.java
Show resolved
Hide resolved
Moved "how to get verbose info" message to after summary message. Test case now randomly creates up to 10 shards.
Renamed ElasticSearchNodeCommandIT to avoid confusion with repurpose tool that has a separate IT.
Changed NodeRepurposeCommandIT to be more specific about which node it runs the tool on rather than just try all. Fixed NodeToolCli messaging.
Changed NodeRepurposeCommandIT to validate indexes still exists after repurpose and do not exist after repurpose on all nodes.
Thanks for a thorough review, @andrershov , I believe I addressed all comments, so is ready for another round. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @henningandersen. I've left a comment about the test and would like to review it again after it's addressed.
Changed NodeRepurposeCommandIT to be less random according to review comments.
Thanks @andrershov , I modified the test, please have another look. |
Fixed NodeRepurposeCommandIT.
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine run elasticsearch-ci/packaging-sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henningandersen I've left several comments to the test, but no need for the next pass.
Since there is no feedback from @yannick on "digest" function, let's have this custom implementation for now.
As for backporting to 7.0, on the one hand, it would be nice to have it in 7.0, but I'm affraid we don't have enough time for documentation. So I think backport to 7.1 is enough.
@yannick what do you think?
ensureGreen(); | ||
|
||
// docs gone | ||
assertFalse(client().prepareGet(indexName, "type1", "1").get().isExists()); | ||
assertTrue(indexExists(indexName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also check here that "prepareGet" will throw NoShardAvailableActionException, because there is no data node to host the shard.
assertTrue(indexExists(indexName)); | ||
|
||
logger.info("--> Restarting and repurposing other node"); | ||
|
||
internalCluster().stopRandomNode(s -> true); | ||
internalCluster().stopRandomNode(s -> true); | ||
|
||
executeRepurposeCommandForOrdinal(noMasterNoDataSettings, indexUUID, 0); | ||
executeRepurposeCommandForOrdinal(noMasterNoDataSettings, indexUUID, 0, 0); | ||
|
||
internalCluster().startMasterOnlyNode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not easy to follow. We've repurposed both nodes as non-master and non-data. And now we're starting them as "master" and "data" correspondingly.
I think, that we at least need to add a comment that describes what we're doing here.
Like "now both nodes are coordinating, let's start one of them as a master node and the second one as the data node, this does not require repurpose tool invocation. However, since they were converted to coordinating nodes previously, all data and metadata should be lost and we can verify it".
|
||
ensureGreen(); | ||
|
||
// indexes gone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index is gone
Improvements to NodeRepurposeCommandIT.
When a node is repurposed to master/no-data or no-master/no-data, v7.x
will not start (see #37748 and #37347). The
elasticsearch-node repurpose
tool can fix this by cleaning up the problematic data.
Unsure if this should be backported to 7.0 too?
Documentation to be done in a follow-up PR.