Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tool elasticsearch-node unsafe-bootstrap #37696

Merged
merged 16 commits into from
Jan 24, 2019

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented Jan 22, 2019

This tool could be used if the majority of master-eligible nodes is lost.
Docs can be found in #37812

@andrershov andrershov added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 22, 2019
@andrershov andrershov requested a review from ywelsch January 22, 2019 10:57
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov
Copy link
Contributor Author

run elasticsearch-ci/2

@ywelsch ywelsch requested a review from DaveCTurner January 22, 2019 12:25
@ywelsch ywelsch mentioned this pull request Jan 22, 2019
61 tasks
@andrershov
Copy link
Contributor Author

run elasticsearch-ci/2

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left mostly minor suggestions, but we should discuss how to deal with node ordinals >0 as I don't think we should just silently ignore them.

I will look in more depth at the docs later.

@@ -318,4 +318,22 @@ public void test100RepairIndexCliPackaging() {
}
}

public void test110RepairIndexCliPackaging() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RepairIndex? I see that this was copied from the test above, which also shouldn't say Repair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I've changed the names of both tests in this commit e2abf3f

"\n" +
" WARNING: Elasticsearch MUST be stopped before running this tool.\n" +
"\n" +
"You should run this tool only if you've lost the majority of master eligible nodes.\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest:

You should run this tool only if you have permanently lost half or more of the master-eligible nodes, and you cannot restore the cluster from a snapshot. This tool can result in arbitrary data loss and should be the last resort.

I'm suggesting to avoid saying "lost a majority" because you also need this tool if you have an even number of nodes and you've lost half of them, which isn't strictly a majority.

Also could you explain how it can render the cluster "completely non-functional"? I mean I see how it can break indices, but it should allow a cluster to form and any broken indices can be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0894fe4
I could not give you an example, I just wanted for this warning to sound as scary as possible.

ClusterService.USER_DEFINED_META_DATA.getConcreteSetting("cluster.metadata.unsafe-bootstrap");

UnsafeBootstrapMasterCommand() {
super("Unsafely bootstraps the master node if the majority of master eligible nodes is lost");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not strictly a majority. Maybe we should replace "bootstraps" with "forces the successful election of"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you find this message b6ada54?

try {
terminal.println(Terminal.Verbosity.VERBOSE, "Writing new global metadata to disk");
long newGeneration = MetaData.FORMAT.write(newMetaData, dataPaths);
Manifest newManifest = new Manifest(manifest.getCurrentTerm(), manifest.getClusterStateVersion(), newGeneration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should increment the term, and log that we've done so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented increment in this commit 0e50913. For now, I'm logging new currentTerm to the terminal, will get back to it - once I figure out how our command line tools work with elasticsearch logs.

@Override
protected void execute(Terminal terminal, OptionSet options, Environment env) throws Exception {
// We don't use classical "private static final Logger", because log4j2 initialization happens
// after UnsafeBootstrapMasterCommand instance creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we make a private static final Logger logger in RemoveCorruptedShardDataCommand, does that not work?

Also, does this logger have the same configuration as loggers in Elasticsearch proper, i.e., it writes to the Elasticsearch log by default? If so, I think we should log more information about this tool having been run in that log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have figured out how the logging works:

  1. Most of the commands, for example, LoggingAwareCommand are passing CommandLoggingConfigurator::configureLoggingWithoutConfig callback to Command constructor as beforeMain argument. This is the function that will be executed when Command.main is called.
  2. From CommandLoggingConfigurator::configureLoggingWithoutConfig javadoc Configure logging without reading a log4j2.properties file, effectively configuring the status logger and all loggers to the console. Basically it means that logs won't be appended to elasticsearch log files.
  3. If you run ShardToolCli indeed you get ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration. The reason for this is because logger initialization happens when log4j2 classes are loaded. Java Runtime loads classes when it first encounters this class, in case of ShardToolCli it will load RemoveCorruptedShardDataCommand and transitively Logger class, because there is static reference to Logger class in RemoveCorruptedShardDataCommand. Since pre-main is not yet executed, logging is not configured.
  4. It turned out that NodeToolCli still suffers from the same warning, because there is ClusterService (which in turn staticly references Logger) static reference in UnsafeBootstrapMasterCommand. The reason why I've not noticied it earlier, because I've added this setting last moment, after I've checked how binaries are working.
  5. For now I've made the following fix e23f59a
  6. I suggest to revisit how logging works for commands in general in the follow-up PR. It seems that beforeMain could be completely removed and replaced by logging initialization in the command constructor. But this change touches a lot of file and I prefer separate PR.

internalCluster().startNode(
Settings.builder()
.put(ElectMasterService.DISCOVERY_ZEN_MINIMUM_MASTER_NODES_SETTING.getKey(), Integer.MAX_VALUE)
.put(DiscoverySettings.INITIAL_STATE_TIMEOUT_SETTING.getKey(), "2s") //to ensure quick node startup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settings lastNodeSettings = allNodesSettings.get(allNodesSettings.size() - 1);
List<Settings> newSettings = new ArrayList<>();
newSettings.addAll(otherNodesSettings);
newSettings.add(Settings.builder().put(lastNodeSettings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we bootstrap a random node rather than it always being the last of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrershov
Copy link
Contributor Author

@DaveCTurner thanks for your review! I've addressed all the comments except nodeOrdinal, I think it's ready for the 2nd review pass.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks good apart from the removal of the docs that we discussed.

@@ -0,0 +1,119 @@
[[node-tool]]
== elasticsearch-node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to move the docs changes to a separate PR - #37812, so this should go.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrershov
Copy link
Contributor Author

run elasticsearch-ci/1

@andrershov andrershov removed the request for review from ywelsch January 24, 2019 15:11
@andrershov
Copy link
Contributor Author

run elasticsearch-ci/1

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again

@andrershov andrershov merged commit 4974684 into elastic:master Jan 24, 2019
andrershov added a commit that referenced this pull request Mar 12, 2019
This commit, mostly authored by @DaveCTurner, 
adds documentation for elasticsearch-node tool #37696.
andrershov pushed a commit that referenced this pull request Mar 12, 2019
This commit, mostly authored by @DaveCTurner,
adds documentation for elasticsearch-node tool #37696.

(cherry picked from commit 09425d5)
andrershov pushed a commit that referenced this pull request Mar 12, 2019
This commit, mostly authored by @DaveCTurner,
adds documentation for elasticsearch-node tool #37696.

(cherry picked from commit 09425d5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants