-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tool elasticsearch-node unsafe-bootstrap #37696
Conversation
Pinging @elastic/es-distributed |
run elasticsearch-ci/2 |
run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left mostly minor suggestions, but we should discuss how to deal with node ordinals >0 as I don't think we should just silently ignore them.
I will look in more depth at the docs later.
@@ -318,4 +318,22 @@ public void test100RepairIndexCliPackaging() { | |||
} | |||
} | |||
|
|||
public void test110RepairIndexCliPackaging() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RepairIndex
? I see that this was copied from the test above, which also shouldn't say Repair
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I've changed the names of both tests in this commit e2abf3f
server/src/main/java/org/elasticsearch/cluster/coordination/NodeToolCli.java
Outdated
Show resolved
Hide resolved
"\n" + | ||
" WARNING: Elasticsearch MUST be stopped before running this tool.\n" + | ||
"\n" + | ||
"You should run this tool only if you've lost the majority of master eligible nodes.\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest:
You should run this tool only if you have permanently lost half or more of the master-eligible nodes, and you cannot restore the cluster from a snapshot. This tool can result in arbitrary data loss and should be the last resort.
I'm suggesting to avoid saying "lost a majority" because you also need this tool if you have an even number of nodes and you've lost half of them, which isn't strictly a majority.
Also could you explain how it can render the cluster "completely non-functional"? I mean I see how it can break indices, but it should allow a cluster to form and any broken indices can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0894fe4
I could not give you an example, I just wanted for this warning to sound as scary as possible.
ClusterService.USER_DEFINED_META_DATA.getConcreteSetting("cluster.metadata.unsafe-bootstrap"); | ||
|
||
UnsafeBootstrapMasterCommand() { | ||
super("Unsafely bootstraps the master node if the majority of master eligible nodes is lost"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not strictly a majority. Maybe we should replace "bootstraps" with "forces the successful election of"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you find this message b6ada54?
server/src/main/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterCommand.java
Show resolved
Hide resolved
try { | ||
terminal.println(Terminal.Verbosity.VERBOSE, "Writing new global metadata to disk"); | ||
long newGeneration = MetaData.FORMAT.write(newMetaData, dataPaths); | ||
Manifest newManifest = new Manifest(manifest.getCurrentTerm(), manifest.getClusterStateVersion(), newGeneration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should increment the term, and log that we've done so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented increment in this commit 0e50913. For now, I'm logging new currentTerm
to the terminal, will get back to it - once I figure out how our command line tools work with elasticsearch logs.
@Override | ||
protected void execute(Terminal terminal, OptionSet options, Environment env) throws Exception { | ||
// We don't use classical "private static final Logger", because log4j2 initialization happens | ||
// after UnsafeBootstrapMasterCommand instance creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we make a private static final Logger logger
in RemoveCorruptedShardDataCommand
, does that not work?
Also, does this logger have the same configuration as loggers in Elasticsearch proper, i.e., it writes to the Elasticsearch log by default? If so, I think we should log more information about this tool having been run in that log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have figured out how the logging works:
- Most of the commands, for example,
LoggingAwareCommand
are passingCommandLoggingConfigurator::configureLoggingWithoutConfig
callback toCommand
constructor asbeforeMain
argument. This is the function that will be executed whenCommand.main
is called. - From CommandLoggingConfigurator::configureLoggingWithoutConfig javadoc
Configure logging without reading a log4j2.properties file, effectively configuring the status logger and all loggers to the console.
Basically it means that logs won't be appended to elasticsearch log files. - If you run
ShardToolCli
indeed you getERROR StatusLogger No Log4j 2 configuration file found. Using default configuration
. The reason for this is because logger initialization happens when log4j2 classes are loaded. Java Runtime loads classes when it first encounters this class, in case ofShardToolCli
it will loadRemoveCorruptedShardDataCommand
and transitivelyLogger
class, because there is static reference toLogger
class inRemoveCorruptedShardDataCommand
. Since pre-main is not yet executed, logging is not configured. - It turned out that
NodeToolCli
still suffers from the same warning, because there isClusterService
(which in turn staticly referencesLogger
) static reference inUnsafeBootstrapMasterCommand
. The reason why I've not noticied it earlier, because I've added this setting last moment, after I've checked how binaries are working. - For now I've made the following fix e23f59a
- I suggest to revisit how logging works for commands in general in the follow-up PR. It seems that
beforeMain
could be completely removed and replaced by logging initialization in the command constructor. But this change touches a lot of file and I prefer separate PR.
internalCluster().startNode( | ||
Settings.builder() | ||
.put(ElectMasterService.DISCOVERY_ZEN_MINIMUM_MASTER_NODES_SETTING.getKey(), Integer.MAX_VALUE) | ||
.put(DiscoverySettings.INITIAL_STATE_TIMEOUT_SETTING.getKey(), "2s") //to ensure quick node startup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java
Show resolved
Hide resolved
Settings lastNodeSettings = allNodesSettings.get(allNodesSettings.size() - 1); | ||
List<Settings> newSettings = new ArrayList<>(); | ||
newSettings.addAll(otherNodesSettings); | ||
newSettings.add(Settings.builder().put(lastNodeSettings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we bootstrap a random node rather than it always being the last of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-Authored-By: andrershov <[email protected]>
@DaveCTurner thanks for your review! I've addressed all the comments except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks good apart from the removal of the docs that we discussed.
server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterCommand.java
Show resolved
Hide resolved
@@ -0,0 +1,119 @@ | |||
[[node-tool]] | |||
== elasticsearch-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to move the docs changes to a separate PR - #37812, so this should go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
run elasticsearch-ci/1 |
run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again
This commit, mostly authored by @DaveCTurner, adds documentation for elasticsearch-node tool #37696.
This commit, mostly authored by @DaveCTurner, adds documentation for elasticsearch-node tool #37696. (cherry picked from commit 09425d5)
This commit, mostly authored by @DaveCTurner, adds documentation for elasticsearch-node tool #37696. (cherry picked from commit 09425d5)
This tool could be used if the majority of master-eligible nodes is lost.
Docs can be found in #37812