-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce transport API for cluster bootstrapping #34961
Introduce transport API for cluster bootstrapping #34961
Conversation
- Introduces a transport API for bootstrapping a Zen2 cluster - Introduces a transport API for requesting the set of nodes that a master-eligible node has discovered and for waiting until this comprises the expected number of nodes. - Alters ESIntegTestCase to use these APIs when forming a cluster, rather than injecting the initial configuration directly.
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments to help with the review.
|
||
public class GetDiscoveredNodesAction extends Action<GetDiscoveredNodesResponse> { | ||
public static final GetDiscoveredNodesAction INSTANCE = new GetDiscoveredNodesAction(); | ||
public static final String NAME = "cluster:monitor/discovered_nodes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an existing category for security, I am not sure either if the API falls under monitor
but it sounds ok to me. talk with the security team about this?
|
||
public class BootstrapClusterAction extends Action<AcknowledgedResponse> { | ||
public static final BootstrapClusterAction INSTANCE = new BootstrapClusterAction(); | ||
public static final String NAME = "cluster:coordination/bootstrap_cluster"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster
should be good but coordination
would introduce a new subcategory for security I think, which should be discussed probably. See https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/security/authz/privilege/ClusterPrivilege.java
public void setInitialConfiguration(final VotingConfiguration votingConfiguration) { | ||
synchronized (mutex) { | ||
final ClusterState currentState = getStateForMasterService(); | ||
|
||
if (currentState.getLastAcceptedConfiguration().isEmpty() == false) { | ||
if (isInitialConfigurationSet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is safe to extract because getStateForMasterService
depends only on things protected by mutex
.
threadPool.schedule(TimeValue.timeValueMillis(delayMillis), Names.GENERIC, runnable); | ||
try { | ||
threadPool.schedule(TimeValue.timeValueMillis(delayMillis), Names.GENERIC, runnable); | ||
} catch (EsRejectedExecutionException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is just for EsRejectedExecutionException
-handling, which arose in test failures. No test to verify this specifically is added here.
return "PeerFinder handling wakeup"; | ||
} | ||
}); | ||
} catch (EsRejectedExecutionException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to this file are just for EsRejectedExecutionException
-handling, which arose in test failures. No test to verify this specifically is added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what kind of test failures you were seeing? I would like to avoid this wrapping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, got it now (after running into failure myself). As this pattern is repeated here many times, let's add a method to ThreadPool
that schedules while ignoring shutdowns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok see 1b62782
@@ -881,6 +895,18 @@ public String toString() { | |||
}); | |||
} | |||
|
|||
public Releasable withDiscoveryListener(Consumer<Iterable<DiscoveryNode>> listener) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes to Coordinator
are tested by testDiscoveryOfPeersTriggersNotification
.
@@ -520,16 +518,4 @@ private boolean isLocalNodeElectedMaster() { | |||
private boolean clusterStateHasNode(DiscoveryNode node) { | |||
return node.equals(MasterServiceTests.discoveryState(masterService).nodes().get(node.getId())); | |||
} | |||
|
|||
private static class NoOpClusterApplier implements ClusterApplier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was extracted to the top level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at the API part and it looks good, I left a couple of thoughts in there. It would be nice to add some javadocs to request and response classes. The test coverage for request, response and transport actions looks great. I know nothing about what these API do so I cannot comment on the rest ;)
|
||
public class BootstrapClusterAction extends Action<AcknowledgedResponse> { | ||
public static final BootstrapClusterAction INSTANCE = new BootstrapClusterAction(); | ||
public static final String NAME = "cluster:coordination/bootstrap_cluster"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster
should be good but coordination
would introduce a new subcategory for security I think, which should be discussed probably. See https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/security/authz/privilege/ClusterPrivilege.java
public class BootstrapClusterRequest extends ActionRequest { | ||
private BootstrapConfiguration bootstrapConfiguration; | ||
|
||
public BootstrapClusterRequest bootstrapConfiguration(BootstrapConfiguration bootstrapConfiguration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we use proper setters for new requests? this was probably discussed a million times before and I may not remember what the last conclusion was, but I seem to recall that we said we would use getters and setters for new stuff. Totally unimportant though :)
|
||
public class GetDiscoveredNodesAction extends Action<GetDiscoveredNodesResponse> { | ||
public static final GetDiscoveredNodesAction INSTANCE = new GetDiscoveredNodesAction(); | ||
public static final String NAME = "cluster:monitor/discovered_nodes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an existing category for security, I am not sure either if the API falls under monitor
but it sounds ok to me. talk with the security team about this?
Thanks @javanna for the comments, particularly pointing out the implications of action names for security. @andrershov would you look at the Zen2-specific parts of this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some initial thoughts.
.../src/main/java/org/elasticsearch/action/admin/cluster/bootstrap/BootstrapClusterRequest.java
Outdated
Show resolved
Hide resolved
...r/src/main/java/org/elasticsearch/action/admin/cluster/bootstrap/BootstrapConfiguration.java
Outdated
Show resolved
Hide resolved
...r/src/main/java/org/elasticsearch/action/admin/cluster/bootstrap/BootstrapConfiguration.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/action/admin/cluster/bootstrap/GetDiscoveredNodesAction.java
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/action/admin/cluster/bootstrap/GetDiscoveredNodesRequest.java
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/action/admin/cluster/bootstrap/GetDiscoveredNodesResponse.java
Outdated
Show resolved
Hide resolved
...n/java/org/elasticsearch/action/admin/cluster/bootstrap/TransportBootstrapClusterAction.java
Outdated
Show resolved
Hide resolved
import java.util.ArrayList; | ||
import java.util.List; | ||
|
||
public class TransportBootstrapClusterAction extends TransportAction<BootstrapClusterRequest, AcknowledgedResponse> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not need to register a request handler on the transport service? Do we only want to execute this action locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today we only execute it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...java/org/elasticsearch/action/admin/cluster/bootstrap/TransportGetDiscoveredNodesAction.java
Outdated
Show resolved
Hide resolved
return "PeerFinder handling wakeup"; | ||
} | ||
}); | ||
} catch (EsRejectedExecutionException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what kind of test failures you were seeing? I would like to avoid this wrapping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've finished the review and left no major comments. It would be nice to get @ywelsch opinion as well.
.../test/java/org/elasticsearch/action/admin/cluster/bootstrap/BootstrapConfigurationTests.java
Outdated
Show resolved
Hide resolved
|
||
if (coordinator.isInitialConfigurationSet()) { | ||
logger.info("initial configuration already set, ignoring bootstrapping request"); | ||
listener.onResponse(new AcknowledgedResponse(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, it means that if the first response to BootstrapClusterRequest is lost and the client sends the request for the second time, it's safe for the client, but "ack" will be set to false. I'm not sure we can call BootstrapClientRequest idempotent in this case. Probably check if initial configuration equals to requested configuration and reply with ack'ed true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we do call it idempotent anywhere, but nonetheless idempotency is a property of the system state and not the response, so it seems ok to yield a different response (but not change the system state) on subsequent calls.
We do not remember the initial configuration anywhere so we can't check that it was the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaveCTurner I'm trying to understand how to properly write a client that makes bootstrap call. Assume the first call timeouts, it means we need to send the request once again. Now we send the request and get "ack": false in the response. Shall we retry? How do we know that cluster has been already bootstrapped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just retry until it returns a response. If you get a response to a request then the cluster has been bootstrapped (perhaps not by that specific request, but it doesn't matter).
However I have moved over to using a specific response type, BootstrapClusterResponse
in place of AcknowledgedResponse
in 759ad07.
final DiscoveryNode discoveryNode = new DiscoveryNode("local", buildNewFakeTransportAddress(), Version.CURRENT); | ||
final TransportService transportService = transport.createTransportService(Settings.EMPTY, threadPool, | ||
TransportService.NOOP_TRANSPORT_INTERCEPTOR, boundTransportAddress -> discoveryNode, null, emptySet()); | ||
final Discovery discovery = new Discovery() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using Mockito verifyZeroInteractions
instead for mocks like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks.
|
||
final VotingConfiguration votingConfiguration = request.getBootstrapConfiguration().resolve(selfAndDiscoveredPeers); | ||
logger.info("setting initial configuration to {}", votingConfiguration); | ||
coordinator.setInitialConfiguration(votingConfiguration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if there is a race of two requests with the same initial configuration, one of them will throw ElasticSearch exception. Idempotency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, idempotency is a property of the system state and not the response, so it seems ok to yield a different response (but not change the system state) on subsequent calls.
} | ||
|
||
final AtomicBoolean responseSent = new AtomicBoolean(); | ||
final CountDownLatch countDownLatch = new CountDownLatch(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the implementation of the algorithm! Consider using CompletableFuture instead of AtomicBoolean+CountDownLatch though because it's only one concurrent primitive instead of 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, thanks, I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used ListenableFuture
in d4818d6 which solves the problems we were having with lost exceptions.
|
||
final TransportBootstrapClusterAction transportBootstrapClusterAction | ||
= new TransportBootstrapClusterAction(Settings.EMPTY, mock(ActionFilters.class), transportService, coordinator); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a lot of code duplication in the initialization part of the tests, which is not important for the test itself, can we refactor a method for intialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did so in fc2f125. I'm not sure I wholly like it, because of things like the fact that we have to change the local node and rely on the transport service not picking it up until started, but I don't have a strong opinion about this.
|
||
final TransportGetDiscoveredNodesAction transportGetDiscoveredNodesAction | ||
= new TransportGetDiscoveredNodesAction(Settings.EMPTY, mock(ActionFilters.class), transportService, coordinator); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here can we refactor initialization part in a separate method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did so in fc2f125. I'm not sure I wholly like it, because of things like the fact that we have to change the local node and rely on the transport service not picking it up until started, and also we have to create a transport that responds to handshakes in all cases, but I don't have a strong opinion about this.
@andrershov @ywelsch I think I've addressed all the review points, this is ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a full pass and left mostly smaller comments. Looking quite good already.
...r/src/main/java/org/elasticsearch/action/admin/cluster/bootstrap/BootstrapClusterAction.java
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/action/admin/cluster/bootstrap/GetDiscoveredNodesAction.java
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/action/admin/cluster/bootstrap/GetDiscoveredNodesRequest.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/action/admin/cluster/bootstrap/GetDiscoveredNodesResponse.java
Outdated
Show resolved
Hide resolved
...n/java/org/elasticsearch/action/admin/cluster/bootstrap/TransportBootstrapClusterAction.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/action/admin/cluster/bootstrap/TransportBootstrapClusterActionTests.java
Show resolved
Hide resolved
...org/elasticsearch/action/admin/cluster/bootstrap/TransportGetDiscoveredNodesActionTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java
Outdated
Show resolved
Hide resolved
Thanks @ywelsch, I think that's all the comments addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
master-eligible node has discovered and for waiting until this comprises the
expected number of nodes.
injecting the initial configuration directly.