New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Zen2] Introduce ElectionScheduler #32709

Closed

DaveCTurner wants to merge 41 commits into elastic:zen2 from DaveCTurner:2018-08-08-election-scheduler

Contributor

DaveCTurner commented Aug 8, 2018

The ElectionScheduler runs while there is no known elected master and is
responsible for scheduling elections randomly, backing off on failure, to
balance the desire to elect a master quickly with the desire to avoid more than
one node starting an election at once.


          Introduce ElectionScheduler

82fcbb1

The ElectionScheduler runs while there is no known elected master and is
responsible for scheduling elections randomly, backing off on failure, to
balance the desire to elect a master quickly with the desire to avoid more than
one node starting an election at once.

DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination labels

DaveCTurner requested a review from ywelsch

August 8, 2018 11:41

Collaborator

elasticmachine commented Aug 8, 2018

Pinging @elastic/es-distributed

ywelsch mentioned this pull request

A new cluster coordination layer #32006

Closed

61 tasks


          Imports

cee906d

DaveCTurner added the WIP label

DaveCTurner removed the request for review from ywelsch

August 8, 2018 13:05

Contributor Author

DaveCTurner commented Aug 8, 2018

Marking this as WIP because I realised I want it to be restartable, and currently it's not.

DaveCTurner added 4 commits

August 8, 2018 16:18


          Make it restartable and improve the tests

4734d7a


          Lifecycle stuff duplicates the presence/absence of currentScheduler, …

b0d9a8d

…so can be removed


          Merge branch 'zen2' into 2018-08-08-election-scheduler

15ac140


          On reflection, these settings do not need to be dynamic, which simpli…

6cbca69

…fies things a bit

ywelsch suggested changes

View reviewed changes

Contributor

ywelsch left a comment

I've done an initial pass. I wonder how this will be integrated and whether the class that's going to use this will also need the notion of a current ElectionContext, so that we might be duplicating the notion of current across two classes.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                      reasonableTimeParser(ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY),
+                      new ElectionMaxRetryIntervalSettingValidator(), Property.NodeScope, Property.Dynamic);
+                  private static Function<String, TimeValue> reasonableTimeParser(final String settingKey) {

Contributor

ywelsch Aug 8, 2018

let's add a parseTimeValue method to Setting that allows to specify both a min and a max.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                  }
+                  static void validateSettings(final TimeValue electionMinRetryInterval, final TimeValue electionMaxRetryInterval) {
+                      if (electionMaxRetryInterval.millis() < electionMinRetryInterval.millis() + 100) {

Contributor

ywelsch Aug 8, 2018

I wonder if we should instead limit electionMaxRetryInterval to minimum 100ms and check that min <= max.

Contributor Author

DaveCTurner Aug 9, 2018

I think it's important that there's some margin between min and max. If they're equal then there's a chance that two nodes end up in lockstep, repeatedly interfering with each other's elections.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                                      public boolean isForceExecution() {
+                                          // There are very few of these scheduled, and they back off, but it's important that they're not rejected as
+                                          // this could prevent a cluster from ever forming.
+                                          return true;

Contributor

ywelsch Aug 8, 2018

The generic threadpool never rejects anyone. I wonder though if the ScheduledThreadPoolExecutor can reject us. As ThreadPool.schedule wraps the executable in an ThreadedRunnable, which is not an AbstractRunnable, this might cause problems? ScheduledThreadPoolExecutor looks to have an unbounded queue though, so needs more investigation...

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                      }
+                  }
+                  public void start() {

Contributor

ywelsch Aug 8, 2018

maybe use the activate / deactivate terminology so that this is not confused with the lifecycle of AbstractLifeCycleComponent

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                  private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval";
+                  public static final Setting<TimeValue> ELECTION_MIN_RETRY_INTERVAL_SETTING
+                      = new Setting<>(ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY, "300ms",

Contributor

ywelsch Aug 8, 2018

300ms is quite long. I wonder if we should go with something a bit more optimistic, say 100ms

Contributor Author

DaveCTurner Aug 9, 2018

I reduced to 100ms. There's opportunity to debate how we calculate the actual delays between election attempts. Currently they're never less than 100ms (previously 300ms) apart but in fact I can't think of a good reason for this lower bound.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                                      delay = randomLongBetween(electionMinRetryInterval.getMillis(), currentDelayMillis + 1);
+                                  }
+                                  logger.trace("scheduling election after delay of [{}ms]", delay);

Contributor

ywelsch Aug 9, 2018

maybe with delay instead of after delay (after sounds as if the delay has already passed)

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                                                  return;
+                                              }
+                                          }
+                                          startElection();

Contributor

ywelsch Aug 9, 2018

add trace logging here that election has started

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                  // bounds on the time between election attempts
+                  private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval";
+                  private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval";

Contributor

ywelsch Aug 9, 2018

I think we should go with naming that's more established in the literature, and use something like min_election_timeout and max_election_timeout

Contributor Author

DaveCTurner Aug 9, 2018

Ok. I don't particularly like the "timeout" nomenclature - "timeout" suggests how long some process has before it's considered to have failed, but here we're counting down until something starts. But I can live with it.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                                      @Override
+                                      public String toString() {
+                                          return "ElectionScheduler: do election and schedule retry";

Contributor

ywelsch Aug 9, 2018

maybe we could put the delay in here as well

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                                      delay = randomLongBetween(electionMinRetryInterval.getMillis(), currentDelayMillis + 1);
+                                  }
+                                  logger.trace("scheduling election after delay of [{}ms]", delay);

Contributor

ywelsch Aug 9, 2018

let's also log the currentDelayMillis, min and max here

DaveCTurner added 14 commits

August 9, 2018 08:41


          Final fields need no mutex

a42c4f3


          Rename settings and add separate backoff parameter

b1b2dcc


          Just validate in ctor

3e59fe4


          Less indenting

da19702


          Log other parameters when scheduling election

1d1d2b9


          ... under mutex

b3f429e


          All election activity is off the happy path so promoting it to debug

82ae493


          Missing param

f664744


          Move mutex

56a089c


          Rename mutex

7cbd40c


          Introduce PreVoteRequest

5637da3


          Introduce PreVoteResponse

5cccf2c


          ElectionScheduler now receives a TransportService

665e7a9


          WIP add pre-voting

94d261b

DaveCTurner added 2 commits

August 9, 2018 15:17


          Fix up ElectionSchedulerTests

0c96aff


          No need for braces

439daa0

DaveCTurner commented

View reviewed changes

Contributor Author

DaveCTurner left a comment

This is still very much a WIP but I spent some time sketching out how pre-voting might fit into this class too, and addressed most of the comments.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                  // bounds on the time between election attempts
+                  private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval";
+                  private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval";

Contributor Author

DaveCTurner Aug 9, 2018

Ok. I don't particularly like the "timeout" nomenclature - "timeout" suggests how long some process has before it's considered to have failed, but here we're counting down until something starts. But I can live with it.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                  private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval";
+                  public static final Setting<TimeValue> ELECTION_MIN_RETRY_INTERVAL_SETTING
+                      = new Setting<>(ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY, "300ms",

Contributor Author

DaveCTurner Aug 9, 2018

I reduced to 100ms. There's opportunity to debate how we calculate the actual delays between election attempts. Currently they're never less than 100ms (previously 300ms) apart but in fact I can't think of a good reason for this lower bound.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                  }
+                  static void validateSettings(final TimeValue electionMinRetryInterval, final TimeValue electionMaxRetryInterval) {
+                      if (electionMaxRetryInterval.millis() < electionMinRetryInterval.millis() + 100) {

Contributor Author

DaveCTurner Aug 9, 2018

I think it's important that there's some margin between min and max. If they're equal then there's a chance that two nodes end up in lockstep, repeatedly interfering with each other's elections.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                                  return this == currentScheduler;
+                              }
+                              private void scheduleNextElection() {

Contributor Author

DaveCTurner Aug 9, 2018

Yes, that's nicer.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java

+                   * It's provably impossible to guarantee that any leader election algorithm ever elects a leader, but they generally work (with
+                   * probability that approaches 1 over time) as long as elections occur sufficiently infrequently, compared to the time it takes to send
+                   * a message to another node and receive a response back. We do not know the round-trip latency here, but we can approximate it by
+                   * attempting elections randomly at reasonably high frequency and backing off (linearly) until one of them succeeds. We also place an

Contributor Author

DaveCTurner Aug 9, 2018

Yes, I've added such a setting.

server/src/main/java/org/elasticsearch/cluster/coordination/ElectionScheduler.java Outdated

+                   */
+                  // bounds on the time between election attempts
+                  private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval";

Contributor Author

DaveCTurner Aug 9, 2018

Ok, fixed.

DaveCTurner added 7 commits

August 13, 2018 08:10


          Move currentDelayMillis inside current scheduler and remove mutexes

a77cf41


          Start work on better test suite

2e6af7b


          Improve, and test, scheduling

eb6c0b8


          Add tests of prevoting

d252d29


          Imports

b1285e3


          Imports

fe17921


          Logger usage

896d456

DaveCTurner requested a review from ywelsch

August 13, 2018 11:16


          Introduce initial grace period before first election attempt

61fd397

DaveCTurner removed the WIP label

DaveCTurner added 11 commits

August 14, 2018 08:17


          Separate PreVoteCollector out as a top-level class


          Simplify ElectionSchedulerTests

1036ece


          Simplify PreVoteCollectorTests

2f936d9


          Pass broadcast nodes into PVC.start()

7eaa959


          Just schedule a runnable

c5fd779


          Simplify start/stop state

78c3079


          Pass in a ClusterState rather than overriding isElectionQuorum

a6080b3


          Pass in max-term consumer rather than overriding method

6f05620


          PreVoteCollector now manages its rounds

40bcc97


          Add PreVoteRequest handler

9aa9586


          Add tests of request handling

85573b0

Contributor Author

DaveCTurner commented Aug 14, 2018

I split this into two PRs, #32846 and #32847, so this can be closed.

DaveCTurner closed this

DaveCTurner deleted the 2018-08-08-election-scheduler branch

August 14, 2018 15:02

DaveCTurner mentioned this pull request

[Zen2] Introduce ElectionScheduler #32846

Merged

jimczi added v7.0.0-beta1 and removed v7.0.0 labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Cluster Coordination >enhancement v7.0.0-beta1