Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zen2] Introduce ElectionScheduler #32709

Closed

Conversation

DaveCTurner
Copy link
Contributor

The ElectionScheduler runs while there is no known elected master and is
responsible for scheduling elections randomly, backing off on failure, to
balance the desire to elect a master quickly with the desire to avoid more than
one node starting an election at once.

The ElectionScheduler runs while there is no known elected master and is
responsible for scheduling elections randomly, backing off on failure, to
balance the desire to elect a master quickly with the desire to avoid more than
one node starting an election at once.
@DaveCTurner DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Aug 8, 2018
@DaveCTurner DaveCTurner requested a review from ywelsch August 8, 2018 11:41
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch mentioned this pull request Aug 8, 2018
61 tasks
@DaveCTurner DaveCTurner added the WIP label Aug 8, 2018
@DaveCTurner DaveCTurner removed the request for review from ywelsch August 8, 2018 13:05
@DaveCTurner
Copy link
Contributor Author

Marking this as WIP because I realised I want it to be restartable, and currently it's not.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done an initial pass. I wonder how this will be integrated and whether the class that's going to use this will also need the notion of a current ElectionContext, so that we might be duplicating the notion of current across two classes.

reasonableTimeParser(ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY),
new ElectionMaxRetryIntervalSettingValidator(), Property.NodeScope, Property.Dynamic);

private static Function<String, TimeValue> reasonableTimeParser(final String settingKey) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a parseTimeValue method to Setting that allows to specify both a min and a max.

}

static void validateSettings(final TimeValue electionMinRetryInterval, final TimeValue electionMaxRetryInterval) {
if (electionMaxRetryInterval.millis() < electionMinRetryInterval.millis() + 100) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should instead limit electionMaxRetryInterval to minimum 100ms and check that min <= max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important that there's some margin between min and max. If they're equal then there's a chance that two nodes end up in lockstep, repeatedly interfering with each other's elections.

public boolean isForceExecution() {
// There are very few of these scheduled, and they back off, but it's important that they're not rejected as
// this could prevent a cluster from ever forming.
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic threadpool never rejects anyone. I wonder though if the ScheduledThreadPoolExecutor can reject us. As ThreadPool.schedule wraps the executable in an ThreadedRunnable, which is not an AbstractRunnable, this might cause problems? ScheduledThreadPoolExecutor looks to have an unbounded queue though, so needs more investigation...

}
}

public void start() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use the activate / deactivate terminology so that this is not confused with the lifecycle of AbstractLifeCycleComponent

private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval";

public static final Setting<TimeValue> ELECTION_MIN_RETRY_INTERVAL_SETTING
= new Setting<>(ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY, "300ms",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300ms is quite long. I wonder if we should go with something a bit more optimistic, say 100ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced to 100ms. There's opportunity to debate how we calculate the actual delays between election attempts. Currently they're never less than 100ms (previously 300ms) apart but in fact I can't think of a good reason for this lower bound.


delay = randomLongBetween(electionMinRetryInterval.getMillis(), currentDelayMillis + 1);
}
logger.trace("scheduling election after delay of [{}ms]", delay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe with delay instead of after delay (after sounds as if the delay has already passed)

return;
}
}
startElection();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add trace logging here that election has started


// bounds on the time between election attempts
private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval";
private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go with naming that's more established in the literature, and use something like min_election_timeout and max_election_timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I don't particularly like the "timeout" nomenclature - "timeout" suggests how long some process has before it's considered to have failed, but here we're counting down until something starts. But I can live with it.


@Override
public String toString() {
return "ElectionScheduler: do election and schedule retry";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could put the delay in here as well


delay = randomLongBetween(electionMinRetryInterval.getMillis(), currentDelayMillis + 1);
}
logger.trace("scheduling election after delay of [{}ms]", delay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also log the currentDelayMillis, min and max here

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still very much a WIP but I spent some time sketching out how pre-voting might fit into this class too, and addressed most of the comments.


// bounds on the time between election attempts
private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval";
private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I don't particularly like the "timeout" nomenclature - "timeout" suggests how long some process has before it's considered to have failed, but here we're counting down until something starts. But I can live with it.

private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval";

public static final Setting<TimeValue> ELECTION_MIN_RETRY_INTERVAL_SETTING
= new Setting<>(ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY, "300ms",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced to 100ms. There's opportunity to debate how we calculate the actual delays between election attempts. Currently they're never less than 100ms (previously 300ms) apart but in fact I can't think of a good reason for this lower bound.

}

static void validateSettings(final TimeValue electionMinRetryInterval, final TimeValue electionMaxRetryInterval) {
if (electionMaxRetryInterval.millis() < electionMinRetryInterval.millis() + 100) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important that there's some margin between min and max. If they're equal then there's a chance that two nodes end up in lockstep, repeatedly interfering with each other's elections.

return this == currentScheduler;
}

private void scheduleNextElection() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's nicer.

* It's provably impossible to guarantee that any leader election algorithm ever elects a leader, but they generally work (with
* probability that approaches 1 over time) as long as elections occur sufficiently infrequently, compared to the time it takes to send
* a message to another node and receive a response back. We do not know the round-trip latency here, but we can approximate it by
* attempting elections randomly at reasonably high frequency and backing off (linearly) until one of them succeeds. We also place an
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've added such a setting.

*/

// bounds on the time between election attempts
private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fixed.

@DaveCTurner DaveCTurner requested a review from ywelsch August 13, 2018 11:16
@DaveCTurner DaveCTurner removed the WIP label Aug 13, 2018
@DaveCTurner
Copy link
Contributor Author

I split this into two PRs, #32846 and #32847, so this can be closed.

@DaveCTurner DaveCTurner deleted the 2018-08-08-election-scheduler branch August 14, 2018 15:02
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants