Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zen2] Introduce ElectionScheduler #32846

Merged
merged 23 commits into from
Aug 15, 2018

Conversation

DaveCTurner
Copy link
Contributor

The ElectionScheduler runs while there is no known elected master and is
responsible for scheduling elections randomly, backing off on failure, to
balance the desire to elect a master quickly with the desire to avoid more than
one node starting an election at once.

The ElectionScheduler runs while there is no known elected master and is
responsible for scheduling elections randomly, backing off on failure, to
balance the desire to elect a master quickly with the desire to avoid more than
one node starting an election at once.
@DaveCTurner DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Aug 14, 2018
@DaveCTurner DaveCTurner requested a review from ywelsch August 14, 2018 14:17
@DaveCTurner DaveCTurner changed the title Introduce ElectionScheduler [Zen2] Introduce ElectionScheduler Aug 14, 2018
@ywelsch ywelsch mentioned this pull request Aug 14, 2018
61 tasks
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some smaller comments. Main change looks good though.

@Nullable
private volatile Object currentScheduler; // only care about its identity; null if stopped

ElectionScheduler(Settings settings, Random random, ThreadPool threadPool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this public please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


public class ElectionScheduler extends AbstractComponent {

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this as Javadoc to the class level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

/**
* @param upperBound exclusive upper bound
*/
private long randomPositiveLongLessThan(long upperBound) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a static method (random instance as parameter) so that it can be unittested in isolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. And I added tests.

return nonNegative(random.nextLong()) % (upperBound - 1) + 1;
}

private long backOffCurrentMaxDelay(long currentMaxDelayMillis) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this boundDelayByMaxTimeout? In any case, please add a comment what this method does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, more docs.

logger.debug("{} not starting election", ElectionScheduler.this);
return;
}
logger.debug("{} starting pre-voting", ElectionScheduler.this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more generic message here? Something like executing scheduled election. The Prevoting class can output it's own log line that prevoting has started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.


private class ElectionScheduler implements Releasable {
private AtomicLong currentMaxDelayMillis = new AtomicLong(minTimeout.millis());
private AtomicBoolean isRunning = new AtomicBoolean(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, change it to isClosed (or just closed), so that we don't have to write == false everywhere :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

private AtomicBoolean isRunning = new AtomicBoolean(true);

void scheduleNextElection(final TimeValue gracePeriod, final Runnable scheduledRunnable) {
final long delay;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while declare it here? Just declare where it's defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hangover from an earlier implementation, now fixed.


private DeterministicTaskQueue deterministicTaskQueue;
private ElectionSchedulerFactory electionSchedulerFactory;
private boolean electionStarted = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to live at the class level? can we move this to the test method? Same for electionSchedulerFactory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, only used by one test now, so can be local to that (with suitable parameters)

assertThat(electionDelay, greaterThanOrEqualTo(initialGracePeriod.millis()));

// Check upper bound
assertThat(electionDelay, lessThanOrEqualTo(ELECTION_MIN_TIMEOUT_SETTING.get(Settings.EMPTY).millis()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we inject random settings with min timeout/ max timeout/backoff to test this on different settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. This was a little painful because putting a TimeValue into a Settings.Builder converts it (lossily) to a string and then back again, and even rejects things with a fraction like 1.1m. Putting a specific number of milliseconds in as a string is the answer.


// bounds on the time between election attempts
private static final String ELECTION_MIN_TIMEOUT_SETTING_KEY = "cluster.election.min_timeout";
private static final String ELECTION_BACK_OFF_TIME_SETTING_KEY = "cluster.election.back_off_time";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document how the backoff works, i.e. the initial interval, then the subsequent intervals we consider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@DaveCTurner DaveCTurner requested a review from ywelsch August 15, 2018 11:29
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some more comments, mostly very minor stuff, and left a few suggestions which I would like to get your opinion on.

* The first election is scheduled to occur a random number of milliseconds after the scheduler is started, where the random number of
* milliseconds is chosen uniformly from
*
* (0, min(ELECTION_INITIAL_TIMEOUT_SETTING, ELECTION_MAX_TIMEOUT_SETTING_KEY)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove _KEY

* For `n > 1`, the `n`th election is scheduled to occur a random number of milliseconds after the `n - 1`th election, where the random
* number of milliseconds is chosen uniformly from
*
* (0, min(ELECTION_INITIAL_TIMEOUT_SETTING + (n-1) * ELECTION_BACK_OFF_TIME_SETTING, ELECTION_MAX_TIMEOUT_SETTING_KEY)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove _KEY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops.

* @param scheduledRunnable The action to run each time an election should be attempted.
*/
public Releasable startElectionScheduler(TimeValue gracePeriod, Runnable scheduledRunnable) {
final ElectionScheduler currentScheduler = new ElectionScheduler();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call this scheduler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok


/**
* @param randomSupplier supplier of randomly-chosen longs
* @param upperBound exclusive upper bound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also document @return here, to denote what's returned? This will go nicely with your explanation above about the uniformly chosen value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😐 it returns a random positive long that's less than upperBound.

(actually neater to make it an inclusive upper bound now - this is on it way)

* The current maximum timeout: the next election is scheduled randomly no later than this number of milliseconds in the future. On
* each election attempt this value is increased by `backoffTime`, up to the `maxTimeout`, to adapt to higher-than-expected latency.
*/
private final AtomicLong currentMaxTimeoutMillis = new AtomicLong(Math.min(initialTimeout.millis(), maxTimeout.millis()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not enforce the condition on maxTimeout to be >= initialTimeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If maxTimeout < initialTimeout then nothing particularly bad happens. I added this call to Math.min because the spec is more uniform with it there than not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, nothing bad happens, but it might still be a user configuration error (user configured initalTimeout, but forgot to set maxTimeout). I think it's nicer to just throw an exception and report it.

public boolean isForceExecution() {
// There are very few of these scheduled, and they back off, but it's important that they're not rejected as
// this could prevent a cluster from ever forming.
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not sure this is needed. Have you checked this? If not, let's add a TODO here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPool#schedule() wraps this up in a ThreadedRunnable, but I think that's what we want. When the ThreadedRunnable finally runs, it runs this AbstractRunnable directly using the executor, so rejections seem to be handled right.

As far as I can see there's nothing special about whether the generic threadpool does rejections, but perhaps I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic threadpool has an unbounded queue, so rejections should not happen unless the executor is shutdown. In that case, there's no need for this code to still execute.


@Override
public void close() {
boolean isClosedChanged = isClosed.compareAndSet(false, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this previouslyNotClosed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

* For `n > 1`, the `n`th election is scheduled to occur a random number of milliseconds after the `n - 1`th election, where the random
* number of milliseconds is chosen uniformly from
*
* (0, min(ELECTION_INITIAL_TIMEOUT_SETTING + (n-1) * ELECTION_BACK_OFF_TIME_SETTING, ELECTION_MAX_TIMEOUT_SETTING_KEY)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this formula. I wonder if we should try to implement ElectionScheduler exactly this way. We would "just" need to keep track of the round n (which might be useful anyhow for debugging purposes), and there would be no need for currentMaxTimeoutMillis, which feels more awkward. The only arithmetic calculation we would need to do then would be just a straight-forward implementation of this formula, which would preferably be done in a single static method at the top-level here. This would also allow us to get rid of the methods randomPositiveLongLessThan and backOffCurrentMaxTimeout, both of which could be folded into this static method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I did this, but there's now potential for integer overflow, so perhaps we do need that upper bound on the settings?

Also I'm unconvinced at the need for a static method to do this calculation. We test it via the actual timeouts chosen, so why extract it?

Also randomPositiveLongAtMost is still needed, and you wanted a test for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, there was already potential for integer overflow without an upper bound on the settings and this change makes it no worse.

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were two remaining discussion points in #32709 - I brought them over here.

public static final Setting<TimeValue> ELECTION_BACK_OFF_TIME_SETTING = Setting.timeSetting(ELECTION_BACK_OFF_TIME_SETTING_KEY,
TimeValue.timeValueMillis(100), TimeValue.timeValueMillis(1), Property.NodeScope);

public static final Setting<TimeValue> ELECTION_MAX_TIMEOUT_SETTING = Setting.timeSetting(ELECTION_MAX_TIMEOUT_SETTING_KEY,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch you asked for a Setting.timeSetting() that allowed us to specify an upper bound for these settings as well as a lower bound. I don't think we need this - the consequences of setting ELECTION_MAX_TIMEOUT_SETTING too high do not seem terribly surprising, and documenting this seems sufficient.

}

@Override
public boolean isForceExecution() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywelsch I didn't totally understand what action you wanted here (see
#32709 (comment)).

public boolean isForceExecution() {
// There are very few of these scheduled, and they back off, but it's important that they're not rejected as
// this could prevent a cluster from ever forming.
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic threadpool has an unbounded queue, so rejections should not happen unless the executor is shutdown. In that case, there's no need for this code to still execute.

}

final long thisAttempt = attempt.getAndIncrement();
final long maxDelayMillis = Math.min(maxTimeout.millis(), initialTimeout.millis() + thisAttempt * backoffTime.millis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned the overflow that could happen here. Can you please add a safeguard? Otherwise I might not be able to sleep anymore 😨

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It's a bit daft because now that there are upper limits on the settings it would take solidly more than a million years to overflow, but there you go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you put it like that, you can undo c08cd8b

* The current maximum timeout: the next election is scheduled randomly no later than this number of milliseconds in the future. On
* each election attempt this value is increased by `backoffTime`, up to the `maxTimeout`, to adapt to higher-than-expected latency.
*/
private final AtomicLong currentMaxTimeoutMillis = new AtomicLong(Math.min(initialTimeout.millis(), maxTimeout.millis()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, nothing bad happens, but it might still be a user configuration error (user configured initalTimeout, but forgot to set maxTimeout). I think it's nicer to just throw an exception and report it.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (two nits)

@DaveCTurner DaveCTurner merged commit 6d9e7c5 into elastic:zen2 Aug 15, 2018
@DaveCTurner DaveCTurner deleted the 2018-08-14-election-scheduler branch August 15, 2018 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants