-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Introduce ElectionScheduler #32709
[Zen2] Introduce ElectionScheduler #32709
Conversation
The ElectionScheduler runs while there is no known elected master and is responsible for scheduling elections randomly, backing off on failure, to balance the desire to elect a master quickly with the desire to avoid more than one node starting an election at once.
Pinging @elastic/es-distributed |
Marking this as WIP because I realised I want it to be restartable, and currently it's not. |
…so can be removed
…fies things a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done an initial pass. I wonder how this will be integrated and whether the class that's going to use this will also need the notion of a current
ElectionContext, so that we might be duplicating the notion of current
across two classes.
reasonableTimeParser(ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY), | ||
new ElectionMaxRetryIntervalSettingValidator(), Property.NodeScope, Property.Dynamic); | ||
|
||
private static Function<String, TimeValue> reasonableTimeParser(final String settingKey) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a parseTimeValue
method to Setting
that allows to specify both a min and a max.
} | ||
|
||
static void validateSettings(final TimeValue electionMinRetryInterval, final TimeValue electionMaxRetryInterval) { | ||
if (electionMaxRetryInterval.millis() < electionMinRetryInterval.millis() + 100) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should instead limit electionMaxRetryInterval to minimum 100ms and check that min <= max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important that there's some margin between min and max. If they're equal then there's a chance that two nodes end up in lockstep, repeatedly interfering with each other's elections.
public boolean isForceExecution() { | ||
// There are very few of these scheduled, and they back off, but it's important that they're not rejected as | ||
// this could prevent a cluster from ever forming. | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generic threadpool never rejects anyone. I wonder though if the ScheduledThreadPoolExecutor can reject us. As ThreadPool.schedule wraps the executable in an ThreadedRunnable, which is not an AbstractRunnable, this might cause problems? ScheduledThreadPoolExecutor looks to have an unbounded queue though, so needs more investigation...
} | ||
} | ||
|
||
public void start() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use the activate
/ deactivate
terminology so that this is not confused with the lifecycle of AbstractLifeCycleComponent
private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval"; | ||
|
||
public static final Setting<TimeValue> ELECTION_MIN_RETRY_INTERVAL_SETTING | ||
= new Setting<>(ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY, "300ms", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
300ms is quite long. I wonder if we should go with something a bit more optimistic, say 100ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced to 100ms. There's opportunity to debate how we calculate the actual delays between election attempts. Currently they're never less than 100ms (previously 300ms) apart but in fact I can't think of a good reason for this lower bound.
|
||
delay = randomLongBetween(electionMinRetryInterval.getMillis(), currentDelayMillis + 1); | ||
} | ||
logger.trace("scheduling election after delay of [{}ms]", delay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe with delay
instead of after delay
(after sounds as if the delay has already passed)
return; | ||
} | ||
} | ||
startElection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add trace logging here that election has started
|
||
// bounds on the time between election attempts | ||
private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval"; | ||
private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should go with naming that's more established in the literature, and use something like min_election_timeout
and max_election_timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I don't particularly like the "timeout" nomenclature - "timeout" suggests how long some process has before it's considered to have failed, but here we're counting down until something starts. But I can live with it.
|
||
@Override | ||
public String toString() { | ||
return "ElectionScheduler: do election and schedule retry"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could put the delay in here as well
|
||
delay = randomLongBetween(electionMinRetryInterval.getMillis(), currentDelayMillis + 1); | ||
} | ||
logger.trace("scheduling election after delay of [{}ms]", delay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also log the currentDelayMillis, min and max here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still very much a WIP but I spent some time sketching out how pre-voting might fit into this class too, and addressed most of the comments.
|
||
// bounds on the time between election attempts | ||
private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval"; | ||
private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I don't particularly like the "timeout" nomenclature - "timeout" suggests how long some process has before it's considered to have failed, but here we're counting down until something starts. But I can live with it.
private static final String ELECTION_MAX_RETRY_INTERVAL_SETTING_KEY = "discovery.election.max_retry_interval"; | ||
|
||
public static final Setting<TimeValue> ELECTION_MIN_RETRY_INTERVAL_SETTING | ||
= new Setting<>(ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY, "300ms", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced to 100ms. There's opportunity to debate how we calculate the actual delays between election attempts. Currently they're never less than 100ms (previously 300ms) apart but in fact I can't think of a good reason for this lower bound.
} | ||
|
||
static void validateSettings(final TimeValue electionMinRetryInterval, final TimeValue electionMaxRetryInterval) { | ||
if (electionMaxRetryInterval.millis() < electionMinRetryInterval.millis() + 100) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important that there's some margin between min and max. If they're equal then there's a chance that two nodes end up in lockstep, repeatedly interfering with each other's elections.
return this == currentScheduler; | ||
} | ||
|
||
private void scheduleNextElection() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's nicer.
* It's provably impossible to guarantee that any leader election algorithm ever elects a leader, but they generally work (with | ||
* probability that approaches 1 over time) as long as elections occur sufficiently infrequently, compared to the time it takes to send | ||
* a message to another node and receive a response back. We do not know the round-trip latency here, but we can approximate it by | ||
* attempting elections randomly at reasonably high frequency and backing off (linearly) until one of them succeeds. We also place an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've added such a setting.
*/ | ||
|
||
// bounds on the time between election attempts | ||
private static final String ELECTION_MIN_RETRY_INTERVAL_SETTING_KEY = "discovery.election.min_retry_interval"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fixed.
The ElectionScheduler runs while there is no known elected master and is
responsible for scheduling elections randomly, backing off on failure, to
balance the desire to elect a master quickly with the desire to avoid more than
one node starting an election at once.