-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Nomad leadership flapping (attempt 2) #6977
Conversation
Fixes a deadlock in leadership handling if leadership flapped. Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader` until the channel is consumed[1] and does not move on to executing follower related logic (in `raft.runFollower`). While Raft `runLeader` defer function blocks, raft cannot process any other raft operations. For example, `run{Leader|Follower}` methods consume `raft.applyCh`, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely. Sadly, `leaderLoop` and `establishLeader` makes few Raft calls! `establishLeader` attempts to auto-create autopilot/scheduler config [3]; and `leaderLoop` attempts to check raft configuration [4]. All of these calls occur without a timeout. Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is invoked and hit any of these Raft calls, Raft handler _deadlock_ forever. Depending on how many times it flapped and where exactly we get stuck, I suspect it's possible to get in the following case: * Agent metrics/stats http and RPC calls hang as they check raft.Configurations * raft.State remains in Leader state, and server attempts to handle RPC calls (e.g. node/alloc updates) and these hang as well As we create goroutines per RPC call, the number of goroutines grow over time and may trigger a out of memory errors in addition to missed updates. [1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193 [2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436 [3] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L198-L202 [4] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L877
// this select before the implementation goroutine dequeues last value. | ||
// | ||
// However, never got it to fail in test - so leaving it now to see if it ever fails; | ||
// and if/when test fails, we can learn of how much of an issue it is and adjust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fail it pretty trivially if we push the values into the channel concurrently to the test thread, but I'm not sure that tells us anything other than we didn't get a chance to consume everything on the channel. If we pull all the values off, we're fine:
package main
import (
"fmt"
"testing"
"time"
"github.com/stretchr/testify/assert"
)
func TestDropButLastChannel(t *testing.T) {
testFunc := func(t *testing.T) {
t.Parallel()
shutdownCh := make(chan struct{})
src := make(chan bool)
dst := dropButLastChannel(src, shutdownCh)
timeoutDuration := 1 * time.Millisecond
go func() {
src <- false
src <- false
src <- false
src <- false
src <- false
src <- false
src <- true
src <- false
src <- true
}()
var v bool
BREAK:
for {
select {
case v = <-dst:
fmt.Println("ok")
case <-time.After(timeoutDuration):
break BREAK
}
}
assert.True(t, v)
close(shutdownCh)
}
for i := 0; i < 1000; i++ {
t.Run(fmt.Sprintf("test-%d", i), testFunc)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - if it's running on different goroutine, we have no guarantees of delivery. This feels like another test to add.
Here, I wanted to test that intermediate messages get sent but get dropped when no receive is happening on the channel - so I made the sends happen in the same goroutine. Though, in current form, we still cannot 100% guarantee that the first message we receive is the last sent message, but this hasn't happened in practice yet, hence my comment.
Your test is good to have in that we should check that ultimately, we always send the last message last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW - seeing your test, I realized I didn't support close
source channel (raft doesn't attempt to close notify channel); I updated function to handle close signal by always attempting to deliver last known value, just in case someone adopt function for something else in future.
nomad/leader.go
Outdated
continue | ||
} | ||
leaderStep := func(isLeader bool) { | ||
switch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this has been pulled out of the select
, could this switch be a if isLeader
with an early return?
nomad/leader.go
Outdated
// Raft transitions that haven't been applied to the FSM | ||
// yet. | ||
// Ensure that that FSM caught up and eval queues are refreshed | ||
s.logger.Error("cluster leadership flapped, lost and gained leadership immediately. Leadership flaps indicate a cluster wide problems (e.g. networking).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this Warn
since there's no directly actionable error.
nomad/leader.go
Outdated
} else { | ||
// Server gained but lost leadership immediately | ||
// before it reacted; nothing to do, move on | ||
s.logger.Error("cluster leadership flapped, gained and lost leadership immediately. Leadership flaps indicate a cluster wide problems (e.g. networking).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be more accurate? I'm not sure gaining-and-losing-before-establishing leadership indicates a clusterwide problem.
s.logger.Error("cluster leadership flapped, gained and lost leadership immediately. Leadership flaps indicate a cluster wide problems (e.g. networking).") | |
s.logger.Warn("cluster leadership gained and lost. Could indicate network issues, memory paging, or high CPU load.") |
AFAICT we only use the defaults for leadership election (1s timeout causes an election, elections have 1s timeout) and don't expose them for customizing. It seems like clusters without demanding scheduler throughput or latency may prefer higher timeouts to reduce elections (and therefore flapping).
If we allowed Raft to be configurable we could at least point users toward those docs when these cases are encountered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: configuration - agreed - we should consider porting consul's raft_multipler [1] and study their defaults. In some Consul testing, they increased their default leadership timeout to 5 seconds to account for low powered instances (e.g. t2.medium
).
[1] https://www.consul.io/docs/install/performance.html#minimum-server-requirements
// wait for first message | ||
select { | ||
case lv = <-sourceCh: | ||
goto ENQUEUE_DST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary.
goto ENQUEUE_DST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary indeed. I'd like to keep though just because I find it easier to see all state machine transitions in goto statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Always deliver last value then send close signal.
03900e2
to
97f20bd
Compare
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Another attempt at #6951 .
Fixes a deadlock in leadership handling if leadership flapped.
Context
Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in
raft.runLeader
until the channel is consumed[1] and does not move on to executing follower related logic (inraft.runFollower
).While Raft
runLeader
defer function blocks, raft cannot process any other raft operations. For example,run{Leader|Follower}
methods consumeraft.applyCh
, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely.Sadly,
leaderLoop
andestablishLeader
makes few Raft calls!establishLeader
attempts to auto-create autopilot/scheduler config [3]; andleaderLoop
attempts to check raft configuration [4]. All of these calls occur without a timeout.Thus, if leadership flapped quickly while
leaderLoop/establishLeadership
is invoked and hit any of these Raft calls, Raft handler deadlock forever.The above explains some of the observations in #4749 ; though I'm not fully sure I can explain the some of the yamux goroutines - and will follow up there.
Approach
This PR introduces a a helper channel that effectively proxies NotifyCh but drops all but the last value in leadership notification. If raft leadership flaps multiple times, while we attempt to gain or drop leadership, we would only make the step once. One edge case is highlighted when we step down but gain leadership again.
Follow up PRs:
Consul had a similar issue in hashicorp/consul#6852 . I chose a slightly different fix from simply increasing the buffered channel.
We should follow up with few changes related to leadership that Consul introduced:
Additionally, ideally
establishLeadership
needs to be fast so Nomad could be fast and we shouldn't attempt to make a Raft apply log transactions there. autopilot/scheduler might make more sense as a watcher for Serf membership changes rather than on leadership changes, e.g. we should create a scheduler config when all servers can honor the log which may happen without a leadership transition (e.g. when last follower upgrades).[1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193
[2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436
[3]
nomad/nomad/leader.go
Lines 198 to 202 in 2a89e47
[4]
nomad/nomad/leader.go
Line 877 in 2a89e47