Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Nomad leadership flapping (attempt 2) #6977

Merged
merged 6 commits into from
Jan 28, 2020
Merged

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jan 22, 2020

Another attempt at #6951 .

Fixes a deadlock in leadership handling if leadership flapped.

Context

Raft propagates leadership transition to Nomad through a NotifyCh channel. Raft blocks when writing to this channel, so channel must be buffered or aggressively consumed[1]. Otherwise, Raft blocks indefinitely in raft.runLeader until the channel is consumed[1] and does not move on to executing follower related logic (in raft.runFollower).

While Raft runLeader defer function blocks, raft cannot process any other raft operations. For example, run{Leader|Follower} methods consume raft.applyCh, and while runLeader defer is blocked, all raft log applications or config lookup will block indefinitely.

Sadly, leaderLoop and establishLeader makes few Raft calls! establishLeader attempts to auto-create autopilot/scheduler config [3]; and leaderLoop attempts to check raft configuration [4]. All of these calls occur without a timeout.

Thus, if leadership flapped quickly while leaderLoop/establishLeadership is invoked and hit any of these Raft calls, Raft handler deadlock forever.

The above explains some of the observations in #4749 ; though I'm not fully sure I can explain the some of the yamux goroutines - and will follow up there.

Approach

This PR introduces a a helper channel that effectively proxies NotifyCh but drops all but the last value in leadership notification. If raft leadership flaps multiple times, while we attempt to gain or drop leadership, we would only make the step once. One edge case is highlighted when we step down but gain leadership again.

Follow up PRs:

Consul had a similar issue in hashicorp/consul#6852 . I chose a slightly different fix from simply increasing the buffered channel.

We should follow up with few changes related to leadership that Consul introduced:

Additionally, ideally establishLeadership needs to be fast so Nomad could be fast and we shouldn't attempt to make a Raft apply log transactions there. autopilot/scheduler might make more sense as a watcher for Serf membership changes rather than on leadership changes, e.g. we should create a scheduler config when all servers can honor the log which may happen without a leadership transition (e.g. when last follower upgrades).

[1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193
[2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436
[3]

nomad/nomad/leader.go

Lines 198 to 202 in 2a89e47

s.getOrCreateAutopilotConfig()
s.autopilot.Start()
// Initialize scheduler configuration
s.getOrCreateSchedulerConfig()

[4]
configFuture := s.raft.GetConfiguration()

Mahmood Ali added 2 commits January 22, 2020 10:55
Fixes a deadlock in leadership handling if leadership flapped.

Raft propagates leadership transition to Nomad through a NotifyCh channel.
Raft blocks when writing to this channel, so channel must be buffered or
aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader`
until the channel is consumed[1] and does not move on to executing follower
related logic (in `raft.runFollower`).

While Raft `runLeader` defer function blocks, raft cannot process any other
raft operations.  For example, `run{Leader|Follower}` methods consume
`raft.applyCh`, and while runLeader defer is blocked, all raft log applications
or config lookup will block indefinitely.

Sadly, `leaderLoop` and `establishLeader` makes few Raft calls!
`establishLeader` attempts to auto-create autopilot/scheduler config [3]; and
`leaderLoop` attempts to check raft configuration [4].  All of these calls occur
without a timeout.

Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is
invoked and hit any of these Raft calls, Raft handler _deadlock_ forever.

Depending on how many times it flapped and where exactly we get stuck, I suspect
it's possible to get in the following case:

* Agent metrics/stats http and RPC calls hang as they check raft.Configurations
* raft.State remains in Leader state, and server attempts to handle RPC calls
  (e.g. node/alloc updates) and these hang as well

As we create goroutines per RPC call, the number of goroutines grow over time
and may trigger a out of memory errors in addition to missed updates.

[1] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/config.go#L190-L193
[2] https://github.com/hashicorp/raft/blob/d90d6d6bdacf1b35d66940b07be515b074d89e88/raft.go#L425-L436
[3] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L198-L202
[4] https://github.com/hashicorp/nomad/blob/2a89e477465adbe6a88987f0dcb9fe80145d7b2f/nomad/leader.go#L877
@notnoop notnoop added this to the 0.10.3 milestone Jan 22, 2020
@notnoop notnoop requested review from schmichael and tgross January 22, 2020 18:27
@notnoop notnoop self-assigned this Jan 22, 2020
// this select before the implementation goroutine dequeues last value.
//
// However, never got it to fail in test - so leaving it now to see if it ever fails;
// and if/when test fails, we can learn of how much of an issue it is and adjust
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fail it pretty trivially if we push the values into the channel concurrently to the test thread, but I'm not sure that tells us anything other than we didn't get a chance to consume everything on the channel. If we pull all the values off, we're fine:

package main

import (
	"fmt"
	"testing"
	"time"

	"github.com/stretchr/testify/assert"
)

func TestDropButLastChannel(t *testing.T) {

	testFunc := func(t *testing.T) {
		t.Parallel()
		shutdownCh := make(chan struct{})

		src := make(chan bool)
		dst := dropButLastChannel(src, shutdownCh)

		timeoutDuration := 1 * time.Millisecond

		go func() {
			src <- false
			src <- false
			src <- false
			src <- false
			src <- false
			src <- false
			src <- true
			src <- false
			src <- true
		}()

		var v bool
	BREAK:
		for {
			select {
			case v = <-dst:
				fmt.Println("ok")
			case <-time.After(timeoutDuration):
				break BREAK
			}
		}

		assert.True(t, v)
		close(shutdownCh)
	}

	for i := 0; i < 1000; i++ {
		t.Run(fmt.Sprintf("test-%d", i), testFunc)
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - if it's running on different goroutine, we have no guarantees of delivery. This feels like another test to add.

Here, I wanted to test that intermediate messages get sent but get dropped when no receive is happening on the channel - so I made the sends happen in the same goroutine. Though, in current form, we still cannot 100% guarantee that the first message we receive is the last sent message, but this hasn't happened in practice yet, hence my comment.

Your test is good to have in that we should check that ultimately, we always send the last message last.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW - seeing your test, I realized I didn't support close source channel (raft doesn't attempt to close notify channel); I updated function to handle close signal by always attempting to deliver last known value, just in case someone adopt function for something else in future.

nomad/leader.go Outdated
continue
}
leaderStep := func(isLeader bool) {
switch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this has been pulled out of the select, could this switch be a if isLeader with an early return?

nomad/leader.go Outdated
// Raft transitions that haven't been applied to the FSM
// yet.
// Ensure that that FSM caught up and eval queues are refreshed
s.logger.Error("cluster leadership flapped, lost and gained leadership immediately. Leadership flaps indicate a cluster wide problems (e.g. networking).")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this Warn since there's no directly actionable error.

nomad/leader.go Outdated
} else {
// Server gained but lost leadership immediately
// before it reacted; nothing to do, move on
s.logger.Error("cluster leadership flapped, gained and lost leadership immediately. Leadership flaps indicate a cluster wide problems (e.g. networking).")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be more accurate? I'm not sure gaining-and-losing-before-establishing leadership indicates a clusterwide problem.

Suggested change
s.logger.Error("cluster leadership flapped, gained and lost leadership immediately. Leadership flaps indicate a cluster wide problems (e.g. networking).")
s.logger.Warn("cluster leadership gained and lost. Could indicate network issues, memory paging, or high CPU load.")

AFAICT we only use the defaults for leadership election (1s timeout causes an election, elections have 1s timeout) and don't expose them for customizing. It seems like clusters without demanding scheduler throughput or latency may prefer higher timeouts to reduce elections (and therefore flapping).

If we allowed Raft to be configurable we could at least point users toward those docs when these cases are encountered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: configuration - agreed - we should consider porting consul's raft_multipler [1] and study their defaults. In some Consul testing, they increased their default leadership timeout to 5 seconds to account for low powered instances (e.g. t2.medium).

[1] https://www.consul.io/docs/install/performance.html#minimum-server-requirements

// wait for first message
select {
case lv = <-sourceCh:
goto ENQUEUE_DST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary.

Suggested change
goto ENQUEUE_DST

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unnecessary indeed. I'd like to keep though just because I find it easier to see all state machine transitions in goto statements.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Mahmood Ali added 2 commits January 28, 2020 09:06
Always deliver last value then send close signal.
@notnoop notnoop force-pushed the b-leadership-flapping-2 branch from 03900e2 to 97f20bd Compare January 28, 2020 14:44
@notnoop notnoop merged commit 771c8ff into master Jan 28, 2020
@notnoop notnoop deleted the b-leadership-flapping-2 branch January 28, 2020 16:40
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants