-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval Broker: Prevent redundant enqueue's when a node is not a leader #5699
Conversation
Primarily a cleanup commit, however, currently there is a potential race condition (that I'm not sure we've ever actually hit) during a flapping SetEnabled/Disabled state where we may never correctly restart the eval broker, if it was being called from multiple routines.
Currently when an evalbroker is disabled, it still recieves delayed enqueues via log application in the fsm. This causes an ever growing heap of evaluations that will never be drained, and can cause memory issues in larger clusters, or when left running for an extended period of time without a leader election. This commit prevents the enqueuing of evaluations while we are disabled, and relies on the leader restoreEvals routine to handle reconciling state during a leadership transition. Existing dequeues during an Enabled->Disabled broker state transition are handled by the enqueueLocked function dropping evals.
return nil, time.Time{} | ||
} | ||
nextEval := b.delayHeap.Peek() | ||
b.l.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock was originally unlocked here rather than using defer to avoid holding on to it after peeking into the heap. I am not certain this was the root cause of the non leader enqueues. Was this more of a clarity fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entirely a clarity fix - I can revert, I thought pulling out the eval would be fast enough that it’s not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its fine to leave it as is, the new lines of execution included in the lock's scope are not that expensive.
@@ -778,13 +785,13 @@ func (b *EvalBroker) runDelayedEvalsWatcher(ctx context.Context, updateCh <-chan | |||
// This peeks at the heap to return the top. If the heap is empty, this returns nil and zero time. | |||
func (b *EvalBroker) nextDelayedEval() (*structs.Evaluation, time.Time) { | |||
b.l.RLock() | |||
defer b.l.RUnlock() | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a unit test in eval_broker_test.go. Suggestion - could create two eval brokers with one of them enabled, and then switch to disabling while enabling the other one in another goroutine. It should verify that the flush
method drained everything on the previously enabled eval broker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be pretty hard to actually validate this on CI because we set GOMAXPROCS to one.
We’d only need a single broker in a test though bc they don’t interact with each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Code looks good but a question
How tested is path with restoreEvals and leader transition? Do we need to do follow up manual testing by any chance? |
@notnoop I did a fair amount of manual testing - |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
fixes #4670
Currently when an evalbroker is disabled, it still receives delayed enqueues via log application in the fsm. This causes an ever growing heap of evaluations that will never be drained, and can cause memory issues in busier clusters, or when left running for an extended period of time without a leader election.
This PR prevents the enqueuing of evaluations while we are disabled, and relies on the leader restoreEvals routine to handle reconciling state during a leadership transition.
Existing dequeues during an Enabled->Disabled broker state transition are handled by the enqueueLocked function dropping evals.