-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reconciler: support disconnected clients #12058
reconciler: support disconnected clients #12058
Conversation
9dde429
to
c7bbf74
Compare
4ea42c6
to
6cd52b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @DerekStrickland! This is looking great for a first pass at some gnarly scheduler code. A bunch of my comments are about docstrings, readability, testing, etc.
I've got a couple of correctness confusions around:
- How we're using
MaxNormScore()
incomputeStopByReconnecting
- Confusing conditionals in
filterByTainted
when tainted nodes are nil that might be obscuring the right behavior - Batch creation in
createTimeoutLaterEvals
Co-authored-by: Tim Gross <[email protected]>
// DisconnectTimeout uses the MaxClientDisconnect to compute when the allocation | ||
// should transition to lost. | ||
func (a *Allocation) DisconnectTimeout(now time.Time) time.Time { | ||
tg := a.Job.LookupTaskGroup(a.TaskGroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a.Job
ever be nil? I know we set it that way in some circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in next push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I've left a few minor remaining questions or suggestions but once those are resolved, let's get this merged.
Co-authored-by: Tim Gross <[email protected]>
* Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <[email protected]>
* Add merge helper for string maps * structs: add statuses, MaxClientDisconnect, and helper funcs * taintedNodes: Include disconnected nodes * upsertAllocsImpl: don't use existing ClientStatus when upserting unknown * allocSet: update filterByTainted and add delayByMaxClientDisconnect * allocReconciler: support disconnecting and reconnecting allocs * GenericScheduler: upsert unknown and queue reconnecting Co-authored-by: Tim Gross <[email protected]>
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Description
This PR is a work in progress tracking against the long-running feature branch
f-disconnected-client-handling
.It includes:
NodeStatusDisconnected
), new client status (AllocClientStatusUnknown
), and a new eval trigger (EvalTriggerMaxDisconnectTimeout
)taintedNodes
function to include disconnected nodes as part of the tainted setMaxClientDisconnect
field on theAllocation
struct for tracking how long an alloc is allowed to be disconnectedallocReconciler
changes to support allocs that can disconnect and reconnect without restartingGenericScheduler
changes to add disconnects to theNodeAllocation
set for upsert with new statusAllocClientStatusUnknown
GenericScheduler
changes to add reconnects toqueuedAllocations
.StateStore
alloc upsert code so that allocs with the unknown status don't have theirClientStatus
overwritten by the existing allocsClientStatus