Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciler: support disconnected clients #12058

Conversation

DerekStrickland
Copy link
Contributor

@DerekStrickland DerekStrickland commented Feb 11, 2022

Description

This PR is a work in progress tracking against the long-running feature branch f-disconnected-client-handling.

It includes:

  • New node status (NodeStatusDisconnected), new client status (AllocClientStatusUnknown), and a new eval trigger (EvalTriggerMaxDisconnectTimeout)
  • Updates to the taintedNodes function to include disconnected nodes as part of the tainted set
  • A new MaxClientDisconnect field on the Allocation struct for tracking how long an alloc is allowed to be disconnected
  • allocReconciler changes to support allocs that can disconnect and reconnect without restarting
  • GenericScheduler changes to add disconnects to the NodeAllocation set for upsert with new status AllocClientStatusUnknown
  • GenericScheduler changes to add reconnects to queuedAllocations.
  • Changes to the StateStore alloc upsert code so that allocs with the unknown status don't have their ClientStatus overwritten by the existing allocs ClientStatus

scheduler/util.go Outdated Show resolved Hide resolved
client/client.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DerekStrickland! This is looking great for a first pass at some gnarly scheduler code. A bunch of my comments are about docstrings, readability, testing, etc.
I've got a couple of correctness confusions around:

  • How we're using MaxNormScore() in computeStopByReconnecting
  • Confusing conditionals in filterByTainted when tainted nodes are nil that might be obscuring the right behavior
  • Batch creation in createTimeoutLaterEvals

helper/funcs.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
scheduler/util.go Outdated Show resolved Hide resolved
scheduler/reconcile_util.go Show resolved Hide resolved
scheduler/reconcile.go Outdated Show resolved Hide resolved
scheduler/reconcile_test.go Outdated Show resolved Hide resolved
scheduler/generic_sched_test.go Outdated Show resolved Hide resolved
scheduler/generic_sched_test.go Outdated Show resolved Hide resolved
scheduler/generic_sched_test.go Outdated Show resolved Hide resolved
// DisconnectTimeout uses the MaxClientDisconnect to compute when the allocation
// should transition to lost.
func (a *Allocation) DisconnectTimeout(now time.Time) time.Time {
tg := a.Job.LookupTaskGroup(a.TaskGroup)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a.Job ever be nil? I know we set it that way in some circumstances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in next push.

scheduler/reconcile.go Outdated Show resolved Hide resolved
scheduler/reconcile.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I've left a few minor remaining questions or suggestions but once those are resolved, let's get this merged.

@vercel vercel bot temporarily deployed to Preview – nomad February 16, 2022 18:21 Inactive
@DerekStrickland DerekStrickland merged commit 195b89b into f-disconnected-client-allocation-handling Feb 16, 2022
@DerekStrickland DerekStrickland deleted the f-reconciler-disconnected-clients branch February 16, 2022 18:50
@DerekStrickland DerekStrickland added this to the 1.3.0 milestone Feb 23, 2022
DerekStrickland added a commit that referenced this pull request Apr 5, 2022
* Add merge helper for string maps
* structs: add statuses, MaxClientDisconnect, and helper funcs
* taintedNodes: Include disconnected nodes
* upsertAllocsImpl: don't use existing ClientStatus when upserting unknown
* allocSet: update filterByTainted and add delayByMaxClientDisconnect
* allocReconciler: support disconnecting and reconnecting allocs
* GenericScheduler: upsert unknown and queue reconnecting

Co-authored-by: Tim Gross <[email protected]>
DerekStrickland added a commit that referenced this pull request Apr 6, 2022
* Add merge helper for string maps
* structs: add statuses, MaxClientDisconnect, and helper funcs
* taintedNodes: Include disconnected nodes
* upsertAllocsImpl: don't use existing ClientStatus when upserting unknown
* allocSet: update filterByTainted and add delayByMaxClientDisconnect
* allocReconciler: support disconnecting and reconnecting allocs
* GenericScheduler: upsert unknown and queue reconnecting

Co-authored-by: Tim Gross <[email protected]>
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants