Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3063: dynamic resource allocation updates for 1.26 #3502

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 71 additions & 27 deletions keps/sig-node/3063-dynamic-resource-allocation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -998,8 +998,8 @@ selector is static and typically will use labels that determine which nodes may
have resources available.

To gather information about the current state of resource availability and to
trigger allocation of a claim, the
scheduler creates a PodScheduling object. That object is owned by the pod and
trigger allocation of a claim, the scheduler creates one PodScheduling object
for each pod that uses claims. That object is owned by the pod and
will either get deleted by the scheduler when it is done with pod scheduling or
through the garbage collector. In the PodScheduling object, the scheduler posts
the list of all potential nodes that it was left with after considering all
Expand Down Expand Up @@ -1042,7 +1042,7 @@ else changes in the system, like for example deleting objects.
* if *delayed allocation and resource not allocated yet*:
* if *at least one node fits pod*:
* **scheduler** creates or updates a `PodScheduling` object with `podScheduling.spec.potentialNodes=<nodes that fit the pod>`
* if *exactly one claim is pending* or *all drivers have provided information*:
* if *exactly one claim is pending (see below)* or *all drivers have provided information*:
* **scheduler** picks one node, sets `podScheduling.spec.selectedNode=<the chosen node>`
* if *resource is available for this selected node*:
* **resource driver** adds finalizer to claim to prevent deletion -> allocation in progress
Expand Down Expand Up @@ -1075,6 +1075,14 @@ else changes in the system, like for example deleting objects.
* **resource driver** clears finalizer and `claim.status.allocation`
* **API server** removes ResourceClaim

When exactly one claim is pending, it is safe to trigger the allocation: if the
node is suitable, the allocation will succeed and the pod can get scheduled
without further delays. If the node is not suitable, allocation fails and the
next attempt can do better because it has more information. The same should not
be done when there are multiple claims because allocation might succeed for
some, but not all of them, which would force the scheduler to recover by asking
for deallocation. It's better to wait for information in this case.

The flow is similar for a ResourceClaim that gets created as a stand-alone
object by the user. In that case, the Pod reference that ResourceClaim by
name. The ResourceClaim does not get deleted at the end and can be reused by
Expand Down Expand Up @@ -1373,14 +1381,15 @@ type PodSchedulingSpec {
// adding nodes here that the driver then would need to
// reject through UnsuitableNodes.
//
// The size of this field is limited to 256. This is large
// enough for many clusters. Larger clusters may need more
// attempts to find a node that suits all pending resources.
// The size of this field is limited to 256 (=
// [PodSchedulingNodeListMaxSize]). This is large enough for many
// clusters. Larger clusters may need more attempts to find a node that
// suits all pending resources.
PotentialNodes []string
}

// PodSchedulingStatus is where resource drivers provide
// information about where the could allocate a resource
// information about where they could allocate a resource
// and whether allocation failed.
type PodSchedulingStatus struct {
// Each resource driver is responsible for providing information about
Expand Down Expand Up @@ -1408,23 +1417,36 @@ type ResourceClaimSchedulingStatus struct {
// PodResourceClaimName matches the PodResourceClaim.Name field.
PodResourceClaimName string

// A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
// allocation attempt trigger a check in the driver
// on which of those nodes the resource might be made available. It
// then excludes nodes by listing those where that is not the case in
// UnsuitableNodes.
//
// UnsuitableNodes lists nodes that the claim cannot be allocated for.
// Nodes listed here will be ignored by the scheduler when selecting a
// node for a Pod. All other nodes are potential candidates, either
// because no information is available yet or because allocation might
// succeed.
//
// This can change, so the driver must refresh this information
// A change of the PodSchedulingSpec.PotentialNodes field and/or a failed
// allocation attempt trigger an update of this field: the driver
// then checks all nodes listed in PotentialNodes and UnsuitableNodes
// and updates UnsuitableNodes.
//
// It must include the prior UnsuitableNodes in this check because the
// scheduler will not list those again in PotentialNodes but they might
// still be unsuitable.
//
// This can change, so the driver also must refresh this information
// periodically and/or after changing resource allocation for some
// other ResourceClaim until a node gets selected by the scheduler.
//
// The size of this field is limited to 256 (=
// [PodSchedulingNodeListMaxSize]), the same as for
// PodSchedulingSpec.PotentialNodes.
UnsuitableNodes []string
}

// PodSchedulingNodeListMaxSize defines the maximum number of entries in the
// node lists that are stored in PodScheduling objects. This limit is part
// of the API.
const PodSchedulingNodeListMaxSize = 256

type PodSpec {
...
// ResourceClaims defines which ResourceClaims must be allocated
Expand Down Expand Up @@ -1657,32 +1679,54 @@ might attempt to improve this.
#### Pre-score

This is passed a list of nodes that have passed filtering by the resource
plugin and the other plugins. The PodScheduling.PotentialNodes field
gets updated now if the field doesn't
match the current list already. If no PodScheduling object exists yet,
it gets created.
plugin and the other plugins. That list is stored by the plugin and will
be copied to PodSchedulingSpec.PotentialNodes when the plugin creates or updates
the object in Reserve.

Pre-score is not called when there is only a single potential node. In that
case Reserve will store the selected node in PodSchedulingSpec.PotentialNodes.

#### Reserve

A node has been chosen for the Pod.

If using delayed allocation and the resource has not been allocated yet,
the PodSchedulingSpec.SelectedNode field
gets set here and the scheduling attempt gets stopped for now. It will be
retried when the ResourceClaim or PodScheduling statuses change.
If using delayed allocation and one or more claims have not been allocated yet,
the plugin now needs to decide whether it wants to trigger allocation by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought only scheduler sets SelectedNode, but this says Plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler plugin. Elsewhere I have only said "the scheduler", but that meant the same thing because the core scheduler doesn't know anything about claims - everything related to those happens inside the scheduler plugin.

I'll make this more consistent and stick with "the scheduler".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at some other sections I found that "the plugin does xyz" was already used, so I changed my mind: in this chapter, let's use "the scheduler" for the core scheduler logic and "the claim plugin" for anything related to claims.

setting the PodSchedulingSpec.SelectedNode field. For a single unallocated
claim that is safe even if no information about unsuitable nodes is available
because the allocation will either succeed or fail. For multiple such claims
allocation only gets triggered when that information is available, to minimize
the risk of getting only some but not all claims allocated. In both cases the
PodScheduling object gets created or updated as needed. This is also where the
PodSchedulingSpec.PotentialNodes field gets set.

If all resources have been allocated already,
thockin marked this conversation as resolved.
Show resolved Hide resolved
the scheduler adds the Pod to the `claim.status.reservedFor` field of its ResourceClaims to ensure that
no-one else gets to use those.
the scheduler ensures that the Pod is listed in the `claim.status.reservedFor` field
of its ResourceClaims. The driver can and should already have added
the Pod when specifically allocating the claim for it, so it may
be possible to skip this update.

If some resources are not allocated yet or reserving an allocated resource
fails, the scheduling attempt needs to be aborted and retried at a later time
or when the statuses change (same as above).
or when the statuses change.

#### Unreserve

The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field because it cannot be scheduled after
all.
The scheduler removes the Pod from the ResourceClaimStatus.ReservedFor field
because it cannot be scheduled after all.

This is necessary to prevent a deadlock: suppose there are two stand-alone
claims that only can be used by one pod at a time and two pods which both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if this resource can be shared by two pods (but not three)? What is validating that only two are reserving it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have a usage count in the API (it was there in an earlier draft), now a claim is either shared (= unlimited number of pods) or single-use (= one pod). But it doesn't matter, in all of these cases the apiserver checks this when validating a ResourceClaimStatus update.

Copy link
Contributor Author

@pohly pohly Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's worth calling out a key design aspect of this KEP: all relevant information that is needed to ensure that a claim doesn't enter some invalid state is stored in the claim object itself. That is why the apiserver can validate it and why state updates are atomic.

Additional objects like PodScheduling can trigger operations like allocating a claim, but even then it is the claim object that records "allocation in progress" (through the driver's finalizer) and not that other object or some internal state of the driver.

"deallocation pending" is another such state that needs to be visible in the claim object. The obvious one is "claim has DeletionTimestamp", but for claims that need to go from "allocated" to "not allocated" instead of "removed" we need something else.

We cannot just reduce it to one case (= delete claim) because the scheduler does not control the lifecycle of the claim.

reference them. Both pods will get scheduled independently, perhaps even by
different schedulers. When each pod manages to allocate and reserve one claim,
then neither of them can get scheduled because they cannot reserve the other
claim.

Giving up the reservations in Unreserve means that the next pod scheduling
attempts have a chance to succeed. It's non-deterministic which pod will win,
but eventually one of them will. Not giving up the reservations would lead to a
permanent deadlock that somehow would have to be detected and resolved to make
progress.

### Cluster Autoscaler

Expand Down