From 6dccd994555ee4161dd231a46a9b27d433826812 Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor Date: Thu, 16 Nov 2023 15:45:09 -0500 Subject: [PATCH 1/7] KEP: Simpler algorithm for admitting groups of Pods Change-Id: I894785325d44ff3cc3287f9717e96add499a2b48 --- keps/976-plain-pods/README.md | 51 +++++++++++++++++------------------ 1 file changed, 24 insertions(+), 27 deletions(-) diff --git a/keps/976-plain-pods/README.md b/keps/976-plain-pods/README.md index a1cdb7ad6c..08864c13e2 100644 --- a/keps/976-plain-pods/README.md +++ b/keps/976-plain-pods/README.md @@ -372,12 +372,11 @@ matchExpressions: Once the webhook has marked Pods subject to queuing with the `kueue.x-k8s.io/managed: true` label, the Pod reconciler can create the corresponding Workload object to feed the Kueue admission logic. -Note that the Workload cannot be owned by the Pod. Otherwise any cascade deletion of the Pod would -delete the Workload object, even before the Pod terminates (if it has a grace period). -This means that the controller needs to manually delete the Workload object once it has the -Finished condition (after we have determined that all pods have finished). +The Workload will be owned by all Pods. Once all the Pods that own the workload +are removed, the Workload will continue to exist. -A possible extension here is to add a TTL, which we can consider based on user feedback. +If individual Pods in the group fail and a replacement Pod comes in, +the replacement Pod will be added as an owner of the Workload as well. #### Single Pods @@ -538,39 +537,35 @@ spec: Pods need to have finalizers so that we can reliably track how many of them run to completion and be able to determine when the Workload is Finished. -A Pod-group reconciler would keep track of the groups of Pods and their respective Workload object, -based on the `jobframework.Reconciler`. -After a Workload is admitted, as the Pod-group reconciler ungates pods, it would keep an in-memory -cache of expected ungated pods: the number of ungated pods that are not reflected in the informers -yet, per Pod-group. This number decreases as the event handler observes Pods transition from being -gated to ungated. +The Pod reconciler will run in a "composable" mode: a mode where a Workload is +composed of multiple objects. +The `jobframework.Reconciler` will be reworked to accomodate this. -In the Pod event handler, we decrement the counter when we see a transition from having -the scheduling gate `kueue.x-k8s.io/admission` to not having it. +After a Workload is admitted, each Pod that owns the workload enters the +reconciliation loop. +The reconciliation loop collects all the Pods that are not Failed and constructs +an in-memory Workload. If there is an existing Workload in the cache and it +has smaller Pod counters than the in-memory Workload, then it is considered +unmatching and the Workload is evicted. In the Pod-group reconciler: 1. If the Pod is not terminated, create a Workload for the pod group if one does not exist. -2. If the Pod is terminated, - - If the Workloald doesn't exist or the workload is finished, remove the finalizer. -3. ungated_pods_in_client: the number of non-terminated pods in the client that are admitted. - We only look at non-terminated pods to allow for terminated pods to be replaced. -4. ungated_pods = ungated_pods_in_client + expected_ungated_pods. Note that this might temporarily - lead to double counting. -2. For gated pods: - - If ungated_pods < admission.count, remove the gate, set nodeSelector, an increase - expected_ungated_pods - - Else, - - If ungated_pods_in_informer < admission.count, we can't admit this Pod now to prevent - overbooking, but requeue this Pod for retry. - - Else, remove finalizer and delete the Pod, as it's beyond the allowed admission. +2. If the Pod is + - Failed, or + - Succeeded and (the Workload doesn't exist or the workload is finished) + Then remove the finalizer. +3. Build the in-memory Workload. If its podset counters are greater than the + stored Workload, then evict the Workload. +4. For gated pods: + - remove the gate, set nodeSelector 5. If the number of terminated pods with a finalizer is greater than or equal to the admission count, and there are no non-terminated Pods, mark the Workload as Finished and remove the finalizers from the Pods. Note that we are only removing Pod finalizers once the Workload is finished. This is a simple way of managing finalizers, but it might lead to too many Pods lingering in etcd for a long time after -terminated. In a future version, we can consider a better scheme similar to [Pod tracking in Jobs](https://kubernetes.io/blog/2022/12/29/scalable-job-tracking-ga/). +terminated. ### Metrics @@ -653,6 +648,8 @@ Major milestones might include: - when the KEP was retired or superseded --> +- Sep 29th: Implemented single Pod support (story 1) [#1103](https://github.com/kubernetes-sigs/kueue/pulls/1103). + ## Drawbacks The proposed labels and annotations for groups of pods can be complex to build manually. From 634e43dc2f2ed96257639956207f651c1fd41a81 Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor Date: Thu, 16 Nov 2023 16:01:20 -0500 Subject: [PATCH 2/7] Clarify that workload is automatically cleaned up Change-Id: I5b07d1ad71c99a23bdb612f1662df7dc8b991bec --- keps/976-plain-pods/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/keps/976-plain-pods/README.md b/keps/976-plain-pods/README.md index 08864c13e2..3bd3e36779 100644 --- a/keps/976-plain-pods/README.md +++ b/keps/976-plain-pods/README.md @@ -373,7 +373,8 @@ Once the webhook has marked Pods subject to queuing with the `kueue.x-k8s.io/man the Pod reconciler can create the corresponding Workload object to feed the Kueue admission logic. The Workload will be owned by all Pods. Once all the Pods that own the workload -are removed, the Workload will continue to exist. +are deleted (and their finalizers are removed), the Workload will be +automatically cleaned up. If individual Pods in the group fail and a replacement Pod comes in, the replacement Pod will be added as an owner of the Workload as well. From f1b5f78bfad7ca4b43aff03f059e52ed1cc907cb Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor Date: Thu, 16 Nov 2023 16:11:23 -0500 Subject: [PATCH 3/7] Last attempt annotation and reclaimable quota Change-Id: Ie0281d6ebd6a022abc85ae245a22b3282dd23881 --- keps/976-plain-pods/README.md | 47 +++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 10 deletions(-) diff --git a/keps/976-plain-pods/README.md b/keps/976-plain-pods/README.md index 3bd3e36779..26bae909e7 100644 --- a/keps/976-plain-pods/README.md +++ b/keps/976-plain-pods/README.md @@ -24,6 +24,8 @@ - [Groups of Pods created beforehand](#groups-of-pods-created-beforehand) - [Groups of pods where driver generates workers](#groups-of-pods-where-driver-generates-workers) - [Tracking admitted and finished Pods](#tracking-admitted-and-finished-pods) + - [Dynamically reclaiming Quota](#dynamically-reclaiming-quota) + - [Retrying Failed Pods](#retrying-failed-pods) - [Metrics](#metrics) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) @@ -66,14 +68,11 @@ use of this API to implement queuing semantics for Pods. - Support queueing of individual Pods. - Support queueing of groups of Pods of fixed size, identified by a common label. - Opt-in or opt-out Pods from specific namespaces from queueing. +- Support for [dynamic reclaiming of quota](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) + for succeeded Pods. ### Non-Goals -- Support for [dynamic reclaiming quota](https://github.com/kubernetes-sigs/kueue/issues/78) - - This feature is incompatible with supporting Pod replacements without knowing the behavior of a - parent controller for the Pods. - - Support for [partial-admission](https://github.com/kubernetes-sigs/kueue/issues/420). Since all pods are already created, an implementation of partial admission would imply the @@ -552,10 +551,8 @@ unmatching and the Workload is evicted. In the Pod-group reconciler: 1. If the Pod is not terminated, create a Workload for the pod group if one does not exist. -2. If the Pod is - - Failed, or - - Succeeded and (the Workload doesn't exist or the workload is finished) - Then remove the finalizer. +2. If the Pod is terminated and (the Workload doesn't exist or the workload is finished), + then remove the finalizer. 3. Build the in-memory Workload. If its podset counters are greater than the stored Workload, then evict the Workload. 4. For gated pods: @@ -568,6 +565,35 @@ Note that we are only removing Pod finalizers once the Workload is finished. Thi of managing finalizers, but it might lead to too many Pods lingering in etcd for a long time after terminated. + +### Dynamically reclaiming Quota + +Succeeded Pods will note be considered replaceable. In other words, the quota +from Succeeded Pods will be released by filling [reclaimablePods](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) +in the Workload status. + +### Retrying Failed Pods + +The Pod group will generally only be considered finished if all the Pods finish +with a Succeeded phase. +This allows the user to send replacement Pods when a Pod in the group fails. +The Pods need to have a different name, because the Failed ones will stay +in the apiserver with a finalizer. + +However, this implies that a group can only succeed. In other words, there is +no way for the user to declare the group failed and stop retrying. + +This can be solved by adding an annotation to any Pod in the group: +`kueue.x-k8s.io/last-in-group: true`. +The annotation can be added to a Pod that was already terminated or it +can be added when the user creates the Pod. +When added on creation, this Pod will run, but it will be considered the last +attempt. + +As a result, Kueue can consider a group finished if there are no running Pods, +and at least one terminated Pod (Failed or Succeeded) has the `last-in-group` +annotation. + ### Metrics In addition to the existing metrics for workloads, it could be beneficial to track gated and @@ -617,7 +643,8 @@ The integration tests should cover the following scenarios: - Multiple Pods with one shape: - queued and admitted - failed pods recreated can use the same quota - - Workload finished when all pods finish (failed or succeeded) + - Workload finished when all pods finish Successfully + - Workload finished when a Pod with `last-in-group` annotation finishes. - Driver Pod creates workers: - queued and admitted. - worker pods beyond the count are rejected (deleted) From 414f0717769e10d7c86e947948f0716030ab39e1 Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor Date: Mon, 20 Nov 2023 12:33:04 -0500 Subject: [PATCH 4/7] Simplify design Change-Id: Ia36c32fdf6862881fe0002d3b565c11da5978a54 --- keps/976-plain-pods/README.md | 97 ++++++++++++++++------------------- 1 file changed, 43 insertions(+), 54 deletions(-) diff --git a/keps/976-plain-pods/README.md b/keps/976-plain-pods/README.md index 26bae909e7..6836333914 100644 --- a/keps/976-plain-pods/README.md +++ b/keps/976-plain-pods/README.md @@ -24,8 +24,8 @@ - [Groups of Pods created beforehand](#groups-of-pods-created-beforehand) - [Groups of pods where driver generates workers](#groups-of-pods-where-driver-generates-workers) - [Tracking admitted and finished Pods](#tracking-admitted-and-finished-pods) - - [Dynamically reclaiming Quota](#dynamically-reclaiming-quota) - [Retrying Failed Pods](#retrying-failed-pods) + - [Dynamically reclaiming Quota](#dynamically-reclaiming-quota) - [Metrics](#metrics) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) @@ -371,12 +371,11 @@ matchExpressions: Once the webhook has marked Pods subject to queuing with the `kueue.x-k8s.io/managed: true` label, the Pod reconciler can create the corresponding Workload object to feed the Kueue admission logic. -The Workload will be owned by all Pods. Once all the Pods that own the workload -are deleted (and their finalizers are removed), the Workload will be -automatically cleaned up. +The Workload will be owned by all Pods. Once all the Pods that own the workload are deleted (and +their finalizers are removed), the Workload will be automatically cleaned up. -If individual Pods in the group fail and a replacement Pod comes in, -the replacement Pod will be added as an owner of the Workload as well. +If individual Pods in the group fail and a replacement Pod comes in, the replacement Pod will be +added as an owner of the Workload as well. #### Single Pods @@ -537,62 +536,51 @@ spec: Pods need to have finalizers so that we can reliably track how many of them run to completion and be able to determine when the Workload is Finished. -The Pod reconciler will run in a "composable" mode: a mode where a Workload is -composed of multiple objects. -The `jobframework.Reconciler` will be reworked to accomodate this. +The Pod reconciler will run in a "composable" mode: a mode where a Workload is composed of multiple +objects. The `jobframework.Reconciler` will be reworked to accomodate this. -After a Workload is admitted, each Pod that owns the workload enters the -reconciliation loop. -The reconciliation loop collects all the Pods that are not Failed and constructs -an in-memory Workload. If there is an existing Workload in the cache and it -has smaller Pod counters than the in-memory Workload, then it is considered -unmatching and the Workload is evicted. +After a Workload is admitted, each Pod that owns the workload enters the reconciliation loop. +The reconciliation loop collects all the Pods that are not Failed and constructs an in-memory +Workload. If there is an existing Workload in the cache and it has smaller Pod counters than the +in-memory Workload, then it is considered unmatching and the Workload is evicted. In the Pod-group reconciler: -1. If the Pod is not terminated, - create a Workload for the pod group if one does not exist. -2. If the Pod is terminated and (the Workload doesn't exist or the workload is finished), - then remove the finalizer. -3. Build the in-memory Workload. If its podset counters are greater than the - stored Workload, then evict the Workload. +1. If the Pod is not terminated and doesn't have a deletionTimestamp, + create a Workload for the pod group if one does not exist. +2. Remove Pod finalizers if the Pod is terminated and the Workload is finished, has a deletion + timestamp or is finished. +3. Build the in-memory Workload. If its podset counters are greater than the stored Workload, + then evict the Workload. 4. For gated pods: - - remove the gate, set nodeSelector -5. If the number of terminated pods with a finalizer is greater than or equal to the admission - count, and there are no non-terminated Pods, mark the Workload as Finished and remove the - finalizers from the Pods. - -Note that we are only removing Pod finalizers once the Workload is finished. This is a simple way -of managing finalizers, but it might lead to too many Pods lingering in etcd for a long time after -terminated. - - -### Dynamically reclaiming Quota + - remove the gate, set nodeSelector +5. If the number of succeeded pods is equal to the admission count, mark the Workload as Finished + and remove the finalizers from the Pods. -Succeeded Pods will note be considered replaceable. In other words, the quota -from Succeeded Pods will be released by filling [reclaimablePods](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) -in the Workload status. +Note that we are only removing Pod finalizers once the Workload is finished or if the Pods are +Failed. This is a simple way of managing finalizers, but it might lead to too many Pods lingering +in etcd for a long time after terminated. ### Retrying Failed Pods -The Pod group will generally only be considered finished if all the Pods finish -with a Succeeded phase. -This allows the user to send replacement Pods when a Pod in the group fails. -The Pods need to have a different name, because the Failed ones will stay -in the apiserver with a finalizer. +The Pod group will generally only be considered finished if all the Pods finish with a Succeeded +phase. +This allows the user to send replacement Pods when a Pod in the group fails or if the group is +preempted. The replacement Pods can have any name, but they must point to the same pod group. + +To declare that a group is failed, a user can execute one of the following actions: +1. Issue a Delete for the Workload object. The controller would terminate all running Pods and + clean up Pod finalizers. +2. Add an annotation to any Pod in the group `kueue.x-k8s.io/retriable-in-group: false`. + The annotation can be added to an existing Pod or added on creation. -However, this implies that a group can only succeed. In other words, there is -no way for the user to declare the group failed and stop retrying. + Kueue will consider a group finished if there are no running or pending Pods, and at + least one terminated Pod (Failed or Succeeded) has the `retriable-in-group: false` annotation. -This can be solved by adding an annotation to any Pod in the group: -`kueue.x-k8s.io/last-in-group: true`. -The annotation can be added to a Pod that was already terminated or it -can be added when the user creates the Pod. -When added on creation, this Pod will run, but it will be considered the last -attempt. +### Dynamically reclaiming Quota -As a result, Kueue can consider a group finished if there are no running Pods, -and at least one terminated Pod (Failed or Succeeded) has the `last-in-group` -annotation. +Succeeded Pods will not be considered replaceable. In other words, the quota +from Succeeded Pods will be released by filling [reclaimablePods](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) +in the Workload status. ### Metrics @@ -640,11 +628,12 @@ The integration tests should cover the following scenarios: - Basic webhook test - Single Pod queued, admitted and finished. -- Multiple Pods with one shape: +- Multiple Pods created beforehand: - queued and admitted - failed pods recreated can use the same quota - - Workload finished when all pods finish Successfully - - Workload finished when a Pod with `last-in-group` annotation finishes. + - Group finished when all pods finish Successfully + - Group finished when a Pod with `retriable-in-group: false` annotation finishes. + - Group preempted and resumed. - Driver Pod creates workers: - queued and admitted. - worker pods beyond the count are rejected (deleted) From e0a3af3d54101bc8c23a51f654499a6400ef5816 Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor Date: Mon, 20 Nov 2023 13:06:19 -0500 Subject: [PATCH 5/7] fix note about failed pods Change-Id: If5cf5dc60dfbf1f7afb45beee974e92bffa979b8 --- keps/976-plain-pods/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/keps/976-plain-pods/README.md b/keps/976-plain-pods/README.md index 6836333914..4ec43ebd18 100644 --- a/keps/976-plain-pods/README.md +++ b/keps/976-plain-pods/README.md @@ -556,9 +556,9 @@ In the Pod-group reconciler: 5. If the number of succeeded pods is equal to the admission count, mark the Workload as Finished and remove the finalizers from the Pods. -Note that we are only removing Pod finalizers once the Workload is finished or if the Pods are -Failed. This is a simple way of managing finalizers, but it might lead to too many Pods lingering -in etcd for a long time after terminated. +Note that we are only removing Pod finalizers once the Workload is finished. This is a simple way of +managing finalizers, but it might lead to too many Pods lingering in etcd for a long time after +terminated. ### Retrying Failed Pods From 1cc9f412fdaa14fd5200866cbed6e808ed0726ad Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor Date: Tue, 21 Nov 2023 14:38:19 -0500 Subject: [PATCH 6/7] Clarify that some fields will be excluded Change-Id: I34fbfdac3e056f559167795938c3c2cab1f41977 --- keps/976-plain-pods/README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/keps/976-plain-pods/README.md b/keps/976-plain-pods/README.md index 4ec43ebd18..d00da4ca8d 100644 --- a/keps/976-plain-pods/README.md +++ b/keps/976-plain-pods/README.md @@ -431,9 +431,12 @@ scheduling. The list of fields to keep are: - `preemptionPolicy` - `topologySpreadConstraints` - `overhead` - - `volumes` - `resourceClaims` +Note that fields like `env` and `command` can sometimes change among all the pods of a group and +they don't influence scheduling, so they are safe to skip. `volumes` can influence scheduling, but +they can be parameterized, like in StatefulSets, so we will ignore them for now. + A sha256 of the reamining Pod spec will be used as a name for a Workload podSet. The count for the podSet will be the number of Pods that match the same sha256. The hash will be calculated by the webhook and stored as an annotation: `kueue.x-k8s.io/role-hash`. From 054a8438f8171dc84549c8a25a3eae586c0b6f0f Mon Sep 17 00:00:00 2001 From: Aldo Culquicondor Date: Mon, 27 Nov 2023 21:55:43 +0000 Subject: [PATCH 7/7] Add dynamic reclaim for non retriable groups Change-Id: I971e0e87fb09ad9fba8a3805a61d2cc267b29bda --- keps/976-plain-pods/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/keps/976-plain-pods/README.md b/keps/976-plain-pods/README.md index d00da4ca8d..cd6e7dedcd 100644 --- a/keps/976-plain-pods/README.md +++ b/keps/976-plain-pods/README.md @@ -585,6 +585,9 @@ Succeeded Pods will not be considered replaceable. In other words, the quota from Succeeded Pods will be released by filling [reclaimablePods](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim) in the Workload status. +The quota from Failed Pods will also be released when there is a pod with +the annotation `kueue.x-k8s.io/retriable-in-group` set to false. + ### Metrics In addition to the existing metrics for workloads, it could be beneficial to track gated and