KEP-3370: add native PodGroup api #3371

denkensk · 2022-06-09T09:26:55Z

Signed-off-by: Alex Wang [email protected]

One-line PR description: Add Native PodGroup api

Issue link: Add Native PodGroup API #3370

Other comments:

alculquicondor · 2022-06-09T13:48:56Z

/cc

MaciekPytel · 2022-06-09T14:33:01Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+### Implementation
+The implementation details are intentionally left for different schedulers. 
+In terms of Kubernetes default scheduler, we will reveal the details in ano


I'm worried that the proposed API may turn out to be incompatible with autoscaling as I described in kubernetes/kubernetes#105802 (comment). I don't see any obvious way to support PodGroup without delaying decision in Filter phase and making it in some later phase of scheduling/binding cycle, which is completely incompatible with the greedy algorithm used by autoscaler.

I don't think we should introduce APIs without having reasonable belief that we will be able to actually support them. I'm not saying we need a detailed design here, but I think we should at least have a general idea that we all relevant sigs agree is feasible.

@MaciekPytel the API needs to support integration with other components like CA. Here is the big picture:

once a Pod failed, scheduler would give a failure message (in pod.status.conditions[*]) that CA can read

by combining the failure reasons, and obtaining the PodGroup object's spec and status info, CA can decide whether it's possible to resolve the failure by provisioning new machines. (correct me if I'm wrong)

More importantly, I don't think we can accept this KEP without an implementation plan for kube-scheduler.

I can think of two high-level implementation approaches:

The one that the existing co-scheduling approach is using: Still take a decision for each pod, and hold them in the Wait phase. This needs clarification on what conditions kube-scheduler would set on pods and podgroup status and how cluster-autoscaler would use them.

Completely ignore the pods that reference a pod group. The scheduler and autoscaler take a decision on the pod group, not the pods. Only once the podgroup is satisfied and the resources are reserved, the scheduler starts actually scheduling the individual pods. This needs more details around minPods. Also, this implementation is closer to what kueue is doing, so we could potentially use the same API for both. And it's closer to bit.ly/k8s-reservations by @ahg-g. In a high throughput batch cluster, customers are concerned about every object creation. So reducing the number of objects is desirable. And this is my main concern with the k8s-reservations proposal as it is now too.

@MaciekPytel, which high level approach do you think is more appropriate for cluster-autoscaler?

Just echoing that we need a detailed implementation plan, this KEP can't move forward without it. I think we need to join forces and try to refine the reservations idea because I feel it will not only address the co-scheduling case, but other cases as well and likely to be more compatible with CA.

We do need to fill the implementation section as I commented in #3371 (comment).

For the time being, approach 1 is more feasible, and will qualify the bar of an alpha feature. But I don't see a big difference of 1 vs. 2 in terms of the scheduler implementation. In approach 2, we still need to schedule pods one by one, we cannot take the "reservation" signal as guaranteed, right?

Regarding reservation, I understand this is a de-factor feature of batch scheduling, and also the prerequisite of implementing backfilling. However, I don't interpret it as a blocker of coscheduling. I envision coscheduling a standalone feature, just like other scheduler plugins, and if someday an external controller (like kueue) or core k8s provides the semantics of reservation, it will improve the perf of coscheduling seamlessly/automatically.

I think an approach that decouples co-scheduling multi-pod check from scheduling of individual pods is much more likely to work with autoscaler.

I think this is compatible with the idea I discussed above and is based on bit.ly/k8s-reservations; basically a group of pods is scheduled as all or nothing by incrementally reserving space on the nodes one pod at a time. For every capacity reserved for a pod in the group, there is an object representing that reserved capacity that all of the scheduler, kubelet and CA are aware of. The scheduler will need to be aware of the group to avoid partially scheduled groups leading to deadlocks, but the scheduling is still made on a one pod/reservation at a time. An external controller like the pod group controller described here will be responsible for declaring a group scheduled when a specific limit is reached, and could also include logic to address potential deadlocks across groups through eviction.

The above approach is eventually consistent, it might be slower to schedule a group, but I think co-scheduling inherently is slow since we are trying to coordinate resources across a group of nodes. Coupling this with queueing where workloads are started only when their quota is available will make this less of an issue.

Some sort of approach where we "schedule" PodGroup as a whole and not do a check for each pod that is a member of the group (which is how I understand option 2 in #3371 (comment)) seems like another feasible approach.

If by this you are suggesting that we would pass to Filter (or a new SuperFilter extension) the group of pods and list of nodes, and expect it to return a subset of nodes that fit the group; then I think it will be difficult to do and a major change to kube-scheduler architecture.

assuming minPods check is not part of "scheduling" individual reservation, but something we check separately

This won't bother CA as if the PodGroup failed due to not meeting quorum, there will be an explicit error message given by scheduler, then CA can just bypass those pods. (It's pretty like how CA now handles nominated Pod)

Regarding the idea of "BulkFilter", I agree with @ahg-g , that may not be the direction of scheduler framework. However, it's fine to implement a "BulkFilter" function (either in scheduler or CA), then CA calls it by providing a bulk of unschedulable Pods along with "new" nodes, and ends up with a binary result. @MaciekPytel do you think it's possible?

@ahg-g I agree with your on bit.ly/k8s-reservations seems like an idea that will work.

@Huang-Wei
I think this is a different case from nominated pod. In pod preemption scenario the scheduler is effectively saying the pod will be able to schedule shortly and no action is required from CA (a scale-up may be needed later for pods that will be preempted, but those pods will trigger the scale-up normally once they go pending).
With co-scheduling the situation is different - consider an example with 4 pods that requests 1 cpu each and they use co-scheduling, while there are only 3 nodes with 1 cpu each in the cluster. The pods will never be able to schedule without CA intervention and the expected result should clearly be CA adding 1 more node to make the pods become schedulable.

In order to trigger a correct scale-up, it's not enough for CA to ignore the pods or have a vague information that they're all pending, because of a co-scheduling constraint not being met. CA must be able to run a simulation that will show that adding just 1 node will make all those pending pods schedulable.

Re: "BulkFilter" - I'm worried about scalability of this approach and CA. The whole point of CA simulation is to figure out what is the minimum number of nodes that are needed to schedule all the pods. We currently do it by adding nodes one-by-one until there is enough to schedule all the pending pods.
With "BulkFilter" we don't know how many "new nodes" we need to add to schedule all the pods. I'm guessing CA would need to call it in a loop with increasing number of "new nodes" until all the pods are scheduled? The problem is that means we need N separate binpacking runs to find that we need N new nodes. We can try to improve it by doing a binary search or similar, but that is still much more computationally expensive than the current algorithm.

@MaciekPytel Re: the similarity of nominated pod, I meant the case that we have enough resources, it's just the quorum not being met (like 3 pods out there, but we require 4), so in CA, it should just ignore the quorum-not-meet pods, and that info is specified by scheduler. Sorry if I didn't make it clear.

Re: the example you listed is exactly the real CA support for co-scheduling, and we do need to let CA be aware that only 1 additional node is needed to schedule all 4 pods.

Re: "BulkFilter", actually that's the cost to support co-scheduling, which IMO is inevitable. Even with reversation, every reservation needs to be simulated and then secured. After reading the pseudo-code in kubernetes/kubernetes#105802 (comment), I think the gap is how to pass the "(temporary) internal assumed/reserved" (or equivalent) info from scheduler to CA. Say in the above example, 3 out of 4 don't need extra resources, so if CA knows it only needs to simulate the 4th pod, then it would be easier - all other logic can be intact.

Reservation can be scheduled incrementally until it is satisfied. In non-autoscaled cluster, it may be a case of waiting for existing pods to finish. With CA, that may mean multiple scale ups, potentially in different node groups.

alculquicondor · 2022-06-09T14:54:17Z

Can you add sig-autoscaling as a participating sig and have an approver from it? In the high level, we should have an implementation path that is compatible with the cluster autoscaler.

keps/sig-scheduling/3370-native-podgroup-api/README.md

alculquicondor · 2022-06-14T17:12:05Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+In this design, a PodGroupStatus is needed to record the latest status. This
+enables users to quickly know what’s going on with this PodGroup; also some
+integrators (like ClusterAutoscaler) can rely on it to make pending PodGroup


You probably need to clarify when the status will be updated with respect to the Pod unschedulable condition (is this going to be updated if the PodGroup is not satisfied yet)?

also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.

You probably need to clarify when the status will be updated with respect to the Pod unschedulable condition (is this going to be updated if the PodGroup is not satisfied yet)?

Added in the scheduling part of implementation.

also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.

According to the current use case, if one (or more) pods is deleted, the results are determined by the workload controller whether to continue or rebuild or mark a failure.

keps/sig-scheduling/3370-native-podgroup-api/README.md

alculquicondor · 2022-06-14T17:26:47Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+### Implementation
+The implementation details are intentionally left for different schedulers. 
+In terms of Kubernetes default scheduler, we will reveal the details in ano


More importantly, I don't think we can accept this KEP without an implementation plan for kube-scheduler.

I can think of two high-level implementation approaches:

The one that the existing co-scheduling approach is using: Still take a decision for each pod, and hold them in the Wait phase. This needs clarification on what conditions kube-scheduler would set on pods and podgroup status and how cluster-autoscaler would use them.

Completely ignore the pods that reference a pod group. The scheduler and autoscaler take a decision on the pod group, not the pods. Only once the podgroup is satisfied and the resources are reserved, the scheduler starts actually scheduling the individual pods. This needs more details around minPods. Also, this implementation is closer to what kueue is doing, so we could potentially use the same API for both. And it's closer to bit.ly/k8s-reservations by @ahg-g. In a high throughput batch cluster, customers are concerned about every object creation. So reducing the number of objects is desirable. And this is my main concern with the k8s-reservations proposal as it is now too.

@MaciekPytel, which high level approach do you think is more appropriate for cluster-autoscaler?

keps/sig-scheduling/3370-native-podgroup-api/README.md

ahg-g · 2022-06-15T09:25:24Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+How will UX be reviewed, and by whom?
+
+Consider including folks who also work outside the SIG or subproject.
+-->


To mention few:

deadlocks

incompatibility with CA

I add the part of incompatibility with CA in it.
but what is the deadlocks here means?

depending on the implementation, we could end up in a situation where two groups are partially assigned resources, but neither of them is fully scheduled.

ahg-g · 2022-06-15T09:34:17Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+const (
+	// PodGroupScheduled represents status of the scheduling 
+	// process for this pod group.
+	PodGroupScheduled PodGroupConditionType = "PodGroupScheduled"


is this constantly reconciled? what if one of the pods gets deleted after initially scheduled?

It is not constantly reconciled. After the podGroup is scheduled for the first time, the condition "PodGroupScheduled" is added. Then we just need to update the status message by reconcile.

we need to document those semantics: pods getting deleted, and in general number of scheduled pod falling below MinMember, after the group is scheduled; and whether or not we will support evicting the whole group when that happens.

Added.

and whether or not we will support evicting the whole group when that happens.

I think it all depends on the workload operator. If the total running count is less than MinMember. The
workload operator can determine if it needs to evict the whole group or just create a new pod.

keps/sig-scheduling/3370-native-podgroup-api/README.md

ahg-g · 2022-06-15T10:01:07Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+In this design, a PodGroupStatus is needed to record the latest status. This
+enables users to quickly know what’s going on with this PodGroup; also some
+integrators (like ClusterAutoscaler) can rely on it to make pending PodGroup


also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.

ahg-g · 2022-06-15T10:26:09Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+### Implementation
+The implementation details are intentionally left for different schedulers. 
+In terms of Kubernetes default scheduler, we will reveal the details in ano


Just echoing that we need a detailed implementation plan, this KEP can't move forward without it. I think we need to join forces and try to refine the reservations idea because I feel it will not only address the co-scheduling case, but other cases as well and likely to be more compatible with CA.

ahg-g · 2022-06-15T10:27:38Z

keps/sig-scheduling/3370-native-podgroup-api/kep.yaml

+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.25"


this is not reasonable given the KEP is not complete and the close deadline.

ahg-g · 2022-06-17T08:56:31Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+```
+### Cluster Autoscaler
+
+Cluster Autoscaler check the failure reasons in pod.status.condition to decide whether it's possible to resolve the failure by provisioning new machines. If the reason is `CoschedulingNotMeetMinMember`, CA don't need to provision new machines.


This feels hand wavy and far from concrete; can we collaborate with a CA expert to detail the possible implementation in CA?

/cc @MaciekPytel @x13n

It maybe like this in autoscaler: https://github.com/kubernetes/autoscaler/blob/6558beddd3d50c4653f45bad7958bf41e050f005/cluster-autoscaler/core/static_autoscaler.go#L405
We can filter the pods with CoschedulingNotMeetMinMember in autoscaler.

Wouldn't filtering these pods cause them to wait for scale up indefinitely? CA should provision nodes to schedule the entire PodGroup if possible, but it quite unclear to me if it should provision partial resources: in some scenarios that might be desirable, e.g. when different Subsets require scaling up different NodeGroups (in which case multiple scale ups are required).

A fool question, does CA support scaling up with specific NodePools now? As you said, I have particular resources requirements. @x13n

NodePool is not a k8s concept. That being said, CA operates on NodeGroups, since most cloud providers implement such concept (e.g. nodepools in GKE). It is quite possible for nodes in one NodeGroup to have certain label specified, so let's say:

pod1 has nodeSelector label1=value1

pod2 has nodeSelector label2=value2

nodes in nodeGroup1 have label1=value1

nodes in nodeGroup2 have label2=value2

In each iteration, CA picks the best node group to scale up. In the scenario above, neither nodeGroup1 nor nodeGroup2 can satisfy both pod1 and pod2, so 2 subsequent scale ups are required. If these pods belong to the same PodGroup, the resources for them need to be provisioned incrementally. Subsequent scheduling will follow all-or-nothing semantics, so it will wait for all scale ups to take place.

ahg-g · 2022-06-17T08:58:16Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+- Then second part is in order for pods belonging to the same PodGroup to be scheduled as a whole without being broken up, we will leverage the feature added https://github.com/kubernetes/kubernetes/pull/103383 to move pods of the same PodGroup back to internal scheduling activeQ instantly.
+
+#### UnReserve


what about WaitForFirstConsumer type volumes?

/cc @msau42

WaitForFirstConsumer is happened in the pre-bind phase., after the premit phase. Won't be affected by gang scheduling here.

One limitation of the delayed volume binding is that it currently doesn't have the notion that a group of volumes would want to be provisioned together. So what could happen is you have a volume provisioned successfully for one pod, but then fails for the second. Or even if you have multiple volumes in a single pod.

It would be interesting to explore if this could help address that use case too.

keps/sig-scheduling/3370-native-podgroup-api/README.md

ahg-g · 2022-06-17T09:05:31Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+	// ScheduledAvailable is the number of pods in the subset that can be scheduled
+	// successfully, but the scheduling fails because minMember cannot be met.
+	ScheduledAvailable int32


I assume this and the field below will be updated by the scheduler? how often and when will it do that?

No. Not in kube-scheduler. It will be updated by pod-group-controller in kube-controller-manager by the reconcile mechanism.
It will use pod.condition.reason = CoschedulingNotMeetMinMember to determine which class the pod belongs to.

oh, so for each pod waiting on permit we will still send a pod update?

If the pod group times out, it needs to release the resources they are occupying. For this part of pod group which can be scheduled successfully, but the scheduling fails because minMember cannot be met. We need to update the pod.condition as CoschedulingNotMeetMinMember in the func handleSchedulingFailure.

If they can meet the minMember, we don't need to update it.

I am not quite clear on this, "that can be scheduled successfully" implies that the pods can be scheduled but not yet, which I understood as this number of pods waiting on permit. Perhaps an example can clarify what this field intends to track and then we can discuss naming.

keps/sig-scheduling/3370-native-podgroup-api/README.md

ahg-g · 2022-06-17T09:20:36Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+const (
+	// PodGroupScheduled represents status of the scheduling 
+	// process for this pod group.
+	PodGroupScheduled PodGroupConditionType = "PodGroupScheduled"


we need to document those semantics: pods getting deleted, and in general number of scheduled pod falling below MinMember, after the group is scheduled; and whether or not we will support evicting the whole group when that happens.

keps/sig-scheduling/3370-native-podgroup-api/README.md

kerthcet · 2022-06-17T05:41:04Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+const (
+	// PodGroupScheduled represents status of the scheduling 
+	// process for this pod group.
+	PodGroupScheduled PodGroupConditionType = "PodGroupScheduled"


Where's other conditions, like Unschedulable or Pending, something like that. And whether Scheduled enough rather than PodGroupScheduled since it's the status of PodGroupCondition.

PodGroupScheduled represents the pod group is scheduled, not only means it's scheduled successfully.

And whether Scheduled enough rather than PodGroupScheduled since it's the status of PodGroupCondition.

Scheduled or PodGroupScheduled are both ok to me.

kerthcet · 2022-06-20T08:46:01Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+- If they are both pgPods:
+  - Compare their pod group's `CreationTimestamp`: the Pod in pod group which has earlier timestamp is positioned ahead of the other.
+  - If their `CreationTimestamp` is identical, order by their UID of PodGroup: a Pod with lexicographically greater UID is scheduled ahead of the other Pod. (The purpose is to tease different PodGroups with the same `CreationTimestamp` apart, while also keeping Pods belonging to the same PodGroup back-to-back)


A little confusing here, won't we reuse PodGroup like PriorityClass? If yes, it makes no sense to compare the CreationTimestamp in PodGroup.

Sorry. I don't get what you said is no sense? Can you give more details for it?

Sorry, just forget my question, I used to think about whether we can reuse PodGroup like PriorityClass, but I just realized PodGroup is a stateful workload.

keps/sig-scheduling/3370-native-podgroup-api/README.md

kerthcet · 2022-06-20T09:13:26Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+- The first part is that we put the pods that doesn't meet MinMember into the WaitingMap and reserve resources until MinMember are met or timeout is triggered.
+
+   - Get the number of Running pods that belong to the same PodGroup


Get by labelSelector? And where's the labelSelector located, in PodGroup level or SubSet level?

I ask this question for I find in SubsetStatus, we will get the pods by label selector too.

Not use label selector, we add a filed called podgroup in pod.spec

keps/sig-scheduling/3370-native-podgroup-api/README.md

Huang-Wei · 2022-06-21T23:04:15Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+In PreFilter phase, we need to check if pod.spec.podGroup is nil firstly. If so, return 
+success. If not, check if the pod group exists in the scheduler's internal cache and pod.spec.
+podGroup.subset is matched in pod group, if not, reject the pod in PreFilter.


In PreFilter, it does some preliminary checks in the following sequences:

If the pod doesn't carry .spec.podGroup, returns Success immediately. Optionally, plumb a key-value pair in CycleState so as to be leveraged in Permit phase later.

If the pod carries .spec.podGroup, verify if the PodGroup exists or not:

If not, return UnschedulableAndUnresolvable.

If yes, verify if the quorum (MinMember) has been met. If met, return Success; otherwise return UnschedulableAndUnresolvable.

Huang-Wei · 2022-06-21T23:22:50Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+
+#### Permit
+In Permit phase, It is divided into two parts.


In Permit, it firstly leverages the pre-plumbed CycleState variable to quick return if it's a Pod not associated with any PodGroup.

Next, it counts the number of "cohorted" Pods that belong to the same PodGroup. Under the hood, it reads the scheduler framework's NodeInfos to include internally assumed/reserved Pods. The evalution formula is:

# of Running + Assumed Pods + 1 >= MinMember(1 means the pod itself)

If the evaluation result is false, return Wait - which will hold the pod in the internal waitingPodsMap, and timed out based on the PodGroup's timeout setting. Meanwhile, proactively move the "cohorted" Pods back to the head of activeQ, so they can be retried immediately. This is a critical optimization to avoid potential deadlocks among PodGroups.

If the evaluation result is true, iterate over its "cohorted" Pods to unhold them (as mentioned above, they where holded in the internal waitingPodsMap), and return Success.

Huang-Wei · 2022-06-22T05:55:05Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+-->
+
+- Restrict the PodGroup implementation to a specific scheduler
+


Let's add a bullet stating that PodGroup based preemption would be a non-gola for alpha implementation.

Ok. Added it.

But maybe we could add this to risks if pods with high priority in a PodGroup preempt successfully but finally released for some reasons.

ahg-g · 2022-06-22T12:08:57Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+#### Story 2
+As a service workload user, I want some related deployments to run in the same zone 
+(with other scheduling directives like PodAffinity) at the same time (using PodGroup); 
+otherwise, don’t start any of them.


can you give an example workload where this is necessary?

ahg-g · 2022-06-22T12:09:53Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+gang-scheduling. But CA will provision machines for all pods. For this part of pods in pod 
+group, we will add different failure message in condition.reason. Cluster Autoscaler check the
+failure reasons in pod.status.condition to decide whether it's possible to resolve the failure 
+by provisioning new machines. Details are described in the following.


Suggested change

by provisioning new machines. Details are described in the following.

by provisioning new machines. Details are described below.

ahg-g · 2022-06-22T12:25:36Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+	// an in-progress PodGroup-level scheduling attempt.
+	// Duration Time is calculated from the time the first Pod in this
+	// PodGroup gets a resource. If the timeout is reached, the resources
+	// occupied by the PodGroup will be released to avoid long-term resource waste.


How should users determine this value?

My concern here is that this is partially leaking an implementation detail to the API. This timeout will impact how long the scheduler could reserve resources for a partially allocated group. Isn't this dangerous because users can set this to some arbitrary high value and so potentially exacerbate the potential for deadlocks?

Do we provide a default value to prevent indefinite occupying?

ahg-g · 2022-06-22T12:31:54Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+```
+
+Pods in the same PodGroup with different priorities might lead to unintended behavior, 
+so need to ensure Pods in the same PodGroup with the same priority.


probably better to have this in the risks section, also please describe what unintended behavior could be faced

ahg-g · 2022-06-22T13:51:59Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+	// Total number of non-terminated pods targeted by this subset. (
+	// their labels match the selector)
+	Total int32


Explain the difference compared to running, basically this includes running and unscheduled pods, correct?

ahg-g · 2022-06-22T13:57:25Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+	// ScheduledUnavailable is the number of pods in the subset that can't be 
+	// scheduled successfully.
+	ScheduledUnavailable int32


The name is confusing to me, why prefix it with scheduled if it can't be scheduled successfully?

Also, can't this be deduced from the other two numbers (total and ScheduledAvailable)?

ahg-g · 2022-06-22T13:57:42Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+	// ScheduledAvailable is the number of pods in the subset that can be scheduled
+	// successfully, but the scheduling fails because minMember cannot be met.
+	ScheduledAvailable int32


I am not quite clear on this, "that can be scheduled successfully" implies that the pods can be scheduled but not yet, which I understood as this number of pods waiting on permit. Perhaps an example can clarify what this field intends to track and then we can discuss naming.

ahg-g · 2022-06-22T16:59:56Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+https://storage.googleapis.com/k8s-triage/index.html
+-->
+
+- These cases will be added in the existed integration tests:


we need scale testing as well

kerthcet · 2022-06-23T09:51:30Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+	// an in-progress PodGroup-level scheduling attempt.
+	// Duration Time is calculated from the time the first Pod in this
+	// PodGroup gets a resource. If the timeout is reached, the resources
+	// occupied by the PodGroup will be released to avoid long-term resource waste.


Do we provide a default value to prevent indefinite occupying?

kerthcet · 2022-06-23T10:06:00Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+ After the pod group is scheduled for the first time, the condition `PodGroupScheduled` is
+ added by the `pod-group-controller` in kube-controller-manager. Then `pod-group-controller`
+ will reconcile the status of pod group. The workload operator can watch the pod group status,


What about when we complete the pod group scheduling? We'll update the condition or just delete the podGroup by controller?

From below textThe user waits for the PodGroup to be complete and then deletes it., I guess the controller will not delete the podGroup automatically, should we add another condition type to indicate we finished the pod group scheduling? Like Complete as Job does?

Also, the condition for failure.

kerthcet · 2022-06-23T10:11:33Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+// PodGroupCondition describes the state of a pod group 
+// at a certain point.
+type PodGroupCondition struct {
+	// Type of deployment condition.


s/deployment/podGroup/ ?

kerthcet · 2022-06-23T10:20:07Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+
+- If they are both pgPods:
+  - Compare their pod group's `CreationTimestamp`: the Pod in pod group which has earlier timestamp is positioned ahead of the other.
+  - If their `CreationTimestamp` is identical, order by their UID of PodGroup: a Pod with lexicographically greater UID is scheduled ahead of the other Pod. (The purpose is to tease different PodGroups with the same `CreationTimestamp` apart, while also keeping Pods belonging to the same PodGroup back-to-back)


Sorry, just forget my question, I used to think about whether we can reuse PodGroup like PriorityClass, but I just realized PodGroup is a stateful workload.

kerthcet · 2022-06-23T10:28:34Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+-->
+
+- Restrict the PodGroup implementation to a specific scheduler
+


But maybe we could add this to risks if pods with high priority in a PodGroup preempt successfully but finally released for some reasons.

kerthcet · 2022-06-23T10:39:47Z

keps/sig-scheduling/3370-native-podgroup-api/README.md

+```
+### Cluster Autoscaler
+
+Cluster Autoscaler check the failure reasons in pod.status.condition to decide whether it's possible to resolve the failure by provisioning new machines. If the reason is `CoschedulingNotMeetMinMember`, CA don't need to provision new machines.


A fool question, does CA support scaling up with specific NodePools now? As you said, I have particular resources requirements. @x13n

k8s-ci-robot · 2023-05-22T11:17:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: denkensk
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign ahg-g for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vsoch · 2024-01-09T15:31:13Z

hey @denkensk is there any reason you stopped working on this? We would find this highly useful for one of our needs.

AxeZhan · 2024-02-19T11:50:36Z

hey @denkensk is there any reason you stopped working on this? We would find this highly useful for one of our needs.

+1 for having this. Would like to know the status of this KEP.

tenzen-y · 2024-02-19T13:01:01Z

hey @denkensk is there any reason you stopped working on this? We would find this highly useful for one of our needs.

+1 for having this. Would like to know the status of this KEP.

I guess that we need to move the cluster-autoscaler ProvisioningRequest forward more and more before we work on this.

k8s-triage-robot · 2024-05-19T13:06:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y · 2024-05-19T14:58:18Z

/remove-lifecycle stale

k8s-triage-robot · 2024-09-05T09:02:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-05T09:39:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Okabe-Junya · 2024-10-26T18:43:57Z

xref here: #3370 (comment)

I think the closure of this issue should be coupled with #3371. How about either re-opening this issue or closing #3371 as well? (It seems that 3371 has also gone stale....)

alculquicondor · 2024-11-04T15:32:41Z

/close
There is a newer proposal in #4671

k8s-ci-robot · 2024-11-04T15:32:48Z

@alculquicondor: Closed this PR.

In response to this:

/close
There is a newer proposal in #4671

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 9, 2022

k8s-ci-robot requested review from alculquicondor and Huang-Wei June 9, 2022 09:27

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jun 9, 2022

denkensk force-pushed the podgroup-api branch from cc5e62b to 3f0137f Compare June 9, 2022 09:32

MaciekPytel reviewed Jun 9, 2022

View reviewed changes

pohly mentioned this pull request Jun 9, 2022

KEP-3063: dynamic resource allocation #3064

Merged

Huang-Wei reviewed Jun 9, 2022

View reviewed changes

keps/sig-scheduling/3370-native-podgroup-api/README.md Outdated Show resolved Hide resolved

Huang-Wei reviewed Jun 9, 2022

View reviewed changes

keps/sig-scheduling/3370-native-podgroup-api/README.md Outdated Show resolved Hide resolved

denkensk changed the title ~~Add Native PodGroup api~~ KEP-3370: add native PodGroup api Jun 10, 2022

alculquicondor reviewed Jun 14, 2022

View reviewed changes

ahg-g reviewed Jun 15, 2022

View reviewed changes

denkensk force-pushed the podgroup-api branch from 1bc0027 to a50c35e Compare June 16, 2022 12:19

ahg-g reviewed Jun 17, 2022

View reviewed changes

k8s-ci-robot requested review from MaciekPytel, msau42 and x13n June 17, 2022 09:35

kerthcet reviewed Jun 17, 2022

View reviewed changes

Priyankasaggu11929 mentioned this pull request Jun 17, 2022

Add Native PodGroup API #3370

Closed

4 tasks

kerthcet reviewed Jun 20, 2022

View reviewed changes

Huang-Wei reviewed Jun 21, 2022

View reviewed changes

Huang-Wei reviewed Jun 22, 2022

View reviewed changes

ahg-g reviewed Jun 22, 2022

View reviewed changes

kerthcet reviewed Jun 23, 2022

View reviewed changes

wojtek-t self-assigned this Jul 11, 2022

x13n mentioned this pull request Oct 7, 2022

Cluster autoscaler improvements for AI workloads kubernetes/autoscaler#5170

Closed

denkensk added 9 commits May 22, 2023 18:37

update by comments

8d78562

add PRR

efa7537

update

e0c0aea

update

0a67861

update

67a9675

update

8c415a4

update

97edc83

update

e282175

update

9b0955c

denkensk force-pushed the podgroup-api branch from 4354f9e to 9b0955c Compare May 22, 2023 11:17

fix typo

807755e

alculquicondor mentioned this pull request May 2, 2024

Add resource check before dispatching workloads kubernetes-sigs/kueue#1538

Closed

3 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2024

wojtek-t removed their assignment Jun 7, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2024

k8s-ci-robot closed this Nov 4, 2024


		- Then second part is in order for pods belonging to the same PodGroup to be scheduled as a whole without being broken up, we will leverage the feature added https://github.com/kubernetes/kubernetes/pull/103383 to move pods of the same PodGroup back to internal scheduling activeQ instantly.

		#### UnReserve


		- The first part is that we put the pods that doesn't meet MinMember into the WaitingMap and reserve resources until MinMember are met or timeout is triggered.

		- Get the number of Running pods that belong to the same PodGroup

		-->

		- Restrict the PodGroup implementation to a specific scheduler

	by provisioning new machines. Details are described in the following.
	by provisioning new machines. Details are described below.

KEP-3370: add native PodGroup api #3371

KEP-3370: add native PodGroup api #3371

Conversation

denkensk commented Jun 9, 2022

alculquicondor commented Jun 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Jun 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denkensk Jun 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Jun 22, 2022 •

edited

Loading

MaciekPytel Jun 28, 2022 •

edited

Loading

denkensk Jun 17, 2022 •

edited

Loading