Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3370: add native PodGroup api #3371

Closed
wants to merge 17 commits into from

Conversation

denkensk
Copy link
Member

@denkensk denkensk commented Jun 9, 2022

Signed-off-by: Alex Wang [email protected]

  • One-line PR description: Add Native PodGroup api
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 9, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jun 9, 2022
@alculquicondor
Copy link
Member

/cc


### Implementation
The implementation details are intentionally left for different schedulers.
In terms of Kubernetes default scheduler, we will reveal the details in ano

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that the proposed API may turn out to be incompatible with autoscaling as I described in kubernetes/kubernetes#105802 (comment). I don't see any obvious way to support PodGroup without delaying decision in Filter phase and making it in some later phase of scheduling/binding cycle, which is completely incompatible with the greedy algorithm used by autoscaler.

I don't think we should introduce APIs without having reasonable belief that we will be able to actually support them. I'm not saying we need a detailed design here, but I think we should at least have a general idea that we all relevant sigs agree is feasible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaciekPytel the API needs to support integration with other components like CA. Here is the big picture:

  • once a Pod failed, scheduler would give a failure message (in pod.status.conditions[*]) that CA can read
  • by combining the failure reasons, and obtaining the PodGroup object's spec and status info, CA can decide whether it's possible to resolve the failure by provisioning new machines. (correct me if I'm wrong)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More importantly, I don't think we can accept this KEP without an implementation plan for kube-scheduler.

I can think of two high-level implementation approaches:

  1. The one that the existing co-scheduling approach is using: Still take a decision for each pod, and hold them in the Wait phase. This needs clarification on what conditions kube-scheduler would set on pods and podgroup status and how cluster-autoscaler would use them.
  2. Completely ignore the pods that reference a pod group. The scheduler and autoscaler take a decision on the pod group, not the pods. Only once the podgroup is satisfied and the resources are reserved, the scheduler starts actually scheduling the individual pods. This needs more details around minPods. Also, this implementation is closer to what kueue is doing, so we could potentially use the same API for both. And it's closer to bit.ly/k8s-reservations by @ahg-g. In a high throughput batch cluster, customers are concerned about every object creation. So reducing the number of objects is desirable. And this is my main concern with the k8s-reservations proposal as it is now too.

@MaciekPytel, which high level approach do you think is more appropriate for cluster-autoscaler?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just echoing that we need a detailed implementation plan, this KEP can't move forward without it. I think we need to join forces and try to refine the reservations idea because I feel it will not only address the co-scheduling case, but other cases as well and likely to be more compatible with CA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to fill the implementation section as I commented in #3371 (comment).

For the time being, approach 1 is more feasible, and will qualify the bar of an alpha feature. But I don't see a big difference of 1 vs. 2 in terms of the scheduler implementation. In approach 2, we still need to schedule pods one by one, we cannot take the "reservation" signal as guaranteed, right?

Regarding reservation, I understand this is a de-factor feature of batch scheduling, and also the prerequisite of implementing backfilling. However, I don't interpret it as a blocker of coscheduling. I envision coscheduling a standalone feature, just like other scheduler plugins, and if someday an external controller (like kueue) or core k8s provides the semantics of reservation, it will improve the perf of coscheduling seamlessly/automatically.

Copy link
Member

@ahg-g ahg-g Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an approach that decouples co-scheduling multi-pod check from scheduling of individual pods is much more likely to work with autoscaler.

I think this is compatible with the idea I discussed above and is based on bit.ly/k8s-reservations; basically a group of pods is scheduled as all or nothing by incrementally reserving space on the nodes one pod at a time. For every capacity reserved for a pod in the group, there is an object representing that reserved capacity that all of the scheduler, kubelet and CA are aware of. The scheduler will need to be aware of the group to avoid partially scheduled groups leading to deadlocks, but the scheduling is still made on a one pod/reservation at a time. An external controller like the pod group controller described here will be responsible for declaring a group scheduled when a specific limit is reached, and could also include logic to address potential deadlocks across groups through eviction.

The above approach is eventually consistent, it might be slower to schedule a group, but I think co-scheduling inherently is slow since we are trying to coordinate resources across a group of nodes. Coupling this with queueing where workloads are started only when their quota is available will make this less of an issue.

Some sort of approach where we "schedule" PodGroup as a whole and not do a check for each pod that is a member of the group (which is how I understand option 2 in #3371 (comment)) seems like another feasible approach.

If by this you are suggesting that we would pass to Filter (or a new SuperFilter extension) the group of pods and list of nodes, and expect it to return a subset of nodes that fit the group; then I think it will be difficult to do and a major change to kube-scheduler architecture.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming minPods check is not part of "scheduling" individual reservation, but something we check separately

This won't bother CA as if the PodGroup failed due to not meeting quorum, there will be an explicit error message given by scheduler, then CA can just bypass those pods. (It's pretty like how CA now handles nominated Pod)

Regarding the idea of "BulkFilter", I agree with @ahg-g , that may not be the direction of scheduler framework. However, it's fine to implement a "BulkFilter" function (either in scheduler or CA), then CA calls it by providing a bulk of unschedulable Pods along with "new" nodes, and ends up with a binary result. @MaciekPytel do you think it's possible?

Copy link

@MaciekPytel MaciekPytel Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g I agree with your on bit.ly/k8s-reservations seems like an idea that will work.


@Huang-Wei
I think this is a different case from nominated pod. In pod preemption scenario the scheduler is effectively saying the pod will be able to schedule shortly and no action is required from CA (a scale-up may be needed later for pods that will be preempted, but those pods will trigger the scale-up normally once they go pending).
With co-scheduling the situation is different - consider an example with 4 pods that requests 1 cpu each and they use co-scheduling, while there are only 3 nodes with 1 cpu each in the cluster. The pods will never be able to schedule without CA intervention and the expected result should clearly be CA adding 1 more node to make the pods become schedulable.

In order to trigger a correct scale-up, it's not enough for CA to ignore the pods or have a vague information that they're all pending, because of a co-scheduling constraint not being met. CA must be able to run a simulation that will show that adding just 1 node will make all those pending pods schedulable.


Re: "BulkFilter" - I'm worried about scalability of this approach and CA. The whole point of CA simulation is to figure out what is the minimum number of nodes that are needed to schedule all the pods. We currently do it by adding nodes one-by-one until there is enough to schedule all the pending pods.
With "BulkFilter" we don't know how many "new nodes" we need to add to schedule all the pods. I'm guessing CA would need to call it in a loop with increasing number of "new nodes" until all the pods are scheduled? The problem is that means we need N separate binpacking runs to find that we need N new nodes. We can try to improve it by doing a binary search or similar, but that is still much more computationally expensive than the current algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaciekPytel Re: the similarity of nominated pod, I meant the case that we have enough resources, it's just the quorum not being met (like 3 pods out there, but we require 4), so in CA, it should just ignore the quorum-not-meet pods, and that info is specified by scheduler. Sorry if I didn't make it clear.

Re: the example you listed is exactly the real CA support for co-scheduling, and we do need to let CA be aware that only 1 additional node is needed to schedule all 4 pods.

Re: "BulkFilter", actually that's the cost to support co-scheduling, which IMO is inevitable. Even with reversation, every reservation needs to be simulated and then secured. After reading the pseudo-code in kubernetes/kubernetes#105802 (comment), I think the gap is how to pass the "(temporary) internal assumed/reserved" (or equivalent) info from scheduler to CA. Say in the above example, 3 out of 4 don't need extra resources, so if CA knows it only needs to simulate the 4th pod, then it would be easier - all other logic can be intact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reservation can be scheduled incrementally until it is satisfied. In non-autoscaled cluster, it may be a case of waiting for existing pods to finish. With CA, that may mean multiple scale ups, potentially in different node groups.

@alculquicondor
Copy link
Member

Can you add sig-autoscaling as a participating sig and have an approver from it? In the high level, we should have an implementation path that is compatible with the cluster autoscaler.

@denkensk denkensk changed the title Add Native PodGroup api KEP-3370: add native PodGroup api Jun 10, 2022

In this design, a PodGroupStatus is needed to record the latest status. This
enables users to quickly know what’s going on with this PodGroup; also some
integrators (like ClusterAutoscaler) can rely on it to make pending PodGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to clarify when the status will be updated with respect to the Pod unschedulable condition (is this going to be updated if the PodGroup is not satisfied yet)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to clarify when the status will be updated with respect to the Pod unschedulable condition (is this going to be updated if the PodGroup is not satisfied yet)?

Added in the scheduling part of implementation.

also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.

According to the current use case, if one (or more) pods is deleted, the results are determined by the workload controller whether to continue or rebuild or mark a failure.


### Implementation
The implementation details are intentionally left for different schedulers.
In terms of Kubernetes default scheduler, we will reveal the details in ano
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More importantly, I don't think we can accept this KEP without an implementation plan for kube-scheduler.

I can think of two high-level implementation approaches:

  1. The one that the existing co-scheduling approach is using: Still take a decision for each pod, and hold them in the Wait phase. This needs clarification on what conditions kube-scheduler would set on pods and podgroup status and how cluster-autoscaler would use them.
  2. Completely ignore the pods that reference a pod group. The scheduler and autoscaler take a decision on the pod group, not the pods. Only once the podgroup is satisfied and the resources are reserved, the scheduler starts actually scheduling the individual pods. This needs more details around minPods. Also, this implementation is closer to what kueue is doing, so we could potentially use the same API for both. And it's closer to bit.ly/k8s-reservations by @ahg-g. In a high throughput batch cluster, customers are concerned about every object creation. So reducing the number of objects is desirable. And this is my main concern with the k8s-reservations proposal as it is now too.

@MaciekPytel, which high level approach do you think is more appropriate for cluster-autoscaler?

How will UX be reviewed, and by whom?

Consider including folks who also work outside the SIG or subproject.
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To mention few:

  1. deadlocks
  2. incompatibility with CA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the part of incompatibility with CA in it.
but what is the deadlocks here means?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on the implementation, we could end up in a situation where two groups are partially assigned resources, but neither of them is fully scheduled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

const (
// PodGroupScheduled represents status of the scheduling
// process for this pod group.
PodGroupScheduled PodGroupConditionType = "PodGroupScheduled"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this constantly reconciled? what if one of the pods gets deleted after initially scheduled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not constantly reconciled. After the podGroup is scheduled for the first time, the condition "PodGroupScheduled" is added. Then we just need to update the status message by reconcile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to document those semantics: pods getting deleted, and in general number of scheduled pod falling below MinMember, after the group is scheduled; and whether or not we will support evicting the whole group when that happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

and whether or not we will support evicting the whole group when that happens.

I think it all depends on the workload operator. If the total running count is less than MinMember. The
workload operator can determine if it needs to evict the whole group or just create a new pod.


In this design, a PodGroupStatus is needed to record the latest status. This
enables users to quickly know what’s going on with this PodGroup; also some
integrators (like ClusterAutoscaler) can rely on it to make pending PodGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.


### Implementation
The implementation details are intentionally left for different schedulers.
In terms of Kubernetes default scheduler, we will reveal the details in ano
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just echoing that we need a detailed implementation plan, this KEP can't move forward without it. I think we need to join forces and try to refine the reservations idea because I feel it will not only address the co-scheduling case, but other cases as well and likely to be more compatible with CA.


# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.25"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not reasonable given the KEP is not complete and the close deadline.

```
### Cluster Autoscaler

Cluster Autoscaler check the failure reasons in pod.status.condition to decide whether it's possible to resolve the failure by provisioning new machines. If the reason is `CoschedulingNotMeetMinMember`, CA don't need to provision new machines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels hand wavy and far from concrete; can we collaborate with a CA expert to detail the possible implementation in CA?

/cc @MaciekPytel @x13n

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maybe like this in autoscaler: https://github.com/kubernetes/autoscaler/blob/6558beddd3d50c4653f45bad7958bf41e050f005/cluster-autoscaler/core/static_autoscaler.go#L405
We can filter the pods with CoschedulingNotMeetMinMember in autoscaler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't filtering these pods cause them to wait for scale up indefinitely? CA should provision nodes to schedule the entire PodGroup if possible, but it quite unclear to me if it should provision partial resources: in some scenarios that might be desirable, e.g. when different Subsets require scaling up different NodeGroups (in which case multiple scale ups are required).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fool question, does CA support scaling up with specific NodePools now? As you said, I have particular resources requirements. @x13n

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodePool is not a k8s concept. That being said, CA operates on NodeGroups, since most cloud providers implement such concept (e.g. nodepools in GKE). It is quite possible for nodes in one NodeGroup to have certain label specified, so let's say:

  • pod1 has nodeSelector label1=value1
  • pod2 has nodeSelector label2=value2
  • nodes in nodeGroup1 have label1=value1
  • nodes in nodeGroup2 have label2=value2

In each iteration, CA picks the best node group to scale up. In the scenario above, neither nodeGroup1 nor nodeGroup2 can satisfy both pod1 and pod2, so 2 subsequent scale ups are required. If these pods belong to the same PodGroup, the resources for them need to be provisioned incrementally. Subsequent scheduling will follow all-or-nothing semantics, so it will wait for all scale ups to take place.


- Then second part is in order for pods belonging to the same PodGroup to be scheduled as a whole without being broken up, we will leverage the feature added https://github.com/kubernetes/kubernetes/pull/103383 to move pods of the same PodGroup back to internal scheduling activeQ instantly.

#### UnReserve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about WaitForFirstConsumer type volumes?

/cc @msau42

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitForFirstConsumer is happened in the pre-bind phase., after the premit phase. Won't be affected by gang scheduling here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One limitation of the delayed volume binding is that it currently doesn't have the notion that a group of volumes would want to be provisioned together. So what could happen is you have a volume provisioned successfully for one pod, but then fails for the second. Or even if you have multiple volumes in a single pod.

It would be interesting to explore if this could help address that use case too.


// ScheduledAvailable is the number of pods in the subset that can be scheduled
// successfully, but the scheduling fails because minMember cannot be met.
ScheduledAvailable int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this and the field below will be updated by the scheduler? how often and when will it do that?

Copy link
Member Author

@denkensk denkensk Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Not in kube-scheduler. It will be updated by pod-group-controller in kube-controller-manager by the reconcile mechanism.
It will use pod.condition.reason = CoschedulingNotMeetMinMember to determine which class the pod belongs to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so for each pod waiting on permit we will still send a pod update?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pod group times out, it needs to release the resources they are occupying. For this part of pod group which can be scheduled successfully, but the scheduling fails because minMember cannot be met. We need to update the pod.condition as CoschedulingNotMeetMinMember in the func handleSchedulingFailure.

If they can meet the minMember, we don't need to update it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite clear on this, "that can be scheduled successfully" implies that the pods can be scheduled but not yet, which I understood as this number of pods waiting on permit. Perhaps an example can clarify what this field intends to track and then we can discuss naming.

const (
// PodGroupScheduled represents status of the scheduling
// process for this pod group.
PodGroupScheduled PodGroupConditionType = "PodGroupScheduled"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to document those semantics: pods getting deleted, and in general number of scheduled pod falling below MinMember, after the group is scheduled; and whether or not we will support evicting the whole group when that happens.

const (
// PodGroupScheduled represents status of the scheduling
// process for this pod group.
PodGroupScheduled PodGroupConditionType = "PodGroupScheduled"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's other conditions, like Unschedulable or Pending, something like that. And whether Scheduled enough rather than PodGroupScheduled since it's the status of PodGroupCondition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodGroupScheduled represents the pod group is scheduled, not only means it's scheduled successfully.

And whether Scheduled enough rather than PodGroupScheduled since it's the status of PodGroupCondition.

Scheduled or PodGroupScheduled are both ok to me.

@Priyankasaggu11929 Priyankasaggu11929 mentioned this pull request Jun 17, 2022
4 tasks

- If they are both pgPods:
- Compare their pod group's `CreationTimestamp`: the Pod in pod group which has earlier timestamp is positioned ahead of the other.
- If their `CreationTimestamp` is identical, order by their UID of PodGroup: a Pod with lexicographically greater UID is scheduled ahead of the other Pod. (The purpose is to tease different PodGroups with the same `CreationTimestamp` apart, while also keeping Pods belonging to the same PodGroup back-to-back)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confusing here, won't we reuse PodGroup like PriorityClass? If yes, it makes no sense to compare the CreationTimestamp in PodGroup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I don't get what you said is no sense? Can you give more details for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just forget my question, I used to think about whether we can reuse PodGroup like PriorityClass, but I just realized PodGroup is a stateful workload.


- The first part is that we put the pods that doesn't meet MinMember into the WaitingMap and reserve resources until MinMember are met or timeout is triggered.

- Get the number of Running pods that belong to the same PodGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get by labelSelector? And where's the labelSelector located, in PodGroup level or SubSet level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ask this question for I find in SubsetStatus, we will get the pods by label selector too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not use label selector, we add a filed called podgroup in pod.spec

Comment on lines 493 to 495
In PreFilter phase, we need to check if pod.spec.podGroup is nil firstly. If so, return
success. If not, check if the pod group exists in the scheduler's internal cache and pod.spec.
podGroup.subset is matched in pod group, if not, reject the pod in PreFilter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PreFilter, it does some preliminary checks in the following sequences:

  • If the pod doesn't carry .spec.podGroup, returns Success immediately. Optionally, plumb a key-value pair in CycleState so as to be leveraged in Permit phase later.
  • If the pod carries .spec.podGroup, verify if the PodGroup exists or not:
    • If not, return UnschedulableAndUnresolvable.
    • If yes, verify if the quorum (MinMember) has been met. If met, return Success; otherwise return UnschedulableAndUnresolvable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



#### Permit
In Permit phase, It is divided into two parts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Permit, it firstly leverages the pre-plumbed CycleState variable to quick return if it's a Pod not associated with any PodGroup.

Next, it counts the number of "cohorted" Pods that belong to the same PodGroup. Under the hood, it reads the scheduler framework's NodeInfos to include internally assumed/reserved Pods. The evalution formula is:

# of Running + Assumed Pods + 1 >= MinMember(1 means the pod itself)
  • If the evaluation result is false, return Wait - which will hold the pod in the internal waitingPodsMap, and timed out based on the PodGroup's timeout setting. Meanwhile, proactively move the "cohorted" Pods back to the head of activeQ, so they can be retried immediately. This is a critical optimization to avoid potential deadlocks among PodGroups.
  • If the evaluation result is true, iterate over its "cohorted" Pods to unhold them (as mentioned above, they where holded in the internal waitingPodsMap), and return Success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

-->

- Restrict the PodGroup implementation to a specific scheduler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a bullet stating that PodGroup based preemption would be a non-gola for alpha implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Added it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe we could add this to risks if pods with high priority in a PodGroup preempt successfully but finally released for some reasons.

#### Story 2
As a service workload user, I want some related deployments to run in the same zone
(with other scheduling directives like PodAffinity) at the same time (using PodGroup);
otherwise, don’t start any of them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example workload where this is necessary?

gang-scheduling. But CA will provision machines for all pods. For this part of pods in pod
group, we will add different failure message in condition.reason. Cluster Autoscaler check the
failure reasons in pod.status.condition to decide whether it's possible to resolve the failure
by provisioning new machines. Details are described in the following.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
by provisioning new machines. Details are described in the following.
by provisioning new machines. Details are described below.

// an in-progress PodGroup-level scheduling attempt.
// Duration Time is calculated from the time the first Pod in this
// PodGroup gets a resource. If the timeout is reached, the resources
// occupied by the PodGroup will be released to avoid long-term resource waste.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should users determine this value?

My concern here is that this is partially leaking an implementation detail to the API. This timeout will impact how long the scheduler could reserve resources for a partially allocated group. Isn't this dangerous because users can set this to some arbitrary high value and so potentially exacerbate the potential for deadlocks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we provide a default value to prevent indefinite occupying?

```

Pods in the same PodGroup with different priorities might lead to unintended behavior,
so need to ensure Pods in the same PodGroup with the same priority.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to have this in the risks section, also please describe what unintended behavior could be faced


// Total number of non-terminated pods targeted by this subset. (
// their labels match the selector)
Total int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the difference compared to running, basically this includes running and unscheduled pods, correct?


// ScheduledUnavailable is the number of pods in the subset that can't be
// scheduled successfully.
ScheduledUnavailable int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is confusing to me, why prefix it with scheduled if it can't be scheduled successfully?

Also, can't this be deduced from the other two numbers (total and ScheduledAvailable)?


// ScheduledAvailable is the number of pods in the subset that can be scheduled
// successfully, but the scheduling fails because minMember cannot be met.
ScheduledAvailable int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite clear on this, "that can be scheduled successfully" implies that the pods can be scheduled but not yet, which I understood as this number of pods waiting on permit. Perhaps an example can clarify what this field intends to track and then we can discuss naming.

https://storage.googleapis.com/k8s-triage/index.html
-->

- These cases will be added in the existed integration tests:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need scale testing as well

// an in-progress PodGroup-level scheduling attempt.
// Duration Time is calculated from the time the first Pod in this
// PodGroup gets a resource. If the timeout is reached, the resources
// occupied by the PodGroup will be released to avoid long-term resource waste.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we provide a default value to prevent indefinite occupying?


After the pod group is scheduled for the first time, the condition `PodGroupScheduled` is
added by the `pod-group-controller` in kube-controller-manager. Then `pod-group-controller`
will reconcile the status of pod group. The workload operator can watch the pod group status,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when we complete the pod group scheduling? We'll update the condition or just delete the podGroup by controller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From below textThe user waits for the PodGroup to be complete and then deletes it., I guess the controller will not delete the podGroup automatically, should we add another condition type to indicate we finished the pod group scheduling? Like Complete as Job does?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the condition for failure.

// PodGroupCondition describes the state of a pod group
// at a certain point.
type PodGroupCondition struct {
// Type of deployment condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/deployment/podGroup/ ?


- If they are both pgPods:
- Compare their pod group's `CreationTimestamp`: the Pod in pod group which has earlier timestamp is positioned ahead of the other.
- If their `CreationTimestamp` is identical, order by their UID of PodGroup: a Pod with lexicographically greater UID is scheduled ahead of the other Pod. (The purpose is to tease different PodGroups with the same `CreationTimestamp` apart, while also keeping Pods belonging to the same PodGroup back-to-back)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just forget my question, I used to think about whether we can reuse PodGroup like PriorityClass, but I just realized PodGroup is a stateful workload.

-->

- Restrict the PodGroup implementation to a specific scheduler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe we could add this to risks if pods with high priority in a PodGroup preempt successfully but finally released for some reasons.

```
### Cluster Autoscaler

Cluster Autoscaler check the failure reasons in pod.status.condition to decide whether it's possible to resolve the failure by provisioning new machines. If the reason is `CoschedulingNotMeetMinMember`, CA don't need to provision new machines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fool question, does CA support scaling up with specific NodePools now? As you said, I have particular resources requirements. @x13n

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: denkensk
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign ahg-g for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vsoch
Copy link

vsoch commented Jan 9, 2024

hey @denkensk is there any reason you stopped working on this? We would find this highly useful for one of our needs.

@AxeZhan
Copy link
Member

AxeZhan commented Feb 19, 2024

hey @denkensk is there any reason you stopped working on this? We would find this highly useful for one of our needs.

+1 for having this. Would like to know the status of this KEP.

@tenzen-y
Copy link
Member

hey @denkensk is there any reason you stopped working on this? We would find this highly useful for one of our needs.

+1 for having this. Would like to know the status of this KEP.

I guess that we need to move the cluster-autoscaler ProvisioningRequest forward more and more before we work on this.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2024
@tenzen-y
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2024
@wojtek-t wojtek-t removed their assignment Jun 7, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2024
@Okabe-Junya
Copy link
Member

xref here: #3370 (comment)

I think the closure of this issue should be coupled with #3371. How about either re-opening this issue or closing #3371 as well? (It seems that 3371 has also gone stale....)

@alculquicondor
Copy link
Member

/close
There is a newer proposal in #4671

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closed this PR.

In response to this:

/close
There is a newer proposal in #4671

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.