-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3370: add native PodGroup api #3371
Conversation
/cc |
|
||
### Implementation | ||
The implementation details are intentionally left for different schedulers. | ||
In terms of Kubernetes default scheduler, we will reveal the details in ano |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that the proposed API may turn out to be incompatible with autoscaling as I described in kubernetes/kubernetes#105802 (comment). I don't see any obvious way to support PodGroup without delaying decision in Filter phase and making it in some later phase of scheduling/binding cycle, which is completely incompatible with the greedy algorithm used by autoscaler.
I don't think we should introduce APIs without having reasonable belief that we will be able to actually support them. I'm not saying we need a detailed design here, but I think we should at least have a general idea that we all relevant sigs agree is feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaciekPytel the API needs to support integration with other components like CA. Here is the big picture:
- once a Pod failed, scheduler would give a failure message (in pod.status.conditions[*]) that CA can read
- by combining the failure reasons, and obtaining the PodGroup object's spec and status info, CA can decide whether it's possible to resolve the failure by provisioning new machines. (correct me if I'm wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More importantly, I don't think we can accept this KEP without an implementation plan for kube-scheduler.
I can think of two high-level implementation approaches:
- The one that the existing co-scheduling approach is using: Still take a decision for each pod, and hold them in the Wait phase. This needs clarification on what conditions kube-scheduler would set on pods and podgroup status and how cluster-autoscaler would use them.
- Completely ignore the pods that reference a pod group. The scheduler and autoscaler take a decision on the pod group, not the pods. Only once the podgroup is satisfied and the resources are reserved, the scheduler starts actually scheduling the individual pods. This needs more details around minPods. Also, this implementation is closer to what kueue is doing, so we could potentially use the same API for both. And it's closer to bit.ly/k8s-reservations by @ahg-g. In a high throughput batch cluster, customers are concerned about every object creation. So reducing the number of objects is desirable. And this is my main concern with the k8s-reservations proposal as it is now too.
@MaciekPytel, which high level approach do you think is more appropriate for cluster-autoscaler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just echoing that we need a detailed implementation plan, this KEP can't move forward without it. I think we need to join forces and try to refine the reservations idea because I feel it will not only address the co-scheduling case, but other cases as well and likely to be more compatible with CA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need to fill the implementation section as I commented in #3371 (comment).
For the time being, approach 1 is more feasible, and will qualify the bar of an alpha feature. But I don't see a big difference of 1 vs. 2 in terms of the scheduler implementation. In approach 2, we still need to schedule pods one by one, we cannot take the "reservation" signal as guaranteed, right?
Regarding reservation, I understand this is a de-factor feature of batch scheduling, and also the prerequisite of implementing backfilling. However, I don't interpret it as a blocker of coscheduling. I envision coscheduling a standalone feature, just like other scheduler plugins, and if someday an external controller (like kueue) or core k8s provides the semantics of reservation, it will improve the perf of coscheduling seamlessly/automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an approach that decouples co-scheduling multi-pod check from scheduling of individual pods is much more likely to work with autoscaler.
I think this is compatible with the idea I discussed above and is based on bit.ly/k8s-reservations; basically a group of pods is scheduled as all or nothing by incrementally reserving space on the nodes one pod at a time. For every capacity reserved for a pod in the group, there is an object representing that reserved capacity that all of the scheduler, kubelet and CA are aware of. The scheduler will need to be aware of the group to avoid partially scheduled groups leading to deadlocks, but the scheduling is still made on a one pod/reservation at a time. An external controller like the pod group controller described here will be responsible for declaring a group scheduled when a specific limit is reached, and could also include logic to address potential deadlocks across groups through eviction.
The above approach is eventually consistent, it might be slower to schedule a group, but I think co-scheduling inherently is slow since we are trying to coordinate resources across a group of nodes. Coupling this with queueing where workloads are started only when their quota is available will make this less of an issue.
Some sort of approach where we "schedule" PodGroup as a whole and not do a check for each pod that is a member of the group (which is how I understand option 2 in #3371 (comment)) seems like another feasible approach.
If by this you are suggesting that we would pass to Filter (or a new SuperFilter extension) the group of pods and list of nodes, and expect it to return a subset of nodes that fit the group; then I think it will be difficult to do and a major change to kube-scheduler architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming minPods check is not part of "scheduling" individual reservation, but something we check separately
This won't bother CA as if the PodGroup failed due to not meeting quorum, there will be an explicit error message given by scheduler, then CA can just bypass those pods. (It's pretty like how CA now handles nominated Pod)
Regarding the idea of "BulkFilter", I agree with @ahg-g , that may not be the direction of scheduler framework. However, it's fine to implement a "BulkFilter" function (either in scheduler or CA), then CA calls it by providing a bulk of unschedulable Pods along with "new" nodes, and ends up with a binary result. @MaciekPytel do you think it's possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g I agree with your on bit.ly/k8s-reservations seems like an idea that will work.
@Huang-Wei
I think this is a different case from nominated pod. In pod preemption scenario the scheduler is effectively saying the pod will be able to schedule shortly and no action is required from CA (a scale-up may be needed later for pods that will be preempted, but those pods will trigger the scale-up normally once they go pending).
With co-scheduling the situation is different - consider an example with 4 pods that requests 1 cpu each and they use co-scheduling, while there are only 3 nodes with 1 cpu each in the cluster. The pods will never be able to schedule without CA intervention and the expected result should clearly be CA adding 1 more node to make the pods become schedulable.
In order to trigger a correct scale-up, it's not enough for CA to ignore the pods or have a vague information that they're all pending, because of a co-scheduling constraint not being met. CA must be able to run a simulation that will show that adding just 1 node will make all those pending pods schedulable.
Re: "BulkFilter" - I'm worried about scalability of this approach and CA. The whole point of CA simulation is to figure out what is the minimum number of nodes that are needed to schedule all the pods. We currently do it by adding nodes one-by-one until there is enough to schedule all the pending pods.
With "BulkFilter" we don't know how many "new nodes" we need to add to schedule all the pods. I'm guessing CA would need to call it in a loop with increasing number of "new nodes" until all the pods are scheduled? The problem is that means we need N separate binpacking runs to find that we need N new nodes. We can try to improve it by doing a binary search or similar, but that is still much more computationally expensive than the current algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaciekPytel Re: the similarity of nominated pod, I meant the case that we have enough resources, it's just the quorum not being met (like 3 pods out there, but we require 4), so in CA, it should just ignore the quorum-not-meet pods, and that info is specified by scheduler. Sorry if I didn't make it clear.
Re: the example you listed is exactly the real CA support for co-scheduling, and we do need to let CA be aware that only 1 additional node is needed to schedule all 4 pods.
Re: "BulkFilter", actually that's the cost to support co-scheduling, which IMO is inevitable. Even with reversation, every reservation needs to be simulated and then secured. After reading the pseudo-code in kubernetes/kubernetes#105802 (comment), I think the gap is how to pass the "(temporary) internal assumed/reserved" (or equivalent) info from scheduler to CA. Say in the above example, 3 out of 4 don't need extra resources, so if CA knows it only needs to simulate the 4th pod, then it would be easier - all other logic can be intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reservation can be scheduled incrementally until it is satisfied. In non-autoscaled cluster, it may be a case of waiting for existing pods to finish. With CA, that may mean multiple scale ups, potentially in different node groups.
Can you add sig-autoscaling as a participating sig and have an approver from it? In the high level, we should have an implementation path that is compatible with the cluster autoscaler. |
|
||
In this design, a PodGroupStatus is needed to record the latest status. This | ||
enables users to quickly know what’s going on with this PodGroup; also some | ||
integrators (like ClusterAutoscaler) can rely on it to make pending PodGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to clarify when the status will be updated with respect to the Pod unschedulable condition (is this going to be updated if the PodGroup is not satisfied yet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to clarify when the status will be updated with respect to the Pod unschedulable condition (is this going to be updated if the PodGroup is not satisfied yet)?
Added in the scheduling part of implementation.
also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.
According to the current use case, if one (or more) pods is deleted, the results are determined by the workload controller whether to continue or rebuild or mark a failure.
|
||
### Implementation | ||
The implementation details are intentionally left for different schedulers. | ||
In terms of Kubernetes default scheduler, we will reveal the details in ano |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More importantly, I don't think we can accept this KEP without an implementation plan for kube-scheduler.
I can think of two high-level implementation approaches:
- The one that the existing co-scheduling approach is using: Still take a decision for each pod, and hold them in the Wait phase. This needs clarification on what conditions kube-scheduler would set on pods and podgroup status and how cluster-autoscaler would use them.
- Completely ignore the pods that reference a pod group. The scheduler and autoscaler take a decision on the pod group, not the pods. Only once the podgroup is satisfied and the resources are reserved, the scheduler starts actually scheduling the individual pods. This needs more details around minPods. Also, this implementation is closer to what kueue is doing, so we could potentially use the same API for both. And it's closer to bit.ly/k8s-reservations by @ahg-g. In a high throughput batch cluster, customers are concerned about every object creation. So reducing the number of objects is desirable. And this is my main concern with the k8s-reservations proposal as it is now too.
@MaciekPytel, which high level approach do you think is more appropriate for cluster-autoscaler?
How will UX be reviewed, and by whom? | ||
|
||
Consider including folks who also work outside the SIG or subproject. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To mention few:
- deadlocks
- incompatibility with CA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the part of incompatibility with CA
in it.
but what is the deadlocks
here means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depending on the implementation, we could end up in a situation where two groups are partially assigned resources, but neither of them is fully scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
const ( | ||
// PodGroupScheduled represents status of the scheduling | ||
// process for this pod group. | ||
PodGroupScheduled PodGroupConditionType = "PodGroupScheduled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this constantly reconciled? what if one of the pods gets deleted after initially scheduled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not constantly reconciled. After the podGroup is scheduled for the first time, the condition "PodGroupScheduled" is added. Then we just need to update the status message by reconcile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to document those semantics: pods getting deleted, and in general number of scheduled pod falling below MinMember, after the group is scheduled; and whether or not we will support evicting the whole group when that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
and whether or not we will support evicting the whole group when that happens.
I think it all depends on the workload operator. If the total running count is less than MinMember
. The
workload operator can determine if it needs to evict the whole group or just create a new pod.
|
||
In this design, a PodGroupStatus is needed to record the latest status. This | ||
enables users to quickly know what’s going on with this PodGroup; also some | ||
integrators (like ClusterAutoscaler) can rely on it to make pending PodGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also what happens if one of the pods gets deleted after the group was scheduled? There are likely cases where a pod (or more) getting deleted can be considered ok, and other cases where the whole group should be preempted.
|
||
### Implementation | ||
The implementation details are intentionally left for different schedulers. | ||
In terms of Kubernetes default scheduler, we will reveal the details in ano |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just echoing that we need a detailed implementation plan, this KEP can't move forward without it. I think we need to join forces and try to refine the reservations idea because I feel it will not only address the co-scheduling case, but other cases as well and likely to be more compatible with CA.
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
milestone: | ||
alpha: "v1.25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not reasonable given the KEP is not complete and the close deadline.
``` | ||
### Cluster Autoscaler | ||
|
||
Cluster Autoscaler check the failure reasons in pod.status.condition to decide whether it's possible to resolve the failure by provisioning new machines. If the reason is `CoschedulingNotMeetMinMember`, CA don't need to provision new machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels hand wavy and far from concrete; can we collaborate with a CA expert to detail the possible implementation in CA?
/cc @MaciekPytel @x13n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It maybe like this in autoscaler: https://github.com/kubernetes/autoscaler/blob/6558beddd3d50c4653f45bad7958bf41e050f005/cluster-autoscaler/core/static_autoscaler.go#L405
We can filter the pods with CoschedulingNotMeetMinMember
in autoscaler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't filtering these pods cause them to wait for scale up indefinitely? CA should provision nodes to schedule the entire PodGroup if possible, but it quite unclear to me if it should provision partial resources: in some scenarios that might be desirable, e.g. when different Subsets require scaling up different NodeGroups (in which case multiple scale ups are required).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fool question, does CA support scaling up with specific NodePools now? As you said, I have particular resources requirements. @x13n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodePool is not a k8s concept. That being said, CA operates on NodeGroups, since most cloud providers implement such concept (e.g. nodepools in GKE). It is quite possible for nodes in one NodeGroup to have certain label specified, so let's say:
pod1
has nodeSelectorlabel1=value1
pod2
has nodeSelectorlabel2=value2
- nodes in
nodeGroup1
havelabel1=value1
- nodes in
nodeGroup2
havelabel2=value2
In each iteration, CA picks the best node group to scale up. In the scenario above, neither nodeGroup1
nor nodeGroup2
can satisfy both pod1
and pod2
, so 2 subsequent scale ups are required. If these pods belong to the same PodGroup, the resources for them need to be provisioned incrementally. Subsequent scheduling will follow all-or-nothing semantics, so it will wait for all scale ups to take place.
|
||
- Then second part is in order for pods belonging to the same PodGroup to be scheduled as a whole without being broken up, we will leverage the feature added https://github.com/kubernetes/kubernetes/pull/103383 to move pods of the same PodGroup back to internal scheduling activeQ instantly. | ||
|
||
#### UnReserve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about WaitForFirstConsumer type volumes?
/cc @msau42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitForFirstConsumer is happened in the pre-bind phase., after the premit phase. Won't be affected by gang scheduling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One limitation of the delayed volume binding is that it currently doesn't have the notion that a group of volumes would want to be provisioned together. So what could happen is you have a volume provisioned successfully for one pod, but then fails for the second. Or even if you have multiple volumes in a single pod.
It would be interesting to explore if this could help address that use case too.
|
||
// ScheduledAvailable is the number of pods in the subset that can be scheduled | ||
// successfully, but the scheduling fails because minMember cannot be met. | ||
ScheduledAvailable int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this and the field below will be updated by the scheduler? how often and when will it do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Not in kube-scheduler. It will be updated by pod-group-controller in kube-controller-manager by the reconcile mechanism.
It will use pod.condition.reason = CoschedulingNotMeetMinMember
to determine which class the pod belongs to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, so for each pod waiting on permit we will still send a pod update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the pod group times out, it needs to release the resources they are occupying. For this part of pod group which can be scheduled successfully, but the scheduling fails because minMember cannot be met. We need to update the pod.condition as CoschedulingNotMeetMinMember
in the func handleSchedulingFailure
.
If they can meet the minMember, we don't need to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite clear on this, "that can be scheduled successfully" implies that the pods can be scheduled but not yet, which I understood as this number of pods waiting on permit. Perhaps an example can clarify what this field intends to track and then we can discuss naming.
const ( | ||
// PodGroupScheduled represents status of the scheduling | ||
// process for this pod group. | ||
PodGroupScheduled PodGroupConditionType = "PodGroupScheduled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to document those semantics: pods getting deleted, and in general number of scheduled pod falling below MinMember, after the group is scheduled; and whether or not we will support evicting the whole group when that happens.
const ( | ||
// PodGroupScheduled represents status of the scheduling | ||
// process for this pod group. | ||
PodGroupScheduled PodGroupConditionType = "PodGroupScheduled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's other conditions, like Unschedulable
or Pending
, something like that. And whether Scheduled
enough rather than PodGroupScheduled
since it's the status of PodGroupCondition
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodGroupScheduled
represents the pod group is scheduled, not only means it's scheduled successfully.
And whether Scheduled enough rather than PodGroupScheduled since it's the status of PodGroupCondition.
Scheduled
or PodGroupScheduled
are both ok to me.
|
||
- If they are both pgPods: | ||
- Compare their pod group's `CreationTimestamp`: the Pod in pod group which has earlier timestamp is positioned ahead of the other. | ||
- If their `CreationTimestamp` is identical, order by their UID of PodGroup: a Pod with lexicographically greater UID is scheduled ahead of the other Pod. (The purpose is to tease different PodGroups with the same `CreationTimestamp` apart, while also keeping Pods belonging to the same PodGroup back-to-back) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confusing here, won't we reuse PodGroup
like PriorityClass
? If yes, it makes no sense to compare the CreationTimestamp
in PodGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I don't get what you said is no sense? Can you give more details for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just forget my question, I used to think about whether we can reuse PodGroup like PriorityClass, but I just realized PodGroup is a stateful workload.
|
||
- The first part is that we put the pods that doesn't meet MinMember into the WaitingMap and reserve resources until MinMember are met or timeout is triggered. | ||
|
||
- Get the number of Running pods that belong to the same PodGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get by labelSelector? And where's the labelSelector located, in PodGroup level or SubSet level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ask this question for I find in SubsetStatus
, we will get the pods by label selector too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not use label selector, we add a filed called podgroup
in pod.spec
In PreFilter phase, we need to check if pod.spec.podGroup is nil firstly. If so, return | ||
success. If not, check if the pod group exists in the scheduler's internal cache and pod.spec. | ||
podGroup.subset is matched in pod group, if not, reject the pod in PreFilter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PreFilter, it does some preliminary checks in the following sequences:
- If the pod doesn't carry
.spec.podGroup
, returnsSuccess
immediately. Optionally, plumb a key-value pair in CycleState so as to be leveraged inPermit
phase later. - If the pod carries
.spec.podGroup
, verify if the PodGroup exists or not:- If not, return
UnschedulableAndUnresolvable
. - If yes, verify if the quorum (MinMember) has been met. If met, return
Success
; otherwise returnUnschedulableAndUnresolvable
.
- If not, return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
|
||
#### Permit | ||
In Permit phase, It is divided into two parts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Permit, it firstly leverages the pre-plumbed CycleState variable to quick return if it's a Pod not associated with any PodGroup.
Next, it counts the number of "cohorted" Pods that belong to the same PodGroup. Under the hood, it reads the scheduler framework's NodeInfos to include internally assumed/reserved Pods. The evalution formula is:
# of Running + Assumed Pods + 1 >= MinMember(1 means the pod itself)
- If the evaluation result is false, return
Wait
- which will hold the pod in the internalwaitingPodsMap
, and timed out based on the PodGroup's timeout setting. Meanwhile, proactively move the "cohorted" Pods back to the head of activeQ, so they can be retried immediately. This is a critical optimization to avoid potential deadlocks among PodGroups. - If the evaluation result is true, iterate over its "cohorted" Pods to unhold them (as mentioned above, they where holded in the internal waitingPodsMap), and return
Success
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
--> | ||
|
||
- Restrict the PodGroup implementation to a specific scheduler | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a bullet stating that PodGroup based preemption would be a non-gola for alpha implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe we could add this to risks if pods with high priority in a PodGroup preempt successfully but finally released for some reasons.
#### Story 2 | ||
As a service workload user, I want some related deployments to run in the same zone | ||
(with other scheduling directives like PodAffinity) at the same time (using PodGroup); | ||
otherwise, don’t start any of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you give an example workload where this is necessary?
gang-scheduling. But CA will provision machines for all pods. For this part of pods in pod | ||
group, we will add different failure message in condition.reason. Cluster Autoscaler check the | ||
failure reasons in pod.status.condition to decide whether it's possible to resolve the failure | ||
by provisioning new machines. Details are described in the following. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by provisioning new machines. Details are described in the following. | |
by provisioning new machines. Details are described below. |
// an in-progress PodGroup-level scheduling attempt. | ||
// Duration Time is calculated from the time the first Pod in this | ||
// PodGroup gets a resource. If the timeout is reached, the resources | ||
// occupied by the PodGroup will be released to avoid long-term resource waste. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should users determine this value?
My concern here is that this is partially leaking an implementation detail to the API. This timeout will impact how long the scheduler could reserve resources for a partially allocated group. Isn't this dangerous because users can set this to some arbitrary high value and so potentially exacerbate the potential for deadlocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we provide a default value to prevent indefinite occupying?
``` | ||
|
||
Pods in the same PodGroup with different priorities might lead to unintended behavior, | ||
so need to ensure Pods in the same PodGroup with the same priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to have this in the risks section, also please describe what unintended behavior could be faced
|
||
// Total number of non-terminated pods targeted by this subset. ( | ||
// their labels match the selector) | ||
Total int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain the difference compared to running, basically this includes running and unscheduled pods, correct?
|
||
// ScheduledUnavailable is the number of pods in the subset that can't be | ||
// scheduled successfully. | ||
ScheduledUnavailable int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is confusing to me, why prefix it with scheduled if it can't be scheduled successfully?
Also, can't this be deduced from the other two numbers (total and ScheduledAvailable)?
|
||
// ScheduledAvailable is the number of pods in the subset that can be scheduled | ||
// successfully, but the scheduling fails because minMember cannot be met. | ||
ScheduledAvailable int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite clear on this, "that can be scheduled successfully" implies that the pods can be scheduled but not yet, which I understood as this number of pods waiting on permit. Perhaps an example can clarify what this field intends to track and then we can discuss naming.
https://storage.googleapis.com/k8s-triage/index.html | ||
--> | ||
|
||
- These cases will be added in the existed integration tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need scale testing as well
// an in-progress PodGroup-level scheduling attempt. | ||
// Duration Time is calculated from the time the first Pod in this | ||
// PodGroup gets a resource. If the timeout is reached, the resources | ||
// occupied by the PodGroup will be released to avoid long-term resource waste. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we provide a default value to prevent indefinite occupying?
|
||
After the pod group is scheduled for the first time, the condition `PodGroupScheduled` is | ||
added by the `pod-group-controller` in kube-controller-manager. Then `pod-group-controller` | ||
will reconcile the status of pod group. The workload operator can watch the pod group status, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about when we complete the pod group scheduling? We'll update the condition or just delete the podGroup by controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From below textThe user waits for the PodGroup to be complete and then deletes it.
, I guess the controller will not delete the podGroup automatically, should we add another condition type to indicate we finished the pod group scheduling? Like Complete
as Job does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the condition for failure.
// PodGroupCondition describes the state of a pod group | ||
// at a certain point. | ||
type PodGroupCondition struct { | ||
// Type of deployment condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/deployment/podGroup/ ?
|
||
- If they are both pgPods: | ||
- Compare their pod group's `CreationTimestamp`: the Pod in pod group which has earlier timestamp is positioned ahead of the other. | ||
- If their `CreationTimestamp` is identical, order by their UID of PodGroup: a Pod with lexicographically greater UID is scheduled ahead of the other Pod. (The purpose is to tease different PodGroups with the same `CreationTimestamp` apart, while also keeping Pods belonging to the same PodGroup back-to-back) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just forget my question, I used to think about whether we can reuse PodGroup like PriorityClass, but I just realized PodGroup is a stateful workload.
--> | ||
|
||
- Restrict the PodGroup implementation to a specific scheduler | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe we could add this to risks if pods with high priority in a PodGroup preempt successfully but finally released for some reasons.
``` | ||
### Cluster Autoscaler | ||
|
||
Cluster Autoscaler check the failure reasons in pod.status.condition to decide whether it's possible to resolve the failure by provisioning new machines. If the reason is `CoschedulingNotMeetMinMember`, CA don't need to provision new machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fool question, does CA support scaling up with specific NodePools now? As you said, I have particular resources requirements. @x13n
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: denkensk The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
hey @denkensk is there any reason you stopped working on this? We would find this highly useful for one of our needs. |
+1 for having this. Would like to know the status of this KEP. |
I guess that we need to move the cluster-autoscaler ProvisioningRequest forward more and more before we work on this. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
xref here: #3370 (comment)
|
/close |
@alculquicondor: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: Alex Wang [email protected]