Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pod-deletion-cost as a new option #3680

Open
khacminh opened this issue Sep 20, 2022 · 27 comments
Open

Add pod-deletion-cost as a new option #3680

khacminh opened this issue Sep 20, 2022 · 27 comments
Labels
feature-request All issues for new features that have not been committed to needs-discussion stale-bot-ignore All issues that should not be automatically closed by our stale bot

Comments

@khacminh
Copy link

Proposal

Hello Keda team,

I have a suggestion that keda should have an option to use Pod deletion cost. With the option set to true, the pods that come later will have the lower deletion cost. It will make the deployment scale up and down in order.

Use-Case

As usual, I would like to have my upscaling pods scheduled on temporary worker nodes. Therefore, when scaling down, I would like these pods should be deleted first. So that, the temporary worker nodes can be released to save cost.

Anything else?

No response

@khacminh khacminh added feature-request All issues for new features that have not been committed to needs-discussion labels Sep 20, 2022
@tomkerkhove tomkerkhove moved this to Proposed in Roadmap - KEDA Core Sep 20, 2022
@tomkerkhove
Copy link
Member

Are you referring to have this option on the scale target? Because that is fully up to KEDA end-users as we do not define that. Or am I misinterpreting your suggestion?

@khacminh
Copy link
Author

khacminh commented Sep 20, 2022

Hi @tomkerkhove,

Thank you for your reply. Let me describe more about my suggestion with this example:

My 3-node cluster has keda and cluster auto scaler (or Karpenter) configured. My scaler is configured to scale my deployment from 5 to 15 pods.

Without down Pod deletion cost

Scaling up

At the high traffic, my deployment is sequentially scaled to 15. There are 3 more worker nodes are added, and upscaling pods are distributed as:

  • Node 1: pod-1, pod-4
  • Node 2: pod-3, pod-5
  • Node 3: pod-2
  • Node 4: pod-6, pod-7, pod-8
  • Node 5: pod-9, pod-10, pod-11
  • Node 6: pod-12, pod-13, pod-14, pod-15

Node 1, Node 2, Node 3 are the stable nodes.
Node 4, Node 5, Node 6 are the new nodes

Scaling down

Now the traffic is down, and the scaler scales my deployment to 10 replicas. Without pod deletion cost, the pods will be terminated randomly. For example, they could be distributed as:

  • Node 1: pod-1, pod-4
  • Node 2: pod-3,
  • Node 3: pod-2
  • Node 4: pod-6, pod-8
  • Node 5: pod-10, pod-11
  • Node 6: pod-12, pod-15

And 6 nodes are still in charged.

When the deployment are scaled back to 5, the distributed could be:

  • Node 1: pod-4
  • Node 2: pod-3
  • Node 3: pod-2
  • Node 4: pod-6
  • Node 5: pod-10
  • Node 6:

And only Node 6 can be terminated (automatically) to save cost.


With Pod Deletion Cost

Scaling up

Keda can add annotation controller.kubernetes.io/pod-deletion-cost for them as

Pods pod-deletion-cost Node
pod-1 1000000 Node-1
pod-2 1000000 Node-3
pod-3 1000000 Node-2
pod-4 1000000 Node-1
pod-5 1000000 Node-2
pod-6 999999 Node-4
pod-7 999998 Node-4
pod-8 999997 Node-4
pod-9 999996 Node-5
pod-10 999995 Node-5
pod-11 999994 Node-5
pod-12 999993 Node-6
pod-13 999992 Node-6
pod-14 999991 Node-6
pod-15 999990 Node-6

Scaling down

  • The deployment is scaled down to 10, kubernetes will terminate the pods that have lower deletion cost. And the remaining pods are:

    Pods pod-deletion-cost Node
    pod-1 1000000 Node-1
    pod-2 1000000 Node-3
    pod-3 1000000 Node-2
    pod-4 1000000 Node-1
    pod-5 1000000 Node-2
    pod-6 999999 Node-4
    pod-7 999998 Node-4
    pod-8 999997 Node-4
    pod-9 999996 Node-5
    pod-10 999995 Node-5

Now Node 6 can be terminated to save cost

  • The deployment is scaled down to 5, the remaining pods are:

    Pods pod-deletion-cost Node
    pod-1 1000000 Node-1
    pod-2 1000000 Node-3
    pod-3 1000000 Node-2
    pod-4 1000000 Node-1
    pod-5 1000000 Node-2

Node 4 and Node 5 can be terminated to save cost


I hope this example could make you clear about the use case.

@JorTurFer
Copy link
Member

I think that this is useful but I'm not sure if it's something we can do from KEDA side. We could check the pods annotations and annotate them from max to 0, reducing the cost for every new instance, but I'm not totally sure if this could solve your problem.
I mean, the only thing KEDA can do is annotate every new pod with less cost, but it doesn't mean that is correct because in your scenario, the pod 7 could be scheduled in node 3 due to other pods evictions.

IDK if this management is under KEDA umbrella. WDYT @kedacore/keda-contributors ?

@tomkerkhove
Copy link
Member

I think this scenario is nice, but as far as I can see this is fully managed on the scale target and more related to workload scheduling.

If we were to add this, hypothetically, how would you see this work?

@khacminh
Copy link
Author

khacminh commented Sep 21, 2022

Hi @JorTurFer and @tomkerkhove

Thank you for your replies.

I think this use-case is common and useful for any cluster running on clouds. The ability to schedule the min/max replicas based on the predictable time frame in date is great. In my opinion, combining with dynamic scaling is better.

About the implementation, I think we should take an assumption that pod-deletion-cost works as it is designed. Keda controller should take care of the annotation value assigned to pods:

  • When keda scales the deployments to min or idle replicas, these pods should have the maximum cost
  • On scaling up, keda reducing the cost for every new pod.

With @JorTurFer example, pod-7 could be scheduled on node 3 (a stable node). In this scenario, the more pods scheduled on stable nodes, the fewer temporary nodes are needed and they could be free sooner to save cost.


It's just another question

It seems the time-based min/max configuration is not supported by keda. What do you think about this feature? I think it's useful for scenarios with the predictable workload.

@tomkerkhove
Copy link
Member

I'm afraid this is entering the scheduler territory more than autoscaling. KEDA is not aware of what pods are being spun up so we can actually not achieve this ATM. What do you think @zroubalik @JorTurFer?

What could be done is use an admission controller and re-balance pods based on the new # of pods but again, more of a scheduling feature rather than autocsaling.

I do get why you want to have this though.

It seems the time-based min/max configuration is not supported by keda. What do you think about this feature? I think it's useful for scenarios with the predictable workload.

See cron scaler

@khacminh
Copy link
Author

khacminh commented Sep 22, 2022

hi @tomkerkhove

I know that KEDA doesn't manage the pods and it may take more effort to manage them. My request is based on the priority of downscaling pods. And of course, it's all about the cost-saving with minimum manual effort and pod rescheduling. Kubernetes doesn't come with pod deletion cost until v1.21 (alpha). In my opinion, it's a great feature for better resource management. And it's great if KEDA supports this feature.


About time-based configuration. I have cron scalers in my application and never try with multiple scalers in a ScaledObject yet. I'll try it for combination conditions.

@JorTurFer
Copy link
Member

JorTurFer commented Sep 24, 2022

I have been thinking about this. Let's say that we agree to implement this, we could use the operator reconciliation to check if there is any pod (from workload) without the annotation and in case of existing, annotate it with less weight than others.

Could this be enough? I mean, KEDA won't guarantee that "easier to remove" pods are in latest created nodes because there could be several events in the cluster which moves pods, missing the order.
E.g: the 3 instances with highest values are evicted from their nodes due to cpu/memory pressure in the node, you could pass from:

Pod Node Weight
A 1 10000
B 2 10000
C 1 10000
D 3 9999

to:

Pod Node Weight
D 3 9999
E (A) 3 9998
F (B) 4 9997
G (C) 4 9996

This is correct, but if the pressure in the node disappears, eventually you could have this scenario too:

Pod Node Weight
D 3 9999
E (A) 3 9998
F (B) 1 9997
G (C) 2 9996

What should KEDA do in this case? I mean, KEDA has annotated each pod with currentValue-1 but the scenario is the opposite to your scenario

@khacminh
Copy link
Author

Hi @JorTurFer

What do you think if the KEDA operator can reset the weight of all pods to maximum value on scaling to minimum/idle replica(s)?

@JorTurFer
Copy link
Member

Yes, my idea was to define a constant max value and do something like this during the reconciliation loops:

  1. List all pods
  2. Get currentValue from them (if all of them are empty, max value is the currentValue, if not, currentValue is the minimum between all of them)
  3. Annotate each pod with currentValue - 1 (calculated for every pod) .

In this scenario, during scale to zero we will reset the counter.
My afraid is that not all the cases can scale to zero, but the edge case I said could happen too

@khacminh
Copy link
Author

Hi @JorTurFer,

I think scaling to the minimum or idle replica(s) occurs frequently. So, resetting the counter on this event will help.

@tomkerkhove
Copy link
Member

I'm not sure if this logic is actually correct, and, if we should do it because it's better to not do anything rather than to do it wrongly.

Given this is tightly related to the nodes and its current workload I'm not sure if KEDA should do this (Hello scheduler 👋)

Just listing all pods with the same maximum value on scale to zero is just a no-op since it will have no impact. I might not get the logic above and this statement can be wrong though.

@khacminh
Copy link
Author

hi @tomkerkhove,

Could you please tell me why you suggest resetting pod deletion cost on scaling to zero, instead of scaling to minimum/idle replicas?

@tomkerkhove
Copy link
Member

I have just used that because I was reading through the conversation, we'd have the same problem when scaling to min/idle replica count imo.

@khacminh
Copy link
Author

Hi @tomkerkhove,

IMO, if a deployment is scaled to zero multiple times per day, it should be the scheduler task. The operation team should end up with serverless schedulers. However, in case the minimum/idle replicas is greater than 0, adding pod-deletion-cost will make the service more stable.

@zroubalik
Copy link
Member

zroubalik commented Oct 14, 2022

I share the concern that this might not be a job for KEDA (because so far KEDA doesn't change the target workload and I'd like to keep it this way), on the other hand I can see benefits.

What if we implement this as an add-on? A separate controller deployed next to KEDA Operator to manage this? This could be an optional component and after some time we can evaluate whether we would like to include it directly into KEDA core Operator.

@JorTurFer
Copy link
Member

JorTurFer commented Oct 14, 2022

I share the concern that this might not be a job for KEDA (because so far KEDA doesn't change the target workload and I'd like to keep it this way), on the other hand I can see benefits.

What if we implement this as an add-on? A separate controller deployed next to KEDA Operator to manage this? This could be an optional component and after some time we can evaluate whether we would like to include it directly into KEDA core Operator.

I like this approach, KEDA will continue being simple, and based on how it works and the adoption, we could think about integrate it

@stale
Copy link

stale bot commented Dec 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Dec 13, 2022
@stale
Copy link

stale bot commented Dec 20, 2022

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Dec 20, 2022
Repository owner moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Dec 20, 2022
@khacminh
Copy link
Author

@JorTurFer
Please reopen this issue because it is not completed

@JorTurFer
Copy link
Member

Yes, we can reopen it but it has been closed by inactivity.
Our las comments don't have any answer so I though you weren't interested 😅

@JorTurFer JorTurFer reopened this Dec 21, 2022
Repository owner moved this from Ready To Ship to Proposed in Roadmap - KEDA Core Dec 21, 2022
@khacminh
Copy link
Author

khacminh commented Dec 22, 2022

Hi @JorTurFer
Because I totally agree with @zroubalik's approach so I left an emoji on his comment. I though it was enough 😅. I'm sorry for not commenting it out.

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Dec 22, 2022
@JorTurFer
Copy link
Member

okey! xD
Are you willing to contribute with this improvement?

@JorTurFer JorTurFer added the stale-bot-ignore All issues that should not be automatically closed by our stale bot label Dec 22, 2022
@khacminh
Copy link
Author

Hi @JorTurFer

For now, I cannot arrange my time for this work. But I'll take some of my free time on the http-add-on. Do you have an add-on repo template and contributor guide for it?

@JorTurFer
Copy link
Member

JorTurFer commented Dec 23, 2022

For now, I cannot arrange my time for this work. But I'll take some of my free time on the http-add-on. Do you have an add-on repo template and contributor guide for it?

Unluckily, we don't have any contribution guide because that add-on was maintained by a single person :(
You are not the first one who ask about it, so after holidays we should consider doing the effort if that help to engage new contributors

@tomkerkhove
Copy link
Member

There is a contribution guide for it available in https://github.com/kedacore/http-add-on/tree/main/docs.

For add-ons, though, it's good to check our scaler governance in our governance repo. We should discuss if this is a 3rd party or official add-on first.

I think community based might be better for now.

@zroubalik
Copy link
Member

Yeah, so I'd suggest you to generate a new Go controller with Operator SDK, so it is in sync with KEDA core:
https://sdk.operatorframework.io/build/

And implement the functionality there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All issues for new features that have not been committed to needs-discussion stale-bot-ignore All issues that should not be automatically closed by our stale bot
Projects
Status: Proposed
Development

No branches or pull requests

4 participants