VPA daemonset recommendations per-pod based on node metadata #5928

jcogilvie · 2023-07-05T15:01:35Z

Which component are you using?:
vertical pod autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
Some daemonsets are comprised of pods which have variable resource needs depending on the node they run on, and by their nature they cannot horizontally scale out of this problem.

Consider the case where a kube cluster is running a cluster autoscaler that provisions all manner of different node types based on cheapest-available capacity (e.g., karpenter using AWS spot).

In the case of dramatically variable node sizes, a pod that's a member of the datadog agent daemonset will require more resources to handle an instance with more pods on it when compared to a member of that same daemonset running on a tiny instance with only a few pods.

Describe the solution you'd like.:

I would like the VPA to (optionally) provide recommendations along an extra dimension for DaemonSets, such as ENI max pods for the host, and size DS pods individually based on this dimension. The recommender might suggest a memory configuration of any given pod based on historical memory_consumed/node_max_pods instead of a single memory value across the daemonset.

Describe any alternative solutions you've considered.:

The alternative is to overprovision the daemonset by a large margin on small instances, or to limit cluster node variability.

Additional context.:

Running on AWS EKS 1.24 with Karpenter.

The text was updated successfully, but these errors were encountered:

fbalicchia · 2023-07-15T06:08:16Z

Hi thanks for opening the issue cause we have the same need too.

At the moment of writing, I think that before starting to address the issue
at VPA level is necessary to use in-place-pod-resize
capability cause when we reside the pods it's restarted and we don't unknown to which node it will be bound.

Avoiding the use of scheduling setting nodeName in Pod spec can be an approach but obviously statically assigning a pod to a node is not an optimal solution cause a node can be fails or becomes unavailable. we are not verifying the available capacity on the node before assigning the pod to it.

On the other side extending the default Scheduler to address the binding pod on node can be a possible solution but I've some concerns about this approach WDYT ?

jcogilvie · 2023-07-19T14:42:28Z

Thanks for the comment @fbalicchia. I'm not an expert in this space, so I can't really speak to your suggestions.

I don't know when the affinity of a pod is determined, but I know if I inspect the pods of a daemonset they have an explicit affinity for the node on which they are intended to run. If that information is available, it might be usable for this case.

jbartosik · 2023-07-24T09:58:56Z

This is one of things I was thinking to support when we have in place support(#4016).

I'd rather do something that supports multiple similar use cases (where we have one workload with instances that have somewhat different resource requirements) and support daemonset s than do dedicated feature for daemonsets.

jcogilvie · 2023-07-24T15:07:28Z

I think that's a good goal @jbartosik. I'm having trouble thinking of how to generalize this to all deployments. Are you giving up on the idea of prediction and just deferring the decision until runtime?

One of the valuable elements of this suggestion is that you would know beforehand how big a specific pod is likely to be based on some external, measurable factor.

jbartosik · 2023-07-27T13:24:56Z

I don't have a specific proposal yet, just some ideas. Like I wrote I think this is something to take a look at after we have support for in-place updates.

We need a way to detect pods that have unusual resource usage for their deployment, waiting for the actual usage data to come is one way we could detect that. Another is using different metrics (similar to how you proposed using node size here).

bernot-dev · 2023-09-25T23:20:07Z

necessary to use in-place-pod-resize
capability cause when we reside the pods it's restarted and we don't unknown to which node it will be bound.

Why is this a blocker? If a pod is resized and the rescheduled to a different node, it seems like it just needs to respect any existing affinities.

jbartosik · 2023-09-27T07:48:18Z

necessary to use in-place-pod-resize
capability cause when we reside the pods it's restarted and we don't unknown to which node it will be bound.

Why is this a blocker? If a pod is resized and the rescheduled to a different node, it seems like it just needs to respect any existing affinities.

My guess is that it's something about nodes that makes resource usage of different daemon sets different (size of node, number of pods, amount of logging happening...)

So if we don't know on which node a pod will live then we don't know how much resources it will need.

k8s-triage-robot · 2024-01-29T05:07:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jcogilvie · 2024-01-29T16:50:52Z

/remove-lifecycle stale

k8s-triage-robot · 2024-04-28T16:55:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jcogilvie · 2024-04-30T18:34:12Z

/remove-lifecycle stale

k8s-triage-robot · 2024-07-29T19:31:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jcogilvie · 2024-07-29T22:08:11Z

/remove-lifecycle stale

k8s-triage-robot · 2024-10-27T22:31:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jcogilvie · 2024-10-28T20:58:17Z

/remove-lifecycle stale

jcogilvie added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 5, 2023

jbartosik added the area/vertical-pod-autoscaler label Jul 24, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

alvaroaleman mentioned this issue Mar 17, 2024

deamonset/per-node resource correction #6608

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024

alexey-gavrilov-flant mentioned this issue Sep 9, 2024

[log-shipper] add limits for vector pods deckhouse/deckhouse#7652

Closed

2 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPA daemonset recommendations per-pod based on node metadata #5928

VPA daemonset recommendations per-pod based on node metadata #5928

jcogilvie commented Jul 5, 2023 •

edited

Loading

fbalicchia commented Jul 15, 2023

jcogilvie commented Jul 19, 2023

jbartosik commented Jul 24, 2023

jcogilvie commented Jul 24, 2023

jbartosik commented Jul 27, 2023

bernot-dev commented Sep 25, 2023

jbartosik commented Sep 27, 2023

k8s-triage-robot commented Jan 29, 2024

jcogilvie commented Jan 29, 2024

k8s-triage-robot commented Apr 28, 2024

jcogilvie commented Apr 30, 2024

k8s-triage-robot commented Jul 29, 2024

jcogilvie commented Jul 29, 2024

k8s-triage-robot commented Oct 27, 2024

jcogilvie commented Oct 28, 2024

VPA daemonset recommendations per-pod based on node metadata #5928

VPA daemonset recommendations per-pod based on node metadata #5928

Comments

jcogilvie commented Jul 5, 2023 • edited Loading

fbalicchia commented Jul 15, 2023

jcogilvie commented Jul 19, 2023

jbartosik commented Jul 24, 2023

jcogilvie commented Jul 24, 2023

jbartosik commented Jul 27, 2023

bernot-dev commented Sep 25, 2023

jbartosik commented Sep 27, 2023

k8s-triage-robot commented Jan 29, 2024

jcogilvie commented Jan 29, 2024

k8s-triage-robot commented Apr 28, 2024

jcogilvie commented Apr 30, 2024

k8s-triage-robot commented Jul 29, 2024

jcogilvie commented Jul 29, 2024

k8s-triage-robot commented Oct 27, 2024

jcogilvie commented Oct 28, 2024

jcogilvie commented Jul 5, 2023 •

edited

Loading