-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VPA: prune stale container aggregates, split recommendations over true number of containers #6745
base: master
Are you sure you want to change the base?
Conversation
Hi @jkyros. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
// TODO(jkyros): This only removes the container state from the VPA's aggregate states, there | ||
// is still a reference to them in feeder.clusterState.aggregateStateMap, and those get | ||
// garbage collected eventually by the rate limited aggregate garbage collector later. | ||
// Maybe we should clean those up here too since we know which ones are stale? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a lot of extra work to do that? Do you see any risks doing it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think it's a lot of extra work, it should be reasonably cheap to clean them up here since it's just deletions from the other maps if the keys exist, I just didn't know all the history.
It seemed possible at least that we were intentionally waiting to clean up the aggregates so if there was an unexpected hiccup we didn't just immediately blow away all that aggregate history we worked so hard to get? (Like maybe someone oopses, deletes their deployment, then puts it back? Right now we don't have to start over -- the pods come back in, find their container aggregates, and resume ? But if I clean them up here, we have to start over...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaning the map is cheap, so I don't mind handling it here. However, in cases like the one mentioned above (where someone deletes a deployment and recreates it immediately), I think it's better to leave this logic to be handled by gc.
@adrianmoisey, any thoughts on this?
// the correct number and not just the number of aggregates that have *ever* been present. (We don't want minimum resources | ||
// to erroneously shrink, either) | ||
func (cluster *ClusterState) setVPAContainersPerPod(pod *PodState) { | ||
for _, vpa := range cluster.Vpas { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is already a place where this logic could go so we don't have to loop over all VPAs for every pod again here.
In large clusters with a VPA to Pod ratio that's closer to 1 this could be a little wasteful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, I struggled with finding a less expensive way without making too much of a mess. Unless I'm missing something (and I might be) we don't seem to have a VPA <--> Pod map -- probably because we didn't need one until now? At the very least I think I should gate this to only run if the number of containers in the pod is > 1.
Like, I think our options are:
- update the VPA as the pods roll through (which requires me to find the VPA for each pod like I did here) or
- count the containers as we load the VPAs (but we load the VPAs before we load the pods, so we'd have to go through the pods again, so that doesn't help us)
- have the VPA actually track the pods it's managing, something like this: jkyros@6ddc208 (could also just be an array of
PodID
s and we could look up the state so we could save the memory cost of the PodState pointer, but you know what I mean)
I put it where I did (option 1
) because at least LoadPods()
was already looping through all the pods so we could freeload off the "outer" pod loop and I figured we didn't want to spend the memory on option 3
. If we'd entertain option 3
and are okay with the memory usage, I can totally do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe option 3 is the best approach. We can implement this in a follow-up PR.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
/ok-to-test I want to see if I can help get this merged |
Hi everyone, so John has take a hiatus and has left me with this PR, so after catching up, I guess we are still waiting for those conversations to resolve on which way do we want to go with those design decisions. The two commits I just put up are just a improvement on the existing implementation (assuming we will go with that, we don't have to), and some e2e tests to prove this works. |
c3a1f0e
to
c84075b
Compare
Generally this seems OK to me. I'd also like other approvers to weigh in here too |
// By default, recommendations for non-existent containers are never pruned until its top-most controller is deleted, | ||
// after which the recommendations are subject to the VPA's recommendation garbage collector. | ||
// +optional | ||
PruningGracePeriod *metav1.Duration `json:"pruningGracePeriod,omitempty" protobuf:"bytes,4,opt,name=pruningGracePeriod"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure how others feel about this, but this change adds PruningGracePeriod to verticalpodautoscaler.spec.resourcePolicy.containerPolicies
Where the description of containerPolicies is:
Per-container resource policies.
ContainerResourcePolicy controls how autoscaler computes the recommended
resources for a specific container.
Technically speaking, PruningGracePeriod isn't related to how the VPA generates recommendations. It feels wrong to put it there, but I don't know of a better location to put it.
An idea I have, which I'm not super excited by, is to use an annotation on the VPA object to drive this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the flags are a bit messy. Some are in the VPA object, some are global in the recommender, and some are in annotations. I think most flags related to how the VPA generates recommendations should be in the VPA object, while others should go in annotations. So yes, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense to me. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 70826a0
…d of containerPolicy Signed-off-by: Max Cao <[email protected]>
… instead of containerPolicy Signed-off-by: Max Cao <[email protected]>
} | ||
duration, err := time.ParseDuration(*globalPruningGracePeriodDuration) | ||
if err != nil { | ||
panic(fmt.Sprintf("Failed to parse --pruning-grace-period-duration: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using panic()
, we should return an error to be handled by the caller of this function. Could we modify this to return an error using klog.ErrorS()
followed by os.Exit(255)
?
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in b2ae65a, does it make sense? I used klog.Fatalf
instead of error and exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fine, but could you use Errors
and Exit
? It's the correct approach for structural logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully cd63322 is correct? 🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Thanks
There's potentially something wrong here. I tested it locally. Everything was fine, I was getting recommendations as expected. I then deleted the second container, and eventually saw this in the logs:
A while later I re-added hamster2, and eventually the VPA did this:
My guess is that it's related to the |
What's happening (I think) is that removing the container makes the deployment deploy a new pod, and the VPA doesn't associate both of the old container aggregates with the new pod, so it never marks From what I can tell, this is probably okay because |
…e for non-breaking opt-in change Signed-off-by: Max Cao <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jkyros The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
That theory sounds plausible. I worry that it's confusing to the user though, since it sounds like their recommendation gets removed. What I'm mostly worried about is this part:
When this happened, the recommendations were deleted, and the status of my VPA looked like this:
|
…to be linked to new containers We need to set VPAContainersPerPod for a VPA correctly so we can split resources correctly on its first run through the recommendation loop. So I opted to explicitly set it after updating the pod's containers. This also allows old aggregateContainerStates that were previously created from a removed Pod's container, to be reused by a new Pod's container that shares the same VPA and targetRef. This allows recommendations to be updated correctly when aggregates are pruned or created. Signed-off-by: Max Cao <[email protected]>
Signed-off-by: Max Cao <[email protected]>
…ong time for non-breaking opt-in change Signed-off-by: Max Cao <[email protected]>
I wasn't able to make the bug appear that removes the recommendations completely, but I noticed that the pruning stale aggregates wouldn't update the number of container recommendations to the new number of pod containers, which is probably the same bug. I fixed this in 11bcee3 hopefully. I needed to link old aggregates that were used by a previous pod/container, to the new pod/container explicitly. This is because the As long as the container name and namespace is the same, all of these aggregates will contribute to a recommendation for that container (hence the name aggregates I guess 😛). PruningGracePeriod just lets you decide if "old enough" containers can still contribute or not. |
var recommendation = make(RecommendedPodResources) | ||
if len(containerNameToAggregateStateMap) == 0 { | ||
return recommendation | ||
} | ||
|
||
fraction := 1.0 / float64(len(containerNameToAggregateStateMap)) | ||
fraction := 1.0 / float64(containersPerPod) | ||
klog.V(4).Infof("Spreading recommendation across %d containers (fraction %f)", containersPerPod, fraction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you switch to structured logging here... something like:
klog.V(4).Infof("Spreading recommendation across %d containers (fraction %f)", containersPerPod, fraction) | |
klog.V(4).InfoS("Spreading recommendation across containers", "containerCount", containersPerPod, "fraction", fraction) |
Yeah, it seems to be fixed now. Something I noticed, was that the old recommendation still exists. When I have 1 containers per Pod:
(notice that the memory is half of the default value of Then when I remove a container from the Deployment, I get this:
Notice that the second (now removed) container persists, and the memory (for both containers) is now correct. I had assumed that as part of this PR, the removed recommendation should be removed from the status field. |
I assume for your VPA object, there is no If so, yeah, I guess that's a side effect of the aggregates not having getting pruned - that there will be stale recommendations still in the VPA (remember CronJobs pods!). To make it less confusing to the user, maybe there could be an extra field to mark it as visibly stale? Maybe it would also make more sense that the recommendation for
EDIT: Alternatively, I just thought of this, and if we don't want stale recommendations to appear regardless of grace period, but still want cronjob recommendations to appear, then maybe we can do:
Thinking about it, I think this alternative solution would remove the need for the
The issue here, would still be covered as if there are no pods for a deployment, we don't remove anything, and the cronJob issue would be solved as well. Is there any bugs I'm missing in this solution? |
Oh hold on, I was too impatient and wasn't waiting for the grace period to actually remove recommendation. My bad!
I think this may make sense, updating the recommendation for a container that's about to be removed seems to show a sign of life, may be it needs to be ignored instead.
Nope, I think the solution is fine at the moment (besides the point above, which is a mild confusion, and not a big deal), I was just not waiting the grace period that I had set. I'll give this a few more tests though, just to see if anything else crops up |
182cead
to
b76f2b4
Compare
Signed-off-by: Max Cao <[email protected]>
b76f2b4
to
811fa14
Compare
Latest commit should fix this now. |
I have another comment, and I'm unsure on this one. The new behaviour results in this:
This happens after the second container is removed, but before the What I had originally expected to happen was that the But, thinking it over, I guess this is fine? The second container's recommendations are retained (until it comes back, or when |
Generally speaking I'm fine with this change. I haven't given it a thorough review, but it works locally in a way that makes sense. But I kinda want to get more thoughts on this (ping @omerap12 @raywainman @voelzmo) just to check if it all makes sense to someone else too |
Thanks for pinging, Ill take a look tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I have a couple of notes.
// TODO(jkyros): This only removes the container state from the VPA's aggregate states, there | ||
// is still a reference to them in feeder.clusterState.aggregateStateMap, and those get | ||
// garbage collected eventually by the rate limited aggregate garbage collector later. | ||
// Maybe we should clean those up here too since we know which ones are stale? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaning the map is cheap, so I don't mind handling it here. However, in cases like the one mentioned above (where someone deletes a deployment and recreates it immediately), I think it's better to leave this logic to be handled by gc.
@adrianmoisey, any thoughts on this?
if podExists && len(pod.Containers) > 1 { | ||
feeder.clusterState.SetVPAContainersPerPod(podState, true) | ||
} else if !podExists { | ||
panic("This shouldn't happen because AddOrUpdatePod should've placed this pod in the clusterState") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to exit here? and if so can we use klog.ErrorS
+ os.Exit(255)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry I forgot about that. That was a temporary check for me. I'll remove that in the next set of patches, thanks.
func (cluster *ClusterState) addPodToItsVpa(pod *PodState) { | ||
// SetVPAContainersPerPod sets the number of containers per pod seen for pods connected to this VPA | ||
// so that later when we're splitting the minimum recommendations over containers, we're splitting them over | ||
// the correct number and not just the number of aggregates that have *ever* been present. (We don't want minimum resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
// the correct number and not just the number of aggregates that have *ever* been present. (We don't want minimum resources | ||
// to erroneously shrink, either) | ||
func (cluster *ClusterState) setVPAContainersPerPod(pod *PodState) { | ||
for _, vpa := range cluster.Vpas { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe option 3 is the best approach. We can implement this in a follow-up PR.
} | ||
vpaExists = false | ||
} | ||
if !vpaExists { | ||
vpa = NewVpa(vpaID, selector, apiObject.CreationTimestamp.Time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we log that we are creating a new VPA?
containerName := resources.ContainerName | ||
containerAggregateState := vpaContainerNameToAggregateStateMap[containerName] | ||
if containerAggregateState != nil && !containerAggregateState.IsUnderVPA && !containerAggregateState.IsAggregateStale(now) { | ||
klog.V(5).InfoS("Container no longer exists, but is not stale. Keeping container recommendation at previous state.", "vpa", vpa.ID.VpaName, "container", containerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. If the container no longer exists and is not under the VPA (!containerAggregateState.IsUnderVPA
), and it's not stale (!containerAggregateState.IsAggregateStale(now)
), then why are we keeping the recommendation instead of pruning it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the container no longer exists, then either:
- It's considered stale by grace period
- Not considered stale by grace period
If it is considered stale, then we are not going to overwrite the new recommendation with the previous, and we just ignore this branch of code which should set a new recommendation that we calculated in https://github.com/kubernetes/autoscaler/pull/6745/files#diff-5c44f22eb0f35931ab799838d2584dd10037916c74f114124e6904af48fcd608R96
If it's not considered stale, then potentially the user still want's this container's recommendation to exist for later, so here I explicitly overwrite the stale container's calculated recommendation with it's previous recommendation.
So the logic is sort of backwards, if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Thanks for clarifying this :)
I agree I was unsure on what behaviour should be "correct". On one hand, I think we want to keep the stale recommendation in there if the pruning hasn't happened, but here's a scenario that might be a problem.
I'm not exactly sure how we want to handle that. If we remove the recommendation entirely, that means the second container doesn't get a recommended resource initially at all, even if the pruning grace period wasn't met. |
For me it makes sense |
/milestone vertical-pod-autoscaler-1.4.0 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Previously we weren't cleaning up "stale" aggregates when container names changed (because of renames, removals) and that was resulting in:
This PR is an attempt to clean up those stale aggregates without incurring too much overhead, and make sure that the resources get spread across the correct number of containers during a rollout.
Which issue(s) this PR fixes:
Fixes #6744
Special notes for your reviewer:
There are probably a lot of different ways we can do the pruning of stale aggregates for missing containers:
PruneAggregates()
that runs afterLoadPods()
that goes through everything and removes them (or do this work as part ofLoadPods()
but that seems...deceptive?)garbageCollectAggregateCollectionStates
and run it immediately afterLoadPods()
every time but that might be expensive.I'm not super-attached to any particular approach, I'd just like to fix this, so I can retool it if necessary.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: