-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understand why resource usage of master components is noticeably different in Kubemark #44701
Comments
On hypothesis that I have is that size of objects is different (in particular size of Node). Something to verify. |
I checked the sizes of Node object, and indeed they're quite different. Kubemark's Node object has 3.6kB, while in real cluster it's 13.6kB. The biggest difference is list of images present, as in the "real" cluster we have "image puller" running, which downloads pretty much whole internet on every node, and information about those images take A LOT of space (9k to be precise). Other differences are smaller, and generally consists of data that kubelet read from cloud-provider and publishes as either labels or node-info, but this is peanuts in comparison to images. And it's probably not worth fixing. There are two ways in which we can approach this:
When we do one of those, we should check if results will become better. |
The node object also contains a lot of container image data too. |
Our nodes in production environments on anything packed often approach 50kb
or more.
…On Thu, Apr 20, 2017 at 11:35 AM, Timothy St. Clair < ***@***.***> wrote:
The node object also contains a lot of container image data too.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#44701 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p2DneAx8sXaBTuHCWfMe8etdZFweks5rx3tIgaJpZM4NCoEg>
.
|
I think we should bound the scope of image details to MRU. It's a sub-optimal heuristic to steer, and those who care about speed will pre-populate the cluster. |
I'm seeing normal clusters (light, unloaded) be at 10k minimum JSON, 8k
proto.
Could the scheduler remember recent images and bias them even if they're
not on the node anymore? That would preserve the heuristic.
…On Thu, Apr 20, 2017 at 11:42 AM, Timothy St. Clair < ***@***.***> wrote:
I think we should bound the scope of image details to MRU. It's a
sub-optimal heuristic to steer, but those who care about speed will
pre-populate the cluster.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#44701 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_3erPqB3BA-TO62P0lH7Yi_5Nv-ks5rx3z8gaJpZM4NCoEg>
.
|
Is the only purpose of having those images for scheduler? We definitely can add some caching for scheduler (that shouldn't be difficult). Though, the problem is what is happening after scheduler restart (after restart it would be kind of best-effort). |
Yeah, they optimize placement (and in practice we see massive benefit of
this on fairly dense clusters). The heuristic is really just best effort
at this point - we could get a lot better in the scheduler by knowing which
images have overlapping layers, but that's not something the node can
surface up.
I think being a bit more restrictive on the nodes, and cache a bit more on
the scheduler, would improve some dimensions of this. Image list is
already capped, and I *thought* it was MR-accessed.
Another thing - small clusters probably can live with bigger lists. Large
clusters would prefer smaller lists.
…On Thu, Apr 20, 2017 at 12:07 PM, Wojciech Tyczynski < ***@***.***> wrote:
Could the scheduler remember recent images and bias them even if they're
not on the node anymore? That would preserve the heuristic.
Is the only purpose of having those images for scheduler?
We definitely can add some caching for scheduler (that shouldn't be
difficult). Though, the problem is what is happening after scheduler
restart (after restart it would be kind of best-effort).
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#44701 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p6kjNjU_KEUP6Tm2QxF1x_MCgTn2ks5rx4LSgaJpZM4NCoEg>
.
|
That's a bigger thing. This doesn't seem like a high priority to me (though I agree it would be useful at some point).
Is it? I think it wasn't some time ago.
That's true, though the question is how to choose the ones that we want. |
Or the ones with the biggest number of layers? |
There's a whole set of heuristics in place. We could cut probably 10-30%
off node size, but I don't know that we would want to go deeper than that.
I will note that the reason nodes are the worst is because of all the
strings we copy - the work to make string decoding more efficient in
protobuf would pay off here (doing an arena allocation for those strings vs
creating one string per field). We have a few object types that benefit
from that.
…On Thu, Apr 20, 2017 at 12:33 PM, Wojciech Tyczynski < ***@***.***> wrote:
Or the ones with the biggest number of layers?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#44701 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p31pzjWAPPFYqg9kuQ2e4QURbhSmks5rx4jcgaJpZM4NCoEg>
.
|
At the macro scale image based steering only matters on large pulls, and it's a 1-time pad. So perhaps our threshold is a mash of image / layer size. Data locality has long since been dispelled as a farce, and what matters more is network bandwidth utilization. Because you will often trade startup latency for fragmentation. |
@shyamjvs - can you please first verify that if we change the size of nodes we will get comparable results? |
Sorry it took me some time to get to this. So here's the result of running 100-node real cluster test with image prepulling disabled:
Seems like image data is indeed doing much mischief. |
CPU usage look a lot better, but it seems that memory didn't dropped too much. So there's something else going on. |
So the new size of the node object is 5.5kB, which is much better than 13.6kB. However there is still some difference with the kubemark node size i.e. 3.6kB. I'll look into why we see this difference. But I feel the part that seems the most probable to cause this difference is the images field:
|
Ok, so I verified that it is indeed the "images" field which is creating the difference. There is no field for it in the kubemark node object, which is not surprising. |
My experience here is minimal, but here are some views I have (could be wrong) about this:
|
Here are the results of running the performance comparison tool we recently built, against runs from the last 24 hours of our 100-node kubemark vs real cluster test. We essentially compare the averages (taken across the runs) of various api calls' latencies across the 2 tests, and deem the ones with the ratio of the averages far away from 1 as mismatched. Showing the ones which mismatched on their 99th percentile here:
The above results are just to give a rough idea about the differences. I'll work on writing a more detailed report on them soon. |
It's not apiserver who populates it - it's kubelet. See #44701 (comment)
+1 - this is useless in our tests
-1 - let's not do any artificial things. If we believe the images are the main difference (and disabling image puller is not enough), we should change hollow-node to make the data the same between real cluster and kubemark (by caching images of pods running on a node in hollow-kubelet).
I don't understand. If this would be empty, many things wouldn't work One potential different that I now can think about is that we don't have a bunch of system pods, like fluentd, in the kubemark cluster (and we should add them). |
@shyamjvs ^^ |
@wojtek-t Thanks for the lead. I guess you are right that the default namespace would be empty in both cases. But the perf comparison shows a super high ratio of "list pods" latency in the case of density test, which is no where close to deviations for any other api call.
I want to understand the reason behind this. Maybe doing a |
Which test are you looking at? if this is for density, it's possible that there won't be any list pods, because pretty much everything in our system is using reflector/informer framework, so we are listing only after watch was broken and we can't really start it from the previous point (which should happen relatively rarely). If this is about load, then the test itself is doing a lot of "LISTs of pods", so if we are missing those metrics, we have a bug somewhere (potentially in gathering metrics code) and we should debug and fix it. |
We can start by dropping metrics with count smaller than, say, 10. |
@wojtek-t That sounds reasonable. I was talking about the density test. However I just looked up into the load capacity runs and it turns out that the case is same even for it. There are "list pods" only in some runs. On checking the count of list pods requests for run#2026, it turns out that there is only 1 such request (that too from density test):
This is not expected given that we have list pods call inside our load test scale function. Either I'm going wrong somewhere or there is some error in the master metrics that are being exposed by the apiserver. However I don't the bug is in the gathering metrics code as all it does is just scrape the |
I'll send out a PR enabling that admission plugin in kubemark. Let's see if that helps. |
With the above PR object size of pod should be almost same in both cases. For node object, we still see ~2kB size difference (real node - 5.5kB, hollow node - 3.5kB). Here's a breakup of the difference:
|
We definitely do NOT want to enable route controller. We can try artificially setting this condition somewhere.
Let's do.
No - this will impact e.g. processing selectors. We don't want this.
Actually, I'm not wondering why we don't have images? Are we mocking something that is tracking them? Or do they come from docker client? Or what?
Can we add them? |
Automatic merge from submit-queue (batch tested with PRs 51765, 53053, 52771, 52860, 53284). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add audit-logging, feature-gates & few admission plugins to kubemark To make kubemark match real cluster settings. Also includes a few other settings like request-timeout, etcd-quorum, etc. Fixes #53021 Related #51899 #44701 cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek @smarterclayton
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/remove-lifecycle rotten |
it's been a while, but there's an interesting update here. After my recent fix (#59832) for a significant bug with endpoints in kubemark (#59823), the gap b/w kubemark and real cluster from ~6.0 (see - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-perf-tests-kubemark-100-benchmark/3192/build-log.txt) We're now down to resolving just the difference for
|
One more thing, since the endpoints objects were effectively empty in the kubemark cluster, that should've considerably reduced etcd and apiserver mem-usage for kubemark. Let me grab some values from the present. |
To summarize, this is how the avg mem usages look like:
|
I have a strong feeling that the remaining difference has a considerable component coming from node objects (as we were discussing earlier in this thread). In addition to the object size itself, there is one more factor magnifying the effect - which is having numerous versions of each node object in etcd (discussed in more detail here: #14733). So I think a logical way to proceed here is to check size of node objects (both latest RV in apiserver and all versions in etcd) and if we find a big enough difference, resurrect some ideas discussed with @wojtek-t here. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
We see pretty big discrepancy between what we see in kubemark clusters and real ones. It needs to be understood. Difference is mostly in the API server usage.
E.g. 99 percentile in kubemark (both results from after Density, which is run as a first test):
and real cluster:
@wojtek-t @shyamjvs @kubernetes/sig-scalability-misc
The text was updated successfully, but these errors were encountered: