Understand why resource usage of master components is noticeably different in Kubemark #44701

gmarek · 2017-04-20T07:09:24Z

We see pretty big discrepancy between what we see in kubemark clusters and real ones. It needs to be understood. Difference is mostly in the API server usage.

E.g. 99 percentile in kubemark (both results from after Density, which is run as a first test):

container                               cpu(cores) memory(MB)
"apiserver/apiserver"                   0.606      407.89
"controller-manager/controller-manager" 0.120      218.27
"etcd/etcd/data"                        0.120      160.75
"etcd/etcd/data-events"                 0.034      46.22
"scheduler/scheduler"                   0.080      114.16

and real cluster:

    {
      "Name": "etcd-server-e2e-scalability-master/etcd-container",
      "Cpu": 0.167254853,
      "Mem": 371601408
    },
    {
      "Name": "etcd-server-events-e2e-scalability-master/etcd-container",
      "Cpu": 0.14851711,
      "Mem": 159068160
    },
    {
      "Name": "kube-apiserver-e2e-scalability-master/kube-apiserver",
      "Cpu": 0.972463425,
      "Mem": 787206144
    },
    {
      "Name": "kube-controller-manager-e2e-scalability-master/kube-controller-manager",
      "Cpu": 0.180771427,
      "Mem": 235991040
    },
    {
      "Name": "kube-scheduler-e2e-scalability-master/kube-scheduler",
      "Cpu": 0.169906087,
      "Mem": 113385472
    },

@wojtek-t @shyamjvs @kubernetes/sig-scalability-misc

The text was updated successfully, but these errors were encountered:

wojtek-t · 2017-04-20T07:29:03Z

On hypothesis that I have is that size of objects is different (in particular size of Node). Something to verify.

gmarek · 2017-04-20T08:48:54Z

I checked the sizes of Node object, and indeed they're quite different. Kubemark's Node object has 3.6kB, while in real cluster it's 13.6kB. The biggest difference is list of images present, as in the "real" cluster we have "image puller" running, which downloads pretty much whole internet on every node, and information about those images take A LOT of space (9k to be precise).

Other differences are smaller, and generally consists of data that kubelet read from cloud-provider and publishes as either labels or node-info, but this is peanuts in comparison to images. And it's probably not worth fixing.

There are two ways in which we can approach this:

don't run image puller in scale tests
"run" it in kubemark (i.e. mock 9k data about images)

When we do one of those, we should check if results will become better.

timothysc · 2017-04-20T15:35:19Z

The node object also contains a lot of container image data too.

smarterclayton · 2017-04-20T15:40:20Z

Our nodes in production environments on anything packed often approach 50kb or more.

…

On Thu, Apr 20, 2017 at 11:35 AM, Timothy St. Clair < ***@***.***> wrote: The node object also contains a lot of container image data too. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#44701 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p2DneAx8sXaBTuHCWfMe8etdZFweks5rx3tIgaJpZM4NCoEg> .

timothysc · 2017-04-20T15:42:43Z

I think we should bound the scope of image details to MRU. It's a sub-optimal heuristic to steer, and those who care about speed will pre-populate the cluster.

smarterclayton · 2017-04-20T15:46:56Z

I'm seeing normal clusters (light, unloaded) be at 10k minimum JSON, 8k proto. Could the scheduler remember recent images and bias them even if they're not on the node anymore? That would preserve the heuristic.

…

On Thu, Apr 20, 2017 at 11:42 AM, Timothy St. Clair < ***@***.***> wrote: I think we should bound the scope of image details to MRU. It's a sub-optimal heuristic to steer, but those who care about speed will pre-populate the cluster. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#44701 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p_3erPqB3BA-TO62P0lH7Yi_5Nv-ks5rx3z8gaJpZM4NCoEg> .

wojtek-t · 2017-04-20T16:07:34Z

Could the scheduler remember recent images and bias them even if they're
not on the node anymore? That would preserve the heuristic.

Is the only purpose of having those images for scheduler?

We definitely can add some caching for scheduler (that shouldn't be difficult). Though, the problem is what is happening after scheduler restart (after restart it would be kind of best-effort).

smarterclayton · 2017-04-20T16:15:21Z

Yeah, they optimize placement (and in practice we see massive benefit of this on fairly dense clusters). The heuristic is really just best effort at this point - we could get a lot better in the scheduler by knowing which images have overlapping layers, but that's not something the node can surface up. I think being a bit more restrictive on the nodes, and cache a bit more on the scheduler, would improve some dimensions of this. Image list is already capped, and I *thought* it was MR-accessed. Another thing - small clusters probably can live with bigger lists. Large clusters would prefer smaller lists.

…

On Thu, Apr 20, 2017 at 12:07 PM, Wojciech Tyczynski < ***@***.***> wrote: Could the scheduler remember recent images and bias them even if they're not on the node anymore? That would preserve the heuristic. Is the only purpose of having those images for scheduler? We definitely can add some caching for scheduler (that shouldn't be difficult). Though, the problem is what is happening after scheduler restart (after restart it would be kind of best-effort). — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#44701 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p6kjNjU_KEUP6Tm2QxF1x_MCgTn2ks5rx4LSgaJpZM4NCoEg> .

wojtek-t · 2017-04-20T16:32:59Z

Yeah, they optimize placement (and in practice we see massive benefit of
this on fairly dense clusters). The heuristic is really just best effort
at this point - we could get a lot better in the scheduler by knowing which
images have overlapping layers, but that's not something the node can
surface up.

That's a bigger thing. This doesn't seem like a high priority to me (though I agree it would be useful at some point).

I think being a bit more restrictive on the nodes, and cache a bit more on
the scheduler, would improve some dimensions of this. Image list is
already capped, and I thought it was MR-accessed.

Is it? I think it wasn't some time ago.
How do we choose which we include in the Node? Randomly?

Another thing - small clusters probably can live with bigger lists. Large
clusters would prefer smaller lists.

That's true, though the question is how to choose the ones that we want.
The biggest ones?

wojtek-t · 2017-04-20T16:33:23Z

Or the ones with the biggest number of layers?

smarterclayton · 2017-04-20T16:38:26Z

There's a whole set of heuristics in place. We could cut probably 10-30% off node size, but I don't know that we would want to go deeper than that. I will note that the reason nodes are the worst is because of all the strings we copy - the work to make string decoding more efficient in protobuf would pay off here (doing an arena allocation for those strings vs creating one string per field). We have a few object types that benefit from that.

…

On Thu, Apr 20, 2017 at 12:33 PM, Wojciech Tyczynski < ***@***.***> wrote: Or the ones with the biggest number of layers? — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#44701 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p31pzjWAPPFYqg9kuQ2e4QURbhSmks5rx4jcgaJpZM4NCoEg> .

timothysc · 2017-04-20T16:39:48Z

At the macro scale image based steering only matters on large pulls, and it's a 1-time pad. So perhaps our threshold is a mash of image / layer size. Data locality has long since been dispelled as a farce, and what matters more is network bandwidth utilization. Because you will often trade startup latency for fragmentation.

wojtek-t · 2017-04-21T13:51:30Z

@shyamjvs - can you please first verify that if we change the size of nodes we will get comparable results?
Probably the easiest way to do it is to run real cluster without image-puller and compare with results in kubemark.

shyamjvs · 2017-04-25T16:27:31Z

Sorry it took me some time to get to this. So here's the result of running 100-node real cluster test with image prepulling disabled:

    {
      "Name": "etcd-server-e2e-test-shyamjvs-master/etcd-container",
      "Cpu": 0.118644001,
      "Mem": 318623744
    },
    {
      "Name": "etcd-server-events-e2e-test-shyamjvs-master/etcd-container",
      "Cpu": 0.112163863,
      "Mem": 180400128
    },
    {
      "Name": "kube-apiserver-e2e-test-shyamjvs-master/kube-apiserver",
      "Cpu": 0.735774906,
      "Mem": 715776000
    },
    {
      "Name": "kube-controller-manager-e2e-test-shyamjvs-master/kube-controller-manager",
      "Cpu": 0.150536057,
      "Mem": 306913280
    },
    {
      "Name": "kube-scheduler-e2e-test-shyamjvs-master/kube-scheduler",
      "Cpu": 0.148011315,
      "Mem": 116596736
    },

Seems like image data is indeed doing much mischief.

gmarek · 2017-04-25T16:46:24Z

CPU usage look a lot better, but it seems that memory didn't dropped too much. So there's something else going on.

shyamjvs · 2017-04-25T16:53:32Z

So the new size of the node object is 5.5kB, which is much better than 13.6kB. However there is still some difference with the kubemark node size i.e. 3.6kB. I'll look into why we see this difference. But I feel the part that seems the most probable to cause this difference is the images field:

"images": [
            {
                "names": [
                    "gcr.io/google_containers/node-problem-detector:v0.3.0"
                ],
                "sizeBytes": 290419520
            },
            {
                "names": [
                    "gcr.io/google-containers/fluentd-gcp:2.0.2"
                ],
                "sizeBytes": 170658036
            },
            {
                "names": [
                    "gcr.io/google_containers/kube-proxy:014539e35fe118b35e4b81db8c6201ed"
                ],
                "sizeBytes": 111088940
            },
            {
                "names": [
                    "gcr.io/google_containers/serve_hostname:v1.4"
                ],
                "sizeBytes": 6222101
            },
            {
                "names": [
                    "gcr.io/google_containers/pause-amd64:3.0"
                ],
                "sizeBytes": 746888
            }
        ],

shyamjvs · 2017-04-25T20:17:28Z

Ok, so I verified that it is indeed the "images" field which is creating the difference. There is no field for it in the kubemark node object, which is not surprising.
All other fields are pretty much the same (except some extra heartbeat conditions, but they hardly create much difference in size).
We can either let this difference be or try to tweak hollow-kubelet to fake these images for the hollow node (or maybe not if apiserver is the one which populates the images field, I'm not sure about it).

shyamjvs · 2017-04-25T20:46:10Z

My experience here is minimal, but here are some views I have (could be wrong) about this:

We should consider disabling image-prepuller for our scalability tests (until we come up with a good solution in the long term, like the heuristics/cacher). This might hurt us by causing some extra test flakiness, but that is the cost we pay for making our tests standardized i.e. not making node size vary. Isn't this also the philosophy behind why we chose to have pause pods instead of arbitrary images with varying sizes? At a time when we are trying to make our SLOs more rigorous/meaningful by standardizing our workloads, it might not be the best thing to introduce such image-list related variance.
I'm open to the idea of (artificially) increasing node size in kubemark to match that with prepulled real cluster node size. However, the issue with that is even if we'd make real cluster and kubemark uniform, we still would end up making the 'get node' call latency vary based on our image list size (which is bad wrt our SLO).
'get nodes' is one instance where we have latencies differing b/w kubemark and real cluster, thanks to this difference in resource consumption for helping notice it. However, this difference is caused in part by many other API calls too, that we need to investigate. One considerable difference, for instance, is due to 'list pods'. In kubemark we essentially get an empty response for it:

{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

shyamjvs · 2017-04-25T21:12:06Z

Here are the results of running the performance comparison tool we recently built, against runs from the last 24 hours of our 100-node kubemark vs real cluster test. We essentially compare the averages (taken across the runs) of various api calls' latencies across the 2 tests, and deem the ones with the ratio of the averages far away from 1 as mismatched. Showing the ones which mismatched on their 99th percentile here:

E2E TEST       VERB         RESOURCE                PERCENTILE  MATCHED?  COMMENTS
Load capacity  LIST         resourcequotas          Perc99      false     AvgRatio=1.9651     N1=21  N2=22
Density        DELETE       pods                    Perc99      false     AvgRatio=1.5327     N1=21  N2=22
Load capacity  POST         namespaces              Perc99      false     AvgRatio=0.6539     N1=21  N2=22
Density        POST         bindings                Perc99      false     AvgRatio=1.3155     N1=21  N2=22
Density        GET          nodes                   Perc99      false     AvgRatio=3.3588     N1=21  N2=22 <---
Density        GET          namespaces              Perc99      false     AvgRatio=1.4072     N1=21  N2=22
Load capacity  LIST         limitranges             Perc99      false     AvgRatio=2.0627     N1=21  N2=22
Load capacity  LIST         pods                    Perc99      false     AvgRatio=21.6880    N1=7   N2=11 <---
Density        GET          clusterrolebindings     Perc99      false     AvgRatio=2.5170     N1=21  N2=22
Load capacity  GET          endpoints               Perc99      false     AvgRatio=1.7900     N1=21  N2=22
Density        GET          replicationcontrollers  Perc99      false     AvgRatio=0.6559     N1=21  N2=22
Density        GET          endpoints               Perc99      false     AvgRatio=1.4890     N1=21  N2=22
Load capacity  PUT          pods                    Perc99      false     AvgRatio=1.5165     N1=21  N2=22
Load capacity  GET          replicationcontrollers  Perc99      false     AvgRatio=1.5290     N1=21  N2=22
Density        LIST         services                Perc99      false     AvgRatio=3.1146     N1=21  N2=11 <---
Load capacity  GET          nodes                   Perc99      false     AvgRatio=4.4837     N1=21  N2=22 <---
Density        POST         pods                    Perc99      false     AvgRatio=1.2752     N1=21  N2=22
Density        LIST         pods                    Perc99      false     AvgRatio=80.7757    N1=8   N2=11 <---
Load capacity  DELETE       services                Perc99      false     AvgRatio=1.2509     N1=21  N2=22
Load capacity  GET          namespaces              Perc99      false     AvgRatio=1.4972     N1=21  N2=22
Load capacity  LIST         endpoints               Perc99      false     AvgRatio=0.2584     N1=21  N2=11
Load capacity  POST         secrets                 Perc99      false     AvgRatio=1.2891     N1=21  N2=22
Load capacity  GET          secrets                 Perc99      false     AvgRatio=5.5787     N1=21  N2=22 <---
Density        GET          secrets                 Perc99      false     AvgRatio=4.2956     N1=21  N2=22 <---
Density        PUT          replicationcontrollers  Perc99      false     AvgRatio=1.3231     N1=21  N2=22
Load capacity  POST         pods                    Perc99      false     AvgRatio=1.4626     N1=21  N2=22
Load capacity  PUT          replicationcontrollers  Perc99      false     AvgRatio=1.4140     N1=21  N2=22
Load capacity  PATCH        clusterrolebindings     Perc99      false     AvgRatio=5.9547     N1=21  N2=22 <---
Load capacity  PATCH        nodes                   Perc99      false     AvgRatio=2.2478     N1=21  N2=22
Load capacity  DELETE       endpoints               Perc99      false     AvgRatio=1.2618     N1=21  N2=22
Load capacity  POST         bindings                Perc99      false     AvgRatio=1.4661     N1=21  N2=22
Load capacity  POST         endpoints               Perc99      false     AvgRatio=1.3484     N1=21  N2=22
Density        LIST         resourcequotas          Perc99      false     AvgRatio=1.8626     N1=21  N2=22
Load capacity  POST         replicationcontrollers  Perc99      false     AvgRatio=1.5701     N1=21  N2=22
Load capacity  GET          services                Perc99      false     AvgRatio=1.3521     N1=21  N2=22
Density        GET          pods                    Perc99      false     AvgRatio=1.7464     N1=21  N2=22
Load capacity  GET          clusterrolebindings     Perc99      false     AvgRatio=2.1498     N1=21  N2=22
Load capacity  GET          serviceaccounts         Perc99      false     AvgRatio=4.9304     N1=21  N2=22
Density        PATCH        clusterrolebindings     Perc99      false     AvgRatio=5.3059     N1=21  N2=22 <---
Load capacity  PUT          endpoints               Perc99      false     AvgRatio=2.4925     N1=21  N2=22
Density        PUT          endpoints               Perc99      false     AvgRatio=1.3722     N1=21  N2=22
Density        PUT          pods                    Perc99      false     AvgRatio=1.5144     N1=21  N2=22
Load capacity  LIST         services                Perc99      false     AvgRatio=0.2305     N1=21  N2=11 <---
Density        PATCH        nodes                   Perc99      false     AvgRatio=2.3621     N1=21  N2=22
Load capacity  DELETE       replicationcontrollers  Perc99      false     AvgRatio=1.4247     N1=21  N2=22
Load capacity  GET          pods                    Perc99      false     AvgRatio=1.8237     N1=21  N2=22
Load capacity  DELETE       pods                    Perc99      false     AvgRatio=1.5266     N1=21  N2=22
Density        LIST         endpoints               Perc99      false     AvgRatio=4.2569     N1=21  N2=8  <---
Density        GET          serviceaccounts         Perc99      false     AvgRatio=4.5150     N1=21  N2=22

The above results are just to give a rough idea about the differences. I'll work on writing a more detailed report on them soon.

wojtek-t · 2017-04-26T12:00:41Z

We can either let this difference be or try to tweak hollow-kubelet to fake these images for the hollow node (or maybe not if apiserver is the one which populates the images field, I'm not sure about it).

It's not apiserver who populates it - it's kubelet. See #44701 (comment)

We should consider disabling image-prepuller for our scalability tests

+1 - this is useless in our tests

I'm open to the idea of (artificially) increasing node size in kubemark

-1 - let's not do any artificial things. If we believe the images are the main difference (and disabling image puller is not enough), we should change hollow-node to make the data the same between real cluster and kubemark (by caching images of pods running on a node in hollow-kubelet).

One considerable difference, for instance, is due to 'list pods'. In kubemark we essentially get an empty response for it:

I don't understand. If this would be empty, many things wouldn't work
Or do you mean in "default" namespace? If so, this should be empty also in real clusters.

One potential different that I now can think about is that we don't have a bunch of system pods, like fluentd, in the kubemark cluster (and we should add them).

wojtek-t · 2017-04-26T12:00:46Z

@shyamjvs ^^

shyamjvs · 2017-04-26T12:48:15Z

@wojtek-t Thanks for the lead. I guess you are right that the default namespace would be empty in both cases. But the perf comparison shows a super high ratio of "list pods" latency in the case of density test, which is no where close to deviations for any other api call.

Density        LIST         pods                    Perc99      false     AvgRatio=98.1665    N1=6   N2=6

I want to understand the reason behind this. Maybe doing a kubectl list pods while the e2e test is happening could help? Also, there is one more thing that's unusual viz. N1 and N2 are both 6 while the total no. of runs we chose are 15 and 18 respectively. This means 'list pods' happens optionally only in some runs. Are you aware of why this could be happening?

wojtek-t · 2017-04-26T13:49:39Z

But the perf comparison shows a super high ratio of "list pods" latency in the case of density test, which is no where close to deviations for any other api call.

Which test are you looking at?

if this is for density, it's possible that there won't be any list pods, because pretty much everything in our system is using reflector/informer framework, so we are listing only after watch was broken and we can't really start it from the previous point (which should happen relatively rarely).

If this is about load, then the test itself is doing a lot of "LISTs of pods", so if we are missing those metrics, we have a bug somewhere (potentially in gathering metrics code) and we should debug and fix it.

gmarek · 2017-04-26T20:51:07Z

We can start by dropping metrics with count smaller than, say, 10.

shyamjvs · 2017-04-26T22:19:21Z

@wojtek-t That sounds reasonable. I was talking about the density test. However I just looked up into the load capacity runs and it turns out that the case is same even for it. There are "list pods" only in some runs. On checking the count of list pods requests for run#2026, it turns out that there is only 1 such request (that too from density test):

      {
        "metric": {
          "__name__": "apiserver_request_count",
          "client": "e2e.test/v1.7.0 (linux/amd64) kubernetes/eb0bc85",
          "code": "200",
          "contentType": "application/vnd.kubernetes.protobuf",
          "resource": "pods",
          "verb": "LIST"
        },
        "value": [
          0,
          "1"
        ]
      },

This is not expected given that we have list pods call inside our load test scale function. Either I'm going wrong somewhere or there is some error in the master metrics that are being exposed by the apiserver. However I don't the bug is in the gathering metrics code as all it does is just scrape the /metrics endpoint after the test finishes and parses prometheus format metrics.

shyamjvs · 2017-05-31T20:37:58Z

The 99th %ile of 'LIST pods' latency in our 100-node real cluster is ~2 times that for 100-node kubemark for density test (for load it is ~1.6 times).
A huge majority of list pods calls are made by e2e.test and try to list the various pods under density test namespaces.
On looking at the size of a single such pod object, it turns out that it's 4.2kB for kubemark and 4.8kB for real cluster. This difference is almost entirely due to default tolerations in the latter case.

I'll send out a PR enabling that admission plugin in kubemark. Let's see if that helps.

shyamjvs · 2017-06-01T20:48:37Z

With the above PR object size of pod should be almost same in both cases. For node object, we still see ~2kB size difference (real node - 5.5kB, hollow node - 3.5kB). Here's a breakup of the difference:

330B due to KernelDeadlock node condition which kubemark's npd doesn't post. Will send out a PR to change this.
330B due to NetworkUnavailable node condition created by routecontroller, which is disabled in kubemark. Do we really want to try enabling this? It'll put more pressure on our kubemark tests, but one good thing is resource usage of controller-manager would be less different across kubemark and real cluster.
340B due to missing labels (fluentd, instance-type, failure-domain) and annotation (volumes.kubernetes.io/controller-managed-attach-detach). We can make hollow-kubelet add these labels. In fact, we can make it add even more dummy labels just to compensate for all the size difference.
570B due to missing images in kubemark (npd, fluentd, kube-proxy).
400B due to misc. missing data (like nodeInfo (bootID, osImage, systemUUID, etc), node ExternalIP, etc)

wojtek-t · 2017-06-02T07:49:14Z

330B due to NetworkUnavailable node condition created by routecontroller, which is disabled in kubemark. Do we really want to try enabling this? It'll put more pressure on our kubemark tests, but one good thing is resource usage of controller-manager would be less different across kubemark and real cluster.

We definitely do NOT want to enable route controller. We can try artificially setting this condition somewhere.

340B due to missing labels (fluentd, instance-type, failure-domain) and annotation (volumes.kubernetes.io/controller-managed-attach-detach). We can make hollow-kubelet add these labels.

Let's do.

In fact, we can make it add even more dummy labels just to compensate for all the size difference.

No - this will impact e.g. processing selectors. We don't want this.

570B due to missing images in kubemark (npd, fluentd, kube-proxy).

Actually, I'm not wondering why we don't have images? Are we mocking something that is tracking them? Or do they come from docker client? Or what?
Maybe it's possible to just track images that we are "using" on this node?

400B due to misc. missing data (like nodeInfo (bootID, osImage, systemUUID, etc), node ExternalIP, etc)

Can we add them?

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 45871, 46498, 46729, 46144, 46804) Enable some pod-related admission plugins for kubemark Ref #44701 This should help reduce discrepancy in "list pods" latency wrt real cluster. Let's see. /cc @wojtek-t @gmarek

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 46972, 42829, 46799, 46802, 46844) Add KernelDeadlock condition to hollow NPD Ref #44701 /cc @wojtek-t @gmarek

@wojtek-t

Automatic merge from submit-queue (batch tested with PRs 51765, 53053, 52771, 52860, 53284). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add audit-logging, feature-gates & few admission plugins to kubemark To make kubemark match real cluster settings. Also includes a few other settings like request-timeout, etcd-quorum, etc. Fixes #53021 Related #51899 #44701 cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek @smarterclayton

fejta-bot · 2017-12-26T01:02:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-25T01:50:26Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

shyamjvs · 2018-02-16T13:49:53Z

/remove-lifecycle rotten

shyamjvs · 2018-02-16T13:58:51Z

it's been a while, but there's an interesting update here. After my recent fix (#59832) for a significant bug with endpoints in kubemark (#59823), the gap b/w kubemark and real cluster PUT endpoints 99%ile latency has drastically fallen,

from ~6.0 (see - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-perf-tests-kubemark-100-benchmark/3192/build-log.txt)
to ~1.3 (see - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-perf-tests-kubemark-100-benchmark/3269/build-log.txt)

We're now down to resolving just the difference for LIST nodes b/w both:

E2E TEST  VERB  RESOURCE   SUBRESOURCE  SCOPE      PERCENTILE  COMMENTS
density   LIST  nodes                   cluster    Perc99      AvgL/R=4.55  AvgL(ms)=58.58  AvgR(ms)=12.86  N1=27  N2=7
load      LIST  nodes                   cluster    Perc99      AvgL/R=2.11  AvgL(ms)=75.76  AvgR(ms)=35.98  N1=27  N2=7

(ref: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-perf-tests-kubemark-100-benchmark/3270/build-log.txt)

shyamjvs · 2018-02-16T14:16:12Z

One more thing, since the endpoints objects were effectively empty in the kubemark cluster, that should've considerably reduced etcd and apiserver mem-usage for kubemark. Let me grab some values from the present.

shyamjvs · 2018-02-16T14:21:01Z

And I was right...

This is from kubemark-100 density test. You can see that the memory usage for both etcd and apiserver has gone up by ~125 MB (around the time of my change).

shyamjvs · 2018-02-16T14:27:43Z

And for 100-node real cluster, this is how the apiserver and etcd mem usages look like:

shyamjvs · 2018-02-16T14:31:58Z

To summarize, this is how the avg mem usages look like:

Real cluster:
- etcd: ~700 MB
- apiserver: ~1200 MB
Kubemark cluster:
- etcd: changed from ~350 MB to ~475 MB
- apiserver: changed from ~700 MB to ~825 MB

shyamjvs · 2018-02-16T14:43:39Z

I have a strong feeling that the remaining difference has a considerable component coming from node objects (as we were discussing earlier in this thread). In addition to the object size itself, there is one more factor magnifying the effect - which is having numerous versions of each node object in etcd (discussed in more detail here: #14733).

So I think a logical way to proceed here is to check size of node objects (both latest RV in apiserver and all versions in etcd) and if we find a big enough difference, resurrect some ideas discussed with @wojtek-t here.

fejta-bot · 2018-05-17T14:50:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2018-06-08T11:41:26Z

/lifecycle frozen

gmarek added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 20, 2017

wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 20, 2017

shyamjvs mentioned this issue Apr 26, 2017

Turn off image prepulling for e2e scalability tests kubernetes/test-infra#2582

Merged

shyamjvs mentioned this issue Apr 26, 2017

Enable metrics gathering for kubemark 100-node test kubernetes/test-infra#2587

Merged

shyamjvs self-assigned this May 19, 2017

shyamjvs mentioned this issue May 31, 2017

Enable some pod-related admission plugins for kubemark #46729

Merged

shyamjvs mentioned this issue Jun 1, 2017

Add KernelDeadlock condition to hollow NPD #46802

Merged

shyamjvs mentioned this issue Jun 19, 2017

APIserver should have a metric for response size #47728

Closed

shyamjvs mentioned this issue Sep 26, 2017

Add audit-logging, feature-gates & few admission plugins to kubemark #53053

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 25, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 16, 2018

shyamjvs mentioned this issue Feb 16, 2018

List all kubemark nodes before starting tests kubernetes/test-infra#6864

Merged

shyamjvs mentioned this issue Apr 5, 2018

'PATCH node-status' latency slo violations #62064

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2018

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 8, 2018

Understand why resource usage of master components is noticeably different in Kubemark #44701

Understand why resource usage of master components is noticeably different in Kubemark #44701

Comments

gmarek commented Apr 20, 2017

wojtek-t commented Apr 20, 2017

gmarek commented Apr 20, 2017

timothysc commented Apr 20, 2017

smarterclayton commented Apr 20, 2017 via email

timothysc commented Apr 20, 2017 • edited Loading

smarterclayton commented Apr 20, 2017 via email

wojtek-t commented Apr 20, 2017

smarterclayton commented Apr 20, 2017 via email

wojtek-t commented Apr 20, 2017

wojtek-t commented Apr 20, 2017

smarterclayton commented Apr 20, 2017 via email

timothysc commented Apr 20, 2017

wojtek-t commented Apr 21, 2017

shyamjvs commented Apr 25, 2017

gmarek commented Apr 25, 2017

shyamjvs commented Apr 25, 2017 • edited Loading

shyamjvs commented Apr 25, 2017 • edited Loading

shyamjvs commented Apr 25, 2017

shyamjvs commented Apr 25, 2017 • edited Loading

wojtek-t commented Apr 26, 2017

wojtek-t commented Apr 26, 2017

shyamjvs commented Apr 26, 2017 • edited Loading

wojtek-t commented Apr 26, 2017

gmarek commented Apr 26, 2017

shyamjvs commented Apr 26, 2017

shyamjvs commented May 31, 2017

shyamjvs commented Jun 1, 2017

wojtek-t commented Jun 2, 2017

fejta-bot commented Dec 26, 2017

fejta-bot commented Jan 25, 2018

shyamjvs commented Feb 16, 2018

shyamjvs commented Feb 16, 2018

shyamjvs commented Feb 16, 2018

shyamjvs commented Feb 16, 2018

shyamjvs commented Feb 16, 2018

shyamjvs commented Feb 16, 2018

shyamjvs commented Feb 16, 2018

fejta-bot commented May 17, 2018

wojtek-t commented Jun 8, 2018

timothysc commented Apr 20, 2017 •

edited

Loading

shyamjvs commented Apr 25, 2017 •

edited

Loading

shyamjvs commented Apr 25, 2017 •

edited

Loading

shyamjvs commented Apr 25, 2017 •

edited

Loading

shyamjvs commented Apr 26, 2017 •

edited

Loading