Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand why resource usage of master components is noticeably different in Kubemark #44701

Open
gmarek opened this issue Apr 20, 2017 · 44 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@gmarek
Copy link
Contributor

gmarek commented Apr 20, 2017

We see pretty big discrepancy between what we see in kubemark clusters and real ones. It needs to be understood. Difference is mostly in the API server usage.

E.g. 99 percentile in kubemark (both results from after Density, which is run as a first test):

container                               cpu(cores) memory(MB)
"apiserver/apiserver"                   0.606      407.89
"controller-manager/controller-manager" 0.120      218.27
"etcd/etcd/data"                        0.120      160.75
"etcd/etcd/data-events"                 0.034      46.22
"scheduler/scheduler"                   0.080      114.16

and real cluster:

    {
      "Name": "etcd-server-e2e-scalability-master/etcd-container",
      "Cpu": 0.167254853,
      "Mem": 371601408
    },
    {
      "Name": "etcd-server-events-e2e-scalability-master/etcd-container",
      "Cpu": 0.14851711,
      "Mem": 159068160
    },
    {
      "Name": "kube-apiserver-e2e-scalability-master/kube-apiserver",
      "Cpu": 0.972463425,
      "Mem": 787206144
    },
    {
      "Name": "kube-controller-manager-e2e-scalability-master/kube-controller-manager",
      "Cpu": 0.180771427,
      "Mem": 235991040
    },
    {
      "Name": "kube-scheduler-e2e-scalability-master/kube-scheduler",
      "Cpu": 0.169906087,
      "Mem": 113385472
    },

@wojtek-t @shyamjvs @kubernetes/sig-scalability-misc

@gmarek gmarek added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 20, 2017
@wojtek-t
Copy link
Member

On hypothesis that I have is that size of objects is different (in particular size of Node). Something to verify.

@wojtek-t wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 20, 2017
@gmarek
Copy link
Contributor Author

gmarek commented Apr 20, 2017

I checked the sizes of Node object, and indeed they're quite different. Kubemark's Node object has 3.6kB, while in real cluster it's 13.6kB. The biggest difference is list of images present, as in the "real" cluster we have "image puller" running, which downloads pretty much whole internet on every node, and information about those images take A LOT of space (9k to be precise).

Other differences are smaller, and generally consists of data that kubelet read from cloud-provider and publishes as either labels or node-info, but this is peanuts in comparison to images. And it's probably not worth fixing.

There are two ways in which we can approach this:

  • don't run image puller in scale tests
  • "run" it in kubemark (i.e. mock 9k data about images)

When we do one of those, we should check if results will become better.

@timothysc
Copy link
Member

The node object also contains a lot of container image data too.

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 20, 2017 via email

@timothysc
Copy link
Member

timothysc commented Apr 20, 2017

I think we should bound the scope of image details to MRU. It's a sub-optimal heuristic to steer, and those who care about speed will pre-populate the cluster.

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 20, 2017 via email

@wojtek-t
Copy link
Member

Could the scheduler remember recent images and bias them even if they're
not on the node anymore? That would preserve the heuristic.

Is the only purpose of having those images for scheduler?

We definitely can add some caching for scheduler (that shouldn't be difficult). Though, the problem is what is happening after scheduler restart (after restart it would be kind of best-effort).

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 20, 2017 via email

@wojtek-t
Copy link
Member

Yeah, they optimize placement (and in practice we see massive benefit of
this on fairly dense clusters). The heuristic is really just best effort
at this point - we could get a lot better in the scheduler by knowing which
images have overlapping layers, but that's not something the node can
surface up.

That's a bigger thing. This doesn't seem like a high priority to me (though I agree it would be useful at some point).

I think being a bit more restrictive on the nodes, and cache a bit more on
the scheduler, would improve some dimensions of this. Image list is
already capped, and I thought it was MR-accessed.

Is it? I think it wasn't some time ago.
How do we choose which we include in the Node? Randomly?

Another thing - small clusters probably can live with bigger lists. Large
clusters would prefer smaller lists.

That's true, though the question is how to choose the ones that we want.
The biggest ones?

@wojtek-t
Copy link
Member

Or the ones with the biggest number of layers?

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 20, 2017 via email

@timothysc
Copy link
Member

At the macro scale image based steering only matters on large pulls, and it's a 1-time pad. So perhaps our threshold is a mash of image / layer size. Data locality has long since been dispelled as a farce, and what matters more is network bandwidth utilization. Because you will often trade startup latency for fragmentation.

@wojtek-t
Copy link
Member

@shyamjvs - can you please first verify that if we change the size of nodes we will get comparable results?
Probably the easiest way to do it is to run real cluster without image-puller and compare with results in kubemark.

@shyamjvs
Copy link
Member

Sorry it took me some time to get to this. So here's the result of running 100-node real cluster test with image prepulling disabled:

    {
      "Name": "etcd-server-e2e-test-shyamjvs-master/etcd-container",
      "Cpu": 0.118644001,
      "Mem": 318623744
    },
    {
      "Name": "etcd-server-events-e2e-test-shyamjvs-master/etcd-container",
      "Cpu": 0.112163863,
      "Mem": 180400128
    },
    {
      "Name": "kube-apiserver-e2e-test-shyamjvs-master/kube-apiserver",
      "Cpu": 0.735774906,
      "Mem": 715776000
    },
    {
      "Name": "kube-controller-manager-e2e-test-shyamjvs-master/kube-controller-manager",
      "Cpu": 0.150536057,
      "Mem": 306913280
    },
    {
      "Name": "kube-scheduler-e2e-test-shyamjvs-master/kube-scheduler",
      "Cpu": 0.148011315,
      "Mem": 116596736
    },

Seems like image data is indeed doing much mischief.

@gmarek
Copy link
Contributor Author

gmarek commented Apr 25, 2017

CPU usage look a lot better, but it seems that memory didn't dropped too much. So there's something else going on.

@shyamjvs
Copy link
Member

shyamjvs commented Apr 25, 2017

So the new size of the node object is 5.5kB, which is much better than 13.6kB. However there is still some difference with the kubemark node size i.e. 3.6kB. I'll look into why we see this difference. But I feel the part that seems the most probable to cause this difference is the images field:

"images": [
            {
                "names": [
                    "gcr.io/google_containers/node-problem-detector:v0.3.0"
                ],
                "sizeBytes": 290419520
            },
            {
                "names": [
                    "gcr.io/google-containers/fluentd-gcp:2.0.2"
                ],
                "sizeBytes": 170658036
            },
            {
                "names": [
                    "gcr.io/google_containers/kube-proxy:014539e35fe118b35e4b81db8c6201ed"
                ],
                "sizeBytes": 111088940
            },
            {
                "names": [
                    "gcr.io/google_containers/serve_hostname:v1.4"
                ],
                "sizeBytes": 6222101
            },
            {
                "names": [
                    "gcr.io/google_containers/pause-amd64:3.0"
                ],
                "sizeBytes": 746888
            }
        ],

@shyamjvs
Copy link
Member

shyamjvs commented Apr 25, 2017

Ok, so I verified that it is indeed the "images" field which is creating the difference. There is no field for it in the kubemark node object, which is not surprising.
All other fields are pretty much the same (except some extra heartbeat conditions, but they hardly create much difference in size).
We can either let this difference be or try to tweak hollow-kubelet to fake these images for the hollow node (or maybe not if apiserver is the one which populates the images field, I'm not sure about it).

@shyamjvs
Copy link
Member

My experience here is minimal, but here are some views I have (could be wrong) about this:

  • We should consider disabling image-prepuller for our scalability tests (until we come up with a good solution in the long term, like the heuristics/cacher). This might hurt us by causing some extra test flakiness, but that is the cost we pay for making our tests standardized i.e. not making node size vary. Isn't this also the philosophy behind why we chose to have pause pods instead of arbitrary images with varying sizes? At a time when we are trying to make our SLOs more rigorous/meaningful by standardizing our workloads, it might not be the best thing to introduce such image-list related variance.
  • I'm open to the idea of (artificially) increasing node size in kubemark to match that with prepulled real cluster node size. However, the issue with that is even if we'd make real cluster and kubemark uniform, we still would end up making the 'get node' call latency vary based on our image list size (which is bad wrt our SLO).
  • 'get nodes' is one instance where we have latencies differing b/w kubemark and real cluster, thanks to this difference in resource consumption for helping notice it. However, this difference is caused in part by many other API calls too, that we need to investigate. One considerable difference, for instance, is due to 'list pods'. In kubemark we essentially get an empty response for it:
{
    "apiVersion": "v1",
    "items": [],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

@shyamjvs
Copy link
Member

shyamjvs commented Apr 25, 2017

Here are the results of running the performance comparison tool we recently built, against runs from the last 24 hours of our 100-node kubemark vs real cluster test. We essentially compare the averages (taken across the runs) of various api calls' latencies across the 2 tests, and deem the ones with the ratio of the averages far away from 1 as mismatched. Showing the ones which mismatched on their 99th percentile here:

E2E TEST       VERB         RESOURCE                PERCENTILE  MATCHED?  COMMENTS
Load capacity  LIST         resourcequotas          Perc99      false     AvgRatio=1.9651     N1=21  N2=22
Density        DELETE       pods                    Perc99      false     AvgRatio=1.5327     N1=21  N2=22
Load capacity  POST         namespaces              Perc99      false     AvgRatio=0.6539     N1=21  N2=22
Density        POST         bindings                Perc99      false     AvgRatio=1.3155     N1=21  N2=22
Density        GET          nodes                   Perc99      false     AvgRatio=3.3588     N1=21  N2=22 <---
Density        GET          namespaces              Perc99      false     AvgRatio=1.4072     N1=21  N2=22
Load capacity  LIST         limitranges             Perc99      false     AvgRatio=2.0627     N1=21  N2=22
Load capacity  LIST         pods                    Perc99      false     AvgRatio=21.6880    N1=7   N2=11 <---
Density        GET          clusterrolebindings     Perc99      false     AvgRatio=2.5170     N1=21  N2=22
Load capacity  GET          endpoints               Perc99      false     AvgRatio=1.7900     N1=21  N2=22
Density        GET          replicationcontrollers  Perc99      false     AvgRatio=0.6559     N1=21  N2=22
Density        GET          endpoints               Perc99      false     AvgRatio=1.4890     N1=21  N2=22
Load capacity  PUT          pods                    Perc99      false     AvgRatio=1.5165     N1=21  N2=22
Load capacity  GET          replicationcontrollers  Perc99      false     AvgRatio=1.5290     N1=21  N2=22
Density        LIST         services                Perc99      false     AvgRatio=3.1146     N1=21  N2=11 <---
Load capacity  GET          nodes                   Perc99      false     AvgRatio=4.4837     N1=21  N2=22 <---
Density        POST         pods                    Perc99      false     AvgRatio=1.2752     N1=21  N2=22
Density        LIST         pods                    Perc99      false     AvgRatio=80.7757    N1=8   N2=11 <---
Load capacity  DELETE       services                Perc99      false     AvgRatio=1.2509     N1=21  N2=22
Load capacity  GET          namespaces              Perc99      false     AvgRatio=1.4972     N1=21  N2=22
Load capacity  LIST         endpoints               Perc99      false     AvgRatio=0.2584     N1=21  N2=11
Load capacity  POST         secrets                 Perc99      false     AvgRatio=1.2891     N1=21  N2=22
Load capacity  GET          secrets                 Perc99      false     AvgRatio=5.5787     N1=21  N2=22 <---
Density        GET          secrets                 Perc99      false     AvgRatio=4.2956     N1=21  N2=22 <---
Density        PUT          replicationcontrollers  Perc99      false     AvgRatio=1.3231     N1=21  N2=22
Load capacity  POST         pods                    Perc99      false     AvgRatio=1.4626     N1=21  N2=22
Load capacity  PUT          replicationcontrollers  Perc99      false     AvgRatio=1.4140     N1=21  N2=22
Load capacity  PATCH        clusterrolebindings     Perc99      false     AvgRatio=5.9547     N1=21  N2=22 <---
Load capacity  PATCH        nodes                   Perc99      false     AvgRatio=2.2478     N1=21  N2=22
Load capacity  DELETE       endpoints               Perc99      false     AvgRatio=1.2618     N1=21  N2=22
Load capacity  POST         bindings                Perc99      false     AvgRatio=1.4661     N1=21  N2=22
Load capacity  POST         endpoints               Perc99      false     AvgRatio=1.3484     N1=21  N2=22
Density        LIST         resourcequotas          Perc99      false     AvgRatio=1.8626     N1=21  N2=22
Load capacity  POST         replicationcontrollers  Perc99      false     AvgRatio=1.5701     N1=21  N2=22
Load capacity  GET          services                Perc99      false     AvgRatio=1.3521     N1=21  N2=22
Density        GET          pods                    Perc99      false     AvgRatio=1.7464     N1=21  N2=22
Load capacity  GET          clusterrolebindings     Perc99      false     AvgRatio=2.1498     N1=21  N2=22
Load capacity  GET          serviceaccounts         Perc99      false     AvgRatio=4.9304     N1=21  N2=22
Density        PATCH        clusterrolebindings     Perc99      false     AvgRatio=5.3059     N1=21  N2=22 <---
Load capacity  PUT          endpoints               Perc99      false     AvgRatio=2.4925     N1=21  N2=22
Density        PUT          endpoints               Perc99      false     AvgRatio=1.3722     N1=21  N2=22
Density        PUT          pods                    Perc99      false     AvgRatio=1.5144     N1=21  N2=22
Load capacity  LIST         services                Perc99      false     AvgRatio=0.2305     N1=21  N2=11 <---
Density        PATCH        nodes                   Perc99      false     AvgRatio=2.3621     N1=21  N2=22
Load capacity  DELETE       replicationcontrollers  Perc99      false     AvgRatio=1.4247     N1=21  N2=22
Load capacity  GET          pods                    Perc99      false     AvgRatio=1.8237     N1=21  N2=22
Load capacity  DELETE       pods                    Perc99      false     AvgRatio=1.5266     N1=21  N2=22
Density        LIST         endpoints               Perc99      false     AvgRatio=4.2569     N1=21  N2=8  <---
Density        GET          serviceaccounts         Perc99      false     AvgRatio=4.5150     N1=21  N2=22

The above results are just to give a rough idea about the differences. I'll work on writing a more detailed report on them soon.

@wojtek-t
Copy link
Member

We can either let this difference be or try to tweak hollow-kubelet to fake these images for the hollow node (or maybe not if apiserver is the one which populates the images field, I'm not sure about it).

It's not apiserver who populates it - it's kubelet. See #44701 (comment)

We should consider disabling image-prepuller for our scalability tests

+1 - this is useless in our tests

I'm open to the idea of (artificially) increasing node size in kubemark

-1 - let's not do any artificial things. If we believe the images are the main difference (and disabling image puller is not enough), we should change hollow-node to make the data the same between real cluster and kubemark (by caching images of pods running on a node in hollow-kubelet).

One considerable difference, for instance, is due to 'list pods'. In kubemark we essentially get an empty response for it:

I don't understand. If this would be empty, many things wouldn't work
Or do you mean in "default" namespace? If so, this should be empty also in real clusters.

One potential different that I now can think about is that we don't have a bunch of system pods, like fluentd, in the kubemark cluster (and we should add them).

@wojtek-t
Copy link
Member

@shyamjvs ^^

@shyamjvs
Copy link
Member

shyamjvs commented Apr 26, 2017

@wojtek-t Thanks for the lead. I guess you are right that the default namespace would be empty in both cases. But the perf comparison shows a super high ratio of "list pods" latency in the case of density test, which is no where close to deviations for any other api call.

Density        LIST         pods                    Perc99      false     AvgRatio=98.1665    N1=6   N2=6

I want to understand the reason behind this. Maybe doing a kubectl list pods while the e2e test is happening could help? Also, there is one more thing that's unusual viz. N1 and N2 are both 6 while the total no. of runs we chose are 15 and 18 respectively. This means 'list pods' happens optionally only in some runs. Are you aware of why this could be happening?

@wojtek-t
Copy link
Member

But the perf comparison shows a super high ratio of "list pods" latency in the case of density test, which is no where close to deviations for any other api call.

Which test are you looking at?

if this is for density, it's possible that there won't be any list pods, because pretty much everything in our system is using reflector/informer framework, so we are listing only after watch was broken and we can't really start it from the previous point (which should happen relatively rarely).

If this is about load, then the test itself is doing a lot of "LISTs of pods", so if we are missing those metrics, we have a bug somewhere (potentially in gathering metrics code) and we should debug and fix it.

@gmarek
Copy link
Contributor Author

gmarek commented Apr 26, 2017

We can start by dropping metrics with count smaller than, say, 10.

@shyamjvs
Copy link
Member

@wojtek-t That sounds reasonable. I was talking about the density test. However I just looked up into the load capacity runs and it turns out that the case is same even for it. There are "list pods" only in some runs. On checking the count of list pods requests for run#2026, it turns out that there is only 1 such request (that too from density test):

      {
        "metric": {
          "__name__": "apiserver_request_count",
          "client": "e2e.test/v1.7.0 (linux/amd64) kubernetes/eb0bc85",
          "code": "200",
          "contentType": "application/vnd.kubernetes.protobuf",
          "resource": "pods",
          "verb": "LIST"
        },
        "value": [
          0,
          "1"
        ]
      },

This is not expected given that we have list pods call inside our load test scale function. Either I'm going wrong somewhere or there is some error in the master metrics that are being exposed by the apiserver. However I don't the bug is in the gathering metrics code as all it does is just scrape the /metrics endpoint after the test finishes and parses prometheus format metrics.

@shyamjvs shyamjvs self-assigned this May 19, 2017
@shyamjvs
Copy link
Member

  • The 99th %ile of 'LIST pods' latency in our 100-node real cluster is ~2 times that for 100-node kubemark for density test (for load it is ~1.6 times).
  • A huge majority of list pods calls are made by e2e.test and try to list the various pods under density test namespaces.
  • On looking at the size of a single such pod object, it turns out that it's 4.2kB for kubemark and 4.8kB for real cluster. This difference is almost entirely due to default tolerations in the latter case.

I'll send out a PR enabling that admission plugin in kubemark. Let's see if that helps.

@shyamjvs
Copy link
Member

shyamjvs commented Jun 1, 2017

With the above PR object size of pod should be almost same in both cases. For node object, we still see ~2kB size difference (real node - 5.5kB, hollow node - 3.5kB). Here's a breakup of the difference:

  • 330B due to KernelDeadlock node condition which kubemark's npd doesn't post. Will send out a PR to change this.
  • 330B due to NetworkUnavailable node condition created by routecontroller, which is disabled in kubemark. Do we really want to try enabling this? It'll put more pressure on our kubemark tests, but one good thing is resource usage of controller-manager would be less different across kubemark and real cluster.
  • 340B due to missing labels (fluentd, instance-type, failure-domain) and annotation (volumes.kubernetes.io/controller-managed-attach-detach). We can make hollow-kubelet add these labels. In fact, we can make it add even more dummy labels just to compensate for all the size difference.
  • 570B due to missing images in kubemark (npd, fluentd, kube-proxy).
  • 400B due to misc. missing data (like nodeInfo (bootID, osImage, systemUUID, etc), node ExternalIP, etc)

@wojtek-t
Copy link
Member

wojtek-t commented Jun 2, 2017

330B due to NetworkUnavailable node condition created by routecontroller, which is disabled in kubemark. Do we really want to try enabling this? It'll put more pressure on our kubemark tests, but one good thing is resource usage of controller-manager would be less different across kubemark and real cluster.

We definitely do NOT want to enable route controller. We can try artificially setting this condition somewhere.

340B due to missing labels (fluentd, instance-type, failure-domain) and annotation (volumes.kubernetes.io/controller-managed-attach-detach). We can make hollow-kubelet add these labels.

Let's do.

In fact, we can make it add even more dummy labels just to compensate for all the size difference.

No - this will impact e.g. processing selectors. We don't want this.

570B due to missing images in kubemark (npd, fluentd, kube-proxy).

Actually, I'm not wondering why we don't have images? Are we mocking something that is tracking them? Or do they come from docker client? Or what?
Maybe it's possible to just track images that we are "using" on this node?

400B due to misc. missing data (like nodeInfo (bootID, osImage, systemUUID, etc), node ExternalIP, etc)

Can we add them?

k8s-github-robot pushed a commit that referenced this issue Jun 5, 2017
Automatic merge from submit-queue (batch tested with PRs 45871, 46498, 46729, 46144, 46804)

Enable some pod-related admission plugins for kubemark

Ref #44701

This should help reduce discrepancy in "list pods" latency wrt real cluster. Let's see.

/cc @wojtek-t @gmarek
k8s-github-robot pushed a commit that referenced this issue Jun 6, 2017
Automatic merge from submit-queue (batch tested with PRs 46972, 42829, 46799, 46802, 46844)

Add KernelDeadlock condition to hollow NPD

Ref #44701

/cc @wojtek-t @gmarek
k8s-github-robot pushed a commit that referenced this issue Oct 3, 2017
Automatic merge from submit-queue (batch tested with PRs 51765, 53053, 52771, 52860, 53284). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add audit-logging, feature-gates & few admission plugins to kubemark

To make kubemark match real cluster settings. Also includes a few other settings like request-timeout, etcd-quorum, etc.

Fixes #53021
Related #51899 #44701

cc @kubernetes/sig-scalability-misc @wojtek-t @gmarek @smarterclayton
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 25, 2018
@shyamjvs
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 16, 2018
@shyamjvs
Copy link
Member

it's been a while, but there's an interesting update here. After my recent fix (#59832) for a significant bug with endpoints in kubemark (#59823), the gap b/w kubemark and real cluster PUT endpoints 99%ile latency has drastically fallen,

from ~6.0 (see - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-perf-tests-kubemark-100-benchmark/3192/build-log.txt)
to ~1.3 (see - https://storage.googleapis.com/kubernetes-jenkins/logs/ci-perf-tests-kubemark-100-benchmark/3269/build-log.txt)

We're now down to resolving just the difference for LIST nodes b/w both:

E2E TEST  VERB  RESOURCE   SUBRESOURCE  SCOPE      PERCENTILE  COMMENTS
density   LIST  nodes                   cluster    Perc99      AvgL/R=4.55  AvgL(ms)=58.58  AvgR(ms)=12.86  N1=27  N2=7
load      LIST  nodes                   cluster    Perc99      AvgL/R=2.11  AvgL(ms)=75.76  AvgR(ms)=35.98  N1=27  N2=7

(ref: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-perf-tests-kubemark-100-benchmark/3270/build-log.txt)

@shyamjvs
Copy link
Member

One more thing, since the endpoints objects were effectively empty in the kubemark cluster, that should've considerably reduced etcd and apiserver mem-usage for kubemark. Let me grab some values from the present.

@shyamjvs
Copy link
Member

And I was right...

apiserver_mem_usage_kubemark
etcd_mem_usage_kubemark

This is from kubemark-100 density test. You can see that the memory usage for both etcd and apiserver has gone up by ~125 MB (around the time of my change).

@shyamjvs
Copy link
Member

And for 100-node real cluster, this is how the apiserver and etcd mem usages look like:

apiserver_mem_usage_real
etcd_mem_usage_real

@shyamjvs
Copy link
Member

To summarize, this is how the avg mem usages look like:

  • Real cluster:
    • etcd: ~700 MB
    • apiserver: ~1200 MB
  • Kubemark cluster:
    • etcd: changed from ~350 MB to ~475 MB
    • apiserver: changed from ~700 MB to ~825 MB

@shyamjvs
Copy link
Member

I have a strong feeling that the remaining difference has a considerable component coming from node objects (as we were discussing earlier in this thread). In addition to the object size itself, there is one more factor magnifying the effect - which is having numerous versions of each node object in etcd (discussed in more detail here: #14733).

So I think a logical way to proceed here is to check size of node objects (both latest RV in apiserver and all versions in etcd) and if we find a big enough difference, resurrect some ideas discussed with @wojtek-t here.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2018
@wojtek-t
Copy link
Member

wojtek-t commented Jun 8, 2018

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

7 participants