Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-state-metrics consuming too much memory #257

Closed
jac-stripe opened this issue Sep 15, 2017 · 70 comments
Closed

kube-state-metrics consuming too much memory #257

jac-stripe opened this issue Sep 15, 2017 · 70 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jac-stripe
Copy link

kube-state-metrics is using >400mb of RAM. It is also very slow when I query /metrics. The kubernetes cluster has 2700 job objects. It seems surprising that this would consume 400mb of RAM for metrics aggregation. Below is a pprof top trace. This is running the latest git revision (d316c01)

(pprof) top
Showing nodes accounting for 526.72MB, 86.90% of 606.14MB total
Dropped 148 nodes (cum <= 3.03MB)
Showing top 10 nodes out of 110
      flat  flat%   sum%        cum   cum%
  195.01MB 32.17% 32.17%   202.01MB 33.33%  github.com/prometheus/client_golang/prometheus.makeLabelPairs
  101.26MB 16.71% 48.88%   148.26MB 24.46%  github.com/prometheus/client_golang/prometheus.(*Registry).Gather
   74.28MB 12.26% 61.13%    74.81MB 12.34%  k8s.io/kube-state-metrics/collectors.RegisterJobCollector.func1                                                                 47MB  7.75% 68.89%       47MB  7.75%  github.com/prometheus/client_golang/prometheus.populateMetric
   27.60MB  4.55% 73.44%    30.60MB  5.05%  k8s.io/client-go/pkg/api/v1.codecSelfer1234.decSliceVolume
   23.01MB  3.80% 77.24%    23.01MB  3.80%  runtime.rawstringtmp
   18.97MB  3.13% 80.37%    19.55MB  3.22%  github.com/golang/protobuf/proto.(*Buffer).EncodeStringBytes
   15.50MB  2.56% 82.92%   217.51MB 35.88%  github.com/prometheus/client_golang/prometheus.NewConstMetric
   13.50MB  2.23% 85.15%    14.02MB  2.31%  runtime.mapassign
   10.58MB  1.74% 86.90%    12.71MB  2.10%  compress/flate.NewWriter
@brancz
Copy link
Member

brancz commented Sep 15, 2017

What does "very" slow mean? Up to 10 seconds response time wouldn't be unusual for a huge request.

For every job object there are at least 12 metrics being reported plus 20 metrics for each of the pods created by those Job objects. 2700 * 12 = 32400 + 2700 * 20 = 86400. With minimum of 86400 lines of metrics per HTTP request, those numbers actually don't seem too unreasonable, although we are aware of some inefficiencies of the Prometheus Go implementation, that primarily drive these numbers up.

It might be worth checking thought that you're not running into the same problem as reported here: #112 (comment)

@julia-stripe
Copy link
Contributor

thanks so much @brancz! "very slow' means that kube-state-metrics' /metrics endpoint often doesn't respond to HTTP requests at all (even after 10 minutes or so).

can you say offhand what some of the inefficiencies of the Prometheus Go implementation are? That could help us debug.

@brancz
Copy link
Member

brancz commented Sep 15, 2017

The inefficiencies are that there are a number of allocations that could be optimized, but it wouldn't explain why the HTTP requests don't respond at all, the scalability tests#124 (comment) that Google ran had 1000 nodes and 30000 pods and responded within 9s, used ~1Gb memory and 0.3 cores. The number of metrics for those tests should be far more than in this case, so I feel it might actually be something in the Job collector. The memory usage is probably just end up showing in the Prometheus client code as we're still creating metrics, so memory profiles of alloc_space and inuse_space would be helpful. Could you take those with go tool pprof and share the bundles that it drops in $HOME/pprof? If you analyze it even better 😉.

@matthiasr
Copy link

matthiasr commented Sep 15, 2017 via email

@brancz
Copy link
Member

brancz commented Sep 15, 2017

Yep that's what I linked to earlier, it's my suspicion as well.

@julia-stripe
Copy link
Contributor

Are there a lot of Completed pods in the cluster?

No, I double checked and there are only 200 completed pods in the cluster (we configure kube to garbage collect terminated pods to avoid exactly this problem)

@andyxning
Copy link
Member

andyxning commented Sep 16, 2017

@julia-stripe Can you paste the log about how many job objects or pod objects scraped as implemented in #254 . We need to make sure what are the actual number of objects kube-state-metrics scraped.

@brancz
Copy link
Member

brancz commented Sep 18, 2017

Yes what @andyxning mentioned would indeed be helpful, and then the full memory profiles for further analysis 🙂 .

@jac-stripe
Copy link
Author

jac-stripe commented Sep 18, 2017

I have attached profiles for alloc_space[1] and inuse_space[2]. Increasing the memory allocated to 2gb seems to help, giving a metrics response in ~25 seconds, however this seems like a lot of RAM for our cluster size. Below are some statistics on our cluster:

$ kubectl get pods --all-namespaces -a | wc -l
273
$ kubectl get jobs --all-namespaces -a | wc -l
2028
$ kubectl get cronjobs --all-namespaces -a | wc -l
193

[1]: pprof.com_github_kubernetes_kube_state_metrics.static.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
[2]: pprof.com_github_kubernetes_kube_state_metrics.static.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz

@brancz
Copy link
Member

brancz commented Sep 19, 2017

That's definitely a lot, could you give us the number of lines in the metric response? It will tell us the number of time-series this is producing, which should be interesting given the response time and memory usage.

@andyxning
Copy link
Member

@jac-stripe
Thanks for your valuable info.

BTW, could you please paste the log about number of collected object in kube-state-metrics aside from the kubectl outputs. That is added in #254 which is mainly used to debug problems something like this one.

@jac-stripe
Copy link
Author

@brancz the response is 16172 lines, totalling 2226kb.
@andyxning here are some logs on collected objects within the last few seconds:

I0919 16:55:56.928875       1 deployment.go:155] collected 3 deployments
I0919 16:55:56.928954       1 replicaset.go:119] collected 4 replicasets
I0919 16:55:56.933851       1 cronjob.go:125] collected 216 cronjobs
I0919 16:55:57.090494       1 job.go:149] collected 665 jobs
I0919 16:55:57.180204       1 pod.go:238] collected 272 pods
I0919 16:55:57.378936       1 daemonset.go:113] collected 1 daemonsets
I0919 16:55:57.379160       1 deployment.go:155] collected 3 deployments
I0919 16:55:57.379251       1 replicaset.go:119] collected 4 replicasets
I0919 16:55:57.379274       1 limitrange.go:97] collected 0 limitranges
I0919 16:55:57.379655       1 namespace.go:110] collected 4 namespaces
I0919 16:55:57.379686       1 statefulset.go:121] collected 0 statefulsets
I0919 16:55:57.379855       1 replicationcontroller.go:125] collected 0 replicationcontrollers
I0919 16:55:57.380030       1 service.go:101] collected 4 services
I0919 16:55:57.380050       1 persistentvolumeclaim.go:107] collected 0 persistentvolumeclaims
I0919 16:55:57.380063       1 resourcequota.go:95] collected 0 resourcequotas
I0919 16:55:57.382270       1 node.go:181] collected 20 nodes
I0919 16:55:57.480281       1 job.go:149] collected 665 jobs
I0919 16:55:57.581042       1 cronjob.go:125] collected 216 cronjobs
I0919 16:55:57.979816       1 pod.go:238] collected 272 pods

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 20, 2017

We've got about 10x that number of resources and are seeing total heap around 2.2GB:

memory for KSM

KSM /metrics has 929k series and is 102MB uncompressed.

@smarterclayton
Copy link
Contributor

A couple of obvious ways to improve:

  1. switch to protobuf in the client (if it isn't already) for resources that support it
  2. switch to a flyweight pattern in the informers - create a shim object to place in the informer cache that contains only the fields necessary (and probably avoids maps, which are very expensive).
  3. Alternatively turn each object into a set of samples and cache those each time the object changes

@brancz
Copy link
Member

brancz commented Sep 21, 2017

@smarterclayton thanks for sharing your data and information! What is the response time for the call to the /metric endpoint?

  1. We haven't done any intentional switch to protobuf, but that seems like an easy first step. From what I can tell it's just a change to the content type passed to the rest client. How can I tell, which types do not have protobuf support?

  2. I haven't seen the "flyweight" pattern in regard to the informers, would this still allow the use of protobuf if we leave out fields?

  3. If I understand this correctly, you mean that we turn the pattern around and based on the events emitted from the Kubernetes API build exactly the metrics we need instead of creating them on every request? I think that's generally a good idea, just gets more complicated in regard to staleness of objects, but should still be manageable.

The first two attempts I can definitely see the returns, the third option seems like a nice to have, but a lot of work at this point with unknown result.

@brancz
Copy link
Member

brancz commented Sep 21, 2017

@jac-stripe a big chunk of memory in use and allocations seem to be coming from the addConditionMetrics call from the job collector. I'm not too familiar with the Job object, what are typical entries in the Job.Status.Conditions array? It seems a large chunk is coming from this loop. It's odd because the Pod collector has much less memory in use and allocated overall and the ratio of Pod collector memory and Job collector memory doesn't match the object count ratio you shared.

@smarterclayton
Copy link
Contributor

How can I tell, which types do not have protobuf support?

You can pass multiple Accept headers and everything should just work for types that don't have protobuf like CRD or custom API extensions. I.e. Accept: application/vnd.kubernetes.protobuf, application/json

I haven't seen the "flyweight" pattern in regard to the informers

Basically transform the object coming in on the watch call into something simpler in the cache.ListWatch. So you call upstream API and get a list of pods, then you transform it into something simpler. You can make the objects "fake" API objects in most cases.

https://github.com/openshift/origin/blob/master/pkg/image/controller/trigger/cache.go#L50 takes arbitrary kube objects and uses an adapter to turn them into a *trigger.cacheEntry which is a uniform value that then goes into the store. You can also do the conversion at cache.ListWatch time but you'll have to use api.List or so.

If I understand this correctly, you mean that we turn the pattern around and based on the events

Right - I'd call this maintaining an index via the reflector rather than maintaining a cache store. So ListWatch, instead of returning the stripped down objects, actually returns a list of metrics for that object. You can wrap both the list and the watch.

We've talked about doing that transform lower down - so you'd be able to past a cache.ListWatch a Transformer that takes arbitrary valid objects and turns them into what goes into the store but haven't done it yet.

@brancz
Copy link
Member

brancz commented Sep 21, 2017

I created #264 to start using protobuf. That should be a quick win, but eventually all of the above are improvements we probably want to make.

I'm still having difficulties to understand your second point though, probably because I don't know protobuf enough or the way it's used in Kubernetes. I'm wondering how parsing a subset fields of a protobuf message works, what would the proto definitions look like? And that doesn't influence the data size transferred on wire right?

@andyxning
Copy link
Member

Sorry for not following up all the comments above, but for a quick question: IIUC, protobuf can only decrease the data size between apiserver and kube-state-metrics. Thus the sync time will be decreased. The actually memory used by kube-state-metrics should not decrease.

Correct me if i am wrong.

@brancz
Copy link
Member

brancz commented Sep 21, 2017

Agreed, nonetheless an improvement we should have.

@andyxning
Copy link
Member

Yep, let's first make the available improvements.

@andyxning
Copy link
Member

andyxning commented Sep 21, 2017

And that doesn't influence the data size transferred on wire right?

If protobuf supports functionality like graphsql, we can request only used fields and definitely the response size should be reduced largely.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 21, 2017 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2018
@brancz
Copy link
Member

brancz commented Jan 8, 2018

This still has to be addressed in a better way than we do today.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2018
@stevenmccord
Copy link

I have one that is taking upwards of 12Gi of memory.

$ kubectl get pods --all-namespaces -a | wc -l
     227
$ kubectl get jobs --all-namespaces -a | wc -l
     381
$ kubectl get cronjobs --all-namespaces -a | wc -l
     116

I am currently on this version FROM quay.io/coreos/kube-state-metrics:v1.2.0

I0404 13:44:41.963104       1 persistentvolumeclaim.go:112] collected 13 persistentvolumeclaims
I0404 13:44:41.963142       1 replicationcontroller.go:130] collected 0 replicationcontrollers
I0404 13:44:41.963193       1 limitrange.go:101] collected 1 limitranges
I0404 13:44:42.062546       1 namespace.go:113] collected 3 namespaces
I0404 13:44:42.062874       1 daemonset.go:136] collected 2 daemonsets
I0404 13:44:42.162730       1 persistentvolume.go:99] collected 13 persistentvolumes
I0404 13:44:42.162850       1 resourcequota.go:99] collected 0 resourcequotas
I0404 13:44:35.766039       1 node.go:186] collected 11 nodes
I0404 13:44:42.262993       1 pod.go:246] collected 199 pods
I0404 13:44:42.462680       1 endpoint.go:120] collected 54 endpoints
I0404 13:44:42.462735       1 service.go:112] collected 51 services
I0404 13:44:42.462956       1 node.go:186] collected 11 nodes
I0404 13:44:42.464009       1 replicaset.go:124] collected 715 replicasets
I0404 13:44:42.464576       1 persistentvolume.go:99] collected 13 persistentvolumes
I0404 13:44:42.663174       1 replicationcontroller.go:130] collected 0 replicationcontrollers
I0404 13:44:42.862684       1 limitrange.go:101] collected 1 limitranges
I0404 13:44:42.862818       1 namespace.go:113] collected 3 namespaces
I0404 13:44:42.862963       1 daemonset.go:136] collected 2 daemonsets
I0404 13:44:43.063073       1 resourcequota.go:99] collected 0 resourcequotas
I0404 13:44:43.163197       1 daemonset.go:136] collected 2 daemonsets
I0404 13:44:43.262674       1 pod.go:246] collected 199 pods
I0404 13:44:43.263091       1 node.go:186] collected 11 nodes
I0404 13:44:43.263327       1 replicaset.go:124] collected 715 replicasets
I0404 13:44:43.264189       1 endpoint.go:120] collected 54 endpoints
I0404 13:44:43.562444       1 statefulset.go:147] collected 5 statefulsets
I0404 13:44:43.562569       1 limitrange.go:101] collected 1 limitranges
I0404 13:44:43.562645       1 namespace.go:113] collected 3 namespaces
I0404 13:44:43.562677       1 deployment.go:160] collected 43 deployments

It tends to spike overnight when I am running a lot of kubernetes jobs ~100. Then it settles down after those jobs are done, but it spikes to a very high memory usage. I am probably doing something incorrectly here, but just wanted to see if there are others having this issue.

@discordianfish
Copy link
Contributor

@jac-stripe / @julia-stripe: Can you try 1.3.1? Just deployed it and as suspect it looks like the go version upgrade fixed it.

@gades
Copy link

gades commented May 23, 2018

@discordianfish The version 1.3.1 doesn't solve the issue with context deadline exceeded or OOMKilled and need to adjust parameters.

screen shot 2018-05-23 at 11 08 53

kubectl get pods --all-namespaces -a | wc -l
996

For me, it starts working with following parameters memory allocated to 2800Mi and CPU 2100m

screen shot 2018-05-23 at 13 46 51

@discordianfish
Copy link
Contributor

@gades Just double checked over here and since I've upgraded to 1.3.1 my memory usage is <400MB and the scrape duration <2s, usually <0.5s.

@DewaldV
Copy link

DewaldV commented Jun 28, 2018

On the topic of Memory consumption, we've been battling with runaway memory consumption of kube-state-metrics on one of our clusters. This particular cluster has around 3730 running pods and 28160 total objects (quick line count of get all --all-namespaces) across 44 nodes.

We've been running a single instance of kube-state-metrics in the kube-system namespace with the following collectors setup:

collectors=cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,jobs,pods,limitranges,namespaces,nodes,persistentvolumeclaims,persistentvolumes,resourcequotas,services,statefulsets

This setup resulted in a kube state metrics that could be run stabily with 5-6 cpu and 8-10GB of RAM.

One of our teams started an additional 900 pods which resulted in us being unable to stablize kube-state-metrics even with 30GB+ memory, it just continued being OOMKilled.

We broke our kube-state-metrics into an instance per namespace and are now running around 33 instances of kube-state-metrics each watching a single namespace. The resulting config brought the resource usage down to 0.5 CPU and around 1.5GB of RAM for all 33 instances in total monitoring the same cluster.

@andyxning
Copy link
Member

andyxning commented Jun 29, 2018

The resulting config brought the resource usage down to 0.5 CPU and around 1.5GB of RAM for all 33 instances in total monitoring the same cluster.

This is an interesting result compared with the single kube-state-metrics scenario. Seems that kube-state-metrics can not handle big objects with one instance or something like memory leak.

  • Which version do you use?
  • And, if is is possible, could you please help us to try the latest master branch since we have change the protocol format between client-go and apiserver from json to protobuf in prefer protobuf instead of just json format #475. This may make some difference. :)

@DewaldV

@DewaldV
Copy link

DewaldV commented Jun 29, 2018

@andyxning We are running the 1.3.1 image from quay.io

Can give the latest master branch a try. I'll run an additional instance of kube-state metrics from latest without letting prometheus scrape from it (to avoid duplicate metrics) and see how it does. I'll also pull some graphs and numbers to show the memory/cpu usage for the different setups to compare.

@andyxning
Copy link
Member

@DewaldV That's really cool!

@brancz
Copy link
Member

brancz commented Jun 29, 2018

Note that scraping will make a difference as producing the /metrics output is significant with those numbers of objects.

@andyxning
Copy link
Member

@DewaldV Another non-prod Prometheus is needed to collect the metrics or we need to make request to /metrics endpoint.

@DewaldV
Copy link

DewaldV commented Jul 2, 2018

@andyxning Will do, I'll spin up another Prometheus as well. I'll try get these numbers later today for you.

@ehashman
Copy link
Member

Just wanted to chime in that I have also encountered the same issue. We are scraping KSM 1.2.0 with Prometheus 2.x on Kubernetes 1.8.7.

We have two clusters: one with ~150 nodes and one with ~200 nodes. On the cluster with ~150 nodes, KSM reports (I'm only including resources with >500 count for brevity):

  • 25k jobs
  • 14.5k pods
  • 650 namespaces
  • 1.9k resourcequotas
  • 3.3k replicasets

Response size is 920k lines and 101M.

I set KSM's memory limit to 4GB but it still frequently exceeds this (and gets OOMKilled). It takes about 10 hours before it hits 4GB of memory usage.

I can see it spikes to 2.5 CPU cores used pretty often as well.

On our cluster with ~200 nodes, KSM frequently will time out on requests (we are scraping it every 30s). It uses even more resources there.

I'd like to upgrade to 1.3.1 but I've been running into certificate validation and authentication/RBAC issues... unclear if that will help with the resource utilization problem. I'd like to look into turning off or dropping any of the timeseries we are not using (e.g. jobs) as well as tuning the cluster's garbage collection, but I feel like that's not solving the underlying problem.

At minimum, can we upgrade the documentation guidelines on resource usage? I was definitely confused when the docs say to allocate 300MB of RAM and 0.150 CPU cores where in reality I need >3GB RAM and 3 cores.

@andyxning
Copy link
Member

andyxning commented Jul 31, 2018

@ehashman Thanks for the feedback.

At minimum, can we upgrade the documentation guidelines on resource usage? I was definitely confused when the docs say to allocate 300MB of RAM and 0.150 CPU cores where in reality I need >3GB RAM and 3 cores.

The guidelines for setting resource usage for KSM is somewhat according to a benchmark which may not cover all reality resource usage when the cluster is of about 150~200 nodes. But the resource usage guidelines are not so easy to give out as the cluster load is different.

The guidelines should be updated.

@brancz
Copy link
Member

brancz commented Jul 31, 2018

@andyxning I felt like we had a PR pending that adds a note that kube-state-metrics actually scales with the number of objects as opposed to number of nodes, but it gives some indication.

@ehashman you can already turn off collectors using the --collectors flag (or rather whitelist the ones you want to use). kube-state-metrics will offload the lack of resources (cpu/memory) onto the other resource, meaning when there is cpu pressure memory consumption will grow. I recommend trying to run kube-state-metrics without any resource limits or requests and see what it ends up using. We definitely want to run new scalability tests, we will do this along with #498.

@andyxning
Copy link
Member

andyxning commented Aug 2, 2018

I felt like we had a PR pending that adds a note that kube-state-metrics actually scales with the number of objects as opposed to number of nodes, but it gives some indication.

This has been merged in #490 as part of describing the pod nanny usage.

@mrsiano
Copy link

mrsiano commented Oct 3, 2018

@brancz @smarterclayton the protobuf already implemented ?!
we have some benchmark results to visible how much better it is?

@mrsiano
Copy link

mrsiano commented Oct 3, 2018

@smarterclayton @brancz another thing we might faced this one as well https://bugzilla.redhat.com/show_bug.cgi?id=1426009

@andyxning
Copy link
Member

andyxning commented Oct 8, 2018

the protobuf already implemented ?!
we have some benchmark results to visible how much better it is?

@mrsiano Yes, pb support has been added in #475. It is available after 1.4.0. Could you please give it a try and do some benchmark.

@ehashman
Copy link
Member

As a follow-up to my earlier comment, just wanted to share the results of my KSM upgrade from 1.4.0 to 1.5.0-beta.0 in one of our aforementioned clusters with 200 nodes:

ksm5

As you can see, CPU utilization and memory usage have dropped dramatically. Network utilization has increased as I am no longer gzipping responses. With this upgrade, the documented benchmarks for resource utilization appear to be accurate and wouldn't need to be updated 🎉

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2019
@ehashman
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@ehashman: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests