-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-state-metrics consuming too much memory #257
Comments
What does "very" slow mean? Up to 10 seconds response time wouldn't be unusual for a huge request. For every job object there are at least 12 metrics being reported plus 20 metrics for each of the pods created by those Job objects. 2700 * 12 = 32400 + 2700 * 20 = 86400. With minimum of 86400 lines of metrics per HTTP request, those numbers actually don't seem too unreasonable, although we are aware of some inefficiencies of the Prometheus Go implementation, that primarily drive these numbers up. It might be worth checking thought that you're not running into the same problem as reported here: #112 (comment) |
thanks so much @brancz! "very slow' means that kube-state-metrics' can you say offhand what some of the inefficiencies of the Prometheus Go implementation are? That could help us debug. |
The inefficiencies are that there are a number of allocations that could be optimized, but it wouldn't explain why the HTTP requests don't respond at all, the scalability tests#124 (comment) that Google ran had 1000 nodes and 30000 pods and responded within 9s, used ~1Gb memory and 0.3 cores. The number of metrics for those tests should be far more than in this case, so I feel it might actually be something in the Job collector. The memory usage is probably just end up showing in the Prometheus client code as we're still creating metrics, so memory profiles of |
Are there a lot of Completed pods in the cluster? Deployments usually don't
leave those around but I think Job does, or did back when I experimented
with it.
kube-state-metrics essentially holds the whole cluster state in memory, so
I can imagine that while these pods don't "do" anything they bring it to
its knees.
…On Fri, Sep 15, 2017, 18:32 Frederic Branczyk ***@***.***> wrote:
The inefficiencies are that there are a number of allocations that could
be optimized, but it wouldn't explain why the HTTP requests don't respond
at all, the scalability tests#124 (comment)
<#124 (comment)>
that Google ran had 1000 nodes and 30000 pods and responded within 9s, used
~1Gb memory and 0.3 cores. The number of metrics for those tests should be
far more than in this case, so I feel it might actually be something in the
Job collector. The memory usage is probably just end up showing in the
Prometheus client code as we're still creating metrics, so memory profiles
of alloc_space and inuse_space would be helpful. Could you take those
with go tool pprof and share the bundles that it drops in $HOME/pprof? If
you analyze it even better 😉.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#257 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBpqzu1plyK-bOZAtRqv-njS0pf_Uks5siqaIgaJpZM4PYXve>
.
|
Yep that's what I linked to earlier, it's my suspicion as well. |
No, I double checked and there are only 200 completed pods in the cluster (we configure kube to garbage collect terminated pods to avoid exactly this problem) |
@julia-stripe Can you paste the log about how many job objects or pod objects scraped as implemented in #254 . We need to make sure what are the actual number of objects kube-state-metrics scraped. |
Yes what @andyxning mentioned would indeed be helpful, and then the full memory profiles for further analysis 🙂 . |
I have attached profiles for
[1]: pprof.com_github_kubernetes_kube_state_metrics.static.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz |
That's definitely a lot, could you give us the number of lines in the metric response? It will tell us the number of time-series this is producing, which should be interesting given the response time and memory usage. |
@jac-stripe BTW, could you please paste the log about number of collected object in kube-state-metrics aside from the kubectl outputs. That is added in #254 which is mainly used to debug problems something like this one. |
@brancz the response is 16172 lines, totalling 2226kb.
|
A couple of obvious ways to improve:
|
@smarterclayton thanks for sharing your data and information! What is the response time for the call to the
The first two attempts I can definitely see the returns, the third option seems like a nice to have, but a lot of work at this point with unknown result. |
@jac-stripe a big chunk of memory in use and allocations seem to be coming from the |
You can pass multiple Accept headers and everything should just work for types that don't have protobuf like CRD or custom API extensions. I.e.
Basically transform the object coming in on the watch call into something simpler in the https://github.com/openshift/origin/blob/master/pkg/image/controller/trigger/cache.go#L50 takes arbitrary kube objects and uses an adapter to turn them into a
Right - I'd call this maintaining an index via the reflector rather than maintaining a cache store. So ListWatch, instead of returning the stripped down objects, actually returns a list of metrics for that object. You can wrap both the list and the watch. We've talked about doing that transform lower down - so you'd be able to past a cache.ListWatch a Transformer that takes arbitrary valid objects and turns them into what goes into the store but haven't done it yet. |
I created #264 to start using protobuf. That should be a quick win, but eventually all of the above are improvements we probably want to make. I'm still having difficulties to understand your second point though, probably because I don't know protobuf enough or the way it's used in Kubernetes. I'm wondering how parsing a subset fields of a protobuf message works, what would the proto definitions look like? And that doesn't influence the data size transferred on wire right? |
Sorry for not following up all the comments above, but for a quick question: IIUC, protobuf can only decrease the data size between apiserver and kube-state-metrics. Thus the sync time will be decreased. The actually memory used by kube-state-metrics should not decrease. Correct me if i am wrong. |
Agreed, nonetheless an improvement we should have. |
Yep, let's first make the available improvements. |
If protobuf supports functionality like graphsql, we can request only used fields and definitely the response size should be reduced largely. |
You're using an informer right? Informers are an in memory cache of the
full object. It's possible to transform what you get from the API server
on your own into a smaller in memory representation that can be queried
later, or simply do the transformation for each event and then sum later.
Kubernetes does not support partial field retrieval like you are describing
(and likely never will from the server). Use versioned apis and protobuf
to fetch the object, then keep only the fields you need. We may in the
future add protobuf "slice" decoding that lets you skip whole tracts of the
object but not in the near term.
…On Thu, Sep 21, 2017 at 11:33 AM, Ning Xie ***@***.***> wrote:
And that doesn't influence the data size transferred on wire right?
If protobuf supports functionality like graphsql
<https://dev-blog.apollodata.com/graphql-vs-rest-5d425123e34b>, we can
request only used fields and definitely the response size should be
decreased largely.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#257 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p72XMXN2InXMwTmmOkXrZVwuiVN5ks5skoHFgaJpZM4PYXve>
.
|
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
This still has to be addressed in a better way than we do today. /remove-lifecycle stale |
I have one that is taking upwards of 12Gi of memory. $ kubectl get pods --all-namespaces -a | wc -l
227
$ kubectl get jobs --all-namespaces -a | wc -l
381
$ kubectl get cronjobs --all-namespaces -a | wc -l
116 I am currently on this version
It tends to spike overnight when I am running a lot of kubernetes jobs ~100. Then it settles down after those jobs are done, but it spikes to a very high memory usage. I am probably doing something incorrectly here, but just wanted to see if there are others having this issue. |
@jac-stripe / @julia-stripe: Can you try 1.3.1? Just deployed it and as suspect it looks like the go version upgrade fixed it. |
@discordianfish The version 1.3.1 doesn't solve the issue with context deadline exceeded or OOMKilled and need to adjust parameters.
For me, it starts working with following parameters memory allocated to 2800Mi and CPU 2100m |
@gades Just double checked over here and since I've upgraded to 1.3.1 my memory usage is <400MB and the scrape duration <2s, usually <0.5s. |
On the topic of Memory consumption, we've been battling with runaway memory consumption of kube-state-metrics on one of our clusters. This particular cluster has around 3730 running pods and 28160 total objects (quick line count of get all --all-namespaces) across 44 nodes. We've been running a single instance of kube-state-metrics in the kube-system namespace with the following collectors setup:
This setup resulted in a kube state metrics that could be run stabily with 5-6 cpu and 8-10GB of RAM. One of our teams started an additional 900 pods which resulted in us being unable to stablize kube-state-metrics even with 30GB+ memory, it just continued being OOMKilled. We broke our kube-state-metrics into an instance per namespace and are now running around 33 instances of kube-state-metrics each watching a single namespace. The resulting config brought the resource usage down to 0.5 CPU and around 1.5GB of RAM for all 33 instances in total monitoring the same cluster. |
This is an interesting result compared with the single kube-state-metrics scenario. Seems that kube-state-metrics can not handle big objects with one instance or something like memory leak.
|
@andyxning We are running the 1.3.1 image from quay.io Can give the latest master branch a try. I'll run an additional instance of kube-state metrics from latest without letting prometheus scrape from it (to avoid duplicate metrics) and see how it does. I'll also pull some graphs and numbers to show the memory/cpu usage for the different setups to compare. |
@DewaldV That's really cool! |
Note that scraping will make a difference as producing the |
@DewaldV Another non-prod Prometheus is needed to collect the metrics or we need to make request to |
@andyxning Will do, I'll spin up another Prometheus as well. I'll try get these numbers later today for you. |
Just wanted to chime in that I have also encountered the same issue. We are scraping KSM 1.2.0 with Prometheus 2.x on Kubernetes 1.8.7. We have two clusters: one with ~150 nodes and one with ~200 nodes. On the cluster with ~150 nodes, KSM reports (I'm only including resources with >500 count for brevity):
Response size is 920k lines and 101M. I set KSM's memory limit to 4GB but it still frequently exceeds this (and gets OOMKilled). It takes about 10 hours before it hits 4GB of memory usage. I can see it spikes to 2.5 CPU cores used pretty often as well. On our cluster with ~200 nodes, KSM frequently will time out on requests (we are scraping it every 30s). It uses even more resources there. I'd like to upgrade to 1.3.1 but I've been running into certificate validation and authentication/RBAC issues... unclear if that will help with the resource utilization problem. I'd like to look into turning off or dropping any of the timeseries we are not using (e.g. jobs) as well as tuning the cluster's garbage collection, but I feel like that's not solving the underlying problem. At minimum, can we upgrade the documentation guidelines on resource usage? I was definitely confused when the docs say to allocate 300MB of RAM and 0.150 CPU cores where in reality I need >3GB RAM and 3 cores. |
@ehashman Thanks for the feedback.
The guidelines for setting resource usage for KSM is somewhat according to a benchmark which may not cover all reality resource usage when the cluster is of about 150~200 nodes. But the resource usage guidelines are not so easy to give out as the cluster load is different. The guidelines should be updated. |
@andyxning I felt like we had a PR pending that adds a note that kube-state-metrics actually scales with the number of objects as opposed to number of nodes, but it gives some indication. @ehashman you can already turn off collectors using the |
This has been merged in #490 as part of describing the pod nanny usage. |
@brancz @smarterclayton the protobuf already implemented ?! |
@smarterclayton @brancz another thing we might faced this one as well https://bugzilla.redhat.com/show_bug.cgi?id=1426009 |
As a follow-up to my earlier comment, just wanted to share the results of my KSM upgrade from 1.4.0 to 1.5.0-beta.0 in one of our aforementioned clusters with 200 nodes: As you can see, CPU utilization and memory usage have dropped dramatically. Network utilization has increased as I am no longer gzipping responses. With this upgrade, the documented benchmarks for resource utilization appear to be accurate and wouldn't need to be updated 🎉 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/close |
@ehashman: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
kube-state-metrics is using >400mb of RAM. It is also very slow when I query
/metrics
. The kubernetes cluster has 2700 job objects. It seems surprising that this would consume 400mb of RAM for metrics aggregation. Below is a pprof top trace. This is running the latest git revision (d316c01)The text was updated successfully, but these errors were encountered: