-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metricbeat] Possible memory leak with autodiscover #33307
Comments
Thank you @eedugon , please can you provide any details of number of pods, cronjobs etc running on the cluster just to simulate the same environment? Additionally it would be really helpful:
|
I'm going to have a look.
This would help us mark out the various layers and isolate the problem in Beats' lib. Also did you try to get any heap profiles with pprof? |
I have been running experiments trying to reproduce the issue. Here is what I'm using for reference: Script to collect heap profiles#!/bin/bash
sleepTime=$((60*5))
for i in $(seq 1 1 100)
do
echo "Getting heap for $i time"
go tool pprof -png http://localhost:5066/debug/pprof/heap > heap${i}.png
sleep $sleepTime
done Metricbeat's confighttp.enabled: true
http.port: 5066
http.host: 0.0.0.0
http.pprof.enabled: true
metricbeat:
autodiscover:
providers:
- type: kubernetes
scope: cluster
node: kind-control-plane
kube_config: /home/chrismark/.kube/config
hints.enabled: true The results I see do not provide any significant indication that there is a memory leak: Number of Pods on the clusterkubectl get pods -A | wc
11 82 1153 Conclusions so far and next steps
|
@gizas , I've updated the description to include some extra details that you asked for. @ChrsMark : I'm running the suggested tests and preparing a report of
I agree that the number of pods or the metadata in general looks related with the issue. About |
Here you have the results of the performed tests. Green line: Metricbeat deployment with autodiscover, without hints and with a simple / fake conditional template. Here you have a report of periodic
The metricbeat with autodiscover memory usage looks very suspicious. Considering the cluster has |
Thanks for the details here @eedugon , great work! Regarding the results, by watching some random memory footprints I see that Metricbeat indead consumes a lot of memory in the "Watch" section. See for example: At the same time the Taking into consideration that the memory footprints show an increasing memory usage in functions of the library could it be that we consume the library in Metricbeat/Beats in way that leads to the memory leak? At the same time the Update: For example at https://github.com/gizas/clientgosimulator/blob/53d64231594344693e4c3724b9efbc03f1c67f0a/main.go#L58-L75 we don't actually manipulate the Object that is returned from |
@ChrsMark , @gizas : I updated the description of my tests, as the beat that is not showing the memory leak was actually beat with autodiscover but pointing only to a namespace with a fixed set of ~40 pods (without That means that the memory leak is not |
The It could be that the simulator's ones either are newer and more performant so we could evaluate using them instead or the |
So we synced with @ChrsMark and the plan is to make @eedugon we will re-issue a new image of clientsimulator and we will have to repeat the tests in your cluster. Hoping that this time the clientgosimulator will show the same behaviour as beats and its memory will increase. |
@eedugon what are the history limits of the Jobs/Cronjobs in your case? |
I'm also running some tests on my end. I'm using Metricbeat 8.6.0-SNAPSHOT with a minimal autodiscovery configuration: metricbeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
hints.enabled: true I'm deploying a significant amount of Pods on the cluster but only some of them are actually getting into "running" state since my cluster is just a one-node kind cluster and the rest of them are failing to get scheduled. In any case this scenario still manages to trigger a memory usage increase in Metricbeat. To deploy the Pods I'm using the stress-test script from https://github.com/elastic/k8s-integration-infra#put-load-on-the-cluster running After some time when the memory has been stabilized again in a higher usage I delete all the Pods running As you can see after the deletion of the Pods the memory comes back to the normal levels which indicates that there is no memory leak related to the increased amount of Pods. This makes me think that what we observe in the reported issue is having Metricbeat to monitor an increased number of Pods over time which never goes back to normal. This is why I was asking for the history limits of the Cronjobs at #33307 (comment). So @eedugon can you clarify again what is the nature of the cluster you use and what is the number of the Pods with the time being? We are interested into see the number of Pods over time and not just the number of Pods at a specific point in the timeline. If the number of Pods is increasing then it makes sense that the data we are getting are more and more and this explains the memory usage increase. This is just a theory open for discussion and at the same time I would be more than happy to have the proper details to replicate the issue on my side too :). Sidenote: at the same time I'm running https://github.com/ChrsMark/k8sdiscovery on the cluster too which also shows a memory increase (smaller) which comes back to normal after the mass deletion of the Pods. |
An update on this. It seems that having a more aggressive type of workloads can replicate the issue. After running the following as per @eedugon's suggestion I can see the memory increase as well: apiVersion: batch/v1
kind: CronJob
metadata:
name: cronjob-XXX
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster; sleep 5; exit 0
restartPolicy: OnFailure and massively deploy those on the cluster using: #!/bin/bash
TOTAL=200
for ((i = 1 ; i <= $TOTAL ; i++)); do
echo "# creating cronjob $i"
sed "s/XXX/$i/g" cronjob-teamplate.yml | kubectl apply -f -
done After deploying these I can see a stable number of Pods over time at around 800 but the memory usage of Beats becomes quite aggressive and leads to memory limit: At the same time the This can be achieved by adding the following in the provider's config: add_resource_metadata:
cronjob: false Note that the memory profiles were indicating an increased memory usage when in the json decoding functions not necessarily coupled with any of the Trying to isolate the issue more I tried to introduce this functionality into the As we can see there is an increased memory usage for the Despite this increase though we can not claim that this is a memory leak again since it get stable after some time. So my assumption here is that even in Beats we don't have a memory leak but we have an increased memory usage because we are executing API calls to k8s from several places for every update of any of the Pods. I would assume that even in Beats the situation would become stable after some point but that would mean that the memory would get stable in a high level indeed. So as next steps I would try to verify this observation in the rest of the reported environments to see if |
I have been investigating the root cause of this issue and I ended up that the query we do to fetch the Jobs' objects at https://github.com/elastic/elastic-agent-autodiscover/blob/768d34583d1f4e262ef94908ab7d5d726ae5e0e9/kubernetes/metadata/pod.go#L184 is the one that cause the memory increase. I performed some tests/simulations using https://github.com/ChrsMark/k8sdiscovery and by reducing the scope of the query to a namespace with fewer objects make the memory usage to decrease. So the issue comes when we perform the query against a namespace with many objects. Even initializing the clientset once like In this regard, I would summarize the next steps as following:
|
Update to this issue:
I'm also trying with bigger nodes to determine if this is really a leak or a poor implementation.
I will update this report with the outcome of a test in a 32G node in a few days. |
Thanks for the update @eedugon ! From my perspective this behavior makes sense. See the code at This issue can be closed and we can continue at elastic/elastic-agent-autodiscover#31. The specific issue has an known workaround and we have narrowed down the scope for a generic OOM error to the specific code that brings this to the surface :). |
Reviewing above discussion we can summarise all that high memory usage is due to the way we handle two specific resources (replicasets and cronjobs). All next actions can be handled here: elastic/elastic-agent-autodiscover#31 Agree that the new fix is not only to disable those values but also to see if we can handle properly Can we close this bug as well as it seems that we finalised all investigation? |
Setting these values did not resolved the memory leak issue for us
|
@kbujold Can you try to set |
@yomduf I was setting these values in filebeat. I also tried completely removing add_kubernetes_metadata and cranking up the resources and still see the memory gradually going up as time go by. It went from 10% to 85% in 24 hours. |
@kbujold I would suggest to open a seperate sdh to track your issue. It is really important to have all info for your setup. Also what is your version of beats? Removing add_kubernetes_metadata wont solve your issue because the metadata enrichement happens by default after specific versions. You can try A general example could be: filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
add_resource_metadata:
deployment: false
cronjob: false |
@gizas We found the same issue in ELK8.6.2 and ELK8.7.1. Note that we need this metadata enrichment. We had tried the above configuration with no success in filebeat. |
I have raised issue #35796 |
I have setup a Metricbeat 8.4.1 Deployment with
hints based autodiscover
inscope: cluster
mode.The beat is not handling any metric because I don't have any
metrics
related annotation in any pod, so it shouldn't be doing much.The memory consumption of the beat grows and grows and after a few days it reaches the 1.5GB limit that I have configured, then it's OOMKilled and restarted.
The configuration of the Metricbeat is:
Environment details:
TODOS
The text was updated successfully, but these errors were encountered: