Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] Possible memory leak with autodiscover #33307

Closed
3 tasks done
eedugon opened this issue Oct 11, 2022 · 22 comments
Closed
3 tasks done

[Metricbeat] Possible memory leak with autodiscover #33307

eedugon opened this issue Oct 11, 2022 · 22 comments
Assignees
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@eedugon
Copy link
Contributor

eedugon commented Oct 11, 2022

I have setup a Metricbeat 8.4.1 Deployment with hints based autodiscover in scope: cluster mode.
The beat is not handling any metric because I don't have any metrics related annotation in any pod, so it shouldn't be doing much.
The memory consumption of the beat grows and grows and after a few days it reaches the 1.5GB limit that I have configured, then it's OOMKilled and restarted.

image

   State:          Running
      Started:      Mon, 10 Oct 2022 15:01:13 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 07 Oct 2022 08:48:04 +0200
      Finished:     Mon, 10 Oct 2022 15:01:12 +0200

The configuration of the Metricbeat is:

    http.enabled: true
    http.port: 5066
    http.host: 0.0.0.0
    metricbeat:
      autodiscover:
        providers:
        - type: kubernetes
          scope: cluster
          node: ${NODE_NAME}
          hints.enabled: true
    processors:
    - add_cloud_metadata: {}
    - add_host_metadata: {}

Environment details:

  • K8s version 1.23.7, deployed by Rancher on VMs in Hetzner Cloud.
  • Cluster with 20 nodes, about 915 pods (majority of them Jobs & CronJobs).
$ kubectl get pods -A | wc
     917    5704  118110

TODOS

  • It seems that the number of Pods is of importance here. We need to determine the environment details and have an easy way to reproduce this environment. Maybe we can make use of https://github.com/elastic/k8s-integration-infra
  • It was reported in the past that narrowing down the targeted environment of Autodiscovery provider solves the issue. @eedugon we need to verify this too here :).
  • We need to replicate the issue with a smaller code snippet if we conclude that the issue lies in the k8s library. We can use https://github.com/gizas/clientgosimulator
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 11, 2022
@gizas gizas added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Oct 11, 2022
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 11, 2022
@gizas
Copy link
Contributor

gizas commented Oct 11, 2022

Thank you @eedugon , please can you provide any details of number of pods, cronjobs etc running on the cluster just to simulate the same environment?

Additionally it would be really helpful:

  • K8s version
  • Cloud managed, local setup?
  • Any sample of cronjob manifests?

@ChrsMark
Copy link
Member

ChrsMark commented Oct 12, 2022

I'm going to have a look.
@eedugon since you already have the environment could you try the same experiment but without hints? I would suggest trying 2 things:

  1. Define a simple template instead of hints
  2. Leave just the provider to be enabled without hints or templates defined:
    metricbeat:
      autodiscover:
        providers:
        - type: kubernetes
          scope: cluster
          node: ${NODE_NAME}

This would help us mark out the various layers and isolate the problem in Beats' lib.

Also did you try to get any heap profiles with pprof?

@ChrsMark
Copy link
Member

ChrsMark commented Oct 12, 2022

I have been running experiments trying to reproduce the issue. Here is what I'm using for reference:

Script to collect heap profiles
#!/bin/bash
sleepTime=$((60*5))
for i in $(seq 1 1 100)
do
   echo "Getting heap for $i time"
   go tool pprof -png http://localhost:5066/debug/pprof/heap > heap${i}.png
   sleep $sleepTime
done
Metricbeat's config
http.enabled: true
http.port: 5066
http.host: 0.0.0.0
http.pprof.enabled: true
metricbeat:
  autodiscover:
    providers:
      - type: kubernetes
        scope: cluster
        node: kind-control-plane
        kube_config: /home/chrismark/.kube/config
        hints.enabled: true

The results I see do not provide any significant indication that there is a memory leak:

simpleConfigHints.tar.gz

Number of Pods on the cluster
kubectl get pods -A | wc
     11      82    1153

Conclusions so far and next steps

  1. It seems that the number of Pods is of importance here.
  2. @eedugon I would like to verify this with you and I would be interested into collecting the heap profiles in your reproduction environment.
  3. It was reported in the past that narrowing down the targeted environment of Autodiscovery provider solves the issue. @eedugon we need to verify this with your env too :). If that's the case it means that the number of Pods is a significant factor here.
  4. We need to replicate the issue with a smaller code snippet if we conclude that the issue lies in the k8s library. (ie: https://github.com/gizas/clientgosimulator)

@eedugon
Copy link
Contributor Author

eedugon commented Oct 13, 2022

@gizas , I've updated the description to include some extra details that you asked for.

@ChrsMark : I'm running the suggested tests and preparing a report of pprofs. Note that there's no way to remove completely hints and conditional templates. We need to have at least one for the provider to run, otherwise the beat won't start. I'm running the tests towards a deployment configured now without hints and with a simple conditional template:

    metricbeat:
      autodiscover:
        providers:
        - type: kubernetes
          scope: cluster
          node: ${NODE_NAME}
          hints.enabled: false
          templates:
            - condition:
                contains:
                  kubernetes.labels.app: "fake-label"
              config:
                - module: redis
                  hosts: "${data.host}:6379"

I agree that the number of pods or the metadata in general looks related with the issue.

About narrowing down the targeted pods solving the issue I will explore further, but that probably means that if autodiscover works only in a single namespace with a few resources it might not be enough to make the issue visible (although the memory leak could still exist there). I will need to do more tests about that. Will share the results here if I find something relevant.

@eedugon
Copy link
Contributor Author

eedugon commented Oct 17, 2022

Here you have the results of the performed tests.
The following graph shows the last 4 days of 3 running pods:

image

Green line: Metricbeat deployment with autodiscover, without hints and with a simple / fake conditional template.
Blue line: Metricebeat deployment without autodiscover and with a static simple input (it doesn't show any memory leak) ---> update: it's a metricbeat with autodiscover but pointing to a space with a fixed set of pods (around 40 pods. No CronJobs there).
Red line: the client-go shared by @gizas . The memory usage there looks pretty stable.

Here you have a report of periodic PNG pprof reports:

The metricbeat with autodiscover memory usage looks very suspicious. Considering the cluster has CronJobs that create a huge amount of periodic new pods, my suspicious is that the beat is probably not releasing / clearing properly from memory the information from pods / containers that don't exist anymore, hence the memory usage increases and increases. That would explain the problem.

@ChrsMark
Copy link
Member

ChrsMark commented Oct 17, 2022

Thanks for the details here @eedugon , great work!

Regarding the results, by watching some random memory footprints I see that Metricbeat indead consumes a lot of memory in the "Watch" section. See for example:
heap24

At the same time the clientgosimulator execution does not show a similar behavior.

Taking into consideration that the memory footprints show an increasing memory usage in functions of the library could it be that we consume the library in Metricbeat/Beats in way that leads to the memory leak? At the same time the clientgosimulator seems to not behave similarly so @gizas since you spent some time on the clientgosimulator (https://github.com/gizas/clientgosimulator/blob/main/main.go) do you see this as a possible scenario? For example in Beats at https://github.com/elastic/elastic-agent-autodiscover/blob/main/kubernetes/watcher.go#L110-L116 we consume some fields/objects like Store, is this something that we don't do in clientgosimulator?

Update:

For example at https://github.com/gizas/clientgosimulator/blob/53d64231594344693e4c3724b9efbc03f1c67f0a/main.go#L58-L75 we don't actually manipulate the Object that is returned from queue.Get() while at https://github.com/elastic/elastic-agent-autodiscover/blob/main/kubernetes/watcher.go#L222-L240 we do. in Beats we actually make use of the store https://github.com/elastic/elastic-agent-autodiscover/blob/main/kubernetes/watcher.go#L222-L240 while in clientgosimulator it seems that not. @gizas any thoughts on this assumption? Shall we try to make clientgosimulator to look identical to implementation of https://github.com/elastic/elastic-agent-autodiscover/blob/main/kubernetes/watcher.go?

@eedugon
Copy link
Contributor Author

eedugon commented Oct 17, 2022

@ChrsMark , @gizas : I updated the description of my tests, as the beat that is not showing the memory leak was actually beat with autodiscover but pointing only to a namespace with a fixed set of ~40 pods (without CronJobs).

That means that the memory leak is not just with autodiscover. Depending on the received pods (and probably the nature of them) we are handling them in a way that ends up causing this leak. In my opinion all points to CronJobs and the nature of the created pods. My environment has 110 CronJobs that creates every day hundreds of pods, and my high level suspicious is that we might not be releasing them from our internal register of objects we are keeping track of.

@ChrsMark
Copy link
Member

The clientgosimulator uses some different structs/implementations compared to those of Beats: https://github.com/elastic/elastic-agent-autodiscover/blob/main/kubernetes/watcher.go#L87-L89 VS https://github.com/gizas/clientgosimulator/blob/53d64231594344693e4c3724b9efbc03f1c67f0a/main.go#L42-L44

It could be that the simulator's ones either are newer and more performant so we could evaluate using them instead or the clientgosimulator does not perform some actions that Beats does related to the k8s library and hence we don't see the leak. We need ti dive deeper into this and figure out where the problem lies.

@gizas
Copy link
Contributor

gizas commented Oct 17, 2022

So we synced with @ChrsMark and the plan is to make clientgosimulator use specific structs as used in beats and retest. Thank you @ChrsMark for the comparison and dig .

@eedugon we will re-issue a new image of clientsimulator and we will have to repeat the tests in your cluster. Hoping that this time the clientgosimulator will show the same behaviour as beats and its memory will increase.

@ChrsMark
Copy link
Member

@eedugon what are the history limits of the Jobs/Cronjobs in your case?
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#jobs-history-limits

@ChrsMark
Copy link
Member

ChrsMark commented Oct 19, 2022

I'm also running some tests on my end. I'm using Metricbeat 8.6.0-SNAPSHOT with a minimal autodiscovery configuration:

metricbeat.autodiscover:
  providers:
    - type: kubernetes
      node: ${NODE_NAME}
      hints.enabled: true

I'm deploying a significant amount of Pods on the cluster but only some of them are actually getting into "running" state since my cluster is just a one-node kind cluster and the rest of them are failing to get scheduled. In any case this scenario still manages to trigger a memory usage increase in Metricbeat. To deploy the Pods I'm using the stress-test script from https://github.com/elastic/k8s-integration-infra#put-load-on-the-cluster running ./stress_test_k8s --kubeconfig=/home/chrismark/.kube/config --deployments=100 --namespaces=10 --podlabels=4 --podannotations=4.

After some time when the memory has been stabilized again in a higher usage I delete all the Pods running ./stress_test_k8s --delete.

Here is what I capture:
memoryLeakTest

As you can see after the deletion of the Pods the memory comes back to the normal levels which indicates that there is no memory leak related to the increased amount of Pods.

This makes me think that what we observe in the reported issue is having Metricbeat to monitor an increased number of Pods over time which never goes back to normal. This is why I was asking for the history limits of the Cronjobs at #33307 (comment).

So @eedugon can you clarify again what is the nature of the cluster you use and what is the number of the Pods with the time being? We are interested into see the number of Pods over time and not just the number of Pods at a specific point in the timeline. If the number of Pods is increasing then it makes sense that the data we are getting are more and more and this explains the memory usage increase.

This is just a theory open for discussion and at the same time I would be more than happy to have the proper details to replicate the issue on my side too :).

Sidenote: at the same time I'm running https://github.com/ChrsMark/k8sdiscovery on the cluster too which also shows a memory increase (smaller) which comes back to normal after the mass deletion of the Pods.

@ChrsMark
Copy link
Member

An update on this. It seems that having a more aggressive type of workloads can replicate the issue. After running the following as per @eedugon's suggestion I can see the memory increase as well:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cronjob-XXX
spec:
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: hello
              image: busybox
              imagePullPolicy: IfNotPresent
              command:
                - /bin/sh
                - -c
                - date; echo Hello from the Kubernetes cluster; sleep 5; exit 0
          restartPolicy: OnFailure

and massively deploy those on the cluster using:

#!/bin/bash

TOTAL=200

for ((i = 1 ; i <= $TOTAL ; i++)); do
  echo "# creating cronjob $i"
  sed "s/XXX/$i/g" cronjob-teamplate.yml | kubectl apply -f -
done

After deploying these I can see a stable number of Pods over time at around 800 but the memory usage of Beats becomes quite aggressive and leads to memory limit:

metricbeatOOM1

At the same time the k8sdiscovery Pod is quite stable as we can notice.
After struggling a bit looking into the code again and trying to understand what causes the memory usage increase in Beats I found out that https://github.com/elastic/elastic-agent-autodiscover/blob/main/kubernetes/metadata/pod.go#L180-L197 piece of code if disabled makes the issue to disappear:

beatsHealthy

This can be achieved by adding the following in the provider's config:

add_resource_metadata:
  cronjob: false

Note that the memory profiles were indicating an increased memory usage when in the json decoding functions not necessarily coupled with any of the watch actions. This explains this finding here.

Trying to isolate the issue more I tried to introduce this functionality into the k8sdiscovery simulator. So using a new image ocker.io/chrismark/k8sdiscovery:v0.0.3-rc2 after the following commit gives some interesting results:
ChrsMark/k8sdiscovery@fd8a8c9
k8sdiscoveryincreasedmemory

As we can see there is an increased memory usage for the k8sdiscovery Pod which clearly related to the latest addition that gets Job's and Cronjob's names by executing an API call to the k8s API.

Despite this increase though we can not claim that this is a memory leak again since it get stable after some time. So my assumption here is that even in Beats we don't have a memory leak but we have an increased memory usage because we are executing API calls to k8s from several places for every update of any of the Pods. I would assume that even in Beats the situation would become stable after some point but that would mean that the memory would get stable in a high level indeed.

So as next steps I would try to verify this observation in the rest of the reported environments to see if add_resource_metadata.cronjob: false fixes the issue and then we can invest some time to investigate if the GetCronjObfJob can be optimized maybe through a watcher/informer.

@eedugon @gizas let me know what you think.

@ChrsMark
Copy link
Member

ChrsMark commented Oct 25, 2022

I have been investigating the root cause of this issue and I ended up that the query we do to fetch the Jobs' objects at https://github.com/elastic/elastic-agent-autodiscover/blob/768d34583d1f4e262ef94908ab7d5d726ae5e0e9/kubernetes/metadata/pod.go#L184 is the one that cause the memory increase.

I performed some tests/simulations using https://github.com/ChrsMark/k8sdiscovery and by reducing the scope of the query to a namespace with fewer objects make the memory usage to decrease. So the issue comes when we perform the query against a namespace with many objects.

Even initializing the clientset once like client.BatchV1().Jobs("default") and re-using this object to execute the Get requests, I observe the increased memory again. For reference, default is the namespace where the massive amount of Jobs/Pods run.

In this regard, I would summarize the next steps as following:

  1. Check if add_resource_metadata.cronjob: false solves the issue in the reported environments. cc: @eedugon
  2. Open a specific issue against https://github.com/elastic/elastic-agent-autodiscover to evaluate the options we have to improve the situation around this issue. (update: opened -> add_resource_metadata.cronjob overloads the memory usage elastic-agent-autodiscover#31)
    a. Consider making add_resource_metadata.cronjob: false the default since we know it's an "expensive" feature.
    b. Document the nature of this setting and its implications
    c. Consider improving the way we retrieve the objects or get the "owner" name by trimming the suffix of the Object's name ie: hello (the name of the cronjob) out of hello-1234 (the name of the job).

@eedugon
Copy link
Contributor Author

eedugon commented Oct 31, 2022

Update to this issue:

  • add_resource_metadata.cronjob: false solves the issue completely in my environment with only simple cronjobs (cronjobs that create single pods, without parallelization).
  • I've seen a user environment where add_resource_metadata.cronjob: false doesn't solve the issue and add_resource_metadata.deployment: false is needed too. My theory for that need is that the CronJobs of that other environment create also ReplicaSets associated to the Jobs for parallel execution (Jobs can run single or multiple pods in parallel).

I'm also trying with bigger nodes to determine if this is really a leak or a poor implementation.

  • In a gke environment of 16G nodes running 100 cronjobs, the metricbeat ends up eating all the available memory of the nodes:

Screenshot 2022-10-31 at 10 08 08

I will update this report with the outcome of a test in a 32G node in a few days.

@ChrsMark
Copy link
Member

Thanks for the update @eedugon ! From my perspective this behavior makes sense.
The 2 settings we are focusing have quite the same characteristics but for different objects. One is for Jobs/Cronjobs and the other one for ReplicaSets/Deployments.

See the code at
https://github.com/elastic/elastic-agent-autodiscover/blob/768d34583d1f4e262ef94908ab7d5d726ae5e0e9/kubernetes/metadata/pod.go#L157-L197

This issue can be closed and we can continue at elastic/elastic-agent-autodiscover#31. The specific issue has an known workaround and we have narrowed down the scope for a generic OOM error to the specific code that brings this to the surface :).

@gizas
Copy link
Contributor

gizas commented Nov 1, 2022

Reviewing above discussion we can summarise all that high memory usage is due to the way we handle two specific resources (replicasets and cronjobs).

All next actions can be handled here: elastic/elastic-agent-autodiscover#31

Agree that the new fix is not only to disable those values but also to see if we can handle properly

Can we close this bug as well as it seems that we finalised all investigation?

@kbujold
Copy link

kbujold commented May 9, 2023

Setting these values did not resolved the memory leak issue for us

processors:
- add_kubernetes_metadata:
    add_resource_metadata:
      cronjob: false
      deployment: false

@yomduf
Copy link

yomduf commented May 9, 2023

@kbujold Can you try to set add_metadata to false on the daemonset and on the deployment in the kubernetes module ?

@kbujold
Copy link

kbujold commented May 25, 2023

@yomduf I was setting these values in filebeat. I also tried completely removing add_kubernetes_metadata and cranking up the resources and still see the memory gradually going up as time go by. It went from 10% to 85% in 24 hours.

@gizas
Copy link
Contributor

gizas commented May 25, 2023

@kbujold I would suggest to open a seperate sdh to track your issue.

It is really important to have all info for your setup. Also what is your version of beats?

Removing add_kubernetes_metadata wont solve your issue because the metadata enrichement happens by default after specific versions.

You can try add_metadata: false to see if your issue is fixed

A general example could be:

filebeat.inputs:
      - type: container
        paths:
          - /var/log/containers/*.log
        processors:
          - add_kubernetes_metadata:
              host: ${NODE_NAME}
              add_resource_metadata:
                deployment: false
                cronjob: false

@kbujold
Copy link

kbujold commented May 25, 2023

@gizas We found the same issue in ELK8.6.2 and ELK8.7.1. Note that we need this metadata enrichment. We had tried the above configuration with no success in filebeat.

@kbujold
Copy link

kbujold commented Jun 16, 2023

I have raised issue #35796

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

No branches or pull requests

5 participants