-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak issue with filebeat and metricbeat #35796
Comments
This issue doesn't have a |
Initially I would advise to add the add_resource_metadata block inside the autodiscovery configurtation: Eg. filebeat.autodiscover:
providers:
- hints.default_config:
close_renamed: true
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
type: container
hints.enabled: true
host: ${NODE_NAME}
type: kubernetes
add_resource_metadata:
deployment:false
cronjob: false The metadata enrichment is enabled by default in autodiscovery. Additionally, you will need to remove the following from the module's config: processors:
- add_kubernetes_metadata:
in_cluster: true
host: ${NODE_NAME}
... The add_kubernetes_metadata processor is redundant for the Kubernetes module since it automatically adds the metadata by default. Note that this processor uses the same "watcher" library under the hood and hence it could hit the same memory leak which is solved by disabling the add_resource_metadata. In other words just make sure that they are disabling the add_resource_metadata.cronjob/deployment from both the Kubernetes module's config and in any add_kubernetes_metadata processor that is actually defined. Also because you use the dedot configuration, have in mind that (see https://www.elastic.co/guide/en/beats/filebeat/current/configuration-autodiscover.html) |
as suggested by @gizas you also need to remove the processor from |
@gizas We still have the same OOM issue with the changes you recommended. Below is the configmap for filebeat
|
@kbujold thank you for the update. One thing noticing is that you have both inputs and autodiscovery. filebeat.inputs:
- close_timeout: 5m
enabled: true
exclude_files:
- ^/var/log/containers/
- ^/var/log/pods/
paths:
- /var/log/*.log
- /var/log/messages
- /var/log/syslog
- /var/log/**/*.log
type: log
http.port: 5066 Can you please test the above and send us the manifest you use (and not the rendered output) to avoid any misalignment ? |
@gizas If we remove remove the filebeat.inputs section, we would lose our host log monitoring. We are running filebeat to capture both container and host logs, and have been doing so for years. We started having issues with ELK8.6.2. What is the recommended way to monitor both host and container logs if we are to remove filebeat.inputs? This is a hard requirement for us.
|
Thanks for the details @kbujold ! (helped me understood your needs) I am doing some tests this morning with 8.8.1 and filestream input, instead of the log input. In general we advise to use filestream and I would also advise to go with 8.8.1 (for fixes like this) I am testing with this configuration and my memory seems stable (of course I have a limited cluster and probably with less traffic than you) filebeat.inputs:
- type: filestream
id: my-filebeat-input
paths:
- /var/log/*.log
- /var/log/messages
- /var/log/syslog
- /var/log/**/*.log
prospector.scanner.exclude_files: ['^/var/log/containers/', '^/var/log/pods/']
fields:
system:
name: test
uid: d1374af9-1234-aaaaa-bbb-974f1b033347
fields_under_root: true
# To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this:
filebeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
hints.enabled: true
hints.default_config:
type: container
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log
add_resource_metadata:
deployment: false
cronjob: false So lets make another test with updated configuration and also upgrade to 8.8.1? |
hello @kbujold , is it generated by the autodiscover or by filebeat.inputs? Can we use the suggested config at #35796 (comment) but disable alternatively the autodiscover or the filebeat.inputs sections and tell us if there is a possible memory leak in either of those cases. |
Removing filebeat.autodiscover and metricbeat.autodiscover did not resulted in OOM pod restarts for metricbeat and filebeat. But as mentioned before we need this config. |
ok. great so the problem is definitively in the autodiscover. Do you mind trying this config instead filebeat.inputs:
- type: filestream
id: my-filebeat-input
paths:
- /var/log/*.log
- /var/log/messages
- /var/log/syslog
- /var/log/**/*.log
prospector.scanner.exclude_files: ['^/var/log/containers/', '^/var/log/pods/']
fields:
system:
name: test
uid: d1374af9-1234-aaaaa-bbb-974f1b033347
fields_under_root: true
# To enable hints based autodiscover, remove `filebeat.inputs` configuration and uncomment this:
filebeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
hints.enabled: true
hints.default_config:
type: filestream
prospector.scanner.symlinks: true
id: filestream-kubernetes-pod-${data.kubernetes.container.id}
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log
parsers:
- container: ~
add_resource_metadata:
deployment: false
cronjob: false just make sure that both filebeat and the Elastic stack are at version 8.8.1 otherwise you might encounter other issues since some of those changes have only been fixed recently. @gizas has tested that the previous configs work with that version and he wasn't able to replicate any OOM issues. |
Thank you @kbujold , at least we have some progress. For filebeat, can you also try the following: add_resource_metadata:
namespace:
enabled: false
node:
enabled: false
deployment: false
cronjob: false You are also disabling namespace and node metadata enrichment with above. I would suggest to try disabling those one by one and then both together. Have in mind that you might loose some metadata in some cases but I hope that wont be important in your case as you are mainly looking for the actual logs. If this also does not work, I think we will need some information for your cluster, size, if you have restarts etc. in order to simulate the situation. Also the output of
Also important note is to test please with 8.8.1 as this is the version where the fix with filestream input has been merged |
hello @kbujold , It's not clear if they replicated the same changes with filebeat. Can you please post here the entire filebeat config not just those provided sections? |
One more thing to be sure about the testing:
|
@gsantoro
|
@kbujold I can still see the processor block inside: - add_kubernetes_metadata:
annotations.dedot: true
default_matchers.enabled: false
labels.dedot: true Can you also remove this? |
It OOMed with this config as well with add_kubernetes_metadata removed
|
are you using 8.8.1 elastic stack? |
@gsantoro
|
@kbujold , |
Lets me suggest some more things @kbujold in order to also be able to simulate your setup:
I have been running experiments trying to reproduce the issue. Here is what I'm using for reference: Script to collect heap profiles #!/bin/bash
sleepTime=$((60*5))
for i in $(seq 1 1 100)
do
echo "Getting heap for $i time"
go tool pprof -png http://localhost:5066/debug/pprof/heap > heap${i}.png
sleep $sleepTime
done Filebeat's config http.enabled: true
http.port: 5066
http.host: 0.0.0.0
http.pprof.enabled: true It would be great to see what part of memory is growing and really helpful to see a heap close to the restart
|
@gizas
|
Hello @kbujold , Trying to understand your comment
If you remove the add_resource_metadata, this means that all metadata enrichment will happen because by default all the flags (cronjob,namespace, node, deployment) are true. So please double confirm that if without the In any case I would want:
|
We will collect more data. Can the CPU running 100% for the pod cause issues with the pod's memory management? We did not increase the CPU pod limit from ELK 7.17 to ELK 8.9.0. With ELK 8.9.0 its running at close to 100%. Here's a 4 days collection. We have found some promising results with increasing the CPU so the filebeat pods are not running at 100%. |
Thank you @kbujold , indeed you need to keep monitoring both CPU and memory.
The CPU throttling will cause service degradation, latency etc. as silently requests will be dropped. For the 100% usage, this is a percentage on the limits you have defined in your manifest. |
Case 1) remove add_resource_metadata from filebeat config. In this case we see the memory being stable. I have collected profile at 3 hours and also 14 hours. See .bin files here
Case 2) We have add_resource_metadata with all options disabled in the filebeat config. I have 3 hours and 12 hours profile bin files here.
I would not have expected Case 2) to cause the creeping memory. Maybe it case be a clue to where this memory leak issue is coming from. Thanks, |
Thank you Cristine, that is good news. So you have all enabled (full metadata enrichment) and no memory increase in 1. I see in your 2) heaps that autodiscovery for pods is there (should not expect this as all are disabled, can you please doubleconfirm that configuration is aligned with correct agent ) Also from 2) heaps: 3h: 12h: So even the heap states that there is no memory leak in the autodiscovery memory. Question: What is the blue and what is the green agents that you have in your second Photo above? What is their difference? Also mind that: http.pprof.enabled: true
logging.level: warning The above two options should be used only in lab environments. Especially pprof should be disabled in production as it increases memory consumption |
There is no good news ;-) Case 1 does not enable the full metadata enrichment. With Case 2) The configuration is 100% correct and thus why it is puzzling. All configs are disabled and the memory is creeping overtime resulting eventually in a pod restart with OOMO
Our system runs the filebeat pod on two nodes. The green is controller-0 and the one with the memory creep. We set the logging to warning because we were getting large amounts of info logs. You do not recommend setting loggin to warning in production?
Kristine |
For 1) You can always enable only add_resource_metadata.node and add_resource_metadata.namespace as I dont expect much overhead. There is this fix in #36736 in 8.10.3 (that will be released next days that should improve filebeat memory usage overall) Also is advisable to set logging.level: error Can I suggest another thing (as long as you are testing in lab:) paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
- /var/log/*.log
- /var/log/messages
- /var/log/syslog |
@kbujold I tested the above and I can confirm that is not accurate: With below config: filebeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
hints.enabled: true
hints.default_config:
type: container
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log See above node and namespace labels being present |
When we have case 1 set we see no data for
|
This seems like you dont receive logs at all
|
I am deploying Elastic 8.9.0 via ECK, I reported this in case 01497623 the ECK yaml that I used to configure filebeat is:
Is the issue I am seeing similar to what you are seeing, @kbujold ? Did you converge on a working solution to get around this problem with filebeat? I also see discussion about filebeat/metricbeat memory leaks in:
Are there any solutions posed in those issues which may solve my problem? |
For the current discussion the workaround that was proposed is to disable specific metadata by using add_resource_metadata config filebeat:
autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
hints:
enabled: true
default_config:
type: container
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log
add_resource_metadata:
cronjob: false
deployment: false For your configuration I see: requests:
memory: 600Mi Mind that according to your cluster and size this might not be enough so you should consider increasing and see where the memory stabilises for your setup. If it not stabilises, then we should see if we have a memory leak or not. PR 36736 address other problem but it can help to reduce the overall memory consumption |
Some questions:
that I will achieve the benefit of what is being discussed in this issue? Or is that a wrong path, due to the fact that I don't have
Is there a better way to figure out an optimal setting for this other than trial and error? For my setup, I need to control memory/CPU resources via a combination of Terraform + Kubernetes YAML, so I can't just arbitrarily change this value.
|
|
I upgraded my Elastic cluster to 8.10.3, and did not see any improveent due to 3. I will investigate 1. and 2. as per your recommendations. Thanks for the responses. |
We do, just not kubernetes fields. We see no errors from the filebeat pods |
We fixed our issue with significantly increasing the filebeat's CPU limit so it would not run at 100%. With this the change we are not seeing anymore OOM pod retsarts and the filebeat memory is stable. |
@kbujold mind that the example you posted comes from the filebeat.input so it is expected that does not belong to any kubernetes autodiscovery and has not container or namespace. It is expected filebeat.inputs:
- close_timeout: 5m
enabled: true
...
paths:
- /var/log/*.log. < ----From this matching |
@gizas I see that you submitted this PR to elastic-agent 8.10.4: elastic/elastic-agent#3591 If I upgrade to filebeat 8.10.4, will this default to disabling metadata enrichment for deployments and cronjobs, |
Ah OK, thanks for that. I will try out 8.10.5 or 8.11.x when they come out. As an aside, do you have any information on how to enable memory profiling in filebeat to do analysis similar to what was done here, with the graphs: https://discuss.elastic.co/t/filebeat-pods-keeps-increasing-memory-usage/325124 I currently have support ticket case 01497623 to cover filebeat memory issues that I am having, and am currently going back and forth with the support agent. I'd like to make faster progress to root-cause and solve this problem. Thanks. |
Add in filebeat: http:
enabled: true
pprof.enabled: true
port: 5066 To get the heap dump I'm redirecting the HTTP endpoint to a local port and then run a curl: # in one terminal session
kubectl port-forward pod/filebeat-hfpnw 8080:5066 # in another
curl -s -v http://localhost:8080/debug/pprof/heap > FILENAME Then to find information: |
Closing this as for now the initial request has been addressed. Also relevant issues for Windows hosts are being addressed in other cases. |
Since moving from ELK 7.17.1 to ELK 8.6.2 (ELK 8.7 and also ELK 8.8.0) we are experiencing OOMKilled on filebeat and metricbeat pods. We had no issues with ELK 7.17.1. Increasing the resources allocations does not resolve the issue and simply delays the pod restarting with OOM which usually occurs after 9-12 hours. This appears to be a memory leak issue with beats.
Below is the initial post which I raised in the forum with .bin files and configmaps
https://discuss.elastic.co/t/potential-memory-leak-issue-with-filebeat-and-metricbeat/334353
We have applied the configs recommended here but it did not resolved the issue. #33307 (comment)
Thank you,
Kris
The text was updated successfully, but these errors were encountered: