[metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed #37142

StefanSa · 2023-11-17T10:08:24Z

Hi there,
We are using the latest version v8.11.0 of the Elastic Agent here.
We recently had a crash of metricbeat with probably a memory leak, 10GB RAM usage and then crash.

metricbeat 10GB RAM Usage:

memmory usage metricbeat time period 7 days:

Number of proccess handels metricbeat rises to over 3 million over 7 days:

metricbeat crash:

The system, Win2k19, was unusable for a long time. Priority 1 should be to search for the bug, as this is very risky in a productive environment.

I have a diagnostic log file from the Elastic Agent, if you are interested i can upload it.

#35796

K4S1 · 2023-11-17T10:47:26Z

#35796 (comment)

Seing exactly the same. :-(

StefanSa · 2023-11-17T10:52:45Z

Yes, we are considering whether to make a role back or deactivate the agent completely.
The current behavior is too unsafe.....

K4S1 · 2023-11-17T11:09:14Z

I needed to remove the agent from 80% of my fleet because of production disruption.
I have kept a couple to see how to fix or get further.
How fast the RAM is filling up depends on how much the servers are doing.
Found some server filling their RAMs in a matter of hours and some more slowly.
But Metric beat seems to have issues in 8.11.0.

No indication it has been discovered yet and fixed in 8.11.1
https://www.elastic.co/guide/en/beats/libbeat/8.11/release-notes-8.11.1.html

and why I removed ASAP, from Critical infrastructure.

And not a know bug in 8.11.0
https://www.elastic.co/guide/en/beats/libbeat/8.11/release-notes-8.11.0.html

nicpenning · 2023-11-18T00:19:11Z

I have seen this in our test environment and others reporting it as well. This is a blocker for us upgrading.

willemdh · 2023-11-18T11:01:53Z

If this is causing so much issues, this needs more attention. This happens only in Metricbeat? Only on Windows? Was planning to update Monday, but I'm going to postpone the update after reading this.

StephanErb · 2023-11-18T11:28:25Z

What version was stable for you before that?

hendry-lim · 2023-11-18T11:51:03Z

8.10.4 is good for us. We downgraded to 8.10.4. Metricbeat is using less than 200 MB of memory.

Raised a support ticket, but no reply yet.

If this is causing so much issues, this needs more attention. This happens only in Metricbeat? Only on Windows?

That's what I noticed in one of our customer's environment. It's fine on Linux. Yes, it's only happening on Metricbeat.

StefanSa · 2023-11-19T09:46:35Z

@andrewkroh Andrew,
when will this serious bug be fixed ?
Currently it is too risky to use the current metricbeat version in a productive environment.

StefanSa · 2023-11-19T09:54:10Z

And especially in a situation like this, a downgrade option via fleet server would be very helpful.
elastic/elastic-agent#520

sniky44 · 2023-11-19T13:19:39Z

We are experiencing the same issue since 8.11.1. Maximizing the ram on the servers and taking Production down.

eriroley · 2023-11-19T14:55:26Z

I can confirm this also - this is a major issue

cedricremmicom · 2023-11-19T16:10:12Z

Confirmed! We have narrowed this down to the System Process Metrics: https://discuss.elastic.co/t/metricbeat-8-11-0-system-module-using-excessive-amount-of-memory/347236

leehinman · 2023-11-19T16:12:03Z

fix is here: elastic/elastic-agent-system-metrics#115

requires rebuild & release of beats

cedricremmicom · 2023-11-19T16:16:17Z

@leehinman As long as that PR is hanging the only way to get around this is to setup our own builds + artifacts repo I guess?

leehinman · 2023-11-19T16:31:04Z

None of the options are great, but here is what I've thought of:

Downgrade to 8.10
you can disable the "process" and "process_summary" metrics in the very short term
build it yourself today
- clone the elastic-agent-system-metrics repo
- merge the PR into your clone on a new branch
- git tag the branch, push branch and tag to your clone
- in beats do `go mod edit -replace github.com/elastic/elastic-agent-system-metrics=github.com//elastic-agent-system-metrics@
- in beats do go mod tidy
- build beat & distribute
Wait for PRs that include the upcoming release of elastic-agent-system-metrics in beats and build those yourself
Wait for official release

StefanSa · 2023-11-21T08:41:35Z

@andrewkroh , @leehinman
As already mentioned, the possibility of a downgrade option for Fleetserver is essential.

But equally important from a security and quality point of view would be a warning when rolling out an agent that the current version contains a system-critical bug and therefore rolling it out is not recommended.

Please discuss in the team how security and software quality management can be increased here, especially for Fleetserver.
This memory leak clearly shows the current weaknesses if you want to roll out the agents across several hundred clients and servers.

cedricremmicom · 2023-11-21T08:46:36Z

As with any software it is always a good idea to spin up a staging/acceptance environment to verify any possible breaking changes before upgrading to a new version, especially when running software in a distributed architecture where you are impacting lots of hosts.

StefanSa · 2023-11-21T08:54:05Z

As with any software it is always a good idea to spin up a staging/acceptance environment to verify any possible breaking changes before upgrading to a new version, especially when running software in a distributed architecture where you are impacting lots of hosts.

@cedricremmicom
i agree with that 100%, which is what we did. We tested for 7 days without any problems.
The bug only became apparent over a longer period of time on Windows systems with a higher load.
This is precisely why the improvement i suggested would be a clear help for all SysAdmins.

hblankers · 2023-11-21T17:14:07Z

Can someone provide me with an estimated time when this fix will be released in the agent?

cmacknz · 2023-11-22T20:14:02Z

The fix will be in the upcoming 8.11.2 release.

JanKnipp · 2023-11-23T17:54:17Z

I understand that issues like this happen in software development which is fine and can of course happen. But not communicating an issue to the customer is really f***ed up once you know that there is an issue. We deployed the agent about 48h ago on hundreds of machines and ran into serious issues. We did not look into the github issues so that is on us but elastic knows about this for about a week now and download still seems to be possible and upgrade within fleets is also still possible to 8.11.0 :(

StefanSa · 2023-11-23T18:29:27Z

I understand that issues like this happen in software development which is fine and can of course happen. But not communicating an issue to the customer is really f***ed up once you know that there is an issue. We deployed the agent about 48h ago on hundreds of machines and ran into serious issues. We did not look into the github issues so that is on us but elastic knows about this for about a week now and download still seems to be possible and upgrade within fleets is also still possible to 8.11.0 :(

@JanKnipp Jan, i had already pointed this out, but without any feedback from the devs.
Maybe Craig @cmacknz can say something about it.

cmacknz · 2023-11-23T20:20:38Z

I don't have anything to comment besides apologizing that it took so long to be communicated widely. It is now a known issue in the release notes https://www.elastic.co/guide/en/fleet/current/release-notes-8.11.1.html. The initial focus was on identifying and fixing the issue and that delayed the communication unnecessarily.

The 8.11.x downloads are still available for now. This is a severe issue if you rely on Metricbeat or Elastic Agent for process metrics on Windows, but there are many other important uses that aren't affected.

RicardoCst · 2023-12-04T15:19:53Z

Shameful

StefanSa · 2023-12-04T15:34:34Z

@cmacknz Craig
Enterprise is not always included where it says enterprise. Here, the self-imposed goals are clearly "not" fulfilled.
The reference to the release note may be correct, but it doesn't really help.
There is still no warning on the download page of the beats metric or in the fleet manager integration.
This is no way to deal with potential "enterprise customers".
That will be a lesson for us.

RicardoCst · 2023-12-04T15:46:19Z

@cmacknz Craig Enterprise is not always included where it says enterprise. Here, the self-imposed goals are clearly "not" fulfilled. The reference to the release note may be correct, but it doesn't really help. There is still no warning on the download page of the beats metric or in the fleet manager integration. This is no way to deal with potential "enterprise customers". That will be a lesson for us.

Indeed

jlind23 · 2023-12-04T17:24:21Z

@RicardoCst @StefanSa We are again really sorry for all the impacts it had on your end, be aware that we are doing are best to improve the situation. The fix will land in the next 8.11 patch and should be available in a week or so.
In the meantime I am happy to hop on a call with you and detail all the next actions we took on our end to improve.

hilt86 · 2023-12-05T02:40:02Z

This probably deserves an email to cloud customers - this is a critical issue and I had to investigate once this has already impacted production when this could have been sent out to customers over 2 weeks ago. You need to communicate better if you want to retain the trust of enterprise customers please

willemdh · 2023-12-05T08:58:00Z

Why hasn't there been a new version released yet with a fix for this?? This is really problematic and food for thought. Instead of focusing on new features, please spend some time thinking about your agent's (Beats + Agent) stability.... To prevent this kind of high-impact problems.....

jvalente-salemstate · 2023-12-10T20:02:59Z

At a minimum an email to customers while the fix is being worked on would be nice. Or a banner in Kibana 8.11.2, or one in the cloud console now. It's mentioned in release notes for stack upgrades but someone who isnt upgrading won't see that. I got lucky and only one out of roughly a hundred Windows servers (oddly the only azure one installed of on prem VMware ) seem to have been impacted --albeit in an annoying manner where it'd crash before being able to get a policy update until I manually intervened.

A week ago when people commented on this issue, angry about the lack of communication and feedback on how to do better was solicited was a good time for it but today or tomorrow are also good.

Bugs happen and taking down the affected downloads was good but communicating an issue like this, especially when agents are behaving in a way that is counter to both observability and security , is crucial imho. Not particularly upset, just some feedback. I am in the Slack and check the GitHub repos regularly for various reasons but most customers are not, and when paying for those big plans should not be expected to, monitor channels for info that support should proactively communicate.

kalramani · 2024-01-11T04:03:30Z

same issue in production and we logged a case with Elastic.
Issue
On Microsoft Windows, all Beats and Elastic Agent running System integration or Metrics collection are affected by a memory leak.
This affects versions 8.11.0 and 8.11.1.
The memory leak can lead to huge pagefile usage, performance issues and even machine restarts (BSOD).

On December 5th, the downloads of Elastic Agent 8.11.0 and 8.11.1 were removed from the Elastic Download page.
This also might result into silent failed upgrades or installs via Fleet UI (as the Elastic Agent file cannot be found anymore.

Environment
Occurs only on Microsoft Windows operating system
Any of the following products:

Elastic Agent 8.11.0 / 8.11.1 with System integration in the policy
Elastic Agent 8.11.0 / 8.11.1 with Logs & Metrics collection enabled in the policy
Metricbeat 8.11.0 / 8.11.1 with System module enabled
All Beats collecting self-metrics
Workaround for Elastic Agents
To mitigate the problem:

Disable the System process metricsand System process_summary metrics from the System integration, if you're using it (It will no more collect those metrics)
Disable the Logs & Metrics collection from the policy (It will no more collect metrics and logs of Elastic Agent & its components)
If you cannot upgrade to 8.11.2 or more recent (recommended), you can downgrade to 8.10.4 or older.
To downgrade, you can:

Uninstall 8.11.0/8.11.1 and install 8.10.4 or older, but the Elastic Agent "state" will be lost
You can run the command line sudo elastic-agent upgrade 8.10.4 (or a previous version). Even if we mention upgrade, the Elastic Agent will go back to the version specified. Note this "downgrade" method is not fully tested. The only advantage is the "state" will be kept.
Workaround for Beats
If you cannot upgrade to 8.11.2 or more recent (recommended), you can downgrade to 8.10.4 or older.

If you really need to keep Beats running, you would need to:

Set logging.metrics.enabled: false in the configuration file
Set http.enabled: false in the configuration file
Set monitoring.enabled: false
Do not enable the system module (if using Metricbeat)
Resolution
Do not install or upgrade to 8.11.0/8.11.1.
Please note Elastic Agent downloads for Microsoft Windows have been removed from our download page on December 5th.

Upgrade/install 8.11.2 or 8.11.3 (when released) or more recent versions

Public references
Public issue #37142
Fix elastic/elastic-agent-system-metrics#115
Related release notes
Fleet: 8.11.0 and 8.11.1
Beats: 8.11.0, 8.11.1

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 17, 2023

nicpenning mentioned this issue Nov 19, 2023

Enable compression by default for Elasticsearch outputs #36681

Merged

6 tasks

andrewkroh added the Team:Elastic-Agent Label for the Agent team label Nov 19, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 19, 2023

zez3 mentioned this issue Nov 20, 2023

Make Fleet spaces aware elastic/fleet-server#2075

Closed

leehinman mentioned this issue Nov 21, 2023

upgrade elastic-agent-system-metrics to v0.8.2 #37171

Merged

6 tasks

leehinman closed this as completed in #37171 Nov 21, 2023

cmacknz assigned leehinman Nov 22, 2023

cmacknz added the v8.11.0 label Nov 22, 2023

jlind23 mentioned this issue Nov 28, 2023

Implement Longevity test for Elastic Agent and Beats elastic/elastic-agent#3833

Closed

7 tasks

fearful-symmetry mentioned this issue Jan 25, 2024

add extended runtime test elastic/elastic-agent#4150

Merged

7 tasks

cmacknz mentioned this issue Feb 6, 2024

Add a test that can detect OS handle leaks elastic/elastic-agent#4206

Closed

cmacknz mentioned this issue Aug 14, 2024

[Flaky Test]: TestLongRunningAgentForLeaks/TestHandleLeak – Metricbeat input status reporting makes Windows agent permanently degraded elastic/elastic-agent#5300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed #37142

[metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed #37142

StefanSa commented Nov 17, 2023 •

edited

Loading

K4S1 commented Nov 17, 2023

StefanSa commented Nov 17, 2023

K4S1 commented Nov 17, 2023

nicpenning commented Nov 18, 2023

willemdh commented Nov 18, 2023

StephanErb commented Nov 18, 2023

hendry-lim commented Nov 18, 2023 •

edited

Loading

StefanSa commented Nov 19, 2023

StefanSa commented Nov 19, 2023

sniky44 commented Nov 19, 2023

eriroley commented Nov 19, 2023

cedricremmicom commented Nov 19, 2023 •

edited

Loading

leehinman commented Nov 19, 2023

cedricremmicom commented Nov 19, 2023

leehinman commented Nov 19, 2023

StefanSa commented Nov 21, 2023

cedricremmicom commented Nov 21, 2023

StefanSa commented Nov 21, 2023

hblankers commented Nov 21, 2023

cmacknz commented Nov 22, 2023

JanKnipp commented Nov 23, 2023

StefanSa commented Nov 23, 2023

cmacknz commented Nov 23, 2023

RicardoCst commented Dec 4, 2023

StefanSa commented Dec 4, 2023

RicardoCst commented Dec 4, 2023 •

edited

Loading

jlind23 commented Dec 4, 2023 •

edited

Loading

hilt86 commented Dec 5, 2023

willemdh commented Dec 5, 2023 •

edited

Loading

jvalente-salemstate commented Dec 10, 2023 •

edited

Loading

kalramani commented Jan 11, 2024

[metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed #37142

[metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed #37142

Comments

StefanSa commented Nov 17, 2023 • edited Loading

K4S1 commented Nov 17, 2023

StefanSa commented Nov 17, 2023

K4S1 commented Nov 17, 2023

nicpenning commented Nov 18, 2023

willemdh commented Nov 18, 2023

StephanErb commented Nov 18, 2023

hendry-lim commented Nov 18, 2023 • edited Loading

StefanSa commented Nov 19, 2023

StefanSa commented Nov 19, 2023

sniky44 commented Nov 19, 2023

eriroley commented Nov 19, 2023

cedricremmicom commented Nov 19, 2023 • edited Loading

leehinman commented Nov 19, 2023

cedricremmicom commented Nov 19, 2023

leehinman commented Nov 19, 2023

StefanSa commented Nov 21, 2023

cedricremmicom commented Nov 21, 2023

StefanSa commented Nov 21, 2023

hblankers commented Nov 21, 2023

cmacknz commented Nov 22, 2023

JanKnipp commented Nov 23, 2023

StefanSa commented Nov 23, 2023

cmacknz commented Nov 23, 2023

RicardoCst commented Dec 4, 2023

StefanSa commented Dec 4, 2023

RicardoCst commented Dec 4, 2023 • edited Loading

jlind23 commented Dec 4, 2023 • edited Loading

hilt86 commented Dec 5, 2023

willemdh commented Dec 5, 2023 • edited Loading

jvalente-salemstate commented Dec 10, 2023 • edited Loading

kalramani commented Jan 11, 2024

StefanSa commented Nov 17, 2023 •

edited

Loading

hendry-lim commented Nov 18, 2023 •

edited

Loading

cedricremmicom commented Nov 19, 2023 •

edited

Loading

RicardoCst commented Dec 4, 2023 •

edited

Loading

jlind23 commented Dec 4, 2023 •

edited

Loading

willemdh commented Dec 5, 2023 •

edited

Loading

jvalente-salemstate commented Dec 10, 2023 •

edited

Loading