Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed #37142

Closed
StefanSa opened this issue Nov 17, 2023 · 31 comments · Fixed by #37171
Closed

[metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed #37142

StefanSa opened this issue Nov 17, 2023 · 31 comments · Fixed by #37171
Assignees
Labels
Team:Elastic-Agent Label for the Agent team v8.11.0

Comments

@StefanSa
Copy link
Contributor

StefanSa commented Nov 17, 2023

Hi there,
We are using the latest version v8.11.0 of the Elastic Agent here.
We recently had a crash of metricbeat with probably a memory leak, 10GB RAM usage and then crash.

metricbeat 10GB RAM Usage:
Screenshot 2023-11-17 103837

memmory usage metricbeat time period 7 days:
Screenshot 2023-11-17 112524

Number of proccess handels metricbeat rises to over 3 million over 7 days:
Screenshot 2023-11-17 113602

metricbeat crash:
Screenshot 2023-11-17 103949

The system, Win2k19, was unusable for a long time. Priority 1 should be to search for the bug, as this is very risky in a productive environment.

I have a diagnostic log file from the Elastic Agent, if you are interested i can upload it.

#35796

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 17, 2023
@K4S1
Copy link

K4S1 commented Nov 17, 2023

#35796 (comment)

Seing exactly the same. :-(

@StefanSa
Copy link
Contributor Author

Yes, we are considering whether to make a role back or deactivate the agent completely.
The current behavior is too unsafe.....

@K4S1
Copy link

K4S1 commented Nov 17, 2023

I needed to remove the agent from 80% of my fleet because of production disruption.
I have kept a couple to see how to fix or get further.
How fast the RAM is filling up depends on how much the servers are doing.
Found some server filling their RAMs in a matter of hours and some more slowly.
But Metric beat seems to have issues in 8.11.0.

No indication it has been discovered yet and fixed in 8.11.1
https://www.elastic.co/guide/en/beats/libbeat/8.11/release-notes-8.11.1.html

and why I removed ASAP, from Critical infrastructure.

And not a know bug in 8.11.0
https://www.elastic.co/guide/en/beats/libbeat/8.11/release-notes-8.11.0.html

@nicpenning
Copy link
Contributor

I have seen this in our test environment and others reporting it as well. This is a blocker for us upgrading.

@willemdh
Copy link

If this is causing so much issues, this needs more attention. This happens only in Metricbeat? Only on Windows? Was planning to update Monday, but I'm going to postpone the update after reading this.

@StephanErb
Copy link

What version was stable for you before that?

@hendry-lim
Copy link
Contributor

hendry-lim commented Nov 18, 2023

8.10.4 is good for us. We downgraded to 8.10.4. Metricbeat is using less than 200 MB of memory.

Raised a support ticket, but no reply yet.

If this is causing so much issues, this needs more attention. This happens only in Metricbeat? Only on Windows?

That's what I noticed in one of our customer's environment. It's fine on Linux. Yes, it's only happening on Metricbeat.

@StefanSa
Copy link
Contributor Author

@andrewkroh Andrew,
when will this serious bug be fixed ?
Currently it is too risky to use the current metricbeat version in a productive environment.

@StefanSa
Copy link
Contributor Author

And especially in a situation like this, a downgrade option via fleet server would be very helpful.
elastic/elastic-agent#520

@andrewkroh andrewkroh added the Team:Elastic-Agent Label for the Agent team label Nov 19, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 19, 2023
@sniky44
Copy link

sniky44 commented Nov 19, 2023

We are experiencing the same issue since 8.11.1. Maximizing the ram on the servers and taking Production down.

@eriroley
Copy link

I can confirm this also - this is a major issue

@cedricremmicom
Copy link

cedricremmicom commented Nov 19, 2023

Confirmed! We have narrowed this down to the System Process Metrics: https://discuss.elastic.co/t/metricbeat-8-11-0-system-module-using-excessive-amount-of-memory/347236
image

@leehinman
Copy link
Contributor

fix is here: elastic/elastic-agent-system-metrics#115

requires rebuild & release of beats

@cedricremmicom
Copy link

@leehinman As long as that PR is hanging the only way to get around this is to setup our own builds + artifacts repo I guess?

@leehinman
Copy link
Contributor

None of the options are great, but here is what I've thought of:

  1. Downgrade to 8.10
  2. you can disable the "process" and "process_summary" metrics in the very short term
  3. build it yourself today
    • clone the elastic-agent-system-metrics repo
    • merge the PR into your clone on a new branch
    • git tag the branch, push branch and tag to your clone
    • in beats do `go mod edit -replace github.com/elastic/elastic-agent-system-metrics=github.com//elastic-agent-system-metrics@
    • in beats do go mod tidy
    • build beat & distribute
  4. Wait for PRs that include the upcoming release of elastic-agent-system-metrics in beats and build those yourself
  5. Wait for official release

@StefanSa
Copy link
Contributor Author

@andrewkroh , @leehinman
As already mentioned, the possibility of a downgrade option for Fleetserver is essential.

But equally important from a security and quality point of view would be a warning when rolling out an agent that the current version contains a system-critical bug and therefore rolling it out is not recommended.

Please discuss in the team how security and software quality management can be increased here, especially for Fleetserver.
This memory leak clearly shows the current weaknesses if you want to roll out the agents across several hundred clients and servers.

@cedricremmicom
Copy link

As with any software it is always a good idea to spin up a staging/acceptance environment to verify any possible breaking changes before upgrading to a new version, especially when running software in a distributed architecture where you are impacting lots of hosts.

@StefanSa
Copy link
Contributor Author

As with any software it is always a good idea to spin up a staging/acceptance environment to verify any possible breaking changes before upgrading to a new version, especially when running software in a distributed architecture where you are impacting lots of hosts.

@cedricremmicom
i agree with that 100%, which is what we did. We tested for 7 days without any problems.
The bug only became apparent over a longer period of time on Windows systems with a higher load.
This is precisely why the improvement i suggested would be a clear help for all SysAdmins.

@hblankers
Copy link

Can someone provide me with an estimated time when this fix will be released in the agent?

@cmacknz
Copy link
Member

cmacknz commented Nov 22, 2023

The fix will be in the upcoming 8.11.2 release.

@JanKnipp
Copy link

I understand that issues like this happen in software development which is fine and can of course happen. But not communicating an issue to the customer is really f***ed up once you know that there is an issue. We deployed the agent about 48h ago on hundreds of machines and ran into serious issues. We did not look into the github issues so that is on us but elastic knows about this for about a week now and download still seems to be possible and upgrade within fleets is also still possible to 8.11.0 :(

@StefanSa
Copy link
Contributor Author

I understand that issues like this happen in software development which is fine and can of course happen. But not communicating an issue to the customer is really f***ed up once you know that there is an issue. We deployed the agent about 48h ago on hundreds of machines and ran into serious issues. We did not look into the github issues so that is on us but elastic knows about this for about a week now and download still seems to be possible and upgrade within fleets is also still possible to 8.11.0 :(

@JanKnipp Jan, i had already pointed this out, but without any feedback from the devs.
Maybe Craig @cmacknz can say something about it.

@cmacknz
Copy link
Member

cmacknz commented Nov 23, 2023

I don't have anything to comment besides apologizing that it took so long to be communicated widely. It is now a known issue in the release notes https://www.elastic.co/guide/en/fleet/current/release-notes-8.11.1.html. The initial focus was on identifying and fixing the issue and that delayed the communication unnecessarily.

The 8.11.x downloads are still available for now. This is a severe issue if you rely on Metricbeat or Elastic Agent for process metrics on Windows, but there are many other important uses that aren't affected.

@RicardoCst
Copy link

Shameful

@StefanSa
Copy link
Contributor Author

StefanSa commented Dec 4, 2023

@cmacknz Craig
Enterprise is not always included where it says enterprise. Here, the self-imposed goals are clearly "not" fulfilled.
The reference to the release note may be correct, but it doesn't really help.
There is still no warning on the download page of the beats metric or in the fleet manager integration.
This is no way to deal with potential "enterprise customers".
That will be a lesson for us.

@RicardoCst
Copy link

RicardoCst commented Dec 4, 2023

@cmacknz Craig Enterprise is not always included where it says enterprise. Here, the self-imposed goals are clearly "not" fulfilled. The reference to the release note may be correct, but it doesn't really help. There is still no warning on the download page of the beats metric or in the fleet manager integration. This is no way to deal with potential "enterprise customers". That will be a lesson for us.

Indeed

@jlind23
Copy link
Collaborator

jlind23 commented Dec 4, 2023

@RicardoCst @StefanSa We are again really sorry for all the impacts it had on your end, be aware that we are doing are best to improve the situation. The fix will land in the next 8.11 patch and should be available in a week or so.
In the meantime I am happy to hop on a call with you and detail all the next actions we took on our end to improve.

@hilt86
Copy link

hilt86 commented Dec 5, 2023

This probably deserves an email to cloud customers - this is a critical issue and I had to investigate once this has already impacted production when this could have been sent out to customers over 2 weeks ago. You need to communicate better if you want to retain the trust of enterprise customers please

@willemdh
Copy link

willemdh commented Dec 5, 2023

Why hasn't there been a new version released yet with a fix for this?? This is really problematic and food for thought. Instead of focusing on new features, please spend some time thinking about your agent's (Beats + Agent) stability.... To prevent this kind of high-impact problems.....

@jvalente-salemstate
Copy link

jvalente-salemstate commented Dec 10, 2023

At a minimum an email to customers while the fix is being worked on would be nice. Or a banner in Kibana 8.11.2, or one in the cloud console now. It's mentioned in release notes for stack upgrades but someone who isnt upgrading won't see that. I got lucky and only one out of roughly a hundred Windows servers (oddly the only azure one installed of on prem VMware ) seem to have been impacted --albeit in an annoying manner where it'd crash before being able to get a policy update until I manually intervened.

A week ago when people commented on this issue, angry about the lack of communication and feedback on how to do better was solicited was a good time for it but today or tomorrow are also good.

Bugs happen and taking down the affected downloads was good but communicating an issue like this, especially when agents are behaving in a way that is counter to both observability and security , is crucial imho. Not particularly upset, just some feedback. I am in the Slack and check the GitHub repos regularly for various reasons but most customers are not, and when paying for those big plans should not be expected to, monitor channels for info that support should proactively communicate.

@kalramani
Copy link

same issue in production and we logged a case with Elastic.
Issue
On Microsoft Windows, all Beats and Elastic Agent running System integration or Metrics collection are affected by a memory leak.
This affects versions 8.11.0 and 8.11.1.
The memory leak can lead to huge pagefile usage, performance issues and even machine restarts (BSOD).

On December 5th, the downloads of Elastic Agent 8.11.0 and 8.11.1 were removed from the Elastic Download page.
This also might result into silent failed upgrades or installs via Fleet UI (as the Elastic Agent file cannot be found anymore.

Environment
Occurs only on Microsoft Windows operating system
Any of the following products:

Elastic Agent 8.11.0 / 8.11.1 with System integration in the policy
Elastic Agent 8.11.0 / 8.11.1 with Logs & Metrics collection enabled in the policy
Metricbeat 8.11.0 / 8.11.1 with System module enabled
All Beats collecting self-metrics
Workaround for Elastic Agents
To mitigate the problem:

Disable the System process metricsand System process_summary metrics from the System integration, if you're using it (It will no more collect those metrics)
Disable the Logs & Metrics collection from the policy (It will no more collect metrics and logs of Elastic Agent & its components)
If you cannot upgrade to 8.11.2 or more recent (recommended), you can downgrade to 8.10.4 or older.
To downgrade, you can:

Uninstall 8.11.0/8.11.1 and install 8.10.4 or older, but the Elastic Agent "state" will be lost
You can run the command line sudo elastic-agent upgrade 8.10.4 (or a previous version). Even if we mention upgrade, the Elastic Agent will go back to the version specified. Note this "downgrade" method is not fully tested. The only advantage is the "state" will be kept.
Workaround for Beats
If you cannot upgrade to 8.11.2 or more recent (recommended), you can downgrade to 8.10.4 or older.

If you really need to keep Beats running, you would need to:

Set logging.metrics.enabled: false in the configuration file
Set http.enabled: false in the configuration file
Set monitoring.enabled: false
Do not enable the system module (if using Metricbeat)
Resolution
Do not install or upgrade to 8.11.0/8.11.1.
Please note Elastic Agent downloads for Microsoft Windows have been removed from our download page on December 5th.

Upgrade/install 8.11.2 or 8.11.3 (when released) or more recent versions

Public references
Public issue #37142
Fix elastic/elastic-agent-system-metrics#115
Related release notes
Fleet: 8.11.0 and 8.11.1
Beats: 8.11.0, 8.11.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team v8.11.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.