-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metricbeat] memory leak 10GB RAM usage and crash, urgent solution needed #37142
Comments
Seing exactly the same. :-( |
Yes, we are considering whether to make a role back or deactivate the agent completely. |
I needed to remove the agent from 80% of my fleet because of production disruption. No indication it has been discovered yet and fixed in 8.11.1 and why I removed ASAP, from Critical infrastructure. And not a know bug in 8.11.0 |
I have seen this in our test environment and others reporting it as well. This is a blocker for us upgrading. |
If this is causing so much issues, this needs more attention. This happens only in Metricbeat? Only on Windows? Was planning to update Monday, but I'm going to postpone the update after reading this. |
What version was stable for you before that? |
Raised a support ticket, but no reply yet.
That's what I noticed in one of our customer's environment. It's fine on Linux. Yes, it's only happening on Metricbeat. |
@andrewkroh Andrew, |
And especially in a situation like this, a downgrade option via fleet server would be very helpful. |
We are experiencing the same issue since 8.11.1. Maximizing the ram on the servers and taking Production down. |
I can confirm this also - this is a major issue |
Confirmed! We have narrowed this down to the System Process Metrics: https://discuss.elastic.co/t/metricbeat-8-11-0-system-module-using-excessive-amount-of-memory/347236 |
fix is here: elastic/elastic-agent-system-metrics#115 requires rebuild & release of beats |
@leehinman As long as that PR is hanging the only way to get around this is to setup our own builds + artifacts repo I guess? |
None of the options are great, but here is what I've thought of:
|
@andrewkroh , @leehinman But equally important from a security and quality point of view would be a warning when rolling out an agent that the current version contains a system-critical bug and therefore rolling it out is not recommended. Please discuss in the team how security and software quality management can be increased here, especially for Fleetserver. |
As with any software it is always a good idea to spin up a staging/acceptance environment to verify any possible breaking changes before upgrading to a new version, especially when running software in a distributed architecture where you are impacting lots of hosts. |
@cedricremmicom |
Can someone provide me with an estimated time when this fix will be released in the agent? |
The fix will be in the upcoming 8.11.2 release. |
I understand that issues like this happen in software development which is fine and can of course happen. But not communicating an issue to the customer is really f***ed up once you know that there is an issue. We deployed the agent about 48h ago on hundreds of machines and ran into serious issues. We did not look into the github issues so that is on us but elastic knows about this for about a week now and download still seems to be possible and upgrade within fleets is also still possible to 8.11.0 :( |
@JanKnipp Jan, i had already pointed this out, but without any feedback from the devs. |
I don't have anything to comment besides apologizing that it took so long to be communicated widely. It is now a known issue in the release notes https://www.elastic.co/guide/en/fleet/current/release-notes-8.11.1.html. The initial focus was on identifying and fixing the issue and that delayed the communication unnecessarily. The 8.11.x downloads are still available for now. This is a severe issue if you rely on Metricbeat or Elastic Agent for process metrics on Windows, but there are many other important uses that aren't affected. |
Shameful |
@cmacknz Craig |
Indeed |
@RicardoCst @StefanSa We are again really sorry for all the impacts it had on your end, be aware that we are doing are best to improve the situation. The fix will land in the next 8.11 patch and should be available in a week or so. |
This probably deserves an email to cloud customers - this is a critical issue and I had to investigate once this has already impacted production when this could have been sent out to customers over 2 weeks ago. You need to communicate better if you want to retain the trust of enterprise customers please |
Why hasn't there been a new version released yet with a fix for this?? This is really problematic and food for thought. Instead of focusing on new features, please spend some time thinking about your agent's (Beats + Agent) stability.... To prevent this kind of high-impact problems..... |
At a minimum an email to customers while the fix is being worked on would be nice. Or a banner in Kibana 8.11.2, or one in the cloud console now. It's mentioned in release notes for stack upgrades but someone who isnt upgrading won't see that. I got lucky and only one out of roughly a hundred Windows servers (oddly the only azure one installed of on prem VMware ) seem to have been impacted --albeit in an annoying manner where it'd crash before being able to get a policy update until I manually intervened. A week ago when people commented on this issue, angry about the lack of communication and feedback on how to do better was solicited was a good time for it but today or tomorrow are also good. Bugs happen and taking down the affected downloads was good but communicating an issue like this, especially when agents are behaving in a way that is counter to both observability and security , is crucial imho. Not particularly upset, just some feedback. I am in the Slack and check the GitHub repos regularly for various reasons but most customers are not, and when paying for those big plans should not be expected to, monitor channels for info that support should proactively communicate. |
same issue in production and we logged a case with Elastic. On December 5th, the downloads of Elastic Agent 8.11.0 and 8.11.1 were removed from the Elastic Download page. Environment Elastic Agent 8.11.0 / 8.11.1 with System integration in the policy Disable the System process metricsand System process_summary metrics from the System integration, if you're using it (It will no more collect those metrics) Uninstall 8.11.0/8.11.1 and install 8.10.4 or older, but the Elastic Agent "state" will be lost If you really need to keep Beats running, you would need to: Set logging.metrics.enabled: false in the configuration file Upgrade/install 8.11.2 or 8.11.3 (when released) or more recent versions Public references |
Hi there,
We are using the latest version v8.11.0 of the Elastic Agent here.
We recently had a crash of metricbeat with probably a memory leak, 10GB RAM usage and then crash.
metricbeat 10GB RAM Usage:
memmory usage metricbeat time period 7 days:
Number of proccess handels metricbeat rises to over 3 million over 7 days:
metricbeat crash:
The system, Win2k19, was unusable for a long time. Priority 1 should be to search for the bug, as this is very risky in a productive environment.
I have a diagnostic log file from the Elastic Agent, if you are interested i can upload it.
#35796
The text was updated successfully, but these errors were encountered: