-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IAzure Logs]: Integration eats up memory and dies #11056
Comments
We are experiencing the same issue while running the Azure integration. Our setup is similar to @dmaasland's however we are running the agent on AWS ECS vs kubernetes. Interestingly, we have been running the Azure integration for months with no issue. We started seeing this error when we upgraded our agents from [EDIT] adding screenshot showing when we upgraded our agents to |
Hey @dmaasland and @peterydzynski, thank you for reporting this issue! There are probably two topics at play here. They are related but distinct.
Let's start with the error logs:
The 8.15.1 release added a new tracer that surfaces internal error logs from the event hub SDK. These errors were already occurring before; they were just invisible. In our experience, the errors usually come from contention between the inputs accessing the event hub. More on this later. The memory leak probably comes from an update to the event hub libraries we shipped in 8.14.0. At this point, we received multiple reports. The leak is located in the SDK code that establishes a link to the event hub partition. The error logs and the memory leak are related because they originate and are amplified by the contention between multiple inputs using the same event hub. If you set up the Azur Logs integration as many users do (installing one integration, setting the event hub name, and enabling most or all the data stream): There is probably contention occurring between input behind the scenes. We are planning an Azure Logs integration to address this issue. We will probably leverage logs routing to run one input for the whole Azure Logs integration, using the reroute processor to dispatch each log event to the target data stream. You can start doing something very similar to the current version. Please take a look at this step-by-step guide that explains how to route logs using the generic event hub integration: With one input that collects the logs and sends them to a data stream that routes them to the destination, we get the following benefits:
We also have an event hub input v2 that uses the latest SDK, ready to be enabled by default as soon as the new Azure Logs integration is ready. |
@zmoog thanks for the detailed response and workaround! Do you have any estimate on when the new Azure Logs integration will be ready? Also, if I use the workaround that you provided I assume I will lose the ability to only pull certain log types as the agent will now indiscriminately pull all types of logs, correct? Obviously I can add a drop processor for the log types I do not want to collect but this will be a bit more load on the agent and elasticsearch. Unrelated but out of curiosity, what is the advantage of breaking the logs out into their own datastreams when the data is coming from the same source and is fairly similar in its format? I understand that under this configuration you can modify mappings, ILM, etc independently for each input but this paradigm has caused us some headaches. The best example is the Zeek integration which breaks out the data into 43 different datastreams which obviously leads to a significant number of indices/shards. Unless the integration is pulling in huge volumes of data, you will have to increase the hardware profile of your cluster simply to handle the memory usage caused by all the indices/shards leading to a large amount of wasted HDD and CPU. The more integrations that do this the worse this problem gets. |
We are working from this week. I estimate to have a working prototype in the next few days, and then I can probably have a better idea of the release ETA.
If the log events don't match the reroute processor condition, they will land in the
Can you tell me more about your use case? It seems interesting.
A dedicated data stream allows custom mappings, pipelines, ILM, etc. But even if you don't need customizations, using a data stream like
Wow, I never noticed the Zeek had so many data streams. Right now, the Azure Logs landscape is made of two kinds of integrations:
The general idea is to offer a choice between out-of-the-box functionality and extension points to cover edge cases. A new event hub input package that's more configurable than the generic one is coming soon. The input packages are a thinner wrapper around the azure-eventhub input, allowing more flexibility (for example, the generic event hub integration only works with the |
I was directed here by a support request (#01756777) I logged for the Azure We are also sending these logs to a single event hub (M365 Defender logs are sent to a different event hub), I have matched the number of partitions to the number of agents running in AWS ECS of 3, the only difference is that I have setup separate integrations for each log type. |
Do you have Elastic Agent diagnostics to share? Could you collect one and send it to me at [email protected]? |
Integration Name
Azure Logs [azure]
Dataset Name
No response
Integration Version
1.14.0
Agent Version
8.15.1
Agent Output Type
elasticsearch
Elasticsearch Version
8.15.1
OS Version and Architecture
Ubuntu 22.04
Software/API Version
No response
Error Message
Event Original
No response
What did you do?
We run elastic agents on an AKS kubernetes cluster for customers. They have a pvc for persistent storage of the
/usr/share/elastic-agent/state
directory.What did you see?
When enabling the
Azure Logs
integration we see memory usage continually rising until the pod gets evicted from a node. It then gets restarted and the cycle begins againMeanwhile, the metrics fill up with millions of errors per day:
When checking the storage account we constantly see files being leased and then released.
What did you expect to see?
Consistent memory usage and no evictions.
Anything else?
No response
The text was updated successfully, but these errors were encountered: