[IAzure Logs]: Integration eats up memory and dies #11056

dmaasland · 2024-09-10T08:33:01Z

Integration Name

Azure Logs [azure]

Dataset Name

No response

Integration Version

1.14.0

Agent Version

8.15.1

Agent Output Type

elasticsearch

Elasticsearch Version

8.15.1

OS Version and Architecture

Ubuntu 22.04

Software/API Version

No response

Error Message

{
  "_index": ".ds-logs-elastic_agent.filebeat-cyberdefensegroup-2024.09.04-000002",
  "_id": "6awE25EBDjTH87fVkJGw",
  "_version": 1,
  "_score": 0,
  "_source": {
    "agent": {
      "name": "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v",
      "id": "21bd0329-66af-4832-ba17-c4dbb7e0f746",
      "ephemeral_id": "aed19de4-5fbe-4b08-85b5-ba4ed0804509",
      "type": "filebeat",
      "version": "8.15.1"
    },
    "service.name": "filebeat",
    "log": {
      "file": {
        "inode": "73",
        "path": "/usr/share/elastic-agent/state/data/logs/elastic-agent-20240910-3013.ndjson",
        "device_id": "66327"
      },
      "offset": 3961311,
      "source": "azure-eventhub-default"
    },
    "elastic_agent": {
      "id": "21bd0329-66af-4832-ba17-c4dbb7e0f746",
      "version": "8.15.1",
      "snapshot": false
    },
    "message": "lease was not found",
    "cloud": {
      "instance": {
        "name": "aks-userpool-84077292-vmss_1",
        "id": "ff0bc899-550f-4481-b899-62809c6a8c02"
      },
      "provider": "azure",
      "machine": {
        "type": "Standard_D4s_v3"
      },
      "service": {
        "name": "Virtual Machines"
      },
      "region": "westeurope",
      "account": {
        "id": "addbd166-b2b1-471f-bda3-fb69e1a2ea39"
      }
    },
    "input": {
      "type": "filestream"
    },
    "log.origin": {
      "file.line": 106,
      "function": "github.com/elastic/beats/v7/x-pack/filebeat/input/azureeventhub.logpLogger.Error",
      "file.name": "azureeventhub/tracer.go"
    },
    "component": {
      "binary": "filebeat",
      "id": "azure-eventhub-default",
      "type": "azure-eventhub",
      "dataset": "elastic_agent.filebeat"
    },
    "@timestamp": "2024-09-10T08:19:53.663Z",
    "ecs": {
      "version": "8.0.0"
    },
    "data_stream": {
      "namespace": "cyberdefensegroup",
      "type": "logs",
      "dataset": "elastic_agent.filebeat"
    },
    "host": {
      "hostname": "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v",
      "os": {
        "kernel": "5.15.0-1071-azure",
        "codename": "focal",
        "name": "Ubuntu",
        "family": "debian",
        "type": "linux",
        "version": "20.04.6 LTS (Focal Fossa)",
        "platform": "ubuntu"
      },
      "containerized": false,
      "name": "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v",
      "architecture": "x86_64"
    },
    "log.level": "error",
    "event": {
      "agent_id_status": "verified",
      "ingested": "2024-09-10T08:19:55Z",
      "dataset": "elastic_agent.filebeat"
    }
  },
  "fields": {
    "elastic_agent.version": [
      "8.15.1"
    ],
    "component.binary": [
      "filebeat"
    ],
    "host.os.name.text": [
      "Ubuntu"
    ],
    "host.hostname": [
      "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v"
    ],
    "component.id": [
      "azure-eventhub-default"
    ],
    "host.os.version": [
      "20.04.6 LTS (Focal Fossa)"
    ],
    "host.os.name": [
      "Ubuntu"
    ],
    "log.level": [
      "error"
    ],
    "agent.name": [
      "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v"
    ],
    "host.name": [
      "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v"
    ],
    "event.agent_id_status": [
      "verified"
    ],
    "cloud.region": [
      "westeurope"
    ],
    "host.os.type": [
      "linux"
    ],
    "log.source": [
      "azure-eventhub-default"
    ],
    "input.type": [
      "filestream"
    ],
    "log.offset": [
      3961311
    ],
    "data_stream.type": [
      "logs"
    ],
    "host.architecture": [
      "x86_64"
    ],
    "cloud.machine.type": [
      "Standard_D4s_v3"
    ],
    "cloud.provider": [
      "azure"
    ],
    "log.origin.function": [
      "github.com/elastic/beats/v7/x-pack/filebeat/input/azureeventhub.logpLogger.Error"
    ],
    "agent.id": [
      "21bd0329-66af-4832-ba17-c4dbb7e0f746"
    ],
    "cloud.service.name": [
      "Virtual Machines"
    ],
    "ecs.version": [
      "8.0.0"
    ],
    "host.containerized": [
      false
    ],
    "agent.version": [
      "8.15.1"
    ],
    "host.os.family": [
      "debian"
    ],
    "cloud.instance.id": [
      "ff0bc899-550f-4481-b899-62809c6a8c02"
    ],
    "agent.type": [
      "filebeat"
    ],
    "host.os.kernel": [
      "5.15.0-1071-azure"
    ],
    "component.dataset": [
      "elastic_agent.filebeat"
    ],
    "log.file.device_id": [
      "66327"
    ],
    "elastic_agent.snapshot": [
      false
    ],
    "log.origin.file.line": [
      106
    ],
    "service.name": [
      "filebeat"
    ],
    "elastic_agent.id": [
      "21bd0329-66af-4832-ba17-c4dbb7e0f746"
    ],
    "data_stream.namespace": [
      "cyberdefensegroup"
    ],
    "host.os.codename": [
      "focal"
    ],
    "message": [
      "lease was not found"
    ],
    "component.type": [
      "azure-eventhub"
    ],
    "event.ingested": [
      "2024-09-10T08:19:55.000Z"
    ],
    "@timestamp": [
      "2024-09-10T08:19:53.663Z"
    ],
    "log.origin.file.name": [
      "azureeventhub/tracer.go"
    ],
    "cloud.account.id": [
      "addbd166-b2b1-471f-bda3-fb69e1a2ea39"
    ],
    "host.os.platform": [
      "ubuntu"
    ],
    "log.file.inode": [
      "73"
    ],
    "data_stream.dataset": [
      "elastic_agent.filebeat"
    ],
    "log.file.path": [
      "/usr/share/elastic-agent/state/data/logs/elastic-agent-20240910-3013.ndjson"
    ],
    "agent.ephemeral_id": [
      "aed19de4-5fbe-4b08-85b5-ba4ed0804509"
    ],
    "event.dataset": [
      "elastic_agent.filebeat"
    ],
    "cloud.instance.name": [
      "aks-userpool-84077292-vmss_1"
    ]
  }
}

Event Original

No response

What did you do?

We run elastic agents on an AKS kubernetes cluster for customers. They have a pvc for persistent storage of the /usr/share/elastic-agent/state directory.

What did you see?

When enabling the Azure Logs integration we see memory usage continually rising until the pod gets evicted from a node. It then gets restarted and the cycle begins again

Meanwhile, the metrics fill up with millions of errors per day:

When checking the storage account we constantly see files being leased and then released.

What did you expect to see?

Consistent memory usage and no evictions.

Anything else?

No response

The text was updated successfully, but these errors were encountered:

peterydzynski · 2024-09-13T13:56:49Z

We are experiencing the same issue while running the Azure integration. Our setup is similar to @dmaasland's however we are running the agent on AWS ECS vs kubernetes.

Interestingly, we have been running the Azure integration for months with no issue. We started seeing this error when we upgraded our agents from 8.15.0 to 8.15.1. Happy to provide more context if needed.

[EDIT] adding screenshot showing when we upgraded our agents to 8.15.1 (Sep. 10 at 16:12) corresponding with error logs blowing up.

zmoog · 2024-09-24T09:27:26Z

Hey @dmaasland and @peterydzynski, thank you for reporting this issue!

There are probably two topics at play here. They are related but distinct.

the error logs
memory leak

Let's start with the error logs:

We started seeing this error when we upgraded our agents from 8.15.0 to 8.15.1

The 8.15.1 release added a new tracer that surfaces internal error logs from the event hub SDK. These errors were already occurring before; they were just invisible. In our experience, the errors usually come from contention between the inputs accessing the event hub. More on this later.

The memory leak probably comes from an update to the event hub libraries we shipped in 8.14.0. At this point, we received multiple reports. The leak is located in the SDK code that establishes a link to the event hub partition.

The error logs and the memory leak are related because they originate and are amplified by the contention between multiple inputs using the same event hub.

If you set up the Azur Logs integration as many users do (installing one integration, setting the event hub name, and enabling most or all the data stream):

There is probably contention occurring between input behind the scenes.

We are planning an Azure Logs integration to address this issue. We will probably leverage logs routing to run one input for the whole Azure Logs integration, using the reroute processor to dispatch each log event to the target data stream.

You can start doing something very similar to the current version.

Please take a look at this step-by-step guide that explains how to route logs using the generic event hub integration:

zmoog/public-notes#92

With one input that collects the logs and sends them to a data stream that routes them to the destination, we get the following benefits:

No partition contention: only one input per agent and event hub. Multiple inputs can still collaborate for availability and performance running on different agents.
Without contention, the link to the event hub partitions happens much less frequently, so the memory leak impact is lower.

We also have an event hub input v2 that uses the latest SDK, ready to be enabled by default as soon as the new Azure Logs integration is ready.

peterydzynski · 2024-09-24T22:03:51Z

@zmoog thanks for the detailed response and workaround! Do you have any estimate on when the new Azure Logs integration will be ready? Also, if I use the workaround that you provided I assume I will lose the ability to only pull certain log types as the agent will now indiscriminately pull all types of logs, correct? Obviously I can add a drop processor for the log types I do not want to collect but this will be a bit more load on the agent and elasticsearch.

Unrelated but out of curiosity, what is the advantage of breaking the logs out into their own datastreams when the data is coming from the same source and is fairly similar in its format? I understand that under this configuration you can modify mappings, ILM, etc independently for each input but this paradigm has caused us some headaches.

The best example is the Zeek integration which breaks out the data into 43 different datastreams which obviously leads to a significant number of indices/shards. Unless the integration is pulling in huge volumes of data, you will have to increase the hardware profile of your cluster simply to handle the memory usage caused by all the indices/shards leading to a large amount of wasted HDD and CPU. The more integrations that do this the worse this problem gets.

zmoog · 2024-09-25T00:01:14Z

@zmoog thanks for the detailed response and workaround! Do you have any estimate on when the new Azure Logs integration will be ready?

We are working from this week. I estimate to have a working prototype in the next few days, and then I can probably have a better idea of the release ETA.

Also, if I use the workaround that you provided I assume I will lose the ability to only pull certain log types as the agent will now indiscriminately pull all types of logs, correct?

If the log events don't match the reroute processor condition, they will land in the logs-azure.eventhub-defautlt data stream. Please tell me more about what you have in mind.

Obviously I can add a drop processor for the log types I do not want to collect but this will be a bit more load on the agent and elasticsearch.

Can you tell me more about your use case? It seems interesting.

Unrelated but out of curiosity, what is the advantage of breaking the logs out into their own datastreams when the data is coming from the same source and is fairly similar in its format? I understand that under this configuration you can modify mappings, ILM, etc independently for each input but this paradigm has caused us some headaches.

A dedicated data stream allows custom mappings, pipelines, ILM, etc. But even if you don't need customizations, using a data stream like logs-<LOG_TYPE>-default helps to keep mappings tidy, reducing potential mapping conflict, and it's open for customization tomorrow with a dedicated index template (if needed!).

The best example is the Zeek integration which breaks out the data into 43 different datastreams which obviously leads to a significant number of indices/shards. Unless the integration is pulling in huge volumes of data, you will have to increase the hardware profile of your cluster simply to handle the memory usage caused by all the indices/shards leading to a large amount of wasted HDD and CPU. The more integrations that do this the worse this problem gets.

Wow, I never noticed the Zeek had so many data streams.

Right now, the Azure Logs landscape is made of two kinds of integrations:

many specialized integrations like sign-in or firewall logs that have a dedicated data stream (mappings, pipelines, etc) and a dashboard;
one generic integration that users can customize

The general idea is to offer a choice between out-of-the-box functionality and extension points to cover edge cases.

A new event hub input package that's more configurable than the generic one is coming soon. The input packages are a thinner wrapper around the azure-eventhub input, allowing more flexibility (for example, the generic event hub integration only works with the logs-azure.eventhub index template.

peterydzynski · 2024-09-25T15:30:13Z

I was able to implement your workaround and observe the types of logs pulled. I am not very familiar with the Azure Eventhub side and I think my question is a bit of a misunderstanding of the setup on my part. For our use case we only want to collect auditlogs, identity_protection, provisioning, and signinlogs logs and not graphactivitylogs, firewall_logs, application_gateway or springcloudlogs. I was thinking that using your workaround, the agent would be collecting all logs regardless of type and sending it to elasticsearch which would be a significant amount of additional load for logs we would ultimately be dropping. While I think this is still true in principle, it appears we have the Eventhub configured such that it only has the log types we want available making this a non issue.

Here is a bit of background on our use cases generally and why we are so sensitive to the number of indices/shards an integration requires. We run a development cluster that has very low throughput for each integration and mainly receives test data. Obviously we want to keep this cluster as cheap as possible but given how all the integrations we are testing all split into multiple datastreams, we have hundreds of very small indices. This has forced us to scale our cluster up just to handle the memory requirements due to the number of shards on each node. We are currently using only 13% of the disk space available just to support the number of indices we have.

Anyways, thanks for the background on the datastreams, that all makes sense. It sounds like the new setup for event hub inputs will be much more flexible and may provide some solutions for the issues I am referring to. Excited for that to come out!

keiransteele-phocas · 2024-09-30T00:16:05Z

I was directed here by a support request (#01756777) I logged for the Azure signinlogs input failing while the other inputs for auditlogs, identity_protection, and provisioning were not failing.

We are also sending these logs to a single event hub (M365 Defender logs are sent to a different event hub), I have matched the number of partitions to the number of agents running in AWS ECS of 3, the only difference is that I have setup separate integrations for each log type.

zmoog · 2024-11-12T12:16:43Z

I was directed here by a support request (#01756777) I logged for the Azure signinlogs input failing while the other inputs for auditlogs, identity_protection, and provisioning were not failing.

We are also sending these logs to a single event hub (M365 Defender logs are sent to a different event hub), I have matched the number of partitions to the number of agents running in AWS ECS of 3, the only difference is that I have setup separate integrations for each log type.

Do you have Elastic Agent diagnostics to share? Could you collect one and send it to me at [email protected]?

dmaasland added the needs:triage label Sep 10, 2024

andrewkroh added Integration:azure Azure Logs Team:obs-ds-hosted-services Label for the Observability Hosted Services team [elastic/obs-ds-hosted-services] labels Sep 10, 2024

zmoog self-assigned this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IAzure Logs]: Integration eats up memory and dies #11056

[IAzure Logs]: Integration eats up memory and dies #11056

dmaasland commented Sep 10, 2024

peterydzynski commented Sep 13, 2024 •

edited

Loading

zmoog commented Sep 24, 2024

peterydzynski commented Sep 24, 2024 •

edited

Loading

zmoog commented Sep 25, 2024

peterydzynski commented Sep 25, 2024

keiransteele-phocas commented Sep 30, 2024

zmoog commented Nov 12, 2024

[IAzure Logs]: Integration eats up memory and dies #11056

[IAzure Logs]: Integration eats up memory and dies #11056

Comments

dmaasland commented Sep 10, 2024

Integration Name

Dataset Name

Integration Version

Agent Version

Agent Output Type

Elasticsearch Version

OS Version and Architecture

Software/API Version

Error Message

Event Original

What did you do?

What did you see?

What did you expect to see?

Anything else?

peterydzynski commented Sep 13, 2024 • edited Loading

zmoog commented Sep 24, 2024

peterydzynski commented Sep 24, 2024 • edited Loading

zmoog commented Sep 25, 2024

peterydzynski commented Sep 25, 2024

keiransteele-phocas commented Sep 30, 2024

zmoog commented Nov 12, 2024

peterydzynski commented Sep 13, 2024 •

edited

Loading

peterydzynski commented Sep 24, 2024 •

edited

Loading