Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IAzure Logs]: Integration eats up memory and dies #11056

Open
dmaasland opened this issue Sep 10, 2024 · 7 comments
Open

[IAzure Logs]: Integration eats up memory and dies #11056

dmaasland opened this issue Sep 10, 2024 · 7 comments
Assignees
Labels
Integration:azure Azure Logs needs:triage Team:obs-ds-hosted-services Label for the Observability Hosted Services team [elastic/obs-ds-hosted-services]

Comments

@dmaasland
Copy link

Integration Name

Azure Logs [azure]

Dataset Name

No response

Integration Version

1.14.0

Agent Version

8.15.1

Agent Output Type

elasticsearch

Elasticsearch Version

8.15.1

OS Version and Architecture

Ubuntu 22.04

Software/API Version

No response

Error Message

{
  "_index": ".ds-logs-elastic_agent.filebeat-cyberdefensegroup-2024.09.04-000002",
  "_id": "6awE25EBDjTH87fVkJGw",
  "_version": 1,
  "_score": 0,
  "_source": {
    "agent": {
      "name": "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v",
      "id": "21bd0329-66af-4832-ba17-c4dbb7e0f746",
      "ephemeral_id": "aed19de4-5fbe-4b08-85b5-ba4ed0804509",
      "type": "filebeat",
      "version": "8.15.1"
    },
    "service.name": "filebeat",
    "log": {
      "file": {
        "inode": "73",
        "path": "/usr/share/elastic-agent/state/data/logs/elastic-agent-20240910-3013.ndjson",
        "device_id": "66327"
      },
      "offset": 3961311,
      "source": "azure-eventhub-default"
    },
    "elastic_agent": {
      "id": "21bd0329-66af-4832-ba17-c4dbb7e0f746",
      "version": "8.15.1",
      "snapshot": false
    },
    "message": "lease was not found",
    "cloud": {
      "instance": {
        "name": "aks-userpool-84077292-vmss_1",
        "id": "ff0bc899-550f-4481-b899-62809c6a8c02"
      },
      "provider": "azure",
      "machine": {
        "type": "Standard_D4s_v3"
      },
      "service": {
        "name": "Virtual Machines"
      },
      "region": "westeurope",
      "account": {
        "id": "addbd166-b2b1-471f-bda3-fb69e1a2ea39"
      }
    },
    "input": {
      "type": "filestream"
    },
    "log.origin": {
      "file.line": 106,
      "function": "github.com/elastic/beats/v7/x-pack/filebeat/input/azureeventhub.logpLogger.Error",
      "file.name": "azureeventhub/tracer.go"
    },
    "component": {
      "binary": "filebeat",
      "id": "azure-eventhub-default",
      "type": "azure-eventhub",
      "dataset": "elastic_agent.filebeat"
    },
    "@timestamp": "2024-09-10T08:19:53.663Z",
    "ecs": {
      "version": "8.0.0"
    },
    "data_stream": {
      "namespace": "cyberdefensegroup",
      "type": "logs",
      "dataset": "elastic_agent.filebeat"
    },
    "host": {
      "hostname": "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v",
      "os": {
        "kernel": "5.15.0-1071-azure",
        "codename": "focal",
        "name": "Ubuntu",
        "family": "debian",
        "type": "linux",
        "version": "20.04.6 LTS (Focal Fossa)",
        "platform": "ubuntu"
      },
      "containerized": false,
      "name": "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v",
      "architecture": "x86_64"
    },
    "log.level": "error",
    "event": {
      "agent_id_status": "verified",
      "ingested": "2024-09-10T08:19:55Z",
      "dataset": "elastic_agent.filebeat"
    }
  },
  "fields": {
    "elastic_agent.version": [
      "8.15.1"
    ],
    "component.binary": [
      "filebeat"
    ],
    "host.os.name.text": [
      "Ubuntu"
    ],
    "host.hostname": [
      "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v"
    ],
    "component.id": [
      "azure-eventhub-default"
    ],
    "host.os.version": [
      "20.04.6 LTS (Focal Fossa)"
    ],
    "host.os.name": [
      "Ubuntu"
    ],
    "log.level": [
      "error"
    ],
    "agent.name": [
      "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v"
    ],
    "host.name": [
      "elastic-agent-cyberdefensegroup-6bcb6b59b8-4j82v"
    ],
    "event.agent_id_status": [
      "verified"
    ],
    "cloud.region": [
      "westeurope"
    ],
    "host.os.type": [
      "linux"
    ],
    "log.source": [
      "azure-eventhub-default"
    ],
    "input.type": [
      "filestream"
    ],
    "log.offset": [
      3961311
    ],
    "data_stream.type": [
      "logs"
    ],
    "host.architecture": [
      "x86_64"
    ],
    "cloud.machine.type": [
      "Standard_D4s_v3"
    ],
    "cloud.provider": [
      "azure"
    ],
    "log.origin.function": [
      "github.com/elastic/beats/v7/x-pack/filebeat/input/azureeventhub.logpLogger.Error"
    ],
    "agent.id": [
      "21bd0329-66af-4832-ba17-c4dbb7e0f746"
    ],
    "cloud.service.name": [
      "Virtual Machines"
    ],
    "ecs.version": [
      "8.0.0"
    ],
    "host.containerized": [
      false
    ],
    "agent.version": [
      "8.15.1"
    ],
    "host.os.family": [
      "debian"
    ],
    "cloud.instance.id": [
      "ff0bc899-550f-4481-b899-62809c6a8c02"
    ],
    "agent.type": [
      "filebeat"
    ],
    "host.os.kernel": [
      "5.15.0-1071-azure"
    ],
    "component.dataset": [
      "elastic_agent.filebeat"
    ],
    "log.file.device_id": [
      "66327"
    ],
    "elastic_agent.snapshot": [
      false
    ],
    "log.origin.file.line": [
      106
    ],
    "service.name": [
      "filebeat"
    ],
    "elastic_agent.id": [
      "21bd0329-66af-4832-ba17-c4dbb7e0f746"
    ],
    "data_stream.namespace": [
      "cyberdefensegroup"
    ],
    "host.os.codename": [
      "focal"
    ],
    "message": [
      "lease was not found"
    ],
    "component.type": [
      "azure-eventhub"
    ],
    "event.ingested": [
      "2024-09-10T08:19:55.000Z"
    ],
    "@timestamp": [
      "2024-09-10T08:19:53.663Z"
    ],
    "log.origin.file.name": [
      "azureeventhub/tracer.go"
    ],
    "cloud.account.id": [
      "addbd166-b2b1-471f-bda3-fb69e1a2ea39"
    ],
    "host.os.platform": [
      "ubuntu"
    ],
    "log.file.inode": [
      "73"
    ],
    "data_stream.dataset": [
      "elastic_agent.filebeat"
    ],
    "log.file.path": [
      "/usr/share/elastic-agent/state/data/logs/elastic-agent-20240910-3013.ndjson"
    ],
    "agent.ephemeral_id": [
      "aed19de4-5fbe-4b08-85b5-ba4ed0804509"
    ],
    "event.dataset": [
      "elastic_agent.filebeat"
    ],
    "cloud.instance.name": [
      "aks-userpool-84077292-vmss_1"
    ]
  }
}

Event Original

No response

What did you do?

We run elastic agents on an AKS kubernetes cluster for customers. They have a pvc for persistent storage of the /usr/share/elastic-agent/state directory.

What did you see?

When enabling the Azure Logs integration we see memory usage continually rising until the pod gets evicted from a node. It then gets restarted and the cycle begins again

Image
Image

Meanwhile, the metrics fill up with millions of errors per day:
Image

When checking the storage account we constantly see files being leased and then released.

What did you expect to see?

Consistent memory usage and no evictions.

Anything else?

No response

@andrewkroh andrewkroh added Integration:azure Azure Logs Team:obs-ds-hosted-services Label for the Observability Hosted Services team [elastic/obs-ds-hosted-services] labels Sep 10, 2024
@peterydzynski
Copy link
Contributor

peterydzynski commented Sep 13, 2024

We are experiencing the same issue while running the Azure integration. Our setup is similar to @dmaasland's however we are running the agent on AWS ECS vs kubernetes.

Interestingly, we have been running the Azure integration for months with no issue. We started seeing this error when we upgraded our agents from 8.15.0 to 8.15.1. Happy to provide more context if needed.

[EDIT] adding screenshot showing when we upgraded our agents to 8.15.1 (Sep. 10 at 16:12) corresponding with error logs blowing up.
Image

@zmoog zmoog self-assigned this Sep 24, 2024
@zmoog
Copy link
Contributor

zmoog commented Sep 24, 2024

Hey @dmaasland and @peterydzynski, thank you for reporting this issue!

There are probably two topics at play here. They are related but distinct.

  • the error logs
  • memory leak

Let's start with the error logs:

We started seeing this error when we upgraded our agents from 8.15.0 to 8.15.1

The 8.15.1 release added a new tracer that surfaces internal error logs from the event hub SDK. These errors were already occurring before; they were just invisible. In our experience, the errors usually come from contention between the inputs accessing the event hub. More on this later.

The memory leak probably comes from an update to the event hub libraries we shipped in 8.14.0. At this point, we received multiple reports. The leak is located in the SDK code that establishes a link to the event hub partition.

The error logs and the memory leak are related because they originate and are amplified by the contention between multiple inputs using the same event hub.

If you set up the Azur Logs integration as many users do (installing one integration, setting the event hub name, and enabling most or all the data stream):

Image

There is probably contention occurring between input behind the scenes.

We are planning an Azure Logs integration to address this issue. We will probably leverage logs routing to run one input for the whole Azure Logs integration, using the reroute processor to dispatch each log event to the target data stream.

You can start doing something very similar to the current version.

Please take a look at this step-by-step guide that explains how to route logs using the generic event hub integration:

zmoog/public-notes#92

With one input that collects the logs and sends them to a data stream that routes them to the destination, we get the following benefits:

  • No partition contention: only one input per agent and event hub. Multiple inputs can still collaborate for availability and performance running on different agents.
  • Without contention, the link to the event hub partitions happens much less frequently, so the memory leak impact is lower.

We also have an event hub input v2 that uses the latest SDK, ready to be enabled by default as soon as the new Azure Logs integration is ready.

@peterydzynski
Copy link
Contributor

peterydzynski commented Sep 24, 2024

@zmoog thanks for the detailed response and workaround! Do you have any estimate on when the new Azure Logs integration will be ready? Also, if I use the workaround that you provided I assume I will lose the ability to only pull certain log types as the agent will now indiscriminately pull all types of logs, correct? Obviously I can add a drop processor for the log types I do not want to collect but this will be a bit more load on the agent and elasticsearch.

Unrelated but out of curiosity, what is the advantage of breaking the logs out into their own datastreams when the data is coming from the same source and is fairly similar in its format? I understand that under this configuration you can modify mappings, ILM, etc independently for each input but this paradigm has caused us some headaches.

The best example is the Zeek integration which breaks out the data into 43 different datastreams which obviously leads to a significant number of indices/shards. Unless the integration is pulling in huge volumes of data, you will have to increase the hardware profile of your cluster simply to handle the memory usage caused by all the indices/shards leading to a large amount of wasted HDD and CPU. The more integrations that do this the worse this problem gets.

@zmoog
Copy link
Contributor

zmoog commented Sep 25, 2024

@zmoog thanks for the detailed response and workaround! Do you have any estimate on when the new Azure Logs integration will be ready?

We are working from this week. I estimate to have a working prototype in the next few days, and then I can probably have a better idea of the release ETA.

Also, if I use the workaround that you provided I assume I will lose the ability to only pull certain log types as the agent will now indiscriminately pull all types of logs, correct?

If the log events don't match the reroute processor condition, they will land in the logs-azure.eventhub-defautlt data stream. Please tell me more about what you have in mind.

Obviously I can add a drop processor for the log types I do not want to collect but this will be a bit more load on the agent and elasticsearch.

Can you tell me more about your use case? It seems interesting.

Unrelated but out of curiosity, what is the advantage of breaking the logs out into their own datastreams when the data is coming from the same source and is fairly similar in its format? I understand that under this configuration you can modify mappings, ILM, etc independently for each input but this paradigm has caused us some headaches.

A dedicated data stream allows custom mappings, pipelines, ILM, etc. But even if you don't need customizations, using a data stream like logs-<LOG_TYPE>-default helps to keep mappings tidy, reducing potential mapping conflict, and it's open for customization tomorrow with a dedicated index template (if needed!).

The best example is the Zeek integration which breaks out the data into 43 different datastreams which obviously leads to a significant number of indices/shards. Unless the integration is pulling in huge volumes of data, you will have to increase the hardware profile of your cluster simply to handle the memory usage caused by all the indices/shards leading to a large amount of wasted HDD and CPU. The more integrations that do this the worse this problem gets.

Wow, I never noticed the Zeek had so many data streams.

Right now, the Azure Logs landscape is made of two kinds of integrations:

  • many specialized integrations like sign-in or firewall logs that have a dedicated data stream (mappings, pipelines, etc) and a dashboard;
  • one generic integration that users can customize

The general idea is to offer a choice between out-of-the-box functionality and extension points to cover edge cases.

A new event hub input package that's more configurable than the generic one is coming soon. The input packages are a thinner wrapper around the azure-eventhub input, allowing more flexibility (for example, the generic event hub integration only works with the logs-azure.eventhub index template.

@peterydzynski
Copy link
Contributor

I was able to implement your workaround and observe the types of logs pulled. I am not very familiar with the Azure Eventhub side and I think my question is a bit of a misunderstanding of the setup on my part. For our use case we only want to collect auditlogs, identity_protection, provisioning, and signinlogs logs and not graphactivitylogs, firewall_logs, application_gateway or springcloudlogs. I was thinking that using your workaround, the agent would be collecting all logs regardless of type and sending it to elasticsearch which would be a significant amount of additional load for logs we would ultimately be dropping. While I think this is still true in principle, it appears we have the Eventhub configured such that it only has the log types we want available making this a non issue.

Here is a bit of background on our use cases generally and why we are so sensitive to the number of indices/shards an integration requires. We run a development cluster that has very low throughput for each integration and mainly receives test data. Obviously we want to keep this cluster as cheap as possible but given how all the integrations we are testing all split into multiple datastreams, we have hundreds of very small indices. This has forced us to scale our cluster up just to handle the memory requirements due to the number of shards on each node. We are currently using only 13% of the disk space available just to support the number of indices we have.
Image

Anyways, thanks for the background on the datastreams, that all makes sense. It sounds like the new setup for event hub inputs will be much more flexible and may provide some solutions for the issues I am referring to. Excited for that to come out!

@keiransteele-phocas
Copy link

I was directed here by a support request (#01756777) I logged for the Azure signinlogs input failing while the other inputs for auditlogs, identity_protection, and provisioning were not failing.

We are also sending these logs to a single event hub (M365 Defender logs are sent to a different event hub), I have matched the number of partitions to the number of agents running in AWS ECS of 3, the only difference is that I have setup separate integrations for each log type.

@zmoog
Copy link
Contributor

zmoog commented Nov 12, 2024

I was directed here by a support request (#01756777) I logged for the Azure signinlogs input failing while the other inputs for auditlogs, identity_protection, and provisioning were not failing.

We are also sending these logs to a single event hub (M365 Defender logs are sent to a different event hub), I have matched the number of partitions to the number of agents running in AWS ECS of 3, the only difference is that I have setup separate integrations for each log type.

Do you have Elastic Agent diagnostics to share? Could you collect one and send it to me at [email protected]?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Integration:azure Azure Logs needs:triage Team:obs-ds-hosted-services Label for the Observability Hosted Services team [elastic/obs-ds-hosted-services]
Projects
None yet
Development

No branches or pull requests

5 participants