Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage for Alloy on Windows Server #2344

Open
sudden1974 opened this issue Jan 7, 2025 · 2 comments
Open

High CPU usage for Alloy on Windows Server #2344

sudden1974 opened this issue Jan 7, 2025 · 2 comments

Comments

@sudden1974
Copy link

sudden1974 commented Jan 7, 2025

Issue

On some of our Windows servers the Grafana Alloy agent consumes a lot of CPU (20-30%). The issue seems to occur on servers when there are many users logged in with RDP sessions or if the server is a Citrix App server.

Is this a known issue in some way and is there some configurations available to reduce this?

There is no advanced configuration, just the standard os, cpu, net, system, memory and textfile collectors are used. We have tried to remove some of the collectors but no major changes in consumed resources found.

System information
Windows Server 2022 Standard

Software version
Grafana alloy v1.4.3

Config

prometheus.exporter.windows "userpromfiles" {
  enabled_collectors = ["textfile"]
}

prometheus.scrape "userpromfiles" {
  scrape_interval = "60s"
  targets    = prometheus.exporter.windows.userpromfiles.targets
  forward_to = [prometheus.relabel.userfiles.receiver]
}

// This section sets the correct "instance" in lowercase (for windows server, the hostname)
prometheus.relabel "userfiles" {
  forward_to = [prometheus.remote_write.testtenant.receiver]
  rule {
    source_labels = ["instance"]
    regex = "(.*?)($|:.*)"
    action = "replace"
    replacement = "$1"
    target_label  = "instance"
  }
  rule {
    source_labels = (["instance"])
    action = "lowercase"
    target_label = "instance"
  } 
}

prometheus.exporter.windows "winserver" {
  enabled_collectors = ["cpu","cs","logical_disk","net","os","system","memory"]
}

prometheus.scrape "example" {
  scrape_interval = "60s"
  targets    = prometheus.exporter.windows.winserver.targets
  forward_to = [prometheus.relabel.dynamics.receiver]
}

// This section sets the correct "instance" in lowercase (for windows server, the hostname)
prometheus.relabel "dynamics" {
  forward_to = [prometheus.remote_write.testtenant.receiver]
  rule {
    source_labels = ["instance"]
    regex = "(.*?)($|:.*)"
    action = "replace"
    replacement = "$1"
    target_label  = "instance"
  }
  rule {
    source_labels = (["instance"])
    action = "lowercase"
    target_label = "instance"
  } 
}

local.file "env" {
  filename = "C:\\Program Files\\GrafanaLabs\\Alloy\\env.facts"
  poll_frequency = "60m"
}

prometheus.remote_write "testtenant" {
  endpoint {
    url = "https://xxx/api/v1/push"
    headers =  {
      "tenantid" = "testtenant",
    }
    tls_config {
       insecure_skip_verify = true
    }
  }
  external_labels = {
    "company"= to_lower(json_path(local.file.env.content,"Company")[0]),
    "envgroup"= to_lower(json_path(local.file.env.content,"envgroup")[0]),
    "env"= to_lower(json_path(local.file.env.content,"env")[0]),
    "system"= to_lower(json_path(local.file.env.content,"system")[0]),
    "dmz"= to_lower(json_path(local.file.env.content,"dmz")[0]),
    "domain"= to_lower(json_path(local.file.env.content,"Domain")[0]),
  }
}

BR
Jan

@Nachtfalkeaw
Copy link

You maybe can Check how Long the scrape time per collector is and If it is really a collector issue or a General one.

Do you use alloy for Something else?
Windows Event logs?

@sudden1974
Copy link
Author

We have tested removing collectors one by one, down to a single one, but there is on significant change in CPU usage. We are only using Alloy for the OS metrics (as in posted config). No logs. We have it installed in about 2000 windows servers and there are only a few that has this problem. What we can see the only common thing on the problematic servers are that there are many logged in users, like Citrix servers with many remote desktop users or "management" servers used as client for many RDP users to manage other systems.

CPU load almost imediatly goes up to 25-30% and stays there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants