Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using clustering, exporters may not work correctly due to instance label #1009

Open
thampiotr opened this issue Jun 10, 2024 · 9 comments
Assignees
Labels
bug Something isn't working needs-attention

Comments

@thampiotr
Copy link
Contributor

thampiotr commented Jun 10, 2024

What's wrong?

Most embedded Prometheus exporters set the instance label to the hostname where Alloy runs.

This breaks in a subtle, but significant way, the fundamental clustering assumption that all instances have the same configuration. The exporters implicitly inject the hostname as an instance label, but instances usually have different hostnames. This leads to either no scraping of metrics at all, or duplicate scraping with different instance labels (unnecessary).

Steps to reproduce

  1. Run any exporter in a clustered mode in a cluster of 2+ instances, each running on a different host. Have scraping set up with clustering and a remote write to a metrics DB.
  2. Observe that some targets will not be scraped at all, some will be scraped multiple times, with different instance labels.
  3. Observe in the UI that instance label is different in exporters' targets between instances, indicating different series.

The issue was discussed in this PR, but decided to move the conversation here for better tracking and to provide a place to refer to for workarounds.

@thampiotr thampiotr added the bug Something isn't working label Jun 10, 2024
@thampiotr
Copy link
Contributor Author

There is a workaround for now: set the instance label to a common value for all instances in the cluster, using discovery.relabel component. For example, this component sets it to "alloy-cluster":

discovery.relabel "replace_instance" {
  targets = discovery.file.targets.targets
  rule {
    action        = "replace"
    source_labels = ["instance"]
    target_label  = "instance"
	replacement   = "alloy-cluster"
  }  
}

You'd add the above component between your exporters and the prometheus.scrape.

Longer term fix can be also achieved via #399. Regardless, we should have good documentation to ensure users don't fall into this pit.

Copy link
Contributor

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@thampiotr thampiotr self-assigned this Aug 1, 2024
@rgarrigue
Copy link

Affected our 30+ blackbox probes over ~7 alloy deployed via Helm, we were missing out on random targets, triggering DatasourceNoData in our alerting. The workaround fixed it.

@st-akorotkov
Copy link

TBH this proposed rule is not a workaround. Using it breaks multiple dashboard and alert since we can't distinguish nodes running node-exporter anymore.

@xdr34m
Copy link

xdr34m commented Dec 2, 2024

Omg, it nearly drove me insane, until i found this issue.
Scraping Blackbox targets by clustered scrape resulted in missing half the scrape targets in a 4 member Cluster. Using discovery.relabel to unify the instance Label before Scraping fixed it.

@nofoxsteven
Copy link

We were hitting this too. Though it did result in some targets not being scraped at all. A hint in the docs would be very helpful.

@tonyswu
Copy link

tonyswu commented Dec 12, 2024

Coming into this discussion a bit late. We are consider using Alloy clustering in some areas, and naturally I find this concerning. However, when I set up a simple test against one remote scrape endpoint I wasn't able to reproduce this. This is my simple one scrape endpoint (alloy cluster of two instances):

discovery.dns "test_app" {
    names = ["${test_app_srv_record}"]
    type  = "SRV"
}

prometheus.scrape "test_app" {
    scrape_interval = "10s"
    targets               = discovery.dns.test_app.targets
    forward_to        = [prometheus.relabel.test_app.receiver]

    clustering {
        enabled = true
    }
}

prometheus.relabel "test_app" {
    forward_to = $"{mimir_push_url}"

    rule {
      action       = "replace"
      replacement  = "test-app"
      target_label = "app"
    }
  }
}

The metrics do have an instance label, but it is set to the discovery record of the app. I am wondering if there is something I am missing. I see someone above mentioning having issue with blackbox, I'll try that next.

@xdr34m
Copy link

xdr34m commented Dec 12, 2024

I think you missunderstood, its only a Problem with Exporters that are setting the instance Label based on the alloy running it. So you cant scrape it directly and have to Go the Extra Route via a discovery relabel between.

@tonyswu
Copy link

tonyswu commented Dec 12, 2024

Ah, I see. Thanks for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-attention
Projects
None yet
Development

No branches or pull requests

6 participants