prometheus.operator.podmonitors with clustering mode scrapes duplicated targets #2348

taidv · 2025-01-08T09:39:04Z

What's wrong?

When using prometheus.operator.podmonitors in clustering mode, the same targets are distributed to multiple Alloy instances, causing the same metrics to be scraped by more than one instance. This results in errors such as err-mimir-sample-out-of-order.

Note: I have deployed Alloy to 3 different clusters, but only 2 of them (with 50-100 nodes each) are experiencing this issue.

Steps to reproduce

Deploy Alloy in cluster mode.
Configure Alloy to scrape metrics from Prometheus Operator CRDs.
Create Istio monitor CRDs using the following manifest: Istio Prometheus Operator CRD
Observe the logs for the err-mimir-sample-out-of-order error (approximately 5% error rate).
Compare the targets between Alloy instances using the Alloy UI. Some targets (about 10%) appear on multiple Alloy instances.

System information

GKE v1.30

Software version

Alloy v1.5.1

Configuration

alloy:
  clustering:
    enabled: true
    name: workload-xxx-cluster
    portName: tcp
  configMap:
    create: true
    content: |-
      prometheus.operator.podmonitors "default" {
        forward_to = [prometheus.remote_write.default.receiver]
        selector {
          match_expression {
            key = "alloy-collector"
            operator = "In"
            values = ["alloy-metrics"]
          }
        }
        clustering {
          enabled = true
        }
      }

      prometheus.operator.servicemonitors "default" {
        forward_to = [prometheus.remote_write.default.receiver]
        selector {
          match_expression {
            key = "alloy-collector"
            operator = "In"
            values = ["alloy-metrics"]
          }
        }
        clustering {
          enabled = true
        }
      }

      prometheus.remote_write "default" {
        endpoint {
          url = "http://mimir-nginx.monitoring.svc.cluster.local/api/v1/push"
        }
      }

  mounts:
    extra:
    - mountPath: /var/lib/alloy
      name: alloy-data
  storagePath: /var/lib/alloy
controller:
  replicas: 3
  type: statefulset
  volumeClaimTemplates:
  - metadata:
      name: alloy-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
          requests:
            storage: 10Gi
fullnameOverride: alloy-metrics

Logs

ts=2025-01-08T09:20:15.440464533Z level=error msg="non-recoverable error" component_path=/ component_id=prometheus.remote_write.default subcomponent=rw remote_name=8989d8 url=http://mimir-nginx.monitoring.svc.cluster.local/api/v1/push count=2000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=anonymous: the sample has been rejected because another sample with a more recent timestamp has already been ingested and out-of-order samples are not allowed (err-mimir-sample-out-of-order). The affected sample has timestamp 2025-01-08T09:20:15.042Z and is from series {__name__=\"istio_build\", [...]}"

The text was updated successfully, but these errors were encountered:

taidv · 2025-01-08T10:00:41Z

looking at the code.

peers, err := c.Lookup(shard.StringKey(nonMetaLabelString(t)), 1, shard.OpReadWrite)
if len(peers) == 0 || err != nil {
	// If the cluster found no peers or returned an error, we fall
	// back to owning the target ourselves.
	g2.Targets = append(g2.Targets, t)
} else if peers[0].Self {
	g2.Targets = append(g2.Targets, t)
}

It seems we take the targets if the cluster finds no peers or if an error is returned. When I checked the debug log, I saw that Alloy discovered 4 peers.
This suggests that c.Lookup might have returned an error. Do you know what errors might occur there, or do you have any tips for troubleshooting this issue?

I really appreciate your help, and thank you in advance for your guidance!

Debug log:

ts=2025-01-08T09:39:42.166463426Z level=debug msg="Initiating push/pull sync with: alloy-metrics-2 10.64.47.4:12345" service=cluster subsystem=memberlist
ts=2025-01-08T09:39:42.169885971Z level=debug msg="merging remote state" service=cluster remote_time=525384
ts=2025-01-08T09:39:48.744928396Z level=debug msg="Stream connection from=127.0.0.6:51369" service=cluster subsystem=memberlist
ts=2025-01-08T09:39:48.747452658Z level=debug msg="merging remote state" service=cluster remote_time=525385
ts=2025-01-08T09:39:52.908680211Z level=debug msg="Stream connection from=127.0.0.6:46327" service=cluster subsystem=memberlist
ts=2025-01-08T09:39:52.910711457Z level=debug msg="merging remote state" service=cluster remote_time=525386
ts=2025-01-08T09:39:53.115111977Z level=debug msg="alloy-metrics-0 @525389: participant" service=cluster
ts=2025-01-08T09:40:00.578557496Z level=debug msg="Stream connection from=127.0.0.6:51369" service=cluster subsystem=memberlist
ts=2025-01-08T09:40:00.708563378Z level=debug msg="alloy-metrics-0 @525391: participant" service=cluster
ts=2025-01-08T09:40:00.788885839Z level=debug msg="merging remote state" service=cluster remote_time=525392
ts=2025-01-08T09:40:00.903980527Z level=debug msg="alloy-metrics-3 @525394: participant" service=cluster
ts=2025-01-08T09:40:07.351722914Z level=debug msg="received DNS query response" service=cluster addr=alloy-metrics-cluster record_type=A/AAAA records_count=4
ts=2025-01-08T09:40:07.351766541Z level=debug msg="discovered peers" service=cluster peers_count=4 peers=10.64.13.9:12345,10.64.53.3:12345,10.64.47.4:12345,10.64.56.140:12345
ts=2025-01-08T09:40:07.351777744Z level=info msg="rejoining peers" service=cluster peers_count=4 peers=10.64.13.9:12345,10.64.53.3:12345,10.64.47.4:12345,10.64.56.140:12345
ts=2025-01-08T09:40:07.354506931Z level=debug msg="Initiating push/pull sync with:  10.64.13.9:12345" service=cluster subsystem=memberlist
ts=2025-01-08T09:40:07.356473825Z level=debug msg="merging remote state" service=cluster remote_time=525396
ts=2025-01-08T09:40:07.356543073Z level=debug msg="alloy-metrics-3 @525396: participant" service=cluster
ts=2025-01-08T09:40:07.359441333Z level=debug msg="Stream connection from=10.64.53.3:51330" service=cluster subsystem=memberlist
ts=2025-01-08T09:40:07.361155513Z level=debug msg="Initiating push/pull sync with:  10.64.53.3:12345" service=cluster subsystem=memberlist
ts=2025-01-08T09:40:07.362859179Z level=debug msg="merging remote state" service=cluster remote_time=525397
ts=2025-01-08T09:40:07.364173252Z level=debug msg="merging remote state" service=cluster remote_time=525397
ts=2025-01-08T09:40:07.36662823Z level=debug msg="Initiating push/pull sync with:  10.64.47.4:12345" service=cluster subsystem=memberlist
ts=2025-01-08T09:40:07.368392599Z level=debug msg="merging remote state" service=cluster remote_time=525395
ts=2025-01-08T09:40:07.387958569Z level=debug msg="Initiating push/pull sync with:  10.64.56.140:12345" service=cluster subsystem=memberlist
ts=2025-01-08T09:40:07.427784449Z level=debug msg="merging remote state" service=cluster remote_time=525395
ts=2025-01-08T09:40:07.427851432Z level=debug msg="alloy-metrics-1 @525399: participant" service=cluster
ts=2025-01-08T09:40:12.172171721Z level=debug msg="Initiating push/pull sync with: alloy-metrics-2 10.64.47.4:12345" service=cluster subsystem=memberlist
ts=2025-01-08T09:40:12.176376094Z level=debug msg="merging remote state" service=cluster remote_time=525401
ts=2025-01-08T09:40:26.51176214Z level=debug msg="Stream connection from=127.0.0.6:42787" service=cluster subsystem=memberlist
ts=2025-01-08T09:40:26.514079498Z level=debug msg="merging remote state" service=cluster remote_time=525404
ts=2025-01-08T09:40:26.677253395Z level=debug msg="alloy-metrics-2 @525405: participant" service=cluster
ts=2025-01-08T09:40:40.303916901Z level=debug msg="alloy-metrics-2 @525407: participant" service=cluster

taidv added the bug Something isn't working label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus.operator.podmonitors with clustering mode scrapes duplicated targets #2348

prometheus.operator.podmonitors with clustering mode scrapes duplicated targets #2348

taidv commented Jan 8, 2025

taidv commented Jan 8, 2025

prometheus.operator.podmonitors with clustering mode scrapes duplicated targets #2348

prometheus.operator.podmonitors with clustering mode scrapes duplicated targets #2348

Comments

taidv commented Jan 8, 2025

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

taidv commented Jan 8, 2025