Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos-Compact halting with error 'err="compaction: group 0@17832940732465865817: overlapping sources detected' #6389

Open
Migueljfs opened this issue May 23, 2023 · 6 comments

Comments

@Migueljfs
Copy link

Migueljfs commented May 23, 2023

Thanos, Prometheus and Golang version used:
Thanos: 0.31.0
Prometheus: 2.44.0

Object Storage Provider:
Google (GCS)

What happened:
Thanos-compact pod halted shortly after starting with error:

│ level=error ts=2023-05-23T13:07:54.219139563Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: group 0@17832940732465865817: overlapping sources detected for plan [01GZ0CAKEYN0BVTRW570T69QKM (min time: 1682553600388, max time: 1682560800000) 01H0FY8AR60GEVX3J3Q7Y7GSYG (min time: 1682553600388, max time: 1682596800000) 01GZ0K651WJG1M9W2DMC050NHF (min time: 1682560800388, max time: 1682568000000) 01GZ0R6020JTZ21GETCZCY5Y57 (min time: 1682568000388, max time: 1682575200000) 01GZ0Z1QA4HWJ4SX0RXJFZTM5X (min time: 1682575200388, max time: 1682582400000) 01GZ15XEJ2M6M0QH50XRQC3M65 (min time: 1682582400388, max time: 1682589600000) 01GZ1CS5T1VQ52EWBAWBSX7Z3W (min time: 1682589600388, max time: 1682596800000)]"

What you expected to happen:
I believe the thanos compact should be able to deduplicate or merge blocks if that is the case? Not really sure

Full logs to relevant components:
I inspected the bucket and the blocks ID in the error message are the following:

|            ULID            |         FROM         |        UNTIL         |     RANGE      |   UNTIL-DOWN    |  #SERIES   |    #SAMPLES    |   #CHUNKS   | COMP-LEVEL | COMP-FAILED |                                                           LABELS                                                            | RESOLUTION |  SOURCE   |
|----------------------------|----------------------|----------------------|----------------|-----------------|------------|----------------|-------------|------------|-------------|-----------------------------------------------------------------------------------------------------------------------------|------------|-----------|
| 01H0FY8AR60GEVX3J3Q7Y7GSYG | 2023-04-27T00:00:00Z | 2023-04-27T12:00:00Z | 11h59m59.612s  | 28h0m0.388s     | 3,219      | 6,655,129      | 59,010      | 3          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-1                                                    | 0s         | compactor |
| 01GZ0CAKEYN0BVTRW570T69QKM | 2023-04-27T00:00:00Z | 2023-04-27T02:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,117      | 1,107,200      | 9,811       | 2          | false       | cluster=operations-staging                                                                                                  | 0s         | compactor |
| 01GZ0K651WJG1M9W2DMC050NHF | 2023-04-27T02:00:00Z | 2023-04-27T04:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,117      | 1,107,198      | 9,812       | 2          | false       | cluster=operations-staging                                                                                                  | 0s         | compactor |
| 01GZ0R6020JTZ21GETCZCY5Y57 | 2023-04-27T04:00:00Z | 2023-04-27T06:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,123      | 1,107,288      | 9,817       | 1          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-0                                                    | 0s         | ruler     |
| 01GZ0Z1QA4HWJ4SX0RXJFZTM5X | 2023-04-27T06:00:00Z | 2023-04-27T08:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,134      | 1,107,689      | 9,827       | 1          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-0                                                    | 0s         | ruler     |
| 01GZ15XEJ2M6M0QH50XRQC3M65 | 2023-04-27T08:00:00Z | 2023-04-27T10:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,134      | 1,108,467      | 9,831       | 1          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-0                                                    | 0s         | ruler     |
| 01GZ1CS5T1VQ52EWBAWBSX7Z3W | 2023-04-27T10:00:00Z | 2023-04-27T12:00:00Z | 1h59m59.612s   | 38h0m0.388s     | 3,203      | 1,117,342      | 9,908       | 1          | false       | cluster=operations-staging,thanos_ruler_replica=thanos-ruler-evaluator-0                                                    | 0s         | ruler     |
@mhoffm-aiven
Copy link
Contributor

How is compactor configured? It looks like a historical compactor had different configuration than new one because the first block still has the replica label

@Migueljfs
Copy link
Author

I have 5 sharded compactors running with this config:

        - compact
        - --wait
        - --log.level=info
        - --log.format=logfmt
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --data-dir=/var/thanos/compact
        - --debug.accept-malformed-index
        - --retention.resolution-raw=2y
        - --retention.resolution-5m=2y
        - --retention.resolution-1h=2y
        - --delete-delay=48h
        - --compact.concurrency=1
        - --downsample.concurrency=1
        - --deduplication.replica-label=prometheus_replica
        - --deduplication.replica-label=receive_replica
        - --deduplication.replica-label=thanos_ruler_replica
        - --compact.enable-vertical-compaction
        - |-
          --selector.relabel-config=
            - action: hashmod
              source_labels: ["cluster"]
              target_label: shard
              modulus: 5
            - action: keep
              source_labels: ["shard"]
              regex: 0

With regex: 0 going from 0 to 4.

Btw this is the exact same config I deploy on my other clusters (different clusters environments for different buckets) and only this one is giving these errors.

@mhoffm-aiven
Copy link
Contributor

My guess is that it ran before without the "thanos_ruler_replica" dedup label? since the 01H0FY8AR60GEVX3J3Q7Y7GSYG block still has it even though tis already compacted and appears in the compaction plan. You could mark it as no-compact probably?

@Migueljfs
Copy link
Author

It's possible, It's been a while so I don't remember to be honest.

Either way, since then what I did is remove the chunks directly from my bucket (this is staging env so I don't care that much about the data itself, I just wanted to understand how I can solve this in case it comes up in prod)

However, eventually, thanos-compact halts again on a new set of chunks. Then I delete them, thanos-compact starts running until it halts again, etc

It's been like this for the past 2 weeks, and I have deleted a bunch of chunks, I thought there were a bunch of corrupted chunks or something like that but I'm starting to think it will be forever like this and I can't understand why

@jaspreet-yb
Copy link

Facing the same issue due to which compaction is getting halted
ts=2023-11-28T08:24:05.350841732Z caller=compact.go:491 level=error msg="critical error detected; halting" err="compaction: group 300000@5350783008949816695: failed to run pre compaction callback for plan: [01HG7EDJNBXVZ2J0PQS2HT87Q8 (min time: 1699488000002, max time: 1700697600000) 01HGAG0TKWJ09CS7SBAXGXTBME (min time: 1700402400000, max time: 1700697600000)]: overlapping sources detected for plan [01HG7EDJNBXVZ2J0PQS2HT87Q8 (min time: 1699488000002, max time: 1700697600000) 01HGAG0TKWJ09CS7SBAXGXTBME (min time: 1700402400000, max time: 1700697600000)]"

Our current thanos compact config

    Args:
      compact
      --wait
      --log.level=info
      --log.format=logfmt
      --objstore.config=$(OBJSTORE_CONFIG)
      --data-dir=/var/thanos/compact
      --debug.accept-malformed-index
      --retention.resolution-raw=7d
      --retention.resolution-5m=30d
      --retention.resolution-1h=545d
      --delete-delay=48h
      --deduplication.replica-label=prometheus_replica
      --compact.enable-vertical-compaction
      --deduplication.func=penalty

We have 6 shards and 2 replicas for prometheus

@Kot-o-pes
Copy link

Kot-o-pes commented Jun 13, 2024

Hi there, faced this issue too
thanos-compactor[2635699]: {"caller":"compact.go:527","err":"compaction: group 300000@7488097868448971783: failed to run pre compaction callback for plan: [01HZN4SYGV9HE0ZJC6ZMY31JVK (min time: 1710374400000, max time: 1711411200000) 01J06BGV9K11S7TV8JB35HAQRK (min time: 1711065600000, max time: 1711584000000)]: overlapping sources detected for plan [01HZN4SYGV9HE0ZJC6ZMY31JVK (min time: 1710374400000, max time: 1711411200000) 01J06BGV9K11S7TV8JB35HAQRK (min time: 1711065600000, max time: 1711584000000)]","level":"error","msg":"critical error detected; halting","ts":"2024-06-13T07:41:53.210408962Z"}
prometheus has 3 replicas

| 01HZN4SYGV9HE0ZJC6ZMY31JVK | 2024-03-14T03:00:00+03:00 | 2024-03-26T03:00:00+03:00 | 288h0m0s       | -48h0m0s        | 66,629,361  | 54,953,985,276  | 511,345,472   | 5          | false       | cluster=k8s.prod,environment=prod,manage_by=flux,prometheus=monitoring/prom-operator-prometheus                                                                         | 5m0s       | compactor |

| 01J06BGV9K11S7TV8JB35HAQRK | 2024-03-22T03:00:00+03:00 | 2024-03-28T03:00:00+03:00 | 144h0m0s       | 96h0m0s         | 40,684,997  | 27,949,523,210  | 232,661,558   | 5          | false       | cluster=k8s.prod,environment=prod,manage_by=flux,prometheus=monitoring/prom-operator-prometheus 

tried to add no compact, also found this issue about compact marks being ignored #5603

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants