-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos Compactor Failure : overlaps found while gathering blocks. #469
Comments
Deleting manually the blocks from the bucket fixed the issue, the |
Thanks for reporting! In your case I would love to know what happened? Did you do anything manually with blocks or ANY manual operation on object storage or use I think the repair you propose makes sense, in the end, to actually unblock users and investigate later (: Maybe even automated mitigation in compactor would be necessary. One idea would be to improve TSDB compaction to prometheus-junkyard/tsdb#90 But the investigation part is really necessary! |
What happened exactly is a good question, I never did any manual change or previous repair on the blocks directly, everything is managed by Thanos directly, it's a quite standard setup. |
Sorry for delayed answer. Could you reproduce it anymore? It might be fixed in v0.1.0 There is one important comparison you could make. To check if those blocks are from same scrapers or not. Let's reopen if you can repro. |
Same here @kedare with thanos version 0.4.0 with my pod in Kubernetes crashing everytime.
|
Had the same problem, same messages as @Lord-Y. Update this was just a temporary fix. I think there's an underlying problem. In my scenario I have two instances of prometheus as statefulsets. |
@amartorelli it's definitely something that need to be fixed. I also have statefulsets for my prometheis instances. This issue happened after migrating from 0.3.2 to 0.4.0 version. |
I've noticed that the overlaps are checked with Is it safe to say that, even if the timestamps overlap, in case the meta files inside the folders contains unique labels, data has been pushed by two different instances of Prometheus (different external_labels sets). hence they should pass the OverlappingBlocks check? Specifically if:
|
I also have the same issue as @Lord-Y & @amartorelli . As of now I don't have a lot of clues but I guess this has to be linked to the fact that I'm currently performing tests/updates on this Prometheus (notably adding/refactoring scrape configs) which leads to lots of Prometheus pod restarts. Edit: I discovered an issue for the persistent storage of my Prometheus deployment. Now that this is fixed I don't have any Thanos Comapctor error / duplicated blocks. |
Hi. Thanos Compactor is complaining about the same error and it references two blobs that, if I do a bucket inspect, they belong to the same prometheus replica. | 01DDAT9V8SHHTAYJXNJPTKVSP6 | 14-06-2019 08:00:00 | 14-06-2019 10:00:00 | 2h0m0s | 38h0m0s | 345,454 | 81,584,495 | 684,780 | 1 | false | cluster=ci,env=ci,prometheus=monitoring/k8s,prometheus_replica=prometheus-k8s-0 | 0s | sidecar | | 01DDAT9VYEZ4QTVRJC4NJBT27F | 14-06-2019 08:00:00 | 14-06-2019 10:00:00 | 2h0m0s | 38h0m0s | 58,675 | 13,924,099 | 116,993 | 1 | false | cluster=ci,env=ci,prometheus=monitoring/k8s,prometheus_replica=prometheus-k8s-0 | 0s | sidecar | As you can see, there are two blobs for the same time and date and the same replica of prometheus with the same compression but different number of series, samples and chunks. We are using 0.3.2 and Prometheus 2.5.0 (from Prometheus-operator 0.30). We have deleted all the blobs in the storage account (azure) and we are still getting the overlapping error. Has this been solved in newer versions? Thank you |
@bwplotka I just faced the same issue. Nothing special about my setup (2 Prometheus instance scrapping independently the same targets). |
same here for thanos version 0.6.0 |
Experiencing this issue still in 0.6.x series. |
Getting the same issue on the versions v0.8.1 and v0.9.0 |
We are having the same issue, running with 3 Prometheus statefulset pods. |
Let me revisit this ticket again. All known causes of overlaps are misconfiguraiton. W tried our best to explain all potential problems and solutions here: https://thanos.io/operating/troubleshooting.md/ Super happy we have finally nice doc about that thanks to @daixiang0. Let's iterate over it if there is something missing there. 🤗 |
@bwplotka I'd like to revisit this issue, if you have a moment. We have a very simple stack set up using this
After running for just a couple of days, we're running into the "error executing compaction: compaction The troubleshooting documents suggests the following reasons for this error:
If you have a minute (or @daixiang0 or someone else) I would appreciate some insight into what could be causing this problem. We're running:
With:
|
Thanks for very clean write up! @larsks
Well, I actually think that is the issue. I remember someone else was reporting that Swift is not strongly consistent. Eventual consistency actually creates thousands of issues. You have a single producer, yes, but imagine Compactor is creating a block. Then it removes the old block, because it just created block, right? So all good! So it deletes the block, and starts new iteration. Now we can have so many different cases:
Overall we spent so much time on a design solution that will work for kind of very rare case of eventual consistent storages... so trust us. We can with @squat and @khyatisoneji elaborate what more can be wrong on such cases... And in the end you can read more details here on was done and what is still planned: https://thanos.io/proposals/201901-read-write-operations-bucket.md/ Overall the new Thanos version will help you a lot, but still, there is an issue with compactor replicating blocks by accident on eventually consistent storages. We are missing this item: #2283 In the meantime I think you can try enabling vertical compaction. This will ensure that compactor will handle overlaps.. by simply compacting again into one. This is experimental though. cc @kakkoyun @metalmatze Ideally, I would suggest using anything else than Swift, as other object storages have no issues like this. |
Thanks for the response! I have a few questions for you:
Do you mean "a future version of Thanos" or do you mean we should simply upgrade to
Is that as simple as adding
Is it okay to use filesystem-backed storage? It has all sorts of warnings in https://github.com/thanos-io/thanos/blob/master/docs/storage.md, but we're not interested in a paid option like GCS or S3, and we don't have a local S3-analog other than Swift. I guess we could set up something like MinIO, but requiring a separate storage service just for Thanos isn't a great option. |
I mean
Yes! In fact, you don't need a replica label, you can even put
Really depends on your needs and amount of data. It has warnings, to avoid cases like users being too smart and running on NFS (: etc. If your data fits on disk and you are fine on manually backing up, resizing disk operations, then filesystem should do just fine. Please feel free to test it out, it is definitely production-grade tested and maintained. Can rephrase docs to state so. |
I switched over to using filesystem-backed storage on 4/10, and it's survived the past several days without error. Looks like it was a storage issue, so hopefully things will be able for a bit now. |
Yes let us know how it goes. I am pretty sure it should be stable, so we
can remove `experimental` mention in the docs (:
Bartek
…On Mon, 13 Apr 2020 at 16:29, Lars Kellogg-Stedman ***@***.***> wrote:
I switched over to using filesystem-backed storage on 4/10, and it's
survived the past several days without error. Looks like it was a storage
issue, so hopefully things will be able for a bit now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#469 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABVA3O3SBFKI24XKLH4DGXDRMMVWTANCNFSM4FOQQK7Q>
.
|
Hey everyone 👋🏼 I would like to explore the possibility of adding some sort of fixing command to the
and
@bwplotka Thanks for all the details shared on this issue! While I understand why the team decided to avoid fixing the problem without knowing the origin of the issue, in the end, users do need to act somehow to clean up the bucket so Thanos compactor can get back to work in the newly added data and everything else that isn't overlapping even if the root cause of the problem was a misconfiguration. Depending on the time window and the amount of data stored in the bucket, the effort required to clean it can get pretty big, forcing users to write their own scripts, and risking ending up in an even worse situation. I would like to propose the possibility of adding a complementary command, or even evolving the current one (thanos tools bucket verify --repair) with the ability to move the affected blocks to a backup bucket, or something similar to that, in a way that users could get compactor back to running and then decide what to do with the affected data. We could also consider implementing a way to move the data back once it's sorted out 🤔 If that's something that makes sense for the project, I would love to explore the topic and contribute. I would appreciate some feedback and guidance on this (should I open a new issue?) Thanks |
@B0go that would be awesome. We experience this issue every few weeks on one of our clusters. Seems to happen randomly. We already got a dedicated alert in monitoring with runbook for what to delete. I guess we could just run the repair command as cronjob proactively in the future to prevent that kind of manual toil. |
Thanos, Prometheus and Golang version used
Docker images version: master-2018-08-04-8b7169b (Was also affecting an older version before)
What happened
The thanos compactor is failing to run (crash)
What you expected to happen
The thanos comptactor to compact :)
How to reproduce it (as minimally and precisely as possible):
Good question, it was running before and when checking it on the server I found it restarting in loop (Because of the docker restart policy that is to restart on crash)
Full logs to relevant components
Related configuration from our DSC (Salt)
Let me know if you need any more information.
The text was updated successfully, but these errors were encountered: