-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor: Deadlock on S3 error during meta sync #7514
Comments
When using |
@palamvmw In general that would help getting unstuck at the very least. I am trying to wrap my head around what happened in the first place, and I believe this is caused by a large number of errors "clogging up" the work queue. Essentially any Exists call that throws an Error takes out one of the worker threads with it, and there is no check for all of them going away, so when we reach the 64th Exists Error all things just stop. No more workers picking up items, and no more items pushed to the channel. Since we are not on a timed Context, we also never get ctx.Done(). I would think that if we error out a worker, we should start a new one in its place. (Or just not stop it in the first place, and send errors to a different channel to be collected.) I am also not quite sure that this line is correct: We have had S3 503 Slow Down responses previously, so it is not at all unlikely that the reason we clogged the 64 workers is because all of a sudden all of them started getting 503 Slow Downs from AWS. |
I think I am facing the same issue but on Azure Object store so I tend to agree with the comment on your draft PR. |
Signed-off-by: Miklós Földényi <[email protected]>
I believe I just hit the same issue on GCP using v0.36.1. The compactor suddenly just stopped doing any work after completing a delete cycle. It was still responding to metric scrapes. |
Signed-off-by: Miklós Földényi <[email protected]> # Conflicts: # CHANGELOG.md
Signed-off-by: Miklós Földényi <[email protected]>
Fixed test case condition Do not try to unwrap multierror unless its actually a multierror Review fixes Signed-off-by: Miklós Földényi <[email protected]>
Errors no longer take out the thread with them, instead are collected into a multierror. Signed-off-by: Miklós Földényi <[email protected]>
Thanos, Prometheus and Golang version used:
Thanos v0.35.1 086a698
imageTag: bitnami/thanos:0.35.1-debian-12-r1
Object Storage Provider:
Amazon S3
What happened:
Running Compactor on a 6TiB S3 bucket full of raw blocks, expecting it to eventually compact and downsample everything, but after 6 days of operation it got into a deadlock.
ts=2024-07-04T05:45:13.602826109Z caller=compact.go:1488 level=info msg="start sync of metas"
thanos/pkg/block/fetcher.go
Line 271 in 086a698
Full pprof output:
What you expected to happen:
The compaction operation to progress normally.
How to reproduce it (as minimally and precisely as possible):
Simulating a random S3 failure on an exists call should work, but I have not yet put together an easy repro.
Full logs to relevant components:
The last entry in the logs is ~5 minutes older than the problem. There are no log entries made as a result of the error.
Anything else we need to know:
I have the system in the deadlocked state still. If you need to run some pprof commands through the web interface, I still can
Environment:
The text was updated successfully, but these errors were encountered: