Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos receive running on ext4 FS experiencing compaction failures #7455

Closed
abelsimonn opened this issue Jun 13, 2024 · 10 comments
Closed

Thanos receive running on ext4 FS experiencing compaction failures #7455

abelsimonn opened this issue Jun 13, 2024 · 10 comments

Comments

@abelsimonn
Copy link
Contributor

Thanos version:
0.35.0

Object Storage Provider:
Azure blob. For receivers, azure disk

What happened:
Seemingly randomly head block compaction will fail, leading to never-ending accumulation and OOM error in the end.

The error itself is unrecoverable, leaving the container in a crash loop, OOMing, the only solution is to delete the wall.

How to reproduce it (as minimally and precisely as possible):
Every time it reproduces, its the same error with a different metric. It seems like duplicate slips through from validation that will disallow later compaction

Full logs to relevant components:

"caller":"db.go:1014","component":"multi-tsdb","err":"add series: out-of-order series added with label set <sets of labels> ","level":"error","msg":"compaction failed","tenant":"default-tenant","ts":"2024-06-13T07:25:59.212754459Z"}

image

Environment:

  • OS: 6.1.77-flatcar

  • Kernel (e.g. uname -a): Linux thanos-receive-11 6.1.77-flatcar Initial structure and block shipper #1 SMP PREEMPT_DYNAMIC Mon Feb 12 19:37:08 -00 2024 x86_64 GNU/Linux

  • Others:

@douglascamata
Copy link
Contributor

Do you have out of order blocks enabled in your setup? I don't think the support for out of order blocks is stable yet.

@abelsimonn
Copy link
Contributor Author

Do you have out-of-order blocks enabled in your setup? I don't think the support for out-of-order blocks is stable yet.

I did have it enabled early this week.

When this screenshot was taken it was already disabled, but this might have come from an env/context where the changes did not propagate yet. (or at least I'm hoping so:) )

We do have alerts on high head series.

Will monitor those and update the issue if reproduces. Hopefully it was just that:)

@douglascamata
Copy link
Contributor

douglascamata commented Jun 17, 2024

@hanem100k you will have to delete the out of order blocks manually if they made it to object storage. Otherwise every time the compactor sees them it might have issues. I'm not sure if the Compactor can gracefully skip out of order blocks.

@abelsimonn
Copy link
Contributor Author

I only had 6 hours of retention on receivers, so old blocks were just deleted. Not sure if it will/would cause downsampling issues once in object storage.

Either way, the good news is that I haven't seen this reoccur in the past week across many environments.

Closing the issue, thanks for your time and looks!

@jkroepke
Copy link

Hi @douglascamata , I'm expiring the same issue:

Thanos version:
0.35.1

Object Storage Provider:
Azure blob. For receivers, azure disk

{"caller":"db.go:1014","component":"multi-tsdb","err":"add series: out-of-order series added with label set \"{__name__=\\\"aggregator_discovery_aggregation_count_total\\\", cluster=\\\"opsstack\\\", endpoint=\\\"https-metrics\\\", instance=\\\"10.0.240.10:10250\\\", job=\\\"kubelet\\\", metrics_path=\\\"/metrics\\\", namespace=\\\"opsstack\\\", node=\\\"aks-opsstack-40597296-vmss000001\\\", prometheus=\\\"opsstack/opsstack-prom-stack-prometheus\\\", prometheus_replica=\\\"prometheus-opsstack-prom-stack-prometheus-0\\\", service=\\\"opsstack-prom-stack-kubelet\\\"}\"","level":"error","msg":"compaction failed","tenant":"opsstack","ts":"2024-06-19T22:04:59.161392545Z"}

Each error has the exact amount of labels.

In our case, receive is not OOM due high memory limits, but I see an memory leaking situation, since the error occurs:

image

The error appears since restart of the pod, the version (v0.35.1) has not been changed.


In our case, we have out-of-order enabled.

I guess deleting something may helps (so what?), but it could appears any time again?

@douglascamata
Copy link
Contributor

@jkroepke out of order is not stable in Thanos, it's still experimental. There might be known and unknown rough edges and bugs. We do not recommend to turn it on in production.

@jkroepke
Copy link

There might be known and unknown rough edges and bugs.

Thats fine, but an bug report is still fine? Or is it in your mind to close all bugs, because the feature in experimental?

@douglascamata
Copy link
Contributor

Thats fine, but an bug report is still fine? Or is it in your mind to close all bugs, because the feature in experimental?

What are you implying with these questions? Did I say a bug report is not fine? Did I say this should be closed? Did I close it myself?

What I said is: if this feature causes you trouble, disable it. It's experimental, not stable, and potentially buggy. I didn't say anything else.

@jkroepke
Copy link

jkroepke commented Jun 20, 2024

Did I say a bug report is not fine

I had at least that feeling. Like: Thanks for the report, please disable that feature. Feels like an deny.

@douglascamata
Copy link
Contributor

A deny is me closing the issue, which I didn't. The author closed it themselves. Me saying "thanks for the report, please disable that feature to avoid issues while it's experimental" is 100% fine. I'm a triager and contributor. Unfortunately I don't know enough about out-of-order to contribute a fix.

So I'm doing some triage and "thanks for the report, please disable that feature to avoid issues while it's experimental" is all I can do as a triager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants