Thanos receive running on ext4 FS experiencing compaction failures #7455

abelsimonn · 2024-06-13T08:47:47Z

Thanos version:
0.35.0

Object Storage Provider:
Azure blob. For receivers, azure disk

What happened:
Seemingly randomly head block compaction will fail, leading to never-ending accumulation and OOM error in the end.

The error itself is unrecoverable, leaving the container in a crash loop, OOMing, the only solution is to delete the wall.

How to reproduce it (as minimally and precisely as possible):
Every time it reproduces, its the same error with a different metric. It seems like duplicate slips through from validation that will disallow later compaction

Full logs to relevant components:

"caller":"db.go:1014","component":"multi-tsdb","err":"add series: out-of-order series added with label set <sets of labels> ","level":"error","msg":"compaction failed","tenant":"default-tenant","ts":"2024-06-13T07:25:59.212754459Z"}

Environment:

OS: 6.1.77-flatcar
Kernel (e.g. uname -a): Linux thanos-receive-11 6.1.77-flatcar Initial structure and block shipper #1 SMP PREEMPT_DYNAMIC Mon Feb 12 19:37:08 -00 2024 x86_64 GNU/Linux
Others:

The text was updated successfully, but these errors were encountered:

douglascamata · 2024-06-14T15:15:08Z

Do you have out of order blocks enabled in your setup? I don't think the support for out of order blocks is stable yet.

abelsimonn · 2024-06-14T19:05:41Z

Do you have out-of-order blocks enabled in your setup? I don't think the support for out-of-order blocks is stable yet.

I did have it enabled early this week.

When this screenshot was taken it was already disabled, but this might have come from an env/context where the changes did not propagate yet. (or at least I'm hoping so:) )

We do have alerts on high head series.

Will monitor those and update the issue if reproduces. Hopefully it was just that:)

douglascamata · 2024-06-17T09:05:40Z

@hanem100k you will have to delete the out of order blocks manually if they made it to object storage. Otherwise every time the compactor sees them it might have issues. I'm not sure if the Compactor can gracefully skip out of order blocks.

abelsimonn · 2024-06-19T08:38:18Z

I only had 6 hours of retention on receivers, so old blocks were just deleted. Not sure if it will/would cause downsampling issues once in object storage.

Either way, the good news is that I haven't seen this reoccur in the past week across many environments.

Closing the issue, thanks for your time and looks!

jkroepke · 2024-06-20T10:27:32Z

Hi @douglascamata , I'm expiring the same issue:

Thanos version:
0.35.1

Object Storage Provider:
Azure blob. For receivers, azure disk

{"caller":"db.go:1014","component":"multi-tsdb","err":"add series: out-of-order series added with label set \"{__name__=\\\"aggregator_discovery_aggregation_count_total\\\", cluster=\\\"opsstack\\\", endpoint=\\\"https-metrics\\\", instance=\\\"10.0.240.10:10250\\\", job=\\\"kubelet\\\", metrics_path=\\\"/metrics\\\", namespace=\\\"opsstack\\\", node=\\\"aks-opsstack-40597296-vmss000001\\\", prometheus=\\\"opsstack/opsstack-prom-stack-prometheus\\\", prometheus_replica=\\\"prometheus-opsstack-prom-stack-prometheus-0\\\", service=\\\"opsstack-prom-stack-kubelet\\\"}\"","level":"error","msg":"compaction failed","tenant":"opsstack","ts":"2024-06-19T22:04:59.161392545Z"}

Each error has the exact amount of labels.

In our case, receive is not OOM due high memory limits, but I see an memory leaking situation, since the error occurs:

The error appears since restart of the pod, the version (v0.35.1) has not been changed.

In our case, we have out-of-order enabled.

I guess deleting something may helps (so what?), but it could appears any time again?

douglascamata · 2024-06-20T10:46:02Z

@jkroepke out of order is not stable in Thanos, it's still experimental. There might be known and unknown rough edges and bugs. We do not recommend to turn it on in production.

jkroepke · 2024-06-20T11:35:12Z

There might be known and unknown rough edges and bugs.

Thats fine, but an bug report is still fine? Or is it in your mind to close all bugs, because the feature in experimental?

douglascamata · 2024-06-20T11:54:57Z

Thats fine, but an bug report is still fine? Or is it in your mind to close all bugs, because the feature in experimental?

What are you implying with these questions? Did I say a bug report is not fine? Did I say this should be closed? Did I close it myself?

What I said is: if this feature causes you trouble, disable it. It's experimental, not stable, and potentially buggy. I didn't say anything else.

jkroepke · 2024-06-20T12:00:14Z

Did I say a bug report is not fine

I had at least that feeling. Like: Thanks for the report, please disable that feature. Feels like an deny.

douglascamata · 2024-06-20T12:07:13Z

A deny is me closing the issue, which I didn't. The author closed it themselves. Me saying "thanks for the report, please disable that feature to avoid issues while it's experimental" is 100% fine. I'm a triager and contributor. Unfortunately I don't know enough about out-of-order to contribute a fix.

So I'm doing some triage and "thanks for the report, please disable that feature to avoid issues while it's experimental" is all I can do as a triager.

abelsimonn closed this as completed Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos receive running on ext4 FS experiencing compaction failures #7455

Thanos receive running on ext4 FS experiencing compaction failures #7455

abelsimonn commented Jun 13, 2024

douglascamata commented Jun 14, 2024

abelsimonn commented Jun 14, 2024

douglascamata commented Jun 17, 2024 •

edited

Loading

abelsimonn commented Jun 19, 2024

jkroepke commented Jun 20, 2024

douglascamata commented Jun 20, 2024

jkroepke commented Jun 20, 2024

douglascamata commented Jun 20, 2024

jkroepke commented Jun 20, 2024 •

edited

Loading

douglascamata commented Jun 20, 2024

Thanos receive running on ext4 FS experiencing compaction failures #7455

Thanos receive running on ext4 FS experiencing compaction failures #7455

Comments

abelsimonn commented Jun 13, 2024

douglascamata commented Jun 14, 2024

abelsimonn commented Jun 14, 2024

douglascamata commented Jun 17, 2024 • edited Loading

abelsimonn commented Jun 19, 2024

jkroepke commented Jun 20, 2024

douglascamata commented Jun 20, 2024

jkroepke commented Jun 20, 2024

douglascamata commented Jun 20, 2024

jkroepke commented Jun 20, 2024 • edited Loading

douglascamata commented Jun 20, 2024

douglascamata commented Jun 17, 2024 •

edited

Loading

jkroepke commented Jun 20, 2024 •

edited

Loading