You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thanos, version 0.32.3 (branch: HEAD, revision: 3d98d7ce7a254b893e4c8ee8122f7f6edd3174bd)
build user: root@0b3c549e9dae
build date: 20230920-07:27:32
go version: go1.20.8
platform: linux/amd64
tags: netgo
Object Storage Provider:
AWS S3
What happened:
After upgrading from 0.31.0 to 0.35.0 we saw greatly increased sidecar memory usage and narrowed it down to a change between 0.32.2 and 0.32.3 (the Prometheus update maybe?).
The memory usage shoots up for certain queries, for us likely recording rules by the ruler, thus constantly high usage was observed.
What you expected to happen:
No significant change in memory usage.
How to reproduce it (as minimally and precisely as possible):
Run {job=".+"} on Prometheus with some metrics for either version and compare memory usage.
Full logs to relevant components:
Anything else we need to know:
Heap profiles for 0.32.2 and 0.32.3 with the same query on the same Prometheus node:
The text was updated successfully, but these errors were encountered:
I think it's a consequence of #6706. We had to fix a correctness bug and as a consequence, responses need to be sorted in memory before being sent off. Unfortunately, but Prometheus sometimes produces not a sorted response and that needs to be fixed upstream. Or external labels functionality has to be completely reworked. See prometheus/prometheus#12605
Ouch, I see. Upgrading in environments like Kubernetes comes with a considerable new risk of OOMs for pods running Prometheus with Thanos sidecar because it gets really hard to estimate max memory requirements for the sidecar containers 🤔
Thanos, Prometheus and Golang version used:
Object Storage Provider:
AWS S3
What happened:
After upgrading from 0.31.0 to 0.35.0 we saw greatly increased sidecar memory usage and narrowed it down to a change between 0.32.2 and 0.32.3 (the Prometheus update maybe?).
The memory usage shoots up for certain queries, for us likely recording rules by the ruler, thus constantly high usage was observed.
What you expected to happen:
No significant change in memory usage.
How to reproduce it (as minimally and precisely as possible):
Run
{job=".+"}
on Prometheus with some metrics for either version and compare memory usage.Full logs to relevant components:
Anything else we need to know:
Heap profiles for 0.32.2 and 0.32.3 with the same query on the same Prometheus node:
The text was updated successfully, but these errors were encountered: