-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Compactor] panic: unexpected seriesToChunkEncoder lack of iterations #6775
Comments
also tried with vertical compaction enabled on another environment and still seeing the same panic
|
Is this the same with the newest |
Hi @GiedriusS upgrading to the latest version didn't resolve the issue
As per suggestions on Slack deduplication function was added as in our case applications are scraped by multiple Prometheus instances. This stopped errors from happening. However, it also seems to have caused issues with compaction now, as it's been stuck on a single block for more than 3 days now. Current configuration is below
|
What's the reason of the block stuck? Did you see any error? |
Hey - I've also seen a similar error on 0.32.4
When searching for |
Hi, thanks for all the bug report. I wonder if it is possible for someone to share the problematic block since I don't have a good way to reproduce this issue locally. Please let me know. You can reach out to me on Slack. |
Seeing this panic on Would be happy to provide data if I knew how to find the correct blocks. |
Hey @bison I think I narrowed this down to thanos trying to do vertical compaction on already compacted blocks - this could be the case if you've not previously had vertical compaction enabled If you want to try a hacky fix, you can try disabling compaction for all the blocks before you enabled compaction (Thats presuming we have the same issue - it could be something different) In compact, look at the logs before it crashed - it should start to compact several blocks - you'll need to mark these, and you might need to do it lots of times for all the blocks that have already been compacted |
Hi @vCra, thanks for the investigation.
It is interesting to know that. How did you fugure this out? Ideally it shouldn't matter to compact whether blocks already compacted or not so shouldn't panic. Maybe we miss something. |
@vCra wow thanks, that's exactly what's happening. Just upgraded this stack and vertical compaction got enabled where it wasn't before. Now the first time the compactor encounters two previously compacted blocks at 5m resolution, it panics. If I mark the same blocks (and all other similar blocks) with no-compact, then compaction completes. Edit: Actually I guess it's any previously compacted block. I originally thought it was only at that resolution for some reason. |
I'm only guessing that this is the issue - compactor kept crashing, and I noticed that we were managing to vertically compact all the new blocks without issue, but the old blocks were not getting vertically compacted - in bucketweb it was quite clear. Looking at bucket-web, we still have the old blocks, but just not vertically compacted - we don't care too much, as we won't use this data too frequently (10 is with vertical compaction) The discussion in https://cloud-native.slack.com/archives/CK5RSSC10/p1681966324787459 helped too |
I spotted this in prod. Looking into it 👁️ |
For #6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]>
For #6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]>
Adding a minimal test case for issue #6775 - reproduces the panic in the compactor. Signed-off-by: Giedrius Statkevičius <[email protected]>
Adding a minimal test case for issue #6775 - reproduces the panic in the compactor. Signed-off-by: Giedrius Statkevičius <[email protected]>
Adding a minimal test case for issue thanos-io#6775 - reproduces the panic in the compactor. Signed-off-by: Giedrius Statkevičius <[email protected]>
For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: mluffman <[email protected]>
Adding a minimal test case for issue thanos-io#6775 - reproduces the panic in the compactor. Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: mluffman <[email protected]>
For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]>
* compact: recover from panics (#7318) For #6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>
* compact: recover from panics (thanos-io#7318) For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (thanos-io#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (thanos-io#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>
* compact: recover from panics (thanos-io#7318) For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (thanos-io#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (thanos-io#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>
* compact: recover from panics (thanos-io#7318) For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (thanos-io#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (thanos-io#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>
Adding a minimal test case for issue thanos-io#6775 - reproduces the panic in the compactor. Signed-off-by: Giedrius Statkevičius <[email protected]>
* compact: recover from panics (thanos-io#7318) For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (thanos-io#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (thanos-io#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>
* compact: recover from panics (thanos-io#7318) For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (thanos-io#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (thanos-io#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>
Adding a minimal test case for issue thanos-io#6775 - reproduces the panic in the compactor. Signed-off-by: Giedrius Statkevičius <[email protected]>
* compact: recover from panics (thanos-io#7318) For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (thanos-io#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (thanos-io#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>
* compact: recover from panics (thanos-io#7318) For thanos-io#6775, it would be useful to know the exact block IDs to aid debugging. Signed-off-by: Giedrius Statkevičius <[email protected]> * Sidecar: wait for prometheus on startup (thanos-io#7323) Signed-off-by: Michael Hoffmann <[email protected]> * Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948) * fix serverAsClient goroutines leak Signed-off-by: Thibault Mange <[email protected]> * fix lint Signed-off-by: Thibault Mange <[email protected]> * update changelog Signed-off-by: Thibault Mange <[email protected]> * delete invalid comment Signed-off-by: Thibault Mange <[email protected]> * remove temp dev test Signed-off-by: Thibault Mange <[email protected]> * remove timer channel drain Signed-off-by: Thibault Mange <[email protected]> --------- Signed-off-by: Thibault Mange <[email protected]> * Receive: fix stats (thanos-io#7373) If we account stats for remote write and local writes we will count them twice since the remote write will be counted locally again by the remote receiver instance. Signed-off-by: Michael Hoffmann <[email protected]> * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382) * *: Ensure objstore flag values are masked & disable debug/pprof/cmdline Signed-off-by: Saswata Mukherjee <[email protected]> * small fix Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Saswata Mukherjee <[email protected]> * Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392) If we have a new querier it will create query hints even without the pushdown feature being present anymore. Old sidecars will then trigger query pushdown which leads to broken max,min,max_over_time and min_over_time. Signed-off-by: Michael Hoffmann <[email protected]> * Cut patch release v0.35.1 Signed-off-by: Saswata Mukherjee <[email protected]> --------- Signed-off-by: Giedrius Statkevičius <[email protected]> Signed-off-by: Michael Hoffmann <[email protected]> Signed-off-by: Thibault Mange <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Co-authored-by: Giedrius Statkevičius <[email protected]> Co-authored-by: Michael Hoffmann <[email protected]> Co-authored-by: Thibault Mange <[email protected]>
Thanos, Prometheus and Golang version used:
Object Storage Provider: S3
What happened:
Thanos compact throws
panic: unexpected seriesToChunkEncoder lack of iterations
and existsWhat you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components:
Uncomment if you would like to post collapsible logs:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: