Compactor: Deadlock on S3 error during meta sync #7514

mfoldenyi · 2024-07-05T06:17:22Z

Thanos, Prometheus and Golang version used:
Thanos v0.35.1 086a698
imageTag: bitnami/thanos:0.35.1-debian-12-r1

Object Storage Provider:
Amazon S3

What happened:
Running Compactor on a 6TiB S3 bucket full of raw blocks, expecting it to eventually compact and downsample everything, but after 6 days of operation it got into a deadlock.

last log entry:
ts=2024-07-04T05:45:13.602826109Z caller=compact.go:1488 level=info msg="start sync of metas"
at 2024-07-04 05:50:30 counter thanos_objstore_bucket_operation_failures_total{operation="exists"} changed from 0 to 1
starting at 2024-07-04 05:50:45 (the next scrape) counter thanos_objstore_bucket_operations_total went to 0 on all operations
the relevant threads shows waiting on a select:

thanos/pkg/block/fetcher.go

Line 271 in 086a698

select {

#	0x169d78b	github.com/thanos-io/thanos/pkg/block.(*ConcurrentLister).GetActiveAndPartialBlockIDs.func2+0xcb	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:271
#	0x14744e4	github.com/thanos-io/objstore/providers/s3.(*Bucket).Iter+0x2a4						/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/providers/s3/s3.go:407
#	0xd82bd9	github.com/thanos-io/objstore.(*metricBucket).Iter+0x139						/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/objstore.go:516
#	0x1494237	github.com/thanos-io/objstore/tracing/opentracing.(*TracingBucket).Iter.TracingBucket.Iter.func1+0x157	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/tracing/opentracing/opentracing.go:50
#	0x1493b7a	github.com/thanos-io/objstore/tracing/opentracing.doWithSpan+0x9a					/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/tracing/opentracing/opentracing.go:215
#	0x1494d2c	github.com/thanos-io/objstore/tracing/opentracing.TracingBucket.Iter+0xec				/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/tracing/opentracing/opentracing.go:48
#	0x169d5cc	github.com/thanos-io/thanos/pkg/block.(*ConcurrentLister).GetActiveAndPartialBlockIDs+0x24c		/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:266
#	0x16a03d9	github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetchMetadata.func2+0x79				/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:517
#	0x971415	golang.org/x/sync/errgroup.(*Group).Go.func1+0x55							/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78

Full pprof output:

goroutine profile: total 153
100 @ 0x43e6ce 0x409c2d 0x409852 0x16c5674 0x471981
#	0x16c5673	github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact.func2+0xf3	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/compact/compact.go:1446

36 @ 0x43e6ce 0x409c2d 0x409852 0x16a0554 0x971416 0x471981
#	0x16a0553	github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetchMetadata.func1+0x73	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:479
#	0x971415	golang.org/x/sync/errgroup.(*Group).Go.func1+0x55				/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78

1 @ 0x40f2e9 0x46df89 0x626b93 0x471981
#	0x46df88	os/signal.signal_recv+0x28	/opt/bitnami/go/src/runtime/sigqueue.go:152
#	0x626b92	os/signal.loop+0x12		/opt/bitnami/go/src/os/signal/signal_unix.go:23

1 @ 0x433691 0x46b9fd 0x10391d1 0x1039005 0x1035aa6 0x14f22a8 0x14f2da5 0x7a8749 0x7aa062 0x7ab42e 0x7a7314 0x471981
#	0x46b9fc	runtime/pprof.runtime_goroutineProfileWithLabels+0x1c	/opt/bitnami/go/src/runtime/mprof.go:844
#	0x10391d0	runtime/pprof.writeRuntimeProfile+0xb0			/opt/bitnami/go/src/runtime/pprof/pprof.go:734
#	0x1039004	runtime/pprof.writeGoroutine+0x44			/opt/bitnami/go/src/runtime/pprof/pprof.go:694
#	0x1035aa5	runtime/pprof.(*Profile).WriteTo+0x145			/opt/bitnami/go/src/runtime/pprof/pprof.go:329
#	0x14f22a7	net/http/pprof.handler.ServeHTTP+0x4a7			/opt/bitnami/go/src/net/http/pprof/pprof.go:267
#	0x14f2da4	net/http/pprof.Index+0xe4				/opt/bitnami/go/src/net/http/pprof/pprof.go:384
#	0x7a8748	net/http.HandlerFunc.ServeHTTP+0x28			/opt/bitnami/go/src/net/http/server.go:2136
#	0x7aa061	net/http.(*ServeMux).ServeHTTP+0x141			/opt/bitnami/go/src/net/http/server.go:2514
#	0x7ab42d	net/http.serverHandler.ServeHTTP+0x8d			/opt/bitnami/go/src/net/http/server.go:2938
#	0x7a7313	net/http.(*conn).serve+0x5f3				/opt/bitnami/go/src/net/http/server.go:2009

1 @ 0x43e6ce 0x409c2d 0x409832 0x20bb5ed 0x627009 0x471981
#	0x20bb5ec	main.main.func2+0x2c				/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/main.go:129
#	0x627008	github.com/oklog/run.(*Group).Run.func1+0x28	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/[email protected]/group.go:38

1 @ 0x43e6ce 0x409c2d 0x409832 0x626e95 0x20bb018 0x43e25b 0x471981
#	0x626e94	github.com/oklog/run.(*Group).Run+0x154	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/[email protected]/group.go:43
#	0x20bb017	main.main+0x19f7			/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/main.go:169
#	0x43e25a	runtime.main+0x2ba			/opt/bitnami/go/src/runtime/proc.go:267

1 @ 0x43e6ce 0x4370b7 0x46be25 0x4e6807 0x4e7afa 0x4e7ae8 0x58a305 0x59c925 0x7a15cb 0x564743 0x564873 0x7a747c 0x471981
#	0x46be24	internal/poll.runtime_pollWait+0x84		/opt/bitnami/go/src/runtime/netpoll.go:343
#	0x4e6806	internal/poll.(*pollDesc).wait+0x26		/opt/bitnami/go/src/internal/poll/fd_poll_runtime.go:84
#	0x4e7af9	internal/poll.(*pollDesc).waitRead+0x279	/opt/bitnami/go/src/internal/poll/fd_poll_runtime.go:89
#	0x4e7ae7	internal/poll.(*FD).Read+0x267			/opt/bitnami/go/src/internal/poll/fd_unix.go:164
#	0x58a304	net.(*netFD).Read+0x24				/opt/bitnami/go/src/net/fd_posix.go:55
#	0x59c924	net.(*conn).Read+0x44				/opt/bitnami/go/src/net/net.go:179
#	0x7a15ca	net/http.(*connReader).Read+0x14a		/opt/bitnami/go/src/net/http/server.go:791
#	0x564742	bufio.(*Reader).fill+0x102			/opt/bitnami/go/src/bufio/bufio.go:113
#	0x564872	bufio.(*Reader).Peek+0x52			/opt/bitnami/go/src/bufio/bufio.go:151
#	0x7a747b	net/http.(*conn).serve+0x75b			/opt/bitnami/go/src/net/http/server.go:2044

1 @ 0x43e6ce 0x4370b7 0x46be25 0x4e6807 0x4ebcec 0x4ebcda 0x58c329 0x5a593e 0x5a4af0 0x7ab884 0x1513aee 0x151326b 0x971416 0x471981
#	0x46be24	internal/poll.runtime_pollWait+0x84					/opt/bitnami/go/src/runtime/netpoll.go:343
#	0x4e6806	internal/poll.(*pollDesc).wait+0x26					/opt/bitnami/go/src/internal/poll/fd_poll_runtime.go:84
#	0x4ebceb	internal/poll.(*pollDesc).waitRead+0x2ab				/opt/bitnami/go/src/internal/poll/fd_poll_runtime.go:89
#	0x4ebcd9	internal/poll.(*FD).Accept+0x299					/opt/bitnami/go/src/internal/poll/fd_unix.go:611
#	0x58c328	net.(*netFD).accept+0x28						/opt/bitnami/go/src/net/fd_unix.go:172
#	0x5a593d	net.(*TCPListener).accept+0x1d						/opt/bitnami/go/src/net/tcpsock_posix.go:152
#	0x5a4aef	net.(*TCPListener).Accept+0x2f						/opt/bitnami/go/src/net/tcpsock.go:315
#	0x7ab883	net/http.(*Server).Serve+0x363						/opt/bitnami/go/src/net/http/server.go:3056
#	0x1513aed	github.com/prometheus/exporter-toolkit/web.Serve+0x34d			/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/prometheus/[email protected]/web/tls_config.go:317
#	0x151326a	github.com/prometheus/exporter-toolkit/web.ServeMultiple.func1+0x2a	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/prometheus/[email protected]/web/tls_config.go:271
#	0x971415	golang.org/x/sync/errgroup.(*Group).Go.func1+0x55			/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78

1 @ 0x43e6ce 0x44eb45 0x104bfbf 0x471981
#	0x104bfbe	go.opencensus.io/stats/view.(*worker).start+0x9e	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/[email protected]/stats/view/worker.go:292

1 @ 0x43e6ce 0x44eb45 0x138cd3d 0x471981
#	0x138cd3c	github.com/minio/minio-go/v7.(*Client).listObjectsV2.func2+0x63c	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/minio/minio-go/[email protected]/api-list.go:137

1 @ 0x43e6ce 0x44eb45 0x169d78c 0x14744e5 0xd82bda 0x1494238 0x1493b7b 0x1494d2d 0x169d5cd 0x16a03da 0x971416 0x471981
#	0x169d78b	github.com/thanos-io/thanos/pkg/block.(*ConcurrentLister).GetActiveAndPartialBlockIDs.func2+0xcb	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:271
#	0x14744e4	github.com/thanos-io/objstore/providers/s3.(*Bucket).Iter+0x2a4						/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/providers/s3/s3.go:407
#	0xd82bd9	github.com/thanos-io/objstore.(*metricBucket).Iter+0x139						/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/objstore.go:516
#	0x1494237	github.com/thanos-io/objstore/tracing/opentracing.(*TracingBucket).Iter.TracingBucket.Iter.func1+0x157	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/tracing/opentracing/opentracing.go:50
#	0x1493b7a	github.com/thanos-io/objstore/tracing/opentracing.doWithSpan+0x9a					/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/tracing/opentracing/opentracing.go:215
#	0x1494d2c	github.com/thanos-io/objstore/tracing/opentracing.TracingBucket.Iter+0xec				/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/thanos-io/[email protected]/tracing/opentracing/opentracing.go:48
#	0x169d5cc	github.com/thanos-io/thanos/pkg/block.(*ConcurrentLister).GetActiveAndPartialBlockIDs+0x24c		/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:266
#	0x16a03d9	github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetchMetadata.func2+0x79				/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:517
#	0x971415	golang.org/x/sync/errgroup.(*Group).Go.func1+0x55							/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78

1 @ 0x43e6ce 0x44eb45 0x20bbaa5 0x20bb41f 0x627009 0x471981
#	0x20bbaa4	main.interrupt+0x104				/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/main.go:180
#	0x20bb41e	main.main.func4+0x1e				/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/main.go:153
#	0x627008	github.com/oklog/run.(*Group).Run.func1+0x28	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/[email protected]/group.go:38

1 @ 0x43e6ce 0x44eb45 0x20bbd72 0x20bb3a5 0x627009 0x471981
#	0x20bbd71	main.reload+0xf1				/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/main.go:193
#	0x20bb3a4	main.main.func6+0x24				/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/main.go:163
#	0x627008	github.com/oklog/run.(*Group).Run.func1+0x28	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/[email protected]/group.go:38

1 @ 0x43e6ce 0x44eb45 0xe13ba8 0x471981
#	0xe13ba7	github.com/baidubce/bce-sdk-go/util/log.NewLogger.func1+0xa7	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/baidubce/[email protected]/util/log/logger.go:375

1 @ 0x43e6ce 0x44fb78 0x44fb4f 0x46d825 0x47e648 0x9712a5 0x15131e5 0x15136d6 0x165732d 0x20ae14f 0x627009 0x471981
#	0x46d824	sync.runtime_Semacquire+0x24							/opt/bitnami/go/src/runtime/sema.go:62
#	0x47e647	sync.(*WaitGroup).Wait+0x47							/opt/bitnami/go/src/sync/waitgroup.go:116
#	0x9712a4	golang.org/x/sync/errgroup.(*Group).Wait+0x24					/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:56
#	0x15131e4	github.com/prometheus/exporter-toolkit/web.ServeMultiple+0x144			/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/prometheus/[email protected]/web/tls_config.go:274
#	0x15136d5	github.com/prometheus/exporter-toolkit/web.ListenAndServe+0x455			/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/prometheus/[email protected]/web/tls_config.go:307
#	0x165732c	github.com/thanos-io/thanos/pkg/server/http.(*Server).ListenAndServe+0x22c	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/server/http/http.go:85
#	0x20ae14e	main.runCompact.func1+0x2e							/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:195
#	0x627008	github.com/oklog/run.(*Group).Run.func1+0x28					/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/[email protected]/group.go:38

1 @ 0x43e6ce 0x44fb78 0x44fb4f 0x46d825 0x47e648 0x9712a5 0x169f9b4 0x16a16bf 0x1695893 0x16a0cc5 0x16a1930 0x16b7d66 0x16c4616 0x20ace3d 0x20ac9ea 0x1691102 0x20ac90d 0x627009 0x471981
#	0x46d824	sync.runtime_Semacquire+0x24							/opt/bitnami/go/src/runtime/sema.go:62
#	0x47e647	sync.(*WaitGroup).Wait+0x47							/opt/bitnami/go/src/sync/waitgroup.go:116
#	0x9712a4	golang.org/x/sync/errgroup.(*Group).Wait+0x24					/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:56
#	0x169f9b3	github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetchMetadata+0x453	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:521
#	0x16a16be	github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch.func2+0x1e		/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:595
#	0x1695892	github.com/golang/groupcache/singleflight.(*Group).Do+0x192			/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/golang/[email protected]/singleflight/singleflight.go:56
#	0x16a0cc4	github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch+0x1a4		/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:593
#	0x16a192f	github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch+0x4f			/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/block/fetcher.go:653
#	0x16b7d65	github.com/thanos-io/thanos/pkg/compact.(*Syncer).SyncMetas+0xc5		/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/compact/compact.go:141
#	0x16c4615	github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact+0x275	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/compact/compact.go:1489
#	0x20ace3c	main.runCompact.func7+0xdc							/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:439
#	0x20ac9e9	main.runCompact.func8.1+0x49							/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:525
#	0x1691101	github.com/thanos-io/thanos/pkg/runutil.Repeat+0x81				/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:91
#	0x20ac90c	main.runCompact.func8+0x18c							/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:524
#	0x627008	github.com/oklog/run.(*Group).Run.func1+0x28					/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/[email protected]/group.go:38

1 @ 0x43e6ce 0x44fb78 0x44fb4f 0x46d8e5 0x47cd3d 0x16b7d15 0x16b7cdd 0x20abd94 0x1691102 0x20abcce 0x627009 0x471981
#	0x46d8e4	sync.runtime_SemacquireMutex+0x24					/opt/bitnami/go/src/runtime/sema.go:77
#	0x47cd3c	sync.(*Mutex).lockSlow+0x15c						/opt/bitnami/go/src/sync/mutex.go:171
#	0x16b7d14	sync.(*Mutex).Lock+0x74							/opt/bitnami/go/src/sync/mutex.go:90
#	0x16b7cdc	github.com/thanos-io/thanos/pkg/compact.(*Syncer).SyncMetas+0x3c	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/compact/compact.go:138
#	0x20abd93	main.runCompact.func16.1+0x93						/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:637
#	0x1691101	github.com/thanos-io/thanos/pkg/runutil.Repeat+0x81			/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:91
#	0x20abccd	main.runCompact.func16+0x2ad						/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:635
#	0x627008	github.com/oklog/run.(*Group).Run.func1+0x28				/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/[email protected]/group.go:38

1 @ 0x43e6ce 0x44fb78 0x44fb4f 0x46d8e5 0x47cd3d 0x16b7d15 0x16b7cdd 0x20ad8e5 0x20ac23e 0x1691102 0x20ac1ef 0x627009 0x471981
#	0x46d8e4	sync.runtime_SemacquireMutex+0x24					/opt/bitnami/go/src/runtime/sema.go:77
#	0x47cd3c	sync.(*Mutex).lockSlow+0x15c						/opt/bitnami/go/src/sync/mutex.go:171
#	0x16b7d14	sync.(*Mutex).Lock+0x74							/opt/bitnami/go/src/sync/mutex.go:90
#	0x16b7cdc	github.com/thanos-io/thanos/pkg/compact.(*Syncer).SyncMetas+0x3c	/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/compact/compact.go:138
#	0x20ad8e4	main.runCompact.func6+0x124						/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:425
#	0x20ac23d	main.runCompact.func14.1+0x3d						/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:608
#	0x1691101	github.com/thanos-io/thanos/pkg/runutil.Repeat+0x81			/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/pkg/runutil/runutil.go:91
#	0x20ac1ee	main.runCompact.func14+0xae						/bitnami/blacksmith-sandox/thanos-0.35.1/src/github.com/thanos-io/thanos/cmd/thanos/compact.go:607
#	0x627008	github.com/oklog/run.(*Group).Run.func1+0x28				/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/oklog/[email protected]/group.go:38

1 @ 0x471981

What you expected to happen:
The compaction operation to progress normally.

How to reproduce it (as minimally and precisely as possible):
Simulating a random S3 failure on an exists call should work, but I have not yet put together an easy repro.

Full logs to relevant components:
The last entry in the logs is ~5 minutes older than the problem. There are no log entries made as a result of the error.

Anything else we need to know:
I have the system in the deadlocked state still. If you need to run some pprof commands through the web interface, I still can

Environment:

running in AWS, kubelet version v1.28.5-eks-5e0fdde
instance type r6a.8xlarge with a 200Gi memory limit (staying below 100Gi)
imageTag: bitnami/thanos:0.35.1-debian-12-r1
compactor parameters:

        - compact
        - --log.level=info
        - --log.format=logfmt
        - --http-address=0.0.0.0:10902
        - --data-dir=/data
        - --retention.resolution-raw=30d
        - --retention.resolution-5m=90d
        - --retention.resolution-1h=2y
        - --compact.enable-vertical-compaction
        - --deduplication.func=
        - --deduplication.replica-label=replica
        - --deduplication.replica-label=storage
        - --consistency-delay=30m
        - --objstore.config-file=/conf/objstore.yml
        - --wait
        - --wait-interval=5m
        - --compact.concurrency=100
        - --downsample.concurrency=100
        - --compact.blocks-fetch-concurrency=100
        - --block-files-concurrency=100
        - --enable-auto-gomemlimit
        - --block-viewer.global.sync-block-timeout=30m
        - --block-viewer.global.sync-block-interval=2h
        - --compact.cleanup-interval=24h
        - --compact.progress-interval=24h
        - --delete-delay=12h
        - --web.disable
        - --block-discovery-strategy=concurrent
        - --block-meta-fetch-concurrency=36```
-->

The text was updated successfully, but these errors were encountered:

palamvmw · 2024-07-05T10:29:37Z

When using select we should add a default condition or timeout in the part of the code below
select { case <-ctx.Done(): return ctx.Err() case metaChan <- id: }

mfoldenyi · 2024-07-05T13:20:48Z

@palamvmw In general that would help getting unstuck at the very least.

I am trying to wrap my head around what happened in the first place, and I believe this is caused by a large number of errors "clogging up" the work queue. Essentially any Exists call that throws an Error takes out one of the worker threads with it, and there is no check for all of them going away, so when we reach the 64th Exists Error all things just stop. No more workers picking up items, and no more items pushed to the channel. Since we are not on a timed Context, we also never get ctx.Done().

I would think that if we error out a worker, we should start a new one in its place. (Or just not stop it in the first place, and send errors to a different channel to be collected.)

I am also not quite sure that this line is correct:
https://github.com/thanos-io/thanos/blame/086a698b2195adb6f3463ebbd032e780f39d2050/pkg/block/fetcher.go#L252
An Exists call can be failing for a bunch of reasons some of which could be retried successfully.

We have had S3 503 Slow Down responses previously, so it is not at all unlikely that the reason we clogged the 64 workers is because all of a sudden all of them started getting 503 Slow Downs from AWS.

ahurtaud · 2024-07-16T10:05:41Z

I think I am facing the same issue but on Azure Object store so I tend to agree with the comment on your draft PR.
we are not using any minio lib for azure :/

Signed-off-by: Miklós Földényi <[email protected]>

taisph · 2024-11-25T07:16:07Z

I believe I just hit the same issue on GCP using v0.36.1. The compactor suddenly just stopped doing any work after completing a delete cycle. It was still responding to metric scrapes.

Signed-off-by: Miklós Földényi <[email protected]> # Conflicts: # CHANGELOG.md

Signed-off-by: Miklós Földényi <[email protected]>

Fixed test case condition Do not try to unwrap multierror unless its actually a multierror Review fixes Signed-off-by: Miklós Földényi <[email protected]>

Errors no longer take out the thread with them, instead are collected into a multierror. Signed-off-by: Miklós Földényi <[email protected]>

mfoldenyi changed the title ~~Compactor: Deadlock on S3 error during~~ Compactor: Deadlock on S3 error during meta sync Jul 5, 2024

mfoldenyi mentioned this issue Jul 5, 2024

Compactor: Added IsRetryableError to fetcher handling minio retryable errors. #7518

Closed

2 tasks

mfoldenyi added a commit to mfoldenyi/thanos that referenced this issue Nov 22, 2024

thanos-io#7514 Fix channel deadlock in meta sync fetcher

eb10cb9

Signed-off-by: Miklós Földényi <[email protected]>

mfoldenyi mentioned this issue Nov 22, 2024

#7514 Fix channel deadlock in meta sync fetcher #7933

Open

2 tasks

mfoldenyi added a commit to mfoldenyi/thanos that referenced this issue Dec 7, 2024

thanos-io#7514 Fix channel deadlock in meta sync fetcher

97075be

Signed-off-by: Miklós Földényi <[email protected]> # Conflicts: # CHANGELOG.md

mfoldenyi added a commit to mfoldenyi/thanos that referenced this issue Dec 23, 2024

thanos-io#7514 Fix channel deadlock in meta sync fetcher

09cf97a

Signed-off-by: Miklós Földényi <[email protected]>

mfoldenyi added a commit to mfoldenyi/thanos that referenced this issue Jan 10, 2025

thanos-io#7514 Fix channel deadlock in meta sync fetcher

6b73501

Errors no longer take out the thread with them, instead are collected into a multierror. Signed-off-by: Miklós Földényi <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor: Deadlock on S3 error during meta sync #7514

Compactor: Deadlock on S3 error during meta sync #7514

mfoldenyi commented Jul 5, 2024 •

edited

Loading

palamvmw commented Jul 5, 2024

mfoldenyi commented Jul 5, 2024

ahurtaud commented Jul 16, 2024

taisph commented Nov 25, 2024

Compactor: Deadlock on S3 error during meta sync #7514

Compactor: Deadlock on S3 error during meta sync #7514

Comments

mfoldenyi commented Jul 5, 2024 • edited Loading

palamvmw commented Jul 5, 2024

mfoldenyi commented Jul 5, 2024

ahurtaud commented Jul 16, 2024

taisph commented Nov 25, 2024

mfoldenyi commented Jul 5, 2024 •

edited

Loading