thanos-compact exited 408 Request Time-out #3878

rrraditya · 2021-03-05T04:35:17Z

/bin/thanos --version
thanos, version 0.17.2 (branch: HEAD, revision: 37e6ef61566c7c70793ba6d128f00c4c66cb2402)
  build user:       root@92283ccb0bc0
  build date:       20201208-10:00:57
  go version:       go1.15
  platform:         linux/amd64

/usr/local/bin/prometheus --version
prometheus, version 2.19.1 (branch: HEAD, revision: eba3fdcbf0d378b66600281903e3aab515732b39)
  build user:       root@62700b3d0ef9
  build date:       20200618-16:35:26
  go version:       go1.14.4

Component: Thanos Compact
Object Storage Provider: OpenIO

What happened:
Our compact component suddenly shutting down when it got 408 Request Time-out error code. Related log attached below. What we would like to know that what is the cause of this error? and why it kill the thanos-compact service? Does the error was critical so the compactor should be terminated?

What you expected to happen:
thanos-compact not shutting down.

How to reproduce it (as minimally and precisely as possible):
This is our thanos compact flag

/bin/thanos compact \
        --data-dir=/var/lib/prometheus-compact/ \
        --objstore.config-file=/etc/prometheus/bucket.yml \
        --wait \
        --wait-interval=5m \
        --retention.resolution-raw=90d \
        --retention.resolution-5m=180d \
        --retention.resolution-1h=400d \
        --log.level=info

Full logs to relevant components:

Logs

Mar  5 03:21:16 temng01thanos01 thanos: level=error ts=2021-03-04T20:21:16.546888255Z caller=compact.go:434 msg="retriable error" err="syncing metas: incomplete view: meta.json file exists: 01EVDGQTW64A7QDNX4KG2975RF/meta.json: stat s3 object: 408 Request Time-out"
Mar  5 03:21:16 temng01thanos01 thanos: level=warn ts=2021-03-04T20:21:16.546967953Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
Mar  5 03:21:16 temng01thanos01 thanos: level=info ts=2021-03-04T20:21:16.546986005Z caller=http.go:65 service=http/server component=compact msg="internal server is shutting down" err=null
Mar  5 03:21:17 temng01thanos01 thanos: level=info ts=2021-03-04T20:21:17.047181015Z caller=http.go:84 service=http/server component=compact msg="internal server is shutdown gracefully" err=null
Mar  5 03:21:17 temng01thanos01 thanos: level=info ts=2021-03-04T20:21:17.047299537Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null
Mar  5 03:21:17 temng01thanos01 thanos: level=info ts=2021-03-04T20:21:17.047379923Z caller=main.go:160 msg=exiting2

Anything else we need to know:
Currently we're applying workaround in the systemd file for Restart=on-failure to avoid this issue. We're still monitoring it.

Environment:

OS (e.g. from /etc/os-release): CentOS Linux release 7.8.2003 (Core)
Kernel (e.g. uname -a): Linux temng01thanos01 3.10.0-1127.19.1.el7.x86_64 Initial structure and block shipper #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

rrraditya · 2021-03-12T03:12:19Z

hi all,

is there anyone can give me some insights for my issue?

thank you.

GiedriusS · 2021-03-15T18:44:03Z

The newest RC version is able to calculate checksums of files so that in such a case Thanos Compactor wouldn't have to redownload all files & it wouldn't cause any issues. Also, we are working on adding retries to objstore clients. Does this help? 🤗

bwplotka · 2021-03-15T22:29:07Z

Thanks for reporting. It looks like server side timeout? Wonder what could be the reason, is there any server specific issue? I would check on your Object Storage documentation why 408 error occurs.

@GiedriusS It's meta sync operation, so checksums won't help I think?

rrraditya · 2021-03-17T07:54:42Z

hi @GiedriusS and @bwplotka thank you for your insights.
It seems that the issue is purely due to network connection time out, either because of intermittent network connection or failing dns resolution. But either way, what I would like to know is the compactor behavior or response on this incident.

Why it shutting down the service by its own? Is there any flag that i missing in order to avoid this behavior?

rrraditya · 2021-04-07T14:34:00Z

hi @bwplotka any update you can share more?

rrraditya · 2021-04-15T03:17:22Z

hi @bwplotka any more insights that you can share?

rrraditya · 2021-04-19T06:29:08Z

hi @bwplotka,
i would like to follow up on this ticket whenever you're available..
thank you!

rrraditya · 2021-04-28T07:57:43Z

hi @bwplotka any more insights that you can share?

stale · 2021-06-28T10:55:24Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-07-13T00:01:17Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale bot added the stale label Jun 28, 2021

stale bot closed this as completed Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thanos-compact exited 408 Request Time-out #3878

thanos-compact exited 408 Request Time-out #3878

rrraditya commented Mar 5, 2021

rrraditya commented Mar 12, 2021

GiedriusS commented Mar 15, 2021

bwplotka commented Mar 15, 2021

rrraditya commented Mar 17, 2021

rrraditya commented Apr 7, 2021

rrraditya commented Apr 15, 2021

rrraditya commented Apr 19, 2021

rrraditya commented Apr 28, 2021

stale bot commented Jun 28, 2021

stale bot commented Jul 13, 2021

thanos-compact exited 408 Request Time-out #3878

thanos-compact exited 408 Request Time-out #3878

Comments

rrraditya commented Mar 5, 2021

rrraditya commented Mar 12, 2021

GiedriusS commented Mar 15, 2021

bwplotka commented Mar 15, 2021

rrraditya commented Mar 17, 2021

rrraditya commented Apr 7, 2021

rrraditya commented Apr 15, 2021

rrraditya commented Apr 19, 2021

rrraditya commented Apr 28, 2021

stale bot commented Jun 28, 2021

stale bot commented Jul 13, 2021