Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thanos-compact exited 408 Request Time-out #3878

Closed
rrraditya opened this issue Mar 5, 2021 · 10 comments
Closed

thanos-compact exited 408 Request Time-out #3878

rrraditya opened this issue Mar 5, 2021 · 10 comments
Labels

Comments

@rrraditya
Copy link

/bin/thanos --version
thanos, version 0.17.2 (branch: HEAD, revision: 37e6ef61566c7c70793ba6d128f00c4c66cb2402)
  build user:       root@92283ccb0bc0
  build date:       20201208-10:00:57
  go version:       go1.15
  platform:         linux/amd64

/usr/local/bin/prometheus --version
prometheus, version 2.19.1 (branch: HEAD, revision: eba3fdcbf0d378b66600281903e3aab515732b39)
  build user:       root@62700b3d0ef9
  build date:       20200618-16:35:26
  go version:       go1.14.4

Component: Thanos Compact
Object Storage Provider: OpenIO

What happened:
Our compact component suddenly shutting down when it got 408 Request Time-out error code. Related log attached below. What we would like to know that what is the cause of this error? and why it kill the thanos-compact service? Does the error was critical so the compactor should be terminated?

What you expected to happen:
thanos-compact not shutting down.

How to reproduce it (as minimally and precisely as possible):
This is our thanos compact flag

/bin/thanos compact \
        --data-dir=/var/lib/prometheus-compact/ \
        --objstore.config-file=/etc/prometheus/bucket.yml \
        --wait \
        --wait-interval=5m \
        --retention.resolution-raw=90d \
        --retention.resolution-5m=180d \
        --retention.resolution-1h=400d \
        --log.level=info

Full logs to relevant components:

Logs

Mar  5 03:21:16 temng01thanos01 thanos: level=error ts=2021-03-04T20:21:16.546888255Z caller=compact.go:434 msg="retriable error" err="syncing metas: incomplete view: meta.json file exists: 01EVDGQTW64A7QDNX4KG2975RF/meta.json: stat s3 object: 408 Request Time-out"
Mar  5 03:21:16 temng01thanos01 thanos: level=warn ts=2021-03-04T20:21:16.546967953Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
Mar  5 03:21:16 temng01thanos01 thanos: level=info ts=2021-03-04T20:21:16.546986005Z caller=http.go:65 service=http/server component=compact msg="internal server is shutting down" err=null
Mar  5 03:21:17 temng01thanos01 thanos: level=info ts=2021-03-04T20:21:17.047181015Z caller=http.go:84 service=http/server component=compact msg="internal server is shutdown gracefully" err=null
Mar  5 03:21:17 temng01thanos01 thanos: level=info ts=2021-03-04T20:21:17.047299537Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null
Mar  5 03:21:17 temng01thanos01 thanos: level=info ts=2021-03-04T20:21:17.047379923Z caller=main.go:160 msg=exiting2

Anything else we need to know:
Currently we're applying workaround in the systemd file for Restart=on-failure to avoid this issue. We're still monitoring it.

Environment:

  • OS (e.g. from /etc/os-release): CentOS Linux release 7.8.2003 (Core)
  • Kernel (e.g. uname -a): Linux temng01thanos01 3.10.0-1127.19.1.el7.x86_64 Initial structure and block shipper #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
@rrraditya
Copy link
Author

hi all,

is there anyone can give me some insights for my issue?

thank you.

@GiedriusS
Copy link
Member

The newest RC version is able to calculate checksums of files so that in such a case Thanos Compactor wouldn't have to redownload all files & it wouldn't cause any issues. Also, we are working on adding retries to objstore clients. Does this help? 🤗

@bwplotka
Copy link
Member

Thanks for reporting. It looks like server side timeout? Wonder what could be the reason, is there any server specific issue? I would check on your Object Storage documentation why 408 error occurs.

@GiedriusS It's meta sync operation, so checksums won't help I think?

@rrraditya
Copy link
Author

hi @GiedriusS and @bwplotka thank you for your insights.
It seems that the issue is purely due to network connection time out, either because of intermittent network connection or failing dns resolution. But either way, what I would like to know is the compactor behavior or response on this incident.

Why it shutting down the service by its own? Is there any flag that i missing in order to avoid this behavior?

@rrraditya
Copy link
Author

hi @bwplotka any update you can share more?

@rrraditya
Copy link
Author

hi @bwplotka any more insights that you can share?

@rrraditya
Copy link
Author

hi @bwplotka,
i would like to follow up on this ticket whenever you're available..
thank you!

@rrraditya
Copy link
Author

hi @bwplotka any more insights that you can share?

@stale
Copy link

stale bot commented Jun 28, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 28, 2021
@stale
Copy link

stale bot commented Jul 13, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants