Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compact: unexpected termination #2653

Closed
jsedy7 opened this issue May 25, 2020 · 4 comments
Closed

compact: unexpected termination #2653

jsedy7 opened this issue May 25, 2020 · 4 comments

Comments

@jsedy7
Copy link

jsedy7 commented May 25, 2020

Hello, I've a problem with Thanos Compact

Thanos, Prometheus and Golang version used:

thanos, version 0.12.2 (branch: HEAD, revision: 52e10c6e0f644ea98fd057e7fbece828d8dd07c7)
  build user:       root@c1a6cf60f03d
  build date:       20200430-16:24:03
  go version:       go1.13.6

Object Storage Provider: S3 (ceph)

What happened: Thanos Compact is running as a pod in our k8s cluster and after some time has been unexpectedly terminated and I don't know why. Pod is terminated, but not as Failed. Probably some block of data is corrupted, but I can't see which one. Due to the fact that the whole process is not complete, the storage on S3 is gradually being filled.

What you expected to happen: I have debug logging turned on and I would like a know, where is the problem.

How to reproduce it (as minimally and precisely as possible):
I don't know, how to reproduce it. But I have similar solution in another k8s cluster and there is everything OK.

Full logs to relevant components:

Logs

level=info ts=2020-05-24T12:18:00.380444142Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.943834156s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:19:00.577219766Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.140707411s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:20:00.065621278Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.629082705s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:21:00.181151836Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.744559354s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:22:00.482910124Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.046368517s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:23:00.34656505Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.909983369s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:24:00.730194781Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.293674641s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:24:34.090968775Z caller=main.go:223 msg="caught signal. Exiting." signal=terminated
level=warn ts=2020-05-24T12:24:34.091079439Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
level=info ts=2020-05-24T12:24:34.091359294Z caller=http.go:81 service=http/server component=compact msg="internal server shutdown" err=null
level=info ts=2020-05-24T12:24:34.091389922Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null

Anything else we need to know:

Environment:

Additional informations:

    spec:
      containers:
      - args:
        - compact
        - --log.level=debug
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/opt/s3.yaml
        - --retention.resolution-raw=15d
        - --retention.resolution-5m=30d
        - --retention.resolution-1h=90d
        - --consistency-delay=1h
        - --wait

Thank you! :)

@GiedriusS
Copy link
Member

What health checks do you have configured on the container? Are there any logs before what you have pasted?

@jsedy7
Copy link
Author

jsedy7 commented May 25, 2020

I have configured these health checks:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

I'm using statefulset definition, but I created deployment with the same configuration and I have got answers:

Status:               Failed
Reason:               Evicted
Message:              Pod The node had condition: [DiskPressure].

The problem was on my side. Maybe the blocks are too large. 🤔
Is there a method to define size of the temporary stored data?
I have a server that has about 750GB of free space.

@GiedriusS
Copy link
Member

I have configured these health checks:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

I'm using statefulset definition, but I created deployment with the same configuration and I have got answers:

Status:               Failed
Reason:               Evicted
Message:              Pod The node had condition: [DiskPressure].

The problem was on my side. Maybe the blocks are too large. thinking
Is there a method to define size of the temporary stored data?
I have a server that has about 750GB of free space.

Exactly - no since that depends on the nature of the data. There's this, though: #1550. I really need to finish it one day. Let's close this in favor of that?

@jsedy7
Copy link
Author

jsedy7 commented May 25, 2020

Yes, we can close this in favor of that.
Thank you for your help! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants