compact: unexpected termination #2653

jsedy7 · 2020-05-25T07:05:31Z

Hello, I've a problem with Thanos Compact

Thanos, Prometheus and Golang version used:

thanos, version 0.12.2 (branch: HEAD, revision: 52e10c6e0f644ea98fd057e7fbece828d8dd07c7)
  build user:       root@c1a6cf60f03d
  build date:       20200430-16:24:03
  go version:       go1.13.6

Object Storage Provider: S3 (ceph)

What happened: Thanos Compact is running as a pod in our k8s cluster and after some time has been unexpectedly terminated and I don't know why. Pod is terminated, but not as Failed. Probably some block of data is corrupted, but I can't see which one. Due to the fact that the whole process is not complete, the storage on S3 is gradually being filled.

What you expected to happen: I have debug logging turned on and I would like a know, where is the problem.

How to reproduce it (as minimally and precisely as possible):
I don't know, how to reproduce it. But I have similar solution in another k8s cluster and there is everything OK.

Full logs to relevant components:

Logs

level=info ts=2020-05-24T12:18:00.380444142Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.943834156s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:19:00.577219766Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.140707411s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:20:00.065621278Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.629082705s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:21:00.181151836Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.744559354s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:22:00.482910124Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.046368517s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:23:00.34656505Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.909983369s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:24:00.730194781Z caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.293674641s cached=3402 returned=3402 partial=0
level=info ts=2020-05-24T12:24:34.090968775Z caller=main.go:223 msg="caught signal. Exiting." signal=terminated
level=warn ts=2020-05-24T12:24:34.091079439Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
level=info ts=2020-05-24T12:24:34.091359294Z caller=http.go:81 service=http/server component=compact msg="internal server shutdown" err=null
level=info ts=2020-05-24T12:24:34.091389922Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null

Anything else we need to know:

Environment:

OS: Debian 9
Kernel: Linux thanos-compact-0 4.9.0-9-amd64 Initial structure and block shipper #1 SMP Debian 4.9.168-1+deb9u3 (2019-06-16) x86_64 GNU/Linux

Additional informations:

    spec:
      containers:
      - args:
        - compact
        - --log.level=debug
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/opt/s3.yaml
        - --retention.resolution-raw=15d
        - --retention.resolution-5m=30d
        - --retention.resolution-1h=90d
        - --consistency-delay=1h
        - --wait

Thank you! :)

The text was updated successfully, but these errors were encountered:

GiedriusS · 2020-05-25T11:26:22Z

What health checks do you have configured on the container? Are there any logs before what you have pasted?

jsedy7 · 2020-05-25T18:14:37Z

I have configured these health checks:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

I'm using statefulset definition, but I created deployment with the same configuration and I have got answers:

Status:               Failed
Reason:               Evicted
Message:              Pod The node had condition: [DiskPressure].

The problem was on my side. Maybe the blocks are too large. 🤔
Is there a method to define size of the temporary stored data?
I have a server that has about 750GB of free space.

GiedriusS · 2020-05-25T21:00:48Z

I have configured these health checks:
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/healthy
            port: http
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 3
          successThreshold: 1
          timeoutSeconds: 1

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/ready
            port: http
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
I'm using statefulset definition, but I created deployment with the same configuration and I have got answers:
Status:               Failed
Reason:               Evicted
Message:              Pod The node had condition: [DiskPressure].
The problem was on my side. Maybe the blocks are too large. thinking
Is there a method to define size of the temporary stored data?
I have a server that has about 750GB of free space.

Exactly - no since that depends on the nature of the data. There's this, though: #1550. I really need to finish it one day. Let's close this in favor of that?

jsedy7 · 2020-05-25T21:03:40Z

Yes, we can close this in favor of that.
Thank you for your help! :)

GiedriusS added component: compact needs-more-info labels May 25, 2020

GiedriusS closed this as completed May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compact: unexpected termination #2653

compact: unexpected termination #2653

jsedy7 commented May 25, 2020 •

edited

Loading

GiedriusS commented May 25, 2020

jsedy7 commented May 25, 2020

GiedriusS commented May 25, 2020

jsedy7 commented May 25, 2020 •

edited

Loading

compact: unexpected termination #2653

compact: unexpected termination #2653

Comments

jsedy7 commented May 25, 2020 • edited Loading

GiedriusS commented May 25, 2020

jsedy7 commented May 25, 2020

GiedriusS commented May 25, 2020

jsedy7 commented May 25, 2020 • edited Loading

jsedy7 commented May 25, 2020 •

edited

Loading

jsedy7 commented May 25, 2020 •

edited

Loading