Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 #1655

Closed
bvis opened this issue Oct 16, 2019 · 20 comments · Fixed by #1952
Closed

Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 #1655

bvis opened this issue Oct 16, 2019 · 20 comments · Fixed by #1952

Comments

@bvis
Copy link
Contributor

bvis commented Oct 16, 2019

Ping: @bwplotka

Thanos, Prometheus and Golang version used:

Thanos: quay.io/thanos/thanos:v0.8.1
Prometheus: prom/prometheus:v2.13.0

Object Storage Provider:

$ cat /config/thanos.yaml
type: GCS
...

What happened:
I've updated all the Thanos stack I'm using, from the v0.7.0 to v0.8.1. Everything works as expected (I've seen incredible memory savings in Thanos Sidecar, amazing job!) except the Store Gateway component.

It simply does not start and dies with a 2 exit code.

What you expected to happen:
It starts as before!

How to reproduce it:
It's happening in our productive environment, our staging stack, which uses its own bucket, works correctly with the same parameters.

Parameters:

      --data-dir=/data
      --log.level=debug
      --index-cache-size=500MB
      --chunk-pool-size=2GB
      --grpc-server-tls-cert=/etc/pki/thanos/tls.crt
      --grpc-server-tls-client-ca=/etc/pki/thanos/tls.ca
      --grpc-server-tls-key=/etc/pki/thanos/tls.key
      --objstore.config-file=/config/thanos.yaml

Full logs to relevant components:
Excerpt of the pod describe:

Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 15 Oct 2019 19:12:56 +0200
      Finished:     Tue, 15 Oct 2019 19:23:23 +0200
Logs of v0.8.1

level=debug ts=2019-10-16T08:00:41.916594638Z caller=main.go:120 msg="maxprocs: Leaving GOMAXPROCS=[8]: CPU quota undefined"
level=info  ts=2019-10-16T08:00:41.924413863Z caller=main.go:168 msg="Tracing will be disabled"
level=info  ts=2019-10-16T08:00:41.924630438Z caller=factory.go:39 msg="loading bucket configuration"
level=info  ts=2019-10-16T08:00:41.929235164Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=262144000 maxSizeBytes=524288000 maxItems=math.MaxInt64
level=debug ts=2019-10-16T08:00:41.981473465Z caller=store.go:162 msg="initializing bucket store"

When I go back to the earlier version it works.

Logs of v0.7.0

``` level=debug ts=2019-10-16T08:00:41.916594638Z caller=main.go:120 msg="maxprocs: Leaving GOMAXPROCS=[8]: CPU quota undefined" level=info ts=2019-10-16T08:00:41.924413863Z caller=main.go:168 msg="Tracing will be disabled" level=info ts=2019-10-16T08:00:41.924630438Z caller=factory.go:39 msg="loading bucket configuration" level=info ts=2019-10-16T08:00:41.929235164Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=262144000 maxSizeBytes=524288000 maxItems=math.MaxInt64 level=debug ts=2019-10-16T08:00:41.981473465Z caller=store.go:162 msg="initializing bucket store" level=debug ts=2019-10-16T08:08:37.879383906Z caller=store.go:166 msg="bucket store ready" init_duration=7m55.871106642s level=info ts=2019-10-16T08:08:38.655822385Z caller=main.go:305 msg="enabled gRPC server side TLS" level=info ts=2019-10-16T08:08:38.657352493Z caller=main.go:322 msg="gRPC server TLS client verification enabled" level=info ts=2019-10-16T08:08:38.697338919Z caller=store.go:209 msg="starting store node" level=info ts=2019-10-16T08:08:38.729108826Z caller=main.go:341 msg="Listening for metrics" address=0.0.0.0:10902 level=info ts=2019-10-16T08:08:38.729094252Z caller=store.go:199 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901 level=debug ts=2019-10-16T08:21:49.523658907Z caller=bucket.go:771 msg="Blocks source resolutions" blocks=2 MaximumResolution=0 mint=1571192190000 maxt=1571214090000 lset="{kubernetes_cluster=\"kube-eu-west\",kubernetes_environment=\"production\"}" spans="Range: 1571191200000-1571205600000 Resolution: 0" level=debug ts=2019-10-16T08:21:51.538999187Z caller=bucket.go:871 msg="stats query processed" stats="&{blocksQueried:2 postingsTouched:94 postingsTouchedSizeSum:1237720 postingsToFetch:0 postingsFetched:94 postingsFetchedSizeSum:1238440 postingsFetchCount:4 postingsFetchDurationSum:849477981 seriesTouched:93 seriesTouchedSizeSum:13619 seriesFetched:93 seriesFetchedSizeSum:20945152 seriesFetchCount:7 seriesFetchDurationSum:2209205202 chunksTouched:368 chunksTouchedSizeSum:243771 chunksFetched:368 chunksFetchedSizeSum:9558188 chunksFetchCount:76 chunksFetchDurationSum:19410111834 getAllDuration:1848805649 mergedSeriesCount:47 mergedChunksCount:368 mergeDuration:60013656}" err=null ``` It works!

Anything else we need to know:
I've configured the compact component as:

compact
      --log.level=info
      --retention.resolution-raw=30d
      --retention.resolution-5m=30d
      --retention.resolution-1h=10y
      --consistency-delay=30m
      --data-dir=/data
      --objstore.config-file=/config/thanos.yaml
      --wait

And as far as I see it's doing its work:

Logs of compact

``` level=info ts=2019-10-16T08:15:49.780661885Z caller=main.go:170 msg="Tracing will be disabled" level=info ts=2019-10-16T08:15:49.781448872Z caller=factory.go:39 msg="loading bucket configuration" level=info ts=2019-10-16T08:15:49.794832718Z caller=compact.go:251 msg="retention policy of raw samples is enabled" duration=720h0m0s level=info ts=2019-10-16T08:15:49.794912434Z caller=compact.go:254 msg="retention policy of 5 min aggregated samples is enabled" duration=720h0m0s level=info ts=2019-10-16T08:15:49.794926786Z caller=compact.go:257 msg="retention policy of 1 hour aggregated samples is enabled" duration=87600h0m0s level=info ts=2019-10-16T08:15:49.794940895Z caller=compact.go:341 msg="starting compact node" level=info ts=2019-10-16T08:15:49.794949212Z caller=prober.go:114 msg="changing probe status" status=ready level=info ts=2019-10-16T08:15:49.795056156Z caller=main.go:353 msg="listening for requests and metrics" component=compact address=0.0.0.0:10902 level=info ts=2019-10-16T08:15:49.795165313Z caller=prober.go:143 msg="changing probe status" status=healthy level=info ts=2019-10-16T08:15:49.795233513Z caller=compact.go:1063 msg="start sync of metas" level=info ts=2019-10-16T08:15:50.312216312Z caller=compact.go:1069 msg="start of GC" level=info ts=2019-10-16T08:15:50.315245595Z caller=compact.go:1075 msg="start of compaction" level=info ts=2019-10-16T08:15:50.658389331Z caller=compact.go:264 msg="compaction iterations done" level=info ts=2019-10-16T08:15:50.658457513Z caller=compact.go:271 msg="start first pass of downsampling" level=info ts=2019-10-16T08:15:53.992421826Z caller=compact.go:277 msg="start second pass of downsampling" level=info ts=2019-10-16T08:15:56.741387181Z caller=compact.go:282 msg="downsampling iterations done" ```

Environment:

  • OS:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
  • Kernel:
$ uname -a
Linux gke-kube-eu-west-highmem-8-fbfa03ef-lvvr 4.15.0-1034-gke #36-Ubuntu SMP Thu Jun 6 11:16:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Others:
    I've stoped during a moment the compact and verified the status of the bucket
Logs

$ thanos bucket verify --objstore.config-file=/config/thanos.yaml
level=info ts=2019-10-16T10:02:43.886890278Z caller=main.go:170 msg="Tracing will be disabled"
level=info ts=2019-10-16T10:02:43.887322586Z caller=factory.go:39 msg="loading bucket configuration"
level=warn ts=2019-10-16T10:02:43.888069338Z caller=verify.go:49 msg="GLOBAL COMPACTOR SHOULD __NOT__ BE RUNNING ON THE SAME BUCKET" issues=2 repair=false
level=info ts=2019-10-16T10:02:43.888110263Z caller=index_issue.go:29 msg="started verifying issue" with-repair=false issue=index_issue
level=info ts=2019-10-16T10:26:30.883476024Z caller=index_issue.go:130 msg="verified issue" with-repair=false issue=index_issue
level=info ts=2019-10-16T10:26:30.88374969Z caller=overlapped_blocks.go:26 msg="started verifying issue" with-repair=false issue=overlapped_blocks
level=info ts=2019-10-16T10:26:35.466468212Z caller=verify.go:68 msg="verify completed" issues=2 repair=false
level=info ts=2019-10-16T10:26:35.466838815Z caller=main.go:218 msg=exiting

It took more than 20 minutes, but finally worked.

@bwplotka bwplotka added the bug label Oct 16, 2019
@bwplotka
Copy link
Member

Hi? Any probes configured? (:

@bvis
Copy link
Contributor Author

bvis commented Oct 16, 2019

True!

Indeed I found the liveness was a problem and I added initialDelaySeconds: 600. I'll try it again with initialDelaySeconds: 1800 and I'll let you know if it worked.

On any case, is it now doing anything else that justifies the longer startup time?

@metalmatze
Copy link
Contributor

In case you use kube-thanos we recently added liveness probes and found that those are not properly made use of. Thus we removed liveness probes from kube-thanos entirely.
Before, we simply didn't have liveness probes on the store and simply never saw that.

Even if you don't use kube-thanos, you might have been bitten by the same scenario?!

@bwplotka
Copy link
Member

👍

@bvis
Copy link
Contributor Author

bvis commented Oct 16, 2019

I'm using this helm chart instead: https://github.com/kiwigrid/helm-charts/tree/master/charts/prometheus-thanos

AFAIK is a best practice to have liveness/readiness checks in your services. In the store I'm just checking that the http port is working.

I've tried again removing the probes and it still does not start. After 10 minutes I just got a new pod starting the process from scratch.

@bwplotka
Copy link
Member

We are fixing to allow proper probing: #1656

I've tried again removing the probes and it still does not start. After 10 minutes I just got a new pod starting the process from scratch.

A startup can be long, and we are working on long term change that will improve this: #1471

@bvis
Copy link
Contributor Author

bvis commented Oct 16, 2019

Ok, by now I'll assume working with v0.7.0, please don't do any breaking change until you improve the startup! ;)

@bwplotka
Copy link
Member

So you expierience startup being longer for v0.8.1 right? What's the difference? Or is it OOM?

@bvis
Copy link
Contributor Author

bvis commented Oct 16, 2019

The difference is that even without any memory limit nor probes it does not start. When the process reaches more or less 9 minutes it stops working and it's restarted by Kubernetes.

When I apply the same config with 0.7.0 it starts serving traffic after 5 minutes.

@bwplotka
Copy link
Member

Thanks, investigating.

@bwplotka bwplotka reopened this Oct 16, 2019
@bvis
Copy link
Contributor Author

bvis commented Oct 16, 2019

I've not seen any OOMKill event in my cluster logs :(

@bwplotka
Copy link
Member

bwplotka commented Oct 16, 2019

Spent some time on reproducing this but unfortunately no luck. Startup time and query seem to be not affected by v0.8.1 on my setup. We will keep an eye on this though.

Plus hopefully, new loading will improve loading to instant startup

@GiedriusS
Copy link
Member

@bwplotka bwplotka removed the bug label Oct 17, 2019
@bvis
Copy link
Contributor Author

bvis commented Oct 17, 2019

Yesterday I removed the probes to be sure they weren’t producing the problem. 😔

@bvis
Copy link
Contributor Author

bvis commented Oct 17, 2019

It would be good to have an extra flag or more verbosing level than the current debug. Could it be possible? I could start a process using an image created from a branch and check it in production.

@bwplotka
Copy link
Member

We can do that indeed if you want to help.

@bvis
Copy link
Contributor Author

bvis commented Oct 17, 2019

Of course!

@bwplotka bwplotka changed the title Store Gateway does not start after image version upgrade from v0.7.0 to v0.8.1 Store Gateway increased startup memory usage from v0.7.0 to v0.8.1 Oct 30, 2019
@bwplotka bwplotka changed the title Store Gateway increased startup memory usage from v0.7.0 to v0.8.1 Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 Oct 30, 2019
@bwplotka
Copy link
Member

Hi again,

I investigated it a bit on my own and bisected different commits using thanosbench

Bisection clearly saw almost 2x baseline memory usage against testing data after this commit: #1482 PR (commit).

I think some dep upgrades could introduce this, but I did not have time to look into profiles. It might be worth to grab this problematic commit and compare profiles with v0.7.0 or commit earlier. We need to dig deeper to know.

However, we could see different regression results on the different dataset: https://cloud-native.slack.com/archives/CK5RSSC10/p1572426624394000

It's interesting what this commit could change. I need to focus on other things so help wanted to investigate deeper. I don't think it's a blocker, but it's not nice indeed.

@bwplotka
Copy link
Member

bwplotka commented Oct 30, 2019

Note that it does not affect request memory used during my tests.

@bvis
Copy link
Contributor Author

bvis commented Nov 27, 2019

Yesterday I give a try to the master version of the image (quay.io/thanos/thanos:master-2019-11-26-825f1199) and it started as expected.

It warns me about a data inconsistency but it works:

level=info ts=2019-11-26T11:40:47.434709813Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2019-11-26T11:40:47.434842525Z caller=factory.go:43 msg="loading bucket configuration"
level=info ts=2019-11-26T11:40:47.435441223Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=262144000 maxSizeBytes=524288000 maxItems=math.MaxInt64
level=info ts=2019-11-26T11:40:47.435616829Z caller=options.go:24 protocol=gRPC msg="enabling server side TLS"
level=info ts=2019-11-26T11:40:47.43599483Z caller=options.go:54 protocol=gRPC msg="server TLS client verification enabled"
level=info ts=2019-11-26T11:40:47.436281653Z caller=store.go:261 msg="starting store node"
level=info ts=2019-11-26T11:40:47.436359936Z caller=prober.go:127 msg="changing probe status" status=healthy
level=info ts=2019-11-26T11:40:47.436379946Z caller=store.go:216 msg="initializing bucket store"
level=info ts=2019-11-26T11:40:47.436417247Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2019-11-26T11:43:45.657080817Z caller=store.go:222 msg="bucket store ready" init_duration=2m58.220639491s
level=info ts=2019-11-26T11:43:45.657279816Z caller=prober.go:107 msg="changing probe status" status=ready
level=info ts=2019-11-26T11:43:45.657375664Z caller=grpc.go:98 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=warn ts=2019-11-26T19:37:46.542771402Z caller=store.go:227 msg="syncing blocks failed" err="iter: load meta: download meta.json: get file 01DTMKDDAP04S8X6BB3XZN8DQY/meta.json: storage: object doesn't exist"```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants