Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 #1655

bvis · 2019-10-16T10:27:16Z

Thanos, Prometheus and Golang version used:

Thanos: quay.io/thanos/thanos:v0.8.1
Prometheus: prom/prometheus:v2.13.0

Object Storage Provider:

$ cat /config/thanos.yaml
type: GCS
...

What happened:
I've updated all the Thanos stack I'm using, from the v0.7.0 to v0.8.1. Everything works as expected (I've seen incredible memory savings in Thanos Sidecar, amazing job!) except the Store Gateway component.

It simply does not start and dies with a 2 exit code.

What you expected to happen:
It starts as before!

How to reproduce it:
It's happening in our productive environment, our staging stack, which uses its own bucket, works correctly with the same parameters.

Parameters:

      --data-dir=/data
      --log.level=debug
      --index-cache-size=500MB
      --chunk-pool-size=2GB
      --grpc-server-tls-cert=/etc/pki/thanos/tls.crt
      --grpc-server-tls-client-ca=/etc/pki/thanos/tls.ca
      --grpc-server-tls-key=/etc/pki/thanos/tls.key
      --objstore.config-file=/config/thanos.yaml

Full logs to relevant components:
Excerpt of the pod describe:

Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 15 Oct 2019 19:12:56 +0200
      Finished:     Tue, 15 Oct 2019 19:23:23 +0200

Logs of v0.8.1

level=debug ts=2019-10-16T08:00:41.916594638Z caller=main.go:120 msg="maxprocs: Leaving GOMAXPROCS=[8]: CPU quota undefined"
level=info  ts=2019-10-16T08:00:41.924413863Z caller=main.go:168 msg="Tracing will be disabled"
level=info  ts=2019-10-16T08:00:41.924630438Z caller=factory.go:39 msg="loading bucket configuration"
level=info  ts=2019-10-16T08:00:41.929235164Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=262144000 maxSizeBytes=524288000 maxItems=math.MaxInt64
level=debug ts=2019-10-16T08:00:41.981473465Z caller=store.go:162 msg="initializing bucket store"

When I go back to the earlier version it works.

Logs of v0.7.0

``` level=debug ts=2019-10-16T08:00:41.916594638Z caller=main.go:120 msg="maxprocs: Leaving GOMAXPROCS=[8]: CPU quota undefined" level=info ts=2019-10-16T08:00:41.924413863Z caller=main.go:168 msg="Tracing will be disabled" level=info ts=2019-10-16T08:00:41.924630438Z caller=factory.go:39 msg="loading bucket configuration" level=info ts=2019-10-16T08:00:41.929235164Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=262144000 maxSizeBytes=524288000 maxItems=math.MaxInt64 level=debug ts=2019-10-16T08:00:41.981473465Z caller=store.go:162 msg="initializing bucket store" level=debug ts=2019-10-16T08:08:37.879383906Z caller=store.go:166 msg="bucket store ready" init_duration=7m55.871106642s level=info ts=2019-10-16T08:08:38.655822385Z caller=main.go:305 msg="enabled gRPC server side TLS" level=info ts=2019-10-16T08:08:38.657352493Z caller=main.go:322 msg="gRPC server TLS client verification enabled" level=info ts=2019-10-16T08:08:38.697338919Z caller=store.go:209 msg="starting store node" level=info ts=2019-10-16T08:08:38.729108826Z caller=main.go:341 msg="Listening for metrics" address=0.0.0.0:10902 level=info ts=2019-10-16T08:08:38.729094252Z caller=store.go:199 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901 level=debug ts=2019-10-16T08:21:49.523658907Z caller=bucket.go:771 msg="Blocks source resolutions" blocks=2 MaximumResolution=0 mint=1571192190000 maxt=1571214090000 lset="{kubernetes_cluster=\"kube-eu-west\",kubernetes_environment=\"production\"}" spans="Range: 1571191200000-1571205600000 Resolution: 0" level=debug ts=2019-10-16T08:21:51.538999187Z caller=bucket.go:871 msg="stats query processed" stats="&{blocksQueried:2 postingsTouched:94 postingsTouchedSizeSum:1237720 postingsToFetch:0 postingsFetched:94 postingsFetchedSizeSum:1238440 postingsFetchCount:4 postingsFetchDurationSum:849477981 seriesTouched:93 seriesTouchedSizeSum:13619 seriesFetched:93 seriesFetchedSizeSum:20945152 seriesFetchCount:7 seriesFetchDurationSum:2209205202 chunksTouched:368 chunksTouchedSizeSum:243771 chunksFetched:368 chunksFetchedSizeSum:9558188 chunksFetchCount:76 chunksFetchDurationSum:19410111834 getAllDuration:1848805649 mergedSeriesCount:47 mergedChunksCount:368 mergeDuration:60013656}" err=null ``` It works!

Anything else we need to know:
I've configured the compact component as:

compact
      --log.level=info
      --retention.resolution-raw=30d
      --retention.resolution-5m=30d
      --retention.resolution-1h=10y
      --consistency-delay=30m
      --data-dir=/data
      --objstore.config-file=/config/thanos.yaml
      --wait

And as far as I see it's doing its work:

Logs of compact

``` level=info ts=2019-10-16T08:15:49.780661885Z caller=main.go:170 msg="Tracing will be disabled" level=info ts=2019-10-16T08:15:49.781448872Z caller=factory.go:39 msg="loading bucket configuration" level=info ts=2019-10-16T08:15:49.794832718Z caller=compact.go:251 msg="retention policy of raw samples is enabled" duration=720h0m0s level=info ts=2019-10-16T08:15:49.794912434Z caller=compact.go:254 msg="retention policy of 5 min aggregated samples is enabled" duration=720h0m0s level=info ts=2019-10-16T08:15:49.794926786Z caller=compact.go:257 msg="retention policy of 1 hour aggregated samples is enabled" duration=87600h0m0s level=info ts=2019-10-16T08:15:49.794940895Z caller=compact.go:341 msg="starting compact node" level=info ts=2019-10-16T08:15:49.794949212Z caller=prober.go:114 msg="changing probe status" status=ready level=info ts=2019-10-16T08:15:49.795056156Z caller=main.go:353 msg="listening for requests and metrics" component=compact address=0.0.0.0:10902 level=info ts=2019-10-16T08:15:49.795165313Z caller=prober.go:143 msg="changing probe status" status=healthy level=info ts=2019-10-16T08:15:49.795233513Z caller=compact.go:1063 msg="start sync of metas" level=info ts=2019-10-16T08:15:50.312216312Z caller=compact.go:1069 msg="start of GC" level=info ts=2019-10-16T08:15:50.315245595Z caller=compact.go:1075 msg="start of compaction" level=info ts=2019-10-16T08:15:50.658389331Z caller=compact.go:264 msg="compaction iterations done" level=info ts=2019-10-16T08:15:50.658457513Z caller=compact.go:271 msg="start first pass of downsampling" level=info ts=2019-10-16T08:15:53.992421826Z caller=compact.go:277 msg="start second pass of downsampling" level=info ts=2019-10-16T08:15:56.741387181Z caller=compact.go:282 msg="downsampling iterations done" ```

Environment:

OS:

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel:

$ uname -a
Linux gke-kube-eu-west-highmem-8-fbfa03ef-lvvr 4.15.0-1034-gke #36-Ubuntu SMP Thu Jun 6 11:16:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Others:
I've stoped during a moment the compact and verified the status of the bucket

Logs

$ thanos bucket verify --objstore.config-file=/config/thanos.yaml
level=info ts=2019-10-16T10:02:43.886890278Z caller=main.go:170 msg="Tracing will be disabled"
level=info ts=2019-10-16T10:02:43.887322586Z caller=factory.go:39 msg="loading bucket configuration"
level=warn ts=2019-10-16T10:02:43.888069338Z caller=verify.go:49 msg="GLOBAL COMPACTOR SHOULD __NOT__ BE RUNNING ON THE SAME BUCKET" issues=2 repair=false
level=info ts=2019-10-16T10:02:43.888110263Z caller=index_issue.go:29 msg="started verifying issue" with-repair=false issue=index_issue
level=info ts=2019-10-16T10:26:30.883476024Z caller=index_issue.go:130 msg="verified issue" with-repair=false issue=index_issue
level=info ts=2019-10-16T10:26:30.88374969Z caller=overlapped_blocks.go:26 msg="started verifying issue" with-repair=false issue=overlapped_blocks
level=info ts=2019-10-16T10:26:35.466468212Z caller=verify.go:68 msg="verify completed" issues=2 repair=false
level=info ts=2019-10-16T10:26:35.466838815Z caller=main.go:218 msg=exiting

It took more than 20 minutes, but finally worked.

The text was updated successfully, but these errors were encountered:

bwplotka · 2019-10-16T12:15:12Z

Hi? Any probes configured? (:

bvis · 2019-10-16T12:54:33Z

True!

Indeed I found the liveness was a problem and I added initialDelaySeconds: 600. I'll try it again with initialDelaySeconds: 1800 and I'll let you know if it worked.

On any case, is it now doing anything else that justifies the longer startup time?

metalmatze · 2019-10-16T13:00:11Z

In case you use kube-thanos we recently added liveness probes and found that those are not properly made use of. Thus we removed liveness probes from kube-thanos entirely.
Before, we simply didn't have liveness probes on the store and simply never saw that.

Even if you don't use kube-thanos, you might have been bitten by the same scenario?!

bwplotka · 2019-10-16T13:09:28Z

👍

bvis · 2019-10-16T14:11:06Z

I'm using this helm chart instead: https://github.com/kiwigrid/helm-charts/tree/master/charts/prometheus-thanos

AFAIK is a best practice to have liveness/readiness checks in your services. In the store I'm just checking that the http port is working.

I've tried again removing the probes and it still does not start. After 10 minutes I just got a new pod starting the process from scratch.

bwplotka · 2019-10-16T14:22:36Z

We are fixing to allow proper probing: #1656

I've tried again removing the probes and it still does not start. After 10 minutes I just got a new pod starting the process from scratch.

A startup can be long, and we are working on long term change that will improve this: #1471

bvis · 2019-10-16T14:23:36Z

Ok, by now I'll assume working with v0.7.0, please don't do any breaking change until you improve the startup! ;)

bwplotka · 2019-10-16T14:25:19Z

So you expierience startup being longer for v0.8.1 right? What's the difference? Or is it OOM?

bvis · 2019-10-16T14:52:56Z

The difference is that even without any memory limit nor probes it does not start. When the process reaches more or less 9 minutes it stops working and it's restarted by Kubernetes.

When I apply the same config with 0.7.0 it starts serving traffic after 5 minutes.

bwplotka · 2019-10-16T15:03:43Z

Thanks, investigating.

bvis · 2019-10-16T15:27:30Z

I've not seen any OOMKill event in my cluster logs :(

bwplotka · 2019-10-16T23:14:04Z

Spent some time on reproducing this but unfortunately no luck. Startup time and query seem to be not affected by v0.8.1 on my setup. We will keep an eye on this though.

Plus hopefully, new loading will improve loading to instant startup

GiedriusS · 2019-10-17T08:28:22Z

Cannot reproduce it here as well. https://github.com/kiwigrid/helm-charts/blob/master/charts/prometheus-thanos/templates/store-gateway-statefulset.yaml#L67-L83 might be the culprit.

bvis · 2019-10-17T09:25:42Z

Yesterday I removed the probes to be sure they weren’t producing the problem. 😔

bvis · 2019-10-17T09:27:07Z

It would be good to have an extra flag or more verbosing level than the current debug. Could it be possible? I could start a process using an image created from a branch and check it in production.

bwplotka · 2019-10-17T09:50:02Z

We can do that indeed if you want to help.

bvis · 2019-10-17T10:17:00Z

Of course!

bwplotka · 2019-10-30T13:06:00Z

Hi again,

I investigated it a bit on my own and bisected different commits using thanosbench

Bisection clearly saw almost 2x baseline memory usage against testing data after this commit: #1482 PR (commit).

I think some dep upgrades could introduce this, but I did not have time to look into profiles. It might be worth to grab this problematic commit and compare profiles with v0.7.0 or commit earlier. We need to dig deeper to know.

However, we could see different regression results on the different dataset: https://cloud-native.slack.com/archives/CK5RSSC10/p1572426624394000

It's interesting what this commit could change. I need to focus on other things so help wanted to investigate deeper. I don't think it's a blocker, but it's not nice indeed.

bwplotka · 2019-10-30T14:44:37Z

Note that it does not affect request memory used during my tests.

bvis · 2019-11-27T12:30:43Z

Yesterday I give a try to the master version of the image (quay.io/thanos/thanos:master-2019-11-26-825f1199) and it started as expected.

It warns me about a data inconsistency but it works:

level=info ts=2019-11-26T11:40:47.434709813Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2019-11-26T11:40:47.434842525Z caller=factory.go:43 msg="loading bucket configuration"
level=info ts=2019-11-26T11:40:47.435441223Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=262144000 maxSizeBytes=524288000 maxItems=math.MaxInt64
level=info ts=2019-11-26T11:40:47.435616829Z caller=options.go:24 protocol=gRPC msg="enabling server side TLS"
level=info ts=2019-11-26T11:40:47.43599483Z caller=options.go:54 protocol=gRPC msg="server TLS client verification enabled"
level=info ts=2019-11-26T11:40:47.436281653Z caller=store.go:261 msg="starting store node"
level=info ts=2019-11-26T11:40:47.436359936Z caller=prober.go:127 msg="changing probe status" status=healthy
level=info ts=2019-11-26T11:40:47.436379946Z caller=store.go:216 msg="initializing bucket store"
level=info ts=2019-11-26T11:40:47.436417247Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2019-11-26T11:43:45.657080817Z caller=store.go:222 msg="bucket store ready" init_duration=2m58.220639491s
level=info ts=2019-11-26T11:43:45.657279816Z caller=prober.go:107 msg="changing probe status" status=ready
level=info ts=2019-11-26T11:43:45.657375664Z caller=grpc.go:98 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=warn ts=2019-11-26T19:37:46.542771402Z caller=store.go:227 msg="syncing blocks failed" err="iter: load meta: download meta.json: get file 01DTMKDDAP04S8X6BB3XZN8DQY/meta.json: storage: object doesn't exist"```

bwplotka added the bug label Oct 16, 2019

bwplotka closed this as completed Oct 16, 2019

bwplotka reopened this Oct 16, 2019

bwplotka removed the bug label Oct 17, 2019

bwplotka mentioned this issue Oct 17, 2019

store: Start metric and status probe HTTP server as earlier as possible #1656

Merged

2 tasks

bwplotka changed the title ~~Store Gateway does not start after image version upgrade from v0.7.0 to v0.8.1~~ Store Gateway increased startup memory usage from v0.7.0 to v0.8.1 Oct 30, 2019

bwplotka changed the title ~~Store Gateway increased startup memory usage from v0.7.0 to v0.8.1~~ Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 Oct 30, 2019

bwplotka added bug difficulty: medium help wanted priority: P0 labels Oct 30, 2019

bwplotka mentioned this issue Nov 1, 2019

Long Term Storage Improvements [Tracking Issue] #1705

Closed

34 tasks

bwplotka mentioned this issue Nov 28, 2019

Handle (ignore) partial uploads on downsampling and apply retention cycles #1335

Closed

This was referenced Jan 10, 2020

Added binary index header implementation. #1952

Merged

Enabled index-header under experimental flag. #1986

Merged

bwplotka closed this as completed in #1952 Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 #1655

Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 #1655

bvis commented Oct 16, 2019 •

edited

Loading

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

metalmatze commented Oct 16, 2019

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

bwplotka commented Oct 16, 2019 •

edited

Loading

GiedriusS commented Oct 17, 2019

bvis commented Oct 17, 2019

bvis commented Oct 17, 2019

bwplotka commented Oct 17, 2019

bvis commented Oct 17, 2019

bwplotka commented Oct 30, 2019

bwplotka commented Oct 30, 2019 •

edited

Loading

bvis commented Nov 27, 2019 •

edited

Loading

Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 #1655

Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 #1655

Comments

bvis commented Oct 16, 2019 • edited Loading

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

metalmatze commented Oct 16, 2019

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

bwplotka commented Oct 16, 2019

bvis commented Oct 16, 2019

bwplotka commented Oct 16, 2019 • edited Loading

GiedriusS commented Oct 17, 2019

bvis commented Oct 17, 2019

bvis commented Oct 17, 2019

bwplotka commented Oct 17, 2019

bvis commented Oct 17, 2019

bwplotka commented Oct 30, 2019

bwplotka commented Oct 30, 2019 • edited Loading

bvis commented Nov 27, 2019 • edited Loading

bvis commented Oct 16, 2019 •

edited

Loading

bwplotka commented Oct 16, 2019 •

edited

Loading

bwplotka commented Oct 30, 2019 •

edited

Loading

bvis commented Nov 27, 2019 •

edited

Loading