-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store Gateway increased startup/baseline memory usage from v0.7.0 to v0.8.1 #1655
Comments
Hi? Any probes configured? (: |
True! Indeed I found the liveness was a problem and I added On any case, is it now doing anything else that justifies the longer startup time? |
In case you use kube-thanos we recently added liveness probes and found that those are not properly made use of. Thus we removed liveness probes from kube-thanos entirely. Even if you don't use kube-thanos, you might have been bitten by the same scenario?! |
👍 |
I'm using this helm chart instead: https://github.com/kiwigrid/helm-charts/tree/master/charts/prometheus-thanos AFAIK is a best practice to have liveness/readiness checks in your services. In the store I'm just checking that the http port is working. I've tried again removing the probes and it still does not start. After 10 minutes I just got a new pod starting the process from scratch. |
Ok, by now I'll assume working with v0.7.0, please don't do any breaking change until you improve the startup! ;) |
So you expierience startup being longer for v0.8.1 right? What's the difference? Or is it OOM? |
The difference is that even without any memory limit nor probes it does not start. When the process reaches more or less 9 minutes it stops working and it's restarted by Kubernetes. When I apply the same config with 0.7.0 it starts serving traffic after 5 minutes. |
Thanks, investigating. |
I've not seen any OOMKill event in my cluster logs :( |
Spent some time on reproducing this but unfortunately no luck. Startup time and query seem to be not affected by Plus hopefully, new loading will improve loading to instant startup |
Cannot reproduce it here as well. https://github.com/kiwigrid/helm-charts/blob/master/charts/prometheus-thanos/templates/store-gateway-statefulset.yaml#L67-L83 might be the culprit. |
Yesterday I removed the probes to be sure they weren’t producing the problem. 😔 |
It would be good to have an extra flag or more verbosing level than the current debug. Could it be possible? I could start a process using an image created from a branch and check it in production. |
We can do that indeed if you want to help. |
Of course! |
Hi again, I investigated it a bit on my own and bisected different commits using thanosbench Bisection clearly saw almost 2x baseline memory usage against testing data after this commit: #1482 PR (commit). I think some dep upgrades could introduce this, but I did not have time to look into profiles. It might be worth to grab this problematic commit and compare profiles with v0.7.0 or commit earlier. We need to dig deeper to know. However, we could see different regression results on the different dataset: https://cloud-native.slack.com/archives/CK5RSSC10/p1572426624394000 It's interesting what this commit could change. I need to focus on other things so help wanted to investigate deeper. I don't think it's a blocker, but it's not nice indeed. |
Note that it does not affect request memory used during my tests. |
Yesterday I give a try to the master version of the image (quay.io/thanos/thanos:master-2019-11-26-825f1199) and it started as expected. It warns me about a data inconsistency but it works:
|
Ping: @bwplotka
Thanos, Prometheus and Golang version used:
Thanos: quay.io/thanos/thanos:v0.8.1
Prometheus: prom/prometheus:v2.13.0
Object Storage Provider:
What happened:
I've updated all the Thanos stack I'm using, from the v0.7.0 to v0.8.1. Everything works as expected (I've seen incredible memory savings in Thanos Sidecar, amazing job!) except the Store Gateway component.
It simply does not start and dies with a
2
exit code.What you expected to happen:
It starts as before!
How to reproduce it:
It's happening in our productive environment, our staging stack, which uses its own bucket, works correctly with the same parameters.
Parameters:
Full logs to relevant components:
Excerpt of the pod describe:
When I go back to the earlier version it works.
``` level=debug ts=2019-10-16T08:00:41.916594638Z caller=main.go:120 msg="maxprocs: Leaving GOMAXPROCS=[8]: CPU quota undefined" level=info ts=2019-10-16T08:00:41.924413863Z caller=main.go:168 msg="Tracing will be disabled" level=info ts=2019-10-16T08:00:41.924630438Z caller=factory.go:39 msg="loading bucket configuration" level=info ts=2019-10-16T08:00:41.929235164Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=262144000 maxSizeBytes=524288000 maxItems=math.MaxInt64 level=debug ts=2019-10-16T08:00:41.981473465Z caller=store.go:162 msg="initializing bucket store" level=debug ts=2019-10-16T08:08:37.879383906Z caller=store.go:166 msg="bucket store ready" init_duration=7m55.871106642s level=info ts=2019-10-16T08:08:38.655822385Z caller=main.go:305 msg="enabled gRPC server side TLS" level=info ts=2019-10-16T08:08:38.657352493Z caller=main.go:322 msg="gRPC server TLS client verification enabled" level=info ts=2019-10-16T08:08:38.697338919Z caller=store.go:209 msg="starting store node" level=info ts=2019-10-16T08:08:38.729108826Z caller=main.go:341 msg="Listening for metrics" address=0.0.0.0:10902 level=info ts=2019-10-16T08:08:38.729094252Z caller=store.go:199 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901 level=debug ts=2019-10-16T08:21:49.523658907Z caller=bucket.go:771 msg="Blocks source resolutions" blocks=2 MaximumResolution=0 mint=1571192190000 maxt=1571214090000 lset="{kubernetes_cluster=\"kube-eu-west\",kubernetes_environment=\"production\"}" spans="Range: 1571191200000-1571205600000 Resolution: 0" level=debug ts=2019-10-16T08:21:51.538999187Z caller=bucket.go:871 msg="stats query processed" stats="&{blocksQueried:2 postingsTouched:94 postingsTouchedSizeSum:1237720 postingsToFetch:0 postingsFetched:94 postingsFetchedSizeSum:1238440 postingsFetchCount:4 postingsFetchDurationSum:849477981 seriesTouched:93 seriesTouchedSizeSum:13619 seriesFetched:93 seriesFetchedSizeSum:20945152 seriesFetchCount:7 seriesFetchDurationSum:2209205202 chunksTouched:368 chunksTouchedSizeSum:243771 chunksFetched:368 chunksFetchedSizeSum:9558188 chunksFetchCount:76 chunksFetchDurationSum:19410111834 getAllDuration:1848805649 mergedSeriesCount:47 mergedChunksCount:368 mergeDuration:60013656}" err=null ``` It works!
Anything else we need to know:
I've configured the compact component as:
And as far as I see it's doing its work:
``` level=info ts=2019-10-16T08:15:49.780661885Z caller=main.go:170 msg="Tracing will be disabled" level=info ts=2019-10-16T08:15:49.781448872Z caller=factory.go:39 msg="loading bucket configuration" level=info ts=2019-10-16T08:15:49.794832718Z caller=compact.go:251 msg="retention policy of raw samples is enabled" duration=720h0m0s level=info ts=2019-10-16T08:15:49.794912434Z caller=compact.go:254 msg="retention policy of 5 min aggregated samples is enabled" duration=720h0m0s level=info ts=2019-10-16T08:15:49.794926786Z caller=compact.go:257 msg="retention policy of 1 hour aggregated samples is enabled" duration=87600h0m0s level=info ts=2019-10-16T08:15:49.794940895Z caller=compact.go:341 msg="starting compact node" level=info ts=2019-10-16T08:15:49.794949212Z caller=prober.go:114 msg="changing probe status" status=ready level=info ts=2019-10-16T08:15:49.795056156Z caller=main.go:353 msg="listening for requests and metrics" component=compact address=0.0.0.0:10902 level=info ts=2019-10-16T08:15:49.795165313Z caller=prober.go:143 msg="changing probe status" status=healthy level=info ts=2019-10-16T08:15:49.795233513Z caller=compact.go:1063 msg="start sync of metas" level=info ts=2019-10-16T08:15:50.312216312Z caller=compact.go:1069 msg="start of GC" level=info ts=2019-10-16T08:15:50.315245595Z caller=compact.go:1075 msg="start of compaction" level=info ts=2019-10-16T08:15:50.658389331Z caller=compact.go:264 msg="compaction iterations done" level=info ts=2019-10-16T08:15:50.658457513Z caller=compact.go:271 msg="start first pass of downsampling" level=info ts=2019-10-16T08:15:53.992421826Z caller=compact.go:277 msg="start second pass of downsampling" level=info ts=2019-10-16T08:15:56.741387181Z caller=compact.go:282 msg="downsampling iterations done" ```
Environment:
I've stoped during a moment the compact and verified the status of the bucket
It took more than 20 minutes, but finally worked.
The text was updated successfully, but these errors were encountered: