sidecar: clustering does not join properly - race condition ? #373

cwolfinger · 2018-06-11T17:20:54Z

thanos, version v0.0.1 (branch: HEAD, revision: 2c63665)
build user: root@0af42dc4266a
build date: 20180602-20:00:24
go version: go1.10.2

What happened
Started a single thanos sidecar and two query nodes. Cluster did not setup until thanos sidecar was restarted.

level=info ts=2018-06-11T17:05:54.3869957Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=10.1.2.84:10901
level=debug ts=2018-06-11T17:05:54.5421866Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=thanos-peers-hi-res.default.svc.cluster.local:10900
level=info ts=2018-06-11T17:05:54.5480826Z caller=sidecar.go:232 msg="No GCS or S3 bucket was configured, uploads will be disabled"
level=info ts=2018-06-11T17:05:54.5482496Z caller=sidecar.go:269 msg="starting sidecar" peer=
level=info ts=2018-06-11T17:05:54.5507041Z caller=main.go:226 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T17:05:54.5521783Z caller=sidecar.go:214 component=store msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2018-06-11T17:05:54.5506985Z caller=reloader.go:77 component=reloader msg="started watching config file for changes" in=/etc/prometheus/prometheus.yml.tmpl out=/etc/prometheus-shared/prometheus.yml
level=error ts=2018-06-11T17:05:54.5620611Z caller=runutil.go:43 component=reloader msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post http://127.0.0.1:9090/-/reload: dial tcp 127.0.0.1:9090: connect: connection refused"
level=warn ts=2018-06-11T17:05:54.5665254Z caller=sidecar.go:130 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="request config against http://127.0.0.1:9090/api/v1/status/config: Get http://127.0.0.1:9090/api/v1/status/config: dial tcp 127.0.0.1:9090: connect: connection refused"
level=debug ts=2018-06-11T17:05:56.6062056Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWZAVFMYR6VJXYQD9A44C7 addr=10.1.2.84:10900
level=debug ts=2018-06-11T17:05:56.6310602Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=0 peerType=source
level=info ts=2018-06-11T17:05:59.59841Z caller=reloader.go:188 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yml.tmpl cfg_out=/etc/prometheus-shared/prometheus.yml rule_dir=

On restart of the thanos sidecar the cluster started properly:

level=info ts=2018-06-11T17:11:43.5448783Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=10.1.2.84:10901
level=debug ts=2018-06-11T17:11:43.5589156Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=10.1.2.85:10900,10.1.2.86:10900
level=info ts=2018-06-11T17:11:43.5594216Z caller=sidecar.go:232 msg="No GCS or S3 bucket was configured, uploads will be disabled"
level=info ts=2018-06-11T17:11:43.5596431Z caller=sidecar.go:269 msg="starting sidecar" peer=
level=info ts=2018-06-11T17:11:43.5600137Z caller=main.go:226 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T17:11:43.5600899Z caller=reloader.go:77 component=reloader msg="started watching config file for changes" in=/etc/prometheus/prometheus.yml.tmpl out=/etc/prometheus-shared/prometheus.yml
level=info ts=2018-06-11T17:11:43.5600343Z caller=sidecar.go:214 component=store msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2018-06-11T17:11:43.5973595Z caller=reloader.go:188 component=reloader msg="Prometheus reload triggered" cfg_in=/etc/prometheus/prometheus.yml.tmpl cfg_out=/etc/prometheus-shared/prometheus.yml rule_dir=
level=debug ts=2018-06-11T17:11:43.6295098Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQX9ZP6SXDF624765BPVWPX addr=10.1.2.84:10900
level=debug ts=2018-06-11T17:11:43.6725086Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWZB4Y3RC90RDKSVCX97MB addr=10.1.2.86:10900
level=debug ts=2018-06-11T17:11:43.6743459Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWZAV9E841Z7X8S49TQ8KA addr=10.1.2.85:10900
level=debug ts=2018-06-11T17:11:43.6744618Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=2 peerType=source

What you expected to happen
The cluster to start regardless of order of sidecar and query nodes.

How to reproduce it (as minimally and precisely as possible):
Hard to reproduce seems to be a race condition in startup of K8s resources.

Full logs to relevant components
In previous section

Anything else we need to know
No

Environment:

OS Ubuntu 18.04 LTS (Bionic Beaver)
Kernel 4.9.93-linuxkit-aufs Initial structure and block shipper #1 SMP Wed Jun 6 16:55:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Others:

The text was updated successfully, but these errors were encountered:

bwplotka · 2018-06-28T12:15:57Z

Potential fix landed #383

Optimize label usage for stringlabels

bwplotka added the bug label Jun 11, 2018

bwplotka closed this as completed Mar 13, 2019

fpetkovski added a commit to fpetkovski/thanos that referenced this issue Oct 17, 2024

Merge pull request thanos-io#373 from Shopify/optimize-labels

3db6d19

Optimize label usage for stringlabels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sidecar: clustering does not join properly - race condition ? #373

sidecar: clustering does not join properly - race condition ? #373

cwolfinger commented Jun 11, 2018

bwplotka commented Jun 28, 2018

sidecar: clustering does not join properly - race condition ? #373

sidecar: clustering does not join properly - race condition ? #373

Comments

cwolfinger commented Jun 11, 2018

bwplotka commented Jun 28, 2018