sidecar: delayed gossip startup if peer is not working #372

pharaujo · 2018-06-11T17:19:29Z

Thanos, Prometheus and Golang version used
Thanos built from master.

$ thanos --version
thanos, version v0.1.0-rc.1 (branch: master, revision: 761c177820b6bc5c54677ab5a23e64f7dd6c1de1)
  build user:       root@buildhost
  build date:       20180605-17:23:40
  go version:       go1.10

What happened
Running thanos (sidecar in this case) with a list of 4 cluster peers, 2 of which weren't running yet. Gossip startup took more around 1 minute. Trying to connect to gossip cluster with thanos query before the cluster was up resulted in it not connecting at all after the cluster was set up. Restarting thanos query after this resulted in immediate cluster membership.

What you expected to happen
I expected to have the cluster be joinable, and expected the query node to join the cluster automatically after it was formed.

How to reproduce it (as minimally and precisely as possible):
Run 2 sidecars with 4 cluster peers, two of which exist but are not running any thanos component; Try to connect to the cluster with a query node.

Full logs to relevant components

prometheus 1

prometheus1$ thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /var/lib/prometheus/ --log.level=debug --cluster.peers=prometheus2.internaldomain:10900 --cluster.peers=grafana1.internaldomain:10900 --cluster.peers=grafana2.internaldomain:10900
level=info ts=2018-06-11T16:57:32.098971221Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=<prometheus1_ipaddress>:10901
level=debug ts=2018-06-11T16:58:32.132202479Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=<prometheus2_ipaddress>:10900,grafana1.internaldomain:10900,grafana2.internaldomain:10900
level=info ts=2018-06-11T16:58:32.132593081Z caller=sidecar.go:232 msg="No GCS or S3 bucket was configured, uploads will be disabled"
level=info ts=2018-06-11T16:58:32.132625982Z caller=sidecar.go:269 msg="starting sidecar" peer=
level=info ts=2018-06-11T16:58:32.133036684Z caller=main.go:226 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T16:58:32.133117184Z caller=sidecar.go:214 component=store msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-06-11T16:58:32.148029166Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4RF0XVKRY0A7T1HNB addr=<prometheus1_ipaddress>:10900
level=debug ts=2018-06-11T16:58:32.153771197Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4X4D25HSNZ1SFGZV6 addr=<prometheus2_ipaddress>:10900

level=debug ts=2018-06-11T16:59:21.533959013Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWKB1P0G41NRR4T21ZHPCJ addr=<querynode_ipaddress>:10900
level=debug ts=2018-06-11T16:59:36.176237191Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=1 peerType=source

prometheus 2

prometheus2$ thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /var/lib/prometheus/ --log.level=debug --cluster.peers=prometheus1.internaldomain:10900 --cluster.peers=grafana1.internaldomain:10900 --cluster.peers=grafana2.internaldomain:10900
level=info ts=2018-06-11T16:57:32.098361835Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=<prometheus2_ipaddress>:10901
level=debug ts=2018-06-11T16:58:32.132461707Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=<prometheus1_ipaddress>:10900,grafana1.internaldomain:10900,grafana2.internaldomain:10900
level=info ts=2018-06-11T16:58:32.132873105Z caller=sidecar.go:232 msg="No GCS or S3 bucket was configured, uploads will be disabled"
level=info ts=2018-06-11T16:58:32.132916105Z caller=sidecar.go:269 msg="starting sidecar" peer=
level=info ts=2018-06-11T16:58:32.133212804Z caller=main.go:226 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T16:58:32.133302303Z caller=sidecar.go:214 component=store msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-06-11T16:58:32.14727114Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4X4D25HSNZ1SFGZV6 addr=<prometheus2_ipaddress>:10900
level=debug ts=2018-06-11T16:58:32.152650415Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4RF0XVKRY0A7T1HNB addr=<prometheus1_ipaddress>:10900

level=debug ts=2018-06-11T16:59:21.534619043Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWKB1P0G41NRR4T21ZHPCJ addr=<querynode_ipaddress>:10900
level=debug ts=2018-06-11T16:59:36.180468054Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=1 peerType=source

query node

querynode$ thanos query --cluster.peers <prometheus1_ipaddress>:10900 --cluster.peers <prometheus2_ipaddress>:10900 --query.replica-label=prom --log.level=debug
level=info ts=2018-06-11T16:57:41.770445675Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=<querynode_ipaddress>:10901
level=info ts=2018-06-11T16:57:41.776103299Z caller=flags.go:66 msg="QueryAPI address that will be propagated through gossip" address=<querynode_ipaddress>:10902
level=debug ts=2018-06-11T16:57:41.783712432Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=<prometheus1_ipaddress>:10900,<prometheus2_ipaddress>:10900
level=info ts=2018-06-11T16:57:41.785963242Z caller=query.go:240 msg="starting query node"
level=info ts=2018-06-11T16:57:41.786153142Z caller=query.go:215 msg="Listening for query and metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T16:57:41.786297043Z caller=query.go:233 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-06-11T16:57:41.792732171Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWG9MQS50V0TNWZJ3TB1A9 addr=<querynode_ipaddress>:10900
level=debug ts=2018-06-11T16:57:41.794295077Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=0 peerType=query
^Clevel=info ts=2018-06-11T16:59:19.380614171Z caller=main.go:154 msg="caught signal. Exiting." signal=interrupt
level=debug ts=2018-06-11T16:59:19.380742972Z caller=delegate.go:88 component=cluster received=NotifyLeave node=01CFQWG9MQS50V0TNWZJ3TB1A9 addr=<querynode_ipaddress>:10900
level=info ts=2018-06-11T16:59:19.380879172Z caller=main.go:146 msg=exiting

querynode$ thanos query --cluster.peers <prometheus1_ipaddress>:10900 --cluster.peers <prometheus2_ipaddress>:10900 --query.replica-label=prom --log.level=debug
level=info ts=2018-06-11T16:59:21.516868029Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=<querynode_ipaddress>:10901
level=info ts=2018-06-11T16:59:21.521570049Z caller=flags.go:66 msg="QueryAPI address that will be propagated through gossip" address=<querynode_ipaddress>:10902
level=debug ts=2018-06-11T16:59:21.526095768Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=<prometheus1_ipaddress>:10900,<prometheus2_ipaddress>:10900
level=info ts=2018-06-11T16:59:21.528217578Z caller=query.go:240 msg="starting query node"
level=info ts=2018-06-11T16:59:21.529224282Z caller=query.go:215 msg="Listening for query and metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T16:59:21.529737684Z caller=query.go:233 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-06-11T16:59:21.534204603Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWKB1P0G41NRR4T21ZHPCJ addr=<querynode_ipaddress>:10900
level=debug ts=2018-06-11T16:59:21.53570681Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4X4D25HSNZ1SFGZV6 addr=<prometheus2_ipaddress>:10900
level=debug ts=2018-06-11T16:59:21.53581941Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4RF0XVKRY0A7T1HNB addr=<prometheus1_ipaddress>:10900
level=debug ts=2018-06-11T16:59:21.537509517Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=2 peerType=query
level=info ts=2018-06-11T16:59:41.532793835Z caller=storeset.go:223 component=storeset msg="adding new store to query storeset" address=<prometheus2_ipaddress>:10901
level=info ts=2018-06-11T16:59:41.532854735Z caller=storeset.go:223 component=storeset msg="adding new store to query storeset" address=<prometheus1_ipaddress>:10901

Environment:

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"

The text was updated successfully, but these errors were encountered:

cwolfinger · 2018-06-13T00:35:44Z

Hi - have you found a work around for this issue - I probably hit this a few times a day in development.

povilasv · 2018-06-21T13:33:08Z

@pharaujo / @cwolfinger I think this PR #383 should fix it, if you want you can try running my docker image (tagged master) https://hub.docker.com/r/povilasv/thanos/tags/

bwplotka · 2018-06-22T20:21:51Z

yea, @cwolfinger @pharaujo did you have a chance to try fixed Thanos docker image?

BTW grafana1.internaldomain.com in sidecar cluster.peers is thanos query domain?

pharaujo · 2018-06-22T22:20:05Z

I haven't had the chance to test it out yet.

@Bplotka yes, I'm running thanos query on the grafana nodes.

bwplotka · 2018-06-28T12:15:11Z

Fix landed to master, so let us know if you can repro (#383)

midnightconman · 2018-07-26T23:46:17Z

I was able to reproduce this today... took about 2 mins for my store-gateway instance to start:

]$ kubectl logs thanos-storegateway                                                                                                                          
level=info ts=2018-07-26T22:51:32.199333285Z caller=flags.go:53 msg="StoreAPI address that will be propagated through gossip" address=192.168.117.5:10901                                                                                                                       
level=debug ts=2018-07-26T22:51:32.222762126Z caller=cluster.go:132 component=cluster msg="resolved peers to following addresses" peers=192.168.0.181:10900                                                                                                                     
level=debug ts=2018-07-26T22:51:32.223640198Z caller=store.go:112 msg="initializing bucket store"
level=debug ts=2018-07-26T22:53:31.204141171Z caller=store.go:116 msg="bucket store ready" init_duration=1m58.980499976s
level=info ts=2018-07-26T22:53:31.205165666Z caller=store.go:176 msg="starting store node"
level=info ts=2018-07-26T22:53:31.205263124Z caller=store.go:145 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901

I am using v0.1.0-rc.2

bwplotka · 2018-07-27T10:52:39Z

@midnightconman This is totally unrelated. We are talking about gossip startup not store gateway init. This is known to take time. "bucket store ready" init_duration=1m58.980499976s <- this is just store gateway preparing cache.

Closing this then for now.

Disable dedup proxy in multi-tsdb

bwplotka added the bug label Jun 11, 2018

povilasv mentioned this issue Jun 21, 2018

Add cluster.peers refresh #383

Merged

bwplotka closed this as completed Jul 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sidecar: delayed gossip startup if peer is not working #372

sidecar: delayed gossip startup if peer is not working #372

pharaujo commented Jun 11, 2018

cwolfinger commented Jun 13, 2018

povilasv commented Jun 21, 2018

bwplotka commented Jun 22, 2018

pharaujo commented Jun 22, 2018

bwplotka commented Jun 28, 2018 •

edited

Loading

midnightconman commented Jul 26, 2018 •

edited

Loading

bwplotka commented Jul 27, 2018

sidecar: delayed gossip startup if peer is not working #372

sidecar: delayed gossip startup if peer is not working #372

Comments

pharaujo commented Jun 11, 2018

cwolfinger commented Jun 13, 2018

povilasv commented Jun 21, 2018

bwplotka commented Jun 22, 2018

pharaujo commented Jun 22, 2018

bwplotka commented Jun 28, 2018 • edited Loading

midnightconman commented Jul 26, 2018 • edited Loading

bwplotka commented Jul 27, 2018

bwplotka commented Jun 28, 2018 •

edited

Loading

midnightconman commented Jul 26, 2018 •

edited

Loading