Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar: delayed gossip startup if peer is not working #372

Closed
pharaujo opened this issue Jun 11, 2018 · 7 comments
Closed

sidecar: delayed gossip startup if peer is not working #372

pharaujo opened this issue Jun 11, 2018 · 7 comments
Labels

Comments

@pharaujo
Copy link
Contributor

Thanos, Prometheus and Golang version used
Thanos built from master.

$ thanos --version
thanos, version v0.1.0-rc.1 (branch: master, revision: 761c177820b6bc5c54677ab5a23e64f7dd6c1de1)
  build user:       root@buildhost
  build date:       20180605-17:23:40
  go version:       go1.10

What happened
Running thanos (sidecar in this case) with a list of 4 cluster peers, 2 of which weren't running yet. Gossip startup took more around 1 minute. Trying to connect to gossip cluster with thanos query before the cluster was up resulted in it not connecting at all after the cluster was set up. Restarting thanos query after this resulted in immediate cluster membership.

What you expected to happen
I expected to have the cluster be joinable, and expected the query node to join the cluster automatically after it was formed.

How to reproduce it (as minimally and precisely as possible):
Run 2 sidecars with 4 cluster peers, two of which exist but are not running any thanos component; Try to connect to the cluster with a query node.

Full logs to relevant components

prometheus 1

prometheus1$ thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /var/lib/prometheus/ --log.level=debug --cluster.peers=prometheus2.internaldomain:10900 --cluster.peers=grafana1.internaldomain:10900 --cluster.peers=grafana2.internaldomain:10900
level=info ts=2018-06-11T16:57:32.098971221Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=<prometheus1_ipaddress>:10901
level=debug ts=2018-06-11T16:58:32.132202479Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=<prometheus2_ipaddress>:10900,grafana1.internaldomain:10900,grafana2.internaldomain:10900
level=info ts=2018-06-11T16:58:32.132593081Z caller=sidecar.go:232 msg="No GCS or S3 bucket was configured, uploads will be disabled"
level=info ts=2018-06-11T16:58:32.132625982Z caller=sidecar.go:269 msg="starting sidecar" peer=
level=info ts=2018-06-11T16:58:32.133036684Z caller=main.go:226 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T16:58:32.133117184Z caller=sidecar.go:214 component=store msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-06-11T16:58:32.148029166Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4RF0XVKRY0A7T1HNB addr=<prometheus1_ipaddress>:10900
level=debug ts=2018-06-11T16:58:32.153771197Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4X4D25HSNZ1SFGZV6 addr=<prometheus2_ipaddress>:10900

level=debug ts=2018-06-11T16:59:21.533959013Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWKB1P0G41NRR4T21ZHPCJ addr=<querynode_ipaddress>:10900
level=debug ts=2018-06-11T16:59:36.176237191Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=1 peerType=source

prometheus 2

prometheus2$ thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /var/lib/prometheus/ --log.level=debug --cluster.peers=prometheus1.internaldomain:10900 --cluster.peers=grafana1.internaldomain:10900 --cluster.peers=grafana2.internaldomain:10900
level=info ts=2018-06-11T16:57:32.098361835Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=<prometheus2_ipaddress>:10901
level=debug ts=2018-06-11T16:58:32.132461707Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=<prometheus1_ipaddress>:10900,grafana1.internaldomain:10900,grafana2.internaldomain:10900
level=info ts=2018-06-11T16:58:32.132873105Z caller=sidecar.go:232 msg="No GCS or S3 bucket was configured, uploads will be disabled"
level=info ts=2018-06-11T16:58:32.132916105Z caller=sidecar.go:269 msg="starting sidecar" peer=
level=info ts=2018-06-11T16:58:32.133212804Z caller=main.go:226 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T16:58:32.133302303Z caller=sidecar.go:214 component=store msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-06-11T16:58:32.14727114Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4X4D25HSNZ1SFGZV6 addr=<prometheus2_ipaddress>:10900
level=debug ts=2018-06-11T16:58:32.152650415Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4RF0XVKRY0A7T1HNB addr=<prometheus1_ipaddress>:10900

level=debug ts=2018-06-11T16:59:21.534619043Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWKB1P0G41NRR4T21ZHPCJ addr=<querynode_ipaddress>:10900
level=debug ts=2018-06-11T16:59:36.180468054Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=1 peerType=source

query node

querynode$ thanos query --cluster.peers <prometheus1_ipaddress>:10900 --cluster.peers <prometheus2_ipaddress>:10900 --query.replica-label=prom --log.level=debug
level=info ts=2018-06-11T16:57:41.770445675Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=<querynode_ipaddress>:10901
level=info ts=2018-06-11T16:57:41.776103299Z caller=flags.go:66 msg="QueryAPI address that will be propagated through gossip" address=<querynode_ipaddress>:10902
level=debug ts=2018-06-11T16:57:41.783712432Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=<prometheus1_ipaddress>:10900,<prometheus2_ipaddress>:10900
level=info ts=2018-06-11T16:57:41.785963242Z caller=query.go:240 msg="starting query node"
level=info ts=2018-06-11T16:57:41.786153142Z caller=query.go:215 msg="Listening for query and metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T16:57:41.786297043Z caller=query.go:233 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-06-11T16:57:41.792732171Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWG9MQS50V0TNWZJ3TB1A9 addr=<querynode_ipaddress>:10900
level=debug ts=2018-06-11T16:57:41.794295077Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=0 peerType=query
^Clevel=info ts=2018-06-11T16:59:19.380614171Z caller=main.go:154 msg="caught signal. Exiting." signal=interrupt
level=debug ts=2018-06-11T16:59:19.380742972Z caller=delegate.go:88 component=cluster received=NotifyLeave node=01CFQWG9MQS50V0TNWZJ3TB1A9 addr=<querynode_ipaddress>:10900
level=info ts=2018-06-11T16:59:19.380879172Z caller=main.go:146 msg=exiting

querynode$ thanos query --cluster.peers <prometheus1_ipaddress>:10900 --cluster.peers <prometheus2_ipaddress>:10900 --query.replica-label=prom --log.level=debug
level=info ts=2018-06-11T16:59:21.516868029Z caller=flags.go:51 msg="StoreAPI address that will be propagated through gossip" address=<querynode_ipaddress>:10901
level=info ts=2018-06-11T16:59:21.521570049Z caller=flags.go:66 msg="QueryAPI address that will be propagated through gossip" address=<querynode_ipaddress>:10902
level=debug ts=2018-06-11T16:59:21.526095768Z caller=cluster.go:128 component=cluster msg="resolved peers to following addresses" peers=<prometheus1_ipaddress>:10900,<prometheus2_ipaddress>:10900
level=info ts=2018-06-11T16:59:21.528217578Z caller=query.go:240 msg="starting query node"
level=info ts=2018-06-11T16:59:21.529224282Z caller=query.go:215 msg="Listening for query and metrics" address=0.0.0.0:10902
level=info ts=2018-06-11T16:59:21.529737684Z caller=query.go:233 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=debug ts=2018-06-11T16:59:21.534204603Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWKB1P0G41NRR4T21ZHPCJ addr=<querynode_ipaddress>:10900
level=debug ts=2018-06-11T16:59:21.53570681Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4X4D25HSNZ1SFGZV6 addr=<prometheus2_ipaddress>:10900
level=debug ts=2018-06-11T16:59:21.53581941Z caller=delegate.go:82 component=cluster received=NotifyJoin node=01CFQWHTT4RF0XVKRY0A7T1HNB addr=<prometheus1_ipaddress>:10900
level=debug ts=2018-06-11T16:59:21.537509517Z caller=cluster.go:190 component=cluster msg="joined cluster" peers=2 peerType=query
level=info ts=2018-06-11T16:59:41.532793835Z caller=storeset.go:223 component=storeset msg="adding new store to query storeset" address=<prometheus2_ipaddress>:10901
level=info ts=2018-06-11T16:59:41.532854735Z caller=storeset.go:223 component=storeset msg="adding new store to query storeset" address=<prometheus1_ipaddress>:10901

Environment:

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
@bwplotka bwplotka added the bug label Jun 11, 2018
@cwolfinger
Copy link

Hi - have you found a work around for this issue - I probably hit this a few times a day in development.

@povilasv
Copy link
Member

@pharaujo / @cwolfinger I think this PR #383 should fix it, if you want you can try running my docker image (tagged master) https://hub.docker.com/r/povilasv/thanos/tags/

@bwplotka
Copy link
Member

yea, @cwolfinger @pharaujo did you have a chance to try fixed Thanos docker image?

BTW grafana1.internaldomain.com in sidecar cluster.peers is thanos query domain?

@pharaujo
Copy link
Contributor Author

I haven't had the chance to test it out yet.

@Bplotka yes, I'm running thanos query on the grafana nodes.

@bwplotka
Copy link
Member

bwplotka commented Jun 28, 2018

Fix landed to master, so let us know if you can repro (#383)

@midnightconman
Copy link
Contributor

midnightconman commented Jul 26, 2018

I was able to reproduce this today... took about 2 mins for my store-gateway instance to start:

]$ kubectl logs thanos-storegateway                                                                                                                          
level=info ts=2018-07-26T22:51:32.199333285Z caller=flags.go:53 msg="StoreAPI address that will be propagated through gossip" address=192.168.117.5:10901                                                                                                                       
level=debug ts=2018-07-26T22:51:32.222762126Z caller=cluster.go:132 component=cluster msg="resolved peers to following addresses" peers=192.168.0.181:10900                                                                                                                     
level=debug ts=2018-07-26T22:51:32.223640198Z caller=store.go:112 msg="initializing bucket store"
level=debug ts=2018-07-26T22:53:31.204141171Z caller=store.go:116 msg="bucket store ready" init_duration=1m58.980499976s
level=info ts=2018-07-26T22:53:31.205165666Z caller=store.go:176 msg="starting store node"
level=info ts=2018-07-26T22:53:31.205263124Z caller=store.go:145 msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901

I am using v0.1.0-rc.2

@bwplotka
Copy link
Member

@midnightconman This is totally unrelated. We are talking about gossip startup not store gateway init. This is known to take time. "bucket store ready" init_duration=1m58.980499976s <- this is just store gateway preparing cache.

Closing this then for now.

fpetkovski added a commit to fpetkovski/thanos that referenced this issue Oct 17, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Disable dedup proxy in multi-tsdb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants