Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: SRV Service Discovery improperly parsing IP / Port combination. #752

Closed
encee opened this issue Jan 21, 2019 · 5 comments · Fixed by #865
Closed

Query: SRV Service Discovery improperly parsing IP / Port combination. #752

encee opened this issue Jan 21, 2019 · 5 comments · Fixed by #865

Comments

@encee
Copy link

encee commented Jan 21, 2019

v0.2.1

thanos, version 0.2.1 (branch: HEAD, revision: 30e7cbd)
build user: root@79ffcf51ff9b
build date: 20181227-15:44:56
go version: go1.11.4

What happened

When parsing SRV records the query node comes out with an IP with a dot suffix.

./thanos2 query --http-address 0.0.0.0:19193 --grpc-address 0.0.0.0:19093 --cluster.address 0.0.0.0:10903 --store=dnssrv+_thanosstores.dev.cpe.alz.ninja
level=info ts=2019-01-14T18:24:40.554440491Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=10.64.12.187:19093
level=info ts=2019-01-14T18:24:40.55950516Z caller=flags.go:105 msg="QueryAPI address that will be propagated through gossip" address=10.64.12.187:19193
level=info ts=2019-01-14T18:24:40.567725384Z caller=main.go:256 component=query msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-01-14T18:24:40.567810857Z caller=query.go:427 msg="starting query node"
level=info ts=2019-01-14T18:24:40.57449789Z caller=query.go:396 msg="Listening for query and metrics" address=0.0.0.0:19193
level=info ts=2019-01-14T18:24:40.574796239Z caller=query.go:419 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:19093
level=warn ts=2019-01-14T18:24:55.568475476Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.60.15.183.:19091
level=warn ts=2019-01-14T18:24:55.568769031Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.60.13.226.:19091
^Clevel=info ts=2019-01-14T18:24:56.437269473Z caller=main.go:192 msg="caught signal. Exiting." signal=interrupt
level=warn ts=2019-01-14T18:24:56.437583512Z caller=runutil.go:69 component=query msg="detected close error" err="store gRPC listener: close tcp [::]:19093: use of closed network connection"
level=warn ts=2019-01-14T18:24:56.437721702Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = Canceled desc = context canceled" address=10.60.15.183.:19091
level=warn ts=2019-01-14T18:24:56.43783222Z caller=storeset.go:305 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = Canceled desc = context canceled" address=10.60.13.226.:19091
level=info ts=2019-01-14T18:24:56.437936685Z caller=main.go:184 msg=exiting

What you expected to happen

When using the IP addresses directly as static endpoints the connection is made correctly. I expect this behavior to be consistent between SRV lookups and static peering.

./thanos2 query --http-address 0.0.0.0:19193 --grpc-address 0.0.0.0:19093 --cluster.address 0.0.0.0:10903 --store=10.60.13.226:19091 --store=10.60.15.183:19091
level=info ts=2019-01-14T18:25:21.007417285Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=10.64.12.187:19093
level=info ts=2019-01-14T18:25:21.013324205Z caller=flags.go:105 msg="QueryAPI address that will be propagated through gossip" address=10.64.12.187:19193
level=info ts=2019-01-14T18:25:21.019817368Z caller=main.go:256 component=query msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-01-14T18:25:21.019905066Z caller=query.go:427 msg="starting query node"
level=info ts=2019-01-14T18:25:21.027077368Z caller=query.go:396 msg="Listening for query and metrics" address=0.0.0.0:19193
level=info ts=2019-01-14T18:25:21.027201816Z caller=query.go:419 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:19093
level=info ts=2019-01-14T18:25:26.026133096Z caller=storeset.go:247 component=storeset msg="adding new store to query storeset" address=10.60.13.226:19091
level=info ts=2019-01-14T18:25:26.02617529Z caller=storeset.go:247 component=storeset msg="adding new store to query storeset" address=10.60.15.183:19091
^Clevel=info ts=2019-01-14T18:25:28.852415885Z caller=main.go:192 msg="caught signal. Exiting." signal=interrupt
level=warn ts=2019-01-14T18:25:28.852743928Z caller=runutil.go:69 component=query msg="detected close error" err="store gRPC listener: close tcp [::]:19093: use of closed network connection"
level=info ts=2019-01-14T18:25:28.85291679Z caller=main.go:184 msg=exiting

How to reproduce it (as minimally and precisely as possible):

Unsure how to reproduce. The DNS is registered within Route53 and I have not seen any other issues or comments regarding this problem.

Full logs to relevant components

See above

Anything else we need to know

OS: Amazon Linux 2

@bwplotka
Copy link
Member

Yea I remember such issue mentioned in some slack topic.

Basically some providers adds dot at the very of domain to make sure it is fully qualified. (It's actually correct behavior: https://serverfault.com/questions/803033/should-i-append-a-dot-at-the-end-of-my-dns-urls)

This means that the SRV lookup returns 10.60.13.226.:19091 and actually 10.60.13.226. is a domain not an IP. I wonder if it is some Route53 configuration for SRV that the domain is actually same as IP just with the dot.

Now is the question... Since this is fully valid domain, lookup for 10.60.13.226. should give us correct IP. Can you confirm this by execing into docker and doing dig 10.60.13.226.?

This lookup after SRV is in Thanos done by gRPC itself so we don't see if gRPC client tries to do lookup for 10.60.13.226. and what is happenning.

One common thing is that SRV logic for Thanos could do SRV lookup and then A lookup for each target. This might fix your case if only 10.60.13.226. is resolvable by golang resolver. In my opinion it is not....

@encee
Copy link
Author

encee commented Jan 22, 2019

Correct it appears to do that.

;; ANSWER SECTION:
10.60.13.226. 0 IN A 10.60.13.226

@bwplotka
Copy link
Member

Most likely golang DNS resolvers is not able to do that? Not sure right now.

I would suggest adding some debug code for our DNS provider to resolver is explicitly and see result. Are you familiar with Go? (: Otherwise we can help for such debug branch.

@encee
Copy link
Author

encee commented Jan 23, 2019

I am not familiar with Go, I can give it a stab at some point but would definitely welcome the help!

@mjd95
Copy link
Contributor

mjd95 commented Feb 23, 2019

I had a look in to this and I think the A lookup after SRV lookup is the way to go. I raised #865, https://github.com/mjd95/thanos/tree/do-a-lookups-after-srv-lookups is the branch if you want to try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants