Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

querier: do A lookups after SRV lookups for store discovery #865

Merged
merged 3 commits into from
Mar 8, 2019

Conversation

mjd95
Copy link
Contributor

@mjd95 mjd95 commented Feb 23, 2019

Changes

This adds an A lookup after the SRV lookups in order to resolve #752.

Verification

I made ran the query with --store set to a random SRV record (i.e., not something running a store node) then immediately killed the process. During close down, we see what address the querier was attempting to use for the store node.

On master:

martin@swift:~/go/src/github.com/improbable-eng/thanos$ ./thanos query --store=dnssrv+_xmpp-server._tcp.google.com.
level=info ts=2019-02-23T10:17:17.886702366Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=192.168.0.14:10901
level=info ts=2019-02-23T10:17:17.891416133Z caller=flags.go:105 msg="QueryAPI address that will be propagated through gossip" address=192.168.0.14:10902
level=info ts=2019-02-23T10:17:17.897047229Z caller=main.go:256 component=query msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-02-23T10:17:17.897083198Z caller=query.go:468 msg="starting query node"
level=info ts=2019-02-23T10:17:17.897113872Z caller=query.go:437 msg="Listening for query and metrics" address=0.0.0.0:10902
level=info ts=2019-02-23T10:17:17.897256901Z caller=query.go:460 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
^Clevel=info ts=2019-02-23T10:17:18.997878046Z caller=main.go:192 msg="caught signal. Exiting." signal=interrupt
level=warn ts=2019-02-23T10:17:18.998304451Z caller=runutil.go:107 component=query msg="detected close error" err="store gRPC listener: close tcp [::]:10901: use of closed netw
ork connection"     
level=warn ts=2019-02-23T10:17:18.998382583Z caller=storeset.go:308 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code =
 Canceled desc = context canceled" address=xmpp-server.l.google.com.:5269                                          
level=warn ts=2019-02-23T10:17:18.998385267Z caller=storeset.go:308 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code =
 Canceled desc = context canceled" address=alt3.xmpp-server.l.google.com.:5269                                                                           
level=warn ts=2019-02-23T10:17:18.998465318Z caller=storeset.go:308 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code =
 Canceled desc = context canceled" address=alt2.xmpp-server.l.google.com.:5269            
level=warn ts=2019-02-23T10:17:18.998606842Z caller=storeset.go:308 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code =
 Canceled desc = context canceled" address=alt4.xmpp-server.l.google.com.:5269                                                          
level=warn ts=2019-02-23T10:17:18.998649891Z caller=storeset.go:308 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code =
 Canceled desc = context canceled" address=alt1.xmpp-server.l.google.com.:5269                                                                                                  level=info ts=2019-02-23T10:17:18.998730396Z caller=main.go:184 msg=exiting

On this branch

martin@swift:~/go/src/github.com/improbable-eng/thanos$ ./thanos query --store=dnssrv+_xmpp-server._tcp.google.com.
level=info ts=2019-02-23T10:18:45.607836325Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=192.168.0.14:10901
level=info ts=2019-02-23T10:18:45.615876888Z caller=flags.go:105 msg="QueryAPI address that will be propagated through gossip" address=192.168.0.14:10902
level=info ts=2019-02-23T10:18:45.632843468Z caller=main.go:256 component=query msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-02-23T10:18:45.632973588Z caller=query.go:468 msg="starting query node"
level=info ts=2019-02-23T10:18:45.633066084Z caller=query.go:437 msg="Listening for query and metrics" address=0.0.0.0:10902
level=info ts=2019-02-23T10:18:45.633309377Z caller=query.go:460 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
^Clevel=info ts=2019-02-23T10:18:47.534010903Z caller=main.go:192 msg="caught signal. Exiting." signal=interrupt
level=warn ts=2019-02-23T10:18:47.534306619Z caller=storeset.go:308 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = Canceled desc = context canceled" address=74.125.195.125:5269
level=warn ts=2019-02-23T10:18:47.534338676Z caller=storeset.go:308 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = Canceled desc = context canceled" address=108.177.125.125:5269

The second one is what we want, as it is the correct <ip>:>port> list that we can also pass in directly

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this, it is the common way of dealing with SRV lookup, even though grpc does it as well. The problem is that there are use cases when you don't want to do A lookup after SRV, because you want to use grpc client loadbalancing etc.. Wonder if we can make this optional per target? E.g as dnssrv and dnsnoasrv (that's bad naming)

PTAL @ivan-valkov

pkg/discovery/dns/resolver.go Outdated Show resolved Hide resolved
pkg/discovery/dns/resolver_test.go Show resolved Hide resolved
@mjd95
Copy link
Contributor Author

mjd95 commented Mar 5, 2019

I'm definitely not an expert, but my understanding is that normally the response includes a domain so an additional A lookup would normally be necessary. However I could imagine that there are cases (sounds like gRPC loadbalancing would be one) where SRV records are entered with an ip in the response and the additional A lookup would not be needed.

I would be happy to extend this PR to support both cases. I don't think we should do anything magic, so as @bwplotka suggests it should probably be explicit from the flag name whether or not we will do the A lookup. I'm not sure what would be better naming than dnssrv+ and dnssrvnoa+ though (naming is hard :))

Does that sounds reasonable @bwplotka @ivan-valkov? I would increase the test coverage so that all cases are covered as well of course.

Copy link

@ivan-valkov ivan-valkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with having both options supported. The names dnssrv and dnssrvnoa seems good enough to me given the flag explains that one does a subsequent IP lookup.

@bwplotka
Copy link
Member

bwplotka commented Mar 5, 2019

That would be awesome!

@mjd95 mjd95 force-pushed the do-a-lookups-after-srv-lookups branch from d2807fa to a72cca1 Compare March 8, 2019 10:03
@mjd95
Copy link
Contributor Author

mjd95 commented Mar 8, 2019

Updated to make the old behaviour available by using dnssrvnoa+

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEGIT! LGTM, thanks. (:

@bwplotka bwplotka merged commit 432785e into thanos-io:master Mar 8, 2019
earthdiaosi pushed a commit to earthdiaosi/thanos that referenced this pull request Mar 14, 2019
…o#865)

* do A lookups after SRV lookups for store discovery

* add missing trailing period

* re-add old behaviour under dnssrvnoa+
@mjd95 mjd95 deleted the do-a-lookups-after-srv-lookups branch April 12, 2019 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query: SRV Service Discovery improperly parsing IP / Port combination.
3 participants