fix issue when OS' nameserver is loopback with --iterative #415

phillip-stephens · 2024-08-07T22:10:08Z

When #412 was merged, it removed the assumption we had that with --iterative mode, the root servers were hard-coded as the 13 global root servers. #412 added the ability for a user to specify --name-servers in --iterative mode to test iteration against a specified name server. Thus, the assumption was broken.

This meant that with main, if the OS configured resolver was loopback, iterative lookups would fail since we were setting the root-servers as Cloudflare/Google here.

➜  zdns git:(main) ✗ head -n 1 benchmark/10k_crux_top_domains.input| ./zdns A --iterative --verbosity=5
DEBU[0000] OS external resolution nameservers are loopback and iterative mode is enabled. Using default non-loopback nameservers to prevent resolution failure edge case
INFO[0000] using local addresses: [137.184.54.104]
INFO[0000] for non-iterative lookups, using external nameservers: 8.8.8.8:53, 8.8.4.4:53, 1.1.1.1:53, 1.0.0.1:53
INFO[0000] for iterative lookups, using nameservers: 8.8.8.8:53, 8.8.4.4:53, 1.1.1.1:53, 1.0.0.1:53
...
{"data":{"additionals":[{"flags":"","type":"EDNS0","udpsize":1232,"version":0}],"protocol":"udp","resolver":"1.1.1.1:53"},"duration":0.039246057,"name":"m.fabswingers.com","status":"SERVFAIL","timestamp":"2024-08-07T22:07:28Z"}

Other Changes

Added a lock in ResolverConfig since each Resolver attempts to populateAndValidate the config at once in worker_manager.go, this lead to some race conditions easily prevented by locking before validation. This lock is only retrieved during Resolver setup, so there shouldn't be performance degradation post initial setup.
Removed now un-used function

Testing

After having this issue crop up and working thru edge cases with Phillip/ipv6, I tried to be more systematic with my testing for this.

Sorry for formatting, Github is not letting us use all available screen real estate here 😢

Variables

A - /etc/resolv.conf using loopback nameservers
B - user-provided nameservers (T/F)
- Option 1 - user provided loopback nameservers
- Option 2user provided non-loopback nameservers
C - --iterative mode used or not
D - --localAddr specified
- Option 1 - user provided loopback addr
- Option 2 - user provided non-loopback addr
  2 * 2 * 3 * 3 = 36 possibilities

A	B	C	D	Command	Expected Outcome	Tested Outcome
F	F	T	F	./zdns A --iterative	Iteration should start from a default root server with the non-loopback NS, SUCCESS	PASS
F	F	T	Opt 1	./zdns A --iterative --local-addr="127.0.0.1"	Should fail config validation, no nameservers reachable	PASS
F	F	T	Opt 2	./zdns A --iterative --local-addr="192.168.1.243"	Iteration should start from a default root server with the non-loopback local-addr, SUCCESS	PASS
F	F	F	F	./zdns A	External query to whatever NS was set in `/etc/resolv.conf`	PASS
F	F	F	Opt 1	./zdns A --local-addr="127.0.0.1"	Should fail config validation, no nameservers reachable	PASS
F	F	F	Opt 2	./zdns A --local-addr="192.168.1.243"	External query to whatever NS was set in `/etc/resolv.conf`	PASS
F	Opt 1	T	F	./zdns A --iterative --name-servers="127.0.0.53"	Pass config validation, iterative lookup attempted using local addr loopback, to NS loopback	PASS
F	Opt 1	T	Opt 1	./zdns A --iterative --name-servers="127.0.0.53" --local-addr="127.0.0.1"	Pass config validation, iterative lookup attempted using local addr loopback, to NS loopback	PASS
F	Opt 1	T	Opt 2	./zdns A --iterative --name-servers="127.0.0.53" --local-addr="192.168.1.243"	Config does not pass validation	PASS
F	Opt 1	F	F	./zdns A --name-servers="127.0.0.53"	Pass config validation, external lookup attempted using local addr loopback, to NS loopback	PASS
F	Opt 1	F	Opt 1	./zdns A --name-servers="127.0.0.53" --local-addr="127.0.0.1"	Pass config validation, external lookup attempted using local addr loopback, to NS loopback	PASS
F	Opt 1	F	Opt 2	./zdns A --name-servers="127.0.0.53" --local-addr="192.168.1.243"	Config does not pass validation	PASS
F	Opt 2	T	F	./zdns A --iterative --name-servers="198.41.0.4"	Attempt iterative lookup to a.root-server.net using a non-loopback local addr	PASS
F	Opt 2	T	Opt 1	./zdns A --iterative --name-servers="198.41.0.4" --local-addr="127.0.0.1"	Fail config validation, loopback mismatch	PASS
F	Opt 2	T	Opt 2	./zdns A --iterative --name-servers="198.41.0.4" --local-addr="192.168.1.243"	Attempt iterative lookup to a.root-server.net using a non-loopback local addr	PASS
F	Opt 2	F	F	./zdns A --name-servers="198.41.0.4"	Attempt external lookup to a.root-server.net using a non-loopback local addr	PASS
F	Opt 2	F	Opt 1	./zdns A --name-servers="198.41.0.4" --local-addr="127.0.0.1"	Fail config validation, loopback mismatch	PASS
F	Opt 2	F	Opt 2	./zdns A --name-servers="198.41.0.4" --local-addr="192.168.1.243"	Attempt external lookup to a.root-server.net using a non-loopback local addr	PASS
T	F	T	F	./zdns A --iterative	Iteration should start from a default root server with a non-loopback local addr, SUCCESS	PASS
T	F	T	Opt 1	./zdns A --iterative --local-addr="127.0.0.1"	Fail config validation	PASS
T	F	T	Opt 2	./zdns A --iterative --local-addr="192.168.1.243"	Iteration should start from a default root server with the non-loopback local addr, SUCCESS	PASS
T	F	F	F	./zdns A	External query to whatever NS was set in `/etc/resolv.conf` using loopback local addr	PASS
T	F	F	Opt 1	./zdns A --local-addr="127.0.0.1"	Attempt external lookup against local NS with loopback local addr	PASS
T	F	F	Opt 2	./zdns A --local-addr="192.168.1.243"	Fail config validation	PASS
T	Opt 1	T	F	./zdns A --iterative --name-servers="127.0.0.53"	Pass config validation, iterative lookup attempted using local addr loopback, to NS loopback	PASS
T	Opt 1	T	Opt 1	./zdns A --iterative --name-servers="127.0.0.53" --local-addr="127.0.0.1"	Pass config validation, iterative lookup attempted using local addr loopback, to NS loopback	PASS
T	Opt 1	T	Opt 2	./zdns A --iterative --name-servers="127.0.0.53" --local-addr="192.168.1.243"	Fail config valiation	PASS
T	Opt 1	F	F	./zdns A --name-servers="127.0.0.53"	Pass config validation, external lookup attempted using local addr loopback, to NS loopback	PASS
T	Opt 1	F	Opt 1	./zdns A --name-servers="127.0.0.53" --local-addr="127.0.0.1"	Pass config validation, external lookup attempted using local addr loopback, to NS loopback	PASS
T	Opt 1	F	Opt 2	./zdns A --name-servers="127.0.0.53" --local-addr="192.168.1.243"	Fail config validation	PASS
T	Opt 2	T	F	./zdns A --iterative --name-servers="198.41.0.4"	Attempt iterative lookup to a.root-server.net using a non-loopback local addr	PASS
T	Opt 2	T	Opt 1	./zdns A --iterative --name-servers="198.41.0.4" --local-addr="127.0.0.1"	Fail config validation, loopback mismatch	PASS
T	Opt 2	T	Opt 2	./zdns A --iterative --name-servers="198.41.0.4" --local-addr="192.168.1.243"	Attempt iterative lookup to a.root-server.net using a non-loopback local addr	PASS
T	Opt 2	F	F	./zdns A --name-servers="198.41.0.4"	Attempt external lookup to a.root-server.net using a non-loopback local addr	PASS
T	Opt 2	F	Opt 1	./zdns A --name-servers="198.41.0.4" --local-addr="127.0.0.1"	Fail config validation, loopback mismatch	PASS
T	Opt 2	F	Opt 2	./zdns A --name-servers="198.41.0.4" --local-addr="192.168.1.243"	Attempt external lookup to a.root-server.net using a non-loopback local addr	PASS

zakird · 2024-08-08T15:45:08Z

src/zdns/resolver.go

@@ -86,6 +88,10 @@ type ResolverConfig struct {

 // PopulateAndValidate checks if the ResolverConfig is valid and populates any missing fields with default values.
 func (rc *ResolverConfig) PopulateAndValidate() error {
+	// when creating multiple resolvers, each will call this function against the same config.
+	rc.Lock()


I don't understand why this needs to be locked. Shouldn't anything for the shared thing be done once ahead of time? Or simply, copy and use. It feels weird to me that a downstream thing is changing the config.

So it is. The CLI calls config.PopulateAndValidate once and no lock is required from the CLI's point-of-view

However, since someone can use the library without the CLI, inside InitResolver we also check that the config is populated with defaults and is valid here. When used with the CLI, this is a redundant check. The initialization of workers happens in a multi-threaded fashion here, so the lock is needed for that case when many threads are checking the config at once.

Even if multiple threads are looking at the config at once, I'm not sure I understand why they should have any need to modify the config.

That's fair, and as an ideal it should be idempotent when the config is populated/valid to run again.

The issue I saw was around LocalAddrs. A single worker would complain that after validation, the localAddrs was empty and it would panic.

I think the issue is here where we checked for IPv6 local addresses (since we don't support IPv6 in main) and replaced the IPv4 local addresses with all local addresses that pass validation. When called on a valid config, this is just going to replace the localAddrs array with an equivalent one, but it's still a write to a shared data structure.

My thinking was I could either try to ensure that a valid config would never be modified at all (even to replace a config field with an equal value as I think is happening here) to ensure idempotency in multi-threaded scenarios, or I just add a lock.

Figured the lock was easier.

I think structurally though this isn't really what we want. If the receiver gets a bad config, then it should error. We should probably validate the config before we pass it in if we're the ones who populated it. I don't think it should try to modify the conf downstream, this just seems like it's going to set us up for confusion.

The specific IPv4/IPv6 error described here sounds more like an architectural issue we need to address.

That's very fair. It's been rather non-trivial to do both

populate missing values

verify config is valid
in the same function correctly and without causing issues in some edge case.

Splitting the concerns into CLI/out-side caller populates the ResolverConfig and there's a ValidateConfig function for InitResolver to use to be sure the config is valid seems cleaner and we don't have this weirdness.

I'll make the change.

…le seperation of concerns

phillip-stephens added 4 commits August 7, 2024 18:01

fix issue when OS' nameserver is loopback with --iterative

b46c5de

improve comment

ebd335f

use lock to fix resolver config race condition

cb0c22e

remove unused fn

a6bd8d4

phillip-stephens marked this pull request as ready for review August 8, 2024 14:25

phillip-stephens requested a review from a team as a code owner August 8, 2024 14:25

phillip-stephens requested a review from zakird August 8, 2024 14:25

phillip-stephens assigned zakird Aug 8, 2024

phillip-stephens added the bug label Aug 8, 2024

zakird reviewed Aug 8, 2024

View reviewed changes

phillip-stephens added 11 commits August 8, 2024 13:44

Merge branch 'main' into phillip/fix-nameserver-population

8fd5a15

make all validation happen in resolver, all population in CLI to enab…

b0a24ae

…le seperation of concerns

lint formmatting dir

7ee2c0d

fixed up resolver config validation tests

a2f2c83

lookup tests passing

b05ab63

fixed up unit tests

c3bfafd

corrected timing issue causing us to not pull in DNS config

45f2115

lint

aec5d9a

filter out IPv6 addresses in CLI

4077848

filter out IPv6 addresses, improve error msg to user

732d83a

fixed error parsing IPv6

953ece1

phillip-stephens requested a review from zakird August 9, 2024 15:13

zakird approved these changes Aug 9, 2024

View reviewed changes

zakird merged commit 6758e08 into main Aug 9, 2024
3 checks passed

zakird deleted the phillip/fix-nameserver-population branch August 9, 2024 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix issue when OS' nameserver is loopback with --iterative #415

fix issue when OS' nameserver is loopback with --iterative #415

phillip-stephens commented Aug 7, 2024 •

edited

Loading

zakird Aug 8, 2024 •

edited

Loading

phillip-stephens Aug 8, 2024 •

edited

Loading

zakird Aug 8, 2024

phillip-stephens Aug 8, 2024

zakird Aug 8, 2024 •

edited

Loading

phillip-stephens Aug 8, 2024

fix issue when OS' nameserver is loopback with --iterative #415

fix issue when OS' nameserver is loopback with --iterative #415

Conversation

phillip-stephens commented Aug 7, 2024 • edited Loading

Other Changes

Testing

zakird Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

phillip-stephens Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

zakird Aug 8, 2024

Choose a reason for hiding this comment

phillip-stephens Aug 8, 2024

Choose a reason for hiding this comment

zakird Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

phillip-stephens Aug 8, 2024

Choose a reason for hiding this comment

phillip-stephens commented Aug 7, 2024 •

edited

Loading

zakird Aug 8, 2024 •

edited

Loading

phillip-stephens Aug 8, 2024 •

edited

Loading

zakird Aug 8, 2024 •

edited

Loading