Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress all DNS responses #971

Closed
wants to merge 2 commits into from
Closed

Conversation

epipho
Copy link

@epipho epipho commented May 22, 2015

No description provided.

@ryanbreen
Copy link
Contributor

Could we get some tests for this?

@armon
Copy link
Member

armon commented May 23, 2015

This requires a lot more testing. In my experience before, the DNS library would cause corruption issues when compression was enabled.

@grobie
Copy link

grobie commented May 23, 2015

I'd be very interested to hear more about the corruption issues you've seen in order to test for such cases in our infrastructure. Can you remember under which circumstances these might happen? We're using miekg/dns as well for our DNS server implementation (on top of consul) and to my knowledge haven't had any issues so far with compression.

@epipho
Copy link
Author

epipho commented May 23, 2015

Sorry about that, meant to open a local PR between branches. Tests will be coming soon.

I have been running a good number of manual tests so far and have seen no hint of corruption. The example given in #854 compresses identically to the result returned by the 8.8.8.8 DNS server. Do you know of any specific examples that caused corruption in the past? It may also only be necessary to enable compression on the recursor queries, leaving it disabled for all other response messages.

.

@armon
Copy link
Member

armon commented May 27, 2015

@grobie @epipho It's been a while... But if memory serves, the compression issue hit when doing recursion for clients. The actual responses we generate didn't cause issues, but I think some of the more edge case record types did. Never saw it with A, AAAA, CNAME or SRV.

@grobie
Copy link

grobie commented May 29, 2015

Thanks Armon. We don't do any recursion on our service discovery DNS
servers (yet), good to know.
On May 27, 2015 2:52 PM, "Armon Dadgar" [email protected] wrote:

@grobie https://github.com/grobie @epipho https://github.com/epipho
It's been a while... But if memory serves, the compression issue hit when
doing recursion for clients. The actual responses we generate didn't cause
issues, but I think some of the more edge case record types did. Never saw
it with A, AAAA, CNAME or SRV.


Reply to this email directly or view it on GitHub
#971 (comment).

@discordianfish
Copy link
Contributor

Is there a ETA for getting this merged and release? See my comment on #854 (comment) - it's IMO really severe issue.

@sstarcher
Copy link

@discordianfish I build consul 0.5.2 with this patch and removed my "search ec2.internal" from /etc/resolv.conf along with setting only a single nameserver of 127.0.0.1.

Even with compression turned on I still get "FATA[0000] Get https://registry-1.docker.io/v1/repositories/library/busybox/tags: dial tcp: lookup registry-1.docker.io on 127.0.0.1:53: cannot unmarshal DNS message" from docker

@discordianfish
Copy link
Contributor

@sstarcher Yes, looks like the responses are still too large. This PR doesn't fix the actual issue, it just makes it less likely to hit it.

@epipho
Copy link
Author

epipho commented Jun 5, 2015

I have not been able to craft a decent test that validates this fix as I cannot get the test DNS server to reliably return a response that is too large. We have been using it without issues but it sounds like @sstarcher still sees some.

@discordianfish do you believe miekg/dns#216 will fix that issue as well?

@discordianfish
Copy link
Contributor

Oh I missed that, I tried the fix and it worked for me - at least for my test tool.
@sstarcher Are you sure you used the patched version?

Beside that, I'd consider this a bug in miekg/dns (see the issue I opened). It seems idiomatic to write the dns message received from Exchange() to the client, but since miekg/dns doesn't set the compress flag even if it was compressed, it yields a message to large.

@sstarcher
Copy link

@discordianfish I'm sure I compiled turbine's compress-dns branch
"./command/agent/dns.go: m.Compress = true
./command/agent/dns.go: m.Compress = true
./command/agent/dns.go: r.Compress = true
./command/agent/dns.go: m.Compress = true
"

I also checked the version of my binary it was something along the lines of 0.5.2.#####

I will re-attempt on monday.

@discordianfish
Copy link
Contributor

@sstarcher I just confirmed that it fixes the problems for me: master is broken but with this branch it works.

@sstarcher
Copy link

@discordianfish I again rebuilt it using the compress-dns branch from turbine and I still get
FATA[0000] Get https://registry-1.docker.io/v1/repositories/library/busybox/tags: dial tcp: lookup registry-1.docker.io on 127.0.0.1:53: cannot unmarshal DNS message

root@ip-10-0-10-59:~/.gvm/pkgsets/go1.4/global/src/github.com/turbine/consul# ./bin/consul version
Consul v0.5.2-13-g08b7313
Consul Protocol: 3 (Understands back to: 1)
root@ip-10-0-10-59:~/.gvm/pkgsets/go1.4/global/src/github.com/turbine/consul# ./bin/consul info
WARNING: It is highly recommended to set GOMAXPROCS higher than 1

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 3
build:
    prerelease =
    revision = 08b7313d
    version = 0.5.2

@discordianfish
Copy link
Contributor

@sstarcher That's weird.. Maybe you can provide a tcpdump?

@discordianfish
Copy link
Contributor

@miekg suggests to set the compress flag in consul, not wait for him to change this upstream.
@armon Can we get this merged soon? Even if it's not fixing the issue for all, it's definitely a pretty clear bug which breaks everyone using docker and consul as recursor.

@sstarcher
Copy link

I edited my last post I should've said it was the unmarshall error again not the no such host error.

Here is a tcpdump of traffic on 53

17:22:09.017814 IP 10.0.10.4.domain > ip-10-0-10-199.node.us-east-1.consul.52227: 24183 3/0/0 A 10.0.20.188, A 10.0.10.15, A 10.0.20.239 (89)
17:22:09.017856 IP 10.0.10.4.domain > ip-10-0-10-199.node.us-east-1.consul.52227: 24457 0/0/0 (41)
17:22:11.012427 IP ip-10-0-10-199.node.us-east-1.consul.59108 > 10.0.0.2.domain: 9408+ A? registry-1.docker.io. (38)
17:22:11.012528 IP ip-10-0-10-199.node.us-east-1.consul.56231 > 10.0.0.2.domain: 39053+ AAAA? registry-1.docker.io. (38)
17:22:11.013844 IP 10.0.0.2.domain > ip-10-0-10-199.node.us-east-1.consul.59108: 9408 7/0/0 CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., A 54.172.137.222, A 107.21.22.134, A 52.0.31.125, A 52.5.4.145, A 52.6.136.158, A 54.164.219.90 (214)
17:22:11.014081 IP 10.0.0.2.domain > ip-10-0-10-199.node.us-east-1.consul.56231: 39053 1/1/0 CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. (200)
17:22:11.014445 IP ip-10-0-10-199.node.us-east-1.consul.59633 > 10.0.0.2.domain: 35794+ A? registry-1.docker.io. (38)
17:22:11.014817 IP 10.0.0.2.domain > ip-10-0-10-199.node.us-east-1.consul.59633: 35794 7/0/0 CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., A 54.164.219.90, A 54.172.137.222, A 107.21.22.134, A 52.0.31.125, A 52.5.4.145, A 52.6.136.158 (214)
17:22:11.015868 IP ip-10-0-10-199.node.us-east-1.consul.40609 > 10.0.0.2.domain: 21181+ A? index.docker.io. (33)
17:22:11.015943 IP ip-10-0-10-199.node.us-east-1.consul.59027 > 10.0.0.2.domain: 38952+ AAAA? index.docker.io. (33)
17:22:11.016956 IP 10.0.0.2.domain > ip-10-0-10-199.node.us-east-1.consul.59027: 38952 0/1/0 (103)
17:22:11.017117 IP 10.0.0.2.domain > ip-10-0-10-199.node.us-east-1.consul.40609: 21181 1/0/0 A 162.242.195.84 (49)
17:22:11.144753 IP ip-10-0-10-199.node.us-east-1.consul.56472 > 10.0.0.2.domain: 11298+ A? registry-1.docker.io. (38)
17:22:11.145042 IP ip-10-0-10-199.node.us-east-1.consul.48252 > 10.0.0.2.domain: 10235+ AAAA? registry-1.docker.io. (38)
17:22:11.145112 IP 10.0.0.2.domain > ip-10-0-10-199.node.us-east-1.consul.56472: 11298 7/0/0 CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., A 52.6.136.158, A 54.164.219.90, A 54.172.137.222, A 107.21.22.134, A 52.0.31.125, A 52.5.4.145 (214)
17:22:11.145408 IP 10.0.0.2.domain > ip-10-0-10-199.node.us-east-1.consul.48252: 10235 1/1/0 CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com. (200)
17:22:11.145646 IP ip-10-0-10-199.node.us-east-1.consul.59490 > 10.0.0.2.domain: 25933+ A? registry-1.docker.io. (38)
17:22:11.145967 IP 10.0.0.2.domain > ip-10-0-10-199.node.us-east-1.consul.59490: 25933 7/0/0 CNAME us-east-1-elbio-rm5bon1qaeo4-623296237.us-east-1.elb.amazonaws.com., A 52.5.4.145, A 52.6.136.158, A 54.164.219.90, A 54.172.137.222, A 107.21.22.134, A 52.0.31.125 (214)

@miekg
Copy link

miekg commented Jun 8, 2015

I don't really think this is a bug in go dns. Compressing a message is basically a "per-client-connection' decision. You should check the client's EDNS0 buffer size and the message size (the Len() function) to see if "it fits". It not set Compress to true (or just do this always). If it still does not fit set the TC bit (when transport != UDP)

@@ -625,6 +627,7 @@ func (d *DNSServer) handleRecurse(resp dns.ResponseWriter, req *dns.Msg) {
var err error
for _, recursor := range d.recursors {
r, rtt, err = c.Exchange(req, recursor)
r.Compress = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be moved after the error check, because r may be nil in that case and cause a nil pointer dereference.

@epipho
Copy link
Author

epipho commented Jun 24, 2015

Hello everyone. Just fixed a silly mistake nil pointer crash in 04bba18 where we were accessing the response returned by exchange before we had checked the error. If a DNS recursion failed, it would cause consul to panic and crash.

If you are using this patch, please upgrade immediatly.

@epipho
Copy link
Author

epipho commented Jun 25, 2015

@ryanuber awesome, it looks like you reviewed and posted literally right as we were discovering that issue.

@ryanuber
Copy link
Member

Just checking in here - I think we can get this in if some tests could be crafted to ensure that we are actually compressing the messages. I also think it would be worthwhile to make the adjustments @miekg suggests rather than just compressing everything and hoping the messages fit into the buffer. This would also ease the hesitation, since most messages should be small enough to fit without compression.

@epipho
Copy link
Author

epipho commented Aug 20, 2015

I will look into both after I get back from vacation on 9/1.

@slackpad
Copy link
Contributor

Hi @epipho checking in on this one - is there still interest in getting this merged?

@sstarcher
Copy link

I'm still interested as I had to stop using consul as the direct DNS due to this and another issue.

@epipho
Copy link
Author

epipho commented Jan 27, 2016

@slackpad indeed. Sorry I had github notifications turned off and didn't see it. I just merged 0.6.3 into my fork and am doing testing to make sure there are no regressions.

@jordan-acosta
Copy link

@epipho , found this PR after running into this problem at work. Wondering if this will be merged soon. Is there anything I can do to help push this along?

This is kind of an edge case for us (only happens in our dev environment, and we're not sure why,) but it would make our lives a lot easier if this were merged?

@jordan-acosta
Copy link

It doesn't look like this change checks the length of the response. Is it possible that, even with the compression, it could still exceed 512 bytes?

@slackpad
Copy link
Contributor

slackpad commented Mar 9, 2016

I'm going to close this out in favor of #1734 which was just merged. It directly enforces the UDP size limit which I think was the root problem here, and avoids any extra client burden / complexity around compression. I think we'd need to enforce the limit even if we did enable compression, so this is a good first step, even if compression is warranted later. It doesn't seem necessary, though, because of random shuffling, even getting 1 or 2 responses back instead of the usual 3 should work fine in most cases.

@slackpad slackpad closed this Mar 9, 2016
@epipho
Copy link
Author

epipho commented Mar 11, 2016

I don't believe #1734 addresses all the cases. It looks like it will address a service lookup returning a response that is too large but will still fail if an upstream DNS server returns a large response. We found this happens when querying large ELBs (10+ A records sometimes) where the upstream DNS server will send a compressed UDP response (that is under 512 bytes) but Consul will decompress it before forwarding it on, making it too large for Go to accept.

@slackpad
Copy link
Contributor

Ah interesting - I'll reopen so we can look at that case as well.

@slackpad slackpad reopened this Mar 11, 2016
@slackpad
Copy link
Contributor

Since Consul-generated messages should be sized by the current fix perhaps we should just compress in the forwarding case if the incoming message is also compressed.

@epipho
Copy link
Author

epipho commented Mar 11, 2016

Seems reasonable. I was also thinking we may want to do some of the following to ensure correctness / usefulness in as many cases as possible.

  • Pass through the truncation flag from the incoming message regardless of the value of enable_truncate Since we don't know what the upstream DNS server is doing it could cause undesirable behavior to silently drop this flag.
  • Compress udp messages that come in that are already over 512 bytes (non-conforming server?)
  • Use the trim method from Make sure UDP DNS responses aren't larger than allowed #1734 to trim records / set truncation if the compressed size is still over. Not sure on this one, could violate the principle of least surprise.

@slackpad slackpad added the type/bug Feature does not function as expected label Mar 18, 2016
@slackpad slackpad self-assigned this Mar 18, 2016
@alexjh
Copy link

alexjh commented May 31, 2016

@slackpad I've been doing some testing as I'm hitting this same docker + consul bug. I don't think the DNS library makes the compressed state of the incoming message available.

@miekg
Copy link

miekg commented May 31, 2016

No it doesn't because compression is hop by hop so it doesn't make sense to
relay this when you are a proxy.

I don't get why consul just compresses everything. Compression has been in
the DNS standard since 1984.
On 31 May 2016 7:37 p.m., "Alex Harford" [email protected] wrote:

@slackpad https://github.com/slackpad I've been doing some testing as
I'm hitting this same docker + consul bug. I don't think the DNS library
makes the compressed state of the incoming message available.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#971 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAVkW_3zhBkwprX1sN0VSn7gdJW6QZhVks5qHHHigaJpZM4Elqx3
.

@alexjh
Copy link

alexjh commented May 31, 2016

@epipho:

Compress udp messages that come in that are already over 512 bytes (non-conforming server?)

Like miekg said, it makes sense to compress everything. I think that would be a good start and would fix the docker bug that a lot of people are seeing.

@ryanuber: I don't think tests for compression are needed as that's already covered by the tests in the dns library.

@slackpad
Copy link
Contributor

slackpad commented Jun 1, 2016

I'd be ok with compressing everything now that we do have the trim fix for the responses being too large (compression wasn't a true fix for that). We could add a config option to turn off compression for people with obscure setups who might be affected by the default change, but default it to on.

@beneshed
Copy link

is there a work around?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.