-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress all DNS responses #971
Conversation
limit from RFC 1035
Could we get some tests for this? |
This requires a lot more testing. In my experience before, the DNS library would cause corruption issues when compression was enabled. |
I'd be very interested to hear more about the corruption issues you've seen in order to test for such cases in our infrastructure. Can you remember under which circumstances these might happen? We're using miekg/dns as well for our DNS server implementation (on top of consul) and to my knowledge haven't had any issues so far with compression. |
Sorry about that, meant to open a local PR between branches. Tests will be coming soon. I have been running a good number of manual tests so far and have seen no hint of corruption. The example given in #854 compresses identically to the result returned by the 8.8.8.8 DNS server. Do you know of any specific examples that caused corruption in the past? It may also only be necessary to enable compression on the recursor queries, leaving it disabled for all other response messages. . |
Thanks Armon. We don't do any recursion on our service discovery DNS
|
Is there a ETA for getting this merged and release? See my comment on #854 (comment) - it's IMO really severe issue. |
@discordianfish I build consul 0.5.2 with this patch and removed my "search ec2.internal" from /etc/resolv.conf along with setting only a single nameserver of 127.0.0.1. Even with compression turned on I still get "FATA[0000] Get https://registry-1.docker.io/v1/repositories/library/busybox/tags: dial tcp: lookup registry-1.docker.io on 127.0.0.1:53: cannot unmarshal DNS message" from docker |
@sstarcher Yes, looks like the responses are still too large. This PR doesn't fix the actual issue, it just makes it less likely to hit it. |
I have not been able to craft a decent test that validates this fix as I cannot get the test DNS server to reliably return a response that is too large. We have been using it without issues but it sounds like @sstarcher still sees some. @discordianfish do you believe miekg/dns#216 will fix that issue as well? |
Oh I missed that, I tried the fix and it worked for me - at least for my test tool. Beside that, I'd consider this a bug in miekg/dns (see the issue I opened). It seems idiomatic to write the dns message received from Exchange() to the client, but since miekg/dns doesn't set the compress flag even if it was compressed, it yields a message to large. |
@discordianfish I'm sure I compiled turbine's compress-dns branch I also checked the version of my binary it was something along the lines of 0.5.2.##### I will re-attempt on monday. |
@sstarcher I just confirmed that it fixes the problems for me: master is broken but with this branch it works. |
@discordianfish I again rebuilt it using the compress-dns branch from turbine and I still get
|
@sstarcher That's weird.. Maybe you can provide a tcpdump? |
I edited my last post I should've said it was the unmarshall error again not the no such host error. Here is a tcpdump of traffic on 53
|
I don't really think this is a bug in go dns. Compressing a message is basically a "per-client-connection' decision. You should check the client's EDNS0 buffer size and the message size (the Len() function) to see if "it fits". It not set Compress to true (or just do this always). If it still does not fit set the TC bit (when transport != UDP) |
@@ -625,6 +627,7 @@ func (d *DNSServer) handleRecurse(resp dns.ResponseWriter, req *dns.Msg) { | |||
var err error | |||
for _, recursor := range d.recursors { | |||
r, rtt, err = c.Exchange(req, recursor) | |||
r.Compress = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be moved after the error check, because r
may be nil in that case and cause a nil pointer dereference.
Hello everyone. Just fixed a silly mistake nil pointer crash in 04bba18 where we were accessing the response returned by exchange before we had checked the error. If a DNS recursion failed, it would cause consul to panic and crash. If you are using this patch, please upgrade immediatly. |
@ryanuber awesome, it looks like you reviewed and posted literally right as we were discovering that issue. |
Just checking in here - I think we can get this in if some tests could be crafted to ensure that we are actually compressing the messages. I also think it would be worthwhile to make the adjustments @miekg suggests rather than just compressing everything and hoping the messages fit into the buffer. This would also ease the hesitation, since most messages should be small enough to fit without compression. |
I will look into both after I get back from vacation on 9/1. |
Hi @epipho checking in on this one - is there still interest in getting this merged? |
I'm still interested as I had to stop using consul as the direct DNS due to this and another issue. |
@slackpad indeed. Sorry I had github notifications turned off and didn't see it. I just merged 0.6.3 into my fork and am doing testing to make sure there are no regressions. |
@epipho , found this PR after running into this problem at work. Wondering if this will be merged soon. Is there anything I can do to help push this along? This is kind of an edge case for us (only happens in our dev environment, and we're not sure why,) but it would make our lives a lot easier if this were merged? |
It doesn't look like this change checks the length of the response. Is it possible that, even with the compression, it could still exceed 512 bytes? |
I'm going to close this out in favor of #1734 which was just merged. It directly enforces the UDP size limit which I think was the root problem here, and avoids any extra client burden / complexity around compression. I think we'd need to enforce the limit even if we did enable compression, so this is a good first step, even if compression is warranted later. It doesn't seem necessary, though, because of random shuffling, even getting 1 or 2 responses back instead of the usual 3 should work fine in most cases. |
I don't believe #1734 addresses all the cases. It looks like it will address a service lookup returning a response that is too large but will still fail if an upstream DNS server returns a large response. We found this happens when querying large ELBs (10+ A records sometimes) where the upstream DNS server will send a compressed UDP response (that is under 512 bytes) but Consul will decompress it before forwarding it on, making it too large for Go to accept. |
Ah interesting - I'll reopen so we can look at that case as well. |
Since Consul-generated messages should be sized by the current fix perhaps we should just compress in the forwarding case if the incoming message is also compressed. |
Seems reasonable. I was also thinking we may want to do some of the following to ensure correctness / usefulness in as many cases as possible.
|
@slackpad I've been doing some testing as I'm hitting this same docker + consul bug. I don't think the DNS library makes the compressed state of the incoming message available. |
No it doesn't because compression is hop by hop so it doesn't make sense to I don't get why consul just compresses everything. Compression has been in
|
Like miekg said, it makes sense to compress everything. I think that would be a good start and would fix the docker bug that a lot of people are seeing. @ryanuber: I don't think tests for compression are needed as that's already covered by the tests in the dns library. |
I'd be ok with compressing everything now that we do have the trim fix for the responses being too large (compression wasn't a true fix for that). We could add a config option to turn off compression for people with obscure setups who might be affected by the default change, but default it to on. |
is there a work around? |
No description provided.