-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul not returning NXDOMAIN for DNS queries to a non-existent datacenter #8218
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding, yes, did not expected RPC not going to a server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, this was a good find and an excellent PR.
🍒✅ Cherry pick of commit e0f9e4a onto |
Glad to help, folks! 🎩 |
btw, using errors chains by wrapping it with |
This would involve refactoring of error handling across the whole project (Consul is from the pre pkg/errors age). That was not the aim of this PR, however. Surely can be done as a follow up. |
sure, it's not related to the PR. I was bringing attention of maintainers. |
Overview
Recently we found an issue when a DNS query sent to Consul for a non-existent domain makes client agent to keep querying servers indefinitely for that domain, when DNS cache is enabled, i.e.
dns_config.use_cache
is set totrue
. We were able to reproduce the issue locally with the most recent versions (1.7.4, 1.8.0):In addition to be an annoying bug, this is also a DDoS vector allowing to quickly exhaust memory on all Consul servers in the cluster by flooding a single Consul agent with random queries in non-existent domains. Once DNS cache is disabled, the agent sends only a single query to the servers available.
All of this is happening due to the bug where
SERVFAIL
is returned whenNXDOMAIN
should be. The bug was supposed to be fixed in #8103, but unfortunately it's not for the case where a DNS query goes through a non-server agent first:Configuration used to reproduce the issue: https://gist.github.com/yurkeen/7bc4bc12c7551b8c979cbf01a3cb5bea
Reasons
There are two consequent reasons why this issue is happening:
I. The test provided in #8103,
TestDNS_NonExistingDC
checks the DNS response received from the agent, when the agent is a Consul server itself. However, this test does not cover the case when DNS resolution happens over RPC through client agent reaching a server agent.II. The error being provided for the
computeRCode()
method in dns.go is compared to the"No path to datacenter"
string. In case when resolution happens through a Consul agent, the error looks slightly different:"rpc error making call: No path to datacenter"
, i.e. RPC layer adds more context to this error causing the condition to fail and slip through to returning the defaultdns.RcodeServerFailure
.Solution
This PR is split into three commits:
3b4ddaa
adds a test to make sure NXDOMAIN is returned when Consul server agent is queried over RPC by a non-server agent. If ran without applying the next commit8d18422
, the test will fail.8d18422
adds a fix to compare errors properly, so NXDOMAIN is returned for both cases — direct on the server agent and over RPC;10361dd
is a bonus track adding small refactoring of theErrQueryNotFound
error and adding theIsErrQueryNotFound(err)
helper for it to align it with the rest of the errors.