Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul not returning NXDOMAIN for DNS queries to a non-existent datacenter #8218

Merged
merged 3 commits into from
Jul 1, 2020

Conversation

yurkeen
Copy link
Contributor

@yurkeen yurkeen commented Jul 1, 2020

Overview

Recently we found an issue when a DNS query sent to Consul for a non-existent domain makes client agent to keep querying servers indefinitely for that domain, when DNS cache is enabled, i.e. dns_config.use_cache is set to true. We were able to reproduce the issue locally with the most recent versions (1.7.4, 1.8.0):

2020-07-01T02:00:49.159+0100 [WARN]  agent.server.rpc: RPC request for DC is currently failing as no path was found: datacenter=locxx method=Health.ServiceNodes
2020-07-01T02:00:49.159+0100 [ERROR] agent.client: RPC failed to server: method=Health.ServiceNodes server=127.0.0.103:8310 error="rpc error making call: No path to datacenter"
...

In addition to be an annoying bug, this is also a DDoS vector allowing to quickly exhaust memory on all Consul servers in the cluster by flooding a single Consul agent with random queries in non-existent domains. Once DNS cache is disabled, the agent sends only a single query to the servers available.
All of this is happening due to the bug where SERVFAIL is returned when NXDOMAIN should be. The bug was supposed to be fixed in #8103, but unfortunately it's not for the case where a DNS query goes through a non-server agent first:

$ consul version
Consul 1.8.0-dev
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
$ dig -p 8600  +noall +answer +comment srv consul.service.locXX.consul-testing @127.0.0.201
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 34400
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096

Configuration used to reproduce the issue: https://gist.github.com/yurkeen/7bc4bc12c7551b8c979cbf01a3cb5bea

Reasons

There are two consequent reasons why this issue is happening:
I. The test provided in #8103, TestDNS_NonExistingDC checks the DNS response received from the agent, when the agent is a Consul server itself. However, this test does not cover the case when DNS resolution happens over RPC through client agent reaching a server agent.
II. The error being provided for the computeRCode() method in dns.go is compared to the "No path to datacenter" string. In case when resolution happens through a Consul agent, the error looks slightly different: "rpc error making call: No path to datacenter", i.e. RPC layer adds more context to this error causing the condition to fail and slip through to returning the default dns.RcodeServerFailure.

Solution

This PR is split into three commits:

  • 3b4ddaa adds a test to make sure NXDOMAIN is returned when Consul server agent is queried over RPC by a non-server agent. If ran without applying the next commit 8d18422, the test will fail.
  • 8d18422 adds a fix to compare errors properly, so NXDOMAIN is returned for both cases — direct on the server agent and over RPC;
  • 10361dd is a bonus track adding small refactoring of the ErrQueryNotFound error and adding the IsErrQueryNotFound(err) helper for it to align it with the rest of the errors.

@yurkeen yurkeen changed the title Consul not returning NXDOMAIN For DNS queries in a non-existent domain Consul not returning NXDOMAIN For DNS queries in a non-existent datacetner Jul 1, 2020
@yurkeen yurkeen changed the title Consul not returning NXDOMAIN For DNS queries in a non-existent datacetner Consul not returning NXDOMAIN For DNS queries in a non-existent datacenter Jul 1, 2020
@yurkeen yurkeen changed the title Consul not returning NXDOMAIN For DNS queries in a non-existent datacenter Consul not returning NXDOMAIN for DNS queries to a non-existent datacenter Jul 1, 2020
Copy link
Contributor

@pierresouchay pierresouchay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding, yes, did not expected RPC not going to a server

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this was a good find and an excellent PR.

@mkeeler mkeeler merged commit e0f9e4a into hashicorp:master Jul 1, 2020
@hashicorp-ci
Copy link
Contributor

🍒✅ Cherry pick of commit e0f9e4a onto release/1.8.x succeeded!

@hashicorp-ci
Copy link
Contributor

🍒❌ Cherry pick of commit e0f9e4a onto release/1.7.x failed! Build Log

@yurkeen
Copy link
Contributor Author

yurkeen commented Jul 1, 2020

Glad to help, folks! 🎩

@xjewer
Copy link

xjewer commented Jul 6, 2020

btw, using errors chains by wrapping it with github.com/pkg/errors https://github.com/pkg/errors/blob/master/errors.go#L184, particularly RPC
could simplify error matching with targets https://pkg.go.dev/github.com/pkg/errors?tab=doc#example-Wrap

@yurkeen
Copy link
Contributor Author

yurkeen commented Jul 6, 2020

btw, using errors chains by wrapping it with github.com/pkg/errors https://github.com/pkg/errors/blob/master/errors.go#L184, particularly RPC

could simplify error matching with targets https://pkg.go.dev/github.com/pkg/errors?tab=doc#example-Wrap

This would involve refactoring of error handling across the whole project (Consul is from the pre pkg/errors age). That was not the aim of this PR, however. Surely can be done as a follow up.

@xjewer
Copy link

xjewer commented Jul 6, 2020

This would involve refactoring of error handling across the whole project (Consul is from the pre pkg/errors age). That was not the aim of this PR, however. Surely can be done as a follow up.

sure, it's not related to the PR. I was bringing attention of maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants