Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Magicdns #79

Merged
merged 9 commits into from
Jan 12, 2025
Merged

Magicdns #79

merged 9 commits into from
Jan 12, 2025

Conversation

aojea
Copy link
Owner

@aojea aojea commented Oct 27, 2024

No description provided.

}

// create a connection with the upstream DNS server
dnsTCPConn, err := dialer.Dial("tcp", net.JoinHostPort(d.nameServer, "53"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that this upstreams over TCP!

if len(questions) == 0 {
return dnsErrorMessage(hdr.ID, dnsmessage.RCodeFormatError, dnsmessage.Question{})
}
// it is supported but not wildly implemented, at least not in golang stdlib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/wildly/widely

}
switch q.Type {
case dnsmessage.TypeA:
klog.V(7).Infof("DNS A request for %s", q.Name.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be really interested to see what DNS requests come from your average pod. If we could "fix" the ndots situation, that would be great. That's got me thinking that we might want to (somehow?) warn about namespaces that collide with TLDs; I quickly scanned through B - both .blog and .build are TLDs and not-unlikely namespace-names!

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was reported recently in sig-network, I think you want something like https://coredns.io/plugins/kubernetes/#autopath , but once we have tcp pipelining working with a pool of connections I'm curious about if that will be relevant ... from 5 requests 4 will be nxdomains

switch q.Type {
case dnsmessage.TypeA:
klog.V(7).Infof("DNS A request for %s", q.Name.String())
addrs, err := d.lookupIP(context.Background(), "ip4", q.Name.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea for the future: I do think that sometimes DNS returns additional records, to avoid a round-trip. Maybe lookupIP should return a set of DNS answers.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think what I'm doing have some limitations, and I just should work with the dns records directly instead of perform the translation

PreferGo: true,
Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
select {
case conn, ok := <-d.connCh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I do wonder if we should put the dial logic in this function, so we can recover if we drop a TCP connection? I assume we don't because we want to reuse it?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be more involved https://datatracker.ietf.org/doc/html/rfc7766#ref-edns-tcp-keepalive , we want to keep a pool of connections and we need to work on keep them open, then we can implement pipelining

LocalAddr: &net.UDPAddr{IP: net.ParseIP(d.nameServer), Port: 53},
Control: func(network, address string, c syscall.RawConn) error {
return c.Control(func(fd uintptr) {
// Mark connections so thet are not processed by the netfilter TPROXY rules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "... so that they are not processed ... " ?

klog.V(7).Info("starting parsing packet")
hdr, err := p.Start(b)
if err != nil {
return dnsErrorMessage(hdr.ID, dnsmessage.RCodeFormatError, dnsmessage.Question{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe logging for the error cases here?

@justinsb
Copy link
Contributor

justinsb commented Nov 5, 2024

:shipit:

@aojea
Copy link
Owner Author

aojea commented Dec 24, 2024

@justinsb I have this fundamental question? should we cache or should we serve the DNS directly from each Nodes ... imaging we embed kube-proxy and coredns, both reuse the same information for EndpointSlices and Services and you save 2 components ... and kindnet becomes the only networking component needed in the cluster ... my point is that I don't see much value in caching if you can serve directly from the nodes ... the later completely removes the problem meanwhile the former just mitigates it but introduces a complexity that can be hard to support ... transparent proxies are complicated if they start misbehaving

@aojea
Copy link
Owner Author

aojea commented Jan 10, 2025

This is next , I just settled on the algorithm, the main problem of kubernetes DNS is p99 latency and conntrack exhaustion, to solve that kindnet will

  • use nfqueue to divert the pods udp packets destined to port 53 to userspace (TCP does not have conntrack problems and will not benefit of caching)
  • use a TCP persistent connection with the upstream DNS server (dnsmasq) has problem with the number of TCP connections so we'll add metrics to the latency to evaluate the number of connections needed with the upstream server
    -it will only process A and AAAA records to begin with, we are going to add metrics to get stats of all types of records to reevaluate later (SVR, PTR and CNAME may be interesting), if is not an A or an AAAA record it will just pass the packet to the kernel again
  • it will check the cache (will keep 30s by record same as GKE is using and a fixed size of cache using LRU with 1024 records to avoid memory consumption) and reply if is present
  • if is not present in the cache, it will use the upstream TCP connection to query the apiserver

@aojea aojea force-pushed the magicdns branch 12 times, most recently from 4e5d358 to 7cc54d6 Compare January 12, 2025 19:54
@aojea
Copy link
Owner Author

aojea commented Jan 12, 2025

@justinsb Quick test with https://github.com/guessi/dnsperf-bench
I'm super happy how this works now, it has good guardrails #79 (comment), it allows to cache requests and to reduce the conntrack entries, also provide great visibility on the protocol with good metrics, number of requests and latency per request and type of record requested.

DNS Performance Testing Tool
Version 2.14.0

[Status] Command line: dnsperf -f any -m udp -s 10.96.0.10 -p 53 -d /opt/records.txt -c 1 -T 1 -l 30 -t 5 -Q 100000
[Status] Sending queries (to 10.96.0.10:53)
[Status] Started at: Sun Jan 12 20:45:41 2025
[Status] Stopping after 30.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         387037
  Queries completed:    387037 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 387037 (100.00%)
  Average packet size:  request 43, response 67
  Run time (s):         30.006926
  Queries per second:   12898.255556

  Average Latency (s):  0.007742 (min 0.000133, max 0.013404)
  Latency StdDev (s):   0.000808

DNS Performance Testing Tool
Version 2.14.0
DNS Performance Testing Tool
Version 2.14.0

[Status] Command line: dnsperf -f any -m udp -s 10.96.0.10 -p 53 -d /opt/records.txt -c 1 -T 1 -l 30 -t 5 -Q 100000
[Status] Sending queries (to 10.96.0.10:53)
[Status] Started at: Sun Jan 12 20:49:54 2025
[Status] Stopping after 30.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         2203514
  Queries completed:    2203514 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 2203514 (100.00%)
  Average packet size:  request 43, response 160
  Run time (s):         30.007927
  Queries per second:   73431.063732

  Average Latency (s):  0.001351 (min 0.000025, max 0.042577)
  Latency StdDev (s):   0.001283

DNS Performance Testing Tool
Version 2.14.0

With cache enabled, the min latency is faster without the cache, it seems I need to

Independently, now each DNS request does NOT generate any UDP conntrack entry, a TCP connection is created and reused in a pool, with a maximum of 3 TCP connections per Node.
Before a name resolution generated one request per family (A and AAAA), and one request for dns search domain in the resolv.conf, now it can generate maximum one request, the rest are pipelined through the TCP connection., cc: @thockin

TODO:

image

@aojea aojea merged commit 3c8c2cd into main Jan 12, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants