Ready plugin continues to answer OK during lameduck period on existing connections #4099

UnwashedMeme · 2020-08-31T19:06:36Z

What happened:

The CoreDNS server continued to report it was "ready" during the lameduck period and only stopped after the server fully shutdown.

After some investigation (and lots of frustration) I found that the loadbalancer (Azure's) was maintaining an existing TCP connection to CoreDNS and using HTTP pipelining to issue new requests. CoreDNS closes the listening port right away on entering lameduck so testing with curl looked like it was denying ready requests.

What you expected to happen:

The server to stop reporting it was ready, not just stop listening for new connections.

How to reproduce it (as minimally and precisely as possible):

Setup CoreDNS with ready plugin, health plugin, lamduck configured (for suitable long period)
netcat localhost 8181 to open a connection, leave this open
pkill -SIGTERM coredns
In the netcat session:

GET /ready HTTP/1.1
Host: localhost:8181

Environment:

the version of CoreDNS: 1.7.0
Corefile:

. {
    errors
    ready                  # what the LB will probe
    health {
        lameduck 60s
    }
    forward . 168.63.129.16
}

logs, if applicable:
I stopped coredns about half way through here -- did it twice to ensure I could issue repeat queries. Then stopped CoreDNS and did it again.

root@coredns100000C:~# netcat localhost 8181
GET /ready HTTP/1.1
Host: localhost:8181

HTTP/1.1 200 OK
Date: Mon, 31 Aug 2020 18:08:05 GMT
Content-Length: 2
Content-Type: text/plain; charset=utf-8

OKGET /ready HTTP/1.1
Host: localhost:8181

HTTP/1.1 200 OK
Date: Mon, 31 Aug 2020 18:08:19 GMT
Content-Length: 2
Content-Type: text/plain; charset=utf-8

OKGET /ready HTTP/1.1
Host: localhost:8181

HTTP/1.1 200 OK
Date: Mon, 31 Aug 2020 18:09:28 GMT
Content-Length: 2
Content-Type: text/plain; charset=utf-8

OK

What finally clued me in was when i stopped asking for listening ports, and show open connections:

root@coredns100000C:~# ss -tnp
State   Recv-Q    Send-Q              Local Address:Port                 Peer Address:Port
Process
ESTAB   0         0            [::ffff:172.20.0.18]:8181       [::ffff:168.63.129.16]:57406
 users:(("coredns",pid=7928,fd=13))

168.63.129.16 is the Azure LB probe IP.

OS: Ubuntu Focal Fossa 20.04
Azure LB config

The text was updated successfully, but these errors were encountered:

UnwashedMeme · 2020-08-31T19:16:47Z

During onFinalShutdown the listening port is closed on L71; but the handler function doesn't appear to be unregistered. I'm not fluent in Go, but this looks fairly straightforward.

It looks like the ready plugin is interrogating the status of other plugins on L46 but apparently no one has said "no". Perhaps the health plugin should implement the readiness interface and report false during lameduck? https://github.com/coredns/coredns/blob/master/plugin/health/health.go#L59

chrisohaver · 2020-08-31T19:20:02Z

Perhaps the health plugin should implement the readiness interface and report false during lameduck?

Yes, I was about to suggest the same thing.

miekg · 2020-09-01T07:15:20Z

[ Quoting <[email protected]> in "Re: [coredns/coredns] Ready plugin ..." ]

Perhaps the health plugin should implement the readiness interface and report false during lameduck? Yes, I was about to suggest the same thing.

From https://coredns.io/plugins/ready/ Once a plugin has signaled it is ready it will not be queried again. this is done to prevent coredns going down because some random plugin is failing, while caching can happily survive.

…

-- Miek Gieben

chrisohaver · 2020-09-01T13:41:57Z

Once a plugin has signaled it is ready it will not be queried again.

Ah right, I forgot about that counterintuitive behavior. How about if we allow the ready plugin to flip back to unready, but also add an "ignored" option to the ready plugin (a list of plugins that will not be polled for readiness). This way, if there is a plugin we don't want to control the ready state, it can be excluded.

miekg · 2020-09-01T14:27:36Z

[ Quoting <[email protected]> in "Re: [coredns/coredns] Ready plugin ..." ]

Once a plugin has signaled it is ready it will not be queried again. Ah right, I forgot about that counterintuitive behavior. How about if we allow the ready plugin to flip back to unready, but also add an "ignored" option to the ready plugin (a list of plugins that will not be polled for readiness). This way, if there is a plugin we don't want to control the ready state, it can be excluded.

flipping global health like that was never a good idea, and it's also not a good idea for readiness. Cache can help you survive broken stuff for up to 30s. Upstream resolving may still fully work, nacking readiness because some plugin breaks can't be used to kill coredns, because that's usually the end of your cluster

UnwashedMeme · 2020-09-01T14:29:25Z

If we don't ever want to switch readiness to false can the ready plugin, during onFinalShutdown, unregister the ready handler so that it doesn't return a 200?

chrisohaver · 2020-09-01T14:31:42Z

If we don't ever want to switch readiness to false can the ready plugin, during onFinalShutdown, unregister the ready handler so that it doesn't return a 200?

SGTM

miekg · 2020-09-02T05:43:29Z

[ Quoting <[email protected]> in "Re: [coredns/coredns] Ready plugin ..." ]

If we don't ever want to switch readiness to false can the ready plugin, during onFinalShutdown, unregister the ready handler so that it doesn't return a 200?

that makes sense, although for readiness to be picked up by k8s it needs to be nacking 3 times in a row, so this may be not enough? OTOH: it's def. the correct thing to do. /Miek

…

-- Miek Gieben

chrisohaver · 2020-09-02T12:52:01Z

although for readiness to be picked up by k8s it needs to be nacking 3 times in a row, so this may be not enough?

In context of k8s, it depends on the readiness probe settings of the Deployment; 3 is the default but can be changed.

miekg · 2020-09-02T12:58:17Z

[ Quoting <[email protected]> in "Re: [coredns/coredns] Ready plugin ..." ]

although for readiness to be picked up by k8s it needs to be nacking 3 times in a row, so this may be not enough? In context of k8s, it depends on the readiness probe settings of the Deployment; 3 is the default but can be changed.

ack. let's make the last proposed change, then you can play with #readiness if you so like as an admin

UnwashedMeme added the bug label Aug 31, 2020

miekg added enhancement plugin/ready and removed bug labels Sep 17, 2020

chrisohaver mentioned this issue Oct 1, 2020

plugin/ready: Don't return 200 OK during shutdown #4167

Merged

chrisohaver closed this as completed Oct 1, 2020

Miciah mentioned this issue Oct 9, 2020

Bug 1884053: Configure CoreDNS to shut down gracefully openshift/cluster-dns-operator#205

Merged

TBBle mentioned this issue Jan 7, 2021

[EKS] [request]: API flag to initialize completely bare EKS cluster aws/containers-roadmap#923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ready plugin continues to answer OK during lameduck period on existing connections #4099

Ready plugin continues to answer OK during lameduck period on existing connections #4099

UnwashedMeme commented Aug 31, 2020

UnwashedMeme commented Aug 31, 2020

chrisohaver commented Aug 31, 2020

miekg commented Sep 1, 2020 via email

chrisohaver commented Sep 1, 2020

miekg commented Sep 1, 2020 via email

UnwashedMeme commented Sep 1, 2020

chrisohaver commented Sep 1, 2020

miekg commented Sep 2, 2020 via email

chrisohaver commented Sep 2, 2020

miekg commented Sep 2, 2020 via email

Ready plugin continues to answer OK during lameduck period on existing connections #4099

Ready plugin continues to answer OK during lameduck period on existing connections #4099

Comments

UnwashedMeme commented Aug 31, 2020

UnwashedMeme commented Aug 31, 2020

chrisohaver commented Aug 31, 2020

miekg commented Sep 1, 2020 via email

chrisohaver commented Sep 1, 2020

miekg commented Sep 1, 2020 via email

UnwashedMeme commented Sep 1, 2020

chrisohaver commented Sep 1, 2020

miekg commented Sep 2, 2020 via email

chrisohaver commented Sep 2, 2020

miekg commented Sep 2, 2020 via email