Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ready plugin continues to answer OK during lameduck period on existing connections #4099

Closed
UnwashedMeme opened this issue Aug 31, 2020 · 10 comments

Comments

@UnwashedMeme
Copy link

What happened:

The CoreDNS server continued to report it was "ready" during the lameduck period and only stopped after the server fully shutdown.

After some investigation (and lots of frustration) I found that the loadbalancer (Azure's) was maintaining an existing TCP connection to CoreDNS and using HTTP pipelining to issue new requests. CoreDNS closes the listening port right away on entering lameduck so testing with curl looked like it was denying ready requests.

What you expected to happen:

The server to stop reporting it was ready, not just stop listening for new connections.

How to reproduce it (as minimally and precisely as possible):

  1. Setup CoreDNS with ready plugin, health plugin, lamduck configured (for suitable long period)
  2. netcat localhost 8181 to open a connection, leave this open
  3. pkill -SIGTERM coredns
  4. In the netcat session:
GET /ready HTTP/1.1
Host: localhost:8181


Environment:

  • the version of CoreDNS: 1.7.0
  • Corefile:
. {
    errors
    ready                  # what the LB will probe
    health {
        lameduck 60s
    }
    forward . 168.63.129.16
}
  • logs, if applicable:
    I stopped coredns about half way through here -- did it twice to ensure I could issue repeat queries. Then stopped CoreDNS and did it again.
root@coredns100000C:~# netcat localhost 8181
GET /ready HTTP/1.1
Host: localhost:8181

HTTP/1.1 200 OK
Date: Mon, 31 Aug 2020 18:08:05 GMT
Content-Length: 2
Content-Type: text/plain; charset=utf-8

OKGET /ready HTTP/1.1
Host: localhost:8181

HTTP/1.1 200 OK
Date: Mon, 31 Aug 2020 18:08:19 GMT
Content-Length: 2
Content-Type: text/plain; charset=utf-8

OKGET /ready HTTP/1.1
Host: localhost:8181

HTTP/1.1 200 OK
Date: Mon, 31 Aug 2020 18:09:28 GMT
Content-Length: 2
Content-Type: text/plain; charset=utf-8

OK

What finally clued me in was when i stopped asking for listening ports, and show open connections:

root@coredns100000C:~# ss -tnp
State   Recv-Q    Send-Q              Local Address:Port                 Peer Address:Port
Process
ESTAB   0         0            [::ffff:172.20.0.18]:8181       [::ffff:168.63.129.16]:57406
 users:(("coredns",pid=7928,fd=13))

168.63.129.16 is the Azure LB probe IP.

  • OS: Ubuntu Focal Fossa 20.04
  • Azure LB config
    image
@UnwashedMeme
Copy link
Author

During onFinalShutdown the listening port is closed on L71; but the handler function doesn't appear to be unregistered. I'm not fluent in Go, but this looks fairly straightforward.

It looks like the ready plugin is interrogating the status of other plugins on L46 but apparently no one has said "no". Perhaps the health plugin should implement the readiness interface and report false during lameduck? https://github.com/coredns/coredns/blob/master/plugin/health/health.go#L59

@chrisohaver
Copy link
Member

Perhaps the health plugin should implement the readiness interface and report false during lameduck?

Yes, I was about to suggest the same thing.

@miekg
Copy link
Member

miekg commented Sep 1, 2020 via email

@chrisohaver
Copy link
Member

Once a plugin has signaled it is ready it will not be queried again.

Ah right, I forgot about that counterintuitive behavior. How about if we allow the ready plugin to flip back to unready, but also add an "ignored" option to the ready plugin (a list of plugins that will not be polled for readiness). This way, if there is a plugin we don't want to control the ready state, it can be excluded.

@miekg
Copy link
Member

miekg commented Sep 1, 2020 via email

@UnwashedMeme
Copy link
Author

If we don't ever want to switch readiness to false can the ready plugin, during onFinalShutdown, unregister the ready handler so that it doesn't return a 200?

@chrisohaver
Copy link
Member

If we don't ever want to switch readiness to false can the ready plugin, during onFinalShutdown, unregister the ready handler so that it doesn't return a 200?

SGTM

@miekg
Copy link
Member

miekg commented Sep 2, 2020 via email

@chrisohaver
Copy link
Member

although for readiness to be picked up by k8s it needs to be nacking 3 times in a row, so this may be not enough?

In context of k8s, it depends on the readiness probe settings of the Deployment; 3 is the default but can be changed.

@miekg
Copy link
Member

miekg commented Sep 2, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants