-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ready plugin continues to answer OK during lameduck period on existing connections #4099
Comments
During It looks like the ready plugin is interrogating the status of other plugins on L46 but apparently no one has said "no". Perhaps the health plugin should implement the readiness interface and report false during lameduck? https://github.com/coredns/coredns/blob/master/plugin/health/health.go#L59 |
Yes, I was about to suggest the same thing. |
[ Quoting <[email protected]> in "Re: [coredns/coredns] Ready plugin ..." ]
Perhaps the health plugin should implement the readiness interface and
report false during lameduck?
Yes, I was about to suggest the same thing.
From https://coredns.io/plugins/ready/
Once a plugin has signaled it is ready it will not be queried again.
this is done to prevent coredns going down because some random plugin is failing, while
caching can happily survive.
…--
Miek Gieben
|
Ah right, I forgot about that counterintuitive behavior. How about if we allow the ready plugin to flip back to unready, but also add an "ignored" option to the ready plugin (a list of plugins that will not be polled for readiness). This way, if there is a plugin we don't want to control the ready state, it can be excluded. |
[ Quoting <[email protected]> in "Re: [coredns/coredns] Ready plugin ..." ]
Once a plugin has signaled it is ready it will not be queried again.
Ah right, I forgot about that counterintuitive behavior. How about if we allow
the ready plugin to flip back to unready, but also add an "ignored" option to
the ready plugin (a list of plugins that will not be polled for readiness).
This way, if there is a plugin we don't want to control the ready state, it can
be excluded.
flipping global health like that was never a good idea, and it's also not a good idea for
readiness. Cache can help you survive broken stuff for up to 30s. Upstream resolving may
still fully work, nacking readiness because some plugin breaks can't be used to kill
coredns, because that's usually the end of your cluster
|
If we don't ever want to switch readiness to false can the ready plugin, during |
SGTM |
[ Quoting <[email protected]> in "Re: [coredns/coredns] Ready plugin ..." ]
If we don't ever want to switch readiness to false can the ready plugin, during
onFinalShutdown, unregister the ready handler so that it doesn't return a 200?
that makes sense, although for readiness to be picked up by k8s it needs to be nacking 3
times in a row, so this may be not enough? OTOH: it's def. the correct thing to do.
/Miek
…--
Miek Gieben
|
In context of k8s, it depends on the readiness probe settings of the Deployment; 3 is the default but can be changed. |
[ Quoting <[email protected]> in "Re: [coredns/coredns] Ready plugin ..." ]
although for readiness to be picked up by k8s it needs to be nacking 3
times in a row, so this may be not enough?
In context of k8s, it depends on the readiness probe settings of the
Deployment; 3 is the default but can be changed.
ack. let's make the last proposed change, then you can play with #readiness if you so like
as an admin
|
What happened:
The CoreDNS server continued to report it was "ready" during the lameduck period and only stopped after the server fully shutdown.
After some investigation (and lots of frustration) I found that the loadbalancer (Azure's) was maintaining an existing TCP connection to CoreDNS and using HTTP pipelining to issue new requests. CoreDNS closes the listening port right away on entering lameduck so testing with curl looked like it was denying ready requests.
What you expected to happen:
The server to stop reporting it was ready, not just stop listening for new connections.
How to reproduce it (as minimally and precisely as possible):
netcat localhost 8181
to open a connection, leave this openpkill -SIGTERM coredns
Environment:
I stopped coredns about half way through here -- did it twice to ensure I could issue repeat queries. Then stopped CoreDNS and did it again.
What finally clued me in was when i stopped asking for listening ports, and show open connections:
168.63.129.16
is the Azure LB probe IP.The text was updated successfully, but these errors were encountered: