-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Activator endpoint probing can race with pre-stop hook #9355
Comments
for full picture -- we can ignore reprobing those but:
|
We had a bit of a discussion in Slack today, so here are more datapoints: When a pod is deleted (i.e. scaled down) it isn't actually moved into the With that in mind, is this even still a problem? I mean there is somewhat of a race, sure, but we shouldn't be reprobing and reconsidering pods so I'd think the race isn't better (or worse) than the race seen in any loadbalancer when removing a pod and our shutdown grace period should fix any errors here. |
Yeah, and there's always a race due to informers not being instantenous anyway... |
yah, thinking about this there's a race, but given terminating pods immediately exit without hitting nonReady, I don't think it's nearly so important a race (and, selfishly, I don't think it's the race causing the problem I was initially investigating :D). I'm going to keep digging a little in to this code path, but gonna close this for now since I don't think this'd be a worthwhile change, given our current understanding. /close |
@julz: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(writing up some slack conversations).
wordy version: Activator has logic to probe the QP when it sees a pod go to notReady state, if the probe succeeds it will treat the pod as ready regardless of its status. This is particularly good at startup because it means we can see the pod is ready and start forwarding traffic before etcd and the kubernetes Rube Goldberg machine have caught up. However, when terminating it is possible for the activator to see the pod go notReady before the QP has received the pre-stop signal (both Endpoints and Kubelet watch for the Terminating state independently, so the pre-stop signal is not guaranteed to be received by the QP before the pod goes notReady in activator's endpoints lister, see https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes). In this case we'll never actually consider the pod notReady (because we saw a good probe), even though it's now in a terminating state and will at some point start failing requests we forward to it.
tl;dr: activator probing of notReady pods should be resilient to case where QP hasn't yet received pre-stop hook.
potential fix: the probing logic is only really needed on startup. When going from never-seen-before to notReady we should probe to get cold start times as low as possible, but when transitioning ready->notReady we can't distinguish readiness blips from terminating pods, and should err on the side of avoiding the race by waiting for the k8s api to report it ready.
note: I found this while looking at a bug we see where intermittently sending a lot of curls in a row we see errors while things are scaling down, I haven't verified yet that this is the problem there -- but it does seem like a problem :).
The text was updated successfully, but these errors were encountered: