-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingress) Use /readyz for readinessProbe #716
Conversation
To prevent pods from becoming ready and serving 404s prior to the `ingress-controller` first syncing config to the `proxy`. This can be demonstrated by running a load test whilst deploying a change to the chart values that causes pods to be rolled: % (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) & [1] 99452 * % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) && fg Release "test" has been upgraded. Happy Helming! … [1] + 99452 running ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration ) % vegeta report vegeta.out Requests [total, rate, throughput] 1500, 50.03, 48.33 Duration [total, attack, wait] 29.983s, 29.98s, 3.297ms Latencies [min, mean, 50, 90, 95, 99, max] 1.765ms, 3.823ms, 3.505ms, 5.353ms, 6ms, 8.102ms, 36.237ms Bytes In [total, mean] 2448, 1.63 Bytes Out [total, mean] 0, 0.00 Success [ratio] 96.60% Status Codes [code:count] 201:1449 404:51 Error Set: 404 Not Found When tested with this change supplied to the latest chart: % (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) & [1] 4518 * % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) --set ingressController.readinessProbe.httpGet.path=/readyz && fg Release "test" has been upgraded. Happy Helming! … [1] + 4518 running ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration ) % vegeta report vegeta.out Requests [total, rate, throughput] 1500, 50.03, 50.03 Duration [total, attack, wait] 29.983s, 29.98s, 3.651ms Latencies [min, mean, 50, 90, 95, 99, max] 1.569ms, 4.745ms, 3.502ms, 5.851ms, 21.034ms, 22.15ms, 41.41ms Bytes In [total, mean] 0, 0.00 Bytes Out [total, mean] 0, 0.00 Success [ratio] 100.00% Status Codes [code:count] 201:1500 Error Set: The functionality of this endpoint was extended in KIC 2.0.0: - Kong/kubernetes-ingress-controller@b0ff612 I thought that the problem might have been introduced by the startup sleep in KIC 2.2.1 but the first version that I can replicate it with is KIC 2.3.x: - Kong/kubernetes-ingress-controller@30ff318 This problem was caught by our own CI tests. A similar approach could be used in `scripts/test-upgrade.sh` to prevent future regressions, but I'd like some feedback on the idea before implementing it in a separate PR.
@pmalek @rainest Thanks for the quick review and merge!
Do you have any thoughts on this? I'm happy to raise a PR but would like to agree on the implementation before working on it. |
Do you have anything specific in mind? Like running an http load test and expecting 0 404s? I'm wondering if this wouldn't end up flaky but we could try this out to verify our current setup and gather conclusions about it and re-iterate on it. |
We do have an internal initiative (there's no public tracking for it yet, but for any Kong staff that want to review it, the ticket is FTI-3081) that should address this by making the gateway able to report its own config readiness. This is tentatively planned for gateway 3.3, but I can't confirm if it will definitely make it. The gateway reporting its own status will make it easier to ensure it doesn't return 404s due to lack of configuration. What we've encountered in the past, and that you may be seeing, is that there are cases where the controller reports that it's sent configuration and marks itself ready even though Kong isn't ready to serve requests yet. While this is usually brief, Controller-side readiness reporting can't avoid these because the controller only knows whether the update request has completed, not the internal gateway-side updates that need to complete after that request. The gateway does not route requests using the raw configuration sent by the controller. It builds a separate set of derived configuration (the "router", "balancer", and "plugins iterator") that are optimized for handling requests quickly--using the raw JSON config would result in much longer request processing times. Building this derived configuration does take some time, and while it doesn't take long, there is a brief window after the config update occurs where Kong can still serve 404s. Moving readiness reporting into the gateway will let it report that it has completed building those internal structures and avoid this. Absent this feature, the workaround we have is to create a proxy route and use that as Kong's readiness endpoint instead of the default |
What this PR does / why we need it:
To prevent pods from becoming ready and serving 404s prior to the
ingress-controller
first syncing config to theproxy
.This can be demonstrated by running a load test whilst deploying a change to the chart values that causes pods to be rolled:
When tested with this change supplied to the latest chart:
Special notes for your reviewer:
The functionality of this endpoint was extended in KIC 2.0.0:
I thought that the problem might have been introduced by the startup sleep in KIC 2.2.1 but the first version that I can replicate it with is KIC 2.3.x:
This problem was caught by our own CI tests. A similar approach could be used in
scripts/test-upgrade.sh
to prevent future regressions, but I'd like some feedback on the idea before implementing it in a separate PR.Checklist
main
branch.