fix(ingress) Use /readyz for readinessProbe #716

dcarley · 2023-01-24T13:39:47Z

What this PR does / why we need it:

To prevent pods from becoming ready and serving 404s prior to the ingress-controller first syncing config to the proxy.

This can be demonstrated by running a load test whilst deploying a change to the chart values that causes pods to be rolled:

% (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) &
[1] 99452

* % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) && fg
Release "test" has been upgraded. Happy Helming!
…
[1]  + 99452 running    ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration  )

% vegeta report vegeta.out
Requests      [total, rate, throughput]         1500, 50.03, 48.33
Duration      [total, attack, wait]             29.983s, 29.98s, 3.297ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.765ms, 3.823ms, 3.505ms, 5.353ms, 6ms, 8.102ms, 36.237ms
Bytes In      [total, mean]                     2448, 1.63
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           96.60%
Status Codes  [code:count]                      201:1449  404:51
Error Set:
404 Not Found

When tested with this change supplied to the latest chart:

% (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) &
[1] 4518

* % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) --set ingressController.readinessProbe.httpGet.path=/readyz && fg
Release "test" has been upgraded. Happy Helming!
…
[1]  + 4518 running    ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration  )

% vegeta report vegeta.out
Requests      [total, rate, throughput]         1500, 50.03, 50.03
Duration      [total, attack, wait]             29.983s, 29.98s, 3.651ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.569ms, 4.745ms, 3.502ms, 5.851ms, 21.034ms, 22.15ms, 41.41ms
Bytes In      [total, mean]                     0, 0.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      201:1500
Error Set:

Special notes for your reviewer:

The functionality of this endpoint was extended in KIC 2.0.0:

Kong/kubernetes-ingress-controller@b0ff612

I thought that the problem might have been introduced by the startup sleep in KIC 2.2.1 but the first version that I can replicate it with is KIC 2.3.x:

Kong/kubernetes-ingress-controller@30ff318

This problem was caught by our own CI tests. A similar approach could be used in scripts/test-upgrade.sh to prevent future regressions, but I'd like some feedback on the idea before implementing it in a separate PR.

Checklist

PR is based off the current tip of the main branch.
Changes are documented under the "Unreleased" header in CHANGELOG.md
Commits follow the Kong commit message guidelines

To prevent pods from becoming ready and serving 404s prior to the `ingress-controller` first syncing config to the `proxy`. This can be demonstrated by running a load test whilst deploying a change to the chart values that causes pods to be rolled: % (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) & [1] 99452 * % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) && fg Release "test" has been upgraded. Happy Helming! … [1] + 99452 running ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration ) % vegeta report vegeta.out Requests [total, rate, throughput] 1500, 50.03, 48.33 Duration [total, attack, wait] 29.983s, 29.98s, 3.297ms Latencies [min, mean, 50, 90, 95, 99, max] 1.765ms, 3.823ms, 3.505ms, 5.353ms, 6ms, 8.102ms, 36.237ms Bytes In [total, mean] 2448, 1.63 Bytes Out [total, mean] 0, 0.00 Success [ratio] 96.60% Status Codes [code:count] 201:1449 404:51 Error Set: 404 Not Found When tested with this change supplied to the latest chart: % (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) & [1] 4518 * % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) --set ingressController.readinessProbe.httpGet.path=/readyz && fg Release "test" has been upgraded. Happy Helming! … [1] + 4518 running ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration ) % vegeta report vegeta.out Requests [total, rate, throughput] 1500, 50.03, 50.03 Duration [total, attack, wait] 29.983s, 29.98s, 3.651ms Latencies [min, mean, 50, 90, 95, 99, max] 1.569ms, 4.745ms, 3.502ms, 5.851ms, 21.034ms, 22.15ms, 41.41ms Bytes In [total, mean] 0, 0.00 Bytes Out [total, mean] 0, 0.00 Success [ratio] 100.00% Status Codes [code:count] 201:1500 Error Set: The functionality of this endpoint was extended in KIC 2.0.0: - Kong/kubernetes-ingress-controller@b0ff612 I thought that the problem might have been introduced by the startup sleep in KIC 2.2.1 but the first version that I can replicate it with is KIC 2.3.x: - Kong/kubernetes-ingress-controller@30ff318 This problem was caught by our own CI tests. A similar approach could be used in `scripts/test-upgrade.sh` to prevent future regressions, but I'd like some feedback on the idea before implementing it in a separate PR.

dcarley · 2023-01-25T09:53:32Z

@pmalek @rainest Thanks for the quick review and merge!

This problem was caught by our own CI tests. A similar approach could be used in scripts/test-upgrade.sh to prevent future regressions, but I'd like some feedback on the idea before implementing it in a separate PR.

Do you have any thoughts on this? I'm happy to raise a PR but would like to agree on the implementation before working on it.

pmalek · 2023-01-25T10:00:32Z

Do you have anything specific in mind? Like running an http load test and expecting 0 404s? I'm wondering if this wouldn't end up flaky but we could try this out to verify our current setup and gather conclusions about it and re-iterate on it.

rainest · 2023-01-25T18:24:12Z

We do have an internal initiative (there's no public tracking for it yet, but for any Kong staff that want to review it, the ticket is FTI-3081) that should address this by making the gateway able to report its own config readiness. This is tentatively planned for gateway 3.3, but I can't confirm if it will definitely make it.

The gateway reporting its own status will make it easier to ensure it doesn't return 404s due to lack of configuration. What we've encountered in the past, and that you may be seeing, is that there are cases where the controller reports that it's sent configuration and marks itself ready even though Kong isn't ready to serve requests yet. While this is usually brief, Controller-side readiness reporting can't avoid these because the controller only knows whether the update request has completed, not the internal gateway-side updates that need to complete after that request.

The gateway does not route requests using the raw configuration sent by the controller. It builds a separate set of derived configuration (the "router", "balancer", and "plugins iterator") that are optimized for handling requests quickly--using the raw JSON config would result in much longer request processing times. Building this derived configuration does take some time, and while it doesn't take long, there is a brief window after the config update occurs where Kong can still serve 404s. Moving readiness reporting into the gateway will let it report that it has completed building those internal structures and avoid this.

Absent this feature, the workaround we have is to create a proxy route and use that as Kong's readiness endpoint instead of the default /status. This ensures the router is ready (the route will return a 404 if it isn't), it just isn't something we can do out of the box, since there's no default set of routes that are guaranteed to be available.

dcarley requested a review from a team as a code owner January 24, 2023 13:39

dcarley mentioned this pull request Jan 24, 2023

Sleep-based synchronization between cache warmup and the first sync to Kong Kong/kubernetes-ingress-controller#2315

Open

2 tasks

pmalek approved these changes Jan 24, 2023

View reviewed changes

rainest merged commit 6b75b59 into Kong:main Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingress) Use /readyz for readinessProbe #716

fix(ingress) Use /readyz for readinessProbe #716

dcarley commented Jan 24, 2023

dcarley commented Jan 25, 2023

pmalek commented Jan 25, 2023

rainest commented Jan 25, 2023 •

edited

Loading

fix(ingress) Use /readyz for readinessProbe #716

fix(ingress) Use /readyz for readinessProbe #716

Conversation

dcarley commented Jan 24, 2023

What this PR does / why we need it:

Special notes for your reviewer:

Checklist

dcarley commented Jan 25, 2023

pmalek commented Jan 25, 2023

rainest commented Jan 25, 2023 • edited Loading

rainest commented Jan 25, 2023 •

edited

Loading