Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingress) Use /readyz for readinessProbe #716

Merged
merged 1 commit into from
Jan 24, 2023

Conversation

dcarley
Copy link
Contributor

@dcarley dcarley commented Jan 24, 2023

What this PR does / why we need it:

To prevent pods from becoming ready and serving 404s prior to the ingress-controller first syncing config to the proxy.

This can be demonstrated by running a load test whilst deploying a change to the chart values that causes pods to be rolled:

% (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) &
[1] 99452

* % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) && fg
Release "test" has been upgraded. Happy Helming!
…
[1]  + 99452 running    ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration  )

% vegeta report vegeta.out
Requests      [total, rate, throughput]         1500, 50.03, 48.33
Duration      [total, attack, wait]             29.983s, 29.98s, 3.297ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.765ms, 3.823ms, 3.505ms, 5.353ms, 6ms, 8.102ms, 36.237ms
Bytes In      [total, mean]                     2448, 1.63
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           96.60%
Status Codes  [code:count]                      201:1449  404:51
Error Set:
404 Not Found

When tested with this change supplied to the latest chart:

% (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) &
[1] 4518

* % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) --set ingressController.readinessProbe.httpGet.path=/readyz && fg
Release "test" has been upgraded. Happy Helming!
…
[1]  + 4518 running    ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration  )

% vegeta report vegeta.out
Requests      [total, rate, throughput]         1500, 50.03, 50.03
Duration      [total, attack, wait]             29.983s, 29.98s, 3.651ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.569ms, 4.745ms, 3.502ms, 5.851ms, 21.034ms, 22.15ms, 41.41ms
Bytes In      [total, mean]                     0, 0.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      201:1500
Error Set:

Special notes for your reviewer:

The functionality of this endpoint was extended in KIC 2.0.0:

I thought that the problem might have been introduced by the startup sleep in KIC 2.2.1 but the first version that I can replicate it with is KIC 2.3.x:

This problem was caught by our own CI tests. A similar approach could be used in scripts/test-upgrade.sh to prevent future regressions, but I'd like some feedback on the idea before implementing it in a separate PR.

Checklist

  • PR is based off the current tip of the main branch.
  • Changes are documented under the "Unreleased" header in CHANGELOG.md
  • Commits follow the Kong commit message guidelines

To prevent pods from becoming ready and serving 404s prior to the
`ingress-controller` first syncing config to the `proxy`.

This can be demonstrated by running a load test whilst deploying a
change to the chart values that causes pods to be rolled:

    % (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) &
    [1] 99452

    * % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) && fg
    Release "test" has been upgraded. Happy Helming!
    …
    [1]  + 99452 running    ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration  )

    % vegeta report vegeta.out
    Requests      [total, rate, throughput]         1500, 50.03, 48.33
    Duration      [total, attack, wait]             29.983s, 29.98s, 3.297ms
    Latencies     [min, mean, 50, 90, 95, 99, max]  1.765ms, 3.823ms, 3.505ms, 5.353ms, 6ms, 8.102ms, 36.237ms
    Bytes In      [total, mean]                     2448, 1.63
    Bytes Out     [total, mean]                     0, 0.00
    Success       [ratio]                           96.60%
    Status Codes  [code:count]                      201:1449  404:51
    Error Set:
    404 Not Found

When tested with this change supplied to the latest chart:

    % (echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration 30s -timeout 5s -http2=false -keepalive=false -output=vegeta.out) &
    [1] 4518

    * % helm upgrade test kong/kong --install --version 2.14.0 --wait --set env.restart=$(date +%s) --set ingressController.readinessProbe.httpGet.path=/readyz && fg
    Release "test" has been upgraded. Happy Helming!
    …
    [1]  + 4518 running    ( echo "GET http://$PROXY_IP/status/201" | vegeta attack -rate 50 -duration  )

    % vegeta report vegeta.out
    Requests      [total, rate, throughput]         1500, 50.03, 50.03
    Duration      [total, attack, wait]             29.983s, 29.98s, 3.651ms
    Latencies     [min, mean, 50, 90, 95, 99, max]  1.569ms, 4.745ms, 3.502ms, 5.851ms, 21.034ms, 22.15ms, 41.41ms
    Bytes In      [total, mean]                     0, 0.00
    Bytes Out     [total, mean]                     0, 0.00
    Success       [ratio]                           100.00%
    Status Codes  [code:count]                      201:1500
    Error Set:

The functionality of this endpoint was extended in KIC 2.0.0:

- Kong/kubernetes-ingress-controller@b0ff612

I thought that the problem might have been introduced by the startup
sleep in KIC 2.2.1 but the first version that I can replicate it with is
KIC 2.3.x:

- Kong/kubernetes-ingress-controller@30ff318

This problem was caught by our own CI tests. A similar approach could be
used in `scripts/test-upgrade.sh` to prevent future regressions, but I'd
like some feedback on the idea before implementing it in a separate PR.
@rainest rainest merged commit 6b75b59 into Kong:main Jan 24, 2023
@dcarley
Copy link
Contributor Author

dcarley commented Jan 25, 2023

@pmalek @rainest Thanks for the quick review and merge!

This problem was caught by our own CI tests. A similar approach could be used in scripts/test-upgrade.sh to prevent future regressions, but I'd like some feedback on the idea before implementing it in a separate PR.

Do you have any thoughts on this? I'm happy to raise a PR but would like to agree on the implementation before working on it.

@pmalek
Copy link
Member

pmalek commented Jan 25, 2023

Do you have anything specific in mind? Like running an http load test and expecting 0 404s? I'm wondering if this wouldn't end up flaky but we could try this out to verify our current setup and gather conclusions about it and re-iterate on it.

@rainest
Copy link
Contributor

rainest commented Jan 25, 2023

We do have an internal initiative (there's no public tracking for it yet, but for any Kong staff that want to review it, the ticket is FTI-3081) that should address this by making the gateway able to report its own config readiness. This is tentatively planned for gateway 3.3, but I can't confirm if it will definitely make it.

The gateway reporting its own status will make it easier to ensure it doesn't return 404s due to lack of configuration. What we've encountered in the past, and that you may be seeing, is that there are cases where the controller reports that it's sent configuration and marks itself ready even though Kong isn't ready to serve requests yet. While this is usually brief, Controller-side readiness reporting can't avoid these because the controller only knows whether the update request has completed, not the internal gateway-side updates that need to complete after that request.

The gateway does not route requests using the raw configuration sent by the controller. It builds a separate set of derived configuration (the "router", "balancer", and "plugins iterator") that are optimized for handling requests quickly--using the raw JSON config would result in much longer request processing times. Building this derived configuration does take some time, and while it doesn't take long, there is a brief window after the config update occurs where Kong can still serve 404s. Moving readiness reporting into the gateway will let it report that it has completed building those internal structures and avoid this.

Absent this feature, the workaround we have is to create a proxy route and use that as Kong's readiness endpoint instead of the default /status. This ensures the router is ready (the route will return a 404 if it isn't), it just isn't something we can do out of the box, since there's no default set of routes that are guaranteed to be available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants