Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller indicates ready only after verifying proxy configured work #1720

Merged
merged 14 commits into from
Aug 26, 2021

Conversation

rainest
Copy link
Contributor

@rainest rainest commented Aug 17, 2021

What this PR does / why we need it:
Only mark the controller ready after it can assume that the proxy has configuration available, either immediately if the proxy has a database or after the first config push if the proxy has no database.

Replaces the no-op Ping function with a dynamically-scoped ConfigReady() function. ConfigReady() returns a nil error if the previously instantiated Proxy's IsReady() function returns true or returns a "proxy not yet configured" error if IsReady() returns false

IsReady() returns true if:

  • proxy.dbmode == "off" and len(proxy.lastConfigSHA) > 0 (i.e. we have applied config at least once).
  • proxy.dbmode is any other value (we assume config is probably already available in the DB). This doesn't handle the initial start edge case where the database is empty, but is usually good enough. Handling the edge case requires some complex nonsense with figuring out if the cluster controller leader has applied configuration (since non-leaders never apply config themselves).

It returns false otherwise.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #1504

Special notes for your reviewer:
I haven't the foggiest how you test this consistently and meaningfully.

Unit tests seem not particularly useful since we'd need to stuff a fake value into lastConfigSHA rather than allowing it to populate naturally. Effectively all we confirm is that string comparisons and slice length checks work as expected.

Integration tests are complicated by many factors:

  • Empty configuration is valid and will populate lastConfigSHA. The controller will sync even before you've created any Ingresses and such. We can only make it not sync by either not starting the proxy (more or less what we had before, but not easily doable in our test environment because we don't get to let Kubernetes restart the container if the initial connection attempt fails) or by creating resources that result in invalid configuration before anything else applies (not sure how to do this with the test parallelization), confirm not ready, rectify the error, and confirm ready.
  • We don't have any way to observe the proxy internals even if we can control the expected config state with initially intentionally broken config. I think we either scrape logs and check for the success message (not sure if we can do this), look at the config diagnostic failed/successful outputs (reasonable enough, I guess), or look at the proxy admin API (can't distinguish between successful empty config and unsuccessful config).

A manual test of the image does at least confirm that the controller becomes ready, but even in manual runs I don't know how I can confirm its behavior that it doesn't become ready when it shouldn't.

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

  • the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

Only mark the controller ready after it can assume that the proxy has
configuration available, either immediately if the proxy has a database
or after the first config push if the proxy has no database.
@github-actions
Copy link

Licenses differ between commit 2609005350500696c825a5647805f0f482b39a57 and base:

+++ pr_licenses.csv	2021-08-17 20:43:46.326653223 +0000
@@ -1,5 +1,4 @@
-cloud.google.com/go/compute/metadata,Unknown,Apache-2.0
-cloud.google.com/go/container/apiv1,Unknown,Apache-2.0
+cloud.google.com/go,Unknown,Apache-2.0
 github.com/Masterminds/goutils,https://github.com/Masterminds/goutils/blob/master/LICENSE.txt,Apache-2.0
 github.com/Masterminds/semver,https://github.com/Masterminds/semver/blob/master/LICENSE.txt,MIT
 github.com/Masterminds/sprig,https://github.com/Masterminds/sprig/blob/master/LICENSE.txt,MIT```

@rainest rainest temporarily deployed to Configure ci August 17, 2021 20:54 Inactive
Copy link
Contributor

@ccfishk ccfishk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function comments

Copy link
Contributor

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my idea for testing this bit of functionality:

  • Implement a unit test with a mock proxy, and then:
    • after initialization, ensure that ready == false
    • report readiness in the mock
    • ensure that ready == true
  • modify the existing integration test harness to block until the proxy reports readiness. This will check only the "positive" half of the readiness check, though, but the negative case should be sufficiently covered by the unit test.

Several tags were incorrect in the changelog footer links. Align tags
with their actual names and the appropriate comparison versions.
@rainest rainest temporarily deployed to Configure ci August 18, 2021 18:26 Inactive
@rainest rainest temporarily deployed to Configure ci August 19, 2021 18:34 Inactive
@rainest
Copy link
Contributor Author

rainest commented Aug 19, 2021

Tests more or less follow #1720 (review). It doesn't actually block testing on readiness, but does confirm that readiness is eventually good, which is roughly equivalent. Most of the other tests imply readiness because they fail if config didn't post, so it doesn't make much sense to block them.

I'm still a bit iffy on that strategy since IMO the negative case in integration is what matters for verifying the feature. That seems like it'd require some weird refactoring where we pull the go proxy.startProxyUpdateServer() out of NewCacheBasedProxyWithStagger() into its own Proxy.Start() function, call that after mgr.Start(ctx) in manager.Run() and also somehow pause between those calls when running the negative test ☹️

@rainest rainest marked this pull request as ready for review August 19, 2021 18:40
@rainest rainest requested a review from a team as a code owner August 19, 2021 18:40
@rainest rainest requested review from mflendrich and ccfishk August 19, 2021 18:40
Copy link
Contributor

@ccfishk ccfishk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, just typo.

@ccfishk ccfishk changed the title Only become ready after proxy configured controller indicates ready only after verifying proxy configured work Aug 19, 2021
Copy link
Contributor

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing the discussion from https://github.com/Kong/kubernetes-ingress-controller/pull/1720/files#discussion_r691494074 here because apparently GH doesn't allow responding to that thread anymore.

What I meant is that we could just define the func ConfigReady(_ *http.Request) error function as a plain old top-level function in the global scope of the file, and then use ConfigReady as an argument to mgr.AddReadyzCheck. ConfigReady won't work directly as a value that will be accepted as an argument thereto (because of a type mismatch), so it'd need a type cast: healthz.Checker(ConfigReady).

- Add a dedicated value for having applied config at least once
- Add a lock for the config applied flag
- Use dedicate True/False test assertions
Copy link
Contributor

@ccfishk ccfishk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor semantics

@rainest
Copy link
Contributor Author

rainest commented Aug 20, 2021

#1720 (comment) might work to go to the inline discussion? That one still shows replies available to me, but yeah, the GH UI has gotten confused.

As I read the last comment though, it looks like that suggestion was about using the var to make it a Checker, not about doing it to make the dynamic-scoped function, so without further comment on the new inline version, I'm assuming that's resolved and handles both concerns (using the proxy instance and not allowing variable substitution) and have marked the thread such.

@rainest rainest temporarily deployed to Configure ci August 20, 2021 18:55 Inactive
@rainest rainest requested review from mflendrich and ccfishk August 20, 2021 19:13
ccfishk
ccfishk previously approved these changes Aug 20, 2021
Copy link
Contributor

@ccfishk ccfishk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 💯

mflendrich
mflendrich previously approved these changes Aug 26, 2021
@rainest rainest dismissed stale reviews from mflendrich and ccfishk via ea82563 August 26, 2021 18:40
@rainest rainest enabled auto-merge (squash) August 26, 2021 18:42
@ccfishk ccfishk self-requested a review August 26, 2021 18:45
Copy link
Contributor

@ccfishk ccfishk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rainest rainest merged commit b0ff612 into main Aug 26, 2021
@rainest rainest deleted the feat/ready-config branch August 26, 2021 18:49
@rainest rainest temporarily deployed to Configure ci August 26, 2021 18:49 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make pod ready only after the first successful sync to Kong
4 participants