Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sleep-based synchronization between cache warmup and the first sync to Kong #2249

Closed
1 task done
rainest opened this issue Feb 4, 2022 · 3 comments
Closed
1 task done

Comments

@rainest
Copy link
Contributor

rainest commented Feb 4, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

After acquiring a leadership lease, the manager starts all controllers and the sync loop. The controllers begin populating the proxy cache with the current set of resources available from the API server, and the sync loop begins syncing. The proxy cache is initially empty, so its first sync will attempt to reconcile an empty target state, deleting existing resources, and then create them again on the next sync:

I0204 17:46:58.087283       1 leaderelection.go:258] successfully acquired lease kong/5b374a9e.konghq.com
...
time="2022-02-04T17:46:58Z" level=info msg="Starting EventSource" logger=controller.ingress reconciler group=networking.k8s.io reconciler kind=Ingress source="kind source: *v1.Ingress"
time="2022-02-04T17:46:58Z" level=info msg="Starting Controller" logger=controller.ingress reconciler group=networking.k8s.io reconciler kind=Ingress
...
time="2022-02-04T17:46:58Z" level=info msg="reconciling resource" NetV1Ingress="{\"Namespace\":\"default\",\"Name\":\"httpbin\"}" logger=controllers.Ingress.netv1 name=httpbin namespace=default
...
deleting route default.httpbin.00
time="2022-02-04T17:46:58Z" level=info msg="successfully synced configuration to kong." subsystem=proxy-cache-resolver
...
creating route default.httpbin.00

Note that while we have reconciled the resource before the deleting route log from deck, that deck log fires after the sync loop has actually determined the target state. We don't have further logging in the sync loop that would tell us the exact time it finalized the state to sync.

Expected Behavior

KIC leaves Kong resources that represent Kubernetes resources present at the time KIC gains leadership in place.

Steps To Reproduce

1. Create at least one Ingress+Service combination.
2. Install a single-replica Kong+KIC Deployment in Postgres mode. Wait for a successful sync.
3. Perform a rolling restart and watch the replacement Pod's controller logs.
4. After some lag before obtaining leadership, you should observe that the controller deletes resources and then creates them again.

Kong Ingress Controller version

v2.1, v2.2

Kubernetes version

No response

Anything else?

This behavior was present prior to 2.1, but was not observed because defaults usually meant that all instances ran with leader election disabled. #2053 fixed some issues with leader election and made it the only option for DB-backed instances.

Initial review does not offer obvious easy solutions:

  • Separation between the controller cache (the client cache controller-runtime creates when instantiating a Manager) and proxy cache (our store module, used as the source of truth for the sync loop) and the need to ferry resources from one to the other is the root of this issue, though it's unclear if there's any viable means of unifying them. Managers do not expose their cache, and AFAIK we can only receive watch-based events from them, not call Reconcile() on a timer.
  • Individual resource controllers do have an exported queue, and this should support a Len() method, but the list of controllers is not available from the Manager. Furthermore, trying to use this would get us into problem situations if there's any activity in the cluster: we cannot rely on the queue dropping to 0 (though in practice it probably would eventually unless you're intentionally trying to flood it constantly with events--the reconciles don't do much, so they're quick).
  • A hacky approach would be to allow controllers to run before leader election--while not guaranteed to reconcile everything, they'd have a chance of ingesting all resources before starting the sync. However, it does not appear that controllers' leadership election requirements are configurable, and the default unconfigured behavior is to require leadership.
@rainest rainest added the bug Something isn't working label Feb 4, 2022
@rainest
Copy link
Contributor Author

rainest commented Feb 4, 2022

Between the above and conversation with controller-runtime devs, I think the only option that solves this completely is to remove the secondary cache, i.e remove most of the store package functionality and replace it with something that uses the same client as the Manager, at which point we can just call List().

I expect this also obsoletes all the individual controllers other than GWAPI controllers, which have other tasks beyond making resources available to the store/parser.

@rainest
Copy link
Contributor Author

rainest commented Feb 7, 2022

The PoC in #2250 indicates that we should be able to rework the store to use the manager client. The PoC injects the manager client into a specific store function to support it and the traditional approach in the other functions, but ultimately we'd use it to replace the Store.stores set in store.New() after converting everything over.

Parser and store tests will need to be reworked to stuff their objects into https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/client/fake

rainest added a commit that referenced this issue Feb 9, 2022
Insert a 5-second delay in the proxy start function. This allows time
for the controllers to populate the proxy cache before the first sync,
as a temporary mitigation for #2249.
rainest added a commit that referenced this issue Feb 9, 2022
Insert a 5-second delay in the proxy start function. This allows time
for the controllers to populate the proxy cache before the first sync,
as a temporary mitigation for #2249.

Add unreleased 2.2.1 changelog entry.

Add missing 2.2.0 changelog TOC entry.
rainest added a commit that referenced this issue Feb 10, 2022
Insert a 5-second delay in the proxy start function. This allows time
for the controllers to populate the proxy cache before the first sync,
as a temporary mitigation for #2249.

Add unreleased 2.2.1 changelog entry.

Add missing 2.2.0 changelog TOC entry.

Co-authored-by: Shane Utt <[email protected]>
@mflendrich mflendrich added area/debt priority/low and removed bug Something isn't working priority/high labels Mar 10, 2022
@mflendrich mflendrich changed the title Controller and sync loop race deletes resources upon acquiring leadership Sleep-based synchronization between cache warmup and the first sync to Kong Mar 10, 2022
@mflendrich
Copy link
Contributor

Closed in favor of #2315

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants