-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
summarize: Consider obj status condition in result #703
Conversation
80a0261
to
f7dd60a
Compare
afterFunc: func(t *WithT, obj conditions.Setter, patchOpts *patch.HelperOptions) { | ||
t.Expect(patchOpts.IncludeStatusObservedGeneration).To(BeFalse()) | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate test case.
f7dd60a
to
d64def3
Compare
SummarizeAndPatch() should also consider the object's status conditions when computing and returning the runtime results to avoid any inconsistency in the runtime result and status condition of the object. When an object's Ready condition is False, the reconciler should retry unless it's in stalled condition. Signed-off-by: Sunny <[email protected]>
d64def3
to
2240106
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @darkowlzz 🏅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good stuff @darkowlzz
There are several FluxCD issues which were attempted to be fixed by implementing recovery logic into app framework. A new issue was observed that may be hit more often because of the recovery logic. Flux won't attempt to finish updating a HelmRelease resource to Ready=True but only display a message containing this signature: '''the object has been modified...'''. The resource must be correctly updated by Flux. Aside from this the recovery logic was put in place for HelmChart resources that are in Ready=False state, but this should have been handled by Flux. I believe the 2 issues described above map to [1] and [2]. To pull [1] and [2] updated FluxCD release to latest available [3]. Release is composed of manifests and container image used. Update containers used: helm-controller from v0.15.0 to v0.27.0, source-controller from v0.20.1 to v0.32.1. Update flux crds, rbac, deployment. Changelog claims we get k8s 1.25 support part of the upversion. Disclaimer for tests: 1) Recovery logic concerned with flipping spec.suspend was removed. Introduced by ongoing [4] 2) optimization by flipping spec.suspend was removed Introduced by ongoing[4] 3) cert-manager, nginx-ingress-controller, platform-integ-apps had the reconciliation interval decreased to 1m to allow Flux to manage the resources by itself in a reasonable time interval. There will be future commits per app updating reconciliation interval. Tests on AIO-SX: PASS: bootstrap PASS: unlocked enabled available PASS: apps applied PASS: tested on k8s 1.24.4 and 1.21.8, inherit that versions between should be OK PASS: delete pods and see they are recreated and no errors PASS: inspect flux pod logs for errors PASS: re-test known trigger for 1996747 and 1995748 [1]: fluxcd/source-controller#703 [2]: fluxcd/source-controller#202 [3]: https://github.com/fluxcd/flux2/releases/tag/v0.37.0 [4]: https://review.opendev.org/c/starlingx/config/+/866862 Related-Bug: 1995748 Related-Bug: 1996747 Closes-Bug: 1999032 Signed-off-by: Dan Voiculeasa <[email protected]> Change-Id: Id295599a2946f48081e7d27e2ab8e06063c3c88d
SummarizeAndPatch()
should also consider the object's status conditionswhen computing and returning the runtime results to avoid any
inconsistency in the runtime result and status condition of the object.
When an object's Ready conditions are False, the reconciler should retry
unless it's in stalled state.
Detailed description
Summarize and patch takes the abstracted result and error, and computes the runtime result from it. Separately, it takes the object with status conditions and calculates the status summary and patches the object status.
This can result in a situation where the (sub)reconcilers return a successful result without any error, resulting in a success runtime result. But a negative status condition on the object set by one of the sub-reconcilers would result in the Ready status condition to be False. As a result of this, an object would have Ready=False status and the reconciler wouldn't be attempting any retry to reconcile it. We expect this behavior to only happen when the status condition Stalled=True. Unless Ready=True, the reconciler should always retry.
To make the model consistent, the summarize and patch process can take into consideration the status condition summary for Ready condition.
Runtime result behavior with respect to Ready condition:
If Ready=True, but the abstracted result is
ResultRequeue
and nil error, the runtime result should haveRequeue: true
.If Ready=True, but the abstracted result is
ResultSuccess
and nil error, the runtime result should haveRequeueAfter: <interval>
.If Ready=True, but the reconciler error is not nil, a runtime error is returned, resulting in runtime requeuing the object for reconciliation. Since an error in itself can't be used to set the Ready=False with reason and message, we can't change the status condition to reflect this failure. The retry will happen in the background without anything visible in the object status.
[New behavior] If Ready=False, but the abstracted result is Success and reconciler error is nil, the reconciler error can be changed to reflect the reason for Ready=False by setting the error to be the Ready condition message. This will result in the object to have Ready=False and the reconciler performing a retry.