Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow marking releases stuck in a pending state as failed #16

Merged
merged 8 commits into from
Jun 29, 2021

Conversation

misberner
Copy link
Contributor

@misberner misberner commented Jun 26, 2021

This PR adds an option WithMarkFailedAfter(duration) which allows marking a release seemingly stuck in a pending state (pending-install, pending-upgrade, pending-rollback) as "failed" after a given timeout (measured from the "last deployed" timestamp). The "failed" state will allow the next "upgrade" operation to succeed.

@misberner misberner changed the title Allow auto-rollback after a timeout for stuck releases Allow marking releases stuck in a pending state as failed Jun 28, 2021
@misberner misberner requested a review from SimonBaeumer June 28, 2021 20:42
@@ -755,6 +777,35 @@ func (r *Reconciler) doUpgrade(actionClient helmclient.ActionInterface, u *updat
return rel, nil
}

func (r *Reconciler) handlePending(actionClient helmclient.ActionInterface, rel *release.Release, u *updater.Updater, log logr.Logger) (ctrl.Result, error) {
Copy link
Member

@SimonBaeumer SimonBaeumer Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a new test case for a reconciliation with a MarkFailedAfter which tests the transisition from a pending to a failed release (and maybe even the roll-forward)?
As these changes affect the core logic of our reconciliations it is important from my point of view.

}
u.UpdateStatus(
updater.EnsureCondition(conditions.Irreconcilable(corev1.ConditionTrue, conditions.ReasonPendingError, err)))
return ctrl.Result{}, err
Copy link
Member

@SimonBaeumer SimonBaeumer Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think adding an interval to set the RequeueAfter property is useful here?
My fear is that the reconciliations are really fast in production systems and could create bigger loads on the API server which in turn leads to a slower API server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: the operator already uses a builtin rate-limiter to prevent overloading the API server. The situation we are handling here should be rare, and we expect that the primary cause are people running manual Helm operations. For these, the pending state will only last for ~10s at most, and we don't want to block the operator for the next 2m if they encounter such a state.

Note: with dependent watches, this might be less of an issue, but we currently don't have those.

@@ -133,6 +140,11 @@ func (c *ActionClient) Upgrade(name, namespace string, chrt *chart.Chart, vals m
return c.HandleUpgrade()
}

func (c *ActionClient) MarkFailed(rel *release.Release, reason string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea to mark the release as failed and being reconciled by the existing logic. Nice! 💯

@misberner misberner merged commit c7be7f8 into main Jun 29, 2021
@misberner misberner deleted the mi/handle-pending branch June 29, 2021 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants