Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0114: Resolve the Flaky Test - TestWaitCustomTask_PipelineRun #5658

Merged
merged 1 commit into from
Oct 25, 2022

Conversation

XinruZhang
Copy link
Member

@XinruZhang XinruZhang commented Oct 18, 2022

Changes

Fix #5653

I'm openning this PR as a tentative solution to address #5653:

TL;DR: TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout has been flaky for a while. This PR stops the PipelineRun reconciler from cancelling Run when it detects that the task-level Timeout configured for the Run has passed, which will address the flake (similar to #5134 which addresses TestPipelineRunTimeout).

Some other useful information, many thanks to @abayer, @lbernick and @jerop:

  • Run should only be canceled when
    • The PipelineRun as a whole has timed out, either due to PipelineRun.Spec.Timeout's value or PipelineRun.Spec.Timeouts.Pipeline's value.
    • It's a non-finally PipelineTask, PipelineRun.Spec.Timeouts.Tasks is set, and that value has elapsed.
    • It's a finally PipelineTask, PipelineRun.Spec.Timeouts.Finally is set, and that value has elapsed.
  • For TaskRun, pipelinerun reconciler don't cancel it when it times out.

This change only applies to task-level Timeout, i.e. when PipelineTask.Timeout is set for a Run.

cc @pritidesai Will this commit have any other side effect?

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

ACTION REQUIRED: Starting from this release, Custom Task Runs controllers need to implement the `Timeout` on your own,  PipelineRun reconciler would not set `Run.Spec.Status == RunCancelled` upon Run timeout.

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 18, 2022
@tekton-robot tekton-robot requested review from dibyom and jerop October 18, 2022 20:39
@XinruZhang
Copy link
Member Author

/kind flake

@tekton-robot tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Oct 18, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 85.2% 85.6% 0.4

@XinruZhang XinruZhang marked this pull request as draft October 18, 2022 20:48
@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 18, 2022
@XinruZhang XinruZhang force-pushed the flaky-custom-run-test branch from 6afa586 to 19d451c Compare October 19, 2022 17:12
@XinruZhang XinruZhang marked this pull request as ready for review October 19, 2022 19:46
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 19, 2022
@tekton-robot tekton-robot requested a review from abayer October 19, 2022 19:46
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 85.0% 85.4% 0.4

@abayer
Copy link
Contributor

abayer commented Oct 19, 2022

  • Run should only be canceled when its uplevel PipelineRun is timed out

A clarification: a Run is cancelled for timing out when any of the following are true:

  • The PipelineRun as a whole has timed out, either due to PipelineRun.Spec.Timeout's value or PipelineRun.Spec.Timeouts.Pipeline's value.
  • It's a non-finally PipelineTask, PipelineRun.Spec.Timeouts.Tasks is set, and that value has elapsed.
  • It's a finally PipelineTask, PipelineRun.Spec.Timeouts.Finally is set, and that value has elapsed.

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XinruZhang for looking into this flake!

Please update the release notes to reflect that, with this change, implementors of Runs would have to fully implement Run timeouts themselves - PipelineRun reconciler would not set Run.Spec.Status == RunCancelled upon Run timeout

Might be also worth clarifying that this change applies only to task-level timeouts, which should be handled by run reconcilers for custom tasks, in the same way that taskrun reconcilers handles task-level timeouts for tasks (since #5134)

Pipeline-level timeouts continue to work as is - through cancelling TaskRuns and Runs:

for _, runName := range runNames {
logger.Infof("cancelling Run %s for timeout", runName)
if err := timeoutRun(ctx, runName, pr.Namespace, clientSet); err != nil {
errs = append(errs, fmt.Errorf("Failed to patch Run `%s` with cancellation: %s", runName, err).Error())
continue
}
}

A similar change (#5134) addressed flakiness in TestPipelineRunTimeout

cc @tektoncd/core-maintainers

@tekton-robot tekton-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note-none Denotes a PR that doesnt merit a release note. labels Oct 20, 2022
@XinruZhang
Copy link
Member Author

Appreciate all the info @jerop! Updated the PR description and release note.

Copy link
Member

@lbernick lbernick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more descriptive commit message could be something like "Remove PipelineRun cancelation of Runs when Pipeline Task timeout is reached"

pkg/reconciler/pipelinerun/pipelinerun_test.go Outdated Show resolved Hide resolved
@XinruZhang XinruZhang force-pushed the flaky-custom-run-test branch from 19d451c to 37d3bb3 Compare October 20, 2022 14:31
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 85.0% 85.4% 0.4

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for investigating and addressing this flake @XinruZhang!

I support this change, but will wait for @ScrapCodes (who may depend on this functionality) to chime in - @tektoncd/core-maintainers please take a look

For future reference, the wait custom task that we use for testing has fully implemented timeouts itself (can serve as an example of not relying on pipelinerun reconciler) -

timeout := r.GetTimeout()
if duration == timeout {
r.Status.MarkRunFailed("InvalidTimeOut", "Spec.Timeout shouldn't equal duration")
return nil
}
elapsed := c.Clock.Since(r.Status.StartTime.Time)
// Custom Task is running and not timed out
if r.Status.StartTime != nil && elapsed <= duration && elapsed <= timeout {
logger.Infof("The Custom Task Run %s is running", r.GetName())
waitTime := duration.Nanoseconds()
if timeout.Nanoseconds() < waitTime {
waitTime = timeout.Nanoseconds()
}
return controller.NewRequeueAfter(time.Duration(waitTime))
}
if r.Status.StartTime != nil && elapsed > duration && elapsed <= timeout {
logger.Infof("The Custom Task Run %v finished", r.GetName())
r.Status.CompletionTime = &metav1.Time{Time: c.Clock.Now()}
r.Status.MarkRunSucceeded("DurationElapsed", "The wait duration has elapsed")
return nil
}
// Custom Task timed out
if r.Status.StartTime != nil && elapsed > timeout {
logger.Infof("The Custom Task Run %v timed out", r.GetName())
r.Status.CompletionTime = &metav1.Time{Time: c.Clock.Now()}
r.Status.MarkRunFailed("TimedOut", WaitTaskCancelledByRunTimeoutMsg)
// Retry if the current RetriesStatus hasn't reached the retries limit
if r.Spec.Retries > len(r.Status.RetriesStatus) {
logger.Infof("Run timed out, retrying... %#v", r.Status)
retryRun(r)
return controller.NewRequeueImmediately()
}
return nil
}

@jerop jerop added this to the Pipelines v0.41 milestone Oct 25, 2022
@@ -1772,7 +1765,7 @@ status:
}, {
Operation: "add",
Path: "/spec/statusMessage",
Value: string(v1alpha1.RunCancelledByPipelineMsg),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this constant be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel It's better if we keep it same as other test cases. Oh actually, we may want to update the previous one as v1alpha1.RunReasonCancelled, nice catch!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I meant can RunCancelledByPipelineMsg be removed from the v1alpha1 package?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, you are totally right, sorry I should've asked. We'll need to move all related usage of v1alpha1.Run to v1beta1 later on as part of #5153, added #5153 (comment) to make sure this will happen.

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2022
…reached

TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout has been
flaky for a while. This commit stops the PipelineRun reconciler from
cancelling Run when it detects that the task-level Timeout configured
for the Run has passed, which will address the flake (similar to tektoncd#5134
which addresses TestPipelineRunTimeout).
@XinruZhang XinruZhang force-pushed the flaky-custom-run-test branch from 37d3bb3 to a1da340 Compare October 25, 2022 13:29
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 85.0% 85.4% 0.4

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 25, 2022
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop, lbernick

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jerop
Copy link
Member

jerop commented Oct 25, 2022

/test tekton-pipeline-unit-tests

cloud events flake

@tekton-robot tekton-robot merged commit 867fe2d into tektoncd:main Oct 25, 2022
@ScrapCodes
Copy link
Contributor

ScrapCodes commented Oct 26, 2022

@jerop and @XinruZhang I was wondering, without this feature how will custom task controller "reconcile" and take action on timed-out task. Basically, this processTimeout causes them to trigger reconcile.

@XinruZhang
Copy link
Member Author

XinruZhang commented Oct 26, 2022

Hi Prashant, thanks for keeping eye on this issue!

If the Run controller handles the timeout correctly -- mark the Run as Failed on timeout, then we wouldn't have any problem.
If the Run controller doesn't implement the TimeOut functionality, the entire PipelineRun will eventually fail on pipeline-level timeout.

@ScrapCodes
Copy link
Contributor

Hi Xinru,
Thanks!, that part is understood. Suppose a custom task is performing an action and is stuck for some reason. The custom task controller will not get to "reconcile" it until it is updated in someway.
The reconcile is triggered not periodically but whenever there is an update. I need to test this out, to be sure.

@XinruZhang
Copy link
Member Author

Oh I see, great point! After thinking more about it, I think that would be a concern of Custom Run controller right? A custom controller can requeue a key after a period of time, similar to

// Custom Task is running and not timed out
if r.Status.StartTime != nil && elapsed <= duration && elapsed <= timeout {
logger.Infof("The Custom Task Run %s is running", r.GetName())
waitTime := duration.Nanoseconds()
if timeout.Nanoseconds() < waitTime {
waitTime = timeout.Nanoseconds()
}
return controller.NewRequeueAfter(time.Duration(waitTime))
}

@ScrapCodes
Copy link
Contributor

ScrapCodes commented Oct 27, 2022

Is this scalable? when number of objects are large. Do you think the overhead of examining a large number of run object by custom task controller, every object's "timeout time" can impact performance?

This is already used by pipeline controller, so we are good.

@jerop jerop changed the title Resolve the Flaky Test - TestWaitCustomTask_PipelineRun TEP-0114: Resolve the Flaky Test - TestWaitCustomTask_PipelineRun Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/flake Categorizes issue or PR as related to a flakey test lgtm Indicates that a PR is ready to be merged. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PipelineRun reconciler shouldn't care about how the Timeout field is implemented in Custom Task Run.
6 participants