diff --git a/teps/0123-specify-on-demand-retry-in-pipelinetask.md b/teps/0123-specify-on-demand-retry-in-pipelinetask.md new file mode 100644 index 000000000..a451046e2 --- /dev/null +++ b/teps/0123-specify-on-demand-retry-in-pipelinetask.md @@ -0,0 +1,119 @@ +--- +status: proposed +title: Specify on-demand-retry in a PipelineTask +creation-date: '2022-09-16' +last-updated: '2022-09-16' +authors: +- '@pritidesai' +--- + +# TEP-0123: Specifying on-demand-retry in a PipelineTask + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Use Cases](#use-cases) + - [CD Use Case](#cd-use-case) + - [CI Use Case](#ci-use-case) +- [References](#references) + + +## Summary + +This TEP proposes a mechanism to allow the users to choose when to retry a failed `pipelineTask` of a pipeline. +This kind of on demand retry of a `pipelineTask` is executed from scratch with the original `params` and `specifications`. +The on demand retry of a `pipelineTask` allow the users to continue to retry until the task succeeds. After the offending task +succeeds, the rest of the `pipeline` continues executing as usual until the completion. + + +## Motivation + +The primary motivation of this proposal is to support pipeline authors to overcome any flakes and to allow running +the same instance of the pipeline until it finishes. + +### Goals + +* Provide a mechanism to enable on-demand-retry of a `pipelineTask` in a `pipeline`. +* Provide a mechanism to signal a failed `taskRun` to start executing with the original specifications. This mechanism +will only be supported for a `pipelineTask` which opts in for on-demand-retry. +* The resolved specifications (`params`, the resolved results from a parent task, `taskRef` or `taskSpec`, etc ) cannot +be updated during such retry. +* A pipeline author can specify more than one on-demand-retry `pipelineTasks` and at any level in the `DAG` including +root node and leaf node. +* On-demand-retry `pipelineTask` fails again after signalling a `taskRun` to start executing, it goes back into the same +mode of on-demand-retry. The users have an option to retry such task again. +* If on-demand-retry `pipelineTask` does not succeed until the pipeline hits the timeout, the failed tasks will be +eventually canceled. +* Until all on-demand-retry `pipelineTask` succeeds and the rest of the pipelineTasks finish executing, the `pipelineRun` +status stays as `running`. + + +### Non-Goals + +* Pause/Resume: + * Support pausing a running `taskRun` as that would require pausing the `pod` itself. + * Support suspending a running `taskRun` for some manual intervention such as approval before resuming the same `taskRun` + and the rest of the `pipeline`. + * Support pausing a running `taskRun` until a condition is met. +* Ignoring a task failure and continue running the rest of the `pipeline` which is proposed in [TEP-0050](0050-ignore-task-failures.md). +* Partial pipeline execution - [TEP-0077](https://github.com/tektoncd/community/pull/484) (not merged) and [tektoncd/pipeline issue#50](https://github.com/tektoncd/pipeline/issues/50) + * Create a separate `pipelineRun` with an option of choosing different pipeline params. +* Retry failed tasks on demand in a pipeline - [TEP-0065](https://github.com/tektoncd/community/pull/422) (not merged) + * Even though the title sounds almost the same, the focus of this TEP is to declare `pipelineTask` with an on-demand-retry +as part of the `pipeline` specification rather than relying on any manual intervention during runtime to decide whether +to rerun or not. +* [Disabling a task in a pipeline](https://docs.google.com/document/d/1rleshixafJy4n1CwFlfbAJuZjBL1PQSm3b0Q9s1B_T8/edit#heading=h.jz9jia3av6h1) + * Disabling a task in a pipeline is driven by runtime configuration v/s being part of the `pipeline` specifications itself. +* Update existing `retry` functionality. + +### Use Cases + +#### CD Use Case + +Let's take an example of a CD `pipeline` to deploy an application to production. The CD `pipeline` is configured to +trigger a new run for every new change request i.e. a PR created in the `application-deployment` repo. + +The release manager creates a new change request by creating a PR with the details needed for the deployment such as +an application GitHub repo and a specific branch or commit. Creation of a PR triggers a new `pipelineRun` in which the +application source is cloned, built, and deployed to the production cluster. After the deployment succeeds, a flaky +acceptance test is executed against the production deployment. The acceptance tests are flaky and results in failure +sometimes. This failure prevents updating the change request with the deployment evidences. + +Now, the release manager has two choices to work around this flakiness: + +1) Close the existing PR and create a new PR for the same request. This new PR will trigger a new `pipelineRun` which +will start from the beginning i.e. clone, build, and deploy the same application source. + +2) Create an asynchronous pipeline with just the two tasks `acceptance-test` and `submit-evidence`. Trigger this `pipeline` +manually with the data (deployment evidence and other configuration needed for the test) from the failed `pipelineRun`. + +![CD Use Case](images/0123-cd-use-case.png) + +#### CI Use Case + +Let's take an example of a CI `pipeline` to validate and test the changes being proposed. Tekton projects has a set of +checks defined to test the changes from any contributor before merging them upstream. To understand this use case, +let's assume Tekton projects have not enabled `prow` functionalities. + +A contributor creates a PR with the changes and creation of a PR triggers a new `pipelineRun` in which a branch from a +contributor's forked repo is cloned, the coverage report is generated, a flaky unit test (just like one of our projects) +and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can +succeed. + +A Tekton contributor without having access to prow command `/test unit-test` requires to close and reopen the +PR to trigger `pipelineRun`. Once `pipelineRun` is triggered, it runs all the tasks including `coverage`, `unit test`, +`build` and `deploy` followed by other tests which all ran successfully in the first attempt. + +![CI Use Case](images/0123-ci-use-case.png) + + +## References + +* [How can I pause the running task and later resume it or retry the failed task even if it exceeds the Retry times?](https://github.com/tektoncd/pipeline/issues/5348) +* [Add pending/pause task in pipeline](https://github.com/tektoncd/pipeline/issues/3796) +* [TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns](https://github.com/tektoncd/community/pull/203) +* [How to retry only failed tests in the CI job run on Gitlab?](https://stackoverflow.com/questions/63612992/how-to-retry-only-failed-tests-in-the-ci-job-run-on-gitlab) +* [Retrying failing jobs](https://docs.bullmq.io/guide/retrying-failing-jobs) \ No newline at end of file diff --git a/teps/README.md b/teps/README.md index ce653dfc7..90f37f4c1 100644 --- a/teps/README.md +++ b/teps/README.md @@ -286,3 +286,4 @@ This is the complete list of Tekton teps: |[TEP-0118](0118-matrix-with-explicit-combinations-of-parameters.md) | Matrix with Explicit Combinations of Parameters | implementable | 2022-08-08 | |[TEP-0119](0119-add-taskrun-template-in-pipelinerun.md) | Add taskRun template in PipelineRun | implementable | 2022-09-01 | |[TEP-0120](0120-canceling-concurrent-pipelineruns.md) | Canceling Concurrent PipelineRuns | proposed | 2022-08-19 | +|[TEP-0123](0123-specify-on-demand-retry-in-pipelinetask.md) | Specifying on-demand-retry in a PipelineTask | proposed | 2022-09-16 | diff --git a/teps/images/0123-cd-use-case.png b/teps/images/0123-cd-use-case.png new file mode 100644 index 000000000..fcb537a0a Binary files /dev/null and b/teps/images/0123-cd-use-case.png differ diff --git a/teps/images/0123-ci-use-case.png b/teps/images/0123-ci-use-case.png new file mode 100644 index 000000000..7d551d76b Binary files /dev/null and b/teps/images/0123-ci-use-case.png differ