[TEP-0123] proposal to specify on-demand-retry in a pipelineTask

Adding a proposal to support pipeline authors to overcome any flakes and to allow running the same instance of the pipeline until it finishes. This TEP proposes a mechanism to allow the users to choose when to retry a failed `pipelineTask` of a pipeline.
tektoncd · Sep 17, 2022 · 9a5a535 · 9a5a535
1 parent 5630eaf
commit 9a5a535
Show file tree

Hide file tree

Showing 4 changed files with 120 additions and 0 deletions.
diff --git a/teps/0123-specify-on-demand-retry-in-pipelinetask.md b/teps/0123-specify-on-demand-retry-in-pipelinetask.md
@@ -0,0 +1,119 @@
+---
+status: proposed 
+title: Specify on-demand-retry in a PipelineTask
+creation-date: '2022-09-16'
+last-updated: '2022-09-16'
+authors:
+- '@pritidesai'
+---
+
+# TEP-0123: Specifying on-demand-retry in a PipelineTask
+
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+  - [Use Cases](#use-cases)
+    - [CD Use Case](#cd-use-case)
+    - [CI Use Case](#ci-use-case)
+- [References](#references)
+<!-- /toc -->
+
+## Summary
+
+This TEP proposes a mechanism to allow the users to choose when to retry a failed `pipelineTask` of a pipeline.
+This kind of on demand retry of a `pipelineTask` is executed from scratch with the original `params` and `specifications`.
+The on demand retry of a `pipelineTask` allow the users to continue to retry until the task succeeds. After the offending task
+succeeds, the rest of the `pipeline` continues executing as usual until the completion.
+
+
+## Motivation
+
+The primary motivation of this proposal is to support pipeline authors to overcome any flakes and to allow running
+the same instance of the pipeline until it finishes.
+
+### Goals
+
+* Provide a mechanism to enable on-demand-retry of a `pipelineTask` in a `pipeline`.
+* Provide a mechanism to signal a failed `taskRun` to start executing with the original specifications. This mechanism
+will only be supported for a `pipelineTask` which opts in for on-demand-retry.
+* The resolved specifications (`params`, the resolved results from a parent task, `taskRef` or `taskSpec`, etc ) cannot
+be updated during such retry.
+* A pipeline author can specify more than one on-demand-retry `pipelineTasks` and at any level in the `DAG` including
+root node and leaf node.
+* On-demand-retry `pipelineTask` fails again after signalling a `taskRun` to start executing, it goes back into the same
+mode of on-demand-retry. The users have an option to retry such task again.
+* If on-demand-retry `pipelineTask` does not succeed until the pipeline hits the timeout, the failed tasks will be
+eventually canceled.
+* Until all on-demand-retry `pipelineTask` succeeds and the rest of the pipelineTasks finish executing, the `pipelineRun`
+status stays as `running`.
+
+
+### Non-Goals
+
+* Pause/Resume:
+  * Support pausing a running `taskRun` as that would require pausing the `pod` itself.
+  * Support suspending a running `taskRun` for some manual intervention such as approval before resuming the same `taskRun`
+  and the rest of the `pipeline`.
+  * Support pausing a running `taskRun` until a condition is met.
+* Ignoring a task failure and continue running the rest of the `pipeline` which is proposed in [TEP-0050](0050-ignore-task-failures.md).
+* Partial pipeline execution - [TEP-0077](https://github.com/tektoncd/community/pull/484) (not merged) and [tektoncd/pipeline issue#50](https://github.com/tektoncd/pipeline/issues/50)
+  * Create a separate `pipelineRun` with an option of choosing different pipeline params.
+* Retry failed tasks on demand in a pipeline - [TEP-0065](https://github.com/tektoncd/community/pull/422) (not merged)
+  * Even though the title sounds almost the same, the focus of this TEP is to declare `pipelineTask` with an on-demand-retry
+as part of the `pipeline` specification rather than relying on any manual intervention during runtime to decide whether
+to rerun or not.
+* [Disabling a task in a pipeline](https://docs.google.com/document/d/1rleshixafJy4n1CwFlfbAJuZjBL1PQSm3b0Q9s1B_T8/edit#heading=h.jz9jia3av6h1)
+  * Disabling a task in a pipeline is driven by runtime configuration v/s being part of the `pipeline` specifications itself.
+* Update existing `retry` functionality.
+
+### Use Cases
+
+#### CD Use Case
+
+Let's take an example of a CD `pipeline` to deploy an application to production. The CD `pipeline` is configured to
+trigger a new run for every new change request i.e. a PR created in the `application-deployment` repo.
+
+The release manager creates a new change request by creating a PR with the details needed for the deployment such as
+an application GitHub repo and a specific branch or commit. Creation of a PR triggers a new `pipelineRun` in which the
+application source is cloned, built, and deployed to the production cluster. After the deployment succeeds, a flaky
+acceptance test is executed against the production deployment. The acceptance tests are flaky and results in failure
+sometimes. This failure prevents updating the change request with the deployment evidences.
+
+Now, the release manager has two choices to work around this flakiness:
+
+1) Close the existing PR and create a new PR for the same request. This new PR will trigger a new `pipelineRun` which
+will start from the beginning i.e. clone, build, and deploy the same application source.
+
+2) Create an asynchronous pipeline with just the two tasks `acceptance-test` and `submit-evidence`. Trigger this `pipeline`
+manually with the data (deployment evidence and other configuration needed for the test) from the failed `pipelineRun`.
+
+![CD Use Case](images/0123-cd-use-case.png)
+
+#### CI Use Case
+
+Let's take an example of a CI `pipeline` to validate and test the changes being proposed. Tekton projects has a set of
+checks defined to test the changes from any contributor before merging them upstream. To understand this use case,
+let's assume Tekton projects have not enabled `prow` functionalities.
+
+A contributor creates a PR with the changes and creation of a PR triggers a new `pipelineRun` in which a branch from a
+contributor's forked repo is cloned, the coverage report is generated, a flaky unit test (just like one of our projects)
+and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can
+succeed.
+
+A Tekton contributor without having access to prow command `/test unit-test` requires to close and reopen the
+PR to trigger `pipelineRun`. Once `pipelineRun` is triggered, it runs all the tasks including `coverage`, `unit test`,
+`build` and `deploy` followed by other tests which all ran successfully in the first attempt.
+
+![CI Use Case](images/0123-ci-use-case.png)
+
+
+## References
+
+* [How can I pause the running task and later resume it or retry the failed task even if it exceeds the Retry times?](https://github.com/tektoncd/pipeline/issues/5348)
+* [Add pending/pause task in pipeline](https://github.com/tektoncd/pipeline/issues/3796)
+* [TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns](https://github.com/tektoncd/community/pull/203)
+* [How to retry only failed tests in the CI job run on Gitlab?](https://stackoverflow.com/questions/63612992/how-to-retry-only-failed-tests-in-the-ci-job-run-on-gitlab)
+* [Retrying failing jobs](https://docs.bullmq.io/guide/retrying-failing-jobs)
diff --git a/teps/README.md b/teps/README.md
@@ -286,3 +286,4 @@ This is the complete list of Tekton teps:
 |[TEP-0118](0118-matrix-with-explicit-combinations-of-parameters.md) | Matrix with Explicit Combinations of Parameters | implementable | 2022-08-08 |
 |[TEP-0119](0119-add-taskrun-template-in-pipelinerun.md) | Add taskRun template in PipelineRun | implementable | 2022-09-01 |
 |[TEP-0120](0120-canceling-concurrent-pipelineruns.md) | Canceling Concurrent PipelineRuns | proposed | 2022-08-19 |
+|[TEP-0123](0123-specify-on-demand-retry-in-pipelinetask.md) | Specifying on-demand-retry in a PipelineTask | proposed | 2022-09-16 |
diff --git a/teps/images/0123-cd-use-case.png b/teps/images/0123-cd-use-case.png
diff --git a/teps/images/0123-ci-use-case.png b/teps/images/0123-ci-use-case.png