-
Notifications
You must be signed in to change notification settings - Fork 222
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[TEP-0123] proposal to specify on-demand-retry in a pipelineTask
Adding a proposal to support pipeline authors to overcome any flakes and to allow running the same instance of the pipeline until it finishes. This TEP proposes a mechanism to allow the users to choose when to retry a failed `pipelineTask` of a pipeline.
- Loading branch information
1 parent
5630eaf
commit 9a5a535
Showing
4 changed files
with
120 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
--- | ||
status: proposed | ||
title: Specify on-demand-retry in a PipelineTask | ||
creation-date: '2022-09-16' | ||
last-updated: '2022-09-16' | ||
authors: | ||
- '@pritidesai' | ||
--- | ||
|
||
# TEP-0123: Specifying on-demand-retry in a PipelineTask | ||
|
||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Use Cases](#use-cases) | ||
- [CD Use Case](#cd-use-case) | ||
- [CI Use Case](#ci-use-case) | ||
- [References](#references) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
|
||
This TEP proposes a mechanism to allow the users to choose when to retry a failed `pipelineTask` of a pipeline. | ||
This kind of on demand retry of a `pipelineTask` is executed from scratch with the original `params` and `specifications`. | ||
The on demand retry of a `pipelineTask` allow the users to continue to retry until the task succeeds. After the offending task | ||
succeeds, the rest of the `pipeline` continues executing as usual until the completion. | ||
|
||
|
||
## Motivation | ||
|
||
The primary motivation of this proposal is to support pipeline authors to overcome any flakes and to allow running | ||
the same instance of the pipeline until it finishes. | ||
|
||
### Goals | ||
|
||
* Provide a mechanism to enable on-demand-retry of a `pipelineTask` in a `pipeline`. | ||
* Provide a mechanism to signal a failed `taskRun` to start executing with the original specifications. This mechanism | ||
will only be supported for a `pipelineTask` which opts in for on-demand-retry. | ||
* The resolved specifications (`params`, the resolved results from a parent task, `taskRef` or `taskSpec`, etc ) cannot | ||
be updated during such retry. | ||
* A pipeline author can specify more than one on-demand-retry `pipelineTasks` and at any level in the `DAG` including | ||
root node and leaf node. | ||
* On-demand-retry `pipelineTask` fails again after signalling a `taskRun` to start executing, it goes back into the same | ||
mode of on-demand-retry. The users have an option to retry such task again. | ||
* If on-demand-retry `pipelineTask` does not succeed until the pipeline hits the timeout, the failed tasks will be | ||
eventually canceled. | ||
* Until all on-demand-retry `pipelineTask` succeeds and the rest of the pipelineTasks finish executing, the `pipelineRun` | ||
status stays as `running`. | ||
|
||
|
||
### Non-Goals | ||
|
||
* Pause/Resume: | ||
* Support pausing a running `taskRun` as that would require pausing the `pod` itself. | ||
* Support suspending a running `taskRun` for some manual intervention such as approval before resuming the same `taskRun` | ||
and the rest of the `pipeline`. | ||
* Support pausing a running `taskRun` until a condition is met. | ||
* Ignoring a task failure and continue running the rest of the `pipeline` which is proposed in [TEP-0050](0050-ignore-task-failures.md). | ||
* Partial pipeline execution - [TEP-0077](https://github.com/tektoncd/community/pull/484) (not merged) and [tektoncd/pipeline issue#50](https://github.com/tektoncd/pipeline/issues/50) | ||
* Create a separate `pipelineRun` with an option of choosing different pipeline params. | ||
* Retry failed tasks on demand in a pipeline - [TEP-0065](https://github.com/tektoncd/community/pull/422) (not merged) | ||
* Even though the title sounds almost the same, the focus of this TEP is to declare `pipelineTask` with an on-demand-retry | ||
as part of the `pipeline` specification rather than relying on any manual intervention during runtime to decide whether | ||
to rerun or not. | ||
* [Disabling a task in a pipeline](https://docs.google.com/document/d/1rleshixafJy4n1CwFlfbAJuZjBL1PQSm3b0Q9s1B_T8/edit#heading=h.jz9jia3av6h1) | ||
* Disabling a task in a pipeline is driven by runtime configuration v/s being part of the `pipeline` specifications itself. | ||
* Update existing `retry` functionality. | ||
|
||
### Use Cases | ||
|
||
#### CD Use Case | ||
|
||
Let's take an example of a CD `pipeline` to deploy an application to production. The CD `pipeline` is configured to | ||
trigger a new run for every new change request i.e. a PR created in the `application-deployment` repo. | ||
|
||
The release manager creates a new change request by creating a PR with the details needed for the deployment such as | ||
an application GitHub repo and a specific branch or commit. Creation of a PR triggers a new `pipelineRun` in which the | ||
application source is cloned, built, and deployed to the production cluster. After the deployment succeeds, a flaky | ||
acceptance test is executed against the production deployment. The acceptance tests are flaky and results in failure | ||
sometimes. This failure prevents updating the change request with the deployment evidences. | ||
|
||
Now, the release manager has two choices to work around this flakiness: | ||
|
||
1) Close the existing PR and create a new PR for the same request. This new PR will trigger a new `pipelineRun` which | ||
will start from the beginning i.e. clone, build, and deploy the same application source. | ||
|
||
2) Create an asynchronous pipeline with just the two tasks `acceptance-test` and `submit-evidence`. Trigger this `pipeline` | ||
manually with the data (deployment evidence and other configuration needed for the test) from the failed `pipelineRun`. | ||
|
||
![CD Use Case](images/0123-cd-use-case.png) | ||
|
||
#### CI Use Case | ||
|
||
Let's take an example of a CI `pipeline` to validate and test the changes being proposed. Tekton projects has a set of | ||
checks defined to test the changes from any contributor before merging them upstream. To understand this use case, | ||
let's assume Tekton projects have not enabled `prow` functionalities. | ||
|
||
A contributor creates a PR with the changes and creation of a PR triggers a new `pipelineRun` in which a branch from a | ||
contributor's forked repo is cloned, the coverage report is generated, a flaky unit test (just like one of our projects) | ||
and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can | ||
succeed. | ||
|
||
A Tekton contributor without having access to prow command `/test unit-test` requires to close and reopen the | ||
PR to trigger `pipelineRun`. Once `pipelineRun` is triggered, it runs all the tasks including `coverage`, `unit test`, | ||
`build` and `deploy` followed by other tests which all ran successfully in the first attempt. | ||
|
||
![CI Use Case](images/0123-ci-use-case.png) | ||
|
||
|
||
## References | ||
|
||
* [How can I pause the running task and later resume it or retry the failed task even if it exceeds the Retry times?](https://github.com/tektoncd/pipeline/issues/5348) | ||
* [Add pending/pause task in pipeline](https://github.com/tektoncd/pipeline/issues/3796) | ||
* [TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns](https://github.com/tektoncd/community/pull/203) | ||
* [How to retry only failed tests in the CI job run on Gitlab?](https://stackoverflow.com/questions/63612992/how-to-retry-only-failed-tests-in-the-ci-job-run-on-gitlab) | ||
* [Retrying failing jobs](https://docs.bullmq.io/guide/retrying-failing-jobs) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.