Skip to content

Commit

Permalink
[TEP-0123] proposal to specify on-demand-retry in a pipelineTask
Browse files Browse the repository at this point in the history
Adding a proposal to support pipeline authors to overcome any flakes and to
allow running the same instance of the pipeline until it finishes. This TEP
proposes a mechanism to allow the users to choose when to retry a failed
`pipelineTask` of a pipeline.
  • Loading branch information
pritidesai committed Sep 17, 2022
1 parent 5630eaf commit 9a5a535
Show file tree
Hide file tree
Showing 4 changed files with 120 additions and 0 deletions.
119 changes: 119 additions & 0 deletions teps/0123-specify-on-demand-retry-in-pipelinetask.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
status: proposed
title: Specify on-demand-retry in a PipelineTask
creation-date: '2022-09-16'
last-updated: '2022-09-16'
authors:
- '@pritidesai'
---

# TEP-0123: Specifying on-demand-retry in a PipelineTask


<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Use Cases](#use-cases)
- [CD Use Case](#cd-use-case)
- [CI Use Case](#ci-use-case)
- [References](#references)
<!-- /toc -->

## Summary

This TEP proposes a mechanism to allow the users to choose when to retry a failed `pipelineTask` of a pipeline.
This kind of on demand retry of a `pipelineTask` is executed from scratch with the original `params` and `specifications`.
The on demand retry of a `pipelineTask` allow the users to continue to retry until the task succeeds. After the offending task
succeeds, the rest of the `pipeline` continues executing as usual until the completion.


## Motivation

The primary motivation of this proposal is to support pipeline authors to overcome any flakes and to allow running
the same instance of the pipeline until it finishes.

### Goals

* Provide a mechanism to enable on-demand-retry of a `pipelineTask` in a `pipeline`.
* Provide a mechanism to signal a failed `taskRun` to start executing with the original specifications. This mechanism
will only be supported for a `pipelineTask` which opts in for on-demand-retry.
* The resolved specifications (`params`, the resolved results from a parent task, `taskRef` or `taskSpec`, etc ) cannot
be updated during such retry.
* A pipeline author can specify more than one on-demand-retry `pipelineTasks` and at any level in the `DAG` including
root node and leaf node.
* On-demand-retry `pipelineTask` fails again after signalling a `taskRun` to start executing, it goes back into the same
mode of on-demand-retry. The users have an option to retry such task again.
* If on-demand-retry `pipelineTask` does not succeed until the pipeline hits the timeout, the failed tasks will be
eventually canceled.
* Until all on-demand-retry `pipelineTask` succeeds and the rest of the pipelineTasks finish executing, the `pipelineRun`
status stays as `running`.


### Non-Goals

* Pause/Resume:
* Support pausing a running `taskRun` as that would require pausing the `pod` itself.
* Support suspending a running `taskRun` for some manual intervention such as approval before resuming the same `taskRun`
and the rest of the `pipeline`.
* Support pausing a running `taskRun` until a condition is met.
* Ignoring a task failure and continue running the rest of the `pipeline` which is proposed in [TEP-0050](0050-ignore-task-failures.md).
* Partial pipeline execution - [TEP-0077](https://github.com/tektoncd/community/pull/484) (not merged) and [tektoncd/pipeline issue#50](https://github.com/tektoncd/pipeline/issues/50)
* Create a separate `pipelineRun` with an option of choosing different pipeline params.
* Retry failed tasks on demand in a pipeline - [TEP-0065](https://github.com/tektoncd/community/pull/422) (not merged)
* Even though the title sounds almost the same, the focus of this TEP is to declare `pipelineTask` with an on-demand-retry
as part of the `pipeline` specification rather than relying on any manual intervention during runtime to decide whether
to rerun or not.
* [Disabling a task in a pipeline](https://docs.google.com/document/d/1rleshixafJy4n1CwFlfbAJuZjBL1PQSm3b0Q9s1B_T8/edit#heading=h.jz9jia3av6h1)
* Disabling a task in a pipeline is driven by runtime configuration v/s being part of the `pipeline` specifications itself.
* Update existing `retry` functionality.

### Use Cases

#### CD Use Case

Let's take an example of a CD `pipeline` to deploy an application to production. The CD `pipeline` is configured to
trigger a new run for every new change request i.e. a PR created in the `application-deployment` repo.

The release manager creates a new change request by creating a PR with the details needed for the deployment such as
an application GitHub repo and a specific branch or commit. Creation of a PR triggers a new `pipelineRun` in which the
application source is cloned, built, and deployed to the production cluster. After the deployment succeeds, a flaky
acceptance test is executed against the production deployment. The acceptance tests are flaky and results in failure
sometimes. This failure prevents updating the change request with the deployment evidences.

Now, the release manager has two choices to work around this flakiness:

1) Close the existing PR and create a new PR for the same request. This new PR will trigger a new `pipelineRun` which
will start from the beginning i.e. clone, build, and deploy the same application source.

2) Create an asynchronous pipeline with just the two tasks `acceptance-test` and `submit-evidence`. Trigger this `pipeline`
manually with the data (deployment evidence and other configuration needed for the test) from the failed `pipelineRun`.

![CD Use Case](images/0123-cd-use-case.png)

#### CI Use Case

Let's take an example of a CI `pipeline` to validate and test the changes being proposed. Tekton projects has a set of
checks defined to test the changes from any contributor before merging them upstream. To understand this use case,
let's assume Tekton projects have not enabled `prow` functionalities.

A contributor creates a PR with the changes and creation of a PR triggers a new `pipelineRun` in which a branch from a
contributor's forked repo is cloned, the coverage report is generated, a flaky unit test (just like one of our projects)
and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can
succeed.

A Tekton contributor without having access to prow command `/test unit-test` requires to close and reopen the
PR to trigger `pipelineRun`. Once `pipelineRun` is triggered, it runs all the tasks including `coverage`, `unit test`,
`build` and `deploy` followed by other tests which all ran successfully in the first attempt.

![CI Use Case](images/0123-ci-use-case.png)


## References

* [How can I pause the running task and later resume it or retry the failed task even if it exceeds the Retry times?](https://github.com/tektoncd/pipeline/issues/5348)
* [Add pending/pause task in pipeline](https://github.com/tektoncd/pipeline/issues/3796)
* [TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns](https://github.com/tektoncd/community/pull/203)
* [How to retry only failed tests in the CI job run on Gitlab?](https://stackoverflow.com/questions/63612992/how-to-retry-only-failed-tests-in-the-ci-job-run-on-gitlab)
* [Retrying failing jobs](https://docs.bullmq.io/guide/retrying-failing-jobs)
1 change: 1 addition & 0 deletions teps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,3 +286,4 @@ This is the complete list of Tekton teps:
|[TEP-0118](0118-matrix-with-explicit-combinations-of-parameters.md) | Matrix with Explicit Combinations of Parameters | implementable | 2022-08-08 |
|[TEP-0119](0119-add-taskrun-template-in-pipelinerun.md) | Add taskRun template in PipelineRun | implementable | 2022-09-01 |
|[TEP-0120](0120-canceling-concurrent-pipelineruns.md) | Canceling Concurrent PipelineRuns | proposed | 2022-08-19 |
|[TEP-0123](0123-specify-on-demand-retry-in-pipelinetask.md) | Specifying on-demand-retry in a PipelineTask | proposed | 2022-09-16 |
Binary file added teps/images/0123-cd-use-case.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added teps/images/0123-ci-use-case.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9a5a535

Please sign in to comment.