status | title | creation-date | last-updated | authors | |
---|---|---|---|---|---|
proposed |
Specifying on-demand-retry in a PipelineTask |
2022-09-16 |
2022-09-16 |
|
This TEP proposes a mechanism to allow users to choose when to retry a failed pipelineTask
of a pipeline.
This kind of on-demand
retry of a pipelineTask
is executed with the original params
and specifications
.
The on-demand
retry of a pipelineTask
allow users to continue to retry until the task succeeds. After the offending task
succeeds, the rest of the pipeline
continues executing as usual until the completion.
The primary motivation of this proposal is to enable pipeline authors to overcome any flakes without re-running the whole pipeline but to re-run the flakes and continue the execution until the pipeline finishes. The main motivation of this proposal is to enable pipeline authors to specify when to re-run such flakes.
Tekton controller does support retries
in which a pipeline author can specify the number of the times a task can be retried when it fails. The retries
are designed
such that once a task fails, it is retried immediately. Other CI/CD systems supports an additional retry strategies, such as:
-
Argo Workflow offers a variety of retry policies. One of them is Back-Off, in which a user can configure the delay between retries.
-
GitHub action supports a retry of an action allowing to set some delay between attempts through attempt_delay.
These additional retry strategies allows users to specify a reasonable delay during which a transient failure can be fixed.
Jenkins takes retry
one step further where a user can specify conditions
for which a task/stage must be retried. For example, when a stage fails because of an underlying infrastructure issue
rather than the execution of the script
. Jenkins stage author can specify a condition agent
which will allow
retry once the connection to an agent is fixed.
Jenkins declarative pipeline supports restarting stages with the same parameters and actions as documented in restarting stages:
When long-running Pipelines fail intermittently for environmental purposes the developer must be able to restart the execution of the stage that failed within the Pipeline. This allows the developer to recoup time lost running the pipeline to the point of failure.
- Provide a mechanism to enable on-demand-retry of a
pipelineTask
in apipeline
. - Provide a mechanism to signal a failed
taskRun
to start executing with the original specifications. This mechanism will only be supported for apipelineTask
which opts in for on-demand-retry.
- Pause/Resume:
- Support pausing a running
taskRun
as that would require pausing thepod
itself. - Support suspending a running
taskRun
for some manual intervention such as approval before resuming the sametaskRun
and the rest of thepipeline
. - Support pausing a running
taskRun
until a condition is met.
- Support pausing a running
- Ignoring a task failure and continue running the rest of the
pipeline
which is proposed in TEP-0050. - Partial pipeline execution - TEP-0077 (not merged) and tektoncd/pipeline issue#50
- Create a separate
pipelineRun
with an option of choosing different pipeline params.
- Create a separate
- Retry failed tasks on demand in a pipeline - TEP-0065 (not merged)
- Even though the title sounds almost the same, the focus of this TEP is to declare
pipelineTask
with an on-demand-retry as part of thepipeline
specification rather than relying on any manual intervention during runtime to decide whether to rerun or not.
- Even though the title sounds almost the same, the focus of this TEP is to declare
- Disabling a task in a pipeline
- Disabling a task in a pipeline is driven by runtime configuration v/s being part of the
pipeline
specifications itself.
- Disabling a task in a pipeline is driven by runtime configuration v/s being part of the
- Update existing
retry
functionality.
- The resolved specifications (
params
, the resolved results from a parent task,taskRef
ortaskSpec
, etc ) of anon-demand-retry
pipelineTask cannot be updated during such retry. - A pipeline author can specify one or more on-demand-retry
pipelineTasks
in a pipeline. - A pipeline author can choose any task from
tasks
section or anyfinally
task for on-demand-retry. - When an On-demand-retry
pipelineTask
fails again after a retry, it goes back into the same mode of on-demand-retry. The users have an option to retry such task again. - If on-demand-retry
pipelineTask
does not succeed until the pipeline hits the timeout, the failed tasks will be eventually canceled. - Until all on-demand-retry
pipelineTask
succeeds and the rest of the pipelineTasks finish executing, thepipelineRun
status stays asrunning
. - In a
pipeline
with multiple branches, a newpipelineTask
from an independent branch can be scheduled whileon-demand-retry
is waiting to be restarted. For example, the following pipeline hasB
defined ason-demand-retry
and B fails whileD
is running, thepipeline
continues to scheduleE
andF
while waiting on user input forB
. If there is afinally
section in the pipeline, it will wait untilB
is retried and succeeds. The user has an option to stop retrying and execute finally by [gracefully cancelling a pipelineRun](How can the user request to stop retrying and execute finally).
┌─────────────────────────────────────┐ ┌────────────┐
│ Main │ │ Finally │
│ ┌───┐ ┌───┐ │ │ │
│ │ │ │ │ │ │ │
│ │ A ├────┤ B (failed) │ │ │
│ │ │ │ - │ │ │ │
│ └───┘ └───┘ │ │ ┌───┐ │
│ │ │ │ │ │
│ │ │ │ G │ │
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ └───┘ │
│ │ C ├────┤ D ├────┤ E ├────┤ F │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ └───┘ └───┘ └───┘ └───┘ │ │ │
│ │ │ │
└─────────────────────────────────────┘ └────────────┘
Let's take an example of a CD pipeline
to deploy an application to production. The CD pipeline
is configured to
trigger a new run for every new change request i.e. a PR created in the application-deployment
repo.
The release manager creates a new change request by creating a PR with the details needed for the deployment such as
an application GitHub repo and a specific branch or commit. Creation of a PR triggers a new pipelineRun
in which the
application source is cloned, built, and deployed to the production cluster. After the deployment succeeds, a flaky
acceptance test is executed against the production deployment. The acceptance tests are flaky and results in failure
sometimes. This failure prevents updating the change request with the deployment evidences.
Now, the release manager has three choices to work around this flakiness:
-
Configure acceptance-tests task with a reasonable amount of retries. But this might or might not work depending on the cause of the flakiness. Retrying a task once it fails might address network connectivity issue as mentioned in the pipeline specification. If the flakiness is caused by an issue which requires a fix from the users, this kind of retry does not help.
-
Close the existing PR and create a new PR for the same request. This new PR will trigger a new
pipelineRun
which will start from the beginning i.e. clone, build, and deploy the same application source. -
Create an asynchronous pipeline with just the two tasks
acceptance-test
andsubmit-evidence
. Trigger thispipeline
manually with the data (deployment evidence and other configuration needed for the test) from the failedpipelineRun
.
Let's take an example of a CI pipeline
to validate and test the changes being proposed. Tekton projects has a set of
checks defined to test the changes from any contributor before merging them upstream. To understand this use case,
let's assume Tekton projects have not enabled prow
functionalities.
A contributor creates a PR with the changes and creation of a PR triggers a new pipelineRun
in which a branch from a
contributor's forked repo is cloned, the coverage report is generated, a flaky unit test (just like one of our projects)
and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can
succeed.
A Tekton contributor without having access to prow command /test unit-test
requires to close and reopen the
PR to trigger pipelineRun
. Once pipelineRun
is triggered, it runs all the tasks including coverage
, unit test
,
build
and deploy
followed by other tests which all ran successfully in the first attempt.
- How can I pause the running task and later resume it or retry the failed task even if it exceeds the Retry times?
- Add pending/pause task in pipeline
- TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns
- How to retry only failed tests in the CI job run on Gitlab?
- Retrying failing jobs
- Restart At Stage