-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0123 - proposal to specify on-demand-retry in a pipelineTask #823
TEP-0123 - proposal to specify on-demand-retry in a pipelineTask #823
Conversation
/assign @dibyom @XinruZhang |
@lbernick: GitHub didn't allow me to assign the following users: xinruzhang. Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
9a5a535
to
71cf794
Compare
acceptance test is executed against the production deployment. The acceptance tests are flaky and results in failure | ||
sometimes. This failure prevents updating the change request with the deployment evidences. | ||
|
||
Now, the release manager has two choices to work around this flakiness: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't another choice be to configure the known flaky task with a reasonable amount of retires? If that is not a valid choice for this scenario, could you explain why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, configuring the known flaky task with reasonable amount of retries will work in some scenarios. In cases where the failure is the result of a momentary bandwidth bottleneck, the reasonable amount of retries might not be a feasible choice. In case of a bandwidth issue or any other infrastructure issue, the desire is to be able to retry once the issue is resolved. Also, for a cluster with 100s or 1000s of pipelineRuns
of the same CD pipeline, I would want to avoid retrying all flaky taskRuns
at the same time.
and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can | ||
succeed. | ||
|
||
A Tekton contributor without having access to prow command `/test unit-test` requires to close and reopen the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this is a compelling use case that I think we'll run into for our dogfooding pipeline (assuming we have a single presubmit pipeline that runs build, unit test, coverage etc)
cc @abayer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this use case. A user that doesn't have access to /test …
command will not have tests running when opening a PR (it "needs" the label). If the test fails due to a flakyness, "someone" (a reviewer) will look into it and run the /test
. Ideally it will be relatively quick, so I am not sure I see a problem to solve there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - where I was coming from was the assumption that the entire presubmit is one tekton pipeline and so /test or /re-test will rerun everything. Like you said below, one alternative would be to have smaller pipelines chained together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setup we have now is one pipeline per CI job.
The main Disadvantage of this approach we need to clone the repo for each CI job (but that is not different from what prow does today).
There are a few ways we could improve on that, one of which is to have a single pipeline, but I did not consider it really until now because of a few reasons:
- CI jobs may be provided / maintained by different teams. So the granularity at best could be one pipeline per group of jobs from the same team.
- We don't have partial pipeline execution, so a re-run would always be a re-run of the whole pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not have done a good job of explaining this use case.
The assumption for this use case is that it's one single tekton pipeline we are talking about.
With most of the reviewers working on multiple projects, it will be beneficial to the contributors to be able to avoid such dependencies on the reviewers.
This proposal is addressing re-run from a failed task. I might have to go back and update this use case to better explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pritidesai
|
||
## Motivation | ||
|
||
The primary motivation of this proposal is to support pipeline authors to overcome any flakes and to allow running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is on demand retry the best way to solve flakes? Why not automated retries? Are there other alternatives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one approach (imo) is to create smaller Pipeline
that are connected through events. That way you can easily re-run a Pipeline(Run)
(that failed) and it would "continue" the flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, creating a smaller pipeline and connect them through events could be an alternative.
But, the disadvantages of this alternative are very painful for the users. I have a user who has deployed one pipeline for 10 different applications. If he has to split up the pipeline for each possible flake, he will end up in a management nightmare of creating additional triggers for each application and managing pieces of pipeline (which might not even make sense if the pieces contains just one single task).
As a user of the pipeline controller, it's reasonable to expect a solution to handle such failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that smaller pipelines can be an alternative here, but as @pritidesai highlighted, it's not necessarily applicable to all users and all cases. It's a bit like the many small repos vs. large mono-repo divide - we should support users that require large pipelines for their use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the proposal here.
Is it :
- an "infinite" retry — retry until it suceed
- a new PipelineRun that is created from a previous one, but will only run the "selected" tasks ?
In the case of the later, I am not sure this is something tektoncd/pipeline
has to solve. But in any case, we may want to explore how other CI/CD tools (Jenkins, github workflows, …) handle those case – if they do at all.
|
||
## Motivation | ||
|
||
The primary motivation of this proposal is to support pipeline authors to overcome any flakes and to allow running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one approach (imo) is to create smaller Pipeline
that are connected through events. That way you can easily re-run a Pipeline(Run)
(that failed) and it would "continue" the flow
and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can | ||
succeed. | ||
|
||
A Tekton contributor without having access to prow command `/test unit-test` requires to close and reopen the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this use case. A user that doesn't have access to /test …
command will not have tests running when opening a PR (it "needs" the label). If the test fails due to a flakyness, "someone" (a reviewer) will look into it and run the /test
. Ideally it will be relatively quick, so I am not sure I see a problem to solve there.
1) Close the existing PR and create a new PR for the same request. This new PR will trigger a new `pipelineRun` which | ||
will start from the beginning i.e. clone, build, and deploy the same application source. | ||
|
||
2) Create an asynchronous pipeline with just the two tasks `acceptance-test` and `submit-evidence`. Trigger this `pipeline` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could argue that submit-evidence
should be executed regardless to track the deployment and failed tests.
Having a retry on the test task would allow submitting evidence that does not include the flake, or that at least includes both runs of the tests.
71cf794
to
8935c4b
Compare
thanks @vdemeester I will look into how other CI/CD systems are solving this issue. This proposal is a retry There is no new |
thank you @dibyom, @vdemeester, and @afrittoli for the reviews 🙏 I have addressed some of the comments, will continue addressing the rest of the comments in a day or two. Thanks! |
8935c4b
to
54415e0
Compare
Hey @vdemeester I have added a small section on how other CI/CD systems such as Argo Workflow, GitHub actions, and Jenkins handle this use case. |
54415e0
to
7ec861f
Compare
Hey @afrittoli, @vdemeester, and @dibyom, please take another look at the changes, thanks! 🙏 |
7ec861f
to
63dbd49
Compare
/assign @afrittoli |
331754e
to
cedbe4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pritidesai I'd like to understand better the motivation behind restarting the same run vs creating a new run with the same specs.
One way to experiment I can think of is with a custom run that takes a failed pipelinerun and the task to be restarted and starts a new pipelinerun
### Non-Goals | ||
|
||
* Pause/Resume: | ||
* Support pausing a running `taskRun` as that would require pausing the `pod` itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this means...couldn't we keep the pod running but signal the entrypoint to stop executing steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we probably don't need to specify why "Pause/resume" is not a goal.
There may be ways to achieve that functionality, but it's not what the TEP is about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have to be here but I think its worth specifying why Pause/Resume isn't a valid alternative (or at least why that isn't the chosen path for the proposal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal is about restarting a failed pipelineTask. Is “Pause/resume” limited to just failure? Some use case wouldn’t require pausing a pipeline even without any failures?
These additional retry strategies allows users to specify a reasonable delay during which a transient failure can be | ||
fixed. | ||
|
||
Jenkins takes `retry` one step further where a user can specify [conditions](https://www.jenkins.io/doc/pipeline/steps/workflow-basic-steps/#retry-retry-the-body-up-to-n-times) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting...but it seems like the retry logic is still not on demand though...Jenkins is still responsible for retrying it it thinks there is a connection issue to the agent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, retry logic is not on demand. Jenkins is responsible for retrying in case of a connection issue.
But, Jenkins does support catching failures and further processing somehow:
Retry the block (up to N times) if any exception happens during its body execution. If an exception happens on the final attempt then it will lead to aborting the build (unless it is caught and processed somehow). User aborts of the build are not caught.
I have not looked into it further though.
Also, Jenkins support conditions
which is very advanced in terms of specifying what condition triggers the retry. Also, Jenkins retry logic understands the difference between infrastructure failure (connection to agent itself is broken) v/s build failure and allows pipeline author to choose to retry in case of an infrastructure failure. Jenkins retry logic can also distinguishes between pod termination v/s pod deletion.
conditions (optional)
Conditions under which the block should be retried. If none match, the block will fail. If there are no specified conditions, the block will always be retried except in case of user aborts.
Array / List of Nested Choice of Objects
agent
Detects that a node block, or certain steps inside it such as sh, failed for reasons which are likely due to infrastructure rather than the behavior of the build. If the connection to an agent is broken or the agent is removed from the list of executors while in use (typically in response to the disappearance of underlying cloud resources), this condition will allow retry to allocate a fresh agent and try the whole block again.
kubernetesAgent
Similar to agent() (Agent errors) but tailored to agents provisioned from a Kubernetes cloud. Unlike the generic agent error condition, this will ignore certain pod termination reasons which are likely to be under the control of the Pipeline author (e.g., OOMKilled) while still allowing retry to recover after common cases of pod deletion.
handleNonKubernetes : boolean (optional)
Behave like the generic agent() (Agent errors) when applied to a non-Kubernetes agent. Useful in cases where it is hard to predict in a job definition whether a Kubernetes or other sort of agent will be used.
nonresumable
The Jenkins controller was restarted while the build was running a step which cannot be resumed. Some steps like sh or input are written to survive a Jenkins restart and simply pick up where they left off when the build resumes. Others like checkout or junit normally complete promptly but cannot tolerate a restart. In case one of these steps happened to be in progress when Jenkins shut down, the resumed build will throw an error; using this condition with retry allows the step (or the whole enclosing node block) to be rerun.
What we are trying build in this PR is the limitation in Jenkins that the users are experiencing: https://issues.jenkins.io/browse/JENKINS-54735 And something on their roadmap.
rather than the execution of the `script`. Jenkins stage author can specify a condition `agent` which will allow | ||
retry once the connection to an agent is fixed. | ||
|
||
Jenkins declarative pipeline supports restarting stages with the same parameters and actions as documented in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some docs here: https://www.jenkins.io/doc/book/pipeline/running-pipelines/#restart-from-a-stage
Some takeaways for me:
You can restart any completed Declarative Pipeline from any top-level stage which ran in that Pipeline. This allows you to rerun a Pipeline from a stage which failed due to transient or environmental considerations, for example. All inputs to the Pipeline will be the same. This includes SCM information, build parameters, and the contents of any stash step calls in the original Pipeline, if specified.
I think this bit about the stash is important. We don't have anything similar. Also later in the docs:
Normally, when you run the stash step in your Pipeline, the resulting stash of artifacts is cleared when the Pipeline completes, regardless of the result of the Pipeline. Since stash artifacts aren’t accessible outside of the Pipeline run that created them, this has not created any limitations on usage. But with Declarative stage restarting, you may want to be able to unstash artifacts from a stage which ran before the stage you’re restarting from.
So, it seems like Jenkins preserves the artifacts that may have been created in a previous stage and makes it optionally accessible to another build/pipeline.
Once you choose a stage to restart from and click submit, a new build, with a new build number, will be started. All stages before the selected stage will be skipped, and the Pipeline will start executing at the selected stage. From that point on, the Pipeline will run as normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TEP does not consider the partial execution of a pipeline that reached completion, whether successful or not. The key assumption for the retries in this TEP is that the pipeline is still running.
If a PipelineTask
marked for "on-demand-retry" fails, the execution of the Pipeline
continues until either of the following happens:
- the
PipelineTask
is successful, and the associated branch of execution is resumed. This may happen if a retry is requested and is successful. - timeout is reached
- the
PipelineRun
is cancelled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this section again it seems like other CI systems provide a way to configure a delay before configuring other retries. Jenkins allows users to specify conditional logic before retrying. Jenkins Declarative pipelines does support restarting stages on demand but like you pointed out that does not keep the current pipeline running.
So, I think the current proposal is a bit unique in that respect. I don't think this by itself should mean should should not pursue on demand retries just that we should consider the alternative solutions for the use cases as well before implementing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clear distinction here is You can restart any completed Declarative Pipeline, this proposal is not applicable if the pipeline is completed/failed/cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @afrittoli explained this proposal is only applicable until one of those three criteria is met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @dibyom, please elaborate on this
other CI systems provide a way to configure a delay before configuring other retries.
Delay is applicable in terms of retries, if a stage fails and asked to retry, a delay can be introduced to schedule a stage for the execution after that interval.
Jenkins allows users to specify conditional logic before retrying.
Conditional logic is part of retry. Jenkins support try-catch
in addition to retry
which is much more robust when used along with retry
.
I don't think this by itself should mean should should not pursue on demand retries just that we should consider the alternative solutions for the use cases as well before implementing this
What are the alternative solutions here apart from custom task? This is a very common use case and if we decide to support it, it must live as a first class solution rather than a custom task.
### Goals | ||
|
||
* Provide a mechanism to enable on-demand-retry of a `pipelineTask` in a `pipeline`. | ||
* Provide a mechanism to signal a failed `taskRun` to start executing with the original specifications. This mechanism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why restart a taskRun and not just recreate a new taskRun?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this belongs more to the design details than to the goals.
The goal is to allow for on-demand retries of a PipelineTask
:
- if a task fails the
PipelineRun
execution does not end - execution of
PipelineTasks
that do not depend on the failed one continues normally - one or more retries of the failed
PipelineTask
may be requested, until either- the
PipelineTask
is successful - the
PipelineRun
times out - the
PipelineRun
is cancelled
- the
* Support pausing a running `taskRun` until a condition is met. | ||
* Ignoring a task failure and continue running the rest of the `pipeline` which is proposed in [TEP-0050](0050-ignore-task-failures.md). | ||
* Partial pipeline execution - [TEP-0077](https://github.com/tektoncd/community/pull/484) (not merged) and [tektoncd/pipeline issue#50](https://github.com/tektoncd/pipeline/issues/50) | ||
* Create a separate `pipelineRun` with an option of choosing different pipeline params. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a non-goal? It seems like it would be an alternative at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scope of this TEP is to retry a task in a running pipeline run.
We could consider partial re-execution of a finished pipeline run, possibly with different inputs, but that would be a different TEP. Partial execution after completion is a whole different story as we would need to make sure we are able to recreate the required inputs at a certain point of execution.
[specification](https://github.com/tektoncd/pipeline/blob/main/docs/pipelines.md#using-the-retries-field). If the | ||
flakiness is caused by an issue which requires a fix from the users, this kind of retry does not help. | ||
|
||
2) Close the existing PR and create a new PR for the same request. This new PR will trigger a new `pipelineRun` which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think close/re-open are the only alternatives. Pushing a new commit or a /retest like chat command can both work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that closing/re-opening is not the only option here.
I think @pritidesai's point was not about the specific way the pipeline is triggered, but rather that a completely new PipelineRun
is required. On-demand retries of a task would change things so that the PipelineRun
would not terminate right away upon failure, giving the engineer an option to fix the underlying condition and retry the specific task without having to repeat the execution of the rest of the pipeline.
Let's take this further, in an organization, there is often a single resource to manage CI of multiple applications. | ||
Now instead of writing multiple CI pipelines for each application, it's easier to manage one common CI pipeline for all | ||
applications. The CI pipeline is fairly simple and common for all applications. It includes `build`, `test`, and `deploy`. | ||
Now, when a `unit-test` fails, it's always desirable to just re-run `unit-test` rather than re-running the whole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this mean that if a user does not detect that the unit test failed and triggered a on-demand retry, the entire pipeline will be kept running and so if you were relying on notifications at the end of a pipelinerun, you wouldn't get it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would keep running until timeout.
1) Configure acceptance-tests task with a reasonable amount of retries. But this might or might not work depending on the | ||
cause of the flakiness. Retrying a task once it fails might address network connectivity issue as mentioned in the pipeline | ||
[specification](https://github.com/tektoncd/pipeline/blob/main/docs/pipelines.md#using-the-retries-field). If the | ||
flakiness is caused by an issue which requires a fix from the users, this kind of retry does not help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could configure a high-enough number of retries - so that the user could still go and fix the underlying cause while the retries are running. This solution would be less efficient and user-friendly though - as it is likely to run many more retries compared to the on-demand one, and also it forces the users to wait for the next failure and retry to verify their fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @pritidesai
I left some comments but nothing blocking for the proposed state.
One thing that we might want to give users is the option to just say "continue" and let the pipeline fail, in cases where no solution is available, but that's something we could add at a later stage too.
/approve
@pritidesai @afrittoli I'm ok merging this as proposed but I think we should consider retry strategies such as allowing users to specify a backoff (or even some kind of a condition) as a more general purpose solution to the use cases here. /approve |
Adding a proposal to support pipeline authors to overcome any flakes and to allow running the same instance of the pipeline until it finishes. This TEP proposes a mechanism to allow the users to choose when to retry a failed `pipelineTask` of a pipeline.
cedbe4f
to
9652bda
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli, dibyom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
thanks a bunch @dibyom and @afrittoli for the approval 🎉 we can certainly add |
Merging at API WG /lgtm |
@pritidesai In light of what we agreed upon in TEP-0121, retries for |
thanks @afrittoli 👍 My understanding is the
Any kind of
Yup, let's discuss this further since I think, all |
Adding a proposal to support pipeline authors to overcome any flakes and to allow running the same instance of the pipeline until it finishes. This PR introduces a TEP with a real world use cases.
/kind tep