Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0123 - proposal to specify on-demand-retry in a pipelineTask #823

Conversation

pritidesai
Copy link
Member

Adding a proposal to support pipeline authors to overcome any flakes and to allow running the same instance of the pipeline until it finishes. This PR introduces a TEP with a real world use cases.

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Sep 17, 2022
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 17, 2022
@lbernick
Copy link
Member

/assign @dibyom @XinruZhang

@tekton-robot
Copy link
Contributor

@lbernick: GitHub didn't allow me to assign the following users: xinruzhang.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @dibyom @XinruZhang

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

teps/0123-specifying-on-demand-retry-in-pipelinetask.md Outdated Show resolved Hide resolved
acceptance test is executed against the production deployment. The acceptance tests are flaky and results in failure
sometimes. This failure prevents updating the change request with the deployment evidences.

Now, the release manager has two choices to work around this flakiness:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't another choice be to configure the known flaky task with a reasonable amount of retires? If that is not a valid choice for this scenario, could you explain why?

Copy link
Member Author

@pritidesai pritidesai Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, configuring the known flaky task with reasonable amount of retries will work in some scenarios. In cases where the failure is the result of a momentary bandwidth bottleneck, the reasonable amount of retries might not be a feasible choice. In case of a bandwidth issue or any other infrastructure issue, the desire is to be able to retry once the issue is resolved. Also, for a cluster with 100s or 1000s of pipelineRuns of the same CD pipeline, I would want to avoid retrying all flaky taskRuns at the same time.

and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can
succeed.

A Tekton contributor without having access to prow command `/test unit-test` requires to close and reopen the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this is a compelling use case that I think we'll run into for our dogfooding pipeline (assuming we have a single presubmit pipeline that runs build, unit test, coverage etc)

cc @abayer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this use case. A user that doesn't have access to /test … command will not have tests running when opening a PR (it "needs" the label). If the test fails due to a flakyness, "someone" (a reviewer) will look into it and run the /test. Ideally it will be relatively quick, so I am not sure I see a problem to solve there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - where I was coming from was the assumption that the entire presubmit is one tekton pipeline and so /test or /re-test will rerun everything. Like you said below, one alternative would be to have smaller pipelines chained together

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup we have now is one pipeline per CI job.
The main Disadvantage of this approach we need to clone the repo for each CI job (but that is not different from what prow does today).

There are a few ways we could improve on that, one of which is to have a single pipeline, but I did not consider it really until now because of a few reasons:

  • CI jobs may be provided / maintained by different teams. So the granularity at best could be one pipeline per group of jobs from the same team.
  • We don't have partial pipeline execution, so a re-run would always be a re-run of the whole pipeline

Copy link
Member Author

@pritidesai pritidesai Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not have done a good job of explaining this use case.

The assumption for this use case is that it's one single tekton pipeline we are talking about.

With most of the reviewers working on multiple projects, it will be beneficial to the contributors to be able to avoid such dependencies on the reviewers.

This proposal is addressing re-run from a failed task. I might have to go back and update this use case to better explain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pritidesai


## Motivation

The primary motivation of this proposal is to support pipeline authors to overcome any flakes and to allow running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is on demand retry the best way to solve flakes? Why not automated retries? Are there other alternatives?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one approach (imo) is to create smaller Pipeline that are connected through events. That way you can easily re-run a Pipeline(Run) (that failed) and it would "continue" the flow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, creating a smaller pipeline and connect them through events could be an alternative.

But, the disadvantages of this alternative are very painful for the users. I have a user who has deployed one pipeline for 10 different applications. If he has to split up the pipeline for each possible flake, he will end up in a management nightmare of creating additional triggers for each application and managing pieces of pipeline (which might not even make sense if the pieces contains just one single task).

As a user of the pipeline controller, it's reasonable to expect a solution to handle such failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that smaller pipelines can be an alternative here, but as @pritidesai highlighted, it's not necessarily applicable to all users and all cases. It's a bit like the many small repos vs. large mono-repo divide - we should support users that require large pipelines for their use cases.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the proposal here.
Is it :

  • an "infinite" retry — retry until it suceed
  • a new PipelineRun that is created from a previous one, but will only run the "selected" tasks ?

In the case of the later, I am not sure this is something tektoncd/pipeline has to solve. But in any case, we may want to explore how other CI/CD tools (Jenkins, github workflows, …) handle those case – if they do at all.


## Motivation

The primary motivation of this proposal is to support pipeline authors to overcome any flakes and to allow running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one approach (imo) is to create smaller Pipeline that are connected through events. That way you can easily re-run a Pipeline(Run) (that failed) and it would "continue" the flow

and other tests are executed. Now, a flaky unit test often fails and requires a couple of retries before it can
succeed.

A Tekton contributor without having access to prow command `/test unit-test` requires to close and reopen the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this use case. A user that doesn't have access to /test … command will not have tests running when opening a PR (it "needs" the label). If the test fails due to a flakyness, "someone" (a reviewer) will look into it and run the /test. Ideally it will be relatively quick, so I am not sure I see a problem to solve there.

1) Close the existing PR and create a new PR for the same request. This new PR will trigger a new `pipelineRun` which
will start from the beginning i.e. clone, build, and deploy the same application source.

2) Create an asynchronous pipeline with just the two tasks `acceptance-test` and `submit-evidence`. Trigger this `pipeline`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could argue that submit-evidence should be executed regardless to track the deployment and failed tests.
Having a retry on the test task would allow submitting evidence that does not include the flake, or that at least includes both runs of the tests.

@pritidesai pritidesai force-pushed the 0123-specify-on-demand-retry-in-pipelinetask branch from 71cf794 to 8935c4b Compare September 23, 2022 06:30
@pritidesai
Copy link
Member Author

I am not sure I understand the proposal here. Is it :

  • an "infinite" retry — retry until it suceed
  • a new PipelineRun that is created from a previous one, but will only run the "selected" tasks ?

In the case of the later, I am not sure this is something tektoncd/pipeline has to solve. But in any case, we may want to explore how other CI/CD tools (Jenkins, github workflows, …) handle those case – if they do at all.

thanks @vdemeester I will look into how other CI/CD systems are solving this issue.

This proposal is a retry pipelineTask until it succeeds or cancel the pipelineRun (original cancellation implementation takes higher precedence).

There is no new pipelineRun created with the "selected task". This is somewhat similar to pending pipelines but extending it to failed tasks. Its the same pipelineRun being utilized.

@pritidesai
Copy link
Member Author

thank you @dibyom, @vdemeester, and @afrittoli for the reviews 🙏 I have addressed some of the comments, will continue addressing the rest of the comments in a day or two. Thanks!

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2022
@pritidesai pritidesai force-pushed the 0123-specify-on-demand-retry-in-pipelinetask branch from 8935c4b to 54415e0 Compare October 7, 2022 22:13
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2022
@pritidesai
Copy link
Member Author

pritidesai commented Oct 7, 2022

But in any case, we may want to explore how other CI/CD tools (Jenkins, github workflows, …) handle those case – if they do at all.

Hey @vdemeester I have added a small section on how other CI/CD systems such as Argo Workflow, GitHub actions, and Jenkins handle this use case.

@pritidesai pritidesai force-pushed the 0123-specify-on-demand-retry-in-pipelinetask branch from 54415e0 to 7ec861f Compare October 7, 2022 23:04
@pritidesai
Copy link
Member Author

Hey @afrittoli, @vdemeester, and @dibyom, please take another look at the changes, thanks! 🙏

@pritidesai pritidesai force-pushed the 0123-specify-on-demand-retry-in-pipelinetask branch from 7ec861f to 63dbd49 Compare October 7, 2022 23:08
@afrittoli
Copy link
Member

/assign @afrittoli

@pritidesai pritidesai force-pushed the 0123-specify-on-demand-retry-in-pipelinetask branch 3 times, most recently from 331754e to cedbe4f Compare October 12, 2022 18:54
Copy link
Member

@dibyom dibyom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pritidesai I'd like to understand better the motivation behind restarting the same run vs creating a new run with the same specs.
One way to experiment I can think of is with a custom run that takes a failed pipelinerun and the task to be restarted and starts a new pipelinerun

### Non-Goals

* Pause/Resume:
* Support pausing a running `taskRun` as that would require pausing the `pod` itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this means...couldn't we keep the pod running but signal the entrypoint to stop executing steps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we probably don't need to specify why "Pause/resume" is not a goal.
There may be ways to achieve that functionality, but it's not what the TEP is about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be here but I think its worth specifying why Pause/Resume isn't a valid alternative (or at least why that isn't the chosen path for the proposal)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is about restarting a failed pipelineTask. Is “Pause/resume” limited to just failure? Some use case wouldn’t require pausing a pipeline even without any failures?

These additional retry strategies allows users to specify a reasonable delay during which a transient failure can be
fixed.

Jenkins takes `retry` one step further where a user can specify [conditions](https://www.jenkins.io/doc/pipeline/steps/workflow-basic-steps/#retry-retry-the-body-up-to-n-times)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting...but it seems like the retry logic is still not on demand though...Jenkins is still responsible for retrying it it thinks there is a connection issue to the agent

Copy link
Member Author

@pritidesai pritidesai Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, retry logic is not on demand. Jenkins is responsible for retrying in case of a connection issue.

But, Jenkins does support catching failures and further processing somehow:

Retry the block (up to N times) if any exception happens during its body execution. If an exception happens on the final attempt then it will lead to aborting the build (unless it is caught and processed somehow). User aborts of the build are not caught.

I have not looked into it further though.

Also, Jenkins support conditions which is very advanced in terms of specifying what condition triggers the retry. Also, Jenkins retry logic understands the difference between infrastructure failure (connection to agent itself is broken) v/s build failure and allows pipeline author to choose to retry in case of an infrastructure failure. Jenkins retry logic can also distinguishes between pod termination v/s pod deletion.

conditions (optional)
Conditions under which the block should be retried. If none match, the block will fail. If there are no specified conditions, the block will always be retried except in case of user aborts.
Array / List of Nested Choice of Objects
agent
Detects that a node block, or certain steps inside it such as sh, failed for reasons which are likely due to infrastructure rather than the behavior of the build. If the connection to an agent is broken or the agent is removed from the list of executors while in use (typically in response to the disappearance of underlying cloud resources), this condition will allow retry to allocate a fresh agent and try the whole block again.
kubernetesAgent
Similar to agent() (Agent errors) but tailored to agents provisioned from a Kubernetes cloud. Unlike the generic agent error condition, this will ignore certain pod termination reasons which are likely to be under the control of the Pipeline author (e.g., OOMKilled) while still allowing retry to recover after common cases of pod deletion.
handleNonKubernetes : boolean (optional)
Behave like the generic agent() (Agent errors) when applied to a non-Kubernetes agent. Useful in cases where it is hard to predict in a job definition whether a Kubernetes or other sort of agent will be used.
nonresumable
The Jenkins controller was restarted while the build was running a step which cannot be resumed. Some steps like sh or input are written to survive a Jenkins restart and simply pick up where they left off when the build resumes. Others like checkout or junit normally complete promptly but cannot tolerate a restart. In case one of these steps happened to be in progress when Jenkins shut down, the resumed build will throw an error; using this condition with retry allows the step (or the whole enclosing node block) to be rerun.

What we are trying build in this PR is the limitation in Jenkins that the users are experiencing: https://issues.jenkins.io/browse/JENKINS-54735 And something on their roadmap.

rather than the execution of the `script`. Jenkins stage author can specify a condition `agent` which will allow
retry once the connection to an agent is fixed.

Jenkins declarative pipeline supports restarting stages with the same parameters and actions as documented in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some docs here: https://www.jenkins.io/doc/book/pipeline/running-pipelines/#restart-from-a-stage

Some takeaways for me:

You can restart any completed Declarative Pipeline from any top-level stage which ran in that Pipeline. This allows you to rerun a Pipeline from a stage which failed due to transient or environmental considerations, for example. All inputs to the Pipeline will be the same. This includes SCM information, build parameters, and the contents of any stash step calls in the original Pipeline, if specified.

I think this bit about the stash is important. We don't have anything similar. Also later in the docs:

Normally, when you run the stash step in your Pipeline, the resulting stash of artifacts is cleared when the Pipeline completes, regardless of the result of the Pipeline. Since stash artifacts aren’t accessible outside of the Pipeline run that created them, this has not created any limitations on usage. But with Declarative stage restarting, you may want to be able to unstash artifacts from a stage which ran before the stage you’re restarting from.

So, it seems like Jenkins preserves the artifacts that may have been created in a previous stage and makes it optionally accessible to another build/pipeline.

Once you choose a stage to restart from and click submit, a new build, with a new build number, will be started. All stages before the selected stage will be skipped, and the Pipeline will start executing at the selected stage. From that point on, the Pipeline will run as normal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TEP does not consider the partial execution of a pipeline that reached completion, whether successful or not. The key assumption for the retries in this TEP is that the pipeline is still running.

If a PipelineTask marked for "on-demand-retry" fails, the execution of the Pipeline continues until either of the following happens:

  • the PipelineTask is successful, and the associated branch of execution is resumed. This may happen if a retry is requested and is successful.
  • timeout is reached
  • the PipelineRun is cancelled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this section again it seems like other CI systems provide a way to configure a delay before configuring other retries. Jenkins allows users to specify conditional logic before retrying. Jenkins Declarative pipelines does support restarting stages on demand but like you pointed out that does not keep the current pipeline running.
So, I think the current proposal is a bit unique in that respect. I don't think this by itself should mean should should not pursue on demand retries just that we should consider the alternative solutions for the use cases as well before implementing this

Copy link
Member Author

@pritidesai pritidesai Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clear distinction here is You can restart any completed Declarative Pipeline, this proposal is not applicable if the pipeline is completed/failed/cancelled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @afrittoli explained this proposal is only applicable until one of those three criteria is met.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dibyom, please elaborate on this

other CI systems provide a way to configure a delay before configuring other retries.

Delay is applicable in terms of retries, if a stage fails and asked to retry, a delay can be introduced to schedule a stage for the execution after that interval.

Jenkins allows users to specify conditional logic before retrying.

Conditional logic is part of retry. Jenkins support try-catch in addition to retry which is much more robust when used along with retry.

https://www.jenkins.io/doc/pipeline/steps/workflow-basic-steps/#catcherror-catch-error-and-set-build-result-to-failure

I don't think this by itself should mean should should not pursue on demand retries just that we should consider the alternative solutions for the use cases as well before implementing this

What are the alternative solutions here apart from custom task? This is a very common use case and if we decide to support it, it must live as a first class solution rather than a custom task.

### Goals

* Provide a mechanism to enable on-demand-retry of a `pipelineTask` in a `pipeline`.
* Provide a mechanism to signal a failed `taskRun` to start executing with the original specifications. This mechanism
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why restart a taskRun and not just recreate a new taskRun?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this belongs more to the design details than to the goals.
The goal is to allow for on-demand retries of a PipelineTask:

  • if a task fails the PipelineRun execution does not end
  • execution of PipelineTasks that do not depend on the failed one continues normally
  • one or more retries of the failed PipelineTask may be requested, until either
    • the PipelineTask is successful
    • the PipelineRun times out
    • the PipelineRun is cancelled

* Support pausing a running `taskRun` until a condition is met.
* Ignoring a task failure and continue running the rest of the `pipeline` which is proposed in [TEP-0050](0050-ignore-task-failures.md).
* Partial pipeline execution - [TEP-0077](https://github.com/tektoncd/community/pull/484) (not merged) and [tektoncd/pipeline issue#50](https://github.com/tektoncd/pipeline/issues/50)
* Create a separate `pipelineRun` with an option of choosing different pipeline params.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a non-goal? It seems like it would be an alternative at least?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of this TEP is to retry a task in a running pipeline run.
We could consider partial re-execution of a finished pipeline run, possibly with different inputs, but that would be a different TEP. Partial execution after completion is a whole different story as we would need to make sure we are able to recreate the required inputs at a certain point of execution.

[specification](https://github.com/tektoncd/pipeline/blob/main/docs/pipelines.md#using-the-retries-field). If the
flakiness is caused by an issue which requires a fix from the users, this kind of retry does not help.

2) Close the existing PR and create a new PR for the same request. This new PR will trigger a new `pipelineRun` which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think close/re-open are the only alternatives. Pushing a new commit or a /retest like chat command can both work here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that closing/re-opening is not the only option here.

I think @pritidesai's point was not about the specific way the pipeline is triggered, but rather that a completely new PipelineRun is required. On-demand retries of a task would change things so that the PipelineRun would not terminate right away upon failure, giving the engineer an option to fix the underlying condition and retry the specific task without having to repeat the execution of the rest of the pipeline.

Let's take this further, in an organization, there is often a single resource to manage CI of multiple applications.
Now instead of writing multiple CI pipelines for each application, it's easier to manage one common CI pipeline for all
applications. The CI pipeline is fairly simple and common for all applications. It includes `build`, `test`, and `deploy`.
Now, when a `unit-test` fails, it's always desirable to just re-run `unit-test` rather than re-running the whole
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this mean that if a user does not detect that the unit test failed and triggered a on-demand retry, the entire pipeline will be kept running and so if you were relying on notifications at the end of a pipelinerun, you wouldn't get it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would keep running until timeout.

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2022
Comment on lines +140 to +143
1) Configure acceptance-tests task with a reasonable amount of retries. But this might or might not work depending on the
cause of the flakiness. Retrying a task once it fails might address network connectivity issue as mentioned in the pipeline
[specification](https://github.com/tektoncd/pipeline/blob/main/docs/pipelines.md#using-the-retries-field). If the
flakiness is caused by an issue which requires a fix from the users, this kind of retry does not help.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could configure a high-enough number of retries - so that the user could still go and fix the underlying cause while the retries are running. This solution would be less efficient and user-friendly though - as it is likely to run many more retries compared to the on-demand one, and also it forces the users to wait for the next failure and retry to verify their fix.

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @pritidesai
I left some comments but nothing blocking for the proposed state.
One thing that we might want to give users is the option to just say "continue" and let the pipeline fail, in cases where no solution is available, but that's something we could add at a later stage too.

/approve

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2022
@dibyom
Copy link
Member

dibyom commented Nov 11, 2022

@pritidesai @afrittoli I'm ok merging this as proposed but I think we should consider retry strategies such as allowing users to specify a backoff (or even some kind of a condition) as a more general purpose solution to the use cases here.

/approve

Adding a proposal to support pipeline authors to overcome any flakes and to
allow running the same instance of the pipeline until it finishes. This TEP
proposes a mechanism to allow the users to choose when to retry a failed
`pipelineTask` of a pipeline.
@pritidesai pritidesai force-pushed the 0123-specify-on-demand-retry-in-pipelinetask branch from cedbe4f to 9652bda Compare November 14, 2022 07:38
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, dibyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 14, 2022
@pritidesai
Copy link
Member Author

thanks a bunch @dibyom and @afrittoli for the approval 🎉 we can certainly add continue kind of an option and allow backoff along with a condition in addition to on-demand-retry. I will update the TEP in next PR. Thanks!

@jerop
Copy link
Member

jerop commented Nov 14, 2022

Merging at API WG

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 14, 2022
@tekton-robot tekton-robot merged commit af11661 into tektoncd:main Nov 14, 2022
@afrittoli
Copy link
Member

@pritidesai In light of what we agreed upon in TEP-0121, retries for TaskRuns and CustomRuns will be managed by the respective controllers.
On-demand retries of a pipelineTask are, by nature, a pipeline feature, so this kind of retry will require implementing a pipeline controller driven retry, which MUST then create new TaskRuns when running.
The same retry mechanism could be used to implement pipeline driven retries like described in my tep-0121 comment, and we'll have to decide which of the two to implement first.

@pritidesai
Copy link
Member Author

@pritidesai In light of what we agreed upon in TEP-0121, retries for TaskRuns and CustomRuns will be managed by the respective controllers.

thanks @afrittoli 👍 My understanding is the retries for a pipelineTask in a pipeline will be managed by the respective controllers but pipelineRun controller will continue watching for taskRuns created for a pipelineTask. Right?

On-demand retries of a pipelineTask are, by nature, a pipeline feature, so this kind of retry will require implementing a pipeline controller driven retry, which MUST then create new TaskRuns when running.

Any kind of retries for a pipelineTask for example, number of attempts, on-demand-retry, delay-retry, etc is still a pipeline feature, since pipelineTask is part of pipeline, the implementers could be any controller but in the end pipelineRun controller is the one checking if pipelineTask is done executing and has the control of scheduling next pipelineTask in the pipeline. Yup, I agree, its the respective controller that implements this feature and we will limit this feature to the first class entity which is taskRun since this feature can be implemented by customRun controller by any custom controller provider if needed.

The same retry mechanism could be used to implement pipeline driven retries like described in my #879 (comment), and we'll have to decide which of the two to implement first.

Yup, let's discuss this further since I think, all retries for a pipelineTask are pipeline driven but could be implemented by respective controllers.

@pritidesai pritidesai deleted the 0123-specify-on-demand-retry-in-pipelinetask branch November 16, 2022 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: UnAssigned
Development

Successfully merging this pull request may close these issues.

7 participants