TEP-0121: Modify metrics recorder for TaskRun Retries #5853

XinruZhang · 2022-12-08T17:29:44Z

Changes

Openning this PR originally because a DATA RACE (Dirty Read) error
occurs when I was implementing #5844, see error msg.

Basically, the reason is that durationAndCountMetrics() is reading
the taskRun object in one goroutine, while retryTaskRun() (which is
called in finishReconcileUpdateEmitEvents) is writing the taskRun object.

After digging into it, there's one more concern which looks like a bug:
durationAndCountMetrics() in reconcile() only counts the taskrun numbers
where the taskruns that have no preparation error / not cancelled right after
scheduled / not timed out before calling reconcile() (for example,
be pending for a long time.).

To address the problem mentioned above, this PR moves
durationAndCountMetrics() outside of reconcile() to ReconcileKind()
to make sure it is reading the taskRun object that has no further update.

Related Background Info

The recording metrics logic was originally in the ReconcileKind() function,
but the commit 1375219 addressing the recount issue moves that logic
to reconcile() [1] (probably for better code style? 🤔).

This is a blocker for #5844.

[1] 1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

~~Has Docs included if any changes are user facing~~
Has Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
~~Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)~~
~~Release notes contains the string "action required" if the change requires additional action from users switching to the new release~~

Release Notes

NONE

tekton-robot · 2022-12-08T17:29:45Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

jerop · 2022-12-08T17:36:54Z

/test all

XinruZhang · 2022-12-09T13:22:49Z

/assign @jerop

cc @khrm
cc @tektoncd/core-collaborators

JeromeJu · 2022-12-09T21:03:50Z

After digging into it, there's one more concern:
durationAndCountMetrics() in reconcile() only counts the taskrun numbers
where the taskrun has no preparation error / not cancelled right after
scheduled / not timed out before calling reconcile() (for example,
be pending for a long time.).

Wondering if adding the status check here in DurationAndCount will help only counts the taskRuns that are not failed from https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/taskrun/taskrun.go#L110-L180?

If this route is valid, shall we also change the similar logics for reconciler/pipelinerun.go

XinruZhang · 2022-12-12T14:57:53Z

Thanks @JeromeJu for taking time to looking into the issue and commenting on the issue :)

If I'm not mistaken, are you suggesting that we should not count those failed taskruns for metrics? If so, why would we want to do that?

If this route is valid, shall we also change the similar logics for reconciler/pipelinerun.go

That's true, I do think we also need to change the logic for reconciler/pipelinerun.go

JeromeJu · 2022-12-12T15:39:48Z

If I'm not mistaken, are you suggesting that we should not count those failed taskruns for metrics? If so, why would we want to do that?

That was referring to the previous logics of putting durationAndCountMetrics() in reconcile function. But I totally agree that we should have it at ReconcileKind level which we can keep count of failed taskRuns(taking the current route).

khrm · 2022-12-12T16:03:57Z

We need to check whether the count in this logical route is correct or not. I will check why this is so. We want to avoid recount issues.

XinruZhang · 2022-12-13T15:19:34Z

cc @afrittoli this is the PR I opened for the issue we discussed today :P

XinruZhang · 2022-12-13T17:39:22Z

@khrm I believe this commit 1375219 is addressing the recount issue by comparing beforeCondition and afterCondition, only record when the two conditions are different).

The recording metrics logic was originally in the ReconcileKind() function, but the commit 1375219 addressing the recount issue moves that logic to reconcile() [1] (probably for better code style? 🤔).

cc @afrittoli @jerop @dibyom

[1] 1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145

Prior to this commit, metrics recording durationAndCountMetrics() is in reconcile() (instead of ReconcileKind()), which only counts the taskrun numbers where the taskruns that have no preparation error / not cancelled right after scheduled / not timed out before calling reconcile() (for example, be pending for a long time.). This commit moves durationAndCountMetrics() into ReconcileKind() to address the issue above. It's worth mentioning that the recording metrics logic was originally in the ReconcileKind() function, but the commit tektoncd@1375219, aming at addressing the recount issue, **moves that logic to `reconcile()`** [1] (probably for better code style? 🤔). [1] tektoncd@1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145

dibyom · 2022-12-14T04:26:10Z

Makes sense, thanks for digging in @XinruZhang

/approve

We need to check whether the count in this logical route is correct or not. I will check why this is so. We want to avoid recount issues.

@khrm - Agreed as well. Will the current set of tests count this?

JeromeJu · 2022-12-14T21:21:06Z

/test check-pr-has-kind-label

tekton-robot · 2022-12-14T21:21:07Z

@JeromeJu: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-tekton-pipeline-alpha-integration-tests
/test pull-tekton-pipeline-build-tests
/test pull-tekton-pipeline-integration-tests
/test tekton-pipeline-unit-tests

The following commands are available to trigger optional jobs:

/test pull-tekton-pipeline-go-coverage

Use /test all to run all jobs.

In response to this:

/test check-pr-has-kind-label

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

afrittoli · 2022-12-15T11:55:04Z

NIT Submitter checklist is not filled in

afrittoli

Thank you @XinruZhang for addressing this!

It looks like we don't have any reconcile unit test that checks metrics, which is problematic because we must rely on our understanding of the code alone to know whether the counting of TaskRuns is ok or duplicates will exist or cases will be missed.

The current logic for counting metrics relies on the assumption that we shall only count a TaskRun when isDone() is true at the end of the reconciliation cycle and some change happened in the ConditionSucceeded condition. These two conditions identify a single reconciliation cycle as long as no change is done to the ConditionSucceeded after a TaskRun is marked as done.

These two conditions do not change when moving the logic from reconcile to ReconcileKind, and in fact, the new location of the code captures some cases that were missing before, as @XinruZhang noted already.

There are a couple of improvements that could be made, but we could do them as a follow-up:

the code under the isDone check should never touch the conditions of theTaskRun. We invoke setDefaults and later updateLabelsAndAnnotations neither of which will touch the condition, so it's ok now, but it would be nice to enforce that somehow to avoid inadvertently breaking this logic here in the future. We could start adding a comment to it. Reconcile metrics tests would ensure that this is not broken in future
The code in the durantionAndCountMetrics uses a lister to get the before condition. We should not do that, listers are expensive for the API server, besides we already store the before condition, so we should pass that to durantionAndCountMetrics, similar to what we do for events.

/approve

tekton-robot · 2022-12-15T12:18:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, chitrangpatel, dibyom, jerop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [afrittoli,dibyom,jerop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

afrittoli

/lgtm

tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 8, 2022

tekton-robot requested review from imjasonh and lbernick December 8, 2022 17:29

tekton-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Dec 8, 2022

XinruZhang mentioned this pull request Dec 8, 2022

TEP-0121: Retries - Reconciler Implementation #5844

Merged

7 tasks

jerop added the kind/bug Categorizes issue or PR as related to a bug. label Dec 8, 2022

XinruZhang force-pushed the TEP-0121-pre-fix branch from a9ed76f to ea60b64 Compare December 9, 2022 13:21

XinruZhang marked this pull request as ready for review December 9, 2022 13:21

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 9, 2022

tekton-robot requested review from bobcatfish and vdemeester December 9, 2022 13:21

tekton-robot assigned jerop Dec 9, 2022

afrittoli added this to the Pipelines v0.43 milestone Dec 13, 2022

chitrangpatel approved these changes Dec 13, 2022

View reviewed changes

XinruZhang force-pushed the TEP-0121-pre-fix branch 2 times, most recently from caad78b to 49224af Compare December 13, 2022 21:27

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 13, 2022

XinruZhang force-pushed the TEP-0121-pre-fix branch from 49224af to 8de0e26 Compare December 13, 2022 21:31

tekton-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 13, 2022

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2022

jerop approved these changes Dec 14, 2022

View reviewed changes

afrittoli reviewed Dec 15, 2022

View reviewed changes

tekton-robot assigned afrittoli Dec 15, 2022

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2022

tekton-robot merged commit ed8a2c2 into tektoncd:main Dec 15, 2022

This was referenced Dec 15, 2022

TaskRun/PipelineRuns that are failed on some reasons can't be recored in Metrics #5866

Closed

Improve Test Coverage on Metrics for TaskRun and PipelineRun Reconcilers #5875

Closed

XinruZhang deleted the TEP-0121-pre-fix branch December 15, 2022 14:50

This was referenced Dec 15, 2022

Stop Metric Recorder Listing TaskRuns When Records Data #5876

Merged

Mark TEP-0121 As Implemented tektoncd/community#905

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEP-0121: Modify metrics recorder for TaskRun Retries #5853

TEP-0121: Modify metrics recorder for TaskRun Retries #5853

XinruZhang commented Dec 8, 2022 •

edited

Loading

tekton-robot commented Dec 8, 2022

jerop commented Dec 8, 2022

XinruZhang commented Dec 9, 2022

JeromeJu commented Dec 9, 2022 •

edited

Loading

XinruZhang commented Dec 12, 2022

JeromeJu commented Dec 12, 2022

khrm commented Dec 12, 2022 •

edited

Loading

XinruZhang commented Dec 13, 2022

XinruZhang commented Dec 13, 2022 •

edited

Loading

dibyom commented Dec 14, 2022

JeromeJu commented Dec 14, 2022

tekton-robot commented Dec 14, 2022

afrittoli commented Dec 15, 2022

afrittoli left a comment

tekton-robot commented Dec 15, 2022

afrittoli left a comment

TEP-0121: Modify metrics recorder for TaskRun Retries #5853

TEP-0121: Modify metrics recorder for TaskRun Retries #5853

Conversation

XinruZhang commented Dec 8, 2022 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Dec 8, 2022

jerop commented Dec 8, 2022

XinruZhang commented Dec 9, 2022

JeromeJu commented Dec 9, 2022 • edited Loading

XinruZhang commented Dec 12, 2022

JeromeJu commented Dec 12, 2022

khrm commented Dec 12, 2022 • edited Loading

XinruZhang commented Dec 13, 2022

XinruZhang commented Dec 13, 2022 • edited Loading

dibyom commented Dec 14, 2022

JeromeJu commented Dec 14, 2022

tekton-robot commented Dec 14, 2022

afrittoli commented Dec 15, 2022

afrittoli left a comment

Choose a reason for hiding this comment

tekton-robot commented Dec 15, 2022

afrittoli left a comment

Choose a reason for hiding this comment

XinruZhang commented Dec 8, 2022 •

edited

Loading

JeromeJu commented Dec 9, 2022 •

edited

Loading

khrm commented Dec 12, 2022 •

edited

Loading

XinruZhang commented Dec 13, 2022 •

edited

Loading