-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0121: Modify metrics recorder for TaskRun Retries #5853
Conversation
Skipping CI for Draft Pull Request. |
/test all |
a9ed76f
to
ea60b64
Compare
Wondering if adding the If this route is valid, shall we also change the similar logics for reconciler/pipelinerun.go |
Thanks @JeromeJu for taking time to looking into the issue and commenting on the issue :) If I'm not mistaken, are you suggesting that we should not count those failed taskruns for metrics? If so, why would we want to do that?
That's true, I do think we also need to change the logic for reconciler/pipelinerun.go |
That was referring to the previous logics of putting durationAndCountMetrics() in reconcile function. But I totally agree that we should have it at |
We need to check whether the count in this logical route is correct or not. I will check why this is so. We want to avoid recount issues. |
cc @afrittoli this is the PR I opened for the issue we discussed today :P |
@khrm I believe this commit 1375219 is addressing the recount issue by comparing The recording metrics logic was originally in the ReconcileKind() function, but the commit 1375219 addressing the recount issue moves that logic to [1] 1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145 |
caad78b
to
49224af
Compare
Prior to this commit, metrics recording durationAndCountMetrics() is in reconcile() (instead of ReconcileKind()), which only counts the taskrun numbers where the taskruns that have no preparation error / not cancelled right after scheduled / not timed out before calling reconcile() (for example, be pending for a long time.). This commit moves durationAndCountMetrics() into ReconcileKind() to address the issue above. It's worth mentioning that the recording metrics logic was originally in the ReconcileKind() function, but the commit tektoncd@1375219, aming at addressing the recount issue, **moves that logic to `reconcile()`** [1] (probably for better code style? 🤔). [1] tektoncd@1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145
49224af
to
8de0e26
Compare
Makes sense, thanks for digging in @XinruZhang /approve
@khrm - Agreed as well. Will the current set of tests count this? |
/test check-pr-has-kind-label |
@JeromeJu: The specified target(s) for
The following commands are available to trigger optional jobs:
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
NIT Submitter checklist is not filled in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XinruZhang for addressing this!
It looks like we don't have any reconcile unit test that checks metrics, which is problematic because we must rely on our understanding of the code alone to know whether the counting of TaskRuns
is ok or duplicates will exist or cases will be missed.
The current logic for counting metrics relies on the assumption that we shall only count a TaskRun
when isDone()
is true
at the end of the reconciliation cycle and some change happened in the ConditionSucceeded
condition. These two conditions identify a single reconciliation cycle as long as no change is done to the ConditionSucceeded
after a TaskRun
is marked as done.
These two conditions do not change when moving the logic from reconcile
to ReconcileKind
, and in fact, the new location of the code captures some cases that were missing before, as @XinruZhang noted already.
There are a couple of improvements that could be made, but we could do them as a follow-up:
- the code under the
isDone
check should never touch the conditions of theTaskRun
. We invokesetDefaults
and laterupdateLabelsAndAnnotations
neither of which will touch the condition, so it's ok now, but it would be nice to enforce that somehow to avoid inadvertently breaking this logic here in the future. We could start adding a comment to it. Reconcile metrics tests would ensure that this is not broken in future - The code in the
durantionAndCountMetrics
uses a lister to get the before condition. We should not do that, listers are expensive for the API server, besides we already store thebefore
condition, so we should pass that todurantionAndCountMetrics
, similar to what we do for events.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli, chitrangpatel, dibyom, jerop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Changes
Openning this PR originally because a
DATA RACE
(Dirty Read) erroroccurs when I was implementing #5844, see error msg.
Basically, the reason is that durationAndCountMetrics() is reading
the
taskRun
object in one goroutine, while retryTaskRun() (which iscalled in finishReconcileUpdateEmitEvents) is writing the
taskRun
object.After digging into it, there's one more concern which looks like a bug:
durationAndCountMetrics() in reconcile() only counts the taskrun numbers
where the taskruns that have no preparation error / not cancelled right after
scheduled / not timed out before calling
reconcile()
(for example,be pending for a long time.).
To address the problem mentioned above, this PR moves
durationAndCountMetrics()
outside of reconcile() to ReconcileKind()to make sure it is reading the taskRun object that has no further update.
Related Background Info
The recording metrics logic was originally in the ReconcileKind() function,
but the commit 1375219 addressing the recount issue moves that logic
to
reconcile()
[1] (probably for better code style? 🤔).This is a blocker for #5844.
[1] 1375219#diff-6e67e9c647bbe2a08807ff5ccbdd7dc9036df373e56b9774d3996f92ab7ceabaL138-L145
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
Has Docs included if any changes are user facingfunctionality, content, code)
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)Release notes contains the string "action required" if the change requires additional action from users switching to the new releaseRelease Notes