TEP-0073 Simplify Metrics #466

vdemeester · 2021-06-23T10:19:48Z

Carrying "Simplify Metrics" TEP (#286) as this is something that we should tackle.

I need to rephrase some stuff, but I am carrying this proposal as it is getting a higher priority for us to fix.

/kind tep

Signed-off-by: Vincent Demeester [email protected]

jerop · 2021-07-12T16:06:29Z

/assign @sbwsg
/assign @afrittoli

bobcatfish · 2021-07-13T22:29:27Z

haha is it on purpose a coincidence that this TEP is 73 when the original was 37??

either way i like it XD

ghost · 2021-07-23T14:45:30Z

teps/0073-simplify-metrics.md

+
+we can add a `config-observability` option to switch `duration_seconds` type. The default value is `histogram`.
+It can be set to `gauge` if users only care about the execution time of individual `TaskRun`.
+It can't be set to `gauge` when `metrics.namespace-level` is `true` because gauge can't be aggregated at namespace level.


What's the fallout of accidentally setting both metrics.namespace-level to true and metrics.duration-seconds-type to guage and how do I discover my mistake? Controller halts / won't start? error messages in logs? failed taskruns?

khrm · 2021-07-28T15:59:39Z

I think we need gauge for Pipelinerun also to avoid cardinality explosion for pipelinerun label coming for tekton_pipelines_controller_pipelinerun_duration_seconds_bucket.
Even if we have 8k Pipelinerun, it will cause cardinality to grow by 8k as every prun will have unique labels coming which are well over the recommended limits. https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
Having p/trun name in labels causes cardinality to be unbounded. Even if we can have pipeline and task same, our metrics cardinality will keep increasing.

dgrisonnet

I have a suggestion for the histogram. In the comments that I made, I noticed that it might not be meaningful to have them at the taskRun & pipelineRun levels. I don't have much knowledge of the project but to me the only places where these duration metrics really make sense are for Task and Pipeline.

How about having only 2 histograms:

tekton_pipelines_controller_pipelines_duration_seconds with namespace and pipeline labels
tekton_pipelines_controller_pipeline_tasks_duration_seconds with namespace, pipeline and task labels

dgrisonnet · 2021-07-29T16:13:14Z

teps/0073-simplify-metrics.md

+- changing `PipelineRun` or `TaskRun` level metrics to `Task` or `Pipeline`
+  level
+- changing `PipelineRun` or `TaskRun` level metrics to namespace level
+- changing metrics type from histogram to gauge in case of `TaskRun` or


I think that the duration metric could even be removed here. You most likely only see a task run once so it is not very meaningful to have an average duration over only one element.

dgrisonnet · 2021-07-29T16:15:54Z

teps/0073-simplify-metrics.md

+### Change metrics type
+
+Large metrics volume is mostly caused by histogram metrics. One single histogram metrics will produce 15 metrics. These metrics type could be changed to gauge which would reduce 15 metrics to one.
+For example, metrics like `tekton_pipelinerun_duration_seconds`, `taskrun_duration_seconds`, `tekton_pipelinerun_taskrun_duration_seconds` are histogram. But the metrics could not provide much information at `TaskRun` level and can be changed to gauge.


As per my previous comment, I would advise removing them. The problem with these metrics can be seen in the Before example you've shared below. These metrics only record the duration of one run per timeserie, so it is not really meaningful to have percentiles or averages based on only one element.

I agree! Data on how long it took to execute a specific PipelineRun should be taken from tektoncd/results if needed.

khrm · 2021-07-30T05:59:35Z

Yes, removing pipelinerun and taskrun labels altogether make sense. We get an additional metrics because of these labels with only one update. This data can be computed using logging stack.

afrittoli

Thanks for this!!
Since this is in proposed state, I'm happy to approve. You could also address comments on the design at this stage if you wanted to keep the conversation context.

/approve

afrittoli · 2021-08-02T13:29:11Z

teps/0073-simplify-metrics.md

+- changing `PipelineRun` or `TaskRun` level metrics to `Task` or `Pipeline`
+  level


Is the aggregation going to be based on the name of the Task or Pipeline, or what do you have in mind?
The task may be deployed to the cluster but it also may come from a bundle. Would every new version of a bundle start a new metric?

That's a good question 🤔 Naively I would think every new version (of a bundle, …) would start a new metric. Just like if you were to "name version" your task inside your cluster.

afrittoli · 2021-08-02T13:30:33Z

teps/0073-simplify-metrics.md

+Latter is mutually exclusive with former two. If users care about 
+individual `TaskRun` or `PipelineRun`'s performance, they can set 
+`duration_seconds` type from histogram to gauge. If users care
+about overall performance, they can collect metrics at namespace level.


NIT: Should this be in the alternative sections?

afrittoli · 2021-08-02T13:33:26Z

teps/0073-simplify-metrics.md

+### Change metrics type
+
+Large metrics volume is mostly caused by histogram metrics. One single histogram metrics will produce 15 metrics. These metrics type could be changed to gauge which would reduce 15 metrics to one.
+For example, metrics like `tekton_pipelinerun_duration_seconds`, `taskrun_duration_seconds`, `tekton_pipelinerun_taskrun_duration_seconds` are histogram. But the metrics could not provide much information at `TaskRun` level and can be changed to gauge.


I agree! Data on how long it took to execute a specific PipelineRun should be taken from tektoncd/results if needed.

afrittoli · 2021-08-02T13:36:46Z

teps/0073-simplify-metrics.md

+## Upgrade & Migration Strategy (optional)
+
+<!--
+Use this section to detail wether this feature needs an upgrade or
+migration strategy. This is especially useful when we modify a
+behavior or add a feature that may replace and deprecate a current one.
+-->


I think this needs to be filled in.

Do we need to continue to support the current behaviour?
Will it be opt-in or opt-out? Does that comply with our stability policy?

From what I see in docs, metrics are still in experimental. We can support current metrics for one release by deprecating it and then removing it.

tekton-robot · 2021-08-02T13:38:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~teps/OWNERS~~ [afrittoli,sbwsg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Carrying "Simplify Metrics" TEP (tektoncd#286) as this is something that we should tackle. Signed-off-by: Vincent Demeester <[email protected]>

jerop · 2021-08-23T16:35:57Z

/lgtm

tekton-robot added kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 23, 2021

tekton-robot requested review from hrishin and a user June 23, 2021 10:19

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 23, 2021

vdemeester force-pushed the carry-simplify-metrics branch from 355553f to dbd006c Compare June 23, 2021 13:24

vdemeester changed the title ~~WIP: TEP-0073 Simplify Metrics~~ TEP-0073 Simplify Metrics Jul 7, 2021

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 7, 2021

vdemeester force-pushed the carry-simplify-metrics branch from dbd006c to be56640 Compare July 7, 2021 13:21

tekton-robot assigned afrittoli and ghost Jul 12, 2021

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 13, 2021

bobcatfish mentioned this pull request Jul 19, 2021

Add an option to disable per TaskRun/PipelineRun metrics tektoncd/pipeline#4101

Closed

ghost approved these changes Jul 23, 2021

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2021

vdemeester force-pushed the carry-simplify-metrics branch from 5ec6849 to 4676fd4 Compare July 29, 2021 14:41

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2021

dgrisonnet reviewed Jul 29, 2021

View reviewed changes

afrittoli reviewed Aug 2, 2021

View reviewed changes

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 2, 2021

vdemeester and others added 2 commits August 17, 2021 15:44

TEP-0073 Simplify Metrics

838972a

Carrying "Simplify Metrics" TEP (tektoncd#286) as this is something that we should tackle. Signed-off-by: Vincent Demeester <[email protected]>

Add Details on cardinality and PipelineRun metrics

78a6465

vdemeester force-pushed the carry-simplify-metrics branch from 4676fd4 to 78a6465 Compare August 17, 2021 13:44

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2021

tekton-robot assigned jerop Aug 23, 2021

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2021

tekton-robot merged commit b7cf199 into tektoncd:main Aug 23, 2021

vdemeester deleted the carry-simplify-metrics branch August 24, 2021 06:00

khrm mentioned this pull request Sep 5, 2021

Add Configuration for Metrics Cardinality Simplification tektoncd/pipeline#4201

Merged

5 tasks

lbernick mentioned this pull request Jan 11, 2022

TEP-0073: Simplify Metrics #603

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEP-0073 Simplify Metrics #466

TEP-0073 Simplify Metrics #466

vdemeester commented Jun 23, 2021

jerop commented Jul 12, 2021

bobcatfish commented Jul 13, 2021

ghost Jul 23, 2021

khrm commented Jul 28, 2021 •

edited

Loading

dgrisonnet left a comment •

edited

Loading

dgrisonnet Jul 29, 2021

dgrisonnet Jul 29, 2021 •

edited

Loading

afrittoli Aug 2, 2021

khrm commented Jul 30, 2021

afrittoli left a comment

afrittoli Aug 2, 2021

vdemeester Aug 3, 2021

afrittoli Aug 2, 2021

afrittoli Aug 2, 2021

afrittoli Aug 2, 2021

khrm Aug 23, 2021

tekton-robot commented Aug 2, 2021

jerop commented Aug 23, 2021

		- changing `PipelineRun` or `TaskRun` level metrics to `Task` or `Pipeline`
		level

TEP-0073 Simplify Metrics #466

TEP-0073 Simplify Metrics #466

Conversation

vdemeester commented Jun 23, 2021

jerop commented Jul 12, 2021

bobcatfish commented Jul 13, 2021

Choose a reason for hiding this comment

khrm commented Jul 28, 2021 • edited Loading

dgrisonnet left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgrisonnet Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khrm commented Jul 30, 2021

afrittoli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Aug 2, 2021

jerop commented Aug 23, 2021

khrm commented Jul 28, 2021 •

edited

Loading

dgrisonnet left a comment •

edited

Loading

dgrisonnet Jul 29, 2021 •

edited

Loading