Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0073 Simplify Metrics #466

Merged
merged 2 commits into from
Aug 23, 2021

Conversation

vdemeester
Copy link
Member

Carrying "Simplify Metrics" TEP (#286) as this is something that we should tackle.

I need to rephrase some stuff, but I am carrying this proposal as it is getting a higher priority for us to fix.

/kind tep

Signed-off-by: Vincent Demeester [email protected]

@tekton-robot tekton-robot added kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 23, 2021
@tekton-robot tekton-robot requested review from hrishin and a user June 23, 2021 10:19
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 23, 2021
@vdemeester vdemeester force-pushed the carry-simplify-metrics branch from 355553f to dbd006c Compare June 23, 2021 13:24
@vdemeester vdemeester changed the title WIP: TEP-0073 Simplify Metrics TEP-0073 Simplify Metrics Jul 7, 2021
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 7, 2021
@vdemeester vdemeester force-pushed the carry-simplify-metrics branch from dbd006c to be56640 Compare July 7, 2021 13:21
@jerop
Copy link
Member

jerop commented Jul 12, 2021

/assign @sbwsg
/assign @afrittoli

@tekton-robot tekton-robot assigned afrittoli and ghost Jul 12, 2021
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 13, 2021
@bobcatfish
Copy link
Contributor

haha is it on purpose a coincidence that this TEP is 73 when the original was 37??

image

either way i like it XD


we can add a `config-observability` option to switch `duration_seconds` type. The default value is `histogram`.
It can be set to `gauge` if users only care about the execution time of individual `TaskRun`.
It can't be set to `gauge` when `metrics.namespace-level` is `true` because gauge can't be aggregated at namespace level.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the fallout of accidentally setting both metrics.namespace-level to true and metrics.duration-seconds-type to guage and how do I discover my mistake? Controller halts / won't start? error messages in logs? failed taskruns?

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2021
@khrm
Copy link
Contributor

khrm commented Jul 28, 2021

I think we need gauge for Pipelinerun also to avoid cardinality explosion for pipelinerun label coming for tekton_pipelines_controller_pipelinerun_duration_seconds_bucket.
Even if we have 8k Pipelinerun, it will cause cardinality to grow by 8k as every prun will have unique labels coming which are well over the recommended limits. https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
Having p/trun name in labels causes cardinality to be unbounded. Even if we can have pipeline and task same, our metrics cardinality will keep increasing.

@vdemeester vdemeester force-pushed the carry-simplify-metrics branch from 5ec6849 to 4676fd4 Compare July 29, 2021 14:41
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2021
Copy link

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion for the histogram. In the comments that I made, I noticed that it might not be meaningful to have them at the taskRun & pipelineRun levels. I don't have much knowledge of the project but to me the only places where these duration metrics really make sense are for Task and Pipeline.

How about having only 2 histograms:

  • tekton_pipelines_controller_pipelines_duration_seconds with namespace and pipeline labels
  • tekton_pipelines_controller_pipeline_tasks_duration_seconds with namespace, pipeline and task labels

- changing `PipelineRun` or `TaskRun` level metrics to `Task` or `Pipeline`
level
- changing `PipelineRun` or `TaskRun` level metrics to namespace level
- changing metrics type from histogram to gauge in case of `TaskRun` or

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the duration metric could even be removed here. You most likely only see a task run once so it is not very meaningful to have an average duration over only one element.

### Change metrics type

Large metrics volume is mostly caused by histogram metrics. One single histogram metrics will produce 15 metrics. These metrics type could be changed to gauge which would reduce 15 metrics to one.
For example, metrics like `tekton_pipelinerun_duration_seconds`, `taskrun_duration_seconds`, `tekton_pipelinerun_taskrun_duration_seconds` are histogram. But the metrics could not provide much information at `TaskRun` level and can be changed to gauge.
Copy link

@dgrisonnet dgrisonnet Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my previous comment, I would advise removing them. The problem with these metrics can be seen in the Before example you've shared below. These metrics only record the duration of one run per timeserie, so it is not really meaningful to have percentiles or averages based on only one element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Data on how long it took to execute a specific PipelineRun should be taken from tektoncd/results if needed.

@khrm
Copy link
Contributor

khrm commented Jul 30, 2021

Yes, removing pipelinerun and taskrun labels altogether make sense. We get an additional metrics because of these labels with only one update. This data can be computed using logging stack.

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!!
Since this is in proposed state, I'm happy to approve. You could also address comments on the design at this stage if you wanted to keep the conversation context.

/approve

Comment on lines +144 to +145
- changing `PipelineRun` or `TaskRun` level metrics to `Task` or `Pipeline`
level
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the aggregation going to be based on the name of the Task or Pipeline, or what do you have in mind?
The task may be deployed to the cluster but it also may come from a bundle. Would every new version of a bundle start a new metric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question 🤔 Naively I would think every new version (of a bundle, …) would start a new metric. Just like if you were to "name version" your task inside your cluster.

Comment on lines +149 to +152
Latter is mutually exclusive with former two. If users care about
individual `TaskRun` or `PipelineRun`'s performance, they can set
`duration_seconds` type from histogram to gauge. If users care
about overall performance, they can collect metrics at namespace level.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Should this be in the alternative sections?

### Change metrics type

Large metrics volume is mostly caused by histogram metrics. One single histogram metrics will produce 15 metrics. These metrics type could be changed to gauge which would reduce 15 metrics to one.
For example, metrics like `tekton_pipelinerun_duration_seconds`, `taskrun_duration_seconds`, `tekton_pipelinerun_taskrun_duration_seconds` are histogram. But the metrics could not provide much information at `TaskRun` level and can be changed to gauge.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Data on how long it took to execute a specific PipelineRun should be taken from tektoncd/results if needed.

Comment on lines +401 to +407
## Upgrade & Migration Strategy (optional)

<!--
Use this section to detail wether this feature needs an upgrade or
migration strategy. This is especially useful when we modify a
behavior or add a feature that may replace and deprecate a current one.
-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be filled in.

Do we need to continue to support the current behaviour?
Will it be opt-in or opt-out? Does that comply with our stability policy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see in docs, metrics are still in experimental. We can support current metrics for one release by deprecating it and then removing it.

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 2, 2021
vdemeester and others added 2 commits August 17, 2021 15:44
Carrying "Simplify Metrics"
TEP (tektoncd#286) as this is
something that we should tackle.

Signed-off-by: Vincent Demeester <[email protected]>
@vdemeester vdemeester force-pushed the carry-simplify-metrics branch from 4676fd4 to 78a6465 Compare August 17, 2021 13:44
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2021
@jerop
Copy link
Member

jerop commented Aug 23, 2021

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2021
@tekton-robot tekton-robot merged commit b7cf199 into tektoncd:main Aug 23, 2021
@vdemeester vdemeester deleted the carry-simplify-metrics branch August 24, 2021 06:00
@lbernick lbernick mentioned this pull request Jan 11, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Proposed
Development

Successfully merging this pull request may close these issues.

7 participants