-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0073 Simplify Metrics #466
TEP-0073 Simplify Metrics #466
Conversation
355553f
to
dbd006c
Compare
dbd006c
to
be56640
Compare
/assign @sbwsg |
|
||
we can add a `config-observability` option to switch `duration_seconds` type. The default value is `histogram`. | ||
It can be set to `gauge` if users only care about the execution time of individual `TaskRun`. | ||
It can't be set to `gauge` when `metrics.namespace-level` is `true` because gauge can't be aggregated at namespace level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the fallout of accidentally setting both metrics.namespace-level
to true
and metrics.duration-seconds-type
to guage
and how do I discover my mistake? Controller halts / won't start? error messages in logs? failed taskruns?
I think we need gauge for Pipelinerun also to avoid cardinality explosion for |
5ec6849
to
4676fd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a suggestion for the histogram. In the comments that I made, I noticed that it might not be meaningful to have them at the taskRun
& pipelineRun
levels. I don't have much knowledge of the project but to me the only places where these duration metrics really make sense are for Task
and Pipeline
.
How about having only 2 histograms:
tekton_pipelines_controller_pipelines_duration_seconds
withnamespace
andpipeline
labelstekton_pipelines_controller_pipeline_tasks_duration_seconds
withnamespace
,pipeline
andtask
labels
- changing `PipelineRun` or `TaskRun` level metrics to `Task` or `Pipeline` | ||
level | ||
- changing `PipelineRun` or `TaskRun` level metrics to namespace level | ||
- changing metrics type from histogram to gauge in case of `TaskRun` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the duration metric could even be removed here. You most likely only see a task run once so it is not very meaningful to have an average duration over only one element.
### Change metrics type | ||
|
||
Large metrics volume is mostly caused by histogram metrics. One single histogram metrics will produce 15 metrics. These metrics type could be changed to gauge which would reduce 15 metrics to one. | ||
For example, metrics like `tekton_pipelinerun_duration_seconds`, `taskrun_duration_seconds`, `tekton_pipelinerun_taskrun_duration_seconds` are histogram. But the metrics could not provide much information at `TaskRun` level and can be changed to gauge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my previous comment, I would advise removing them. The problem with these metrics can be seen in the Before
example you've shared below. These metrics only record the duration of one run per timeserie, so it is not really meaningful to have percentiles or averages based on only one element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Data on how long it took to execute a specific PipelineRun
should be taken from tektoncd/results
if needed.
Yes, removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!!
Since this is in proposed
state, I'm happy to approve. You could also address comments on the design at this stage if you wanted to keep the conversation context.
/approve
- changing `PipelineRun` or `TaskRun` level metrics to `Task` or `Pipeline` | ||
level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the aggregation going to be based on the name of the Task
or Pipeline
, or what do you have in mind?
The task may be deployed to the cluster but it also may come from a bundle. Would every new version of a bundle start a new metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question 🤔 Naively I would think every new version (of a bundle, …) would start a new metric. Just like if you were to "name version" your task inside your cluster.
Latter is mutually exclusive with former two. If users care about | ||
individual `TaskRun` or `PipelineRun`'s performance, they can set | ||
`duration_seconds` type from histogram to gauge. If users care | ||
about overall performance, they can collect metrics at namespace level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Should this be in the alternative sections?
### Change metrics type | ||
|
||
Large metrics volume is mostly caused by histogram metrics. One single histogram metrics will produce 15 metrics. These metrics type could be changed to gauge which would reduce 15 metrics to one. | ||
For example, metrics like `tekton_pipelinerun_duration_seconds`, `taskrun_duration_seconds`, `tekton_pipelinerun_taskrun_duration_seconds` are histogram. But the metrics could not provide much information at `TaskRun` level and can be changed to gauge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! Data on how long it took to execute a specific PipelineRun
should be taken from tektoncd/results
if needed.
## Upgrade & Migration Strategy (optional) | ||
|
||
<!-- | ||
Use this section to detail wether this feature needs an upgrade or | ||
migration strategy. This is especially useful when we modify a | ||
behavior or add a feature that may replace and deprecate a current one. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be filled in.
Do we need to continue to support the current behaviour?
Will it be opt-in or opt-out? Does that comply with our stability policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see in docs, metrics are still in experimental. We can support current metrics for one release by deprecating it and then removing it.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli, sbwsg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Carrying "Simplify Metrics" TEP (tektoncd#286) as this is something that we should tackle. Signed-off-by: Vincent Demeester <[email protected]>
4676fd4
to
78a6465
Compare
/lgtm |
Carrying "Simplify Metrics" TEP (#286) as this is something that we should tackle.
I need to rephrase some stuff, but I am carrying this proposal as it is getting a higher priority for us to fix.
/kind tep
Signed-off-by: Vincent Demeester [email protected]