-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance metrics endpoint #195
Enhance metrics endpoint #195
Conversation
Removed UID labels for cardinality, swithed labels to snake_case
Hi @fsniper , Thank you so much for the PR. Since it was a weekend, we haven't gotten a chance to look into it. We shall try to review it ASAP. Thanks & Regards, |
I am nearly finished with the MachineDeployment Metrics. But, I can't decide if exposing a failed_machines metric is necessary or not (for both MachineSets and MachineDeployments). What do you think of this? |
* mcm_machine_deployment_[created|info|status_condition|condition|failed_machines] * mcm_machine_set_failed_machines
hi @fsniper , one minor thing regarding the release note block: please remove the linebreak before you can also copy paste the text below and replace it with yours (btw. the header will disappear on |
@petersutter Thank you, This is still a WIP and, I am in the process of adding cloud provider api call metrics. |
I have added cloud api requests metrics. I could only test azure ones, so all the other drivers need testing. Also I don't have much experience with these apis (gcp,alicloud,openstack,aws). It would be really great if the respected owners review, or better yet test them. |
Hi @fsniper , I took one quick look at the PR. The metrics for cloud provider was something I think should be split into the counts for different create/list/delete calls. It should be fine for now though, and overall looks acceptable to me. @dkistner has more experience with adding metrics. Can you kindly have a second look at this? Thanks & Regards, |
And regarding testing them for different providers, I shall definitely test them out before merging. |
@prashanth26 adding another metric partition would solve that issue. I'll have a look later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsniper Thanks for the contribution.
I had only a short look and wrote some comments directly to the code. For a detailed look I need a little bit more time.
In general I would recommend to avoid to many labels for a metric. https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
pkg/metrics/metrics.go
Outdated
Name: "mcm_machine_deployment_failed_machines", | ||
Help: "Information of the mcm managed Machinedeployments' failed machines.", | ||
}, []string{"name", "namespace", "uid", "failed_Machine_name", "failed_machine_provider_id", "failed_machine_owner_ref", | ||
"failed_Machine_last_operation_description", "failed_machine_last_operation_last_update_time", "failed_machine_last_operation_state", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we write the label names completely in lowercases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pkg/metrics/metrics.go
Outdated
Name: "mcm_machine_deployment_status", | ||
Help: "Information of the mcm managed Machinedeployments' status conditions.", | ||
}, []string{"name", "namespace", "uid", "available_replicas", "unavailable_replicas", "ready_replicas", | ||
"updated_replicas", "collision_count", "replicas"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labels are not intended to transport any kind of measurements. If you need for example a metric which expose the amount of replicas for a MachineDeployment
then we should rather add a new metric for that e.g. mcm_machine_deployment_replicas
or mcm_machine_deployment_available_replicas
and those metrics should have labels like name
, namespace
and uid
.
Keep in mind the Caution mentioned here: https://prometheus.io/docs/practices/naming/#labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I handled these correctly. You are right on every count. I think I was a bit rushing. I removed these labels, and put more metrics exposing these measurements.
pkg/metrics/metrics.go
Outdated
Name: "mcm_machine_set_failed_machines", | ||
Help: "Information of the mcm managed Machinesets' failed machines.", | ||
}, []string{"name", "namespace", "uid", "failed_Machine_name", "failed_machine_provider_id", "failed_machine_owner_ref", | ||
"failed_Machine_last_operation_description", "failed_machine_last_operation_last_update_time", "failed_machine_last_operation_state", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use labels to transport any kind of operation messages and try to avoid labels which can have many different values e.g. timestamps. In general labels should have a fixed value set, otherwise it would be very hard to query for those metrics in Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fsniper for applying the feedback. I had now a more detailed look. Hope I did not oversee something. Could you kindly have another look?
Generally I would assume that we don't need the uid
label for all metrics, because the combination of name
and namespace
makes the metric for Machines
, MachineSets
and MachineDeployments
already unique.
pkg/controller/metrics.go
Outdated
|
||
// CollectMachines is method to collect Machine related metrics. | ||
func (c *controller) CollectMachineControllerFrozenStatus(ch chan<- prometheus.Metric) { | ||
frozen_status := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not having the variable frozen_status
directly as float64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controller/metrics.go
Outdated
"condition": string(condition.Type), | ||
"status": string(condition.Status)}).Set(float64(status)) | ||
|
||
phase := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phase
could also be directly a float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controller/metrics.go
Outdated
"name": mMeta.Name, | ||
"namespace": mMeta.Namespace, | ||
"uid": string(mMeta.UID), | ||
"phase": string(machine.Status.CurrentStatus.Phase)}).Set(float64(phase)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need the phase
label? The current phase is represented by the values of the metric, right?
pkg/metrics/metrics.go
Outdated
MachineCSPhase = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "mcm_machine_current_status_phase", | ||
Help: "Current status phase of the Machines currently managed by the mcm.", | ||
}, []string{"name", "namespace", "uid", "phase"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phase is represented by the value of the metrics, therefore no need for the phase
label.
pkg/metrics/metrics.go
Outdated
MachineInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "mcm_machine_info", | ||
Help: "Information of the Machines currently managed by the mcm.", | ||
}, []string{"name", "namespace", "uid", "generation", "kind", "api_version", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need a metric for the generation
we should also create a dedicated counter for that. Because those label value will not stay stable. With every new spec version of the machine object we will generate a new timeseries in Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it's really needed. I am removing all generation labels for now. If we decide it's needed. I can add them as new metrics.
pkg/metrics/metrics.go
Outdated
MachineDeploymentInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "mcm_machine_deployment_info", | ||
Help: "Information of the Machinedeployments currently managed by the mcm.", | ||
}, []string{"name", "namespace", "uid", "generation", "kind", "api_version", "spec_replicas", "spec_strategy_type", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the generation
. If needed, we should create a dedicated metric for that.
I would also create dedicated metrics for spec_paused
e. g. mcm_machine_deployment_paused{"name", namespace, "uid"} -> 0=paused, 1=unpaused
. It's the pretty similar for the other labels spec_strategy_rolling_update_max_surge
, spec_strategy_rolling_update_max_unavailable
, spec_min_ready_seconds
, spec_min_ready_seconds
.
I think the kind
label is not required, because the resource kind is already expressed through the metric name "mcm_machine_deployment_*". I'm also not sure about the api_version
label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controller/metrics.go
Outdated
"spec_class_name": mSpec.Class.Name}).Set(float64(1)) | ||
|
||
for _, condition := range machine.Status.Conditions { | ||
status := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status
could be directly a float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controller/metrics.go
Outdated
"uid": string(msMeta.UID)}).Set(float64(msSpec.MinReadySeconds)) | ||
|
||
for _, condition := range machineSet.Status.Conditions { | ||
status := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status
could be directly a float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/controller/metrics.go
Outdated
metrics.MachineDeploymentInfo.With(infoLabels).Set(float64(1)) | ||
|
||
for _, condition := range machineDeployment.Status.Conditions { | ||
status := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status
could be directly a float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"spec_class_kind": mSpec.Class.Kind, | ||
"spec_class_name": mSpec.Class.Name}).Set(float64(1)) | ||
|
||
for _, condition := range machine.Status.Conditions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditions are handled same for Machines
, MachineSets
and MachineDeployments
. Could we move that into a separate function to which we pass the metric, label information and the array of conditions. Then we do not need to duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that, but because all condition types are different structs (also not interfaced) it is not working. I tried accepting them as interface{} , but this also leads to a non-iterable.
The Status fields are not the same type. 1 of them is from k8s api, 2 of them are from mcm api.
So I don't think that's a good idea to invest much time into this.
Removed observation data Splitted MachineSetStatus metric used float64 types instead of using auto typing.
pkg/metrics/metrics.go
Outdated
"failed_machine_last_operation_machine_operation_type"}) | ||
|
||
MachineSetStatusAvailableReplicas = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "mcm_machine_set_status_availabla_replicas", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: s/mcm_machine_set_status_availabla_replicas/mcm_machine_set_status_available_replicas/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @hardikdr for the dalay. A few small things are still open.
When those things are done then it looks good to me.
Here the list of open things. @fsniper Could you please have another look?
-
Many metrics use as value for the
name
label the value ofmetadata.labels["name"]
, but those kubernetes labels does not exists on the respective objects. We should usemetadata.name
instead. -
We should remove the
uid
label from all metrics. Think this it is not required, because the combination ofmetadata.name
+metadata.namespace
should be already unique. -
We have a typo in a metric name. See here: https://github.com/gardener/machine-controller-manager/pull/195/files#diff-7cbe8e056d62a2de30c7066e359bd9c9R58
-
Please rename the label
created
tocreatedAt
. See here:- https://github.com/gardener/machine-controller-manager/pull/195/files#diff-7cbe8e056d62a2de30c7066e359bd9c9R19
- https://github.com/gardener/machine-controller-manager/pull/195/files#diff-7cbe8e056d62a2de30c7066e359bd9c9R32
- https://github.com/gardener/machine-controller-manager/pull/195/files#diff-7cbe8e056d62a2de30c7066e359bd9c9R81
pkg/metrics/metrics.go
Outdated
MachineDeploymentInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "mcm_machine_deployment_info", | ||
Help: "Information of the Machinedeployments currently managed by the mcm.", | ||
}, []string{"name", "namespace", "uid", "created", "spec_strategy_type"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename label.
s/created/createdAt/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/metrics/metrics.go
Outdated
MachineInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "mcm_machine_info", | ||
Help: "Information of the Machines currently managed by the mcm.", | ||
}, []string{"name", "namespace", "uid", "created", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename label.
s/created/createdAt/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
MachineSetCountDesc = prometheus.NewDesc("mcm_machineset_items_total", "Count of machinesets currently managed by the mcm.", nil, nil) | ||
|
||
MachineSetInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename label.
s/created/createdAt/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
used meta.Name instead of meta.labels[Name] Removed uid labels renamed created labels as createdAt
pkg/driver/driver_alicloud.go
Outdated
@@ -90,7 +92,9 @@ func (c *AlicloudDriver) Create() (string, string, error) { | |||
response, err := client.RunInstances(request) | |||
if err != nil { | |||
return "", "", err | |||
metrics.ApiFailedRequestCount.With(prometheus.Labels{"provider": "alicloud", "service": "ecs"}).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you will have to move this statement above the return as mentioned by @dkistner. The lint checks are failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, cool catch. Fixing.
A few more lint suggestions. I apologize, the CI doesn't pass unless the lint checks are fixed.
|
How can I run the lint checks locally? Make file has a verify target, but it seems to be spesific for CI operations. I ran golint on directories I touched, this also lead to too many lint errors. |
Hi @fsniper , There are still a few more lint checks. I can see at least 1 more. You can run lint checks locally by running
Thanks & Regards, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks once again @fsniper for the PR. We see that all tests are passing now. LGTM.
What this PR does / why we need it:
This PR is a WIP for #189 . Extending the metrics endpoint of mcm
Which issue(s) this PR fixes:
Fixes #189
Special notes for your reviewer:
This is a WIP. Any reviews/comments/criticism are welcome.
Release note: