Enhance metrics endpoint #195

fsniper · 2018-11-23T12:06:44Z

What this PR does / why we need it:
This PR is a WIP for #189 . Extending the metrics endpoint of mcm

Which issue(s) this PR fixes:
Fixes #189

Special notes for your reviewer:
This is a WIP. Any reviews/comments/criticism are welcome.

Release note:

Metrics endpoint is enhanced.

…eated

Removed UID labels for cardinality, swithed labels to snake_case

CLAassistant · 2018-11-23T12:06:51Z

All committers have signed the CLA.

prashanth26 · 2018-11-25T16:39:21Z

Hi @fsniper ,

Thank you so much for the PR. Since it was a weekend, we haven't gotten a chance to look into it. We shall try to review it ASAP.

Thanks & Regards,
Prashanth

fsniper · 2018-11-26T13:28:23Z

I am nearly finished with the MachineDeployment Metrics. But, I can't decide if exposing a failed_machines metric is necessary or not (for both MachineSets and MachineDeployments). What do you think of this?

* mcm_machine_deployment_[created|info|status_condition|condition|failed_machines] * mcm_machine_set_failed_machines

petersutter · 2018-11-27T22:56:51Z

hi @fsniper , one minor thing regarding the release note block: please remove the linebreak before improvement user and change it to lower case, thanks :)

you can also copy paste the text below and replace it with yours (btw. the header will disappear on Preview/ after editing)
```improvement user
Metrics endpoint is enhanced.
```

fsniper · 2018-11-27T23:04:24Z

@petersutter Thank you, This is still a WIP and, I am in the process of adding cloud provider api call metrics.

fsniper · 2018-11-28T15:05:20Z

I have added cloud api requests metrics. I could only test azure ones, so all the other drivers need testing.

Also I don't have much experience with these apis (gcp,alicloud,openstack,aws). It would be really great if the respected owners review, or better yet test them.

prashanth26 · 2018-11-28T16:27:37Z

Hi @fsniper ,

I took one quick look at the PR. The metrics for cloud provider was something I think should be split into the counts for different create/list/delete calls. It should be fine for now though, and overall looks acceptable to me.

@dkistner has more experience with adding metrics. Can you kindly have a second look at this?

Thanks & Regards,
Prashanth

prashanth26 · 2018-11-28T16:29:04Z

And regarding testing them for different providers, I shall definitely test them out before merging.

fsniper · 2018-11-28T16:52:38Z

@prashanth26 adding another metric partition would solve that issue. I'll have a look later.

dkistner

@fsniper Thanks for the contribution.
I had only a short look and wrote some comments directly to the code. For a detailed look I need a little bit more time.

In general I would recommend to avoid to many labels for a metric. https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels

dkistner · 2018-11-30T15:19:59Z

pkg/metrics/metrics.go

+		Name: "mcm_machine_deployment_failed_machines",
+		Help: "Information of the mcm managed Machinedeployments' failed machines.",
+	}, []string{"name", "namespace", "uid", "failed_Machine_name", "failed_machine_provider_id", "failed_machine_owner_ref",
+		"failed_Machine_last_operation_description", "failed_machine_last_operation_last_update_time", "failed_machine_last_operation_state",


Could we write the label names completely in lowercases?

dkistner · 2018-11-30T15:26:30Z

pkg/metrics/metrics.go

+		Name: "mcm_machine_deployment_status",
+		Help: "Information of the mcm managed Machinedeployments' status conditions.",
+	}, []string{"name", "namespace", "uid", "available_replicas", "unavailable_replicas", "ready_replicas",
+		"updated_replicas", "collision_count", "replicas"})


Labels are not intended to transport any kind of measurements. If you need for example a metric which expose the amount of replicas for a MachineDeployment then we should rather add a new metric for that e.g. mcm_machine_deployment_replicas or mcm_machine_deployment_available_replicas and those metrics should have labels like name, namespace and uid.

Keep in mind the Caution mentioned here: https://prometheus.io/docs/practices/naming/#labels

I hope I handled these correctly. You are right on every count. I think I was a bit rushing. I removed these labels, and put more metrics exposing these measurements.

dkistner · 2018-11-30T15:32:10Z

pkg/metrics/metrics.go

+		Name: "mcm_machine_set_failed_machines",
+		Help: "Information of the mcm managed Machinesets' failed machines.",
+	}, []string{"name", "namespace", "uid", "failed_Machine_name", "failed_machine_provider_id", "failed_machine_owner_ref",
+		"failed_Machine_last_operation_description", "failed_machine_last_operation_last_update_time", "failed_machine_last_operation_state",


Please do not use labels to transport any kind of operation messages and try to avoid labels which can have many different values e.g. timestamps. In general labels should have a fixed value set, otherwise it would be very hard to query for those metrics in Prometheus.

Removed these.

dkistner

Thanks @fsniper for applying the feedback. I had now a more detailed look. Hope I did not oversee something. Could you kindly have another look?

Generally I would assume that we don't need the uid label for all metrics, because the combination of name and namespace makes the metric for Machines, MachineSets and MachineDeployments already unique.

dkistner · 2018-12-04T07:12:06Z

pkg/controller/metrics.go

+
+// CollectMachines is method to collect Machine related metrics.
+func (c *controller) CollectMachineControllerFrozenStatus(ch chan<- prometheus.Metric) {
+	frozen_status := 0


Why not having the variable frozen_status directly as float64?

dkistner · 2018-12-04T07:21:27Z

pkg/controller/metrics.go

+				"condition": string(condition.Type),
+				"status":    string(condition.Status)}).Set(float64(status))
+
+			phase := 0


phase could also be directly a float64

dkistner · 2018-12-04T07:22:12Z

pkg/controller/metrics.go

+				"name":      mMeta.Name,
+				"namespace": mMeta.Namespace,
+				"uid":       string(mMeta.UID),
+				"phase":     string(machine.Status.CurrentStatus.Phase)}).Set(float64(phase))


Why we need the phase label? The current phase is represented by the values of the metric, right?

dkistner · 2018-12-04T07:25:47Z

pkg/metrics/metrics.go

+	MachineCSPhase = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: "mcm_machine_current_status_phase",
+		Help: "Current status phase of the Machines currently managed by the mcm.",
+	}, []string{"name", "namespace", "uid", "phase"})


The phase is represented by the value of the metrics, therefore no need for the phase label.

dkistner · 2018-12-04T07:28:58Z

pkg/metrics/metrics.go

+	MachineInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: "mcm_machine_info",
+		Help: "Information of the Machines currently managed by the mcm.",
+	}, []string{"name", "namespace", "uid", "generation", "kind", "api_version",


If we need a metric for the generation we should also create a dedicated counter for that. Because those label value will not stay stable. With every new spec version of the machine object we will generate a new timeseries in Prometheus.

I am not sure if it's really needed. I am removing all generation labels for now. If we decide it's needed. I can add them as new metrics.

dkistner · 2018-12-04T08:13:18Z

pkg/metrics/metrics.go

+	MachineDeploymentInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: "mcm_machine_deployment_info",
+		Help: "Information of the Machinedeployments currently managed by the mcm.",
+	}, []string{"name", "namespace", "uid", "generation", "kind", "api_version", "spec_replicas", "spec_strategy_type",


Same for the generation. If needed, we should create a dedicated metric for that.

I would also create dedicated metrics for spec_paused e. g. mcm_machine_deployment_paused{"name", namespace, "uid"} -> 0=paused, 1=unpaused. It's the pretty similar for the other labels spec_strategy_rolling_update_max_surge, spec_strategy_rolling_update_max_unavailable, spec_min_ready_seconds, spec_min_ready_seconds.

I think the kind label is not required, because the resource kind is already expressed through the metric name "mcm_machine_deployment_*". I'm also not sure about the api_version label.

dkistner · 2018-12-04T08:36:08Z

pkg/controller/metrics.go

+			"spec_class_name":      mSpec.Class.Name}).Set(float64(1))
+
+		for _, condition := range machine.Status.Conditions {
+			status := 0


status could be directly a float64.

dkistner · 2018-12-04T08:36:48Z

pkg/controller/metrics.go

+			"uid":       string(msMeta.UID)}).Set(float64(msSpec.MinReadySeconds))
+
+		for _, condition := range machineSet.Status.Conditions {
+			status := 0


status could be directly a float64.

dkistner · 2018-12-04T08:37:23Z

pkg/controller/metrics.go

+		metrics.MachineDeploymentInfo.With(infoLabels).Set(float64(1))
+
+		for _, condition := range machineDeployment.Status.Conditions {
+			status := 0


status could be directly a float64.

dkistner · 2018-12-04T08:40:26Z

pkg/controller/metrics.go

+			"spec_class_kind":      mSpec.Class.Kind,
+			"spec_class_name":      mSpec.Class.Name}).Set(float64(1))
+
+		for _, condition := range machine.Status.Conditions {


The conditions are handled same for Machines, MachineSets and MachineDeployments. Could we move that into a separate function to which we pass the metric, label information and the array of conditions. Then we do not need to duplicate code.

I tried that, but because all condition types are different structs (also not interfaced) it is not working. I tried accepting them as interface{} , but this also leads to a non-iterable.

The Status fields are not the same type. 1 of them is from k8s api, 2 of them are from mcm api.

So I don't think that's a good idea to invest much time into this.

Removed observation data Splitted MachineSetStatus metric used float64 types instead of using auto typing.

hardikdr · 2018-12-24T09:02:32Z

@dkistner thanks for the review, and @fsniper thanks for the contribution.
I am wondering if there is anything else to be discussed here,or could we otherwise merge it?

dkistner · 2019-01-02T08:20:45Z

pkg/metrics/metrics.go

+		"failed_machine_last_operation_machine_operation_type"})
+
+	MachineSetStatusAvailableReplicas = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: "mcm_machine_set_status_availabla_replicas",


Typo: s/mcm_machine_set_status_availabla_replicas/mcm_machine_set_status_available_replicas/

dkistner

Sorry @hardikdr for the dalay. A few small things are still open.
When those things are done then it looks good to me.

Here the list of open things. @fsniper Could you please have another look?

Many metrics use as value for the name label the value of metadata.labels["name"], but those kubernetes labels does not exists on the respective objects. We should use metadata.name instead.
We should remove the uid label from all metrics. Think this it is not required, because the combination of metadata.name + metadata.namespace should be already unique.
We have a typo in a metric name. See here: https://github.com/gardener/machine-controller-manager/pull/195/files#diff-7cbe8e056d62a2de30c7066e359bd9c9R58
Please rename the label created to createdAt. See here:

dkistner · 2019-01-02T09:34:22Z

pkg/metrics/metrics.go

+	MachineDeploymentInfo      = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: "mcm_machine_deployment_info",
+		Help: "Information of the Machinedeployments currently managed by the mcm.",
+	}, []string{"name", "namespace", "uid", "created", "spec_strategy_type"})


Please rename label.
s/created/createdAt/

dkistner · 2019-01-02T09:34:54Z

pkg/metrics/metrics.go

+	MachineInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: "mcm_machine_info",
+		Help: "Information of the Machines currently managed by the mcm.",
+	}, []string{"name", "namespace", "uid", "created",


Please rename label.
s/created/createdAt/

dkistner · 2019-01-02T09:35:32Z

pkg/metrics/metrics.go

+
+	MachineSetCountDesc = prometheus.NewDesc("mcm_machineset_items_total", "Count of machinesets currently managed by the mcm.", nil, nil)
+
+	MachineSetInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{


Please rename label.
s/created/createdAt/

used meta.Name instead of meta.labels[Name] Removed uid labels renamed created labels as createdAt

prashanth26 · 2019-01-03T16:04:17Z

pkg/driver/driver_alicloud.go

@@ -90,7 +92,9 @@ func (c *AlicloudDriver) Create() (string, string, error) {
 	response, err := client.RunInstances(request)
 	if err != nil {
 		return "", "", err
+		metrics.ApiFailedRequestCount.With(prometheus.Labels{"provider": "alicloud", "service": "ecs"}).Inc()


I think you will have to move this statement above the return as mentioned by @dkistner. The lint checks are failing.

Oh, cool catch. Fixing.

prashanth26 · 2019-01-04T08:31:30Z

A few more lint suggestions. I apologize, the CI doesn't pass unless the lint checks are fixed.

./pkg/controller/metrics.go:62:24: should drop = 0 from declaration of var paused; it is the zero value
./pkg/controller/metrics.go:103:25: should drop = 0 from declaration of var status; it is the zero value
./pkg/controller/metrics.go:135:11: don't use underscores in Go names; range var failed_machine should be failedMachine
./pkg/controller/metrics.go:186:25: should drop = 0 from declaration of var status; it is the zero value
./pkg/controller/metrics.go:225:11: don't use underscores in Go names; range var failed_machine should be failedMachine
./pkg/controller/metrics.go:262:25: should drop = 0 from declaration of var status; it is the zero value
./pkg/controller/metrics.go:279:23: should drop = 0 from declaration of var phase; it is the zero value
./pkg/controller/metrics.go:312:6: don't use underscores in Go names; var frozen_status should be frozenStatus
./pkg/controller/metrics.go:312:30: should drop = 0 from declaration of var frozen_status; it is the zero value
Found 9 lint suggestions; failing.

fsniper · 2019-01-04T08:51:11Z

How can I run the lint checks locally? Make file has a verify target, but it seems to be spesific for CI operations. I ran golint on directories I touched, this also lead to too many lint errors.

prashanth26 · 2019-01-04T10:05:53Z

Hi @fsniper ,

There are still a few more lint checks. I can see at least 1 more. You can run lint checks locally by running make check. I apologize for the last minute lint issues.

./pkg/metrics/metrics.go:8:2: exported var MachineControllerFrozenDesc should have comment or be unexported
Found 1 lint suggestions; failing.

Thanks & Regards,
Prashanth

prashanth26

Thanks once again @fsniper for the PR. We see that all tests are passing now. LGTM.

fsniper added 3 commits November 22, 2018 16:38

Added mcm_[machineset|machinedeployments]_items_total, mcm_machine_cr…

88bece5

…eated

added mcm_machine_[info|status_condition] metrics

b3c921d

added mcm_machine_current_status_phase

884c8e1

Removed UID labels for cardinality, swithed labels to snake_case

fsniper requested review from ggaurav10 and a team as code owners November 23, 2018 12:06

added mcm_machine_set_[created|info|status|status_condition] metrics

1b49e62

fsniper added 2 commits November 26, 2018 15:15

Added more metrics

a99468e

* mcm_machine_deployment_[created|info|status_condition|condition|failed_machines] * mcm_machine_set_failed_machines

added mcm_machine_controller_frozen metric

514e820

fsniper added 2 commits November 28, 2018 15:00

Moved metric declarations to their own package

332a404

Added mcm_cloud_api_[|failed_]requests metric to drivers

d5d8cd8

fsniper changed the title ~~[WIP] Enhance metrics endpoint~~ Enhance metrics endpoint Nov 28, 2018

dkistner reviewed Nov 30, 2018

View reviewed changes

Changed machine[set|deployment] measure realted labels into metrics

48c5b29

dkistner requested changes Dec 4, 2018

View reviewed changes

Updated the code according to dominic Kitsner's review

1ff7169

Removed observation data Splitted MachineSetStatus metric used float64 types instead of using auto typing.

dkistner reviewed Jan 2, 2019

View reviewed changes

Merged origin/master

da76a59

dkistner requested changes Jan 2, 2019

View reviewed changes

fsniper added 2 commits January 2, 2019 09:44

fixed typo

a91c167

Review chan:ges

a674992

used meta.Name instead of meta.labels[Name] Removed uid labels renamed created labels as createdAt

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 3, 2019

prashanth26 reviewed Jan 3, 2019

View reviewed changes

Fixed unreachable code

69e00a1

prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019

Fix some linting issues

74ef405

prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019

Added comments for export vars

f496898

prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019

prashanth26 approved these changes Jan 4, 2019

View reviewed changes

prashanth26 merged commit 5554b50 into gardener:master Jan 4, 2019

ghost added the component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) label Mar 7, 2020


		MachineSetCountDesc = prometheus.NewDesc("mcm_machineset_items_total", "Count of machinesets currently managed by the mcm.", nil, nil)

		MachineSetInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{

Enhance metrics endpoint #195

Enhance metrics endpoint #195

Conversation

fsniper commented Nov 23, 2018 • edited Loading

CLAassistant commented Nov 23, 2018 • edited Loading

prashanth26 commented Nov 25, 2018

fsniper commented Nov 26, 2018

petersutter commented Nov 27, 2018

fsniper commented Nov 27, 2018 • edited Loading

fsniper commented Nov 28, 2018

prashanth26 commented Nov 28, 2018

prashanth26 commented Nov 28, 2018

fsniper commented Nov 28, 2018

dkistner left a comment

Choose a reason for hiding this comment

dkistner Nov 30, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dkistner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardikdr commented Dec 24, 2018 • edited Loading

Choose a reason for hiding this comment

dkistner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashanth26 commented Jan 4, 2019

fsniper commented Jan 4, 2019

prashanth26 commented Jan 4, 2019

prashanth26 left a comment

Choose a reason for hiding this comment

fsniper commented Nov 23, 2018 •

edited

Loading

CLAassistant commented Nov 23, 2018 •

edited

Loading

fsniper commented Nov 27, 2018 •

edited

Loading

dkistner Nov 30, 2018 •

edited

Loading

hardikdr commented Dec 24, 2018 •

edited

Loading