Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance metrics endpoint #195

Merged
merged 16 commits into from
Jan 4, 2019

Conversation

fsniper
Copy link
Contributor

@fsniper fsniper commented Nov 23, 2018

What this PR does / why we need it:
This PR is a WIP for #189 . Extending the metrics endpoint of mcm

Which issue(s) this PR fixes:
Fixes #189

Special notes for your reviewer:
This is a WIP. Any reviews/comments/criticism are welcome.

Release note:

Metrics endpoint is enhanced.

@fsniper fsniper requested review from ggaurav10 and a team as code owners November 23, 2018 12:06
@CLAassistant
Copy link

CLAassistant commented Nov 23, 2018

CLA assistant check
All committers have signed the CLA.

@prashanth26
Copy link
Contributor

Hi @fsniper ,

Thank you so much for the PR. Since it was a weekend, we haven't gotten a chance to look into it. We shall try to review it ASAP.

Thanks & Regards,
Prashanth

@fsniper
Copy link
Contributor Author

fsniper commented Nov 26, 2018

I am nearly finished with the MachineDeployment Metrics. But, I can't decide if exposing a failed_machines metric is necessary or not (for both MachineSets and MachineDeployments). What do you think of this?

* mcm_machine_deployment_[created|info|status_condition|condition|failed_machines]
* mcm_machine_set_failed_machines
@petersutter
Copy link
Member

hi @fsniper , one minor thing regarding the release note block: please remove the linebreak before improvement user and change it to lower case, thanks :)

you can also copy paste the text below and replace it with yours (btw. the header will disappear on Preview/ after editing)
```improvement user
Metrics endpoint is enhanced.
```

@fsniper
Copy link
Contributor Author

fsniper commented Nov 27, 2018

@petersutter Thank you, This is still a WIP and, I am in the process of adding cloud provider api call metrics.

@fsniper fsniper changed the title [WIP] Enhance metrics endpoint Enhance metrics endpoint Nov 28, 2018
@fsniper
Copy link
Contributor Author

fsniper commented Nov 28, 2018

I have added cloud api requests metrics. I could only test azure ones, so all the other drivers need testing.

Also I don't have much experience with these apis (gcp,alicloud,openstack,aws). It would be really great if the respected owners review, or better yet test them.

@prashanth26
Copy link
Contributor

Hi @fsniper ,

I took one quick look at the PR. The metrics for cloud provider was something I think should be split into the counts for different create/list/delete calls. It should be fine for now though, and overall looks acceptable to me.

@dkistner has more experience with adding metrics. Can you kindly have a second look at this?

Thanks & Regards,
Prashanth

@prashanth26
Copy link
Contributor

And regarding testing them for different providers, I shall definitely test them out before merging.

@fsniper
Copy link
Contributor Author

fsniper commented Nov 28, 2018

@prashanth26 adding another metric partition would solve that issue. I'll have a look later.

Copy link
Member

@dkistner dkistner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsniper Thanks for the contribution.
I had only a short look and wrote some comments directly to the code. For a detailed look I need a little bit more time.

In general I would recommend to avoid to many labels for a metric. https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels

Name: "mcm_machine_deployment_failed_machines",
Help: "Information of the mcm managed Machinedeployments' failed machines.",
}, []string{"name", "namespace", "uid", "failed_Machine_name", "failed_machine_provider_id", "failed_machine_owner_ref",
"failed_Machine_last_operation_description", "failed_machine_last_operation_last_update_time", "failed_machine_last_operation_state",
Copy link
Member

@dkistner dkistner Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we write the label names completely in lowercases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Name: "mcm_machine_deployment_status",
Help: "Information of the mcm managed Machinedeployments' status conditions.",
}, []string{"name", "namespace", "uid", "available_replicas", "unavailable_replicas", "ready_replicas",
"updated_replicas", "collision_count", "replicas"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels are not intended to transport any kind of measurements. If you need for example a metric which expose the amount of replicas for a MachineDeployment then we should rather add a new metric for that e.g. mcm_machine_deployment_replicas or mcm_machine_deployment_available_replicas and those metrics should have labels like name, namespace and uid.

Keep in mind the Caution mentioned here: https://prometheus.io/docs/practices/naming/#labels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I handled these correctly. You are right on every count. I think I was a bit rushing. I removed these labels, and put more metrics exposing these measurements.

Name: "mcm_machine_set_failed_machines",
Help: "Information of the mcm managed Machinesets' failed machines.",
}, []string{"name", "namespace", "uid", "failed_Machine_name", "failed_machine_provider_id", "failed_machine_owner_ref",
"failed_Machine_last_operation_description", "failed_machine_last_operation_last_update_time", "failed_machine_last_operation_state",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use labels to transport any kind of operation messages and try to avoid labels which can have many different values e.g. timestamps. In general labels should have a fixed value set, otherwise it would be very hard to query for those metrics in Prometheus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these.

Copy link
Member

@dkistner dkistner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fsniper for applying the feedback. I had now a more detailed look. Hope I did not oversee something. Could you kindly have another look?

Generally I would assume that we don't need the uid label for all metrics, because the combination of name and namespace makes the metric for Machines, MachineSets and MachineDeployments already unique.


// CollectMachines is method to collect Machine related metrics.
func (c *controller) CollectMachineControllerFrozenStatus(ch chan<- prometheus.Metric) {
frozen_status := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not having the variable frozen_status directly as float64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"condition": string(condition.Type),
"status": string(condition.Status)}).Set(float64(status))

phase := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phase could also be directly a float64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"name": mMeta.Name,
"namespace": mMeta.Namespace,
"uid": string(mMeta.UID),
"phase": string(machine.Status.CurrentStatus.Phase)}).Set(float64(phase))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need the phase label? The current phase is represented by the values of the metric, right?

MachineCSPhase = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "mcm_machine_current_status_phase",
Help: "Current status phase of the Machines currently managed by the mcm.",
}, []string{"name", "namespace", "uid", "phase"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phase is represented by the value of the metrics, therefore no need for the phase label.

MachineInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "mcm_machine_info",
Help: "Information of the Machines currently managed by the mcm.",
}, []string{"name", "namespace", "uid", "generation", "kind", "api_version",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need a metric for the generation we should also create a dedicated counter for that. Because those label value will not stay stable. With every new spec version of the machine object we will generate a new timeseries in Prometheus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it's really needed. I am removing all generation labels for now. If we decide it's needed. I can add them as new metrics.

MachineDeploymentInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "mcm_machine_deployment_info",
Help: "Information of the Machinedeployments currently managed by the mcm.",
}, []string{"name", "namespace", "uid", "generation", "kind", "api_version", "spec_replicas", "spec_strategy_type",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the generation. If needed, we should create a dedicated metric for that.

I would also create dedicated metrics for spec_paused e. g. mcm_machine_deployment_paused{"name", namespace, "uid"} -> 0=paused, 1=unpaused. It's the pretty similar for the other labels spec_strategy_rolling_update_max_surge, spec_strategy_rolling_update_max_unavailable, spec_min_ready_seconds, spec_min_ready_seconds.

I think the kind label is not required, because the resource kind is already expressed through the metric name "mcm_machine_deployment_*". I'm also not sure about the api_version label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"spec_class_name": mSpec.Class.Name}).Set(float64(1))

for _, condition := range machine.Status.Conditions {
status := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status could be directly a float64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"uid": string(msMeta.UID)}).Set(float64(msSpec.MinReadySeconds))

for _, condition := range machineSet.Status.Conditions {
status := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status could be directly a float64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

metrics.MachineDeploymentInfo.With(infoLabels).Set(float64(1))

for _, condition := range machineDeployment.Status.Conditions {
status := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status could be directly a float64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"spec_class_kind": mSpec.Class.Kind,
"spec_class_name": mSpec.Class.Name}).Set(float64(1))

for _, condition := range machine.Status.Conditions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditions are handled same for Machines, MachineSets and MachineDeployments. Could we move that into a separate function to which we pass the metric, label information and the array of conditions. Then we do not need to duplicate code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but because all condition types are different structs (also not interfaced) it is not working. I tried accepting them as interface{} , but this also leads to a non-iterable.

The Status fields are not the same type. 1 of them is from k8s api, 2 of them are from mcm api.

So I don't think that's a good idea to invest much time into this.

Removed observation data
Splitted MachineSetStatus metric
used float64 types instead of using auto typing.
@hardikdr
Copy link
Member

hardikdr commented Dec 24, 2018

@dkistner thanks for the review, and @fsniper thanks for the contribution.
I am wondering if there is anything else to be discussed here,or could we otherwise merge it?

"failed_machine_last_operation_machine_operation_type"})

MachineSetStatusAvailableReplicas = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "mcm_machine_set_status_availabla_replicas",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/mcm_machine_set_status_availabla_replicas/mcm_machine_set_status_available_replicas/

Copy link
Member

@dkistner dkistner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hardikdr for the dalay. A few small things are still open.
When those things are done then it looks good to me.

Here the list of open things. @fsniper Could you please have another look?

MachineDeploymentInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "mcm_machine_deployment_info",
Help: "Information of the Machinedeployments currently managed by the mcm.",
}, []string{"name", "namespace", "uid", "created", "spec_strategy_type"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename label.
s/created/createdAt/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

MachineInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "mcm_machine_info",
Help: "Information of the Machines currently managed by the mcm.",
}, []string{"name", "namespace", "uid", "created",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename label.
s/created/createdAt/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


MachineSetCountDesc = prometheus.NewDesc("mcm_machineset_items_total", "Count of machinesets currently managed by the mcm.", nil, nil)

MachineSetInfo = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename label.
s/created/createdAt/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

used meta.Name instead of meta.labels[Name]
Removed uid labels
renamed created labels as createdAt
@prashanth26 prashanth26 added component/machine-controller-manager area/monitoring Monitoring (including availability monitoring and alerting) related reviewed/lgtm Has approval for merging size/s Size of pull request is small (see gardener-robot robot/bots/size.py) topology/seed Affects Seed clusters labels Jan 3, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 3, 2019
@@ -90,7 +92,9 @@ func (c *AlicloudDriver) Create() (string, string, error) {
response, err := client.RunInstances(request)
if err != nil {
return "", "", err
metrics.ApiFailedRequestCount.With(prometheus.Labels{"provider": "alicloud", "service": "ecs"}).Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you will have to move this statement above the return as mentioned by @dkistner. The lint checks are failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, cool catch. Fixing.

@prashanth26 prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019
@prashanth26
Copy link
Contributor

A few more lint suggestions. I apologize, the CI doesn't pass unless the lint checks are fixed.

./pkg/controller/metrics.go:62:24: should drop = 0 from declaration of var paused; it is the zero value
./pkg/controller/metrics.go:103:25: should drop = 0 from declaration of var status; it is the zero value
./pkg/controller/metrics.go:135:11: don't use underscores in Go names; range var failed_machine should be failedMachine
./pkg/controller/metrics.go:186:25: should drop = 0 from declaration of var status; it is the zero value
./pkg/controller/metrics.go:225:11: don't use underscores in Go names; range var failed_machine should be failedMachine
./pkg/controller/metrics.go:262:25: should drop = 0 from declaration of var status; it is the zero value
./pkg/controller/metrics.go:279:23: should drop = 0 from declaration of var phase; it is the zero value
./pkg/controller/metrics.go:312:6: don't use underscores in Go names; var frozen_status should be frozenStatus
./pkg/controller/metrics.go:312:30: should drop = 0 from declaration of var frozen_status; it is the zero value
Found 9 lint suggestions; failing.

@fsniper
Copy link
Contributor Author

fsniper commented Jan 4, 2019

How can I run the lint checks locally? Make file has a verify target, but it seems to be spesific for CI operations. I ran golint on directories I touched, this also lead to too many lint errors.

@prashanth26 prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019
@prashanth26
Copy link
Contributor

Hi @fsniper ,

There are still a few more lint checks. I can see at least 1 more. You can run lint checks locally by running make check. I apologize for the last minute lint issues.

./pkg/metrics/metrics.go:8:2: exported var MachineControllerFrozenDesc should have comment or be unexported
Found 1 lint suggestions; failing.

Thanks & Regards,
Prashanth

@prashanth26 prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jan 4, 2019
Copy link
Contributor

@prashanth26 prashanth26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks once again @fsniper for the PR. We see that all tests are passing now. LGTM.

@prashanth26 prashanth26 merged commit 5554b50 into gardener:master Jan 4, 2019
@ghost ghost added the component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) label Mar 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) kind/enhancement Enhancement, improvement, extension needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging size/s Size of pull request is small (see gardener-robot robot/bots/size.py) status/accepted Issue was accepted as something we need to work on topology/seed Affects Seed clusters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance metrics endpoint
7 participants