Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom Prometheus metrics. #137

Merged
merged 1 commit into from
Aug 22, 2018

Conversation

dkistner
Copy link
Member

What this PR does / why we need it:
Add instructions to integrate custom Prometheus metrics.
Start with a first basic metric about the amount of managed machines.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Checkout, build and run locally. Curl the mcm metrics endpoint and grep for 'mcm_' metrics.
curl localhost:10258/metrics | less

Release note:

The mcm has now support to integrate custom Prometheus metrics. A metric to expose the amount of managed machines is already integrated.

@dkistner dkistner requested a review from a team as a code owner August 13, 2018 07:35
@prashanth26
Copy link
Contributor

Hi @dkistner ,
Thanks for the PR. I shall check on it and get back to you.

Regards,
Prashanth

Copy link
Contributor

@prashanth26 prashanth26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay in response. The functionality mentioned seems to be working fine. However, I don't have much idea about custom metrics required in our case. @mvladev can you have a quick look at this?

@dkistner
Copy link
Member Author

When we monitor the mcm from outside via Prometheus, we want to have a time series, which shows how many machines are managed by mcm at a certain point of time. There could be also other metrics, which are helpful to have better observability of the mcm. Those metrics can be implemented in the same way as the mcm_machine_items_total metric. The needed coding is provided with this change and can be extended easily with additional metrics.

@hardikdr hardikdr self-requested a review August 20, 2018 08:52
@@ -0,0 +1,41 @@
package controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add the license header here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, done.

)

var (
machineCountDesc = prometheus.NewDesc("mcm_machine_items_total", "Count of machines currently managed by the mcm.", nil, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more meaning - per MachineDeployment - though this seems ok for the first cut.
As mentioned by Prashanth, is there any planned consumption of these metrics already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure the metrics can be extended. If you need these information, please extend it :) Would be nice to see how many machines overall and by MachineDeployment are exists at a certain point in time.

For the monitoring of the Gardener: We plan to display the information how many machines over all Shoots really exists at a certain point in time. The mcm is the component, which knows how many machines exists in a Shoot. Those metrics should be collected by the Shoot monitoring and then exposed in a aggregated way to the Gardner monitoring itself. For now it is a starting point to achieve that.

@@ -410,6 +411,7 @@ func (c *controller) Run(workers int, stopCh <-chan struct{}) {

glog.V(1).Info("Starting machine-controller-manager")
handlers.UpdateHealth(true)
prometheus.MustRegister(c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be really nice to have a very small docu, just mentioning Prometheus and current-metrics exposed.We plan to have a doku-run soon and complete many of it, can be taken care there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little explanation here.

Start with a basic metric about the amount of managed machines.
Copy link
Member

@hardikdr hardikdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@prashanth26 prashanth26 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 22, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 22, 2018
@prashanth26 prashanth26 added kind/enhancement Enhancement, improvement, extension reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/accepted Issue was accepted as something we need to work on component/machine-controller-manager platform/all area/monitoring Monitoring (including availability monitoring and alerting) related topology/seed Affects Seed clusters and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 22, 2018
@prashanth26
Copy link
Contributor

Hi @dkistner ,
The CI lint checks are failing on the PR. Can you please fix them. Ignore the test error, as we have fixed that from our side.

@hardikdr hardikdr merged commit 51e5ad4 into gardener:master Aug 22, 2018
@ghost ghost added the component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) label Mar 7, 2020
@dkistner dkistner deleted the metrics branch July 25, 2020 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) kind/enhancement Enhancement, improvement, extension platform/all size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/accepted Issue was accepted as something we need to work on topology/seed Affects Seed clusters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants