Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prometheus metrics support to Swarm Client #246

Merged
merged 7 commits into from
Oct 30, 2020

Conversation

nre-ableton
Copy link
Contributor

This commit adds preliminary support to the Swarm Client for
Prometheus metrics (see https://prometheus.io). Currently, only stats
about JVM usage are reported, but in the future we could expand this
to report stats specific to Swarm Client.

One reason for adding this feature is to facilitate monitoring of
Swarm Client nodes. If the Swarm Client service itself crashes, then
alertmanager (see https://github.com/prometheus/alertmanager) can be
used to send alerts about the service being down.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from adding a few more JARs to the equation, I do not see anything wrong here. Could be a good addition for those who wants to do direct monitoring of the agent nodes. requesting changes, because some user documentation is needed for this new feature

client/pom.xml Outdated Show resolved Hide resolved
@nre-ableton nre-ableton force-pushed the nre/master/prometheus branch from 1eb3414 to e0205f8 Compare August 17, 2020 14:09
@nre-ableton
Copy link
Contributor Author

@oleg-nenashev Thanks for the feedback! I've added some documentation in e0205f8.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nre-ableton
Copy link
Contributor Author

@oleg-nenashev Thanks! Who is responsible for merging PRs in this repo?

@oleg-nenashev
Copy link
Member

@basil is the current maintainer

@basil
Copy link
Member

basil commented Aug 18, 2020

Thanks for the PR, @nre-ableton! I definitely want to see monitoring support added to Swarm. This has been requested by users since 2017 (see JENKINS-43119). I am not very familiar with this problem space, so I need to educate myself further about best practices before I commit to maintaining any specific solution.

One of my open questions from a design perspective is how to support multiple reporting backends. Since Swarm is a general-purpose tool that is intended to be used in a variety of configurations, I want to make sure that it can be extended with support for multiple backends, such as Graphite, CloudWatch (requested in JENKINS-43119), Prometheus (used by your company), Datadog (used by my company), etc. To that end, I want to make sure there is a clean API/implementation separation with an API that is backend-agnostic and implementations that are backend-specific. Note that I am not necessarily asking you to implement support for a second reporting backend right now, but I am asking that the API (including any relevant command-line options) be designed with future extensibility in mind, and the best way to prove out such an API would be to implement support for a second reporting backend.

I did some preliminary research on this topic and found some Java libraries like Dropwizard Metrics and Micrometer which advertise themselves as delivering such a vendor-neutral API. For example, Dropwizard can integrate with Prometheus. At face value, it seems that one of these solutions would be desirable for a general-purpose tool like Swarm that is intended to be used in a variety of configurations. But I am not sure about the details, since I have not worked with these technologies directly. Some prototyping would be helpful here.

Are you familiar with any of these technologies, Nik? If so, maybe you could help me answer some of these questions and make the implementation more generic, ideally by adding support for at least a second reporting backend. If not, I will need to do some more research about this and possibly prototype some solutions using the other backends and libraries mentioned above. As this is a volunteer effort for me, it may take several weeks or months for me to get to this.

@nre-ableton
Copy link
Contributor Author

@basil Thanks for the detailed feedback! I agree that having a more general metrics reporting framework would be beneficial for users with other metric collection platforms. I am not familiar with the libraries that you listed, but I had a quick look at their webpages and think that Micrometer looks quite promising. I'll rework this PR to use Micrometer instead.

@nre-ableton nre-ableton marked this pull request as draft August 27, 2020 10:00
@nre-ableton nre-ableton force-pushed the nre/master/prometheus branch 2 times, most recently from b2e4879 to 1ceb5b8 Compare September 9, 2020 08:01
@nre-ableton nre-ableton marked this pull request as ready for review September 9, 2020 08:02
@nre-ableton
Copy link
Contributor Author

@basil Ok, I've reworked this PR to use Micrometer. It currently still only supports Prometheus, but of course this leaves the door open to easily support other platforms in the future should we choose. Mind having another look? Thanks!

@nre-ableton
Copy link
Contributor Author

Hmm, it seems that CI is broken, though it was working before. I'll have a look at this.

@nre-ableton nre-ableton force-pushed the nre/master/prometheus branch from 1ceb5b8 to 9f4ce71 Compare October 1, 2020 16:39
@nre-ableton
Copy link
Contributor Author

Ok, CI is fixed! Mind taking another look @basil ? Thanks 👍

@nre-ableton
Copy link
Contributor Author

re-ping @basil, in case the notification got lost. 😄

@basil
Copy link
Member

basil commented Oct 28, 2020

Hey @nre-ableton, sorry for the delay on this. I meant to test this out but got busy with other things. It occurs to me that if there was a test in SwarmClientIntegrationTest that I wouldn't need to spend as much time testing this manually. That would speed things up if you were able to add such a test.

@nre-ableton nre-ableton force-pushed the nre/master/prometheus branch from 9f4ce71 to b5c7496 Compare October 28, 2020 16:58
@nre-ableton
Copy link
Contributor Author

Hey @nre-ableton, sorry for the delay on this. I meant to test this out but got busy with other things. It occurs to me that if there was a test in SwarmClientIntegrationTest that I wouldn't need to spend as much time testing this manually. That would speed things up if you were able to add such a test.

Sounds good, I'll try to come up with some good unit tests and ping you when this PR has them.

This commit adds preliminary support to the Swarm Client for
Prometheus metrics (see https://prometheus.io), by way of Micrometer
(see https://micrometer.io). Currently, only process resource usage
and JVM statistics are reported, but in the future we could expand
this to report stats specific to Swarm Client.

One reason for adding this feature is to facilitate monitoring of
Swarm Client nodes. If the Swarm Client service itself crashes, then
alertmanager (see https://github.com/prometheus/alertmanager) can be
used to send alerts about the service being down.

At the moment, only Prometheus is configured as an endpoint, but in
the future other metrics platforms could also be supported.
@nre-ableton nre-ableton force-pushed the nre/master/prometheus branch from b5c7496 to 3131c1b Compare October 29, 2020 17:20
@nre-ableton
Copy link
Contributor Author

@basil Ok, I've added a unit test which attempts to fetch the metrics page from the client after starting it up.

As it stands, it's more of a sanity test of the feature rather than a comprehensive test. We could add separate unit tests to try to get different metrics names for the various reporters, but I worry that this will be a bit brittle and prone to breakage since these names change periodically.

Should I add any additional tests, or is this one sufficient?

Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for adding the integration test. This test is sufficient. Just a few more minor comments and this should be ready to merge.

client/pom.xml Outdated Show resolved Hide resolved
client/pom.xml Outdated
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>1.5.4</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for Dependabot to be able to automatically update this library, the version number needs to be factored out into a Maven property as in jenkinsci/email-ext-plugin#229. Please also update this to the latest version (1.6.0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 0c282d7.

client/pom.xml Outdated
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>1.5.4</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for Dependabot to be able to automatically update this library, the version number needs to be factored out into a Maven property as in jenkinsci/email-ext-plugin#229. Please also update this to the latest version (1.6.0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 0c282d7.

client/src/main/java/hudson/plugins/swarm/SwarmClient.java Outdated Show resolved Hide resolved
@nre-ableton nre-ableton requested a review from basil October 30, 2020 08:20
@nre-ableton
Copy link
Contributor Author

@basil Thanks for the suggestions; I've applied them as fixups to the branch and also made another fixup for the Micrometer version in pom.xml. Please let me know if this looks good to you and I'll squash all fixups before merging.

@basil basil merged commit 9cf7f5b into jenkinsci:master Oct 30, 2020
@basil basil added the feature New features and improvements label Oct 30, 2020
@nre-ableton nre-ableton deleted the nre/master/prometheus branch October 30, 2020 15:50
@nre-ableton
Copy link
Contributor Author

@basil thanks! 👍

@basil
Copy link
Member

basil commented Oct 30, 2020

@basil thanks! +1

Thank you for the PR! If you can confirm that the final build is working then I'll go ahead and cut a release of this.

@nre-ableton
Copy link
Contributor Author

@basil thanks! +1

Thank you for the PR! If you can confirm that the final build is working then I'll go ahead and cut a release of this.

@basil ok, I just did a quick test with the JAR you linked to from that Jenkins server, and the Swarm Client itself still seems to work, and the new Prometheus feature is working as well. So I think that we're good to go as far as a release is concerned.

@basil
Copy link
Member

basil commented Nov 2, 2020

Released in 3.24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New features and improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants