Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple rule configuration sources #107

Closed
carlpett opened this issue Aug 30, 2018 · 20 comments
Closed

Multiple rule configuration sources #107

carlpett opened this issue Aug 30, 2018 · 20 comments

Comments

@carlpett
Copy link

As discussed briefly with @DirectXMan12 on Slack, we want/need to let multiple teams supply rules to the k8s-prometheus-adapter. The outcome of that discussion was to move configuration from the config file to a CRD.
I'd be happy to work on this, but it should probably be scoped first.

Thoughts and things to be considered?

@DirectXMan12
Copy link
Contributor

The Prometheus Operator could be a source of inspiration.

Things to consider:

  • Is there any way to "safely" allow normal users to specify rules without worrying about cross-namespace metrics reading? (probably would require query rewriting, or at least query parsing for verification :-/)
  • Do we want one object per "rule" (in the current config), or allow multiple rules in one object?

@DirectXMan12
Copy link
Contributor

@brancz might be interested in this.

@tomkerkhove
Copy link
Contributor

We are considering the same for the Azure metrics adapter (Azure/azure-k8s-metrics-adapter#19), FYI @jsturtevant

@carlpett
Copy link
Author

@DirectXMan12 Is the safety worry about reading other namespaces' "sensitive metrics", or about expensive queries? I would guess that those who take the decision to have a shared Prometheus installation would have some trust to the parties using it already (since they could already graph any such metrics). Expensive queries is a real problem, but I'm not sure if that is solvable? Are there other dimensions to this?

I would reflexively propose to allow a list of rules, but without any particular arguments other than not wanting to create dozens of them... Are there good arguments for going with single rules?

One potential issue I could see would be de-duplication of rules. Does it need to be handled? What do we do about duplicate names, for example? Esp if the name is the same but the other parameters different... Error out, warn and continue or ignore?

Re taking inspiration from the operator - I believe it predates the operator-framework. Should we prefer to go hand-crafted vs using the framework?

@brancz
Copy link
Contributor

brancz commented Aug 31, 2018

I'm not sure how I feel about custom rules for the Prometheus case. My feeling is that most useful rules will use recording rules anyways, so I'd prefer to explore how the adapter could work with multiple Prometheus instances, or maybe even more generic, explore how we could essentially federate metrics adapters so people can use mixed adapters in cluster, I imagine especially in multi-tenancy cases this will be very common. Essentially then everyone can setup their own adapters to their liking.

@carlpett
Copy link
Author

I'm probably missing some context here, because I didn't really understand that :)
Why is useful rules (with recording rules?) something that works against, or is harmed by, allowing multiple users in the cluster to define custom metrics?
Also, why would useful rules mostly be based recording rules?

(That there are two or three(?) different meanings to "rules" does not help here 🙂)

@brancz
Copy link
Contributor

brancz commented Aug 31, 2018

(That there are two or three(?) different meanings to "rules" does not help here slightly_smiling_face)

Haha, definitely.

I can essentially imagine unlimited scenarios or queries users want to perform against Prometheus for autoscaling. While the rules of this adapter can help a bit doing this on the fly, they're very prone to requiring customization. Instead I think I would prefer if users just specify recording rules in Prometheus directly, making sure the resulting time-series has the appropriate labelling and then the arbitrarily complex query can be used through the adapter as it's accessible as if it was normal single metric/time-series.

Sorry if I'm not being detailed enough, I'm still recovering from GopherCon.

@DirectXMan12
Copy link
Contributor

I've been talking a bit offline with @brancz, and I'm think about switching to not having "discovery", and instead requiring manually specifying metrics and associations in CRD objects. Most people seem to care about a small subset of metrics anyway, and seem to mostly be using the custom rules to select one or two specific metrics, as opposed to all metrics in their cluster.

What do people think about that?

@aabed @zioproto @tanner-bruce @discordianfish @itskingori @atomaras you've all been vocal on stuff before

@DirectXMan12
Copy link
Contributor

@antoinne85 this may affect your work as well...

@atomaras
Copy link

From our perspective we have a limited set of custom metrics that we'd want to autoscale on. It is cpu, memory, request metrics and worker queue size metrics. We'd prefer manually specifying only the metrics we need.

@itskingori
Copy link

Most people seem to care about a small subset of metrics anyway, and seem to mostly be using the custom rules to select one or two specific metrics, as opposed to all metrics in their cluster.

@DirectXMan12 I agree with this statement. I started with the auto-discovery but realised that ultimately, what I want is the custom metrics I'm exporting to Prometheus. In my case I wanted to scale one component based on the web-server queue and another based on the worker queue. So I've manually specified the metrics I want.

I'd also like to point out that another motivation for this was that the prometheus adapter was using quite an amount of resources on all auto-discovered metrics. Limiting it to just what I need reduced resource usage significantly.

@tomkerkhove
Copy link
Contributor

I agree with @DirectXMan12 & @itskingori - People are only interested in a segment of metrics and not all. Even if they are interested in all, that doesn't mean they should have access to them by default as well.

@itskingori
Copy link

I've been talking a bit offline with @brancz, and I'm think about switching to not having "discovery", and instead requiring manually specifying metrics and associations in CRD objects.

@DirectXMan12 I didn't comment on this part, but ... my thoughts are if moving to CRD objects can the reduce complexity of the config and simplify association, I'm all for that. This approach also means that one doesn't need to restart the adapter for it to get new configuration.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fredr
Copy link

fredr commented Mar 31, 2021

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@fredr: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 31, 2021
@fredr
Copy link

fredr commented Mar 31, 2021

I think this is still a very interesting feature, and since noone seemed to be against it three years ago, I would like to reinitiate the discussion.

Having custom and external metrics via a CRDs would make it a lot easier to understand where what metrics are available where.
Three years later 😄 what are your opinions on this issue now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants