-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add prometheus metrics & endpoint #60
Conversation
This looks amazing! I need to get some proper sleep before looking at this. I will definitely do a full review tomorrow morning! For now, just one note in regards to the need for https://github.com/cloudbase/garm/blob/main/auth/context.go#L219-L227 It's mainly used for testing now, but the intention for it was to be used for your exact use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple of comments. Will look at everything a bit closer tomorrow. This is a great feature to have!
apiserver/controllers/controllers.go
Outdated
log.Printf("got not found error from DispatchWorkflowJob. webhook not meant for us?: %q", err) | ||
return | ||
} else if strings.Contains(err.Error(), "signature") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific error type we return when the signature does not match? We could use errors.Is()
if there is. It seems like a significant event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw .. this error never come. There is a bug in the webhook validation which has the effect that every webhook for orgs end enterprises are always valid. see #61
aae1886
to
d654738
Compare
3f5030c
to
4fe31b4
Compare
a07e32f
to
4f9075e
Compare
24d21d0
to
dbaa191
Compare
Hi, I changed the code so now you can:
the cli now has
with this git you can curl the metrics endpoint, but not the normal server endpoints, vice versa the admin token I skipped adding a dedicated metrics user in the db. the configuration in the garm/config.toml looks like this:
I did not update the docs and examples. I wanted to check with you if this is going in the right direction?! |
I think we can skip having to add another listener for metrics. In some cases, that's just another port to justify to the security team, and while it does seem to be optional, I think it's best we don't add a moving part unless absolutely unavoidable. In this case, if we really want to, we can do this by putting nginx in front of garm. Garm can listen on Our job should be to make sure that authentication and authorization is properly fleshed out. Tokens need to have just enough privileges to do the things they need to do. There's no need for a dedicated "metrics" user. In fact, the user ID can be left blank, or a new claims type can be created. At a later time, it may be useful to create a unified claims type that allows us to set access levels. At this point, we can simply add a new field in the claims that states the token is only valid for metrics. We parse the token, and set the context appropriately. Almost all current handlers and runner functions check if the context is that of Same can be done for metrics. I will look over this PR today in more detail and can give more constructive feedback point by point. But this is shaping up nicely and is a great addition to garm. Can't wait to see it in action 😄. |
added some nginx samples: mercedes-benz#3 This should mitigate the need for a separate listener, just for metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally had a chance to do a proper review 😄
apiserver/controllers/metrics.go
Outdated
} | ||
} | ||
|
||
hostname, controllerID := c.apiController.GetControllerInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be c.store.ControllerInfo()
if you give the collector access to the store. If you want to limit the scope of the dependencies we're injecting, you could (although not mandatory) give it access just to that function if you want to be restrictive by defining a struct field as:
type GarmCollector struct {
instanceMetric *prometheus.Desc
apiController *APIController
ctrlInfoFn func() (params.ControllerInfo, error)
}
And just pass the store.ControllerInfo
function in when initializing the collector in main.go
. I would give it access to the store to make things simple. Either way is fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the dependency in to apicontroller in the collector .. but still I think this func makes sense, as described above
config/config.go
Outdated
@@ -476,14 +476,30 @@ func (t *TLSConfig) Validate() error { | |||
return nil | |||
} | |||
|
|||
type MetricsConfig struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Validate()
function to this config type. All config types have validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you want to use the same router, there is no port to validate .. so basically the only real config toggle is a bool (no_auth). Validation makes not much sense here ..
do you still want a validate func ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
viper would really be nice here :)
We'll get there. Eventually 😄. It's not worth the effort for now.
do you still want a validate func ?
It's a good idea to have a Validate()
func, even if it's a noop, if we have a separate type for a particular set of settings. If we embed the metrics auth in an existing type, and it's a bool, there is no need to validate it.
The idea is that we may extend a config at some point, and may forget to add a Validate()
func later on.
btw, I removed the
disable metrics
toggle, mainly because in the router I don't have access to the config and I don't know if one would want to disable metrics.
You can pass in a new param to NewAPIRouter()
which includes whichever config you need. It didn't make sense to pass in a config to that func until now, but if that changes, I have nothing against adding a param that makes sense.
I think it may be useful to have a toggle to disable metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. I now pass the config into NewAPIRouter
runner/runner.go
Outdated
@@ -265,6 +266,14 @@ type Runner struct { | |||
credentials map[string]config.Github | |||
} | |||
|
|||
func (r *Runner) GetControllerID() (uuid.UUID, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed if you give the collector access to the store.
As a side note, all functions should accept context.Context
, for at least the following reasons:
- The context may get canceled and if we pass it along, it allows the various libraries we use to cancel their tasks and return gracefully.
- It allows us to do some authorization checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't give the collector access directly to the store, for now this is not necessary, access to the runner seems to be a good level of abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. See above wall of text 😄
Thank you, I'm a bit busy next week, but I'll fix everything as soon as I can. |
No worries at all! Thanks for all the work! This will be a great addition to garm! |
I fixed some linting errors and added a new workflow that checks things like gofmt, linting, vendoring consistency, etc. There are now a few makefile targets that should help out. make verify will run the lint, gofmt, vendor checks. make test will run verify and the go tests. This is the last time I cause conflicts for your PR 😄. Sorry about that! |
The last two comments are the last nitpicking I have 😄 |
.. so .. finally, I think you should be good to do another review .. |
runner/runner.go
Outdated
} | ||
|
||
func (r *Runner) GetControllerInfo(ctx context.Context) (params.ControllerInfo, error) { | ||
if r.controllerInfo == (params.ControllerInfo{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the controller ID already set in the Runner.poolManagerCtrl.controllerID
. You should be able to access it. You can also add it to the Runner{}
itself if you wish. The runner.NewRunner()
function, errors out if it can't get the controller ID, so having the controller ID set is guaranteed.
Which means we only have to get the hostname, which shouldn't error out.
So this could be as simple as:
return params.ControllerInfo{
ControllerID: r.poolManagerCtrl.controllerID,
Hostname: os.Hostname(),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything else looks fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. My mistake. os.Hostname()
can return an error. So that would be something like:
hostname, err := os.Hostname()
if err != nil {
return params.ControllerInfo{}, errors.Wrap(err, "getting hostname")
}
return params.ControllerInfo{
ControllerID: r.poolManagerCtrl.controllerID,
Hostname: hostname,
}, nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm .. accessing the id through Runner.poolManagerCtrl.controllerID is not that easy .. as you decoupled the access via an interface, I didn't want to extend this .. and type assertion seems not the right way ..
os.Hostname will probably not really error .. and if so we will ignore it anyway .. we cannot crash, right?
So I went for having no error result, but there's still the context in the runner.GetControllerInfo .. do you think that still makes much sense ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Then it might be worth adding a new un-exported field to Runner{}
like controllerID uuid.UUID
, and cache the controller ID there for later use.
Note, the NewRunner()
function is only called once. If you want to catch a hostname change, you probably want to call that in GetControllerInfo()
. The call to os.Hostname()
shouldn't error, but if they added that value as a return, it may happen, in which case you risk getting an empty string there. Is that ok in your current code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I went for having no error result, but there's still the context in the runner.GetControllerInfo .. do you think that still makes much sense ?
The context is useful if we want to limit who gets to fetch this info. There is no real sensitive info returned by this function, but we should probably still check for authorization here. In the future we may extend this function to return more metadata about the controller, especially if we will decouple the code and possibly deploy multiple instances of that particular component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will not nitpick too much on this. I am okay with merging as is, and making changes later as needed. The only thing I think is worth changing is the location of the call to os.Hostname()
. It's worth moving it into GetControllerInfo()
as that should resolve the hostname on every call to GetControllerInfo()
, as opposed to caching it for the duration of the lifetime of the current process.
You mentioned previously that you would rather pick up on hostname changes without restarting garm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned previously that you would rather pick up on hostname changes without restarting garm.
right .. thanks for your reviews and your patience with me 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work you put into this!
This adds a prometheus metrics endpoint and 2 simple metrics to allow garm being monitored.
Remarks:
ListAllInstances
andListAllPools
now have a version without the admin check that can be used internally.I didn't want to fake admin
Michael Kuhnt [email protected] Mercedes-Benz Tech Innovation GmbH (ProviderInformation)