-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Create metric for flux manifest errors #2535
Conversation
edf514f
to
464f11d
Compare
I think the metric should be named Labels:
With this metric and labels one can compose alerts with |
It seems like the main thrust of this PR is to be able to know the number of manifests that failed to apply, so you can alert on manifests failing to apply. There's a metric which record how long each sync takes, and whether it was successful or not -- So I think there's room another metric to record specifically the application step. A gauge counting how many things were applied is appropriate -- I would record how many were attempted, and label according to whether they were successful applied or not. Tangentially for your purpose, a metric timing (and labelling the success) of each phase of the syncing might be useful too. E.g., generating the manifests, parsing the manifests, doing the application, doing the garbage collection. |
I'm not a fan of metrics that are there just so you can write a very specific alert as simply as possible, but provide no other information. What if you want to alert on the proportion of apply errors to successes? Or an alert on whether the number of manifests being applied is not what you expected -- a mistake in configuration could mean you suddenly have no manifests, and no errors. |
Main idea of initial request - #2199 - is to provide possibility to get information about current flux state - is everything is in sync and works as expected or something can't be synced and manual intervention is needed. flux_demon_sync_duration_seconds doesn't give flux current state information. |
Sorry, this difference of opinion is going to hold up the PR. @stefanprodan Can we agree a minimal change that makes it mergeable? I suggest: make the metric a gauge of attempted manifest applications, labelled by success or failure. In other words,
If you want to alert on _any_failures, you can use
|
I'm ok with |
So if sync was successful we put flux_daemon_sync_manifests{success=true}=manifest_number, if not - flux_daemon_sync_manifests{success=false}=manifest_errors |
I'm not sure of the most convenient way to get these numbers, but
manifest_errors is the number of resource errors received back from the sync, as you have it now (sync.go L163). |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the metric to https://github.com/fluxcd/flux/blob/master/docs/references/monitoring.md
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @mpashka
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's wrong to use a metric which is a count (manifests, here) also as a binary 1/0. Rely on another metric to determine the success or failure of the whole sync.
That's correct. Success or failure of the overall sync process can be obtained by running queries: So we need only measure the case where manifests are correct but there were some problems applying manifests to kubernetes - e.g. configmap size is too big to fit in annotations or immutable field (like label selector) was changed. Probably it is better to put glux alarming info into documentation as well. I suppose https://github.com/fluxcd/flux/blob/master/docs/references/monitoring.md is a good place for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks all squared away (I have a cosmetic suggestion only). Thank you for your diligent work @mpashka! Can you rebase+squash it into one commit (or I can do that before merging, if you have better things to do :-)
75368ec
to
45c19b1
Compare
45c19b1
to
5c026fa
Compare
#2199
Expose Prometheus metric that allows alerting on invalid Kubernetes manifests:
flux_daemon_sync_manifests{success='false'} > 0
- if true then there are either some problems with applying git manifests to kubernetes - e.g. configmap size is too big to fit in annotations or immutable field (like label selector) was changed.