Create metric for flux manifest errors #2535

mpashka · 2019-10-21T21:02:40Z

Expose Prometheus metric that allows alerting on invalid Kubernetes manifests:
flux_daemon_sync_manifests{success='false'} > 0 - if true then there are either some problems with applying git manifests to kubernetes - e.g. configmap size is too big to fit in annotations or immutable field (like label selector) was changed.

pkg/daemon/sync_test.go

stefanprodan · 2019-10-22T09:29:07Z

I think the metric should be named sync_error and it should act like a boolean, values can be 0 or 1.

Labels:

fetch (records git errors)
generate (records .flux.yaml generators errors)
parse (records manifestsStore errors)
apply (records kubectl apply errors)

With this metric and labels one can compose alerts with avg_over_time(flux_daemon_sync_error [5m]) > 0 and compose messages like flux sync apply failed or flux sync parse falied.

squaremo · 2019-10-22T09:37:25Z

It seems like the main thrust of this PR is to be able to know the number of manifests that failed to apply, so you can alert on manifests failing to apply.

There's a metric which record how long each sync takes, and whether it was successful or not -- flux_daemon_sync_duration_seconds -- but that might be too broad for your purposes, since it will label a sample as unsuccessful if any error came up while syncing, e.g., it failed to post a webhook.

So I think there's room another metric to record specifically the application step. A gauge counting how many things were applied is appropriate -- I would record how many were attempted, and label according to whether they were successful applied or not.

Tangentially for your purpose, a metric timing (and labelling the success) of each phase of the syncing might be useful too. E.g., generating the manifests, parsing the manifests, doing the application, doing the garbage collection.

squaremo · 2019-10-22T10:02:24Z

I think the metric should be named sync_error and it should act like a boolean

I'm not a fan of metrics that are there just so you can write a very specific alert as simply as possible, but provide no other information. What if you want to alert on the proportion of apply errors to successes? Or an alert on whether the number of manifests being applied is not what you expected -- a mistake in configuration could mean you suddenly have no manifests, and no errors.

mpashka · 2019-10-22T10:09:22Z

Main idea of initial request - #2199 - is to provide possibility to get information about current flux state - is everything is in sync and works as expected or something can't be synced and manual intervention is needed. flux_demon_sync_duration_seconds doesn't give flux current state information.
I suppose metric is proper way to get flux state.

squaremo · 2019-10-22T10:15:37Z

Sorry, this difference of opinion is going to hold up the PR. @stefanprodan Can we agree a minimal change that makes it mergeable?

I suggest: make the metric a gauge of attempted manifest applications, labelled by success or failure. In other words,

flux_daemon_sync_manifests{success=true|false}

If you want to alert on _any_failures, you can use

flux_daemon_sync_manifests{success=false} > 0

stefanprodan · 2019-10-22T10:29:37Z

I'm ok with flux_daemon_sync_manifests{success=true|false}. It's easy to reason about and create alerts.

mpashka · 2019-10-22T12:26:15Z

So if sync was successful we put flux_daemon_sync_manifests{success=true}=manifest_number, if not - flux_daemon_sync_manifests{success=false}=manifest_errors
If there was error reading manifests or parsing manifest we put flux_daemon_sync_manifests{success=true}=0 and flux_daemon_sync_manifests{success=false}=number_of_parse_errors? Or just flux_daemon_sync_manifests{success=false}=1?

@stefanprodan , @squaremo ?

squaremo · 2019-10-22T12:53:39Z

I'm not sure of the most convenient way to get these numbers, but

flux_daemon_sync_manifests{success=false}=manifest_errors
flux_daemon_sync_manifests{success=true}=total_manifest_number - manifest_errors

manifest_errors is the number of resource errors received back from the sync, as you have it now (sync.go L163).

mpashka · 2019-10-22T20:14:09Z

Done

stefanprodan

Can you please add the metric to https://github.com/fluxcd/flux/blob/master/docs/references/monitoring.md

mpashka · 2019-10-23T12:32:26Z

Done

stefanprodan

LGTM

Thanks @mpashka

squaremo

It's wrong to use a metric which is a count (manifests, here) also as a binary 1/0. Rely on another metric to determine the success or failure of the whole sync.

pkg/daemon/sync.go

mpashka · 2019-10-23T13:40:44Z

That's correct. Success or failure of the overall sync process can be obtained by running queries:
delta(flux_daemon_sync_duration_seconds_count{success='true'}[6m]) < 1
or
rate(flux_daemon_sync_duration_seconds_count{success='false'}[6m]) > 0

So we need only measure the case where manifests are correct but there were some problems applying manifests to kubernetes - e.g. configmap size is too big to fit in annotations or immutable field (like label selector) was changed.

Probably it is better to put glux alarming info into documentation as well. I suppose https://github.com/fluxcd/flux/blob/master/docs/references/monitoring.md is a good place for it.

squaremo

This looks all squared away (I have a cosmetic suggestion only). Thank you for your diligent work @mpashka! Can you rebase+squash it into one commit (or I can do that before merging, if you have better things to do :-)

docs/references/monitoring.md

mpashka force-pushed the errors_metrics branch from edf514f to 464f11d Compare October 21, 2019 21:06

mpashka changed the title ~~Create metric for flux manifest errors -~~ Create metric for flux manifest errors Oct 21, 2019

stefanprodan mentioned this pull request Oct 22, 2019

🚧 Provide basic observability over synchronisation errors. #2534

Closed

stefanprodan suggested changes Oct 22, 2019

View reviewed changes

pkg/daemon/sync_test.go Outdated Show resolved Hide resolved

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 22, 2019

Code review - fluxcd#2535 (comment)

32ddd86

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 22, 2019

Code review - fluxcd#2535 (comment)

18472e2

stefanprodan suggested changes Oct 23, 2019

View reviewed changes

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 23, 2019

Code review - fluxcd#2535 (review)

6a327f5

stefanprodan approved these changes Oct 23, 2019

View reviewed changes

squaremo suggested changes Oct 23, 2019

View reviewed changes

pkg/daemon/sync.go Outdated Show resolved Hide resolved

pkg/daemon/sync.go Outdated Show resolved Hide resolved

pkg/daemon/sync.go Outdated Show resolved Hide resolved

pkg/daemon/sync.go Outdated Show resolved Hide resolved

pkg/daemon/sync.go Outdated Show resolved Hide resolved

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 24, 2019

Code review - fluxcd#2535 (review)

ec57ec9

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 24, 2019

Code review - fluxcd#2535 (review) (rollback syncDuration metric check)

afa64ee

squaremo approved these changes Oct 24, 2019

View reviewed changes

docs/references/monitoring.md Outdated Show resolved Hide resolved

mpashka added a commit to pulsepointinc/flux that referenced this pull request Oct 24, 2019

Code review - fluxcd#2535 (review) (rollback syncDuration metric check)

f3ddde4

mpashka force-pushed the errors_metrics branch 2 times, most recently from 75368ec to 45c19b1 Compare October 24, 2019 11:59

Create metric for flux manifest errors - fluxcd#2199

5c026fa

mpashka force-pushed the errors_metrics branch from 45c19b1 to 5c026fa Compare October 24, 2019 12:02

squaremo merged commit 28db2f5 into fluxcd:master Oct 24, 2019

2opremio mentioned this pull request Nov 12, 2019

Create metric for flux manifest errors #2199

Closed

stefanprodan mentioned this pull request Nov 13, 2019

Add a resource sync error counter #2608

Closed

2opremio added this to the 1.16.0 milestone Nov 21, 2019

2opremio mentioned this pull request Jan 10, 2020

Report errors at kubernetes level #2695

Closed

2opremio mentioned this pull request Feb 20, 2020

Flux aborts synchronization on manifest syntax errors #2861

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create metric for flux manifest errors #2535

Create metric for flux manifest errors #2535

mpashka commented Oct 21, 2019 •

edited by stefanprodan

Loading

stefanprodan commented Oct 22, 2019

squaremo commented Oct 22, 2019 •

edited

Loading

squaremo commented Oct 22, 2019

mpashka commented Oct 22, 2019

squaremo commented Oct 22, 2019

stefanprodan commented Oct 22, 2019

mpashka commented Oct 22, 2019 •

edited

Loading

squaremo commented Oct 22, 2019

mpashka commented Oct 22, 2019

stefanprodan left a comment

mpashka commented Oct 23, 2019

stefanprodan left a comment

squaremo left a comment

mpashka commented Oct 23, 2019

squaremo left a comment •

edited

Loading

Create metric for flux manifest errors #2535

Create metric for flux manifest errors #2535

Conversation

mpashka commented Oct 21, 2019 • edited by stefanprodan Loading

stefanprodan commented Oct 22, 2019

squaremo commented Oct 22, 2019 • edited Loading

squaremo commented Oct 22, 2019

mpashka commented Oct 22, 2019

squaremo commented Oct 22, 2019

stefanprodan commented Oct 22, 2019

mpashka commented Oct 22, 2019 • edited Loading

squaremo commented Oct 22, 2019

mpashka commented Oct 22, 2019

stefanprodan left a comment

Choose a reason for hiding this comment

mpashka commented Oct 23, 2019

stefanprodan left a comment

Choose a reason for hiding this comment

squaremo left a comment

Choose a reason for hiding this comment

mpashka commented Oct 23, 2019

squaremo left a comment • edited Loading

Choose a reason for hiding this comment

mpashka commented Oct 21, 2019 •

edited by stefanprodan

Loading

squaremo commented Oct 22, 2019 •

edited

Loading

mpashka commented Oct 22, 2019 •

edited

Loading

squaremo left a comment •

edited

Loading