Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

🚧 Provide basic observability over synchronisation errors. #2534

Closed

Conversation

bmcustodio
Copy link
Contributor

@bmcustodio bmcustodio commented Oct 21, 2019

This PR implements basic observability over git-to-cluster synchronisation errors, as requested in #1340. The implementation introductes a new metric:

  • sync_error_count, as suggested in the original issue, which provides a counter of synchronisation errors.

While I believe sync_error_count has its own merit and use-cases, I am also introducting two additional metrics:

  • last_sync_timestamp, which provides the timestamp at which synchronisation was last attempted; and
  • last_successful_sync_timestamp, which provides the timestamp at which synchronisation was last attempted successfully.

It follows that whenever these two metrics differ, the git-to-cluster synchronisation is currently failing. As an example, this is what metrics look like after a successful synchronisation:

(...)
# HELP flux_daemon_last_successful_sync_timestamp The timestamp at which git-to-cluster synchronisation was last successfully attempted.
# TYPE flux_daemon_last_successful_sync_timestamp gauge
flux_daemon_last_successful_sync_timestamp 1.5716875893542927e+18
# HELP flux_daemon_last_sync_timestamp The timestamp at which git-to-cluster synchronisation was last attempted.
# TYPE flux_daemon_last_sync_timestamp gauge
flux_daemon_last_sync_timestamp 1.5716875893542927e+18
(...())
# HELP flux_daemon_sync_error_count Count of git-to-cluster synchronisation errors.
# TYPE flux_daemon_sync_error_count counter
flux_daemon_sync_error_count 0
(...)

And this is what happens after a failed synchronisation attempt (in this case, I had pushed a malformed YAML file to the repository being synced):

# HELP flux_daemon_last_successful_sync_timestamp The timestamp at which git-to-cluster synchronisation was last successfully attempted.
# TYPE flux_daemon_last_successful_sync_timestamp gauge
flux_daemon_last_successful_sync_timestamp 1.5716877452956073e+18
# HELP flux_daemon_last_sync_timestamp The timestamp at which git-to-cluster synchronisation was last attempted.
# TYPE flux_daemon_last_sync_timestamp gauge
flux_daemon_last_sync_timestamp 1.5716877717151754e+18
(...)
# HELP flux_daemon_sync_error_count Count of git-to-cluster synchronisation errors.
# TYPE flux_daemon_sync_error_count counter
flux_daemon_sync_error_count 1

I'd very much like to hear everyone's thoughts on this. 🙂 If everyone's happy with this in its current form, I'll work on adding the proper documentation.

Closes #1340.

@bmcustodio
Copy link
Contributor Author

A Docker image containing a build of this PR is available at bmcstdio/flux:pr-2534 if anyone's interested in taking it for a spin.

@stefanprodan
Copy link
Member

@bmcstdio there is no need for the _timestamp metrics, the existing histogram sync_duration_seconds has that information. I'm inclined to close this in favour of #2535

@stefanprodan
Copy link
Member

Closing this in favour of #2535 docs are here https://docs.fluxcd.io/en/stable/references/monitoring.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feedback on invalid manifests
2 participants