diff --git a/docs/monitoring.md b/docs/monitoring.md new file mode 100644 index 00000000..314d915f --- /dev/null +++ b/docs/monitoring.md @@ -0,0 +1,116 @@ +## Monitoring the Upjet Runtime +The [Kubernetes controller-runtime] library provides a Prometheus metrics +endpoint by default. The Upjet based providers including the +[upbound/provider-aws], [upbound/provider-azure], [upbound/provider-azuread] and +[upbound/provider-gcp] expose [various +metrics](https://book.kubebuilder.io/reference/metrics-reference.html) +from the controller-runtime to help monitor the health of the various runtime +components, such as the [`controller-runtime` client], the [leader election +client], the [controller workqueues], etc. In addition to these metrics, each +controller also +[exposes](https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/internal/controller/metrics/metrics.go#L25) +various metrics related to the reconciliation of the custom resources and active +reconciliation worker goroutines. + +In addition to these metrics exposed by the `controller-runtime`, the Upjet +based providers also expose metrics specific to the Upjet runtime. The Upjet +runtime registers some custom metrics using the [available extension +mechanism](https://book.kubebuilder.io/reference/metrics.html#publishing-additional-metrics), +and are available from the default `/metrics` endpoint of the provider pod. Here +are these custom metrics exposed from the Upjet runtime: +- `upjet_terraform_cli_duration`: This is a histogram metric and reports + statistics, in seconds, on how long it takes a Terraform CLI invocation to + complete. +- `upjet_terraform_active_cli_invocations`: This is a gauge metric and it's the + number of active (running) Terraform CLI invocations. +- `upjet_terraform_running_processes`: This is a gauge metric and it's the + number of running Terraform CLI and Terraform provider processes. +- `upjet_resource_ttr`: This is a histogram metric and it measures, in seconds, + the time-to-readiness for managed resources. + +Prometheus metrics can have [labels] associated with them to differentiate the +characteristics of the measurements being made, such as differentiating between +the CLI processes and the Terraform provider processes when counting the number +of active Terraform processes running. Here is a list of labels associated with +each of the above custom Upjet metrics: +- Labels associated with the `upjet_terraform_cli_duration` metric: + - `subcommand`: The `terraform` subcommand that's run, e.g., `init`, + `apply`, `plan`, `destroy`, etc. + - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that + the CLI was invoked synchronously as part of a reconcile loop), `async` + (so that the CLI was invoked asynchronously, the reconciler goroutine will + poll and collect results in future). +- Labels associated with the `upjet_terraform_active_cli_invocations` metric: + - `subcommand`: The `terraform` subcommand that's run, e.g., `init`, + `apply`, `plan`, `destroy`, etc. + - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that + the CLI was invoked synchronously as part of a reconcile loop), `async` + (so that the CLI was invoked asynchronously, the reconciler goroutine will + poll and collect results in future). +- Labels associated with the `upjet_terraform_running_processes` metric: + - `type`: Either `cli` for Terraform CLI (the `terraform` process) processes + or `provider` for the Terraform provider processes. Please note that this + is a best effort metric that may not be able to precisely catch & report + all relevant processes. We may, in the future, improve this if needed by + for example watching the `fork` system calls. But currently, it may prove + to be useful to watch rouge Terraform provider processes. +- Labels associated with the `upjet_resource_ttr` metric: + - `group`, `version`, `kind` labels record the [API group, version and + kind](https://kubernetes.io/docs/reference/using-api/api-concepts/) for + the managed resource, whose + [time-to-readiness](https://github.com/crossplane/terrajet/issues/55#issuecomment-929494212) + measurement is captured. + +## Examples +You can [export](https://book.kubebuilder.io/reference/metrics.html) all these +custom metrics and the `controller-runtime` metrics from the provider pod for +Prometheus. Here are some examples showing the custom metrics in action from the +Prometheus console: + +- `upjet_terraform_active_cli_invocations` gauge metric showing the sync & async + `terraform init/apply/plan/destroy` invocations: image + +- `upjet_terraform_running_processes` gauge metric showing both `cli` and + `provider` labels: image + +- `upjet_terraform_cli_duration` histogram metric, showing average Terraform CLI + running times for the last 5m: image + +- The medians (0.5-quantiles) for these observations aggregated by the mode and +Terraform subcommand being invoked: image + +- `upjet_resource_ttr` histogram metric, showing average resource TTR for the + last 10m: image + +- The median (0.5-quantile) for these TTR observations: + +These samples have been collected by provisioning 10 [upbound/provider-aws] +`cognitoidp.UserPool` resources by running the provider with a poll interval of +1m. In these examples, one can observe that the resources were polled +(reconciled) twice after they acquired the `Ready=True` condition and after +that, they were destroyed. + +## Reference +You can find a full reference of the exposed metrics from the Upjet-based +providers [here](provider_metrics_help.txt). + +[Kubernetes controller-runtime]: + https://github.com/kubernetes-sigs/controller-runtime +[upbound/provider-aws]: https://github.com/upbound/provider-aws +[upbound/provider-azure]: https://github.com/upbound/provider-azure +[upbound/provider-azuread]: https://github.com/upbound/provider-azuread +[upbound/provider-gcp]: https://github.com/upbound/provider-gcp +[`controller-runtime` client]: + https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/client_go_adapter.go#L40 +[leader election client]: + https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/leaderelection.go#L12 +[controller workqueues]: + https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/workqueue.go#L40 +[labels]: https://prometheus.io/docs/practices/naming/#labels diff --git a/docs/provider_metrics_help.txt b/docs/provider_metrics_help.txt new file mode 100644 index 00000000..638a829c --- /dev/null +++ b/docs/provider_metrics_help.txt @@ -0,0 +1,147 @@ +# HELP upjet_terraform_cli_duration Measures in seconds how long it takes a Terraform CLI invocation to complete +# TYPE upjet_terraform_cli_duration histogram + +# HELP upjet_terraform_running_processes The number of running Terraform CLI and Terraform provider processes +# TYPE upjet_terraform_running_processes gauge + +# HELP upjet_resource_ttr Measures in seconds the time-to-readiness (TTR) for managed resources +# TYPE upjet_resource_ttr histogram + +# HELP upjet_terraform_active_cli_invocations The number of active (running) Terraform CLI invocations +# TYPE upjet_terraform_active_cli_invocations gauge + +# HELP certwatcher_read_certificate_errors_total Total number of certificate read errors +# TYPE certwatcher_read_certificate_errors_total counter + +# HELP certwatcher_read_certificate_total Total number of certificate reads +# TYPE certwatcher_read_certificate_total counter + +# HELP controller_runtime_active_workers Number of currently used workers per controller +# TYPE controller_runtime_active_workers gauge + +# HELP controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller +# TYPE controller_runtime_max_concurrent_reconciles gauge + +# HELP controller_runtime_reconcile_errors_total Total number of reconciliation errors per controller +# TYPE controller_runtime_reconcile_errors_total counter + +# HELP controller_runtime_reconcile_time_seconds Length of time per reconciliation per controller +# TYPE controller_runtime_reconcile_time_seconds histogram + +# HELP controller_runtime_reconcile_total Total number of reconciliations per controller +# TYPE controller_runtime_reconcile_total counter + +# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. +# TYPE go_gc_duration_seconds summary + +# HELP go_goroutines Number of goroutines that currently exist. +# TYPE go_goroutines gauge + +# HELP go_info Information about the Go environment. +# TYPE go_info gauge + +# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. +# TYPE go_memstats_alloc_bytes gauge + +# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. +# TYPE go_memstats_alloc_bytes_total counter + +# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. +# TYPE go_memstats_buck_hash_sys_bytes gauge + +# HELP go_memstats_frees_total Total number of frees. +# TYPE go_memstats_frees_total counter + +# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. +# TYPE go_memstats_gc_sys_bytes gauge + +# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. +# TYPE go_memstats_heap_alloc_bytes gauge + +# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. +# TYPE go_memstats_heap_idle_bytes gauge + +# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. +# TYPE go_memstats_heap_inuse_bytes gauge + +# HELP go_memstats_heap_objects Number of allocated objects. +# TYPE go_memstats_heap_objects gauge + +# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. +# TYPE go_memstats_heap_released_bytes gauge + +# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. +# TYPE go_memstats_heap_sys_bytes gauge + +# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. +# TYPE go_memstats_last_gc_time_seconds gauge + +# HELP go_memstats_lookups_total Total number of pointer lookups. +# TYPE go_memstats_lookups_total counter + +# HELP go_memstats_mallocs_total Total number of mallocs. +# TYPE go_memstats_mallocs_total counter + +# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. +# TYPE go_memstats_mcache_inuse_bytes gauge + +# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. +# TYPE go_memstats_mcache_sys_bytes gauge + +# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. +# TYPE go_memstats_mspan_inuse_bytes gauge + +# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. +# TYPE go_memstats_mspan_sys_bytes gauge + +# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. +# TYPE go_memstats_next_gc_bytes gauge + +# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. +# TYPE go_memstats_other_sys_bytes gauge + +# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. +# TYPE go_memstats_stack_inuse_bytes gauge + +# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. +# TYPE go_memstats_stack_sys_bytes gauge + +# HELP go_memstats_sys_bytes Number of bytes obtained from system. +# TYPE go_memstats_sys_bytes gauge + +# HELP go_threads Number of OS threads created. +# TYPE go_threads gauge + +# HELP rest_client_request_duration_seconds Request latency in seconds. Broken down by verb, and host. +# TYPE rest_client_request_duration_seconds histogram + +# HELP rest_client_request_size_bytes Request size in bytes. Broken down by verb and host. +# TYPE rest_client_request_size_bytes histogram + +# HELP rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host. +# TYPE rest_client_requests_total counter + +# HELP rest_client_response_size_bytes Response size in bytes. Broken down by verb and host. +# TYPE rest_client_response_size_bytes histogram + +# HELP workqueue_adds_total Total number of adds handled by workqueue +# TYPE workqueue_adds_total counter + +# HELP workqueue_depth Current depth of workqueue +# TYPE workqueue_depth gauge + +# HELP workqueue_longest_running_processor_seconds How many seconds has the longest running processor for workqueue been running. +# TYPE workqueue_longest_running_processor_seconds gauge + +# HELP workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested +# TYPE workqueue_queue_duration_seconds histogram + +# HELP workqueue_retries_total Total number of retries handled by workqueue +# TYPE workqueue_retries_total counter + +# HELP workqueue_unfinished_work_seconds How many seconds of work has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases. +# TYPE workqueue_unfinished_work_seconds gauge + +# HELP workqueue_work_duration_seconds How long in seconds processing an item from workqueue takes. +# TYPE workqueue_work_duration_seconds histogram +