Skip to content

Commit

Permalink
Merge branch 'main' into add-eks-cluster-name-tag
Browse files Browse the repository at this point in the history
  • Loading branch information
rschalo authored Jul 30, 2024
2 parents 20cb63b + 7872669 commit e797ea2
Show file tree
Hide file tree
Showing 3 changed files with 93 additions and 1 deletion.
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ require (
github.com/samber/lo v1.46.0
go.uber.org/multierr v1.11.0
go.uber.org/zap v1.27.0
golang.org/x/exp v0.0.0-20231006140011-7918f672742d
golang.org/x/sync v0.7.0
k8s.io/api v0.30.3
k8s.io/apiextensions-apiserver v0.30.3
Expand Down Expand Up @@ -90,7 +91,6 @@ require (
github.com/spf13/pflag v1.0.5 // indirect
go.opencensus.io v0.24.0 // indirect
go.uber.org/automaxprocs v1.5.3 // indirect
golang.org/x/exp v0.0.0-20231006140011-7918f672742d // indirect
golang.org/x/net v0.25.0 // indirect
golang.org/x/oauth2 v0.18.0 // indirect
golang.org/x/sys v0.21.0 // indirect
Expand Down
19 changes: 19 additions & 0 deletions hack/docs/metrics_gen/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"go/ast"
"go/parser"
"go/token"
"golang.org/x/exp/slices"
"io/fs"
"log"
"os"
Expand All @@ -39,6 +40,16 @@ type metricInfo struct {
help string
}

var (
stableMetrics = []string{"controller_runtime", "aws_sdk_go", "client_go", "leader_election", "interruption", "cluster_state", "workqueue", "karpenter_build_info", "karpenter_nodepool_usage", "karpenter_nodepool_limit",
"karpenter_nodeclaims_terminated_total", "karpenter_nodeclaims_created_total", "karpenter_nodes_terminated_total", "karpenter_nodes_created_total", "karpenter_pods_startup_duration_seconds",
"karpenter_provisioner_scheduling_simulation_duration_seconds", "karpenter_provisioner_scheduling_duration_seconds", "karpenter_nodepool_allowed_disruptions", "karpenter_disruption_decisions_total"}
betaMetrics = []string{"status_condition", "cloudprovider", "cloudprovider_batcher", "karpenter_nodeclaims_termination_duration_seconds", "karpenter_nodeclaims_instance_termination_duration_seconds",
"karpenter_nodes_total_pod_requests", "karpenter_nodes_total_pod_limits", "karpenter_nodes_total_daemon_requests", "karpenter_nodes_total_daemon_limits", "karpenter_nodes_termination_time_seconds",
"karpenter_nodes_system_overhead", "karpenter_nodes_allocatable", "karpenter_pods_state", "karpenter_provisioner_scheduling_queue_depth", "karpenter_disruption_queue_failures_total",
"karpenter_disruption_evaluation_duration_seconds", "karpenter_disruption_eligible_nodes", "karpenter_disruption_consolidation_timeouts_total"}
)

func (i metricInfo) qualifiedName() string {
return strings.Join(lo.Compact([]string{i.namespace, i.subsystem, i.name}), "_")
}
Expand Down Expand Up @@ -119,6 +130,14 @@ description: >
}
fmt.Fprintf(f, "### `%s`\n", metric.qualifiedName())
fmt.Fprintf(f, "%s\n", metric.help)
switch {
case slices.Contains(stableMetrics, metric.subsystem) || slices.Contains(stableMetrics, metric.qualifiedName()):
fmt.Fprintf(f, "- Stability Level: %s\n", "STABLE")
case slices.Contains(betaMetrics, metric.subsystem) || slices.Contains(betaMetrics, metric.qualifiedName()):
fmt.Fprintf(f, "- Stability Level: %s\n", "BETA")
default:
fmt.Fprintf(f, "- Stability Level: %s\n", "ALPHA")
}
fmt.Fprintln(f)
}

Expand Down
73 changes: 73 additions & 0 deletions website/content/en/preview/reference/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,254 +10,327 @@ description: >
Karpenter makes several metrics available in Prometheus format to allow monitoring cluster provisioning status. These metrics are available by default at `karpenter.karpenter.svc.cluster.local:8000/metrics` configurable via the `METRICS_PORT` environment variable documented [here](../settings)
### `karpenter_build_info`
A metric with a constant '1' value labeled by version from which karpenter was built.
- Stability Level: STABLE

## Nodepool Metrics

### `karpenter_nodepool_usage`
The amount of resources that have been provisioned for a nodepool. Labeled by nodepool name and resource type.
- Stability Level: STABLE

### `karpenter_nodepool_limit`
Limits specified on the nodepool that restrict the quantity of resources provisioned. Labeled by nodepool name and resource type.
- Stability Level: STABLE

### `karpenter_nodepool_allowed_disruptions`
The number of nodes for a given NodePool that can be concurrently disrupting at a point in time. Labeled by NodePool. Note that allowed disruptions can change very rapidly, as new nodes may be created and others may be deleted at any point.
- Stability Level: STABLE

## Nodeclaims Metrics

### `karpenter_nodeclaims_termination_duration_seconds`
Duration of NodeClaim termination in seconds.
- Stability Level: BETA

### `karpenter_nodeclaims_terminated_total`
Number of nodeclaims terminated in total by Karpenter. Labeled by reason the nodeclaim was terminated and the owning nodepool.
- Stability Level: STABLE

### `karpenter_nodeclaims_registered_total`
Number of nodeclaims registered in total by Karpenter. Labeled by the owning nodepool.
- Stability Level: ALPHA

### `karpenter_nodeclaims_launched_total`
Number of nodeclaims launched in total by Karpenter. Labeled by the owning nodepool.
- Stability Level: ALPHA

### `karpenter_nodeclaims_instance_termination_duration_seconds`
Duration of CloudProvider Instance termination in seconds.
- Stability Level: BETA

### `karpenter_nodeclaims_initialized_total`
Number of nodeclaims initialized in total by Karpenter. Labeled by the owning nodepool.
- Stability Level: ALPHA

### `karpenter_nodeclaims_drifted_total`
Number of nodeclaims drifted reasons in total by Karpenter. Labeled by drift type of the nodeclaim and the owning nodepool.
- Stability Level: ALPHA

### `karpenter_nodeclaims_disrupted_total`
Number of nodeclaims disrupted in total by Karpenter. Labeled by disruption type of the nodeclaim and the owning nodepool.
- Stability Level: ALPHA

### `karpenter_nodeclaims_created_total`
Number of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepool.
- Stability Level: STABLE

## Nodes Metrics

### `karpenter_nodes_total_pod_requests`
Node total pod requests are the resources requested by non-DaemonSet pods bound to nodes.
- Stability Level: BETA

### `karpenter_nodes_total_pod_limits`
Node total pod limits are the resources specified by non-DaemonSet pod limits.
- Stability Level: BETA

### `karpenter_nodes_total_daemon_requests`
Node total daemon requests are the resource requested by DaemonSet pods bound to nodes.
- Stability Level: BETA

### `karpenter_nodes_total_daemon_limits`
Node total daemon limits are the resources specified by DaemonSet pod limits.
- Stability Level: BETA

### `karpenter_nodes_termination_time_seconds`
The time taken between a node's deletion request and the removal of its finalizer
- Stability Level: BETA

### `karpenter_nodes_terminated_total`
Number of nodes terminated in total by Karpenter. Labeled by owning nodepool.
- Stability Level: STABLE

### `karpenter_nodes_system_overhead`
Node system daemon overhead are the resources reserved for system overhead, the difference between the node's capacity and allocatable values are reported by the status.
- Stability Level: BETA

### `karpenter_nodes_leases_deleted_total`
Number of deleted leaked leases.
- Stability Level: ALPHA

### `karpenter_nodes_created_total`
Number of nodes created in total by Karpenter. Labeled by owning nodepool.
- Stability Level: STABLE

### `karpenter_nodes_allocatable`
Node allocatable are the resources allocatable by nodes.
- Stability Level: BETA

## Pods Metrics

### `karpenter_pods_state`
Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, nodepool name, zone, architecture, capacity type, instance type and pod phase.
- Stability Level: BETA

### `karpenter_pods_startup_duration_seconds`
The time from pod creation until the pod is running.
- Stability Level: STABLE

## Provisioner Metrics

### `karpenter_provisioner_scheduling_simulation_duration_seconds`
Duration of scheduling simulations used for deprovisioning and provisioning in seconds.
- Stability Level: STABLE

### `karpenter_provisioner_scheduling_queue_depth`
The number of pods currently waiting to be scheduled.
- Stability Level: BETA

### `karpenter_provisioner_scheduling_duration_seconds`
Duration of scheduling process in seconds.
- Stability Level: STABLE

## Interruption Metrics

### `karpenter_interruption_received_messages_total`
Count of messages received from the SQS queue. Broken down by message type and whether the message was actionable.
- Stability Level: STABLE

### `karpenter_interruption_message_queue_duration_seconds`
Length of time between message creation in queue and an action taken on the message by the controller.
- Stability Level: STABLE

### `karpenter_interruption_deleted_messages_total`
Count of messages deleted from the SQS queue.
- Stability Level: STABLE

## Disruption Metrics

### `karpenter_disruption_replacement_nodeclaim_initialized_seconds`
Amount of time required for a replacement nodeclaim to become initialized.
- Stability Level: ALPHA

### `karpenter_disruption_queue_failures_total`
The number of times that an enqueued disruption decision failed. Labeled by disruption method.
- Stability Level: BETA

### `karpenter_disruption_pods_disrupted_total`
Total number of reschedulable pods disrupted on nodes. Labeled by NodePool, disruption action, method, and consolidation type.
- Stability Level: ALPHA

### `karpenter_disruption_evaluation_duration_seconds`
Duration of the disruption evaluation process in seconds. Labeled by method and consolidation type.
- Stability Level: BETA

### `karpenter_disruption_eligible_nodes`
Number of nodes eligible for disruption by Karpenter. Labeled by disruption method and consolidation type.
- Stability Level: BETA

### `karpenter_disruption_decisions_total`
Number of disruption decisions performed. Labeled by disruption action, method, and consolidation type.
- Stability Level: STABLE

### `karpenter_disruption_consolidation_timeouts_total`
Number of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type.
- Stability Level: BETA

## Consistency Metrics

### `karpenter_consistency_errors_total`
Number of consistency checks that have failed.
- Stability Level: ALPHA

## Cluster State Metrics

### `karpenter_cluster_state_synced`
Returns 1 if cluster state is synced and 0 otherwise. Synced checks that nodeclaims and nodes that are stored in the APIServer have the same representation as Karpenter's cluster state
- Stability Level: STABLE

### `karpenter_cluster_state_node_count`
Current count of nodes in cluster state
- Stability Level: STABLE

## Cloudprovider Metrics

### `karpenter_cloudprovider_instance_type_offering_price_estimate`
Instance type offering estimated hourly price used when making informed decisions on node cost calculation, based on instance type, capacity type, and zone.
- Stability Level: BETA

### `karpenter_cloudprovider_instance_type_offering_available`
Instance type offering availability, based on instance type, capacity type, and zone
- Stability Level: BETA

### `karpenter_cloudprovider_instance_type_memory_bytes`
Memory, in bytes, for a given instance type.
- Stability Level: BETA

### `karpenter_cloudprovider_instance_type_cpu_cores`
VCPUs cores for a given instance type.
- Stability Level: BETA

### `karpenter_cloudprovider_errors_total`
Total number of errors returned from CloudProvider calls.
- Stability Level: BETA

### `karpenter_cloudprovider_duration_seconds`
Duration of cloud provider method calls. Labeled by the controller, method name and provider.
- Stability Level: BETA

## Cloudprovider Batcher Metrics

### `karpenter_cloudprovider_batcher_batch_time_seconds`
Duration of the batching window per batcher
- Stability Level: BETA

### `karpenter_cloudprovider_batcher_batch_size`
Size of the request batch per batcher
- Stability Level: BETA

## Controller Runtime Metrics

### `controller_runtime_terminal_reconcile_errors_total`
Total number of terminal reconciliation errors per controller
- Stability Level: STABLE

### `controller_runtime_reconcile_total`
Total number of reconciliations per controller
- Stability Level: STABLE

### `controller_runtime_reconcile_time_seconds`
Length of time per reconciliation per controller
- Stability Level: STABLE

### `controller_runtime_reconcile_errors_total`
Total number of reconciliation errors per controller
- Stability Level: STABLE

### `controller_runtime_max_concurrent_reconciles`
Maximum number of concurrent reconciles per controller
- Stability Level: STABLE

### `controller_runtime_active_workers`
Number of currently used workers per controller
- Stability Level: STABLE

## Workqueue Metrics

### `workqueue_work_duration_seconds`
How long in seconds processing an item from workqueue takes.
- Stability Level: STABLE

### `workqueue_unfinished_work_seconds`
How many seconds of work has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
- Stability Level: STABLE

### `workqueue_retries_total`
Total number of retries handled by workqueue
- Stability Level: STABLE

### `workqueue_queue_duration_seconds`
How long in seconds an item stays in workqueue before being requested
- Stability Level: STABLE

### `workqueue_longest_running_processor_seconds`
How many seconds has the longest running processor for workqueue been running.
- Stability Level: STABLE

### `workqueue_depth`
Current depth of workqueue
- Stability Level: STABLE

### `workqueue_adds_total`
Total number of adds handled by workqueue
- Stability Level: STABLE

## Status Condition Metrics

### `operator_status_condition_transition_seconds`
The amount of time a condition was in a given state before transitioning. e.g. Alarm := P99(Updated=False) > 5 minutes
- Stability Level: BETA

### `operator_status_condition_count`
The number of an condition for a given object, type and status. e.g. Alarm := Available=False > 0
- Stability Level: BETA

## Client Go Metrics

### `client_go_request_total`
Number of HTTP requests, partitioned by status code and method.
- Stability Level: STABLE

### `client_go_request_duration_seconds`
Request latency in seconds. Broken down by verb, group, version, kind, and subresource.
- Stability Level: STABLE

## AWS SDK Go Metrics

### `aws_sdk_go_request_total`
The total number of AWS SDK Go requests
- Stability Level: STABLE

### `aws_sdk_go_request_retry_count`
The total number of AWS SDK Go retry attempts per request
- Stability Level: STABLE

### `aws_sdk_go_request_duration_seconds`
Latency of AWS SDK Go requests
- Stability Level: STABLE

### `aws_sdk_go_request_attempt_total`
The total number of AWS SDK Go request attempts
- Stability Level: STABLE

### `aws_sdk_go_request_attempt_duration_seconds`
Latency of AWS SDK Go request attempts
- Stability Level: STABLE

## Leader Election Metrics

### `leader_election_slowpath_total`
Total number of slow path exercised in renewing leader leases. 'name' is the string used to identify the lease. Please make sure to group by name.
- Stability Level: STABLE

### `leader_election_master_status`
Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. Please make sure to group by name.
- Stability Level: STABLE

0 comments on commit e797ea2

Please sign in to comment.