A Prometheus exporter for Google Stackdriver Monitoring metrics. It acts as a proxy that requests Stackdriver API for the metric's time-series everytime prometheus scrapes it.
Download the already existing binaries for your platform:
$ ./stackdriver_exporter <flags>
Using the standard go install
(you must have Go already installed in your local machine):
$ go install github.com/prometheus-community/stackdriver_exporter
$ stackdriver_exporter <flags>
To run the stackdriver exporter as a Docker container, run:
$ docker run -p 9255:9255 prometheuscommunity/stackdriver-exporter <flags>
You can find a helm chart in the prometheus-community charts repository at https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-stackdriver-exporter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install [RELEASE_NAME] prometheus-community/prometheus-stackdriver-exporter
The exporter can be deployed to an already existing Cloud Foundry environment:
$ git clone https://github.com/prometheus-community/stackdriver_exporter.git
$ cd stackdriver_exporter
Modify the included application manifest file to include the desired properties. Then you can push the exporter to your Cloud Foundry environment:
$ cf push
This exporter can be deployed using the Prometheus BOSH Release.
The Google Stackdriver Exporter uses the Google Golang Client Library, which offers a variety of ways to provide credentials. Please refer to the Google Application Default Credentials documentation to see how the credentials can be provided.
If you are using IAM roles, the roles/monitoring.viewer
IAM role contains the required permissions. See the Access Control Guide for more information.
If you are still using the legacy Access scopes, the https://www.googleapis.com/auth/monitoring.read
scope is required.
Flag | Required | Default | Description |
---|---|---|---|
google.project-id |
No | GCloud SDK auto-discovery | Comma seperated list of Google Project IDs |
monitoring.metrics-ingest-delay |
No | Offsets metric collection by a delay appropriate for each metric type, e.g. because bigquery metrics are slow to appear | |
monitoring.drop-delegated-projects |
No | No | Drop metrics from attached projects and fetch project_id only. |
monitoring.metrics-type-prefixes |
Yes | Comma separated Google Stackdriver Monitoring Metric Type prefixes (see example and available metrics) | |
monitoring.metrics-interval |
No | 5m |
Metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API. Only the most recent data point is used |
monitoring.metrics-offset |
No | 0s |
Offset (into the past) for the metric's timestamp interval to request from the Google Stackdriver Monitoring Metrics API, to handle latency in published metrics |
monitoring.filters |
No | Formatted string to allow filtering on certain metrics type | |
monitoring.aggregate-deltas |
No | If enabled will treat all DELTA metrics as an in-memory counter instead of a gauge. Be sure to read what to know about aggregating DELTA metrics | |
monitoring.aggregate-deltas-ttl |
No | 30m |
How long should a delta metric continue to be exported and stored after GCP stops producing it. Read slow moving metrics to understand the problem this attempts to solve |
monitoring.descriptor-cache-ttl |
No | 0s |
How long should the metric descriptors for a prefixed be cached for |
stackdriver.max-retries |
No | 0 |
Max number of retries that should be attempted on 503 errors from stackdriver. |
stackdriver.http-timeout |
No | 10s |
How long should stackdriver_exporter wait for a result from the Stackdriver API. |
stackdriver.max-backoff= |
No | Max time between each request in an exp backoff scenario. | |
stackdriver.backoff-jitter |
No | 1s |
The amount of jitter to introduce in a exp backoff scenario. |
stackdriver.retry-statuses |
No | 503 |
The HTTP statuses that should trigger a retry. |
web.config.file |
No | [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. | |
web.listen-address |
No | :9255 |
Address to listen on for web interface and telemetry Repeatable for multiple addresses. |
web.systemd-socket |
No | Use systemd socket activation listeners instead of port listeners (Linux only). | |
web.stackdriver-telemetry-path |
No | "/metrics" | Path under which to expose Stackdriver metrics. |
web.telemetry-path |
No | /metrics |
Path under which to expose Prometheus metrics |
The Stackdriver Exporter supports TLS and basic authentication.
To use TLS and/or basic authentication, you need to pass a configuration file
using the --web.config.file
parameter. The format of the file is described
in the exporter-toolkit repository.
The exporter returns the following metrics:
Metric | Description | Labels |
---|---|---|
stackdriver_monitoring_api_calls_total |
Total number of Google Stackdriver Monitoring API calls made | project_id |
stackdriver_monitoring_scrapes_total |
Total number of Google Stackdriver Monitoring metrics scrapes | project_id |
stackdriver_monitoring_scrape_errors_total |
Total number of Google Stackdriver Monitoring metrics scrape errors | project_id |
stackdriver_monitoring_last_scrape_error |
Whether the last metrics scrape from Google Stackdriver Monitoring resulted in an error (1 for error, 0 for success) |
project_id |
stackdriver_monitoring_last_scrape_timestamp |
Number of seconds since 1970 since last metrics scrape from Google Stackdriver Monitoring | project_id |
stackdriver_monitoring_last_scrape_duration_seconds |
Duration of the last metrics scrape from Google Stackdriver Monitoring | project_id |
Metrics gathered from Google Stackdriver Monitoring are converted to Prometheus metrics:
- Metric's names are normalized according to the Prometheus specification using the following pattern:
namespace
is a constant prefix (stackdriver
)subsystem
is the normalized monitored resource type (iegce_instance
)name
is the normalized metric type (iecompute_googleapis_com_instance_cpu_usage_time
)
- Labels attached to each metric are an aggregation of:
- the
unit
in which the metric value is reported - the metric type labels (see Metrics List)
- the monitored resource labels (see Monitored Resource Types)
- the
- For each timeseries, only the most recent data point is exported.
- Stackdriver
GAUGE
metric kinds are reported as PrometheusGauge
metrics - Stackdriver
CUMULATIVE
metric kinds are reported as PrometheusCounter
metrics. - Stackdriver
DELTA
metric kinds are reported as PrometheusGauge
metrics or an accumulatingCounter
ifmonitoring.aggregate-deltas
is set - Only
BOOL
,INT64
,DOUBLE
andDISTRIBUTION
metric types are supported, other types (STRING
andMONEY
) are discarded. DISTRIBUTION
metric type is reported as a PrometheusHistogram
, except the_sum
time series is not supported.
If we want to get all CPU
(compute.googleapis.com/instance/cpu
) and Disk
(compute.googleapis.com/instance/disk
) metrics for all Google Compute Engine instances, we can run the exporter with the following options:
stackdriver_exporter \
--google.project-id=my-test-project \
--monitoring.metrics-type-prefixes "compute.googleapis.com/instance/cpu,compute.googleapis.com/instance/disk"
Using extra filters:
stackdriver_exporter \
--google.project-id=my-test-project \
--monitoring.metrics-type-prefixes='pubsub.googleapis.com/subscription' \
--monitoring.filters='pubsub.googleapis.com/subscription:resource.labels.subscription_id=monitoring.regex.full_match("us-west4.*my-team-subs.*")'
The stackdriver_exporter
collects all metrics type prefixes by default.
For advanced uses, the collection can be filtered by using a repeatable URL param called collect
. In the Prometheus configuration you can use you can use this syntax under the scrape config.
params:
collect:
- compute.googleapis.com/instance/cpu
- compute.googleapis.com/instance/disk
Treating DELTA Metrics as a gauge produces data which is wildly inaccurate/not very useful (see prometheus-community#116). However, aggregating the DELTA metrics overtime is not a perfect solution and is intended to produce data which mirrors GCP's data as close as possible.
The biggest challenge to producing a correct result is that a counter for prometheus does not start at 0, it starts at the first value which is exported. This can cause inconsistencies when the exporter first starts and for slow moving metrics which are described below.
When the exporter first starts it has no persisted counter information and the stores will be empty. When the first sample is received for a series it is intended to be a change from a previous value according to GCP, a delta. But the prometheus counter is not initialized to 0 so it does not export this as a change from 0, it exports that the counter started at the sample value. Since the series exported are dynamic it's not possible to export an initial 0 value in order to account for this issue. The end result is that it can take a few cycles for aggregated metrics to start showing rates exactly as GCP.
As an example consider a prometheus query, sum by(backend_target_name) (rate(stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_request_bytes_count[1m]))
which is aggregating 5 series. All 5 series will need to have two samples from GCP in order for the query to produce the same result as GCP.
A slow moving metric would be a metric which is not constantly changing with every sample from GCP. GCP does not consistently report slow moving metrics DELTA metrics. If this occurs for too long (default 5m) prometheus will mark the series as stale. The end result is that the next reported sample will be treated as the start of a new series and not an increment from the previous value. Here's an example of this in action,
There are two features which attempt to combat this issue,
monitoring.aggregate-deltas-ttl
which controls how long a metric is persisted in the data store after its no longer being reported by GCP- Metrics which were not collected during a scrape are still exported at their current counter value
The configuration when using monitoring.aggregate-deltas
gives a 30 minute buffer to slower moving metrics and monitoring.aggregate-deltas-ttl
can be adjusted to tune memory requirements vs correctness. Storing the data for longer results in a higher memory cost.
The feature which continues to export metrics which are not collected can cause the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested
if your scrape config for the exporter has honor_timestamps
enabled (this is the default value). This is caused by the fact that it's not possible to know the different between GCP having late arriving data and GCP not exporting a value. The underlying counter is still incremented when this happens so the next reported sample will show a higher rate than expected.
Refer to the contributing guidelines.
Apache License 2.0, see LICENSE.