diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md index 3c0d9902c6..649577ebc1 100644 --- a/.github/ISSUE_TEMPLATE.md +++ b/.github/ISSUE_TEMPLATE.md @@ -9,7 +9,7 @@ about what components it touches e.g "query:" or ".*:" In case of issues related to exact bucket implementation, please ping corresponded maintainer from list here: https://github.com/thanos-io/thanos/blob/master/docs/storage.md --> -**Thanos, Prometheus and Golang version used** +**Thanos, Prometheus and Golang version used**: -**What happened** +**Object Storage Provider**: -**What you expected to happen** +**What happened**: + +**What you expected to happen**: **How to reproduce it (as minimally and precisely as possible)**: -**Full logs to relevant components** +**Full logs to relevant components**: -**Anything else we need to know** +**Anything else we need to know**: -* [] CHANGELOG entry if change is relevant to the end user. + + +* [] I added CHANGELOG entry for this change. +* [] Change is not relevant to the end user. ## Changes diff --git a/README.md b/README.md index 50bea12953..3e4446f9ac 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,8 @@ Thanos is a set of components that can be composed into a highly available metri system with unlimited storage capacity, which can be added seamlessly on top of existing Prometheus deployments. +Thanos is a [CNCF](https://www.cncf.io/) Sandbox project. + Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies. Additionally, it provides a global query view across all Prometheus installations and can merge data from Prometheus @@ -23,16 +25,12 @@ Concretely the aims of the project are: 1. Unlimited retention of metrics. 1. High availability of components, including Prometheus. -## Architecture Overview - -![architecture_overview](docs/img/arch.jpg) - ## Getting Started * **[Getting Started](https://thanos.io/getting-started.md/)** * [Design](https://thanos.io/design.md/) -* [Prom Meetup Slides](https://www.slideshare.net/BartomiejPotka/thanos-global-durable-prometheus-monitoring) -* [Introduction blog post](https://improbable.io/games/blog/thanos-prometheus-at-scale) +* [Blog posts](docs/getting-started.md#blog-posts) +* [Talks](docs/getting-started.md#talks) * [Proposals](docs/proposals) * [Integrations](docs/integrations.md) @@ -48,6 +46,10 @@ Concretely the aims of the project are: * Simple gRPC "Store API" for unified data access across all metric data * Easy integration points for custom metric providers +## Architecture Overview + +![architecture_overview](docs/img/arch.jpg) + ## Thanos Philosophy The philosophy of Thanos and our community is borrowing much from UNIX philosophy and the golang programming language. diff --git a/docs/components/query.md b/docs/components/query.md index 283d624c88..9e164d85b7 100644 --- a/docs/components/query.md +++ b/docs/components/query.md @@ -4,14 +4,15 @@ type: docs menu: components --- -# Query +# Querier/Query -The query component implements the Prometheus HTTP v1 API to query data in a Thanos cluster via PromQL. +The Querier component (also known as "Query") implements the [Prometheus HTTP v1 API](https://prometheus.io/docs/prometheus/latest/querying/api/) to query data in a Thanos cluster via PromQL. -It gathers the data needed to evaluate the query from underlying StoreAPIs. See [here](../service-discovery.md) -on how to connect querier with desired StoreAPIs. +In short, it gathers the data needed to evaluate the query from underlying [StoreAPIs](../../pkg/store/storepb/rpc.proto), evaluates the query and returns the result. -Querier currently is fully stateless and horizontally scalable. +Querier is fully stateless and horizontally scalable. + +Example command to run Querier: ```bash $ thanos query \ @@ -19,8 +20,51 @@ $ thanos query \ --store ":" \ --store ":" ``` +## Querier use cases, why do I need this component? + +Thanos Querier essentially allows to aggregate and optionally deduplicate multiple metrics backends under single Prometheus Query endpoint. + +### Global View + +Since for Querier "a backend" is anything that implements gRPC StoreAPI we can aggregate data from any number of the different storages like: + +* Prometheus (see [Sidecar](sidecar.md)) +* Object Storage (see [Store Gateway](store.md)) +* Global alerting/recording rules evaluations (see [Ruler](rule.md)) +* Metrics received from Prometheus remote write streams (see [Thanos Receiver](../proposals/201812_thanos-remote-receive.md)) +* Another Querier (you can stack Queriers on top of each other) +* Non-Prometheus systems! + * e.g [OpenTSDB](../integrations.md#opentsdb) + +Thanks to that, you can run queries (manually, from Grafana or via Alerting rule) that aggregate metrics from mix of those sources. + +Some examples: + +* `sum(cpu_used{cluster=~"cluster-(eu1|eu2|eu3|us1|us2|us3)", job="service1"})` that will give you sum of CPU used inside all listed clusters for service `service1`. This will work +even if those clusters runs multiple Prometheus servers each. Querier will know which data sources to query. + +* In single cluster you shard Prometheus functionally or have different Prometheus instances for different tenants. You can spin up Querier to have access to both within single Query evaluation. + +### Run-time deduplication of HA groups -## Deduplication +Prometheus is stateful and does not allow replicating its database. This means that increasing high availability by running multiple Prometheus replicas is not very easy to use. +Simple loadbalancing will not work as for example after some crash, replica might be up but querying such replica will result in small gap during the period it was down. You have a + second replica that maybe was up, but it could be down in other moment (e.g rolling restart), so load balancing on top of those is not working well. + +Thanos Querier instead pulls the data from both replicas, and deduplicate those signals, filling the gaps if any, transparently to the Querier consumer. + +## Metric Query Flow Overview + +querier-steps + +Overall QueryAPI exposed by Thanos is guaranteed to be compatible with [Prometheus 2.x. API](https://prometheus.io/docs/prometheus/latest/querying/api/). +The above diagram shows what Querier does for each Prometheus query request. + +See [here](../service-discovery.md) on how to connect Querier with desired StoreAPIs. + + + +### Deduplication The query layer can deduplicate series that were collected from high-availability pairs of data sources such as Prometheus. A fixed single or multiple replica labels must be chosen for the entire cluster and can then be passed to query nodes on startup. @@ -73,16 +117,17 @@ $ thanos query \ This logic can also be controlled via parameter on QueryAPI. More details below. -## Query API - -Overall QueryAPI exposed by Thanos is guaranteed to be compatible with Prometheus 2.x. +## Query API Overview -However, for additional Thanos features, Thanos, on top of Prometheus adds +As mentioned, Query API exposed by Thanos is guaranteed to be compatible with [Prometheus 2.x. API](https://prometheus.io/docs/prometheus/latest/querying/api/). +However for additional Thanos features on top of Prometheus, Thanos adds: * partial response behaviour * several additional parameters listed below * custom response fields. +Let's walk through all of those extensions: + ### Partial Response QueryAPI and StoreAPI has additional behaviour controlled via query parameter called [PartialResponseStrategy](/pkg/store/storepb/rpc.pb.go). @@ -169,7 +214,6 @@ type queryData struct { Additional field is `Warnings` that contains every error that occurred that is assumed non critical. `partial_response` option controls if storeAPI unavailability is considered critical. - ## Expose UI on a sub-path It is possible to expose thanos-query UI and optionally API on a sub-path. diff --git a/docs/design.md b/docs/design.md index f993237b7d..c9bd09be0b 100644 --- a/docs/design.md +++ b/docs/design.md @@ -147,7 +147,6 @@ For example, rule sets can be divided across multiple HA pairs of rule nodes. St Overall, first-class horizontal sharding is possible but will not be considered for the time being since there's no evidence that it is required in practical setups. - ## Cost The only extra cost Thanos adds to an existing Prometheus setup is essentially the price of storing and querying data from the object storage and running of the store node. diff --git a/docs/getting-started.md b/docs/getting-started.md index 29665e5212..1100b4f43e 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -8,17 +8,26 @@ slug: /getting-started.md # Getting started -Thanos provides a global query view, data backup, and historical data access as its core features in a single binary. All three features can be run independently of each other. This allows you to have a subset of Thanos features ready for immediate benefit or testing, while also making it flexible for gradual roll outs in more complex environments. +Thanos provides a global query view, high availability, data backup with historical, cheap data access as its core features in a single binary. -In this quick-start guide, we will configure Thanos and all components mentioned to work against an object storage cloud provider. -Thanos is able to use [different storage providers](storage.md), with the ability to add more providers as necessary. +Those features can be deployed independently of each other. This allows you to have a subset of Thanos features ready +for immediate benefit or testing, while also making it flexible for gradual roll outs in more complex environments. -Thanos will work in cloud native environments as well as more traditional ones. Some users run Thanos in Kubernetes while others on bare metal. +In this quick-start guide, we will explain: -Thanos aims for simple deployment and maintenance model. The only dependencies are: +* How to ask questions, build and contribute to Thanos. +* A few common ways of deploying Thanos. +* Links for further reading. -* One or more [Prometheus](https://prometheus.io) v2.2.1+ installations with persistent disk -* Optional object storage +Thanos will work in cloud native environments as well as more traditional ones. Some users run Thanos in Kubernetes while others on the bare metal. + +## Dependencies + +Thanos aims for a simple deployment and maintenance model. The only dependencies are: + +* One or more [Prometheus](https://prometheus.io) v2.2.1+ installations with persistent disk. +* Optional object storage + * Thanos is able to use [many different storage providers](storage.md), with the ability to add more providers as necessary. ## Get Thanos! @@ -33,10 +42,6 @@ During that, we build tarballs for major platforms and release docker images. See [release process docs](release-process.md) for details. -## Running Thanos - -For detailed, free in-browser interactive tutorial please visit our [Katacoda Thanos Course](https://katacoda.com/bwplotka/courses/thanos) - ## Building from source: Thanos is built purely in [Golang](https://golang.org/), thus allowing to run Thanos on various x64 operating systems. @@ -67,210 +72,42 @@ of the community. Here are ways to get in touch with the community: See [MAINTAINERS.md](/MAINTAINERS.md) -## Quick Overview - -## Architecture - -architecture overview - -### Prometheus - -Thanos bases itself on vanilla [Prometheus](https://prometheus.io/) (v2.2.1+). - -To find out the Prometheus' versions Thanos is tested against, look at the value of the `PROM_VERSIONS` variable in the [Makefile](/Makefile). - -### Components - -Following the [KISS](https://en.wikipedia.org/wiki/KISS_principle) and Unix philosophies, Thanos is made of a set of components with each filling a specific role. - -* Sidecar: connects to Prometheus and reads its data for query and/or upload it to cloud storage -* Store Gateway: exposes the content of a cloud storage bucket -* Compactor: compact and downsample data stored in remote storage -* Receiver: receives data from Prometheus' remote-write WAL, exposes it and/or upload it to cloud storage -* Ruler: evaluates recording and alerting rules against data in Thanos for exposition and/or upload -* Query Gateway: implements Prometheus' v1 API to aggregate data from the underlying components - -### [Sidecar](components/sidecar.md) - -Thanos integrates with existing Prometheus servers through a [Sidecar process](https://docs.microsoft.com/en-us/azure/architecture/patterns/sidecar#solution), which runs on the same machine or in the same pod as the Prometheus server. - -The purpose of the Sidecar is to backup Prometheus data into an Object Storage bucket, and giving other Thanos components access to the Prometheus instance the Sidecar is attached to via a gRPC API. - -The Sidecar makes use of the `reload` Prometheus endpoint. Make sure it's enabled with the flag `--web.enable-lifecycle`. - -#### External storage - -The following configures the sidecar to write Prometheus' data into a configured object storage: +## Community Thanos Kubernetes Applications -```bash -thanos sidecar \ - --tsdb.path /var/prometheus \ # TSDB data directory of Prometheus - --prometheus.url "http://localhost:9090" \ # Be sure that the sidecar can use this url! - --objstore.config-file bucket_config.yaml \ # Storage configuration for uploading data -``` - -The format of YAML file depends on the provider you choose. Examples of config and up-to-date list of storage types Thanos supports is available [here](storage.md). - -Rolling this out has little to zero impact on the running Prometheus instance. It is a good start to ensure you are backing up your data while figuring out the other pieces of Thanos. - -If you are not interested in backing up any data, the `--objstore.config-file` flag can simply be omitted. +Thanos is **not** tied to Kubernetes. However, Kubernetes, Thanos and Prometheus are part of the CNCF so the most popular applications are on top of Kubernetes. -* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar.yaml)_ -* _[Example Kubernetes manifest with Minio upload](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar-lts.yaml)_ -* _[Example Deploying sidecar using official Prometheus Helm Chart](/tutorials/kubernetes-helm/README.md)_ -* _[Details & Config for other object stores](storage.md)_ +Our friendly community maintains a few different ways of installing Thanos on Kubernetes. See those below: -#### Store API - -The Sidecar component implements and exposes a gRPC _[Store API](/pkg/store/storepb/rpc.proto#L19)_. The sidecar implementation allows you to query the metric data stored in Prometheus. - -Let's extend the Sidecar in the previous section to connect to a Prometheus server, and expose the Store API. - -```bash -thanos sidecar \ - --tsdb.path /var/prometheus \ - --objstore.config-file bucket_config.yaml \ # Bucket config file to send data to - --prometheus.url http://localhost:9090 \ # Location of the Prometheus HTTP server - --http-address 0.0.0.0:19191 \ # HTTP endpoint for collecting metrics on the Sidecar - --grpc-address 0.0.0.0:19090 # GRPC endpoint for StoreAPI -``` +* [prometheus-operator](https://github.com/coreos/prometheus-operator): Prometheus operator has support for deploying Prometheus with Thanos +* [kube-thanos](https://github.com/thanos-io/kube-thanos): Jsonnet based Kubernetes templates. +* [Community Helm charts](https://hub.helm.sh/charts?q=thanos) -* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar.yaml)_ -* _[Example Kubernetes manifest with GCS upload](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar-lts.yaml)_ +If you want to add yourself to this list, let us know! -#### External Labels +## Deploying Thanos -Prometheus allows the configuration of "external labels" of a given Prometheus instance. These are meant to globally identify the role of that instance. As Thanos aims to aggregate data across all instances, providing a consistent set of external labels becomes crucial! +* [WIP] Detailed, free, in-browser interactive tutorial [as Katacoda Thanos Course](https://katacoda.com/bwplotka/courses/thanos) +* [Quick Tutorial](./quick-tutorial.md) on Thanos website. -Every Prometheus instance must have a globally unique set of identifying labels. For example, in Prometheus's configuration file: +## Operating -```yaml -global: - external_labels: - region: eu-west - monitor: infrastructure - replica: A -``` - -### [Query Gateway](components/query.md) - -Now that we have setup the Sidecar for one or more Prometheus instances, we want to use Thanos' global [Query Layer](components/query.md) to evaluate PromQL queries against all instances at once. - -The Query component is stateless and horizontally scalable and can be deployed with any number of replicas. Once connected to the Sidecars, it automatically detects which Prometheus servers need to be contacted for a given PromQL query. - -Query also implements Prometheus's official HTTP API and can thus be used with external tools such as Grafana. It also serves a derivative of Prometheus's UI for ad-hoc querying and stores status. - -Below, we will set up a Query to connect to our Sidecars, and expose its HTTP UI. - -```bash -thanos query \ - --http-address 0.0.0.0:19192 \ # HTTP Endpoint for Query UI - --store 1.2.3.4:19090 \ # Static gRPC Store API Address for the query node to query - --store 1.2.3.5:19090 \ # Also repeatable - --store dnssrv+_grpc._tcp.thanos-store.monitoring.svc # Supports DNS A & SRV records -``` - -Go to the configured HTTP address that should now show a UI similar to that of Prometheus. If the cluster formed correctly you can now query across all Prometheus instances within the cluster. You can also check the Stores page to check up on your stores. - -#### Deduplicating Data from Prometheus HA pairs - -The Query component is also capable of deduplicating data collected from Prometheus HA pairs. This requires configuring Prometheus's `global.external_labels` configuration block (as mentioned in the [External Labels section](getting-started.md#external-labels)) to identify the role of a given Prometheus instance. - -A typical choice is simply the label name "replica" while letting the value be whatever you wish. For example, you might set up the following in Prometheus's configuration file: - -```yaml -global: - external_labels: - region: eu-west - monitor: infrastructure - replica: A -# ... -``` - -In a Kubernetes stateful deployment, the replica label can also be the pod name. - -Reload your Prometheus instances, and then, in Query, we will define `replica` as the label we want to enable deduplication to occur on: - -```bash -thanos query \ - --http-address 0.0.0.0:19192 \ - --store 1.2.3.4:19090 \ - --store 1.2.3.5:19090 \ - --query.replica-label replica # Replica label for de-duplication - --query.replica-label replicaX # Supports multiple replica labels for de-duplication -``` - -Go to the configured HTTP address, and you should now be able to query across all Prometheus instances and receive de-duplicated data. - -* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/thanos-querier.yaml)_ - -#### Communication Between Components - -The only required communication between nodes is for Thanos Querier to be able to reach gRPC storeAPIs you provide. Thanos Querier periodically calls Info endpoint to collect up-to-date metadata as well as checking the health of given StoreAPI. -The metadata includes the information about time windows and external labels for each node. - -There are various ways to tell query component about the StoreAPIs it should query data from. The simplest way is to use a static list of well known addresses to query. -These are repeatable so can add as many endpoint as needed. You can put DNS domain prefixed by `dns+` or `dnssrv+` to have Thanos Query do an `A` or `SRV` lookup to get all required IPs to communicate with. - -```bash -thanos query \ - --http-address 0.0.0.0:19192 \ # Endpoint for Query UI - --grpc-address 0.0.0.0:19092 \ # gRPC endpoint for Store API - --store 1.2.3.4:19090 \ # Static gRPC Store API Address for the query node to query - --store 1.2.3.5:19090 \ # Also repeatable - --store dns+rest.thanos.peers:19092 # Use DNS lookup for getting all registered IPs as separate StoreAPIs -``` - -Read more details [here](service-discovery.md). - -* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar.yaml)_ -* _[Example Kubernetes manifest with GCS upload](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar-lts.yaml)_ - -### [Store Gateway](components/store.md) - -As the sidecar backs up data into the object storage of your choice, you can decrease Prometheus retention and store less locally. However we need a way to query all that historical data again. -The store gateway does just that by implementing the same gRPC data API as the sidecars but backing it with data it can find in your object storage bucket. -Just like sidecars and query nodes, the store gateway exposes StoreAPI and needs to be discovered by Thanos Querier. - -```bash -thanos store \ - --data-dir /var/thanos/store \ # Disk space for local caches - --objstore.config-file bucket_config.yaml \ # Bucket to fetch data from - --http-address 0.0.0.0:19191 \ # HTTP endpoint for collecting metrics on the Store Gateway - --grpc-address 0.0.0.0:19090 # GRPC endpoint for StoreAPI -``` - -The store gateway occupies small amounts of disk space for caching basic information about data in the object storage. This will rarely exceed more than a few gigabytes and is used to improve restart times. It is useful but not required to preserve it across restarts. - -* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/thanos-store-gateway.yaml)_ - -### [Compactor](components/compact.md) - -A local Prometheus installation periodically compacts older data to improve query efficiency. Since the sidecar backs up data as soon as possible, we need a way to apply the same process to data in the object storage. - -The compactor component simple scans the object storage and processes compaction where required. At the same time it is responsible for creating downsampled copies of data to speed up queries. - -```bash -thanos compact \ - --data-dir /var/thanos/compact \ # Temporary workspace for data processing - --objstore.config-file bucket_config.yaml \ # Bucket where to apply the compacting - --http-address 0.0.0.0:19191 # HTTP endpoint for collecting metrics on the Compactor -``` - -The compactor is not in the critical path of querying or data backup. It can either be run as a periodic batch job or be left running to always compact data as soon as possible. It is recommended to provide 100-300GB of local disk space for data processing. +See up to date [jsonnet mixins](https://github.com/thanos-io/kube-thanos/tree/master/jsonnet/thanos-mixin) +We also have example Grafana dashboards [here](/examples/grafana/monitoring.md) and some [alerts](/examples/alerts/alerts.md) to get you started. -_NOTE: The compactor must be run as a **singleton** and must not run when manually modifying data in the bucket._ +## Talks -### [Ruler](components/rule.md) +* 02.2018: [Very first Prometheus Meetup Slides](https://www.slideshare.net/BartomiejPotka/thanos-global-durable-prometheus-monitoring) +* 02.2019: [FOSDEM + demo](https://fosdem.org/2019/schedule/event/thanos_transforming_prometheus_to_a_global_scale_in_a_seven_simple_steps/) +* 09.2019: [CloudNative Warsaw Slides](https://docs.google.com/presentation/d/1cKpbJY3jIAtr03M-zcNujwBA38_LDj7NqE4LjNfvglE/edit?usp=sharing) -In case of Prometheus with Thanos sidecar does not have enough retention, or if you want to have alerts or recording rules that requires global view, Thanos has just the component for that: the [Ruler](components/rule.md), -which does rule and alert evaluation on top of a given Thanos Querier. +## Blog posts -## Extras +* 2018: [Introduction blog post](https://improbable.io/games/blog/thanos-prometheus-at-scale) +* 2019: [Metric monitoring architecture](https://improbable.io/blog/thanos-architecture-at-improbable) -See this [talk](https://fosdem.org/2019/schedule/event/thanos_transforming_prometheus_to_a_global_scale_in_a_seven_simple_steps/) to see quick example of running Thanos on Kubernetes. +## Integrations -We also have example Grafana dashboards [here](/examples/grafana/monitoring.md) and some [alerts](/examples/alerts/alerts.md) to get you started. +See [Integrations page](./integrations.md) ## Testing Thanos on Single Host diff --git a/docs/img/querier.svg b/docs/img/querier.svg new file mode 100644 index 0000000000..f3ccf47f98 --- /dev/null +++ b/docs/img/querier.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/integrations.md b/docs/integrations.md index 862c949979..befbb9fe4e 100644 --- a/docs/integrations.md +++ b/docs/integrations.md @@ -18,4 +18,3 @@ Below you can find some public integrations with other systems through StoreAPI: [Geras](https://github.com/G-Research/geras) is an OpenTSDB integration service which can connect your OpenTSDB cluster to Thanos. Geras exposes the Thanos Storage API, thus other Thanos components can query OpenTSDB via Geras, providing a unified query interface over OpenTSDB and Prometheus. Although OpenTSDB is able to aggregte the data, it's not supported by Geras at the moment. - diff --git a/docs/quick-tutorial.md b/docs/quick-tutorial.md new file mode 100644 index 0000000000..b7867ef0b7 --- /dev/null +++ b/docs/quick-tutorial.md @@ -0,0 +1,224 @@ +--- +title: Quick Tutorial +type: docs +menu: thanos +weight: 1 +slug: /quick-tutorial.md +--- + +# Quick Tutorial + +Feel free to check the free, in-browser interactive tutorial [as Katacoda Thanos Course](https://katacoda.com/bwplotka/courses/thanos) +We will be progressively updating our Katacoda Course with more scenarios. + +On top of this feel free to go through our tutorial presented here: + +### Prometheus + +Thanos is based on Prometheus. With Thanos you use more or less Prometheus features depending on the deployment model, however +Prometheus always stays as integral foundation for *collecting metrics* and alerting using local data. + +Thanos bases itself on vanilla [Prometheus](https://prometheus.io/) (v2.2.1+). We plan to support *all* Prometheus version beyond this version. + +NOTE: It is highly recommended to use Prometheus 2.13 (available in next Prometheus release) due to Prometheus remote read improvements. + +Always make sure to run Prometheus as recommended by Prometheus team, so: + +* Put Prometheus in the same failure domain. This means same network, same datacenter as monitoring services. +* Use persistent disk to persist data across Prometheus restarts. +* Use local compaction for longer retentions. +* Do not change min TSDB block durations. +* Do not scale out Prometheus unless necessary. Single Prometheus is highly efficient (: + +We recommend using Thanos when you need to scale out your Prometheus instance. + +### Components + +Following the [KISS](https://en.wikipedia.org/wiki/KISS_principle) and Unix philosophies, Thanos is made of a set of components with each filling a specific role. + +* Sidecar: connects to Prometheus, reads its data for query and/or uploads it to cloud storage. +* Store Gateway: serves metrics inside of a cloud storage bucket. +* Compactor: compacts, downsamples and applies retention on the data stored in cloud storage bucket. +* Receiver: receives data from Prometheus' remote-write WAL, exposes it and/or upload it to cloud storage. +* Ruler/Rule: evaluates recording and alerting rules against data in Thanos for exposition and/or upload. +* Querier/Query: implements Prometheus' v1 API to aggregate data from the underlying components. + +See those components on this diagram: + +architecture overview + +### [Sidecar](components/sidecar.md) + +Thanos integrates with existing Prometheus servers through a [Sidecar process](https://docs.microsoft.com/en-us/azure/architecture/patterns/sidecar#solution), which runs on the same machine or in the same pod as the Prometheus server. + +The purpose of the Sidecar is to backup Prometheus data into an Object Storage bucket, and give other Thanos components access to the Prometheus metrics via a gRPC API. + +The Sidecar makes use of the `reload` Prometheus endpoint. Make sure it's enabled with the flag `--web.enable-lifecycle`. + +#### External storage + +The following configures the sidecar to write Prometheus' data into a configured object storage: + +```bash +thanos sidecar \ + --tsdb.path /var/prometheus \ # TSDB data directory of Prometheus + --prometheus.url "http://localhost:9090" \ # Be sure that the sidecar can use this url! + --objstore.config-file bucket_config.yaml \ # Storage configuration for uploading data +``` + +The format of YAML file depends on the provider you choose. Examples of config and up-to-date list of storage types Thanos supports is available [here](storage.md). + +Rolling this out has little to zero impact on the running Prometheus instance. It is a good start to ensure you are backing up your data while figuring out the other pieces of Thanos. + +If you are not interested in backing up any data, the `--objstore.config-file` flag can simply be omitted. + +* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar.yaml)_ +* _[Example Kubernetes manifest with Minio upload](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar-lts.yaml)_ +* _[Example Deploying sidecar using official Prometheus Helm Chart](/tutorials/kubernetes-helm/README.md)_ +* _[Details & Config for other object stores](storage.md)_ + +#### Store API + +The Sidecar component implements and exposes a gRPC _[Store API](/pkg/store/storepb/rpc.proto#L19)_. The sidecar implementation allows you to query the metric data stored in Prometheus. + +Let's extend the Sidecar in the previous section to connect to a Prometheus server, and expose the Store API. + +```bash +thanos sidecar \ + --tsdb.path /var/prometheus \ + --objstore.config-file bucket_config.yaml \ # Bucket config file to send data to + --prometheus.url http://localhost:9090 \ # Location of the Prometheus HTTP server + --http-address 0.0.0.0:19191 \ # HTTP endpoint for collecting metrics on the Sidecar + --grpc-address 0.0.0.0:19090 # GRPC endpoint for StoreAPI +``` + +* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar.yaml)_ +* _[Example Kubernetes manifest with GCS upload](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar-lts.yaml)_ + +#### External Labels + +Prometheus allows the configuration of "external labels" of a given Prometheus instance. These are meant to globally identify the role of that instance. As Thanos aims to aggregate data across all instances, providing a consistent set of external labels becomes crucial! + +Every Prometheus instance must have a globally unique set of identifying labels. For example, in Prometheus's configuration file: + +```yaml +global: + external_labels: + region: eu-west + monitor: infrastructure + replica: A +``` + +### [Querier/Query](components/query.md) + +Now that we have setup the Sidecar for one or more Prometheus instances, we want to use Thanos' global [Query Layer](components/query.md) to evaluate PromQL queries against all instances at once. + +The Query component is stateless and horizontally scalable and can be deployed with any number of replicas. Once connected to the Sidecars, it automatically detects which Prometheus servers need to be contacted for a given PromQL query. + +Query also implements Prometheus's official HTTP API and can thus be used with external tools such as Grafana. It also serves a derivative of Prometheus's UI for ad-hoc querying and stores status. + +Below, we will set up a Query to connect to our Sidecars, and expose its HTTP UI. + +```bash +thanos query \ + --http-address 0.0.0.0:19192 \ # HTTP Endpoint for Query UI + --store 1.2.3.4:19090 \ # Static gRPC Store API Address for the query node to query + --store 1.2.3.5:19090 \ # Also repeatable + --store dnssrv+_grpc._tcp.thanos-store.monitoring.svc # Supports DNS A & SRV records +``` + +Go to the configured HTTP address that should now show a UI similar to that of Prometheus. If the cluster formed correctly you can now query across all Prometheus instances within the cluster. You can also check the Stores page to check up on your stores. + +#### Deduplicating Data from Prometheus HA pairs + +The Query component is also capable of deduplicating data collected from Prometheus HA pairs. This requires configuring Prometheus's `global.external_labels` configuration block (as mentioned in the [External Labels section](getting-started.md#external-labels)) to identify the role of a given Prometheus instance. + +A typical choice is simply the label name "replica" while letting the value be whatever you wish. For example, you might set up the following in Prometheus's configuration file: + +```yaml +global: + external_labels: + region: eu-west + monitor: infrastructure + replica: A +# ... +``` + +In a Kubernetes stateful deployment, the replica label can also be the pod name. + +Reload your Prometheus instances, and then, in Query, we will define `replica` as the label we want to enable deduplication to occur on: + +```bash +thanos query \ + --http-address 0.0.0.0:19192 \ + --store 1.2.3.4:19090 \ + --store 1.2.3.5:19090 \ + --query.replica-label replica # Replica label for de-duplication + --query.replica-label replicaX # Supports multiple replica labels for de-duplication +``` + +Go to the configured HTTP address, and you should now be able to query across all Prometheus instances and receive de-duplicated data. + +* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/thanos-querier.yaml)_ + +#### Communication Between Components + +The only required communication between nodes is for Thanos Querier to be able to reach gRPC storeAPIs you provide. Thanos Querier periodically calls Info endpoint to collect up-to-date metadata as well as checking the health of given StoreAPI. +The metadata includes the information about time windows and external labels for each node. + +There are various ways to tell query component about the StoreAPIs it should query data from. The simplest way is to use a static list of well known addresses to query. +These are repeatable so can add as many endpoint as needed. You can put DNS domain prefixed by `dns+` or `dnssrv+` to have Thanos Query do an `A` or `SRV` lookup to get all required IPs to communicate with. + +```bash +thanos query \ + --http-address 0.0.0.0:19192 \ # Endpoint for Query UI + --grpc-address 0.0.0.0:19092 \ # gRPC endpoint for Store API + --store 1.2.3.4:19090 \ # Static gRPC Store API Address for the query node to query + --store 1.2.3.5:19090 \ # Also repeatable + --store dns+rest.thanos.peers:19092 # Use DNS lookup for getting all registered IPs as separate StoreAPIs +``` + +Read more details [here](service-discovery.md). + +* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar.yaml)_ +* _[Example Kubernetes manifest with GCS upload](/tutorials/kubernetes-demo/manifests/prometheus-ha-sidecar-lts.yaml)_ + +### [Store Gateway](components/store.md) + +As the sidecar backs up data into the object storage of your choice, you can decrease Prometheus retention and store less locally. However we need a way to query all that historical data again. +The store gateway does just that by implementing the same gRPC data API as the sidecars but backing it with data it can find in your object storage bucket. +Just like sidecars and query nodes, the store gateway exposes StoreAPI and needs to be discovered by Thanos Querier. + +```bash +thanos store \ + --data-dir /var/thanos/store \ # Disk space for local caches + --objstore.config-file bucket_config.yaml \ # Bucket to fetch data from + --http-address 0.0.0.0:19191 \ # HTTP endpoint for collecting metrics on the Store Gateway + --grpc-address 0.0.0.0:19090 # GRPC endpoint for StoreAPI +``` + +The store gateway occupies small amounts of disk space for caching basic information about data in the object storage. This will rarely exceed more than a few gigabytes and is used to improve restart times. It is useful but not required to preserve it across restarts. + +* _[Example Kubernetes manifest](/tutorials/kubernetes-demo/manifests/thanos-store-gateway.yaml)_ + +### [Compactor](components/compact.md) + +A local Prometheus installation periodically compacts older data to improve query efficiency. Since the sidecar backs up data as soon as possible, we need a way to apply the same process to data in the object storage. + +The compactor component simple scans the object storage and processes compaction where required. At the same time it is responsible for creating downsampled copies of data to speed up queries. + +```bash +thanos compact \ + --data-dir /var/thanos/compact \ # Temporary workspace for data processing + --objstore.config-file bucket_config.yaml \ # Bucket where to apply the compacting + --http-address 0.0.0.0:19191 # HTTP endpoint for collecting metrics on the Compactor +``` + +The compactor is not in the critical path of querying or data backup. It can either be run as a periodic batch job or be left running to always compact data as soon as possible. It is recommended to provide 100-300GB of local disk space for data processing. It is recommended to provide 100-300GB of local disk space for data processing. + +_NOTE: The compactor must be run as a **singleton** and must not run when manually modifying data in the bucket._ + +### [Ruler/Rule](components/rule.md) + +In case of Prometheus with Thanos sidecar does not have enough retention, or if you want to have alerts or recording rules that requires global view, Thanos has just the component for that: the [Ruler](components/rule.md), +which does rule and alert evaluation on top of a given Thanos Querier. diff --git a/docs/release-process.md b/docs/release-process.md index 2a13c549b5..7f9e33ab92 100644 --- a/docs/release-process.md +++ b/docs/release-process.md @@ -28,10 +28,10 @@ Release shepherd responsibilities: * Perform releases (from first RC to actual release). * Announce all releases on all communication channels. - | Release | Time of first RC | Shepherd (Github handle) | |-----------|--------------------------|--------------------------| -| v0.8.0 | (planned) 9.10.2019 | TBC | +| v0.9.0 | (planned) 20.11.2019 | TBC | +| v0.8.0 | (planned) 9.10.2019 | `@bwplotka` | | v0.7.0 | (planned) 28.08.2019 | `@domgreen` | | v0.6.0 | (planned) 12.07.2019 | `@GiedriusS` | | v0.5.0 | 31.06.2019 | `@bwplotka` | diff --git a/website/layouts/index.html b/website/layouts/index.html index 5dccddc412..ffb4d7f183 100644 --- a/website/layouts/index.html +++ b/website/layouts/index.html @@ -7,7 +7,7 @@
-

Highly available Prometheus setup with long term storage capabilities.

+

Open source, highly available Prometheus setup with long term storage capabilities.

+
diff --git a/website/static/cloud-native-computing.svg b/website/static/cloud-native-computing.svg new file mode 100644 index 0000000000..c04343849e --- /dev/null +++ b/website/static/cloud-native-computing.svg @@ -0,0 +1,79 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +