From d41032644e46096516e882d4ef8a08d6eac5025d Mon Sep 17 00:00:00 2001 From: Leonhard Markert Date: Thu, 11 Nov 2021 13:42:12 +0000 Subject: [PATCH 1/4] Link to Prometheus "main" branch (#4854) ... instead of its previous name, "master". Signed-off-by: curiousleo --- docs/design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design.md b/docs/design.md index e7fc9e088b..1b6ba731fa 100644 --- a/docs/design.md +++ b/docs/design.md @@ -2,7 +2,7 @@ Thanos is a set of components that can be composed into a highly available Prometheus setup with long term storage capabilities. Its main goals are operation simplicity and retaining of Prometheus's reliability properties. -The Prometheus metric data model and the 2.0 storage format ([spec](https://github.com/prometheus/prometheus/tree/master/tsdb/docs/format), [slides](https://www.slideshare.net/FabianReinartz/storing-16-bytes-at-scale-81282712)) are the foundational layers of all components in the system. +The Prometheus metric data model and the 2.0 storage format ([spec](https://github.com/prometheus/prometheus/tree/main/tsdb/docs/format), [slides](https://www.slideshare.net/FabianReinartz/storing-16-bytes-at-scale-81282712)) are the foundational layers of all components in the system. ## Architecture From aee7c97b8b66b7d203d43b60a877ab8dee9f2c4d Mon Sep 17 00:00:00 2001 From: Manuel Hutter Date: Thu, 11 Nov 2021 14:43:58 +0100 Subject: [PATCH 2/4] docs: Proof-read compactor docs (#4849) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Proof-read compactor docs Signed-off-by: Manuel Hutter * Apply suggestions Co-authored-by: Giedrius Statkevičius Signed-off-by: Manuel Hutter Co-authored-by: Giedrius Statkevičius --- docs/components/compact.md | 66 +++++++++++++++++++------------------- 1 file changed, 33 insertions(+), 33 deletions(-) diff --git a/docs/components/compact.md b/docs/components/compact.md index ee3f9a14a1..af39d1bf5a 100644 --- a/docs/components/compact.md +++ b/docs/components/compact.md @@ -21,23 +21,23 @@ config: bucket: example-bucket ``` -By default `thanos compact` will run to completion which makes it possible to execute in a cronjob. Using the arguments `--wait` and `--wait-interval=5m` it's possible to keep it running. +By default, `thanos compact` will run to completion which makes it possible to execute it as a cronjob. Using the arguments `--wait` and `--wait-interval=5m` it's possible to keep it running. -**Compactor, Sidecar, Receive and Ruler are the only Thanos component which should have a write access to object storage, with only Compactor being able to delete data.** +**Compactor, Sidecar, Receive and Ruler are the only Thanos components which should have a write access to object storage, with only Compactor being able to delete data.** -> **NOTE:** High availability for Compactor is generally not required. See [Availability](#availability) section. +> **NOTE:** High availability for Compactor is generally not required. See the [Availability](#availability) section. ## Compaction The Compactor, among other things, is responsible for compacting multiple blocks into one. -Why even compacting? This is a process, also done in Prometheus, to reduce number of blocks and compact index indices. We can compact index quite well in most cases, because series usually live longer than the duration of the smallest block, so 2 hours. +Why even compacting? This is a process, also done by Prometheus, to reduce the number of blocks and compact index indices. We can compact an index quite well in most cases, because series usually live longer than the duration of the smallest blocks (2 hours). ### Compaction Groups / Block Streams -Usually those blocks come through the same source. We call blocks from a single source, a "stream" of blocks or compaction group. We distinguish streams by `external labels`. Blocks with the same labels are considered as produced by a single source. +Usually those blocks come through the same source. We call blocks from a single source a "stream" of blocks or "compaction group". We distinguish streams by **external labels**. Blocks with the same labels are considered as produced by the same source. -This is because `external_labels` are added by the Prometheus which produced the block. +This is because `external_labels` are added by the Prometheus instance which produced the block. ⚠ This is why those labels on block must be both *unique* and *persistent* across different Prometheus instances. ⚠ @@ -46,57 +46,57 @@ This is because `external_labels` are added by the Prometheus which produced the Natively Prometheus does not store external labels anywhere. This is why external labels are added only on upload time to the `ThanosMeta` section of `meta.json` in each block. -> **NOTE:** In default mode the state of two or more blocks having same external labels and overlapping in time is assumed as an unhealthy situation. Refer to [Overlap Issue Troubleshooting](../operating/troubleshooting.md#overlaps) for more info. This results in compactor [halting](#halting). +> **NOTE:** In default mode the state of two or more blocks having the same external labels and overlapping in time is assumed as an unhealthy situation. Refer to [Overlap Issue Troubleshooting](../operating/troubleshooting.md#overlaps) for more info. This results in compactor [halting](#halting). -#### Warning: Only one Instance has to run against single stream of blocks in single Object Storage. +#### Warning: Only one instance of Compactor may run against a single stream of blocks in a single object storage. :warning: :warning: :warning: -Because there is no safe locking mechanism for all object storage provides, currently, you need to ensure on your own that only single Compactor is running against single stream of blocks on single bucket. Running more can result with [Overlap Issues](../operating/troubleshooting.md#overlaps) that has to be resolved manually. +Because not all object storage providers implement a safe locking mechanism, you need to ensure on your own that only a single Compactor is running against a single stream of blocks on a single bucket. Running more than one Compactor may result in [Overlap Issues](../operating/troubleshooting.md#overlaps) which have to be resolved manually. -This rule, means also that there could be a problem when both compacted and non compacted blocks are being uploaded by sidecar. This is why "upload compacted" flag is still under a separate `--shipper.upload-compacted` flag that helps to ensure that compacted blocks are uploaded before anything else. The singleton rule is also why local Prometheus compaction has to be disabled in order to use sidecar with upload option. Use hidden `--shipper.ignore-unequal-block-size` to override this check (on your own risk). +This rule also means that there could be a problem when both compacted and non-compacted blocks are being uploaded by a sidecar. This is why the "upload compacted" function still lives under a separate `--shipper.upload-compacted` flag that helps to ensure that compacted blocks are uploaded before anything else. The singleton rule is also why local Prometheus compaction has to be disabled in order to use Thanos Sidecar with the upload option. Use - at your own risk! - the hidden `--shipper.ignore-unequal-block-size` flag to disable this check. -> **NOTE:** In further Thanos version it's possible that both restrictions will be removed with production status of [vertical compaction](#vertical-compactions) which is worked on. +> **NOTE:** In future versions of Thanos it's possible that both restrictions will be removed once [vertical compaction](#vertical-compactions) reaches production status. -You can though run multiple Compactors against single Bucket as long as for separate streams of blocks. You can do it in order to [scale compaction process](#scalability). +You can though run multiple Compactors against a single Bucket as long as each instance compacts a separate streams of blocks. You can do this in order to [scale the compaction process](#scalability). ### Vertical Compactions -Thanos and Prometheus supports vertical compaction, so process of compacting multiple streams of blocks into one. +Thanos and Prometheus support vertical compaction, the process of compacting multiple streams of blocks into one. -In Prometheus, this can be triggered by setting hidden flag in Prometheus and putting additional TSDB blocks within Prometheus local directory. Extra blocks can overlap with existing ones. When Prometheus detects that situation it performs `vertical compaction` which compacts overlapping blocks into single one. This is mainly used for **backfilling** purposes. +In Prometheus, this can be triggered by setting a hidden flag in Prometheus and putting additional TSDB blocks in Prometheus' local data directory. Extra blocks can overlap with existing ones. When Prometheus detects this situation, it performs `vertical compaction` which compacts overlapping blocks into a single one. This is mainly used for **backfilling**. -In Thanos, it works similarly, but on bigger scale and using external labels for grouping as explained in [Compaction section](#compaction). +In Thanos, this works similarly, but on a bigger scale and using external labels for grouping as explained in the ["Compaction" section](#compaction). -In both systems, series with the same labels are merged together. In prometheus, merging samples is **naive**. It works by deduplicating samples within exactly the same timestamps. Otherwise samples are added in sorted by time order. Thanos also support a new penalty based samples merger and it is explained in [Deduplication](#vertical-compaction-use-cases). +In both systems, series with the same labels are merged together. In Prometheus, merging samples is **naive**. It works by deduplicating samples within exactly the same timestamps. Otherwise samples are merged and sorted by timestamp. Thanos also supports a new penalty based samples merging strategy, which is explained in [Deduplication](#vertical-compaction-use-cases). -> **NOTE:** Both Prometheus and Thanos default behaviour is to fail compaction if any overlapping blocks are spotted. (For Thanos, within the same external labels). +> **NOTE:** Both Prometheus' and Thanos' default behaviour is to fail compaction if any overlapping blocks are spotted. (For Thanos, with the same external labels). #### Vertical Compaction Use Cases -There can be few valid use cases for vertical compaction: +The following are valid use cases for vertical compaction: -* Races between multiple compactions, for example multiple compactors or between compactor and Prometheus compactions. While this will have extra computation overhead for Compactor it's safe to enable vertical compaction for this case. -* Backfilling. If you want to add blocks of data to any stream where there is existing data already there for the time range, you will need enabled vertical compaction. -* Offline deduplication of series. It's very common to have the same data replicated into multiple streams. We can distinguish two common series deduplications, `one-to-one` and `penalty`: - * `one-to-one` deduplication is when same series (series with the same labels from different blocks) for the same range have **exactly** the same samples: Same values and timestamps. This is very common while using [Receivers](receive.md) with replication greater than 1 as receiver replication copies exactly the same timestamps and values to different receive instances. - * `penalty` deduplication is when same series data is **logically duplicated**. For example, it comes from the same application, but scraped by two different Prometheus-es. Ideally this requires more complex deduplication algorithms. For example one that is used to [deduplicate on the fly on the Querier](query.md#run-time-deduplication-of-ha-groups). This is common case when Prometheus HA replicas are used. You can enable this deduplication via `--deduplication.func=penalty` flag. +* **Races** between multiple compactions, for example multiple Thanos compactors or between Thanos and Prometheus compactions. While this will cause extra computational overhead for Compactor it's safe to enable vertical compaction for this case. +* **Backfilling**. If you want to add blocks of data to any stream where there already is existing data for some time range, you will need to enable vertical compaction. +* **Offline deduplication** of series. It's very common to have the same data replicated into multiple streams. We can distinguish two common strategies for deduplications, `one-to-one` and `penalty`: + * `one-to-one` deduplication is when multiple series (with the same labels) from different blocks for the same time range have **exactly** the same samples: Same values and timestamps. This is very common when using [Receivers](receive.md) with replication greater than 1 as receiver replication copies samples exactly (same timestamps and values) to different receive instances. + * `penalty` deduplication is when the same data is **duplicated logically**, i.e. the same application is scraped from two different Prometheis. This usually requires more complex deduplication algorithms. For example, one that is used to [deduplicate on the fly on the Querier](query.md#run-time-deduplication-of-ha-groups). This is a common case when Prometheus HA replicas are used. You can enable this deduplication strategy via the `--deduplication.func=penalty` flag. #### Vertical Compaction Risks The main risk is the **irreversible** implications of potential configuration errors: -* If you accidentally upload block with the same external labels but produced by totally different Prometheus for totally different applications, some metrics can overlap and potentially can merge together making such series useless. +* If you accidentally upload blocks with the same external labels but produced by totally different Prometheis for totally different applications, some metrics can overlap and potentially merge together, making the series useless. * If you merge disjoint series in multiple of blocks together, there is currently no easy way to split them back. -* The `penalty` offline deduplication algorithm has its own limitation. Even though it has been battle-tested for quite a long time but still very few issues come up from time to time such as https://github.com/thanos-io/thanos/issues/2890. If you'd like to enable this deduplication algorithm, please take the risk and make sure you back up your data. +* The `penalty` offline deduplication algorithm has its own limitation. Even though it has been battle-tested for quite a long time, very few issues still come up from time to time (such as [breaking rate/irate](https://github.com/thanos-io/thanos/issues/2890)). If you'd like to enable this deduplication algorithm, do so at your own risk and back up your data first! #### Enabling Vertical Compaction -NOTE: See [risks](#vertical-compaction-risks) section to understand the implications and experimental nature of this feature. +**NOTE:** See the ["risks" section](#vertical-compaction-risks) to understand the implications and experimental nature of this feature. -You can enable vertical compaction using a hidden flag `--compact.enable-vertical-compaction` +You can enable vertical compaction using the hidden flag `--compact.enable-vertical-compaction` -If you want to "virtually" group blocks differently for deduplication use case, use hidden flag `deduplication.replica-label` to set one or many flags to be ignored during block loading. +If you want to "virtually" group blocks differently for deduplication use case, use `--deduplication.replica-label=LABEL` to set one or more labels to be ignored during block loading. For example if you have following set of block streams: @@ -107,7 +107,7 @@ external_labels: {cluster="us1", replica="1", receive="true", environment="produ external_labels: {cluster="us1", replica="1", receive="true", environment="staging"} ``` -and set `--deduplication.replica-label="replica"`, compactor will assume those as: +and set `--deduplication.replica-label="replica"`, Compactor will assume those as: ``` external_labels: {cluster="eu1", receive="true", environment="production"} (2 streams, resulted in one) @@ -115,15 +115,15 @@ external_labels: {cluster="us1", receive="true", environment="production"} external_labels: {cluster="us1", receive="true", environment="staging"} ``` -On next compaction multiple streams' blocks will be compacted into one. +On the next compaction, multiple streams' blocks will be compacted into one. -If you need a different deduplication algorithm, use `deduplication.func` flag. The default value is the original `one-to-one` deduplication. +If you need a different deduplication algorithm, use `--deduplication.func=FUNC` flag. The default value is the original `one-to-one` deduplication. ## Enforcing Retention of Data -By default, there is NO retention set for object storage data. This means that you store data for unlimited time, which is a valid and recommended way of running Thanos. +By default, there is NO retention set for object storage data. This means that you store data forever, which is a valid and recommended way of running Thanos. -You can set retention by different resolutions using `--retention.resolution-raw` `--retention.resolution-5m` and `--retention.resolution-1h` flag. Not setting them or setting to `0s` means no retention. +You can configure retention by using `--retention.resolution-raw` `--retention.resolution-5m` and `--retention.resolution-1h` flag. Not setting them or setting to `0s` means no retention. **NOTE:** ⚠ ️Retention is applied right after Compaction and Downsampling loops. If those are failing, data will be never deleted. From aa7e9f33aaf0ff63963c4b17bc0ce67db8f97a38 Mon Sep 17 00:00:00 2001 From: Aditi Ahuja <48997495+metonymic-smokey@users.noreply.github.com> Date: Fri, 12 Nov 2021 10:39:38 +0530 Subject: [PATCH 3/4] Compactor: Add metrics for retention progress (#4848) --- CHANGELOG.md | 1 + cmd/thanos/compact.go | 12 ++- pkg/compact/compact.go | 49 ++++++++++++ pkg/compact/compact_test.go | 152 ++++++++++++++++++++++++++++++++++-- 4 files changed, 208 insertions(+), 6 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index d7dacbd316..35869f0266 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -24,6 +24,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re - [#4576](https://github.com/thanos-io/thanos/pull/4576) UI: add filter compaction level to the Block UI. - [#4731](https://github.com/thanos-io/thanos/pull/4731) Rule: add stateless mode to ruler according to https://thanos.io/tip/proposals-accepted/202005-scalable-rule-storage.md/. Continue https://github.com/thanos-io/thanos/pull/4250. - [#4612](https://github.com/thanos-io/thanos/pull/4612) Sidecar: add `--prometheus.http-client` and `--prometheus.http-client-file` flag for sidecar to connect Prometheus with basic auth or TLS. +- [#4848](https://github.com/thanos-io/thanos/pull/4848) Compactor: added Prometheus metric for tracking the progress of retention. ### Fixed diff --git a/cmd/thanos/compact.go b/cmd/thanos/compact.go index 111af4912e..2ca0331938 100644 --- a/cmd/thanos/compact.go +++ b/cmd/thanos/compact.go @@ -462,6 +462,7 @@ func runCompact( if conf.compactionProgressMetrics { g.Add(func() error { ps := compact.NewCompactionProgressCalculator(reg, tsdbPlanner) + rs := compact.NewRetentionProgressCalculator(reg, retentionByResolution) var ds *compact.DownsampleProgressCalculator if !conf.disableDownsampling { ds = compact.NewDownsampleProgressCalculator(reg) @@ -476,13 +477,22 @@ func runCompact( metas := sy.Metas() groups, err := grouper.Groups(metas) if err != nil { - return errors.Wrapf(err, "could not group metadata") + return errors.Wrapf(err, "could not group metadata for compaction") } if err = ps.ProgressCalculate(ctx, groups); err != nil { return errors.Wrapf(err, "could not calculate compaction progress") } + retGroups, err := grouper.Groups(metas) + if err != nil { + return errors.Wrapf(err, "could not group metadata for retention") + } + + if err = rs.ProgressCalculate(ctx, retGroups); err != nil { + return errors.Wrapf(err, "could not calculate retention progress") + } + if !conf.disableDownsampling { groups, err = grouper.Groups(metas) if err != nil { diff --git a/pkg/compact/compact.go b/pkg/compact/compact.go index 0beb8c2b4e..aebe2bf050 100644 --- a/pkg/compact/compact.go +++ b/pkg/compact/compact.go @@ -667,6 +667,55 @@ func (ds *DownsampleProgressCalculator) ProgressCalculate(ctx context.Context, g return nil } +// RetentionProgressMetrics contains Prometheus metrics related to retention progress. +type RetentionProgressMetrics struct { + NumberOfBlocksToDelete *prometheus.GaugeVec +} + +// RetentionProgressCalculator contains RetentionProgressMetrics, which are updated during the retention simulation process. +type RetentionProgressCalculator struct { + *RetentionProgressMetrics + retentionByResolution map[ResolutionLevel]time.Duration +} + +// NewRetentionProgressCalculator creates a new RetentionProgressCalculator. +func NewRetentionProgressCalculator(reg prometheus.Registerer, retentionByResolution map[ResolutionLevel]time.Duration) *RetentionProgressCalculator { + return &RetentionProgressCalculator{ + retentionByResolution: retentionByResolution, + RetentionProgressMetrics: &RetentionProgressMetrics{ + NumberOfBlocksToDelete: promauto.With(reg).NewGaugeVec(prometheus.GaugeOpts{ + Name: "thanos_compact_todo_deletion_blocks", + Help: "number of blocks that have crossed their retention period", + }, []string{"group"}), + }, + } +} + +// ProgressCalculate calculates the number of blocks to be retained for the given groups. +func (rs *RetentionProgressCalculator) ProgressCalculate(ctx context.Context, groups []*Group) error { + groupBlocks := make(map[string]int, len(groups)) + + for _, group := range groups { + for _, m := range group.metasByMinTime { + retentionDuration := rs.retentionByResolution[ResolutionLevel(m.Thanos.Downsample.Resolution)] + if retentionDuration.Seconds() == 0 { + continue + } + maxTime := time.Unix(m.MaxTime/1000, 0) + if time.Now().After(maxTime.Add(retentionDuration)) { + groupBlocks[group.key]++ + } + } + } + + rs.RetentionProgressMetrics.NumberOfBlocksToDelete.Reset() + for key, blocks := range groupBlocks { + rs.RetentionProgressMetrics.NumberOfBlocksToDelete.WithLabelValues(key).Add(float64(blocks)) + } + + return nil +} + // Planner returns blocks to compact. type Planner interface { // Plan returns a list of blocks that should be compacted into single one. diff --git a/pkg/compact/compact_test.go b/pkg/compact/compact_test.go index 8119a66429..03edf34d7b 100644 --- a/pkg/compact/compact_test.go +++ b/pkg/compact/compact_test.go @@ -26,7 +26,6 @@ import ( "github.com/thanos-io/thanos/pkg/errutil" "github.com/thanos-io/thanos/pkg/extprom" "github.com/thanos-io/thanos/pkg/objstore" - "github.com/thanos-io/thanos/pkg/receive" "github.com/thanos-io/thanos/pkg/testutil" ) @@ -205,6 +204,149 @@ func createBlockMeta(id uint64, minTime, maxTime int64, labels map[string]string return m } +func TestRetentionProgressCalculate(t *testing.T) { + logger := log.NewNopLogger() + reg := prometheus.NewRegistry() + + var bkt objstore.Bucket + temp := promauto.With(reg).NewCounter(prometheus.CounterOpts{Name: "test_metric_for_group", Help: "this is a test metric for compact progress tests"}) + grouper := NewDefaultGrouper(logger, bkt, false, false, reg, temp, temp, temp, "") + + type groupedResult map[string]float64 + + type retInput struct { + meta []*metadata.Meta + resMap map[ResolutionLevel]time.Duration + } + + keys := make([]string, 3) + m := make([]metadata.Meta, 3) + m[0].Thanos.Labels = map[string]string{"a": "1"} + m[0].Thanos.Downsample.Resolution = downsample.ResLevel0 + m[1].Thanos.Labels = map[string]string{"b": "2"} + m[1].Thanos.Downsample.Resolution = downsample.ResLevel1 + m[2].Thanos.Labels = map[string]string{"a": "1", "b": "2"} + m[2].Thanos.Downsample.Resolution = downsample.ResLevel2 + for ind, meta := range m { + keys[ind] = DefaultGroupKey(meta.Thanos) + } + + ps := NewRetentionProgressCalculator(reg, nil) + + for _, tcase := range []struct { + testName string + input retInput + expected groupedResult + }{ + { + // In this test case, blocks belonging to multiple groups are tested. All blocks in the first group and the first block in the second group are beyond their retention period. In the second group, the second block still has some time before its retention period and hence, is not marked to be deleted. + testName: "multi_group_test", + input: retInput{ + meta: []*metadata.Meta{ + createBlockMeta(6, 1, int64(time.Now().Add(-6*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1"}, downsample.ResLevel0, []uint64{}), + createBlockMeta(9, 1, int64(time.Now().Add(-9*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1"}, downsample.ResLevel0, []uint64{}), + createBlockMeta(7, 1, int64(time.Now().Add(-4*30*24*time.Hour).Unix()*1000), map[string]string{"b": "2"}, downsample.ResLevel1, []uint64{}), + createBlockMeta(8, 1, int64(time.Now().Add(-1*30*24*time.Hour).Unix()*1000), map[string]string{"b": "2"}, downsample.ResLevel1, []uint64{}), + createBlockMeta(10, 1, int64(time.Now().Add(-4*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1", "b": "2"}, downsample.ResLevel2, []uint64{}), + }, + resMap: map[ResolutionLevel]time.Duration{ + ResolutionLevel(downsample.ResLevel0): 5 * 30 * 24 * time.Hour, // 5 months retention. + ResolutionLevel(downsample.ResLevel1): 3 * 30 * 24 * time.Hour, // 3 months retention. + ResolutionLevel(downsample.ResLevel2): 6 * 30 * 24 * time.Hour, // 6 months retention. + }, + }, + expected: groupedResult{ + keys[0]: 2.0, + keys[1]: 1.0, + keys[2]: 0.0, + }, + }, { + // In this test case, all the blocks are retained since they have not yet crossed their retention period. + testName: "retain_test", + input: retInput{ + meta: []*metadata.Meta{ + createBlockMeta(6, 1, int64(time.Now().Add(-6*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1"}, downsample.ResLevel0, []uint64{}), + createBlockMeta(7, 1, int64(time.Now().Add(-4*30*24*time.Hour).Unix()*1000), map[string]string{"b": "2"}, downsample.ResLevel1, []uint64{}), + createBlockMeta(8, 1, int64(time.Now().Add(-7*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1", "b": "2"}, downsample.ResLevel2, []uint64{}), + }, + resMap: map[ResolutionLevel]time.Duration{ + ResolutionLevel(downsample.ResLevel0): 10 * 30 * 24 * time.Hour, // 10 months retention. + ResolutionLevel(downsample.ResLevel1): 12 * 30 * 24 * time.Hour, // 12 months retention. + ResolutionLevel(downsample.ResLevel2): 16 * 30 * 24 * time.Hour, // 6 months retention. + }, + }, + expected: groupedResult{ + keys[0]: 0, + keys[1]: 0, + keys[2]: 0, + }, + }, + { + // In this test case, all the blocks are deleted since they are past their retention period. + testName: "delete_test", + input: retInput{ + meta: []*metadata.Meta{ + createBlockMeta(6, 1, int64(time.Now().Add(-6*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1"}, downsample.ResLevel0, []uint64{}), + createBlockMeta(7, 1, int64(time.Now().Add(-4*30*24*time.Hour).Unix()*1000), map[string]string{"b": "2"}, downsample.ResLevel1, []uint64{}), + createBlockMeta(8, 1, int64(time.Now().Add(-7*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1", "b": "2"}, downsample.ResLevel2, []uint64{}), + }, + resMap: map[ResolutionLevel]time.Duration{ + ResolutionLevel(downsample.ResLevel0): 3 * 30 * 24 * time.Hour, // 3 months retention. + ResolutionLevel(downsample.ResLevel1): 1 * 30 * 24 * time.Hour, // 1 months retention. + ResolutionLevel(downsample.ResLevel2): 6 * 30 * 24 * time.Hour, // 6 months retention. + }, + }, + expected: groupedResult{ + keys[0]: 1, + keys[1]: 1, + keys[2]: 1, + }, + }, + { + // In this test case, none of the blocks are marked for deletion since the retention period is 0d i.e. indefinitely long retention. + testName: "zero_day_test", + input: retInput{ + meta: []*metadata.Meta{ + createBlockMeta(6, 1, int64(time.Now().Add(-6*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1"}, downsample.ResLevel0, []uint64{}), + createBlockMeta(7, 1, int64(time.Now().Add(-4*30*24*time.Hour).Unix()*1000), map[string]string{"b": "2"}, downsample.ResLevel1, []uint64{}), + createBlockMeta(8, 1, int64(time.Now().Add(-7*30*24*time.Hour).Unix()*1000), map[string]string{"a": "1", "b": "2"}, downsample.ResLevel2, []uint64{}), + }, + resMap: map[ResolutionLevel]time.Duration{ + ResolutionLevel(downsample.ResLevel0): 0, + ResolutionLevel(downsample.ResLevel1): 0, + ResolutionLevel(downsample.ResLevel2): 0, + }, + }, + expected: groupedResult{ + keys[0]: 0, + keys[1]: 0, + keys[2]: 0, + }, + }, + } { + if ok := t.Run(tcase.testName, func(t *testing.T) { + blocks := make(map[ulid.ULID]*metadata.Meta, len(tcase.input.meta)) + for _, meta := range tcase.input.meta { + blocks[meta.ULID] = meta + } + groups, err := grouper.Groups(blocks) + testutil.Ok(t, err) + ps.retentionByResolution = tcase.input.resMap + err = ps.ProgressCalculate(context.Background(), groups) + testutil.Ok(t, err) + metrics := ps.RetentionProgressMetrics + testutil.Ok(t, err) + for key := range tcase.expected { + a, err := metrics.NumberOfBlocksToDelete.GetMetricWithLabelValues(key) + testutil.Ok(t, err) + testutil.Equals(t, tcase.expected[key], promtestutil.ToFloat64(a)) + } + }); !ok { + return + } + } +} + func TestCompactProgressCalculate(t *testing.T) { type planResult struct { compactionBlocks, compactionRuns float64 @@ -213,7 +355,6 @@ func TestCompactProgressCalculate(t *testing.T) { logger := log.NewNopLogger() reg := prometheus.NewRegistry() - unRegisterer := &receive.UnRegisterer{Registerer: reg} planner := NewTSDBBasedPlanner(logger, []int64{ int64(1 * time.Hour / time.Millisecond), int64(2 * time.Hour / time.Millisecond), @@ -231,6 +372,8 @@ func TestCompactProgressCalculate(t *testing.T) { keys[ind] = DefaultGroupKey(meta.Thanos) } + ps := NewCompactionProgressCalculator(reg, planner) + var bkt objstore.Bucket temp := promauto.With(reg).NewCounter(prometheus.CounterOpts{Name: "test_metric_for_group", Help: "this is a test metric for compact progress tests"}) grouper := NewDefaultGrouper(logger, bkt, false, false, reg, temp, temp, temp, "") @@ -316,7 +459,6 @@ func TestCompactProgressCalculate(t *testing.T) { } groups, err := grouper.Groups(blocks) testutil.Ok(t, err) - ps := NewCompactionProgressCalculator(unRegisterer, planner) err = ps.ProgressCalculate(context.Background(), groups) testutil.Ok(t, err) metrics := ps.CompactProgressMetrics @@ -337,7 +479,6 @@ func TestCompactProgressCalculate(t *testing.T) { func TestDownsampleProgressCalculate(t *testing.T) { reg := prometheus.NewRegistry() - unRegisterer := &receive.UnRegisterer{Registerer: reg} logger := log.NewNopLogger() type groupedResult map[string]float64 @@ -353,6 +494,8 @@ func TestDownsampleProgressCalculate(t *testing.T) { keys[ind] = DefaultGroupKey(meta.Thanos) } + ds := NewDownsampleProgressCalculator(reg) + var bkt objstore.Bucket temp := promauto.With(reg).NewCounter(prometheus.CounterOpts{Name: "test_metric_for_group", Help: "this is a test metric for downsample progress tests"}) grouper := NewDefaultGrouper(logger, bkt, false, false, reg, temp, temp, temp, "") @@ -438,7 +581,6 @@ func TestDownsampleProgressCalculate(t *testing.T) { groups, err := grouper.Groups(blocks) testutil.Ok(t, err) - ds := NewDownsampleProgressCalculator(unRegisterer) err = ds.ProgressCalculate(context.Background(), groups) testutil.Ok(t, err) metrics := ds.DownsampleProgressMetrics From b1c7483ea9be11a81e2c7422f19b811e81b98859 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=A9ssica=20Lins?= Date: Fri, 12 Nov 2021 10:47:38 -0300 Subject: [PATCH 4/4] Mixin: Add Query Frontend grafana dashboard (#4856) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add queryFrontend selector, start dashboard Signed-off-by: Jéssica Lins * Fix naming, generate query_frontend.json Signed-off-by: Jéssica Lins * Add resources row Signed-off-by: Jéssica Lins * Add cache row, fe queries panel Signed-off-by: Jéssica Lins * Update mixin README Signed-off-by: Jéssica Lins * Solve conflicts Signed-off-by: Jéssica Lins * Change to queryFrontend instead of query_frontend Signed-off-by: Jéssica Lins --- CHANGELOG.md | 2 +- examples/dashboards/dashboards.md | 1 + examples/dashboards/queryFrontend.json | 1117 +++++++++++++++++++++ mixin/README.md | 4 + mixin/config.libsonnet | 4 + mixin/dashboards/dashboards.libsonnet | 1 + mixin/dashboards/query_frontend.libsonnet | 82 ++ 7 files changed, 1210 insertions(+), 1 deletion(-) create mode 100644 examples/dashboards/queryFrontend.json create mode 100644 mixin/dashboards/query_frontend.libsonnet diff --git a/CHANGELOG.md b/CHANGELOG.md index 35869f0266..a0556f5078 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -24,7 +24,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re - [#4576](https://github.com/thanos-io/thanos/pull/4576) UI: add filter compaction level to the Block UI. - [#4731](https://github.com/thanos-io/thanos/pull/4731) Rule: add stateless mode to ruler according to https://thanos.io/tip/proposals-accepted/202005-scalable-rule-storage.md/. Continue https://github.com/thanos-io/thanos/pull/4250. - [#4612](https://github.com/thanos-io/thanos/pull/4612) Sidecar: add `--prometheus.http-client` and `--prometheus.http-client-file` flag for sidecar to connect Prometheus with basic auth or TLS. -- [#4848](https://github.com/thanos-io/thanos/pull/4848) Compactor: added Prometheus metric for tracking the progress of retention. +- [#4856](https://github.com/thanos-io/thanos/pull/4856) Mixin: Add Query Frontend Grafana dashboard. ### Fixed diff --git a/examples/dashboards/dashboards.md b/examples/dashboards/dashboards.md index 4fae5a1a1c..4cff60678d 100644 --- a/examples/dashboards/dashboards.md +++ b/examples/dashboards/dashboards.md @@ -5,6 +5,7 @@ There exists Grafana dashboards for each component (not all of them complete) ta - [Thanos Overview](overview.json) - [Thanos Compact](compact.json) - [Thanos Querier](query.json) +- [Thanos Query Frontend](queryFrontend.json) - [Thanos Store](store.json) - [Thanos Receiver](receive.json) - [Thanos Sidecar](sidecar.json) diff --git a/examples/dashboards/queryFrontend.json b/examples/dashboards/queryFrontend.json new file mode 100644 index 0000000000..07b4b33346 --- /dev/null +++ b/examples/dashboards/queryFrontend.json @@ -0,0 +1,1117 @@ +{ + "annotations": { + "list": [ ] + }, + "editable": true, + "gnetId": null, + "graphTooltip": 0, + "hideControls": false, + "links": [ ], + "refresh": "10s", + "rows": [ + { + "collapse": false, + "height": "250px", + "panels": [ + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "description": "Shows rate of requests against Query Frontend for the given time.", + "fill": 10, + "id": 1, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 0, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ + { + "alias": "/1../", + "color": "#EAB839" + }, + { + "alias": "/2../", + "color": "#37872D" + }, + { + "alias": "/3../", + "color": "#E0B400" + }, + { + "alias": "/4../", + "color": "#1F60C4" + }, + { + "alias": "/5../", + "color": "#C4162A" + } + ], + "spaceLength": 10, + "span": 3, + "stack": true, + "steppedLine": false, + "targets": [ + { + "expr": "sum by (job, handler, code) (rate(http_requests_total{job=~\"$job\", handler=\"query-frontend\"}[$interval]))", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "{{job}} {{handler}} {{code}}", + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Rate of requests", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + }, + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "description": "Shows rate of queries passing through Query Frontend", + "fill": 10, + "id": 2, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 0, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ + { + "alias": "/1../", + "color": "#EAB839" + }, + { + "alias": "/2../", + "color": "#37872D" + }, + { + "alias": "/3../", + "color": "#E0B400" + }, + { + "alias": "/4../", + "color": "#1F60C4" + }, + { + "alias": "/5../", + "color": "#C4162A" + } + ], + "spaceLength": 10, + "span": 3, + "stack": true, + "steppedLine": false, + "targets": [ + { + "expr": "sum by (job, handler, code) (rate(thanos_query_frontend_queries_total{job=~\"$job\", op=\"query_range\"}[$interval]))", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "{{job}} {{handler}} {{code}}", + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Rate of queries", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + }, + { + "aliasColors": { + "error": "#E24D42" + }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "description": "Shows ratio of errors compared to the the total number of handled requests against Query Frontend.", + "fill": 10, + "id": 3, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 0, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ ], + "spaceLength": 10, + "span": 3, + "stack": true, + "steppedLine": false, + "targets": [ + { + "expr": "sum by (job) (rate(http_requests_total{job=~\"$job\", handler=\"query-frontend\",code=~\"5..\"}[$interval])) / sum by (job) (rate(http_requests_total{job=~\"$job\", handler=\"query-frontend\"}[$interval]))", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "error", + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Errors", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "percentunit", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + }, + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "description": "Shows how long has it taken to handle requests in quantiles.", + "fill": 1, + "id": 4, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ + { + "alias": "p99", + "color": "#FA6400", + "fill": 1, + "fillGradient": 1 + }, + { + "alias": "p90", + "color": "#E0B400", + "fill": 1, + "fillGradient": 1 + }, + { + "alias": "p50", + "color": "#37872D", + "fill": 10, + "fillGradient": 0 + } + ], + "spaceLength": 10, + "span": 3, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "histogram_quantile(0.50, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\"$job\", handler=\"query-frontend\"}[$interval]))) * 1", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "p50 {{job}}", + "logBase": 10, + "max": null, + "min": null, + "step": 10 + }, + { + "expr": "histogram_quantile(0.90, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\"$job\", handler=\"query-frontend\"}[$interval]))) * 1", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "p90 {{job}}", + "logBase": 10, + "max": null, + "min": null, + "step": 10 + }, + { + "expr": "histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~\"$job\", handler=\"query-frontend\"}[$interval]))) * 1", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "p99 {{job}}", + "logBase": 10, + "max": null, + "min": null, + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Duration", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "s", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + } + ], + "repeat": null, + "repeatIteration": null, + "repeatRowId": null, + "showTitle": true, + "title": "Query Frontend API", + "titleSize": "h6" + }, + { + "collapse": false, + "height": "250px", + "panels": [ + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "description": "Show rate of cache requests.", + "fill": 10, + "id": 5, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 0, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ ], + "spaceLength": 10, + "span": 3, + "stack": true, + "steppedLine": false, + "targets": [ + { + "expr": "sum by (job, tripperware) (rate(cortex_cache_request_duration_seconds_count{job=~\"$job\"}[$interval]))", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "{{job}} {{tripperware}}", + "legendLink": null, + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Requests", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + }, + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "description": "Show rate of Querier cache gets vs misses.", + "fill": 10, + "id": 6, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 0, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ ], + "spaceLength": 10, + "span": 3, + "stack": true, + "steppedLine": false, + "targets": [ + { + "expr": "sum by (job, tripperware) (rate(querier_cache_gets_total{job=~\"$job\"}[$interval]))", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "Cache gets - {{job}} {{tripperware}}", + "legendLink": null, + "step": 10 + }, + { + "expr": "sum by (job, tripperware) (rate(querier_cache_misses_total{job=~\"$job\"}[$interval]))", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "Cache misses - {{job}} {{tripperware}}", + "legendLink": null, + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Querier cache gets vs misses", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + }, + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "description": "Shows rate of cortex fetched keys.", + "fill": 10, + "id": 7, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 0, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ ], + "spaceLength": 10, + "span": 3, + "stack": true, + "steppedLine": false, + "targets": [ + { + "expr": "sum by (job, tripperware) (rate(cortex_cache_fetched_keys{job=~\"$job\"}[$interval]))", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "{{job}} {{tripperware}}", + "legendLink": null, + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Cortex fetched keys", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + }, + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "description": "Shows rate of cortex cache hits.", + "fill": 10, + "id": 8, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 0, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ ], + "spaceLength": 10, + "span": 3, + "stack": true, + "steppedLine": false, + "targets": [ + { + "expr": "sum by (job, tripperware) (rate(cortex_cache_hits{job=~\"$job\"}[$interval]))", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "{{job}} {{tripperware}}", + "legendLink": null, + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Cortex cache hits", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + } + ], + "repeat": null, + "repeatIteration": null, + "repeatRowId": null, + "showTitle": true, + "title": "Cache Operations", + "titleSize": "h6" + }, + { + "collapse": true, + "height": "250px", + "panels": [ + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "fill": 1, + "id": 9, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ ], + "spaceLength": 10, + "span": 4, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "go_memstats_alloc_bytes{job=~\"$job\"}", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "alloc all {{instance}}", + "legendLink": null, + "step": 10 + }, + { + "expr": "go_memstats_heap_alloc_bytes{job=~\"$job\"}", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "alloc heap {{instance}}", + "legendLink": null, + "step": 10 + }, + { + "expr": "rate(go_memstats_alloc_bytes_total{job=~\"$job\"}[30s])", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "alloc rate all {{instance}}", + "legendLink": null, + "step": 10 + }, + { + "expr": "rate(go_memstats_heap_alloc_bytes{job=~\"$job\"}[30s])", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "alloc rate heap {{instance}}", + "legendLink": null, + "step": 10 + }, + { + "expr": "go_memstats_stack_inuse_bytes{job=~\"$job\"}", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "inuse heap {{instance}}", + "legendLink": null, + "step": 10 + }, + { + "expr": "go_memstats_heap_inuse_bytes{job=~\"$job\"}", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "inuse stack {{instance}}", + "legendLink": null, + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Memory Used", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + }, + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "fill": 1, + "id": 10, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ ], + "spaceLength": 10, + "span": 4, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "go_goroutines{job=~\"$job\"}", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "{{instance}}", + "legendLink": null, + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "Goroutines", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + }, + { + "aliasColors": { }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "$datasource", + "fill": 1, + "id": 11, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "links": [ ], + "nullPointMode": "null as zero", + "percentage": false, + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [ ], + "spaceLength": 10, + "span": 4, + "stack": false, + "steppedLine": false, + "targets": [ + { + "expr": "go_gc_duration_seconds{job=~\"$job\"}", + "format": "time_series", + "intervalFactor": 2, + "legendFormat": "{{quantile}} {{instance}}", + "legendLink": null, + "step": 10 + } + ], + "thresholds": [ ], + "timeFrom": null, + "timeShift": null, + "title": "GC Time Quantiles", + "tooltip": { + "shared": false, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [ ] + }, + "yaxes": [ + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": 0, + "show": true + }, + { + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ] + } + ], + "repeat": null, + "repeatIteration": null, + "repeatRowId": null, + "showTitle": true, + "title": "Resources", + "titleSize": "h6" + } + ], + "schemaVersion": 14, + "style": "dark", + "tags": [ + "thanos-mixin" + ], + "templating": { + "list": [ + { + "current": { + "text": "default", + "value": "default" + }, + "hide": 0, + "label": null, + "name": "datasource", + "options": [ ], + "query": "prometheus", + "refresh": 1, + "regex": "", + "type": "datasource" + }, + { + "auto": true, + "auto_count": 300, + "auto_min": "10s", + "current": { + "text": "5m", + "value": "5m" + }, + "hide": 0, + "label": "interval", + "name": "interval", + "query": "5m,10m,30m,1h,6h,12h", + "refresh": 2, + "type": "interval" + }, + { + "allValue": null, + "current": { + "text": "all", + "value": "$__all" + }, + "datasource": "$datasource", + "hide": 0, + "includeAll": true, + "label": "job", + "multi": false, + "name": "job", + "options": [ ], + "query": "label_values(up{job=~\".*thanos-query-frontend.*\"}, job)", + "refresh": 1, + "regex": "", + "sort": 2, + "tagValuesQuery": "", + "tags": [ ], + "tagsQuery": "", + "type": "query", + "useTags": false + } + ] + }, + "time": { + "from": "now-1h", + "to": "now" + }, + "timepicker": { + "refresh_intervals": [ + "5s", + "10s", + "30s", + "1m", + "5m", + "15m", + "30m", + "1h", + "2h", + "1d" + ], + "time_options": [ + "5m", + "15m", + "1h", + "6h", + "12h", + "24h", + "2d", + "7d", + "30d" + ] + }, + "timezone": "UTC", + "title": "Thanos / Query Frontend", + "uid": "9bc9f8bb21d4d18193c3fe772b36c306", + "version": 0 +} diff --git a/mixin/README.md b/mixin/README.md index baef01946c..1ecedabb95 100644 --- a/mixin/README.md +++ b/mixin/README.md @@ -88,6 +88,10 @@ This project is intended to be used as a library. You can extend and customize d selector: 'job=~".*thanos-query.*"', title: '%(prefix)sQuery' % $.dashboard.prefix, }, + queryFrontend+:: { + selector: 'job=~".*thanos-query-frontend.*"', + title: '%(prefix)sQuery Frontend' % $.dashboard.prefix, + }, store+:: { selector: 'job=~".*thanos-store.*"', title: '%(prefix)sStore' % $.dashboard.prefix, diff --git a/mixin/config.libsonnet b/mixin/config.libsonnet index e4d415d5ef..55e4e9cb17 100644 --- a/mixin/config.libsonnet +++ b/mixin/config.libsonnet @@ -28,6 +28,10 @@ selector: 'job=~".*thanos-query.*"', title: '%(prefix)sQuery' % $.dashboard.prefix, }, + queryFrontend+:: { + selector: 'job=~".*thanos-query-frontend.*"', + title: '%(prefix)sQuery Frontend' % $.dashboard.prefix, + }, store+:: { selector: 'job=~".*thanos-store.*"', title: '%(prefix)sStore' % $.dashboard.prefix, diff --git a/mixin/dashboards/dashboards.libsonnet b/mixin/dashboards/dashboards.libsonnet index 5bd99093f5..d35bbcd843 100644 --- a/mixin/dashboards/dashboards.libsonnet +++ b/mixin/dashboards/dashboards.libsonnet @@ -1,4 +1,5 @@ (import 'query.libsonnet') + +(import 'query_frontend.libsonnet') + (import 'store.libsonnet') + (import 'sidecar.libsonnet') + (import 'receive.libsonnet') + diff --git a/mixin/dashboards/query_frontend.libsonnet b/mixin/dashboards/query_frontend.libsonnet new file mode 100644 index 0000000000..136f7405f4 --- /dev/null +++ b/mixin/dashboards/query_frontend.libsonnet @@ -0,0 +1,82 @@ +local g = import '../lib/thanos-grafana-builder/builder.libsonnet'; +local utils = import '../lib/utils.libsonnet'; + +{ + local thanos = self, + queryFrontend+:: { + selector: error 'must provide selector for Thanos Query Frontend dashboard', + title: error 'must provide title for Thanos Query Frontend dashboard', + dashboard:: { + selector: std.join(', ', thanos.dashboard.selector + ['job=~"$job"']), + dimensions: std.join(', ', thanos.dashboard.dimensions + ['job']), + }, + }, + grafanaDashboards+:: { + [if thanos.queryFrontend != null then 'queryFrontend.json']: + local queryFrontendHandlerSelector = utils.joinLabels([thanos.queryFrontend.dashboard.selector, 'handler="query-frontend"']); + local queryFrontendTripperwareSelector = utils.joinLabels([thanos.queryFrontend.dashboard.selector, 'tripperware="query_range"']); + local queryFrontendOpSelector = utils.joinLabels([thanos.queryFrontend.dashboard.selector, 'op="query_range"']); + g.dashboard(thanos.queryFrontend.title) + .addRow( + g.row('Query Frontend API') + .addPanel( + g.panel('Rate of requests', 'Shows rate of requests against Query Frontend for the given time.') + + g.httpQpsPanel('http_requests_total', queryFrontendHandlerSelector, thanos.queryFrontend.dashboard.dimensions) + ) + .addPanel( + g.panel('Rate of queries', 'Shows rate of queries passing through Query Frontend') + + g.httpQpsPanel('thanos_query_frontend_queries_total', queryFrontendOpSelector, thanos.queryFrontend.dashboard.dimensions) + ) + .addPanel( + g.panel('Errors', 'Shows ratio of errors compared to the the total number of handled requests against Query Frontend.') + + g.httpErrPanel('http_requests_total', queryFrontendHandlerSelector, thanos.queryFrontend.dashboard.dimensions) + ) + .addPanel( + g.panel('Duration', 'Shows how long has it taken to handle requests in quantiles.') + + g.latencyPanel('http_request_duration_seconds', queryFrontendHandlerSelector, thanos.queryFrontend.dashboard.dimensions) + ) + ) + .addRow( + g.row('Cache Operations') + .addPanel( + g.panel('Requests', 'Show rate of cache requests.') + + g.queryPanel( + 'sum by (%s) (rate(cortex_cache_request_duration_seconds_count{%s}[$interval]))' % [utils.joinLabels([thanos.queryFrontend.dashboard.dimensions, 'tripperware']), thanos.queryFrontend.dashboard.selector], + '{{job}} {{tripperware}}', + ) + + g.stack + ) + .addPanel( + g.panel('Querier cache gets vs misses', 'Show rate of Querier cache gets vs misses.') + + g.queryPanel( + 'sum by (%s) (rate(querier_cache_gets_total{%s}[$interval]))' % [utils.joinLabels([thanos.queryFrontend.dashboard.dimensions, 'tripperware']), thanos.queryFrontend.dashboard.selector], + 'Cache gets - {{job}} {{tripperware}}', + ) + + g.queryPanel( + 'sum by (%s) (rate(querier_cache_misses_total{%s}[$interval]))' % [utils.joinLabels([thanos.queryFrontend.dashboard.dimensions, 'tripperware']), thanos.queryFrontend.dashboard.selector], + 'Cache misses - {{job}} {{tripperware}}', + ) + + g.stack + ) + .addPanel( + g.panel('Cortex fetched keys', 'Shows rate of cortex fetched keys.') + + g.queryPanel( + 'sum by (%s) (rate(cortex_cache_fetched_keys{%s}[$interval]))' % [utils.joinLabels([thanos.queryFrontend.dashboard.dimensions, 'tripperware']), thanos.queryFrontend.dashboard.selector], + '{{job}} {{tripperware}}', + ) + + g.stack + ) + .addPanel( + g.panel('Cortex cache hits', 'Shows rate of cortex cache hits.') + + g.queryPanel( + 'sum by (%s) (rate(cortex_cache_hits{%s}[$interval]))' % [utils.joinLabels([thanos.queryFrontend.dashboard.dimensions, 'tripperware']), thanos.queryFrontend.dashboard.selector], + '{{job}} {{tripperware}}', + ) + + g.stack + ) + ) + .addRow( + g.resourceUtilizationRow(thanos.queryFrontend.dashboard.selector, thanos.queryFrontend.dashboard.dimensions) + ), + }, +}