Skip to content

Commit

Permalink
format changelog
Browse files Browse the repository at this point in the history
Signed-off-by: Ben Ye <[email protected]>
  • Loading branch information
yeya24 committed Aug 8, 2020
1 parent f160e12 commit e5f0737
Show file tree
Hide file tree
Showing 4 changed files with 144 additions and 74 deletions.
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,6 @@ We use *breaking* word for marking changes that are not backward compatible (rel

### Changed

- [#2906](https://github.com/thanos-io/thanos/pull/2906) Tools: Refactor Bucket replicate execution. Removed all `thanos_replicate_origin_.*` metrics.
- `thanos_replicate_origin_meta_loads_total` can be replaced by `blocks_meta_synced{state="loaded"}`.
- `thanos_replicate_origin_partial_meta_reads_total` can be replaced by `blocks_meta_synced{state="failed"}`.

### Fixed

Expand Down Expand Up @@ -49,6 +46,9 @@ We use *breaking* word for marking changes that are not backward compatible (rel
- [#2902](https://github.com/thanos-io/thanos/pull/2902) ui: React: Separate dedupe and partial response checkboxes per panel.
- [#2931](https://github.com/thanos-io/thanos/pull/2931) Query: Allow passing a `storeMatcher[]` to select matching stores when debugging the querier. See [documentation](https://thanos.io/components/query.md/#store-filtering)
- [#2991](https://github.com/thanos-io/thanos/pull/2991) store: `operation` label value `getrange` changed to `get_range` for `thanos_store_bucket_cache_operation_requests_total` and `thanos_store_bucket_cache_operation_hits_total` to be consistent with bucket operation metrics.
- [#2906](https://github.com/thanos-io/thanos/pull/2906) Tools: Refactor Bucket replicate execution. Removed all `thanos_replicate_origin_.*` metrics.
- `thanos_replicate_origin_meta_loads_total` can be replaced by `blocks_meta_synced{state="loaded"}`.
- `thanos_replicate_origin_partial_meta_reads_total` can be replaced by `blocks_meta_synced{state="failed"}`.

## [v0.14.0](https://github.com/thanos-io/thanos/releases/tag/v0.14.0) - 2020.07.10

Expand Down
102 changes: 68 additions & 34 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ name: thanos-compact.rules
rules:
- alert: ThanosCompactMultipleRunning
annotations:
message: No more than one Thanos Compact instance should be running at once. There are {{ $value }}
message: No more than one Thanos Compact instance should be running at once. There
are {{ $value }}
expr: sum(up{job=~"thanos-compact.*"}) > 1
for: 5m
labels:
Expand All @@ -24,7 +25,8 @@ rules:
severity: warning
- alert: ThanosCompactHighCompactionFailures
annotations:
message: Thanos Compact {{$labels.job}} is failing to execute {{ $value | humanize }}% of compactions.
message: Thanos Compact {{$labels.job}} is failing to execute {{ $value | humanize
}}% of compactions.
expr: |
(
sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~"thanos-compact.*"}[5m]))
Expand All @@ -37,7 +39,8 @@ rules:
severity: warning
- alert: ThanosCompactBucketHighOperationFailures
annotations:
message: Thanos Compact {{$labels.job}} Bucket is failing to execute {{ $value | humanize }}% of operations.
message: Thanos Compact {{$labels.job}} Bucket is failing to execute {{ $value
| humanize }}% of operations.
expr: |
(
sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~"thanos-compact.*"}[5m]))
Expand All @@ -51,7 +54,8 @@ rules:
- alert: ThanosCompactHasNotRun
annotations:
message: Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.
expr: (time() - max(max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~"thanos-compact.*"}[24h]))) / 60 / 60 > 24
expr: (time() - max(max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~"thanos-compact.*"}[24h])))
/ 60 / 60 > 24
labels:
severity: warning
```
Expand All @@ -74,7 +78,8 @@ rules:
severity: critical
- alert: ThanosRuleSenderIsFailingAlerts
annotations:
message: Thanos Rule {{$labels.job}} {{$labels.pod}} is failing to send alerts to alertmanager.
message: Thanos Rule {{$labels.job}} {{$labels.pod}} is failing to send alerts
to alertmanager.
expr: |
sum by (job) (rate(thanos_alert_sender_alerts_dropped_total{job=~"thanos-rule.*"}[5m])) > 0
for: 5m
Expand All @@ -95,15 +100,17 @@ rules:
severity: critical
- alert: ThanosRuleHighRuleEvaluationWarnings
annotations:
message: Thanos Rule {{$labels.job}} {{$labels.pod}} has high number of evaluation warnings.
message: Thanos Rule {{$labels.job}} {{$labels.pod}} has high number of evaluation
warnings.
expr: |
sum by (job) (rate(thanos_rule_evaluation_with_warnings_total{job=~"thanos-rule.*"}[5m])) > 0
for: 15m
labels:
severity: info
- alert: ThanosRuleRuleEvaluationLatencyHigh
annotations:
message: Thanos Rule {{$labels.job}}/{{$labels.pod}} has higher evaluation latency than interval for {{$labels.rule_group}}.
message: Thanos Rule {{$labels.job}}/{{$labels.pod}} has higher evaluation latency
than interval for {{$labels.rule_group}}.
expr: |
(
sum by (job, pod, rule_group) (prometheus_rule_group_last_duration_seconds{job=~"thanos-rule.*"})
Expand All @@ -115,7 +122,8 @@ rules:
severity: warning
- alert: ThanosRuleGrpcErrorRate
annotations:
message: Thanos Rule {{$labels.job}} is failing to handle {{ $value | humanize }}% of requests.
message: Thanos Rule {{$labels.job}} is failing to handle {{ $value | humanize
}}% of requests.
expr: |
(
sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~"thanos-rule.*"}[5m]))
Expand All @@ -129,13 +137,15 @@ rules:
- alert: ThanosRuleConfigReloadFailure
annotations:
message: Thanos Rule {{$labels.job}} has not been able to reload its configuration.
expr: avg(thanos_rule_config_last_reload_successful{job=~"thanos-rule.*"}) by (job) != 1
expr: avg(thanos_rule_config_last_reload_successful{job=~"thanos-rule.*"}) by (job)
!= 1
for: 5m
labels:
severity: info
- alert: ThanosRuleQueryHighDNSFailures
annotations:
message: Thanos Rule {{$labels.job}} has {{ $value | humanize }}% of failing DNS queries for query endpoints.
message: Thanos Rule {{$labels.job}} has {{ $value | humanize }}% of failing DNS
queries for query endpoints.
expr: |
(
sum by (job) (rate(thanos_ruler_query_apis_dns_failures_total{job=~"thanos-rule.*"}[5m]))
Expand All @@ -148,7 +158,8 @@ rules:
severity: warning
- alert: ThanosRuleAlertmanagerHighDNSFailures
annotations:
message: Thanos Rule {{$labels.job}} has {{ $value | humanize }}% of failing DNS queries for Alertmanager endpoints.
message: Thanos Rule {{$labels.job}} has {{ $value | humanize }}% of failing DNS
queries for Alertmanager endpoints.
expr: |
(
sum by (job) (rate(thanos_ruler_alertmanagers_dns_failures_total{job=~"thanos-rule.*"}[5m]))
Expand All @@ -161,7 +172,8 @@ rules:
severity: warning
- alert: ThanosRuleNoEvaluationFor10Intervals
annotations:
message: Thanos Rule {{$labels.job}} has {{ $value | humanize }}% rule groups that did not evaluate for at least 10x of their expected interval.
message: Thanos Rule {{$labels.job}} has {{ $value | humanize }}% rule groups
that did not evaluate for at least 10x of their expected interval.
expr: |
time() - max by (job, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~"thanos-rule.*"})
>
Expand All @@ -171,7 +183,8 @@ rules:
severity: info
- alert: ThanosNoRuleEvaluations
annotations:
message: Thanos Rule {{$labels.job}} did not perform any rule evaluations in the past 2 minutes.
message: Thanos Rule {{$labels.job}} did not perform any rule evaluations in the
past 2 minutes.
expr: |
sum(rate(prometheus_rule_evaluations_total{job=~"thanos-rule.*"}[2m])) <= 0
and
Expand All @@ -189,7 +202,8 @@ name: thanos-store.rules
rules:
- alert: ThanosStoreGrpcErrorRate
annotations:
message: Thanos Store {{$labels.job}} is failing to handle {{ $value | humanize }}% of requests.
message: Thanos Store {{$labels.job}} is failing to handle {{ $value | humanize
}}% of requests.
expr: |
(
sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~"thanos-store.*"}[5m]))
Expand All @@ -202,7 +216,8 @@ rules:
severity: warning
- alert: ThanosStoreSeriesGateLatencyHigh
annotations:
message: Thanos Store {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for store series gate requests.
message: Thanos Store {{$labels.job}} has a 99th percentile latency of {{ $value
}} seconds for store series gate requests.
expr: |
(
histogram_quantile(0.9, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~"thanos-store.*"}[5m]))) > 2
Expand All @@ -214,7 +229,8 @@ rules:
severity: warning
- alert: ThanosStoreBucketHighOperationFailures
annotations:
message: Thanos Store {{$labels.job}} Bucket is failing to execute {{ $value | humanize }}% of operations.
message: Thanos Store {{$labels.job}} Bucket is failing to execute {{ $value |
humanize }}% of operations.
expr: |
(
sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~"thanos-store.*"}[5m]))
Expand All @@ -227,7 +243,8 @@ rules:
severity: warning
- alert: ThanosStoreObjstoreOperationLatencyHigh
annotations:
message: Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{ $value }} seconds for the bucket operations.
message: Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of
{{ $value }} seconds for the bucket operations.
expr: |
(
histogram_quantile(0.9, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~"thanos-store.*"}[5m]))) > 2
Expand Down Expand Up @@ -255,7 +272,8 @@ rules:
severity: critical
- alert: ThanosSidecarUnhealthy
annotations:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value
}} seconds.
expr: |
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 300) > 0
labels:
Expand All @@ -270,7 +288,8 @@ name: thanos-query.rules
rules:
- alert: ThanosQueryHttpRequestQueryErrorRateHigh
annotations:
message: Thanos Query {{$labels.job}} is failing to handle {{ $value | humanize }}% of "query" requests.
message: Thanos Query {{$labels.job}} is failing to handle {{ $value | humanize
}}% of "query" requests.
expr: |
(
sum(rate(http_requests_total{code=~"5..", job=~"thanos-query.*", handler="query"}[5m]))
Expand All @@ -282,7 +301,8 @@ rules:
severity: critical
- alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh
annotations:
message: Thanos Query {{$labels.job}} is failing to handle {{ $value | humanize }}% of "query_range" requests.
message: Thanos Query {{$labels.job}} is failing to handle {{ $value | humanize
}}% of "query_range" requests.
expr: |
(
sum(rate(http_requests_total{code=~"5..", job=~"thanos-query.*", handler="query_range"}[5m]))
Expand All @@ -294,7 +314,8 @@ rules:
severity: critical
- alert: ThanosQueryGrpcServerErrorRate
annotations:
message: Thanos Query {{$labels.job}} is failing to handle {{ $value | humanize }}% of requests.
message: Thanos Query {{$labels.job}} is failing to handle {{ $value | humanize
}}% of requests.
expr: |
(
sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~"thanos-query.*"}[5m]))
Expand All @@ -307,7 +328,8 @@ rules:
severity: warning
- alert: ThanosQueryGrpcClientErrorRate
annotations:
message: Thanos Query {{$labels.job}} is failing to send {{ $value | humanize }}% of requests.
message: Thanos Query {{$labels.job}} is failing to send {{ $value | humanize
}}% of requests.
expr: |
(
sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~"thanos-query.*"}[5m]))
Expand All @@ -319,7 +341,8 @@ rules:
severity: warning
- alert: ThanosQueryHighDNSFailures
annotations:
message: Thanos Query {{$labels.job}} have {{ $value | humanize }}% of failing DNS queries for store endpoints.
message: Thanos Query {{$labels.job}} have {{ $value | humanize }}% of failing
DNS queries for store endpoints.
expr: |
(
sum by (job) (rate(thanos_querier_store_apis_dns_failures_total{job=~"thanos-query.*"}[5m]))
Expand All @@ -331,7 +354,8 @@ rules:
severity: warning
- alert: ThanosQueryInstantLatencyHigh
annotations:
message: Thanos Query {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for instant queries.
message: Thanos Query {{$labels.job}} has a 99th percentile latency of {{ $value
}} seconds for instant queries.
expr: |
(
histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~"thanos-query.*", handler="query"}[5m]))) > 40
Expand All @@ -343,7 +367,8 @@ rules:
severity: critical
- alert: ThanosQueryRangeLatencyHigh
annotations:
message: Thanos Query {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for range queries.
message: Thanos Query {{$labels.job}} has a 99th percentile latency of {{ $value
}} seconds for range queries.
expr: |
(
histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~"thanos-query.*", handler="query_range"}[5m]))) > 90
Expand All @@ -363,7 +388,8 @@ name: thanos-receive.rules
rules:
- alert: ThanosReceiveHttpRequestErrorRateHigh
annotations:
message: Thanos Receive {{$labels.job}} is failing to handle {{ $value | humanize }}% of requests.
message: Thanos Receive {{$labels.job}} is failing to handle {{ $value | humanize
}}% of requests.
expr: |
(
sum(rate(http_requests_total{code=~"5..", job=~"thanos-receive.*", handler="receive"}[5m]))
Expand All @@ -375,7 +401,8 @@ rules:
severity: critical
- alert: ThanosReceiveHttpRequestLatencyHigh
annotations:
message: Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.
message: Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value
}} seconds for requests.
expr: |
(
histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~"thanos-receive.*", handler="receive"}[5m]))) > 10
Expand All @@ -387,7 +414,8 @@ rules:
severity: critical
- alert: ThanosReceiveHighReplicationFailures
annotations:
message: Thanos Receive {{$labels.job}} is failing to replicate {{ $value | humanize }}% of requests.
message: Thanos Receive {{$labels.job}} is failing to replicate {{ $value | humanize
}}% of requests.
expr: |
thanos_receive_replication_factor > 1
and
Expand All @@ -409,7 +437,8 @@ rules:
severity: warning
- alert: ThanosReceiveHighForwardRequestFailures
annotations:
message: Thanos Receive {{$labels.job}} is failing to forward {{ $value | humanize }}% of requests.
message: Thanos Receive {{$labels.job}} is failing to forward {{ $value | humanize
}}% of requests.
expr: |
(
sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~"thanos-receive.*"}[5m]))
Expand All @@ -421,7 +450,8 @@ rules:
severity: warning
- alert: ThanosReceiveHighHashringFileRefreshFailures
annotations:
message: Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{ $value | humanize }} of attempts failed.
message: Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{
$value | humanize }} of attempts failed.
expr: |
(
sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~"thanos-receive.*"}[5m]))
Expand All @@ -435,13 +465,15 @@ rules:
- alert: ThanosReceiveConfigReloadFailure
annotations:
message: Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.
expr: avg(thanos_receive_config_last_reload_successful{job=~"thanos-receive.*"}) by (job) != 1
expr: avg(thanos_receive_config_last_reload_successful{job=~"thanos-receive.*"})
by (job) != 1
for: 5m
labels:
severity: warning
- alert: ThanosReceiveNoUpload
annotations:
message: Thanos Receive {{ $labels.instance }} of {{$labels.job}} has not uploaded latest data to object storage.
message: Thanos Receive {{ $labels.instance }} of {{$labels.job}} has not uploaded
latest data to object storage.
expr: |
(up{job=~"thanos-receive.*"} - 1)
+ on (instance) # filters to only alert on current instance last 3h
Expand All @@ -467,7 +499,8 @@ rules:
severity: critical
- alert: ThanosBucketReplicateErrorRate
annotations:
message: Thanos Replicate failing to run, {{ $value | humanize }}% of attempts failed.
message: Thanos Replicate failing to run, {{ $value | humanize }}% of attempts
failed.
expr: |
(
sum(rate(thanos_replicate_replication_runs_total{result="error", job=~"thanos-bucket-replicate.*"}[5m]))
Expand All @@ -479,7 +512,8 @@ rules:
severity: critical
- alert: ThanosBucketReplicateRunLatency
annotations:
message: Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for the replicate operations.
message: Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{
$value }} seconds for the replicate operations.
expr: |
(
histogram_quantile(0.9, sum by (job, le) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~"thanos-bucket-replicate.*"}[5m]))) > 20
Expand Down
Loading

0 comments on commit e5f0737

Please sign in to comment.