Skip to content

Commit

Permalink
Merge branch 'main' into dimitar/2.14/promote-chunks-streaming-to-stable
Browse files Browse the repository at this point in the history
  • Loading branch information
dimitarvdimitrov authored Jul 23, 2024
2 parents b7e4a68 + c2716aa commit 2e736f7
Show file tree
Hide file tree
Showing 90 changed files with 3,901 additions and 1,740 deletions.
20 changes: 18 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,12 @@
* [CHANGE] Store-gateway: enabled `-blocks-storage.bucket-store.max-concurrent-queue-timeout` by default with a timeout of 5 seconds. #8496
* [CHANGE] Store-gateway: enabled `-blocks-storage.bucket-store.index-header.lazy-loading-concurrency-queue-timeout` by default with a timeout of 5 seconds . #8667
* [CHANGE] Distributor: Incoming OTLP requests were previously size-limited by using limit from `-distributor.max-recv-msg-size` option. We have added option `-distributor.max-otlp-request-size` for limiting OTLP requests, with default value of 100 MiB. #8574
* [CHANGE] Distributor: remove metric `cortex_distributor_sample_delay_seconds`. #8698
* [CHANGE] Query-frontend: Remove deprecated `frontend.align_queries_with_step` YAML configuration. The configuration option has been moved to per-tenant and default `limits` since Mimir 2.12. #8733 #8735
* [CHANGE] Store-gateway: Change default of `-blocks-storage.bucket-store.max-concurrent` to 200. #8768
* [CHANGE] Added new metric `cortex_compactor_disk_out_of_space_errors_total` which counts how many times a compaction failed due to the compactor being out of disk, alert if there is a single increase. #8237 #8278
* [CHANGE] Store-gateway: Remove experimental parameter `-blocks-storage.bucket-store.series-selection-strategy`. The default strategy is now `worst-case`. #8702
* [CHANGE] Store-gateway: Rename `-blocks-storage.bucket-store.series-selection-strategies.worst-case-series-preference` to `-blocks-storage.bucket-store.series-fetch-preference` and promote to stable. #8702
* [CHANGE] Querier, store-gateway: remove deprecated `-querier.prefer-streaming-chunks-from-store-gateways=true`. Streaming from store-gateways is now always enabled. #8696
* [FEATURE] Querier: add experimental streaming PromQL engine, enabled with `-querier.query-engine=mimir`. #8422 #8430 #8454 #8455 #8360 #8490 #8508 #8577 #8671
* [FEATURE] Experimental Kafka-based ingest storage. #6888 #6894 #6929 #6940 #6951 #6974 #6982 #7029 #7030 #7091 #7142 #7147 #7148 #7153 #7160 #7193 #7349 #7376 #7388 #7391 #7393 #7394 #7402 #7404 #7423 #7424 #7437 #7486 #7503 #7508 #7540 #7621 #7682 #7685 #7694 #7695 #7696 #7697 #7701 #7733 #7734 #7741 #7752 #7838 #7851 #7871 #7877 #7880 #7882 #7887 #7891 #7925 #7955 #7967 #8031 #8063 #8077 #8088 #8135 #8176 #8184 #8194 #8216 #8217 #8222 #8233 #8503 #8542 #8579 #8657 #8686 #8688 #8703 #8706 #8708 #8738 #8750
Expand Down Expand Up @@ -63,15 +67,19 @@

* [CHANGE] Dashboards: set default auto-refresh rate to 5m. #8758
* [ENHANCEMENT] Dashboards: allow switching between using classic or native histograms in dashboards.
* Overview dashboard: status, read/write latency and queries/ingestion per sec panels, `cortex_request_duration_seconds` metric. #7674 #8502
* Writes dashboard: `cortex_request_duration_seconds` metric. #8757
* Overview dashboard: status, read/write latency and queries/ingestion per sec panels, `cortex_request_duration_seconds` metric. #7674 #8502 #8791
* Writes dashboard: `cortex_request_duration_seconds` metric. #8757 #8791
* Reads dashboard: `cortex_request_duration_seconds` metric. #8752
* Rollout progress dashboard. #8779
* Alertmanager dashboard. #8792
* [ENHANCEMENT] Alerts: `MimirRunningIngesterReceiveDelayTooHigh` alert has been tuned to be more reactive to high receive delay. #8538
* [ENHANCEMENT] Dashboards: improve end-to-end latency and strong read consistency panels when experimental ingest storage is enabled. #8543
* [ENHANCEMENT] Dashboards: Add panels for monitoring ingester autoscaling when not using ingest-storage. These panels are disabled by default, but can be enabled using the `autoscaling.ingester.enabled: true` config option. #8484
* [ENHANCEMENT] Dashboards: add panels to show writes to experimental ingest storage backend in the "Mimir / Ruler" dashboard, when `_config.show_ingest_storage_panels` is enabled. #8732
* [ENHANCEMENT] Dashboards: show all series in tooltips on time series dashboard panels. #8748
* [ENHANCEMENT] Dashboards: add compactor autoscaling panels to "Mimir / Compactor" dashboard. The panels are disabled by default, but can be enabled setting `_config.autoscaling.compactor.enabled` to `true`. #8777
* [ENHANCEMENT] Alerts: added `MimirKafkaClientBufferedProduceBytesTooHigh` alert. #8763
* [ENHANCEMENT] Dashboards: added "Kafka produced records / sec" panel to "Mimir / Writes" dashboard. #8763
* [BUGFIX] Dashboards: fix "current replicas" in autoscaling panels when HPA is not active. #8566
* [BUGFIX] Alerts: do not fire `MimirRingMembersMismatch` during the migration to experimental ingest storage. #8727

Expand All @@ -85,10 +93,18 @@
* [ENHANCEMENT] Distributor: increase `-distributor.remote-timeout` when the experimental ingest storage is enabled. #8518
* [ENHANCEMENT] Memcached: Update to Memcached 1.6.28 and memcached-exporter 0.14.4. #8557
* [ENHANCEMENT] Rollout-operator: Allow the rollout-operator to be used as Kubernetes statefulset webhook to enable `no-downscale` and `prepare-downscale` annotations to be used on ingesters or store-gateways. #8743
* [ENHANCEMENT] Do not deploy ingester-zone-c when experimental ingest storage is enabled and `ingest_storage_ingester_zones` is configured to `2`. #8776
* [ENHANCEMENT] Added the config option `ingest_storage_migration_classic_ingesters_no_scale_down_delay` to disable the downscale delay on classic ingesters when migrating to experimental ingest storage. #8775

### Mimirtool

* [CHANGE] Analyze Rules: Count recording rules used in rules group as used. #6133
* [CHANGE] Remove deprecated `--rule-files` flag in favor of CLI arguments for the following commands: #8701
* `mimirtool rules load`
* `mimirtool rules sync`
* `mimirtool rules diff`
* `mimirtool rules check`
* `mimirtool rules prepare`

### Mimir Continuous Test

Expand Down
33 changes: 6 additions & 27 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -9481,35 +9481,14 @@
},
{
"kind": "field",
"name": "series_selection_strategy",
"name": "series_fetch_preference",
"required": false,
"desc": "This option controls the strategy to selection of series and deferring application of matchers. A more aggressive strategy will fetch less posting lists at the cost of more series. This is useful when querying large blocks in which many series share the same label name and value. Supported values (most aggressive to least aggressive): speculative, worst-case, worst-case-small-posting-lists, all.",
"desc": "This parameter controls the trade-off in fetching series versus fetching postings to fulfill a series request. Increasing the series preference results in fetching more series and reducing the volume of postings fetched. Reducing the series preference results in the opposite. Increase this parameter to reduce the rate of fetched series bytes (see \"Mimir / Queries\" dashboard) or API calls to the object store. Must be a positive floating point number.",
"fieldValue": null,
"fieldDefaultValue": "worst-case",
"fieldFlag": "blocks-storage.bucket-store.series-selection-strategy",
"fieldType": "string",
"fieldCategory": "experimental"
},
{
"kind": "block",
"name": "series_selection_strategies",
"required": false,
"desc": "",
"blockEntries": [
{
"kind": "field",
"name": "worst_case_series_preference",
"required": false,
"desc": "This option is only used when blocks-storage.bucket-store.series-selection-strategy=worst-case. Increasing the series preference results in fetching more series than postings. Must be a positive floating point number.",
"fieldValue": null,
"fieldDefaultValue": 0.75,
"fieldFlag": "blocks-storage.bucket-store.series-selection-strategies.worst-case-series-preference",
"fieldType": "float",
"fieldCategory": "experimental"
}
],
"fieldValue": null,
"fieldDefaultValue": null
"fieldDefaultValue": 0.75,
"fieldFlag": "blocks-storage.bucket-store.series-fetch-preference",
"fieldType": "float",
"fieldCategory": "advanced"
}
],
"fieldValue": null,
Expand Down
6 changes: 2 additions & 4 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -671,12 +671,10 @@ Usage of ./cmd/mimir/mimir:
Max size - in bytes - of a gap for which the partitioner aggregates together two bucket GET object requests. (default 524288)
-blocks-storage.bucket-store.posting-offsets-in-mem-sampling int
Controls what is the ratio of postings offsets that the store will hold in memory. (default 32)
-blocks-storage.bucket-store.series-fetch-preference float
This parameter controls the trade-off in fetching series versus fetching postings to fulfill a series request. Increasing the series preference results in fetching more series and reducing the volume of postings fetched. Reducing the series preference results in the opposite. Increase this parameter to reduce the rate of fetched series bytes (see "Mimir / Queries" dashboard) or API calls to the object store. Must be a positive floating point number. (default 0.75)
-blocks-storage.bucket-store.series-hash-cache-max-size-bytes uint
Max size - in bytes - of the in-memory series hash cache. The cache is shared across all tenants and it's used only when query sharding is enabled. (default 1073741824)
-blocks-storage.bucket-store.series-selection-strategies.worst-case-series-preference float
[experimental] This option is only used when blocks-storage.bucket-store.series-selection-strategy=worst-case. Increasing the series preference results in fetching more series than postings. Must be a positive floating point number. (default 0.75)
-blocks-storage.bucket-store.series-selection-strategy string
[experimental] This option controls the strategy to selection of series and deferring application of matchers. A more aggressive strategy will fetch less posting lists at the cost of more series. This is useful when querying large blocks in which many series share the same label name and value. Supported values (most aggressive to least aggressive): speculative, worst-case, worst-case-small-posting-lists, all. (default "worst-case")
-blocks-storage.bucket-store.sync-dir string
Directory to store synchronized TSDB index headers. This directory is not required to be persisted between restarts, but it's highly recommended in order to improve the store-gateway startup time. (default "./tsdb-sync/")
-blocks-storage.bucket-store.sync-interval duration
Expand Down
5 changes: 0 additions & 5 deletions docs/sources/mimir/configure/about-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,6 @@ The following features are currently experimental:
- `-query-scheduler.querier-forget-delay`
- Store-gateway
- Use of Redis cache backend (`-blocks-storage.bucket-store.chunks-cache.backend=redis`, `-blocks-storage.bucket-store.index-cache.backend=redis`, `-blocks-storage.bucket-store.metadata-cache.backend=redis`)
- `-blocks-storage.bucket-store.series-selection-strategy`
- Eagerly loading some blocks on startup even when lazy loading is enabled `-blocks-storage.bucket-store.index-header.eager-loading-startup-enabled`
- Read-write deployment mode
- API endpoints:
Expand Down Expand Up @@ -211,14 +210,10 @@ For details about what _deprecated_ means, see [Parameter lifecycle]({{< relref

The following features or configuration parameters are currently deprecated and will be **removed in Mimir 2.14**:

- Distributor
- the metric `cortex_distributor_sample_delay_seconds`
- Ingester
- `-ingester.return-only-grpc-errors`
- Ingester client
- `-ingester.client.report-grpc-codes-in-instrumentation-label-enabled`
- Mimirtool
- the flag `--rule-files`

The following features or configuration parameters are currently deprecated and will be **removed in a future release (to be announced)**:

Expand Down
25 changes: 9 additions & 16 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4145,22 +4145,15 @@ bucket_store:
# CLI flag: -blocks-storage.bucket-store.batch-series-size
[streaming_series_batch_size: <int> | default = 5000]
# (experimental) This option controls the strategy to selection of series and
# deferring application of matchers. A more aggressive strategy will fetch
# less posting lists at the cost of more series. This is useful when querying
# large blocks in which many series share the same label name and value.
# Supported values (most aggressive to least aggressive): speculative,
# worst-case, worst-case-small-posting-lists, all.
# CLI flag: -blocks-storage.bucket-store.series-selection-strategy
[series_selection_strategy: <string> | default = "worst-case"]
series_selection_strategies:
# (experimental) This option is only used when
# blocks-storage.bucket-store.series-selection-strategy=worst-case.
# Increasing the series preference results in fetching more series than
# postings. Must be a positive floating point number.
# CLI flag: -blocks-storage.bucket-store.series-selection-strategies.worst-case-series-preference
[worst_case_series_preference: <float> | default = 0.75]
# (advanced) This parameter controls the trade-off in fetching series versus
# fetching postings to fulfill a series request. Increasing the series
# preference results in fetching more series and reducing the volume of
# postings fetched. Reducing the series preference results in the opposite.
# Increase this parameter to reduce the rate of fetched series bytes (see
# "Mimir / Queries" dashboard) or API calls to the object store. Must be a
# positive floating point number.
# CLI flag: -blocks-storage.bucket-store.series-fetch-preference
[series_fetch_preference: <float> | default = 0.75]
tsdb:
# Directory to store TSDBs (including WAL) in the ingesters. This directory is
Expand Down
38 changes: 38 additions & 0 deletions docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -639,6 +639,26 @@ How to **investigate**:
- If you encounter any Compactor resource issues, add CPU/Memory as needed temporarily, then scale back later.
- You can also optionally scale replicas and shards further to split the work up into even smaller pieces until the situation has recovered.
### MimirCompactorHasRunOutOfDiskSpace
This alert fires when the compactor has run out of disk space at least once.
When this happens the compaction will fail and after some time the compactor will retry the failed compaction.
It's very likely that on each retry of the job, the compactor will just hit the same disk space limit again and it won't be able to recover on its own.
Alternatively, if compactor concurrency is higher than 1, it could have been just an unlucky combination of jobs that caused compactor to run out of disk space.
How to **investigate**:
- Look at the disk space usage in the compactor's data volumes.
- Look for an error with the string `no space left on device` to confirm that the compactor ran out of disk space.
How to **fix** it:
- The only long-term solution is to give the compactor more disk space, as it requires more space to fit the largest single job into its disk.
- If the number of blocks that the compactor is failing to compact is not very significant and you want to skip compacting them and focus on more recent blocks instead, consider marking the affected blocks for no compaction:
```
./tools/markblocks/markblocks -backend gcs -gcs.bucket-name <bucket> -mark no-compact -tenant <tenant-id> -details "focus on newer blocks"
```
### MimirCompactorSkippedUnhealthyBlocks
This alert fires when compactor tries to compact a block, but finds that given block is unhealthy. This indicates a bug in Prometheus TSDB library and should be investigated.
Expand Down Expand Up @@ -1465,6 +1485,24 @@ How to **investigate**:
- Check if ingesters are processing too many records, and they need to be scaled up (vertically or horizontally).
- Check actual error in logs to see whether the `-ingest-storage.kafka.wait-strong-read-consistency-timeout` or the request timeout has been hit first.
### MimirKafkaClientBufferedProduceBytesTooHigh
This alert fires when the Kafka client buffer, used to write incoming write requests to Kafka, is getting full.
How it **works**:
- Distributor and ruler encapsulate write requests into Kafka records and send them to Kafka.
- The Kafka client has a limit on the total byte size of buffered records either sent to Kafka or sent to Kafka but not acknowledged yet.
- When the limit is reached, the Kafka client stops producing more records and fast fails.
- The limit is configured via `-ingest-storage.kafka.producer-max-buffered-bytes`.
- The default limit is configured intentionally high, so that when the buffer utilization gets close to the limit, this indicates that there's probably an issue.
How to **investigate**:
- Query `cortex_ingest_storage_writer_buffered_produce_bytes{quantile="1.0"}` metrics to see the actual buffer utilization peaks.
- If the high buffer utilization is isolated to a small set of pods, then there might be an issue in the client pods.
- If the high buffer utilization is spread across all or most pods, then there might be an issue in Kafka.
### Ingester is overloaded when consuming from Kafka
This runbook covers the case an ingester is overloaded when ingesting metrics data (consuming) from Kafka.
Expand Down
7 changes: 3 additions & 4 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ require (
github.com/golang/snappy v0.0.4
github.com/google/gopacket v1.1.19
github.com/gorilla/mux v1.8.1
github.com/grafana/dskit v0.0.0-20240718080635-f5bd38371e1c
github.com/grafana/dskit v0.0.0-20240719153732-6e8a03e781de
github.com/grafana/e2e v0.1.2-0.20240118170847-db90b84177fc
github.com/hashicorp/golang-lru v1.0.2 // indirect
github.com/json-iterator/go v1.1.12
github.com/minio/minio-go/v7 v7.0.72
github.com/minio/minio-go/v7 v7.0.74
github.com/mitchellh/go-wordwrap v1.0.1
github.com/oklog/ulid v1.3.1
github.com/opentracing-contrib/go-grpc v0.0.0-20210225150812-73cb765af46e
Expand Down Expand Up @@ -102,6 +102,7 @@ require (
github.com/Masterminds/sprig/v3 v3.2.1 // indirect
github.com/bboreham/go-loser v0.0.0-20230920113527-fcc2c21820a3 // indirect
github.com/cenkalti/backoff/v3 v3.2.2 // indirect
github.com/go-ini/ini v1.67.0 // indirect
github.com/go-ole/go-ole v1.2.6 // indirect
github.com/go-test/deep v1.1.0 // indirect
github.com/goccy/go-json v0.10.3 // indirect
Expand Down Expand Up @@ -160,7 +161,6 @@ require (
github.com/beorn7/perks v1.0.1 // indirect
github.com/bits-and-blooms/bitset v1.13.0 // indirect
github.com/cenkalti/backoff/v4 v4.3.0 // indirect
github.com/cespare/xxhash v1.1.0 // indirect
github.com/cespare/xxhash/v2 v2.3.0
github.com/coreos/go-semver v0.3.0 // indirect
github.com/coreos/go-systemd/v22 v22.5.0 // indirect
Expand Down Expand Up @@ -270,7 +270,6 @@ require (
google.golang.org/genproto v0.0.0-20240528184218-531527333157 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20240528184218-531527333157 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20240711142825-46eb208f015d
gopkg.in/ini.v1 v1.67.0 // indirect
k8s.io/kube-openapi v0.0.0-20240228011516-70dd3763d340 // indirect
k8s.io/utils v0.0.0-20230726121419-3b25d923346b // indirect
sigs.k8s.io/yaml v1.4.0 // indirect
Expand Down
Loading

0 comments on commit 2e736f7

Please sign in to comment.