Max bytes read limit (#8670)

**What this PR does / why we need it**: This PR implements two new per-tenant limits that are enforced on log and metric queries (both range and instant) when TSDB is used: - `max_query_bytes_read`: Refuse queries that would read more than the configured bytes here. Overall limit regardless of splitting/sharding. The goal is to refuse queries that would take too long. The default value of 0 disables this limit. - `max_querier_bytes_read`: Refuse queries in which any of their subqueries after splitting and sharding would read more than the configured bytes here. The goal is to avoid a querier from running a query that would load too much data in memory and can potentially get OOMed. The default value of 0 disables this limit. These new limits can be configured per tenant and per query (see #8727). The bytes a query would read are estimated through TSDB's index stats. Even though they are not exact, they are good enough to have a rough estimation of whether a query is too big to run or not. For more details on this refer to this discussion in the PR: #8670 (comment). Both limits are implemented in the frontend. Even though we considered implementing `max_querier_bytes_read` in the querier, this way, the limits for pre and post splitting/sharding queries are enforced close to each other on the same component. Moreover, this way we can reduce the number of index stats requests issued to the index gateways by reusing the stats gathered while sharding the query. With regard to how index stats requests are issued: - We parallelize index stats requests by splitting them into queries that span up to 24h since our indices are sharded by 24h periods. On top of that, this prevents a single index gateway from processing a single huge request like `{app=~".+"} for 30d`. - If sharding is enabled and the query is shardable, for `max_querier_bytes_read`, we re-use the stats requests issued by the sharding ware. Specifically, we look at the [bytesPerShard][1] to enforce this limit. Note that once we merge this PR and enable these limits, the load of index stats requests will increase substantially and we may discover bottlenecks in our index gateways and TSDB. After speaking with @owen-d, we think it should be fine as, if needed, we can scale up our index gateways and support caching index stats requests. Here's a demo of this working: <img width="1647" alt="image" src="https://user-images.githubusercontent.com/8354290/226918478-d4b6c2fd-de4d-478a-9c8b-e38fe148fa95.png"> <img width="1647" alt="image" src="https://user-images.githubusercontent.com/8354290/226918798-a71b1db8-ea68-4d00-933b-e5eb1524d240.png"> **Which issue(s) this PR fixes**: This PR addresses grafana/loki-private#674. **Special notes for your reviewer**: - @jeschkies has reviewed the changes related to query-time limits. - I've done some refactoring in this PR: - Extracted logic to get stats for a set of matches into a new function [getStatsForMatchers][2]. - Extracted the _Handler_ interface implementation for [queryrangebase.roundTripper][3] into a new type [queryrangebase.roundTripperHandler][4]. This is used to create the handler that skips the rest of configured middlewares when sending an index stat quests ([example][5]). **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L179-L186 [2]: https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L72 [3]: https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L124 [4]: https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L163 [5]: https://github.com/grafana/loki/blob/f422e0a52b743a11209b8276510feb2ab8241486/pkg/querier/queryrange/roundtrip.go#L521
grafana · Mar 23, 2023 · d24fe3e · d24fe3e
1 parent 94725e7
commit d24fe3e
Show file tree

Hide file tree

Showing 23 changed files with 1,183 additions and 281 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -47,6 +47,7 @@
 * [6675](https://github.com/grafana/loki/pull/6675) **btaani**: Add logfmt expression parser for selective extraction of labels from logfmt formatted logs
 * [8474](https://github.com/grafana/loki/pull/8474) **farodin91**: Add support for short-lived S3 session tokens
 * [8774](https://github.com/grafana/loki/pull/8774) **slim-bean**: Add new logql template functions `bytes`, `duration`, `unixEpochMillis`, `unixEpochNanos`, `toDateInZone`, `b64Enc`, and `b64Dec`
+* [8670](https://github.com/grafana/loki/pull/8670) **salvacorts** Introduce two new limits to refuse log and metric queries that would read too much data.
 
 ##### Fixes
 

diff --git a/docs/sources/configuration/_index.md b/docs/sources/configuration/_index.md
@@ -2331,6 +2331,17 @@ The `limits_config` block configures global and per-tenant limits in Loki.
 # CLI flag: -frontend.min-sharding-lookback
 [min_sharding_lookback: <duration> | default = 0s]
 
+# Max number of bytes a query can fetch. Enforced in log and metric queries only
+# when TSDB is used. The default value of 0 disables this limit.
+# CLI flag: -frontend.max-query-bytes-read
+[max_query_bytes_read: <int> | default = 0B]
+
+# Max number of bytes a query can fetch after splitting and sharding. Enforced
+# in log and metric queries only when TSDB is used. The default value of 0
+# disables this limit.
+# CLI flag: -frontend.max-querier-bytes-read
+[max_querier_bytes_read: <int> | default = 0B]
+
 # Duration to delay the evaluation of rules to ensure the underlying metrics
 # have been pushed to Cortex.
 # CLI flag: -ruler.evaluation-delay-duration

diff --git a/pkg/logql/downstream_test.go b/pkg/logql/downstream_test.go
@@ -81,7 +81,7 @@ func TestMappingEquivalence(t *testing.T) {
 			ctx := user.InjectOrgID(context.Background(), "fake")
 
 			mapper := NewShardMapper(ConstantShards(shards), nilShardMetrics)
-			_, mapped, err := mapper.Parse(tc.query)
+			_, _, mapped, err := mapper.Parse(tc.query)
 			require.Nil(t, err)
 
 			shardedQry := sharded.Query(ctx, params, mapped)
@@ -146,7 +146,7 @@ func TestShardCounter(t *testing.T) {
 			ctx := user.InjectOrgID(context.Background(), "fake")
 
 			mapper := NewShardMapper(ConstantShards(shards), nilShardMetrics)
-			noop, mapped, err := mapper.Parse(tc.query)
+			noop, _, mapped, err := mapper.Parse(tc.query)
 			require.Nil(t, err)
 
 			shardedQry := sharded.Query(ctx, params, mapped)

diff --git a/pkg/logql/shardmapper.go b/pkg/logql/shardmapper.go
@@ -4,21 +4,25 @@ import (
 	"fmt"
 
 	"github.com/go-kit/log/level"
+	"github.com/grafana/loki/pkg/util/math"
 	"github.com/pkg/errors"
 	"github.com/prometheus/client_golang/prometheus"
 
 	"github.com/grafana/loki/pkg/logql/syntax"
 	"github.com/grafana/loki/pkg/querier/astmapper"
+	"github.com/grafana/loki/pkg/storage/stores/index/stats"
 	util_log "github.com/grafana/loki/pkg/util/log"
 )
 
 type ShardResolver interface {
-	Shards(expr syntax.Expr) (int, error)
+	Shards(expr syntax.Expr) (int, uint64, error)
+	GetStats(e syntax.Expr) (stats.Stats, error)
 }
 
 type ConstantShards int
 
-func (s ConstantShards) Shards(_ syntax.Expr) (int, error) { return int(s), nil }
+func (s ConstantShards) Shards(_ syntax.Expr) (int, uint64, error)   { return int(s), 0, nil }
+func (s ConstantShards) GetStats(_ syntax.Expr) (stats.Stats, error) { return stats.Stats{}, nil }
 
 type ShardMapper struct {
 	shards  ShardResolver
@@ -36,18 +40,18 @@ func NewShardMapperMetrics(registerer prometheus.Registerer) *MapperMetrics {
 	return newMapperMetrics(registerer, "shard")
 }
 
-func (m ShardMapper) Parse(query string) (noop bool, expr syntax.Expr, err error) {
+func (m ShardMapper) Parse(query string) (noop bool, bytesPerShard uint64, expr syntax.Expr, err error) {
 	parsed, err := syntax.ParseExpr(query)
 	if err != nil {
-		return false, nil, err
+		return false, 0, nil, err
 	}
 
 	recorder := m.metrics.downstreamRecorder()
 
-	mapped, err := m.Map(parsed, recorder)
+	mapped, bytesPerShard, err := m.Map(parsed, recorder)
 	if err != nil {
 		m.metrics.ParsedQueries.WithLabelValues(FailureKey).Inc()
-		return false, nil, err
+		return false, 0, nil, err
 	}
 
 	originalStr := parsed.String()
@@ -61,21 +65,21 @@ func (m ShardMapper) Parse(query string) (noop bool, expr syntax.Expr, err error
 
 	recorder.Finish() // only record metrics for successful mappings
 
-	return noop, mapped, err
+	return noop, bytesPerShard, mapped, err
 }
 
-func (m ShardMapper) Map(expr syntax.Expr, r *downstreamRecorder) (syntax.Expr, error) {
+func (m ShardMapper) Map(expr syntax.Expr, r *downstreamRecorder) (syntax.Expr, uint64, error) {
 	// immediately clone the passed expr to avoid mutating the original
 	expr, err := syntax.Clone(expr)
 	if err != nil {
-		return nil, err
+		return nil, 0, err
 	}
 
 	switch e := expr.(type) {
 	case *syntax.LiteralExpr:
-		return e, nil
+		return e, 0, nil
 	case *syntax.VectorExpr:
-		return e, nil
+		return e, 0, nil
 	case *syntax.MatchersExpr, *syntax.PipelineExpr:
 		return m.mapLogSelectorExpr(e.(syntax.LogSelectorExpr), r)
 	case *syntax.VectorAggregationExpr:
@@ -85,43 +89,47 @@ func (m ShardMapper) Map(expr syntax.Expr, r *downstreamRecorder) (syntax.Expr,
 	case *syntax.RangeAggregationExpr:
 		return m.mapRangeAggregationExpr(e, r)
 	case *syntax.BinOpExpr:
-		lhsMapped, err := m.Map(e.SampleExpr, r)
+		lhsMapped, lhsBytesPerShard, err := m.Map(e.SampleExpr, r)
 		if err != nil {
-			return nil, err
+			return nil, 0, err
 		}
-		rhsMapped, err := m.Map(e.RHS, r)
+		rhsMapped, rhsBytesPerShard, err := m.Map(e.RHS, r)
 		if err != nil {
-			return nil, err
+			return nil, 0, err
 		}
 		lhsSampleExpr, ok := lhsMapped.(syntax.SampleExpr)
 		if !ok {
-			return nil, badASTMapping(lhsMapped)
+			return nil, 0, badASTMapping(lhsMapped)
 		}
 		rhsSampleExpr, ok := rhsMapped.(syntax.SampleExpr)
 		if !ok {
-			return nil, badASTMapping(rhsMapped)
+			return nil, 0, badASTMapping(rhsMapped)
 		}
 		e.SampleExpr = lhsSampleExpr
 		e.RHS = rhsSampleExpr
-		return e, nil
+
+		// We take the maximum bytes per shard of both sides of the operation
+		bytesPerShard := uint64(math.Max(int(lhsBytesPerShard), int(rhsBytesPerShard)))
+
+		return e, bytesPerShard, nil
 	default:
-		return nil, errors.Errorf("unexpected expr type (%T) for ASTMapper type (%T) ", expr, m)
+		return nil, 0, errors.Errorf("unexpected expr type (%T) for ASTMapper type (%T) ", expr, m)
 	}
 }
 
-func (m ShardMapper) mapLogSelectorExpr(expr syntax.LogSelectorExpr, r *downstreamRecorder) (syntax.LogSelectorExpr, error) {
+func (m ShardMapper) mapLogSelectorExpr(expr syntax.LogSelectorExpr, r *downstreamRecorder) (syntax.LogSelectorExpr, uint64, error) {
 	var head *ConcatLogSelectorExpr
-	shards, err := m.shards.Shards(expr)
+	shards, bytesPerShard, err := m.shards.Shards(expr)
 	if err != nil {
-		return nil, err
+		return nil, 0, err
 	}
 	if shards == 0 {
 		return &ConcatLogSelectorExpr{
 			DownstreamLogSelectorExpr: DownstreamLogSelectorExpr{
 				shard:           nil,
 				LogSelectorExpr: expr,
 			},
-		}, nil
+		}, bytesPerShard, nil
 	}
 	for i := shards - 1; i >= 0; i-- {
 		head = &ConcatLogSelectorExpr{
@@ -137,22 +145,22 @@ func (m ShardMapper) mapLogSelectorExpr(expr syntax.LogSelectorExpr, r *downstre
 	}
 	r.Add(shards, StreamsKey)
 
-	return head, nil
+	return head, bytesPerShard, nil
 }
 
-func (m ShardMapper) mapSampleExpr(expr syntax.SampleExpr, r *downstreamRecorder) (syntax.SampleExpr, error) {
+func (m ShardMapper) mapSampleExpr(expr syntax.SampleExpr, r *downstreamRecorder) (syntax.SampleExpr, uint64, error) {
 	var head *ConcatSampleExpr
-	shards, err := m.shards.Shards(expr)
+	shards, bytesPerShard, err := m.shards.Shards(expr)
 	if err != nil {
-		return nil, err
+		return nil, 0, err
 	}
 	if shards == 0 {
 		return &ConcatSampleExpr{
 			DownstreamSampleExpr: DownstreamSampleExpr{
 				shard:      nil,
 				SampleExpr: expr,
 			},
-		}, nil
+		}, bytesPerShard, nil
 	}
 	for i := shards - 1; i >= 0; i-- {
 		head = &ConcatSampleExpr{
@@ -168,120 +176,139 @@ func (m ShardMapper) mapSampleExpr(expr syntax.SampleExpr, r *downstreamRecorder
 	}
 	r.Add(shards, MetricsKey)
 
-	return head, nil
+	return head, bytesPerShard, nil
 }
 
 // technically, std{dev,var} are also parallelizable if there is no cross-shard merging
 // in descendent nodes in the AST. This optimization is currently avoided for simplicity.
-func (m ShardMapper) mapVectorAggregationExpr(expr *syntax.VectorAggregationExpr, r *downstreamRecorder) (syntax.SampleExpr, error) {
+func (m ShardMapper) mapVectorAggregationExpr(expr *syntax.VectorAggregationExpr, r *downstreamRecorder) (syntax.SampleExpr, uint64, error) {
 	// if this AST contains unshardable operations, don't shard this at this level,
 	// but attempt to shard a child node.
 	if !expr.Shardable() {
-		subMapped, err := m.Map(expr.Left, r)
+		subMapped, bytesPerShard, err := m.Map(expr.Left, r)
 		if err != nil {
-			return nil, err
+			return nil, 0, err
 		}
 		sampleExpr, ok := subMapped.(syntax.SampleExpr)
 		if !ok {
-			return nil, badASTMapping(subMapped)
+			return nil, 0, badASTMapping(subMapped)
 		}
 
 		return &syntax.VectorAggregationExpr{
 			Left:      sampleExpr,
 			Grouping:  expr.Grouping,
 			Params:    expr.Params,
 			Operation: expr.Operation,
-		}, nil
+		}, bytesPerShard, nil
 
 	}
 
 	switch expr.Operation {
 	case syntax.OpTypeSum:
 		// sum(x) -> sum(sum(x, shard=1) ++ sum(x, shard=2)...)
-		sharded, err := m.mapSampleExpr(expr, r)
+		sharded, bytesPerShard, err := m.mapSampleExpr(expr, r)
 		if err != nil {
-			return nil, err
+			return nil, 0, err
 		}
 		return &syntax.VectorAggregationExpr{
 			Left:      sharded,
 			Grouping:  expr.Grouping,
 			Params:    expr.Params,
 			Operation: expr.Operation,
-		}, nil
+		}, bytesPerShard, nil
 
 	case syntax.OpTypeAvg:
 		// avg(x) -> sum(x)/count(x)
-		lhs, err := m.mapVectorAggregationExpr(&syntax.VectorAggregationExpr{
+		lhs, lhsBytesPerShard, err := m.mapVectorAggregationExpr(&syntax.VectorAggregationExpr{
 			Left:      expr.Left,
 			Grouping:  expr.Grouping,
 			Operation: syntax.OpTypeSum,
 		}, r)
 		if err != nil {
-			return nil, err
+			return nil, 0, err
 		}
-		rhs, err := m.mapVectorAggregationExpr(&syntax.VectorAggregationExpr{
+		rhs, rhsBytesPerShard, err := m.mapVectorAggregationExpr(&syntax.VectorAggregationExpr{
 			Left:      expr.Left,
 			Grouping:  expr.Grouping,
 			Operation: syntax.OpTypeCount,
 		}, r)
 		if err != nil {
-			return nil, err
+			return nil, 0, err
 		}
 
+		// We take the maximum bytes per shard of both sides of the operation
+		bytesPerShard := uint64(math.Max(int(lhsBytesPerShard), int(rhsBytesPerShard)))
+
 		return &syntax.BinOpExpr{
 			SampleExpr: lhs,
 			RHS:        rhs,
 			Op:         syntax.OpTypeDiv,
-		}, nil
+		}, bytesPerShard, nil
 
 	case syntax.OpTypeCount:
 		// count(x) -> sum(count(x, shard=1) ++ count(x, shard=2)...)
-		sharded, err := m.mapSampleExpr(expr, r)
+		sharded, bytesPerShard, err := m.mapSampleExpr(expr, r)
 		if err != nil {
-			return nil, err
+			return nil, 0, err
 		}
 		return &syntax.VectorAggregationExpr{
 			Left:      sharded,
 			Grouping:  expr.Grouping,
 			Operation: syntax.OpTypeSum,
-		}, nil
+		}, bytesPerShard, nil
 	default:
 		// this should not be reachable. If an operation is shardable it should
 		// have an optimization listed.
 		level.Warn(util_log.Logger).Log(
 			"msg", "unexpected operation which appears shardable, ignoring",
 			"operation", expr.Operation,
 		)
-		return expr, nil
+		exprStats, err := m.shards.GetStats(expr)
+		if err != nil {
+			return nil, 0, err
+		}
+		return expr, exprStats.Bytes, nil
 	}
 }
 
-func (m ShardMapper) mapLabelReplaceExpr(expr *syntax.LabelReplaceExpr, r *downstreamRecorder) (syntax.SampleExpr, error) {
-	subMapped, err := m.Map(expr.Left, r)
+func (m ShardMapper) mapLabelReplaceExpr(expr *syntax.LabelReplaceExpr, r *downstreamRecorder) (syntax.SampleExpr, uint64, error) {
+	subMapped, bytesPerShard, err := m.Map(expr.Left, r)
 	if err != nil {
-		return nil, err
+		return nil, 0, err
 	}
 	cpy := *expr
 	cpy.Left = subMapped.(syntax.SampleExpr)
-	return &cpy, nil
+	return &cpy, bytesPerShard, nil
 }
 
-func (m ShardMapper) mapRangeAggregationExpr(expr *syntax.RangeAggregationExpr, r *downstreamRecorder) (syntax.SampleExpr, error) {
+func (m ShardMapper) mapRangeAggregationExpr(expr *syntax.RangeAggregationExpr, r *downstreamRecorder) (syntax.SampleExpr, uint64, error) {
 	if hasLabelModifier(expr) {
 		// if an expr can modify labels this means multiple shards can return the same labelset.
 		// When this happens the merge strategy needs to be different from a simple concatenation.
 		// For instance for rates we need to sum data from different shards but same series.
 		// Since we currently support only concatenation as merge strategy, we skip those queries.
-		return expr, nil
+		exprStats, err := m.shards.GetStats(expr)
+		if err != nil {
+			return nil, 0, err
+		}
+
+		return expr, exprStats.Bytes, nil
 	}
+
 	switch expr.Operation {
 	case syntax.OpRangeTypeCount, syntax.OpRangeTypeRate, syntax.OpRangeTypeBytesRate, syntax.OpRangeTypeBytes:
 		// count_over_time(x) -> count_over_time(x, shard=1) ++ count_over_time(x, shard=2)...
 		// rate(x) -> rate(x, shard=1) ++ rate(x, shard=2)...
 		// same goes for bytes_rate and bytes_over_time
 		return m.mapSampleExpr(expr, r)
 	default:
-		return expr, nil
+		// This part of the query is not shardable, so the bytesPerShard is the bytes for all the log matchers in expr
+		exprStats, err := m.shards.GetStats(expr)
+		if err != nil {
+			return nil, 0, err
+		}
+
+		return expr, exprStats.Bytes, nil
 	}
 }