Skip to content

Commit

Permalink
Max bytes read limit (#8670)
Browse files Browse the repository at this point in the history
**What this PR does / why we need it**:

This PR implements two new per-tenant limits that are enforced on log
and metric queries (both range and instant) when TSDB is used:

- `max_query_bytes_read`: Refuse queries that would read more than the
configured bytes here. Overall limit regardless of splitting/sharding.
The goal is to refuse queries that would take too long. The default
value of 0 disables this limit.

- `max_querier_bytes_read`: Refuse queries in which any of their
subqueries after splitting and sharding would read more than the
configured bytes here. The goal is to avoid a querier from running a
query that would load too much data in memory and can potentially get
OOMed. The default value of 0 disables this limit.

These new limits can be configured per tenant and per query (see
#8727).

The bytes a query would read are estimated through TSDB's index stats.
Even though they are not exact, they are good enough to have a rough
estimation of whether a query is too big to run or not. For more details
on this refer to this discussion in the PR:
#8670 (comment).

Both limits are implemented in the frontend. Even though we considered
implementing `max_querier_bytes_read` in the querier, this way, the
limits for pre and post splitting/sharding queries are enforced close to
each other on the same component. Moreover, this way we can reduce the
number of index stats requests issued to the index gateways by reusing
the stats gathered while sharding the query.

With regard to how index stats requests are issued:
- We parallelize index stats requests by splitting them into queries
that span up to 24h since our indices are sharded by 24h periods. On top
of that, this prevents a single index gateway from processing a single
huge request like `{app=~".+"} for 30d`.
- If sharding is enabled and the query is shardable, for
`max_querier_bytes_read`, we re-use the stats requests issued by the
sharding ware. Specifically, we look at the [bytesPerShard][1] to
enforce this limit.

Note that once we merge this PR and enable these limits, the load of
index stats requests will increase substantially and we may discover
bottlenecks in our index gateways and TSDB. After speaking with @owen-d,
we think it should be fine as, if needed, we can scale up our index
gateways and support caching index stats requests.

Here's a demo of this working:
<img width="1647" alt="image"
src="https://user-images.githubusercontent.com/8354290/226918478-d4b6c2fd-de4d-478a-9c8b-e38fe148fa95.png">

<img width="1647" alt="image"
src="https://user-images.githubusercontent.com/8354290/226918798-a71b1db8-ea68-4d00-933b-e5eb1524d240.png">


**Which issue(s) this PR fixes**:
This PR addresses grafana/loki-private#674.

**Special notes for your reviewer**:

- @jeschkies has reviewed the changes related to query-time limits.
- I've done some refactoring in this PR:
- Extracted logic to get stats for a set of matches into a new function
[getStatsForMatchers][2].
- Extracted the _Handler_ interface implementation for
[queryrangebase.roundTripper][3] into a new type
[queryrangebase.roundTripperHandler][4]. This is used to create the
handler that skips the rest of configured middlewares when sending an
index stat quests ([example][5]).

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`


[1]:
https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L179-L186

[2]:
https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L72

[3]:
https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L124

[4]:
https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L163

[5]:
https://github.com/grafana/loki/blob/f422e0a52b743a11209b8276510feb2ab8241486/pkg/querier/queryrange/roundtrip.go#L521
  • Loading branch information
salvacorts authored Mar 23, 2023
1 parent 94725e7 commit d24fe3e
Show file tree
Hide file tree
Showing 23 changed files with 1,183 additions and 281 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
* [6675](https://github.com/grafana/loki/pull/6675) **btaani**: Add logfmt expression parser for selective extraction of labels from logfmt formatted logs
* [8474](https://github.com/grafana/loki/pull/8474) **farodin91**: Add support for short-lived S3 session tokens
* [8774](https://github.com/grafana/loki/pull/8774) **slim-bean**: Add new logql template functions `bytes`, `duration`, `unixEpochMillis`, `unixEpochNanos`, `toDateInZone`, `b64Enc`, and `b64Dec`
* [8670](https://github.com/grafana/loki/pull/8670) **salvacorts** Introduce two new limits to refuse log and metric queries that would read too much data.

##### Fixes

Expand Down
11 changes: 11 additions & 0 deletions docs/sources/configuration/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2331,6 +2331,17 @@ The `limits_config` block configures global and per-tenant limits in Loki.
# CLI flag: -frontend.min-sharding-lookback
[min_sharding_lookback: <duration> | default = 0s]

# Max number of bytes a query can fetch. Enforced in log and metric queries only
# when TSDB is used. The default value of 0 disables this limit.
# CLI flag: -frontend.max-query-bytes-read
[max_query_bytes_read: <int> | default = 0B]

# Max number of bytes a query can fetch after splitting and sharding. Enforced
# in log and metric queries only when TSDB is used. The default value of 0
# disables this limit.
# CLI flag: -frontend.max-querier-bytes-read
[max_querier_bytes_read: <int> | default = 0B]

# Duration to delay the evaluation of rules to ensure the underlying metrics
# have been pushed to Cortex.
# CLI flag: -ruler.evaluation-delay-duration
Expand Down
4 changes: 2 additions & 2 deletions pkg/logql/downstream_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ func TestMappingEquivalence(t *testing.T) {
ctx := user.InjectOrgID(context.Background(), "fake")

mapper := NewShardMapper(ConstantShards(shards), nilShardMetrics)
_, mapped, err := mapper.Parse(tc.query)
_, _, mapped, err := mapper.Parse(tc.query)
require.Nil(t, err)

shardedQry := sharded.Query(ctx, params, mapped)
Expand Down Expand Up @@ -146,7 +146,7 @@ func TestShardCounter(t *testing.T) {
ctx := user.InjectOrgID(context.Background(), "fake")

mapper := NewShardMapper(ConstantShards(shards), nilShardMetrics)
noop, mapped, err := mapper.Parse(tc.query)
noop, _, mapped, err := mapper.Parse(tc.query)
require.Nil(t, err)

shardedQry := sharded.Query(ctx, params, mapped)
Expand Down
133 changes: 80 additions & 53 deletions pkg/logql/shardmapper.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,25 @@ import (
"fmt"

"github.com/go-kit/log/level"
"github.com/grafana/loki/pkg/util/math"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"

"github.com/grafana/loki/pkg/logql/syntax"
"github.com/grafana/loki/pkg/querier/astmapper"
"github.com/grafana/loki/pkg/storage/stores/index/stats"
util_log "github.com/grafana/loki/pkg/util/log"
)

type ShardResolver interface {
Shards(expr syntax.Expr) (int, error)
Shards(expr syntax.Expr) (int, uint64, error)
GetStats(e syntax.Expr) (stats.Stats, error)
}

type ConstantShards int

func (s ConstantShards) Shards(_ syntax.Expr) (int, error) { return int(s), nil }
func (s ConstantShards) Shards(_ syntax.Expr) (int, uint64, error) { return int(s), 0, nil }
func (s ConstantShards) GetStats(_ syntax.Expr) (stats.Stats, error) { return stats.Stats{}, nil }

type ShardMapper struct {
shards ShardResolver
Expand All @@ -36,18 +40,18 @@ func NewShardMapperMetrics(registerer prometheus.Registerer) *MapperMetrics {
return newMapperMetrics(registerer, "shard")
}

func (m ShardMapper) Parse(query string) (noop bool, expr syntax.Expr, err error) {
func (m ShardMapper) Parse(query string) (noop bool, bytesPerShard uint64, expr syntax.Expr, err error) {
parsed, err := syntax.ParseExpr(query)
if err != nil {
return false, nil, err
return false, 0, nil, err
}

recorder := m.metrics.downstreamRecorder()

mapped, err := m.Map(parsed, recorder)
mapped, bytesPerShard, err := m.Map(parsed, recorder)
if err != nil {
m.metrics.ParsedQueries.WithLabelValues(FailureKey).Inc()
return false, nil, err
return false, 0, nil, err
}

originalStr := parsed.String()
Expand All @@ -61,21 +65,21 @@ func (m ShardMapper) Parse(query string) (noop bool, expr syntax.Expr, err error

recorder.Finish() // only record metrics for successful mappings

return noop, mapped, err
return noop, bytesPerShard, mapped, err
}

func (m ShardMapper) Map(expr syntax.Expr, r *downstreamRecorder) (syntax.Expr, error) {
func (m ShardMapper) Map(expr syntax.Expr, r *downstreamRecorder) (syntax.Expr, uint64, error) {
// immediately clone the passed expr to avoid mutating the original
expr, err := syntax.Clone(expr)
if err != nil {
return nil, err
return nil, 0, err
}

switch e := expr.(type) {
case *syntax.LiteralExpr:
return e, nil
return e, 0, nil
case *syntax.VectorExpr:
return e, nil
return e, 0, nil
case *syntax.MatchersExpr, *syntax.PipelineExpr:
return m.mapLogSelectorExpr(e.(syntax.LogSelectorExpr), r)
case *syntax.VectorAggregationExpr:
Expand All @@ -85,43 +89,47 @@ func (m ShardMapper) Map(expr syntax.Expr, r *downstreamRecorder) (syntax.Expr,
case *syntax.RangeAggregationExpr:
return m.mapRangeAggregationExpr(e, r)
case *syntax.BinOpExpr:
lhsMapped, err := m.Map(e.SampleExpr, r)
lhsMapped, lhsBytesPerShard, err := m.Map(e.SampleExpr, r)
if err != nil {
return nil, err
return nil, 0, err
}
rhsMapped, err := m.Map(e.RHS, r)
rhsMapped, rhsBytesPerShard, err := m.Map(e.RHS, r)
if err != nil {
return nil, err
return nil, 0, err
}
lhsSampleExpr, ok := lhsMapped.(syntax.SampleExpr)
if !ok {
return nil, badASTMapping(lhsMapped)
return nil, 0, badASTMapping(lhsMapped)
}
rhsSampleExpr, ok := rhsMapped.(syntax.SampleExpr)
if !ok {
return nil, badASTMapping(rhsMapped)
return nil, 0, badASTMapping(rhsMapped)
}
e.SampleExpr = lhsSampleExpr
e.RHS = rhsSampleExpr
return e, nil

// We take the maximum bytes per shard of both sides of the operation
bytesPerShard := uint64(math.Max(int(lhsBytesPerShard), int(rhsBytesPerShard)))

return e, bytesPerShard, nil
default:
return nil, errors.Errorf("unexpected expr type (%T) for ASTMapper type (%T) ", expr, m)
return nil, 0, errors.Errorf("unexpected expr type (%T) for ASTMapper type (%T) ", expr, m)
}
}

func (m ShardMapper) mapLogSelectorExpr(expr syntax.LogSelectorExpr, r *downstreamRecorder) (syntax.LogSelectorExpr, error) {
func (m ShardMapper) mapLogSelectorExpr(expr syntax.LogSelectorExpr, r *downstreamRecorder) (syntax.LogSelectorExpr, uint64, error) {
var head *ConcatLogSelectorExpr
shards, err := m.shards.Shards(expr)
shards, bytesPerShard, err := m.shards.Shards(expr)
if err != nil {
return nil, err
return nil, 0, err
}
if shards == 0 {
return &ConcatLogSelectorExpr{
DownstreamLogSelectorExpr: DownstreamLogSelectorExpr{
shard: nil,
LogSelectorExpr: expr,
},
}, nil
}, bytesPerShard, nil
}
for i := shards - 1; i >= 0; i-- {
head = &ConcatLogSelectorExpr{
Expand All @@ -137,22 +145,22 @@ func (m ShardMapper) mapLogSelectorExpr(expr syntax.LogSelectorExpr, r *downstre
}
r.Add(shards, StreamsKey)

return head, nil
return head, bytesPerShard, nil
}

func (m ShardMapper) mapSampleExpr(expr syntax.SampleExpr, r *downstreamRecorder) (syntax.SampleExpr, error) {
func (m ShardMapper) mapSampleExpr(expr syntax.SampleExpr, r *downstreamRecorder) (syntax.SampleExpr, uint64, error) {
var head *ConcatSampleExpr
shards, err := m.shards.Shards(expr)
shards, bytesPerShard, err := m.shards.Shards(expr)
if err != nil {
return nil, err
return nil, 0, err
}
if shards == 0 {
return &ConcatSampleExpr{
DownstreamSampleExpr: DownstreamSampleExpr{
shard: nil,
SampleExpr: expr,
},
}, nil
}, bytesPerShard, nil
}
for i := shards - 1; i >= 0; i-- {
head = &ConcatSampleExpr{
Expand All @@ -168,120 +176,139 @@ func (m ShardMapper) mapSampleExpr(expr syntax.SampleExpr, r *downstreamRecorder
}
r.Add(shards, MetricsKey)

return head, nil
return head, bytesPerShard, nil
}

// technically, std{dev,var} are also parallelizable if there is no cross-shard merging
// in descendent nodes in the AST. This optimization is currently avoided for simplicity.
func (m ShardMapper) mapVectorAggregationExpr(expr *syntax.VectorAggregationExpr, r *downstreamRecorder) (syntax.SampleExpr, error) {
func (m ShardMapper) mapVectorAggregationExpr(expr *syntax.VectorAggregationExpr, r *downstreamRecorder) (syntax.SampleExpr, uint64, error) {
// if this AST contains unshardable operations, don't shard this at this level,
// but attempt to shard a child node.
if !expr.Shardable() {
subMapped, err := m.Map(expr.Left, r)
subMapped, bytesPerShard, err := m.Map(expr.Left, r)
if err != nil {
return nil, err
return nil, 0, err
}
sampleExpr, ok := subMapped.(syntax.SampleExpr)
if !ok {
return nil, badASTMapping(subMapped)
return nil, 0, badASTMapping(subMapped)
}

return &syntax.VectorAggregationExpr{
Left: sampleExpr,
Grouping: expr.Grouping,
Params: expr.Params,
Operation: expr.Operation,
}, nil
}, bytesPerShard, nil

}

switch expr.Operation {
case syntax.OpTypeSum:
// sum(x) -> sum(sum(x, shard=1) ++ sum(x, shard=2)...)
sharded, err := m.mapSampleExpr(expr, r)
sharded, bytesPerShard, err := m.mapSampleExpr(expr, r)
if err != nil {
return nil, err
return nil, 0, err
}
return &syntax.VectorAggregationExpr{
Left: sharded,
Grouping: expr.Grouping,
Params: expr.Params,
Operation: expr.Operation,
}, nil
}, bytesPerShard, nil

case syntax.OpTypeAvg:
// avg(x) -> sum(x)/count(x)
lhs, err := m.mapVectorAggregationExpr(&syntax.VectorAggregationExpr{
lhs, lhsBytesPerShard, err := m.mapVectorAggregationExpr(&syntax.VectorAggregationExpr{
Left: expr.Left,
Grouping: expr.Grouping,
Operation: syntax.OpTypeSum,
}, r)
if err != nil {
return nil, err
return nil, 0, err
}
rhs, err := m.mapVectorAggregationExpr(&syntax.VectorAggregationExpr{
rhs, rhsBytesPerShard, err := m.mapVectorAggregationExpr(&syntax.VectorAggregationExpr{
Left: expr.Left,
Grouping: expr.Grouping,
Operation: syntax.OpTypeCount,
}, r)
if err != nil {
return nil, err
return nil, 0, err
}

// We take the maximum bytes per shard of both sides of the operation
bytesPerShard := uint64(math.Max(int(lhsBytesPerShard), int(rhsBytesPerShard)))

return &syntax.BinOpExpr{
SampleExpr: lhs,
RHS: rhs,
Op: syntax.OpTypeDiv,
}, nil
}, bytesPerShard, nil

case syntax.OpTypeCount:
// count(x) -> sum(count(x, shard=1) ++ count(x, shard=2)...)
sharded, err := m.mapSampleExpr(expr, r)
sharded, bytesPerShard, err := m.mapSampleExpr(expr, r)
if err != nil {
return nil, err
return nil, 0, err
}
return &syntax.VectorAggregationExpr{
Left: sharded,
Grouping: expr.Grouping,
Operation: syntax.OpTypeSum,
}, nil
}, bytesPerShard, nil
default:
// this should not be reachable. If an operation is shardable it should
// have an optimization listed.
level.Warn(util_log.Logger).Log(
"msg", "unexpected operation which appears shardable, ignoring",
"operation", expr.Operation,
)
return expr, nil
exprStats, err := m.shards.GetStats(expr)
if err != nil {
return nil, 0, err
}
return expr, exprStats.Bytes, nil
}
}

func (m ShardMapper) mapLabelReplaceExpr(expr *syntax.LabelReplaceExpr, r *downstreamRecorder) (syntax.SampleExpr, error) {
subMapped, err := m.Map(expr.Left, r)
func (m ShardMapper) mapLabelReplaceExpr(expr *syntax.LabelReplaceExpr, r *downstreamRecorder) (syntax.SampleExpr, uint64, error) {
subMapped, bytesPerShard, err := m.Map(expr.Left, r)
if err != nil {
return nil, err
return nil, 0, err
}
cpy := *expr
cpy.Left = subMapped.(syntax.SampleExpr)
return &cpy, nil
return &cpy, bytesPerShard, nil
}

func (m ShardMapper) mapRangeAggregationExpr(expr *syntax.RangeAggregationExpr, r *downstreamRecorder) (syntax.SampleExpr, error) {
func (m ShardMapper) mapRangeAggregationExpr(expr *syntax.RangeAggregationExpr, r *downstreamRecorder) (syntax.SampleExpr, uint64, error) {
if hasLabelModifier(expr) {
// if an expr can modify labels this means multiple shards can return the same labelset.
// When this happens the merge strategy needs to be different from a simple concatenation.
// For instance for rates we need to sum data from different shards but same series.
// Since we currently support only concatenation as merge strategy, we skip those queries.
return expr, nil
exprStats, err := m.shards.GetStats(expr)
if err != nil {
return nil, 0, err
}

return expr, exprStats.Bytes, nil
}

switch expr.Operation {
case syntax.OpRangeTypeCount, syntax.OpRangeTypeRate, syntax.OpRangeTypeBytesRate, syntax.OpRangeTypeBytes:
// count_over_time(x) -> count_over_time(x, shard=1) ++ count_over_time(x, shard=2)...
// rate(x) -> rate(x, shard=1) ++ rate(x, shard=2)...
// same goes for bytes_rate and bytes_over_time
return m.mapSampleExpr(expr, r)
default:
return expr, nil
// This part of the query is not shardable, so the bytesPerShard is the bytes for all the log matchers in expr
exprStats, err := m.shards.GetStats(expr)
if err != nil {
return nil, 0, err
}

return expr, exprStats.Bytes, nil
}
}

Expand Down
Loading

0 comments on commit d24fe3e

Please sign in to comment.