Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain queries will cause a massive memory leak when running Loki in monolith mode (v3) #13277

Closed
ASatanicPickle opened this issue Jun 20, 2024 · 1 comment · Fixed by #13501
Labels
needs triage Issue requires investigation type/bug Somehing is not working as expected

Comments

@ASatanicPickle
Copy link

ASatanicPickle commented Jun 20, 2024

Describe the bug

A dramatic memory leak occurs with a specific kind of query when running the all-in-one (-target=all) loki, v3 is affected. 2.x line is fine. We had pods spike up to 100GB in a matter of minutes.

A query that causes the leak to happen:

{stream="stdout",name="loki-canary"} |= "p" | json |= "p"

These queries work fine,

{stream="stdout",name="loki-canary"} |= "p" | json
{stream="stdout",name="loki-canary"} |  json  |= "p"

Seems to be specific to having filters before and after the parser expr. (Note, I didn't test that much around this though)

I did some digging and I think the issue is here:

filters = append(filters, f)

pkg/logql/syntax/ast.go in the reorderStages() function, specifically with the combineFilters() function.

The function seems to have a side effect where it is changing the original request's AST and with multiple queries all running in parallel, the AST gets real big. There's a heap dump below.

I was able to get it working by changing this line:

for _, s := range m { switch f := s.(type) { case *LineFilterExpr: filters = append(filters, f)

to

for _, s := range m { switch f := s.(type) { case *LineFilterExpr: filters = append(filters, MustClone(f))

Clone the filter exprs so the combineFilters() won't change the original request's AST.

To Reproduce

Steps to reproduce the behavior:

  1. Run the loki with -target=all
  2. Load it with some data ( I used the canary program)
  3. Make sure some chunks get flushed to disk, the bug will not occur if there are no chunks flushed to the store.
  4. Run the query, {stream="stdout",name="loki-canary"} |= "p" | json |= "p"

Expected behavior

Should work normally.

Environment:

  • Infrastructure: repro'd on my laptop and in a k8s environment

The loki config that I was using locally to repro:

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
    chunk_encoding: snappy
    max_chunk_age: 5m
    chunk_idle_period: 1m

common:
  instance_addr: localhost
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

Heap dump while the leak was occurring:

memleak

Thanks,
Mark.

@ASatanicPickle ASatanicPickle changed the title Certain queries will cause a memory leak when running Loki in monolith mode Certain queries will cause a memory leak when running Loki in monolith mode (v3) Jun 20, 2024
@ASatanicPickle ASatanicPickle changed the title Certain queries will cause a memory leak when running Loki in monolith mode (v3) Certain queries will cause a massive memory leak when running Loki in monolith mode (v3) Jun 24, 2024
@JStickler JStickler added type/bug Somehing is not working as expected needs triage Issue requires investigation labels Jun 24, 2024
@sondrelg
Copy link

We're seeing something similar. Our primary loki-backend pod is continously oomkilled

bilde

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue requires investigation type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants