Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue / holding reference to binaries #459

Open
elvanja opened this issue Sep 12, 2022 · 40 comments
Open

Performance issue / holding reference to binaries #459

elvanja opened this issue Sep 12, 2022 · 40 comments

Comments

@elvanja
Copy link

elvanja commented Sep 12, 2022

Hi,

A Phoenix app I am working on is experiencing binaries related memory leak.
It is related to https://github.com/quantum-elixir triggered jobs, but after some investigation found that it is actually open telemetry related process that holds on to those binaries. Not sure if this is a Quantum or Telemetry issue though (or our usage!).

Binaries memory leak with quantum and (open)telemetry contains all the details.

Any ideas on what might be the cause?

@tsloughter
Copy link
Member

Thanks, haven't had time to do a detailed dive into your post but a quick look are you seeing the issue in h2_connection? If so, maybe you can switch to using http_protobuf instead of grpc for exporting? If it is the http2 connection causing the leak I'll get it fixed, but switching will give immediately relief if that is the issue and you'll be able to know if that is the issue for sure.

@elvanja
Copy link
Author

elvanja commented Sep 12, 2022

Unfortunately it is not possible to move to http_protobuf at this time. What I did find out in the mean time is that another project had similar leak issue but without Quantum scheduled involved. What they did in the end was a bit of a brute force approach. They implemented a watchdog which killed h2_stream processes if there were too many (an arbitrary number). That apparently helped reduce the memory usage and was safe for them to handle (loosing some traces was OK). I will see if there is anything I can do about protobuf. Thank for looking into this.

@tsloughter
Copy link
Member

Ok, this issue sounds familiar actually, there may already be a fix for it I can merge to chatterbox and release a new version. Will get back to you.

@tsloughter
Copy link
Member

See joedevivo/chatterbox#120

@tsloughter
Copy link
Member

@elvanja I've released grpcbox 0.16.0 and chatterbox 0.13.0. Are you are able to bump to those versions and see if anything improves?

@elvanja
Copy link
Author

elvanja commented Sep 14, 2022

Will do, stay tuned for results. The process will take at least a few days to determine if this helps.

@elvanja
Copy link
Author

elvanja commented Sep 19, 2022

Update: app version with grpcbox 0.16.0 and chatterbox 0.13.0 was in production for 5 days now, and I am still seeing the same pattern of binaries being held. I upgraded only those two dependencies, all the other related dependencies stayed the same. Do you feel it would make sense to try with all related dependencies upgraded before giving :http_protobuf a chance? I confirmed in the mean time that I can deploy :http_protobuf as well. A bit more work, but it is doable.

Here's the list of those related dependencies which use telemetry (removed the others to reduce the noise):

app
├── broadway ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4.3 or ~> 1.0 (Hex package)
├── chatterbox ~> 0.13 (Hex package)
├── ecto ~> 3.6 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── ecto_sql ~> 3.6 (Hex package)
│   ├── db_connection ~> 2.2 (Hex package)
│   │   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4.0 (Hex package)
├── ex_aws ~> 2.2 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── finch ~> 0.12 (Hex package) *override*
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── grpcbox ~> 0.16 (Hex package)
│   ├── chatterbox ~>0.13.0 (Hex package)
│   └── gproc ~>0.8.0 (Hex package)
├── mentat ~> 0.7.1 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── opentelemetry ~> 1.0 (Hex package)
│   └── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_ecto ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   ├── opentelemetry_process_propagator ~> 0.1.0 (Hex package)
│   │   └── opentelemetry_api ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── opentelemetry_exporter ~> 1.0 (Hex package)
│   ├── grpcbox >= 0.0.0 (Hex package)
│   ├── opentelemetry ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_phoenix ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   ├── opentelemetry_telemetry ~> 1.0 (Hex package)
│   │   ├── opentelemetry_api ~> 1.0 (Hex package)
│   │   ├── telemetry ~> 1.0 (Hex package)
│   │   └── telemetry_registry ~> 0.3.0 (Hex package)
│   │       └── telemetry ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── phoenix ~> 1.5.8 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── plug ~> 1.11.0 (Hex package) *override*
│   └── telemetry ~> 0.4 (Hex package)
├── plug_cowboy ~> 2.5 (Hex package)
│   ├── cowboy_telemetry ~> 0.3 (Hex package)
│   │   └── telemetry ~> 0.4 (Hex package)
├── prom_ex ~> 1.4.0 (Hex package)
│   ├── telemetry >= 0.4.0 (Hex package)
│   ├── telemetry_metrics ~> 0.6.0 (Hex package)
│   ├── telemetry_metrics_prometheus_core ~> 1.0.1 (Hex package)
│   │   ├── telemetry ~> 0.4 (Hex package)
│   │   └── telemetry_metrics ~> 0.6 (Hex package)
│   └── telemetry_poller ~> 0.5.1 (Hex package)
├── quantum ~> 3.0 (Hex package)
│   └── telemetry ~> 0.4.3 or ~> 1.0.0 (Hex package)
├── telemetry ~> 1.1 (Hex package) *override*
├── telemetry_metrics ~> 0.6 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── telemetry_poller ~> 0.5 (Hex package)
│   └── telemetry ~> 0.4 (Hex package)
├── tesla ~> 1.4.3 (Hex package) *override*
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)

If it helps, here are the details for the largest one:

%{
  binaries_size: 1226523189,
  memory: 239552,
  pid: #PID<0.5407.0>,
  process_info: [
    current_function: {:gen_statem, :loop_receive, 3},
    initial_call: {:proc_lib, :init_p, 5},
    status: :waiting,
    message_queue_len: 0,
    links: [#PID<0.16047.24>, #PID<0.24620.2>, #PID<0.29652.25>,
     #PID<0.31253.1>, #PID<0.32115.25>, #PID<0.32508.24>, #Port<0.121>,
     #PID<0.32245.48>, #PID<0.31690.48>, #PID<0.31829.70>, #PID<0.31763.49>,
     #PID<0.31675.24>, #PID<0.30646.48>, #PID<0.30675.69>, #PID<0.30813.69>,
     #PID<0.30898.70>, #PID<0.30721.49>, #PID<0.30649.24>, #PID<0.29999.24>,
     #PID<0.30020.91>, #PID<0.30083.25>, #PID<0.30000.49>, #PID<0.29777.25>,
     #PID<0.29987.70>, #PID<0.29750.92>, #PID<0.27522.24>, #PID<0.28244.24>,
     #PID<0.29183.24>, #PID<0.29457.25>, #PID<0.29525.92>, #PID<0.29493.48>,
     #PID<0.29415.49>, #PID<0.29246.91>, #PID<0.28777.49>, #PID<0.29171.25>,
     #PID<0.29067.69>, #PID<0.28591.49>, #PID<0.28598.49>, #PID<0.28079.1>,
     #PID<0.28124.24>, #PID<0.27628.70>, ...],
    dictionary: [
      "$ancestors": [#PID<0.4553.0>, #PID<0.4547.0>, :otel_batch_processor,
       :opentelemetry_sup, #PID<0.4498.0>],
      "$initial_call": {:h2_connection, :init, 1}
    ],
    trap_exit: false,
    error_handler: :error_handler,
    priority: :normal,
    group_leader: #PID<0.4497.0>,
    total_heap_size: 28690,
    heap_size: 28690,
    stack_size: 11,
    reductions: 568170927,
    garbage_collection: [
      max_heap_size: %{error_logger: true, kill: true, size: 0},
      min_bin_vheap_size: 46422,
      min_heap_size: 233,
      fullsweep_after: 65535,
      minor_gcs: 0
    ],
    suspending: []
  ]
}

@tsloughter
Copy link
Member

I'd go with using http_protobuf if that can work for you.

There should be a simple fix here, it is just wrong that those stream processes are still alive, something isn't properly closing.

@tsloughter
Copy link
Member

I was hoping I'd find it as simple as checking the number of links a connection process still has after running some unary requests so I could write a test in grpcbox but that doesn't appear to be the case.

@tsloughter
Copy link
Member

Are you sending to the Otel collector? If so, what version? I'm now wondering if it could be the server is not closing their end and that's why those processes are alive.

I'm testing against 0.60.0 of the collector-contrib docker image.

@elvanja
Copy link
Author

elvanja commented Sep 22, 2022

We are actually sending to Honeycomb, via Refinery as proxy. Related links: https://github.com/honeycombio/refinery/tree/v1.17.0 and https://docs.honeycomb.io/manage-data-volume/refinery/.

And, I have started the switch to protobuf, but again it will take a bit of time to actually release and see if it solved the problem. Stay tuned.

@elvanja
Copy link
Author

elvanja commented Oct 11, 2022

Update: so changing the protocols solved the problem. Apologies for a bit late reply, but it took a bit of time to make sure we are in the clear. Not sure why :grpc flow causes this, but at least in our case, the switch to :http_protobuf saved the day. Thank you for your time @tsloughter! 🙇🏼 Not sure if you want to keep the issue open because of :grpc though.

@tsloughter
Copy link
Member

@elvanja good to hear.

We should keep the issue open. I know it is a bug in the grpc client and need to get it fixed.

@davan-nl
Copy link

davan-nl commented Oct 11, 2022

Hey, we are also seeing the same issue. Unfortunately setting protocol to :http_protobuf did not solve the issue
We tested the app disabling the exporter and the issue goes away.
Any tips on how we can debug this further to provide more details or context?

erlang 23.3.4
elixir 1.11.4-otp-23

app
├── ecto_sql ~> 3.4 (Hex package)
│   ├── db_connection ~> 2.2 (Hex package)
│   │   ├── connection ~> 1.0 (Hex package)
│   │   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   ├── ecto ~> 3.7.0 (Hex package)
│   ├── postgrex ~> 0.15.0 or ~> 0.16.0 or ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4.0 or ~> 1.0 (Hex package)
├── ex_aws ~> 2.0 (Hex package)
│   ├── hackney ~> 1.16 (Hex package)
│   │   ├── certifi ~>2.9.0 (Hex package)
│   │   ├── idna ~>6.1.0 (Hex package)
│   │   │   └── unicode_util_compat ~>0.7.0 (Hex package)
│   │   ├── metrics ~>1.0.0 (Hex package)
│   │   ├── mimerl ~>1.1 (Hex package)
│   │   ├── parse_trans 3.3.1 (Hex package)
│   │   ├── ssl_verify_fun ~>1.1.0 (Hex package)
│   │   └── unicode_util_compat ~>0.7.0 (Hex package)
│   ├── jason ~> 1.1 (Hex package)
│   ├── mime ~> 1.2 or ~> 2.0 (Hex package)
│   └── telemetry ~> 0.4.3 or ~> 1.0 (Hex package)
├── ex_aws_sns ~> 2.0 (Hex package)
│   └── ex_aws ~> 2.0 (Hex package)
├── opentelemetry ~> 1.0 (Hex package)
│   └── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_ecto ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   ├── opentelemetry_process_propagator ~> 0.1.0 (Hex package)
│   │   └── opentelemetry_api ~> 1.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── opentelemetry_exporter ~> 1.0 (Hex package)
│   ├── grpcbox >= 0.0.0 (Hex package)
│   │   ├── acceptor_pool ~>1.0.0 (Hex package)
│   │   ├── chatterbox ~>0.11.0 (Hex package)
│   │   │   └── hpack ~>0.2.3 (Hex package)
│   │   ├── ctx ~>0.6.0 (Hex package)
│   │   └── gproc ~>0.8.0 (Hex package)
│   ├── opentelemetry ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   └── tls_certificate_check ~> 1.11 (Hex package)
│       └── ssl_verify_fun 1.1.6 (Hex package)
├── opentelemetry_logger_metadata ~> 0.1.0-rc (Hex package)
│   └── opentelemetry_api ~> 1.0 (Hex package)
├── opentelemetry_phoenix ~> 1.0 (Hex package)
│   ├── opentelemetry_api ~> 1.0 (Hex package)
│   ├── opentelemetry_telemetry ~> 1.0 (Hex package)
│   │   ├── opentelemetry_api ~> 1.0 (Hex package)
│   │   ├── telemetry ~> 1.0 (Hex package)
│   │   └── telemetry_registry ~> 0.3.0 (Hex package)
│   │       └── telemetry ~> 1.0 (Hex package)
│   ├── plug >= 1.11.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── opentelemetry_tesla ~> 1.3.2-rc.4 (Hex package)
│   ├── opentelemetry_api ~> 1.0.0 (Hex package)
│   ├── opentelemetry_telemetry ~> 1.0.0-beta (Hex package)
│   ├── telemetry ~> 0.4 or ~> 1.0 (Hex package)
│   └── tesla ~> 1.4 (Hex package)
├── phoenix ~> 1.5.9 (Hex package)
│   ├── jason ~> 1.0 (Hex package)
│   ├── phoenix_html ~> 2.13 or ~> 3.0 (Hex package)
│   │   └── plug ~> 1.5 (Hex package)
│   ├── phoenix_pubsub ~> 2.0 (Hex package)
│   ├── plug ~> 1.10 (Hex package)
│   ├── plug_cowboy ~> 1.0 or ~> 2.2 (Hex package)
│   ├── plug_crypto ~> 1.1.2 or ~> 1.2 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── phoenix_ecto ~> 4.1 (Hex package)
│   ├── ecto ~> 3.3 (Hex package)
│   ├── phoenix_html ~> 2.14.2 or ~> 3.0 (Hex package)
│   └── plug ~> 1.9 (Hex package)
├── telemetry_metrics ~> 0.6 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)
├── telemetry_poller ~> 1.0 (Hex package)
│   └── telemetry ~> 1.0 (Hex package)
├── tesla ~> 1.4 (Hex package)
│   ├── hackney ~> 1.6 (Hex package)
│   ├── jason >= 1.0.0 (Hex package)
│   ├── mime ~> 1.0 or ~> 2.0 (Hex package)
│   └── telemetry ~> 0.4 or ~> 1.0 (Hex package)

@tsloughter
Copy link
Member

It is most likely not the same issue if it happens even with http_protobuf. Can you open a new issue with any more details you can provide?

@davan-nl
Copy link

Sorry for this question but is this the correct way to configure the exporter:

    otel_batch_processor: %{
      exporter:
        {:opentelemetry_exporter,
         %{endpoints: [{:http, otel_collector_host, otel_collector_port, []}], protocol: :http_protobuf}}
    }

For some reason we are seeing grpcbox_client:recv_end/2 being called (being one of the most memory hungry processes) but the only library in our app that has that as a dependency is the exporter. Is it used for something else even if the protocol is http_protobuf?

@tsloughter
Copy link
Member

@davan-nl technically it can be configured that way but since it is so deeply nested and hard to be sure its correct we have a simpler way to configure the processor and exporter:

# config/runtime.exs
config :opentelemetry,
  span_processor: :batch,
  traces_exporter: :otlp

config :opentelemetry_exporter,
  otlp_protocol: :http_protobuf,
  otlp_endpoint: "http://localhost:4318"

These are also the defaults

@tsloughter
Copy link
Member

and no, grpc isn't used for anything else, so you should not see any use of grpcbox_client when the processor is configured to use the http exporter.

@davan-nl
Copy link

Thanks for the prompt responses @tsloughter. I am wondering whats the priority to get grpc working. As you mention using http_protobuf fixes the leak but its not really an option for us. Wish I could help but I dont know erlang 😅

@tsloughter
Copy link
Member

@davan-nl are you sending to the otel collector or a vendor directly?

@tsloughter
Copy link
Member

I try doing a simple example against the collector and I'm not seeing the ever increasing list of links like you show in the process info for the grpc connection process (meaning a stream leak).

@davan-nl
Copy link

We use instana which has a local agent running. Can try investigating if using the otel collector in-between can help. There is a clear difference in memory usage when we switched from grpc to http_protobuf.

@tsloughter
Copy link
Member

Ok, yea, that'd be great if you can reproduce with otel collector. I'm running an app locally publishing to the collector and not reproducing. It could be something with how the vendor agents close, or don't close, streams. Still likely a bug on our end (not ending the stream when we should) but it would help narrow down where the bug is.

@davan-nl
Copy link

davan-nl commented Oct 12, 2022

Just wanted to add this was the output from recon :recon.bin_leak(5)

[
  {#PID<0.30485.5>, -45019,
   [
     current_function: {:grpcbox_client, :recv_end, 2},
     initial_call: {:erlang, :apply, 2}
   ]},
  {#PID<0.2685.0>, -1120,
   [
     :ssl_manager,
     {:current_function, {:gen_server, :loop, 7}},
     {:initial_call, {:proc_lib, :init_p, 5}}
   ]},
  {#PID<0.3093.0>, -622,
   [
     SomeService,
     {:current_function, {:gen, :do_call, 4}},
     {:initial_call, {:proc_lib, :init_p, 5}}
   ]},
  {#PID<0.2711.0>, -537,
   [
     Logger,
     {:current_function, {:re, :postprocess, 5}},
     {:initial_call, {:proc_lib, :init_p, 5}}
   ]},
  {#PID<0.28540.5>, -407,
   [
     current_function: {:gen_statem, :loop_receive, 3},
     initial_call: {:proc_lib, :init_p, 5}
   ]}
]

@tsloughter
Copy link
Member

Thanks, recv_end may, at least partially, mean my guess is right that it gets into an end state but for some reason never finishes the ending.

@davan-nl
Copy link

Just wanted to give 1 last update
We tried Elixir -> Otel Collector -> Instana all over gRPC and everything worked smoothly. If Instana was mishandling things on their end we would see the collector leak too no? Just more details if you suspect the grpc_box lib.

@tsloughter
Copy link
Member

@davan-nl good to know. And yes, it is most likely and issue in grpcbox, but not 100%. It could be both Instana and the Collector use Go's grpc lib and so has the same quirk, but more likely is that something about talking to a grpc server over the internet (could be an intermediary box causing this) results in grpcbox not closing the stream.

But now I know I should be able to reproduce by signing up for a service and giving it a try.

@tsloughter
Copy link
Member

I've got an example exporting to honeycomb, its creating a span every 500 ms (its just a free account so trying not to be cutoff :)) and I'm still unable to see his happening. Going to keep it running and hopefully eventually reproduces.

@tsloughter
Copy link
Member

Oh, oops, forgot for honeycomb it was first going through refinery when @elvanja was seeing this... I guess I need to try running with that to be able to reproduce.

@pprogrammingg
Copy link

Hi, any update with the issue above? we also use Instana and seeing the memory leak issue

@tsloughter
Copy link
Member

@pprogrammingg afraid not, I still need to reproduce so I can debug.

I guess I can look into Instana to see if they have a free trial I can utilize.

@tsloughter
Copy link
Member

Finding it not so easy to get going with Instana.

Are you using the docker image to run the agent?

@davan-nl
Copy link

davan-nl commented Oct 27, 2022

We do run it with docker, this should be enough to help bootstrap instana assuming you are using kubernetes
values.yaml:

agent:
  configuration_yaml: |
    ---
    com.instana.plugin.opentelemetry:
      enabled: true

    com.instana.tracing:
      extra-http-headers:
        - "X-Request-Id"
        - "X-Amzn-Trace-Id"
  downloadKey: <redacted>
  endpointHost: <your-instana-endpoint>
  endpointPort: 443
  key: <redacted>
  mode: APM
cluster:
  name: <your-cluster-name>
fullnameOverride: instana-agent
opentelemetry:
  enabled: true
service:
  create: true

Install Helm chart:

helm repo add instana https://agents.instana.io/helm
helm repo update
helm install instana -n instana-agent --create-namespace -f values.yaml

Use instana-agent.instana-agent.svc.cluster.local:4317 as the OTEL collector endpoint (gRPC)

@tsloughter
Copy link
Member

Thanks. I'm just trying to run with docker so I don't have to setup k8s. I enabled otel in the configuration in the docker image and get a crash of NoClassDefFoundError so wonder if the default docker image doesn't come with otel support.

@davan-nl
Copy link

davan-nl commented Oct 27, 2022

❯ cat configuration-otel.yaml
com.instana.plugin.opentelemetry:
  enabled: true
com.instana.tracing:
  extra-http-headers:
    - "X-Request-Id"
    - "X-Amzn-Trace-Id"

Be sure to set INSTANA_AGENT_KEY and INSTANA_AGENT_ENDPOINT first

❯ docker run --rm --name instana -v $PWD/configuration-otel.yaml:/opt/instana/agent/etc/instana/configuration-otel.yaml -e INSTANA_AGENT_ENDPOINT_PORT=443 -e INSTANA_AGENT_KEY=$INSTANA_AGENT_KEY -e INSTANA_AGENT_ENDPOINT=$INSTANA_AGENT_ENDPOINT icr.io/instana/agent

@davan-nl
Copy link

@tsloughter Did you get a chance to test?

@tsloughter
Copy link
Member

Sorry, I missed that you provided exactly what I needed to do :(. I have to get a new instana account and then I'll be able to test it. I'll definitely get to it this week, if not today.

@tsloughter
Copy link
Member

I contacted Instana on Wednesday and thought I'd have an extension on my trial by now but don't. Guess I have to make a new account if the [email protected] trick works for emails to signup for their trials.

@tsloughter
Copy link
Member

Sorry I forgot to note that I did try again with a new account but did not see the leak. I wish I could address this sooner since I have little doubt it is an issue on our end but its slow going being able to give it attention and when giving it attention actually getting anywhere :(

@tsloughter
Copy link
Member

Oh, I did end up testing this but was unable to reproduce. Sorry for the late notice, I was hoping to keep trying until I was able to and then respond but now that it has been so long I figured I'd better give an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants