Threads grow indefinitely #460

akhstash · 2023-07-27T17:42:36Z

We run DataDog using a helm chart in k8s. We recently encountered a situation where a DataDog process grew to tens of thousands of threads that caused crashes for all other jvm processes running on the same node. Given we use DataDog to monitor a lot of nodes, this caused a lot of crashes.

During investigation, we used the resource metrics from the process dashboard, and saw a number of processed has unbound growth on thread counts. Here is one example:

We took the PID associated with the one above and checked the process on the host and saw it was jmxfetch:

java -XX:+UseContainerSupport -classpath /opt/datadog-agent/bin/agent/dist/jmx/jmxfetch.jar [org.datadog.jmxfetch.App](http://org.datadog.jmxfetch.app/) --ipc_host localhost --ipc_port 5001 --check_period 15000 --thread_pool_size 3 --collection_timeout 60 --reconnection_timeout 60 --reconnection_thread_pool_size 3 --log_level ERROR --reporter statsd:unix:///var/run/datadog/dsd.socket --statsd_queue_size 4096 collect

The DataDog agent has this configuration for some Debzium monitoring:

    ad_identifiers:
      - strimzi
    init_config:
      is_jmx: true
      new_gc_metrics: true
      collect_default_metrics: true
      service_check_prefix: kafka_connect
    instances:
      - host: source-connect-api.strimzi.svc.cluster.local
        port: 9999
        name: kafka-connect-source
        collect_default_jvm_metrics: true
        tags:
          - owner:data
        service: kafka-connect-source

A week before our production systems were affected, we decommissioned the debezium setup, but did not remove the DataDog monitoring. We think there might be an edge case with jmxfetch if the services are removed after initially existing that causes the growth in threads. After restarting all of the agents, we have not seen the same issue.

It might be similar in nature to the issue reported here.

The text was updated successfully, but these errors were encountered:

carlosroman · 2023-07-31T09:33:18Z

Which version of the Agent/JMXFetch are you running? Running agent status from inside the Agent container will give you both the Agent version and JMXFetch version.

akhstash · 2023-08-01T05:58:56Z

Helm chart version 3.30.10
Which uses agent version 7.44.1

JMXFetch
runtime_version : 11.0.18
version : 0.47.8

Just saw the release 0.47.9 - not sure if the thread leak mentioned is the issue we encountered?

carlosroman · 2023-08-02T11:00:29Z

@akhstash I wasn't able to recreate this on the latest version of the Agent using 0.47.9 of JMXFetch. I wonder if your issue was solved with this fix #432?

carlosroman · 2023-11-07T14:52:04Z

Closing as issue most likely fixed by #432

carlosroman closed this as completed Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threads grow indefinitely #460

Threads grow indefinitely #460

akhstash commented Jul 27, 2023 •

edited

Loading

carlosroman commented Jul 31, 2023

akhstash commented Aug 1, 2023 •

edited

Loading

carlosroman commented Aug 2, 2023

carlosroman commented Nov 7, 2023

Threads grow indefinitely #460

Threads grow indefinitely #460

Comments

akhstash commented Jul 27, 2023 • edited Loading

carlosroman commented Jul 31, 2023

akhstash commented Aug 1, 2023 • edited Loading

carlosroman commented Aug 2, 2023

carlosroman commented Nov 7, 2023

akhstash commented Jul 27, 2023 •

edited

Loading

akhstash commented Aug 1, 2023 •

edited

Loading