Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads grow indefinitely #460

Closed
akhstash opened this issue Jul 27, 2023 · 4 comments
Closed

Threads grow indefinitely #460

akhstash opened this issue Jul 27, 2023 · 4 comments

Comments

@akhstash
Copy link

akhstash commented Jul 27, 2023

We run DataDog using a helm chart in k8s. We recently encountered a situation where a DataDog process grew to tens of thousands of threads that caused crashes for all other jvm processes running on the same node. Given we use DataDog to monitor a lot of nodes, this caused a lot of crashes.

During investigation, we used the resource metrics from the process dashboard, and saw a number of processed has unbound growth on thread counts. Here is one example:
image

We took the PID associated with the one above and checked the process on the host and saw it was jmxfetch:

java -XX:+UseContainerSupport -classpath /opt/datadog-agent/bin/agent/dist/jmx/jmxfetch.jar [org.datadog.jmxfetch.App](http://org.datadog.jmxfetch.app/) --ipc_host localhost --ipc_port 5001 --check_period 15000 --thread_pool_size 3 --collection_timeout 60 --reconnection_timeout 60 --reconnection_thread_pool_size 3 --log_level ERROR --reporter statsd:unix:///var/run/datadog/dsd.socket --statsd_queue_size 4096 collect

The DataDog agent has this configuration for some Debzium monitoring:

    ad_identifiers:
      - strimzi
    init_config:
      is_jmx: true
      new_gc_metrics: true
      collect_default_metrics: true
      service_check_prefix: kafka_connect
    instances:
      - host: source-connect-api.strimzi.svc.cluster.local
        port: 9999
        name: kafka-connect-source
        collect_default_jvm_metrics: true
        tags:
          - owner:data
        service: kafka-connect-source

A week before our production systems were affected, we decommissioned the debezium setup, but did not remove the DataDog monitoring. We think there might be an edge case with jmxfetch if the services are removed after initially existing that causes the growth in threads. After restarting all of the agents, we have not seen the same issue.

It might be similar in nature to the issue reported here.

@carlosroman
Copy link
Contributor

Which version of the Agent/JMXFetch are you running? Running agent status from inside the Agent container will give you both the Agent version and JMXFetch version.

@akhstash
Copy link
Author

akhstash commented Aug 1, 2023

Helm chart version 3.30.10
Which uses agent version 7.44.1

JMXFetch
runtime_version : 11.0.18
version : 0.47.8

Just saw the release 0.47.9 - not sure if the thread leak mentioned is the issue we encountered?

@carlosroman
Copy link
Contributor

@akhstash I wasn't able to recreate this on the latest version of the Agent using 0.47.9 of JMXFetch. I wonder if your issue was solved with this fix #432?

@carlosroman
Copy link
Contributor

Closing as issue most likely fixed by #432

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants