Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #402

Closed
terryma opened this issue Jan 2, 2019 · 82 comments
Closed

Memory leak #402

terryma opened this issue Jan 2, 2019 · 82 comments
Labels
bug Something isn't working community

Comments

@terryma
Copy link

terryma commented Jan 2, 2019

We've deployed 0.6.1 across our production service fleets. Out of over a dozen services, we're seeing memory leaks with 2 services. All services are more or less node.js/express based with the following configuration for dd-trace:

import datadogTracer from 'dd-trace';

datadogTracer.init(config);
datadogTracer.use('express', {
    headers: [
      'x-datadog-trace-id',
      'x-datadog-parent-id',
      'x-datadog-sampling-priority',
    ],
  });
datadogTracer.use('http', { splitByDomain: true });
datadogTracer.use('redis', {});
datadogTracer.use('graphql', { service });
datadogTracer.use('pg', {});

Are you aware of any outstanding memory leak issues? I've taken two separate heapdumps with a delta containing the leak, showing a lot of DatadogSpans being retained. I can provide the heapdumps to help troubleshoot.

@rochdev
Copy link
Member

rochdev commented Jan 7, 2019

Were you getting the same memory leak before upgrading to 0.6.1? Which version was used before if any? Do you do custom tracing at all or only rely on the out of the box integrations?

Heap dumps would definitely be useful. I would recommend to open a support ticket and submit the heap dumps there. Also, can you try updating to 0.7.2 to see if it fixes the issue? There is also more logging in that version to detect spans that are not properly finished.

About outstanding memory leaks, the only one I'm aware of is a bug in Node that affects HTTP clients with keep-alive when used with async_hooks, which we use. This however would cause context objects to be kept in memory and not spans. I am not aware of any bug where spans would leak.

@terryma
Copy link
Author

terryma commented Jan 7, 2019

No custom tracing, just out of the box integration for now.

About outstanding memory leaks, the only one I'm aware of is a bug in Node that affects HTTP clients with keep-alive when used with async_hooks, which we use. This however would cause context objects to be kept in memory and not spans.

This actually might be the issue. I just took a look at the heapdump diffs again, and it's definitely a list of Context objects getting retained. Context eventually has a reference to all the spans and I missed this initially

image

Any lights you can shed on this known issue? Any mitigations?

@terryma
Copy link
Author

terryma commented Jan 7, 2019

Is this the issue you're referencing? nodejs/node#19859? Curious to why this only affects 2 of our services, and why it's not more widespread.

@rochdev
Copy link
Member

rochdev commented Jan 7, 2019

Yes, that's the issue I was referencing. It depends on quite a few factors to trigger the leak, so it's possible that only a few services are affected. Of course this is only a theory at this point. A fix was released for this specific bug in Node 8.13.0 and 10.14.2. If you are still getting the same behavior with these versions, it means your issue is a different bug that would probably be in the tracer.

Can you try to redeploy with an updated version of Node?

@terryma
Copy link
Author

terryma commented Jan 9, 2019

Thanks! I updated one of the affected services with node 10.15.0 and so far it looks like the same issue is still happening.

It depends on quite a few factors to trigger the leak

Could you elaborate on that if you have some info? I can try to generate a new heapdump diff with this new version of node, but I'm expecting to see similar behavior.

@rochdev
Copy link
Member

rochdev commented Jan 9, 2019

It depends on quite a few factors to trigger the leak

I meant to trigger the leak from the Node bug, but since you updated it means this is not the issue.

Were you using a version of dd-trace that didn't leak before in these services? If so, what was the version?

If you could share the actual heap dump it would be very helpful. You can send it to me on our public Slack, but I would suggest opening a support ticket as well (you can give me the ticket number on Slack too).

One thing that could be interesting to try to narrow down the issue would be disabling some of the plugins to see if it makes a difference.

@rochdev rochdev added the bug Something isn't working label Jan 9, 2019
@tomonari-t
Copy link

tomonari-t commented Jan 18, 2019

I found the same issue.

In my case, I updated dd-trace from 0.6.2 to 0.7.2 and memory leak happen.

@rochdev
Copy link
Member

rochdev commented Jan 18, 2019

@tomonari-t Are you saying you didn't have any issue before 0.7.2? Most reports so far seem to indicate the issue was present in at least 0.6.0.

@tomonari-t
Copy link

@rochdev Yes. When I used [email protected], memory leak didn't happen.

@rochdev
Copy link
Member

rochdev commented Jan 21, 2019

@tomonari-t Can you try disabling plugins for promise libraries? So far it seems the leak is caused by the scope manager, which could have an impact on these plugins specifically if that's the case. Since these plugins were added in 0.7.0, it would explain why you are just now getting the issue. The plugins that would have to be disabled are bluebird, q and when.

@rochdev
Copy link
Member

rochdev commented Feb 22, 2019

@terryma @tomonari-t Can you try with 0.9.0? We have rewritten some core components that were prone to memory leaks.

@Chris911
Copy link

Just want to add that we're also seeing clear memory leaks with 0.8.0. Will try with 0.9.0 this week and report back.

@Chris911
Copy link

Chris911 commented Feb 26, 2019

Unfortunately we're still seeing memory issues in one of our applications with 0.9.3.

I'll try again with some of the plugins disabled later this week.

@rochdev
Copy link
Member

rochdev commented Feb 26, 2019

In general the most probable causes of leaks are:

  • Scope manager. This should no longer be an issue since 0.9.0 with our new much simpler scope manager.
  • Node memory leak in async_hooks in part versions. If you are using any version lower than 8.15.0 or 10.15.0 it is recommended to update as it may fix the issue.
  • In one of our plugins. Trying to enable/disable them as you're doing is a good way to pinpoint which one.

@Chris911 Are you doing manual instrumentation?

@rochdev
Copy link
Member

rochdev commented Feb 26, 2019

One other thing to keep in mind is some plugins are required to avoid memory leaks. If you are using tracer.init({ plugins: false }) it's possible these don't get enabled. In general unless you have issues it's recommended to not use this configuration option.

@Chris911
Copy link

We do use plugins: false and then enable specific plugins based on configuration. I shared most of our setup with your support team (support ticket ID 204548). I'll report back if I'm able to pinpoint which plugin is causing the issue but it's not easy to debug because we can't reproduce on our dev environments.

@rochdev
Copy link
Member

rochdev commented Feb 26, 2019

@Chris911 Can you try without tracer.init({ plugins: false })?

Another option would be to enable these plugins if any of the corresponding module is used either directly or transitively: bluebird, q, when, generic-pool. Using these modules without the corresponding plugin may cause a memory leak, so worth a try.

@rochdev
Copy link
Member

rochdev commented Feb 26, 2019

@Chris911 Also, we are thinking of deprecating the plugins option as in general it should never be needed and may cause many issues. What is your use case for using it?

@Chris911
Copy link

When we did our initial testing of the APM a few weeks back we were getting memory leaks on our dev environments. They went away when we disabled all plugins and only enable 1-2. I'll try with plugins: true sometime this week.

Also to answer your previous question we're not doing any custom instrumentation.

@freewil
Copy link

freewil commented Feb 28, 2019

We enabled [email protected] with a deployment last night on our cluster serving web traffic (it's an express 4 app) running node v10.15.1 and saw the memory utilization gradually increase over night until we did another deployment/servers restarted this morning.

screen shot 2019-02-28 at 10 59 26 am

@rochdev
Copy link
Member

rochdev commented Feb 28, 2019

@freewil Which version were you using before? Also, which supported modules are you using? Could you provide a snippet of the tracer initialization?

@freewil
Copy link

freewil commented Feb 28, 2019

We weren't using dd-trace before.

Non-native modules in use:

  • express
  • mongodb-core
  • redis
  const ddTrace = require("dd-trace")
  // https://datadog.github.io/dd-trace-js/#tracer-settings
  ddTrace.init({
    enabled: true,
    debug: false,
    service: 'REDACTED',
    hostname: 'REDACTED',
    port: 'REDACTED',
    env: 'REDACTED'
  })

@rochdev
Copy link
Member

rochdev commented Feb 28, 2019

I will try to add an option for tomorrow's release so you can disable individual plugins. This will allow you to try without individual plugins to try and pinpoint the issue. I'll keep you posted when it's available.

@freewil
Copy link

freewil commented Feb 28, 2019

Cool, thanks!

@manuelconvoy
Copy link

manuelconvoy commented Mar 4, 2019

@rochdev I work with @terryma, I upgraded the services showing the memory leak to dd-trace 0.9.3 and the memory leak is gone, thanks for your help!

@freewil
Copy link

freewil commented Mar 5, 2019

@rochdev

I will try to add an option for tomorrow's release so you can disable individual plugins.

Any ETA on this? Was this an option that was removed in a recent release? It appears you can/could do this in the code examples above with tracer.init({ plugins: false }) ?

@rochdev
Copy link
Member

rochdev commented Mar 5, 2019

@freewil The release that was planned last Friday was pushed back and I don't have an ETA right away, but I did release a beta as 0.10.0-beta.0 that contains this functionality by doing tracer.use('express', false) for example.

The reason I prefer to not recommend going the tracer.init({ plugins: false }) route is that it can actually cause more harm than good since it will remove plugins that are important for the health of the tracer.

I would say to try 0.10.0-beta.0 and try disabling plugins one by one to try and pinpoint the problem. Then it will be much easier to try and find the issue.

@freewil
Copy link

freewil commented Mar 5, 2019

Ok - thanks for the update and info!

@rochdev
Copy link
Member

rochdev commented Apr 12, 2019

It would help. We will also release a new version next week that will have a lot of insights into what is happening in the runtime. I'll keep this thread posted, but it should help determine what is causing the leak.

@nodesocket
Copy link

I encountered the memory leak quickly, only took a few days before Node.js processes crash out of memory. Not running the mysql integration though, running mongodb integration.

@bjaworski3
Copy link

Any updates on this memory leak? Just deployed dd-trace 0.10.2 on Node 8 and I'm still seeing a memory leak.

@rochdev
Copy link
Member

rochdev commented May 14, 2019

It looks like there is not a single cause of memory leak but multiple. I'm currently working on fixing some of them.

This is what we found so far:

  • The span from the net plugin is never finished if the socket times out. This causes its trace to stay in memory forever. A fix will be in the next release
  • Older Node versions had an issue where the destroy hook would never be called for some resources in async_hooks. Since we rely on this hook to cleanup the context, it was causing a memory leak. This has been fixed in Node ^8.15 || ^10.15 || >=12. Updating Node fixes this issue.

We have also added a new Runtime Metrics feature which can help us a lot to pinpoint the source of the leak. The feature is in beta and has to be enabled by us for your account. If you want me to investigate your account specifically using Runtime Metrics, please let me know on our public Slack. My handle is rochdev.

The above is also valid for everyone in this thread. Finding the source(s) of the memory leak is high priority for us, but it's very difficult to find without working directly with users get access to the account, so definitely reach out in Slack if you want to work with us on this.

@rochdev
Copy link
Member

rochdev commented Jun 10, 2019

A memory leak in the TCP integration that would impact other integrations has been fixed in 0.11.2, and we have added workarounds for known issues in Node that would cause memory leaks as well in 0.12.0.

I would recommend to anyone in this thread to upgrade the tracer to 0.12.0 and report back if the memory leak is still present or not.

@nodesocket
Copy link

@rochdev I just enabled the Node.js tracer back in our application running 0.12.0. However I don't see anything listed in APM => Services.

I do however see recent activity in APM => Trace list.

Screen Shot 2019-06-10 at 5 03 09 PM

Screen Shot 2019-06-10 at 5 03 17 PM

@rochdev
Copy link
Member

rochdev commented Jun 10, 2019

@nodesocket If you haven't been using the tracer for a while, the service list was reset. New services might take a few minutes to appear.

@nodesocket
Copy link

@rochdev been over 15 minutes, should I wait longer or is something wrong?

@rochdev
Copy link
Member

rochdev commented Jun 10, 2019

@nodesocket Usually it takes around 5-10 minutes, but we've been experiencing unusual delays today, so it's possible it will take more time. In general if you can see the traces and they have the correct service on them the services should eventually appear.

This delay only exists when a new service is added, which is every service when you weren't previously using the tracer.

@xzyfer
Copy link

xzyfer commented Jul 8, 2019

I can confirm a pretty significant memory leak is preset in 0.12.1. We deployed a portion of our cluster with the agent disabled with the following results.

image

Using the following code

if (process.env.DATADOG_AGENT_HOST) {
  const enabled = Math.random() > 0.5 ? true : false;

  tracer.init({
    hostname: process.env.DATADOG_AGENT_HOST,
    enabled: enabled,
    runtimeMetrics: enabled,
    env: process.env.ENVIRONMENT,
    tags: ['app:name', `environment:${process.env.ENVIRONMENT}`],
    sampleRate: 0.1
    plugins: false,
  });
}

We're using Node 11.15.0 for this trace, but the issue persists on 10.16.0 also.

We've determined that using the noop scope prevents the leak. As such we tried using the async-listener scope which also exhibited the same leak as the default async-hooks scope.

@rochdev
Copy link
Member

rochdev commented Jul 8, 2019

@xzyfer At this point I would say there are 2 likely causes for leaks:

  1. There is a pre-existing leak in the application that the tracer is making worse. With the latest version, all memory leaks reported so far were caused by this. We will have an improvement in the next version that will lessen the impact of the tracer in this scenario, but it's still important to find the root cause of the leak.

  2. There is a memory leak in one of our integrations.

In both cases, we would need to take a look at the runtime metrics in your account to confirm what is happening. I would recommend to direct message me on our public Slack (I'm @rochdev) or open a support ticket. You can find the documentation to enable runtime metrics here.

@xzyfer
Copy link

xzyfer commented Jul 8, 2019

@rochdev interestingly it was the runtime metrics that pointed us to looking at the APM library itself.

image

Note that the async resource counter is never decremented until the instance is OOM killed.

There is a memory leak in one of our integrations.

We thought this also so we disabled all plugins and run time metrics just in case.

@xzyfer
Copy link

xzyfer commented Jul 8, 2019

There is a pre-existing leak in the application that the tracer is making worse

After digging deeper into async.resources.by.type I think you may be onto something.

@rochdev
Copy link
Member

rochdev commented Jul 8, 2019

This is a very high number of async resources. I have 2 questions that could help narrow down the issue:

  1. Which async resources have the highest count?
  2. Is there a high count of unfinished spans on the heap or are they almost all finished?

@xzyfer
Copy link

xzyfer commented Jul 8, 2019 via email

@rochdev
Copy link
Member

rochdev commented Jul 8, 2019

  1. Interesting. In the tracer we only use 2 timers that are instantiated only once per process. Is it possible something in your app would hold a reference to a timer?
  2. In the graph above in Heap Spans each line represent a specific span with its state.

@xzyfer
Copy link

xzyfer commented Jul 8, 2019

  1. We have no timers in our code. The application is an Apollo GraphQL Server with a handful of very simple resolvers.
  2. They appear to predominately, by orders of magnitude, be finished spans.

@xzyfer
Copy link

xzyfer commented Jul 9, 2019

I think the timeouts Async resource refers to network request timeouts to downstream

@xzyfer
Copy link

xzyfer commented Jul 9, 2019

I've continuing to augment our reporting. The current candidate looks like node-fetch's timeout handling

https://github.com/bitinn/node-fetch/blob/0c2294ec48fa5b84519f8bdd60f4e2672ebd9b06/src/index.js#L88-L93

@xzyfer
Copy link

xzyfer commented Jul 9, 2019

Ultimately it looks like pm2 is the root cause. There is a known memory leak that was recently patched in https://github.com/keymetrics/pm2-io-apm/releases/tag/4.2.2.

Even with this patch applied we're experiencing a leak but much slower than previously.

Left: @pm2/[email protected]
Middle: no pm2
Right: @pm2/[email protected]

@rochdev
Copy link
Member

rochdev commented Jul 9, 2019

They appear to predominately, by orders of magnitude, be finished spans.

Perfect, this means that it's not spans holding async resources but the other way around.

Even with this patch applied we're experiencing a leak but much slower than previously.

Does it still seem to be caused by ever increasing async resources?

@rochdev
Copy link
Member

rochdev commented Jul 9, 2019

Actually I just noticed that the middle line is stable without PM2. This seems to indicate that there is still memory leak in the latest release of PM2 and not in dd-trace. Was that the outcome of your investigation?

@brian-l
Copy link

brian-l commented Jul 31, 2019

I noticed that the async hooks are being enabled but never disabled. Is this intentional? https://github.com/DataDog/dd-trace-js/blob/master/packages/dd-trace/src/scope/async_hooks.js#L31

I'm seeing memory leaks happen only in my jest test environment (node 8.16, dd-trace 0.13.1, jest 23) when watching for changes; it seems that restarting the tests repeatedly (including dd-trace setup) will eventually cause FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory and the last line of code in the stack trace is related to async hooks Security context: 0x1a91fcd25891 <JSObject> 1: _init [/code/node_modules/dd-trace/packages/dd-trace/src/scope/async_hooks.js:~52] [pc=0x1e881f6f83ee](this=0xc08dc4859a9 <Scope map = 0x3d2e1d02781>,asyncId=1022343,type=0x1b58ae12951 <String[10]: TickObject>,triggerAsyncId=1020161,resource=0x98a3be27f79 <TickObject map = 0x2725aef10e81>) 2: nextTick [internal/process/next_tick.js:~246] [pc=0x1e881f9790a3](this=0x3...

@rochdev
Copy link
Member

rochdev commented Jul 31, 2019

@brian-l This might be because of the way Jest works. For example, if it clears the require cache while the tracer is still active, it will just keep adding new tracers constantly without cleaning up the old one, taking up more and more memory.

This could be addressed by adding a way to enable/disable the tracer at runtime, which could be added after each run for example. This will be available in the coming weeks.

It's possible that it's caused by something else, but in a test environment I would say this is the most likely cause. We will be able to validate this once the functionality to disable lands.

In the meantime, I would recommend to simply disable the tracer in your tests by using the DD_TRACE_ENABLED=false environment variable.

@brian-l
Copy link

brian-l commented Aug 1, 2019

@brian-l This might be because of the way Jest works. For example, if it clears the require cache while the tracer is still active, it will just keep adding new tracers constantly without cleaning up the old one, taking up more and more memory.

This could be addressed by adding a way to enable/disable the tracer at runtime, which could be added after each run for example. This will be available in the coming weeks.

It's possible that it's caused by something else, but in a test environment I would say this is the most likely cause. We will be able to validate this once the functionality to disable lands.

In the meantime, I would recommend to simply disable the tracer in your tests by using the DD_TRACE_ENABLED=false environment variable.

that didn't change anything unfortunately, however this investigation has revealed other memory leaks that need to be fixed as well so our issues are not caused by this library as far as I can tell. thanks.

@rochdev
Copy link
Member

rochdev commented Oct 2, 2019

We've had several reports that the issue is fixed completely, and no new report of memory leaks for several months at this point, so closing this issue.

@rochdev rochdev closed this as completed Oct 2, 2019
@devilleweppenaar
Copy link

We're running dd-trace 0.16.1, and can confirm that there is definitely a memory leak. We tried disabling plugins, but disabling tracing completely is the only thing that stopped our services from running out of memory and crashing.

Should I log a new, separate issue?

@rochdev
Copy link
Member

rochdev commented Nov 22, 2019

@devillexio Please open a new issue. It will make it easier to keep any underlying causes separated.

@devilleweppenaar
Copy link

@rochdev As requested, I logged #759.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community
Projects
None yet
Development

No branches or pull requests