Memory leak #402

terryma · 2019-01-02T18:21:18Z

We've deployed 0.6.1 across our production service fleets. Out of over a dozen services, we're seeing memory leaks with 2 services. All services are more or less node.js/express based with the following configuration for dd-trace:

import datadogTracer from 'dd-trace';

datadogTracer.init(config);
datadogTracer.use('express', {
    headers: [
      'x-datadog-trace-id',
      'x-datadog-parent-id',
      'x-datadog-sampling-priority',
    ],
  });
datadogTracer.use('http', { splitByDomain: true });
datadogTracer.use('redis', {});
datadogTracer.use('graphql', { service });
datadogTracer.use('pg', {});

Are you aware of any outstanding memory leak issues? I've taken two separate heapdumps with a delta containing the leak, showing a lot of DatadogSpans being retained. I can provide the heapdumps to help troubleshoot.

The text was updated successfully, but these errors were encountered:

rochdev · 2019-01-07T18:22:17Z

Were you getting the same memory leak before upgrading to 0.6.1? Which version was used before if any? Do you do custom tracing at all or only rely on the out of the box integrations?

Heap dumps would definitely be useful. I would recommend to open a support ticket and submit the heap dumps there. Also, can you try updating to 0.7.2 to see if it fixes the issue? There is also more logging in that version to detect spans that are not properly finished.

About outstanding memory leaks, the only one I'm aware of is a bug in Node that affects HTTP clients with keep-alive when used with async_hooks, which we use. This however would cause context objects to be kept in memory and not spans. I am not aware of any bug where spans would leak.

terryma · 2019-01-07T19:50:45Z

No custom tracing, just out of the box integration for now.

About outstanding memory leaks, the only one I'm aware of is a bug in Node that affects HTTP clients with keep-alive when used with async_hooks, which we use. This however would cause context objects to be kept in memory and not spans.

This actually might be the issue. I just took a look at the heapdump diffs again, and it's definitely a list of Context objects getting retained. Context eventually has a reference to all the spans and I missed this initially

Any lights you can shed on this known issue? Any mitigations?

terryma · 2019-01-07T20:09:26Z

Is this the issue you're referencing? nodejs/node#19859? Curious to why this only affects 2 of our services, and why it's not more widespread.

rochdev · 2019-01-07T20:17:50Z

Yes, that's the issue I was referencing. It depends on quite a few factors to trigger the leak, so it's possible that only a few services are affected. Of course this is only a theory at this point. A fix was released for this specific bug in Node 8.13.0 and 10.14.2. If you are still getting the same behavior with these versions, it means your issue is a different bug that would probably be in the tracer.

Can you try to redeploy with an updated version of Node?

terryma · 2019-01-09T02:38:53Z

Thanks! I updated one of the affected services with node 10.15.0 and so far it looks like the same issue is still happening.

It depends on quite a few factors to trigger the leak

Could you elaborate on that if you have some info? I can try to generate a new heapdump diff with this new version of node, but I'm expecting to see similar behavior.

rochdev · 2019-01-09T14:36:55Z

It depends on quite a few factors to trigger the leak

I meant to trigger the leak from the Node bug, but since you updated it means this is not the issue.

Were you using a version of dd-trace that didn't leak before in these services? If so, what was the version?

If you could share the actual heap dump it would be very helpful. You can send it to me on our public Slack, but I would suggest opening a support ticket as well (you can give me the ticket number on Slack too).

One thing that could be interesting to try to narrow down the issue would be disabling some of the plugins to see if it makes a difference.

tomonari-t · 2019-01-18T05:52:22Z

I found the same issue.

In my case, I updated dd-trace from 0.6.2 to 0.7.2 and memory leak happen.

[email protected]
[email protected]
[email protected]

rochdev · 2019-01-18T16:57:32Z

@tomonari-t Are you saying you didn't have any issue before 0.7.2? Most reports so far seem to indicate the issue was present in at least 0.6.0.

tomonari-t · 2019-01-19T01:46:23Z

@rochdev Yes. When I used [email protected], memory leak didn't happen.

rochdev · 2019-01-21T14:58:14Z

@tomonari-t Can you try disabling plugins for promise libraries? So far it seems the leak is caused by the scope manager, which could have an impact on these plugins specifically if that's the case. Since these plugins were added in 0.7.0, it would explain why you are just now getting the issue. The plugins that would have to be disabled are bluebird, q and when.

rochdev · 2019-02-22T18:36:19Z

@terryma @tomonari-t Can you try with 0.9.0? We have rewritten some core components that were prone to memory leaks.

Chris911 · 2019-02-26T01:37:24Z

Just want to add that we're also seeing clear memory leaks with 0.8.0. Will try with 0.9.0 this week and report back.

Chris911 · 2019-02-26T19:39:25Z

Unfortunately we're still seeing memory issues in one of our applications with 0.9.3.

I'll try again with some of the plugins disabled later this week.

rochdev · 2019-02-26T21:15:04Z

In general the most probable causes of leaks are:

Scope manager. This should no longer be an issue since 0.9.0 with our new much simpler scope manager.
Node memory leak in async_hooks in part versions. If you are using any version lower than 8.15.0 or 10.15.0 it is recommended to update as it may fix the issue.
In one of our plugins. Trying to enable/disable them as you're doing is a good way to pinpoint which one.

@Chris911 Are you doing manual instrumentation?

rochdev · 2019-02-26T21:16:28Z

One other thing to keep in mind is some plugins are required to avoid memory leaks. If you are using tracer.init({ plugins: false }) it's possible these don't get enabled. In general unless you have issues it's recommended to not use this configuration option.

Chris911 · 2019-02-26T21:47:36Z

We do use plugins: false and then enable specific plugins based on configuration. I shared most of our setup with your support team (support ticket ID 204548). I'll report back if I'm able to pinpoint which plugin is causing the issue but it's not easy to debug because we can't reproduce on our dev environments.

rochdev · 2019-02-26T21:49:52Z

@Chris911 Can you try without tracer.init({ plugins: false })?

Another option would be to enable these plugins if any of the corresponding module is used either directly or transitively: bluebird, q, when, generic-pool. Using these modules without the corresponding plugin may cause a memory leak, so worth a try.

rochdev · 2019-02-26T22:02:28Z

@Chris911 Also, we are thinking of deprecating the plugins option as in general it should never be needed and may cause many issues. What is your use case for using it?

Chris911 · 2019-02-26T22:05:04Z

When we did our initial testing of the APM a few weeks back we were getting memory leaks on our dev environments. They went away when we disabled all plugins and only enable 1-2. I'll try with plugins: true sometime this week.

Also to answer your previous question we're not doing any custom instrumentation.

freewil · 2019-02-28T19:07:56Z

We enabled [email protected] with a deployment last night on our cluster serving web traffic (it's an express 4 app) running node v10.15.1 and saw the memory utilization gradually increase over night until we did another deployment/servers restarted this morning.

rochdev · 2019-02-28T19:11:06Z

@freewil Which version were you using before? Also, which supported modules are you using? Could you provide a snippet of the tracer initialization?

freewil · 2019-02-28T20:05:22Z

We weren't using dd-trace before.

Non-native modules in use:

express
mongodb-core
redis

  const ddTrace = require("dd-trace")
  // https://datadog.github.io/dd-trace-js/#tracer-settings
  ddTrace.init({
    enabled: true,
    debug: false,
    service: 'REDACTED',
    hostname: 'REDACTED',
    port: 'REDACTED',
    env: 'REDACTED'
  })

rochdev · 2019-02-28T20:19:29Z

I will try to add an option for tomorrow's release so you can disable individual plugins. This will allow you to try without individual plugins to try and pinpoint the issue. I'll keep you posted when it's available.

freewil · 2019-02-28T20:22:39Z

Cool, thanks!

manuelconvoy · 2019-03-04T21:36:45Z

@rochdev I work with @terryma, I upgraded the services showing the memory leak to dd-trace 0.9.3 and the memory leak is gone, thanks for your help!

freewil · 2019-03-05T20:09:43Z

@rochdev

I will try to add an option for tomorrow's release so you can disable individual plugins.

Any ETA on this? Was this an option that was removed in a recent release? It appears you can/could do this in the code examples above with tracer.init({ plugins: false }) ?

rochdev · 2019-03-05T20:15:10Z

@freewil The release that was planned last Friday was pushed back and I don't have an ETA right away, but I did release a beta as 0.10.0-beta.0 that contains this functionality by doing tracer.use('express', false) for example.

The reason I prefer to not recommend going the tracer.init({ plugins: false }) route is that it can actually cause more harm than good since it will remove plugins that are important for the health of the tracer.

I would say to try 0.10.0-beta.0 and try disabling plugins one by one to try and pinpoint the problem. Then it will be much easier to try and find the issue.

freewil · 2019-03-05T20:16:25Z

Ok - thanks for the update and info!

rochdev · 2019-04-12T21:03:55Z

It would help. We will also release a new version next week that will have a lot of insights into what is happening in the runtime. I'll keep this thread posted, but it should help determine what is causing the leak.

nodesocket · 2019-04-12T22:13:15Z

I encountered the memory leak quickly, only took a few days before Node.js processes crash out of memory. Not running the mysql integration though, running mongodb integration.

bjaworski3 · 2019-05-14T17:56:49Z

Any updates on this memory leak? Just deployed dd-trace 0.10.2 on Node 8 and I'm still seeing a memory leak.

rochdev · 2019-05-14T18:11:23Z

It looks like there is not a single cause of memory leak but multiple. I'm currently working on fixing some of them.

This is what we found so far:

The span from the net plugin is never finished if the socket times out. This causes its trace to stay in memory forever. A fix will be in the next release
Older Node versions had an issue where the destroy hook would never be called for some resources in async_hooks. Since we rely on this hook to cleanup the context, it was causing a memory leak. This has been fixed in Node ^8.15 || ^10.15 || >=12. Updating Node fixes this issue.

We have also added a new Runtime Metrics feature which can help us a lot to pinpoint the source of the leak. The feature is in beta and has to be enabled by us for your account. If you want me to investigate your account specifically using Runtime Metrics, please let me know on our public Slack. My handle is rochdev.

The above is also valid for everyone in this thread. Finding the source(s) of the memory leak is high priority for us, but it's very difficult to find without working directly with users get access to the account, so definitely reach out in Slack if you want to work with us on this.

rochdev · 2019-06-10T20:28:33Z

A memory leak in the TCP integration that would impact other integrations has been fixed in 0.11.2, and we have added workarounds for known issues in Node that would cause memory leaks as well in 0.12.0.

I would recommend to anyone in this thread to upgrade the tracer to 0.12.0 and report back if the memory leak is still present or not.

nodesocket · 2019-06-10T22:05:28Z

@rochdev I just enabled the Node.js tracer back in our application running 0.12.0. However I don't see anything listed in APM => Services.

I do however see recent activity in APM => Trace list.

rochdev · 2019-06-10T22:07:46Z

@nodesocket If you haven't been using the tracer for a while, the service list was reset. New services might take a few minutes to appear.

nodesocket · 2019-06-10T22:23:42Z

@rochdev been over 15 minutes, should I wait longer or is something wrong?

rochdev · 2019-06-10T22:26:42Z

@nodesocket Usually it takes around 5-10 minutes, but we've been experiencing unusual delays today, so it's possible it will take more time. In general if you can see the traces and they have the correct service on them the services should eventually appear.

This delay only exists when a new service is added, which is every service when you weren't previously using the tracer.

xzyfer · 2019-07-08T04:23:16Z

I can confirm a pretty significant memory leak is preset in 0.12.1. We deployed a portion of our cluster with the agent disabled with the following results.

Using the following code

if (process.env.DATADOG_AGENT_HOST) {
  const enabled = Math.random() > 0.5 ? true : false;

  tracer.init({
    hostname: process.env.DATADOG_AGENT_HOST,
    enabled: enabled,
    runtimeMetrics: enabled,
    env: process.env.ENVIRONMENT,
    tags: ['app:name', `environment:${process.env.ENVIRONMENT}`],
    sampleRate: 0.1
    plugins: false,
  });
}

We're using Node 11.15.0 for this trace, but the issue persists on 10.16.0 also.

We've determined that using the noop scope prevents the leak. As such we tried using the async-listener scope which also exhibited the same leak as the default async-hooks scope.

rochdev · 2019-07-08T09:37:17Z

@xzyfer At this point I would say there are 2 likely causes for leaks:

There is a pre-existing leak in the application that the tracer is making worse. With the latest version, all memory leaks reported so far were caused by this. We will have an improvement in the next version that will lessen the impact of the tracer in this scenario, but it's still important to find the root cause of the leak.
There is a memory leak in one of our integrations.

In both cases, we would need to take a look at the runtime metrics in your account to confirm what is happening. I would recommend to direct message me on our public Slack (I'm @rochdev) or open a support ticket. You can find the documentation to enable runtime metrics here.

xzyfer · 2019-07-08T10:23:59Z

@rochdev interestingly it was the runtime metrics that pointed us to looking at the APM library itself.

Note that the async resource counter is never decremented until the instance is OOM killed.

There is a memory leak in one of our integrations.

We thought this also so we disabled all plugins and run time metrics just in case.

xzyfer · 2019-07-08T10:34:34Z

There is a pre-existing leak in the application that the tracer is making worse

After digging deeper into async.resources.by.type I think you may be onto something.

rochdev · 2019-07-08T14:04:41Z

This is a very high number of async resources. I have 2 questions that could help narrow down the issue:

Which async resources have the highest count?
Is there a high count of unfinished spans on the heap or are they almost all finished?

xzyfer · 2019-07-08T21:51:20Z

1. The async resources are predominantly timeouts 2. Not sure how exactly to confirm this?

…

On Tue., 9 Jul. 2019, 12:04 am Roch Devost, ***@***.***> wrote: This is a very high number of async resources. I have 2 questions that could help narrow down the issue: 1. Which async resources have the highest count? 2. Is there a high count of unfinished spans on the heap or are they almost all finished? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402?email_source=notifications&email_token=AAENSWGOBBMFKQ2T65EFTULP6NCP7A5CNFSM4GM4DZOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZNF5BI#issuecomment-509238917>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAENSWH2TWU47DJFJZLQQJLP6NCP7ANCNFSM4GM4DZOA> .

rochdev · 2019-07-08T22:04:30Z

Interesting. In the tracer we only use 2 timers that are instantiated only once per process. Is it possible something in your app would hold a reference to a timer?
In the graph above in Heap Spans each line represent a specific span with its state.

xzyfer · 2019-07-08T22:23:58Z

We have no timers in our code. The application is an Apollo GraphQL Server with a handful of very simple resolvers.
They appear to predominately, by orders of magnitude, be finished spans.

xzyfer · 2019-07-09T00:23:48Z

I think the timeouts Async resource refers to network request timeouts to downstream

xzyfer · 2019-07-09T01:16:58Z

I've continuing to augment our reporting. The current candidate looks like node-fetch's timeout handling

https://github.com/bitinn/node-fetch/blob/0c2294ec48fa5b84519f8bdd60f4e2672ebd9b06/src/index.js#L88-L93

xzyfer · 2019-07-09T12:07:31Z

Ultimately it looks like pm2 is the root cause. There is a known memory leak that was recently patched in https://github.com/keymetrics/pm2-io-apm/releases/tag/4.2.2.

Even with this patch applied we're experiencing a leak but much slower than previously.

Left: @pm2/[email protected]
Middle: no pm2
Right: @pm2/[email protected]

rochdev · 2019-07-09T13:49:02Z

They appear to predominately, by orders of magnitude, be finished spans.

Perfect, this means that it's not spans holding async resources but the other way around.

Even with this patch applied we're experiencing a leak but much slower than previously.

Does it still seem to be caused by ever increasing async resources?

rochdev · 2019-07-09T13:50:36Z

Actually I just noticed that the middle line is stable without PM2. This seems to indicate that there is still memory leak in the latest release of PM2 and not in dd-trace. Was that the outcome of your investigation?

brian-l · 2019-07-31T15:03:34Z

I noticed that the async hooks are being enabled but never disabled. Is this intentional? https://github.com/DataDog/dd-trace-js/blob/master/packages/dd-trace/src/scope/async_hooks.js#L31

I'm seeing memory leaks happen only in my jest test environment (node 8.16, dd-trace 0.13.1, jest 23) when watching for changes; it seems that restarting the tests repeatedly (including dd-trace setup) will eventually cause FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory and the last line of code in the stack trace is related to async hooks Security context: 0x1a91fcd25891 <JSObject> 1: _init [/code/node_modules/dd-trace/packages/dd-trace/src/scope/async_hooks.js:~52] [pc=0x1e881f6f83ee](this=0xc08dc4859a9 <Scope map = 0x3d2e1d02781>,asyncId=1022343,type=0x1b58ae12951 <String[10]: TickObject>,triggerAsyncId=1020161,resource=0x98a3be27f79 <TickObject map = 0x2725aef10e81>) 2: nextTick [internal/process/next_tick.js:~246] [pc=0x1e881f9790a3](this=0x3...

rochdev · 2019-07-31T21:16:53Z

@brian-l This might be because of the way Jest works. For example, if it clears the require cache while the tracer is still active, it will just keep adding new tracers constantly without cleaning up the old one, taking up more and more memory.

This could be addressed by adding a way to enable/disable the tracer at runtime, which could be added after each run for example. This will be available in the coming weeks.

It's possible that it's caused by something else, but in a test environment I would say this is the most likely cause. We will be able to validate this once the functionality to disable lands.

In the meantime, I would recommend to simply disable the tracer in your tests by using the DD_TRACE_ENABLED=false environment variable.

brian-l · 2019-08-01T18:40:36Z

@brian-l This might be because of the way Jest works. For example, if it clears the require cache while the tracer is still active, it will just keep adding new tracers constantly without cleaning up the old one, taking up more and more memory.

This could be addressed by adding a way to enable/disable the tracer at runtime, which could be added after each run for example. This will be available in the coming weeks.

It's possible that it's caused by something else, but in a test environment I would say this is the most likely cause. We will be able to validate this once the functionality to disable lands.

In the meantime, I would recommend to simply disable the tracer in your tests by using the DD_TRACE_ENABLED=false environment variable.

that didn't change anything unfortunately, however this investigation has revealed other memory leaks that need to be fixed as well so our issues are not caused by this library as far as I can tell. thanks.

rochdev · 2019-10-02T12:43:03Z

We've had several reports that the issue is fixed completely, and no new report of memory leaks for several months at this point, so closing this issue.

devilleweppenaar · 2019-11-22T16:18:59Z

We're running dd-trace 0.16.1, and can confirm that there is definitely a memory leak. We tried disabling plugins, but disabling tracing completely is the only thing that stopped our services from running out of memory and crashing.

Should I log a new, separate issue?

rochdev · 2019-11-22T18:26:09Z

@devillexio Please open a new issue. It will make it easier to keep any underlying causes separated.

devilleweppenaar · 2019-11-22T18:31:46Z

@rochdev As requested, I logged #759.

rochdev added the community label Jan 7, 2019

rochdev added the bug Something isn't working label Jan 9, 2019

rochdev closed this as completed Oct 2, 2019

Memory leak #402

Memory leak #402

Comments

terryma commented Jan 2, 2019

rochdev commented Jan 7, 2019

terryma commented Jan 7, 2019

terryma commented Jan 7, 2019

rochdev commented Jan 7, 2019

terryma commented Jan 9, 2019

rochdev commented Jan 9, 2019

tomonari-t commented Jan 18, 2019 • edited Loading

rochdev commented Jan 18, 2019

tomonari-t commented Jan 19, 2019

rochdev commented Jan 21, 2019

rochdev commented Feb 22, 2019

Chris911 commented Feb 26, 2019

Chris911 commented Feb 26, 2019 • edited Loading

rochdev commented Feb 26, 2019

rochdev commented Feb 26, 2019

Chris911 commented Feb 26, 2019

rochdev commented Feb 26, 2019

rochdev commented Feb 26, 2019

Chris911 commented Feb 26, 2019

freewil commented Feb 28, 2019

rochdev commented Feb 28, 2019

freewil commented Feb 28, 2019 • edited Loading

rochdev commented Feb 28, 2019

freewil commented Feb 28, 2019

manuelconvoy commented Mar 4, 2019 • edited Loading

freewil commented Mar 5, 2019

rochdev commented Mar 5, 2019

freewil commented Mar 5, 2019

rochdev commented Apr 12, 2019

nodesocket commented Apr 12, 2019

bjaworski3 commented May 14, 2019

rochdev commented May 14, 2019

rochdev commented Jun 10, 2019

nodesocket commented Jun 10, 2019

rochdev commented Jun 10, 2019

nodesocket commented Jun 10, 2019

rochdev commented Jun 10, 2019

xzyfer commented Jul 8, 2019 • edited Loading

rochdev commented Jul 8, 2019

xzyfer commented Jul 8, 2019

xzyfer commented Jul 8, 2019

rochdev commented Jul 8, 2019

xzyfer commented Jul 8, 2019 via email

rochdev commented Jul 8, 2019

xzyfer commented Jul 8, 2019

xzyfer commented Jul 9, 2019

xzyfer commented Jul 9, 2019

xzyfer commented Jul 9, 2019

rochdev commented Jul 9, 2019

rochdev commented Jul 9, 2019

brian-l commented Jul 31, 2019

rochdev commented Jul 31, 2019

brian-l commented Aug 1, 2019

rochdev commented Oct 2, 2019

devilleweppenaar commented Nov 22, 2019

rochdev commented Nov 22, 2019

devilleweppenaar commented Nov 22, 2019

tomonari-t commented Jan 18, 2019 •

edited

Loading

Chris911 commented Feb 26, 2019 •

edited

Loading

freewil commented Feb 28, 2019 •

edited

Loading

manuelconvoy commented Mar 4, 2019 •

edited

Loading

xzyfer commented Jul 8, 2019 •

edited

Loading