[Major Bug] Memory leak causing increased CPU/Memory usage over time on idle host #71

mlehner616 · 2019-02-24T02:05:48Z

Environment:
I have two completely idle ec2 amazon linux 2 instances with lifecycled v3.0.1 and docker installed but no containers running. As a control, One instance is without lifecycled installed but still has docker installed with no containers running.

Behavior:
See attached metrics

htop snapshot from affected instance which shows heavy cpu/memory usage by lifecycled

Command line used in systemd unit:

/opt/lifecycled/lifecycled --cloudwatch-group=${LIFECYCLED_CLOUDWATCH_GROUP} --handler=${LIFECYCLED_HANDLER} --sns-topic=${LIFECYCLED_SNS_TOPIC} --json

The instance with lifecycled running gradually consumes memory and CPU over time. The control instance with just docker installed remains idle/flat over time.

Expected behavior:
Instance cpu and memory should remain mostly idle/flat over a length of time.

The text was updated successfully, but these errors were encountered:

lox · 2019-02-24T05:48:48Z

Sorry you are having issues @mlehner616! Are you able to provide logs for the lifecycled instances? It's very odd that they are consuming so much CPU and also that there are many processes running 🤔

mlehner616 · 2019-02-24T07:43:21Z

My Logs look completely normal. I ran pprof and have a suspicion that something in SpotListener is the culprit. I've added --no-spot to my instances since I don't really need the SpotListener in my case and will report back what I find after a couple days. If my leak goes away with --no-spot that should narrow it down a bit.

I did find that SpotListener was creating a new instance of time.NewTicker on every iteration of the loop so I submitted a PR to move that out of the loop. That's the only thing that stands out to me.

mlehner616 · 2019-02-24T08:09:36Z

I'm also going to patch it with my branch on one instance and leave --no-spot off to confirm if that resolves it as well.

Other than this issue, this package works like a dream so 👍 .

lox · 2019-02-24T22:22:48Z

I am seeing higher than expected CPU usage on one of our longer lived instances. I'm running a patched version of lifecycled with pprof hooks, will see how it goes.

At a guess, this is going to be the cloudwatch logging code.

itsdalmo · 2019-02-25T09:19:49Z

I'm seeing the same upwards trajectory on CPU utilisation over time on a couple of instances I have running lifecycled v3.0.1 (not using --no-spot). Uptime is currently 10 days, and utilisation started at 1% and is now hovering around 17%, with memory usage <1%.

According to the docs here I think @mlehner616 is correct about our use of time.NewTicker being the culprit. We have to explicitly Close() the ticker in order to garbage collect the ticker. With that in mind I'd also add a followup PR to #72 which defers the Close() when the listener returns.

itsdalmo · 2019-02-25T09:29:53Z

Added a PR with the tiny fix discussed above.
Thanks for figuring out and fixing this @mlehner616 🙌

mlehner616 · 2019-03-01T23:59:34Z

I think I found one more thing in the cloudwatch code. Moving the NewTimer alone did not seem to fix the issue but it does look like it slowed the increase down. I also added a patch to the cloudwatch code which seems to resolve the leak but I'm not sure if it has other side-effects. It would be good to have that confirmed with someone more familiar. I'll send a PR for cloudwatch in a few minutes for review.

Here's a screenshot of the lifecycled with the patched SpotListener

Here's a screenshot of the lifecycled with both patched SpotListener and cloudwatch code

itsdalmo · 2019-03-02T13:48:59Z

Is the CPU usage perhaps too low in the last screenshot - only peaking at 0.03% CPU usage when we have timers, looping API requests and dumping all log entries directly to CloudWatch? Ref my comments in #74, I don't understand how ioutil.Discard could have such a big effect on CPU usage 😅

mlehner616 · 2019-03-02T16:36:22Z

Yeah I was a bit surprised myself about this one🤗, I’ll keep digging. I have a couple more things to try and I’m starting wonder if I accidentally uncovered something going on with my version of journald itself.

lox · 2019-03-02T21:16:25Z

Might it be possible to run a CPU trace and see what the culprit is @mlehner616? (https://blog.golang.org/profiling-go-programs)

lox · 2019-03-02T21:17:21Z

The new trace stuff is kind of amazing too: https://medium.com/@cep21/using-go-1-10-new-trace-features-to-debug-an-integration-test-1dc39e4e812d

itsdalmo · 2019-03-03T21:29:29Z

I fired up lifecycled as it exists on master branch right now, with spot listener enabled, and left it running for (almost) 24h. The CPU usage measured at 5min intervals looks good to me:

How are things looking on your end @mlehner616?

itsdalmo · 2019-03-06T21:46:15Z

Having run the example above for ~4 days now, I can't see any increased CPU usage. I think #72 has solved this issue, and that we can consider this issue closed without any further code changes - or are you seeing something else on your end @mlehner616?

itsdalmo · 2019-03-07T08:41:23Z

@lox - Perhaps a 3.0.2 release should be made after the changes in #72 (and #73) were merged? Any further fixes can just go in another release.

mlehner616 · 2019-03-09T21:49:37Z

I've confirmed at this point that #72 and #73 together resolve this issue. I've also confirmed that adding --no-spot is a workaround for versions <=3.0.1 if the spot listener is not required. Thank you for the quick attention on this and thank you for maintaining this awesome tool. 👍

Looking forward to a 3.0.2 release.

lox · 2019-03-10T05:08:11Z

Thanks @mlehner616 and @itsdalmo, putting out a 3.0.2 release now!

mlehner616 mentioned this issue Feb 24, 2019

Move SpotListener newTicker out of the infinite loop #72

Merged

itsdalmo mentioned this issue Feb 25, 2019

Defer call to stop ticker when spot listener returns #73

Merged

mlehner616 mentioned this issue Mar 2, 2019

Add ioutil.Discard to logger to prevent leaks #74

Closed

lox closed this as completed Mar 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Major Bug] Memory leak causing increased CPU/Memory usage over time on idle host #71

[Major Bug] Memory leak causing increased CPU/Memory usage over time on idle host #71

mlehner616 commented Feb 24, 2019 •

edited

Loading

lox commented Feb 24, 2019

mlehner616 commented Feb 24, 2019 •

edited

Loading

mlehner616 commented Feb 24, 2019

lox commented Feb 24, 2019

itsdalmo commented Feb 25, 2019 •

edited

Loading

itsdalmo commented Feb 25, 2019 •

edited

Loading

mlehner616 commented Mar 1, 2019

itsdalmo commented Mar 2, 2019 •

edited

Loading

mlehner616 commented Mar 2, 2019

lox commented Mar 2, 2019

lox commented Mar 2, 2019

itsdalmo commented Mar 3, 2019

itsdalmo commented Mar 6, 2019

itsdalmo commented Mar 7, 2019

mlehner616 commented Mar 9, 2019

lox commented Mar 10, 2019

[Major Bug] Memory leak causing increased CPU/Memory usage over time on idle host #71

[Major Bug] Memory leak causing increased CPU/Memory usage over time on idle host #71

Comments

mlehner616 commented Feb 24, 2019 • edited Loading

lox commented Feb 24, 2019

mlehner616 commented Feb 24, 2019 • edited Loading

mlehner616 commented Feb 24, 2019

lox commented Feb 24, 2019

itsdalmo commented Feb 25, 2019 • edited Loading

itsdalmo commented Feb 25, 2019 • edited Loading

mlehner616 commented Mar 1, 2019

itsdalmo commented Mar 2, 2019 • edited Loading

mlehner616 commented Mar 2, 2019

lox commented Mar 2, 2019

lox commented Mar 2, 2019

itsdalmo commented Mar 3, 2019

itsdalmo commented Mar 6, 2019

itsdalmo commented Mar 7, 2019

mlehner616 commented Mar 9, 2019

lox commented Mar 10, 2019

mlehner616 commented Feb 24, 2019 •

edited

Loading

mlehner616 commented Feb 24, 2019 •

edited

Loading

itsdalmo commented Feb 25, 2019 •

edited

Loading

itsdalmo commented Feb 25, 2019 •

edited

Loading

itsdalmo commented Mar 2, 2019 •

edited

Loading