Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Major Bug] Memory leak causing increased CPU/Memory usage over time on idle host #71

Closed
mlehner616 opened this issue Feb 24, 2019 · 16 comments

Comments

@mlehner616
Copy link
Contributor

mlehner616 commented Feb 24, 2019

Environment:
I have two completely idle ec2 amazon linux 2 instances with lifecycled v3.0.1 and docker installed but no containers running. As a control, One instance is without lifecycled installed but still has docker installed with no containers running.

Behavior:
See attached metrics
image

htop snapshot from affected instance which shows heavy cpu/memory usage by lifecycled
image

Command line used in systemd unit:

/opt/lifecycled/lifecycled --cloudwatch-group=${LIFECYCLED_CLOUDWATCH_GROUP} --handler=${LIFECYCLED_HANDLER} --sns-topic=${LIFECYCLED_SNS_TOPIC} --json

The instance with lifecycled running gradually consumes memory and CPU over time. The control instance with just docker installed remains idle/flat over time.

Expected behavior:
Instance cpu and memory should remain mostly idle/flat over a length of time.

@lox
Copy link
Contributor

lox commented Feb 24, 2019

Sorry you are having issues @mlehner616! Are you able to provide logs for the lifecycled instances? It's very odd that they are consuming so much CPU and also that there are many processes running 🤔

@mlehner616
Copy link
Contributor Author

mlehner616 commented Feb 24, 2019

My Logs look completely normal. I ran pprof and have a suspicion that something in SpotListener is the culprit. I've added --no-spot to my instances since I don't really need the SpotListener in my case and will report back what I find after a couple days. If my leak goes away with --no-spot that should narrow it down a bit.

I did find that SpotListener was creating a new instance of time.NewTicker on every iteration of the loop so I submitted a PR to move that out of the loop. That's the only thing that stands out to me.

@mlehner616
Copy link
Contributor Author

I'm also going to patch it with my branch on one instance and leave --no-spot off to confirm if that resolves it as well.

Other than this issue, this package works like a dream so 👍 .

@lox
Copy link
Contributor

lox commented Feb 24, 2019

I am seeing higher than expected CPU usage on one of our longer lived instances. I'm running a patched version of lifecycled with pprof hooks, will see how it goes.

At a guess, this is going to be the cloudwatch logging code.

@itsdalmo
Copy link
Contributor

itsdalmo commented Feb 25, 2019

I'm seeing the same upwards trajectory on CPU utilisation over time on a couple of instances I have running lifecycled v3.0.1 (not using --no-spot). Uptime is currently 10 days, and utilisation started at 1% and is now hovering around 17%, with memory usage <1%.

According to the docs here I think @mlehner616 is correct about our use of time.NewTicker being the culprit. We have to explicitly Close() the ticker in order to garbage collect the ticker. With that in mind I'd also add a followup PR to #72 which defers the Close() when the listener returns.

@itsdalmo
Copy link
Contributor

itsdalmo commented Feb 25, 2019

Added a PR with the tiny fix discussed above.
Thanks for figuring out and fixing this @mlehner616 🙌

@mlehner616
Copy link
Contributor Author

I think I found one more thing in the cloudwatch code. Moving the NewTimer alone did not seem to fix the issue but it does look like it slowed the increase down. I also added a patch to the cloudwatch code which seems to resolve the leak but I'm not sure if it has other side-effects. It would be good to have that confirmed with someone more familiar. I'll send a PR for cloudwatch in a few minutes for review.

Here's a screenshot of the lifecycled with the patched SpotListener
image

Here's a screenshot of the lifecycled with both patched SpotListener and cloudwatch code
image

@itsdalmo
Copy link
Contributor

itsdalmo commented Mar 2, 2019

Is the CPU usage perhaps too low in the last screenshot - only peaking at 0.03% CPU usage when we have timers, looping API requests and dumping all log entries directly to CloudWatch? Ref my comments in #74, I don't understand how ioutil.Discard could have such a big effect on CPU usage 😅

@mlehner616
Copy link
Contributor Author

Yeah I was a bit surprised myself about this one🤗, I’ll keep digging. I have a couple more things to try and I’m starting wonder if I accidentally uncovered something going on with my version of journald itself.

@lox
Copy link
Contributor

lox commented Mar 2, 2019

Might it be possible to run a CPU trace and see what the culprit is @mlehner616? (https://blog.golang.org/profiling-go-programs)

@lox
Copy link
Contributor

lox commented Mar 2, 2019

The new trace stuff is kind of amazing too: https://medium.com/@cep21/using-go-1-10-new-trace-features-to-debug-an-integration-test-1dc39e4e812d

@itsdalmo
Copy link
Contributor

itsdalmo commented Mar 3, 2019

I fired up lifecycled as it exists on master branch right now, with spot listener enabled, and left it running for (almost) 24h. The CPU usage measured at 5min intervals looks good to me:

screenshot 2019-03-03 at 22 21 16

How are things looking on your end @mlehner616?

@itsdalmo
Copy link
Contributor

itsdalmo commented Mar 6, 2019

Having run the example above for ~4 days now, I can't see any increased CPU usage. I think #72 has solved this issue, and that we can consider this issue closed without any further code changes - or are you seeing something else on your end @mlehner616?

@itsdalmo
Copy link
Contributor

itsdalmo commented Mar 7, 2019

@lox - Perhaps a 3.0.2 release should be made after the changes in #72 (and #73) were merged? Any further fixes can just go in another release.

@mlehner616
Copy link
Contributor Author

I've confirmed at this point that #72 and #73 together resolve this issue. I've also confirmed that adding --no-spot is a workaround for versions <=3.0.1 if the spot listener is not required. Thank you for the quick attention on this and thank you for maintaining this awesome tool. 👍

Looking forward to a 3.0.2 release.

@lox
Copy link
Contributor

lox commented Mar 10, 2019

Thanks @mlehner616 and @itsdalmo, putting out a 3.0.2 release now!

@lox lox closed this as completed Mar 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants