-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak from fleet jobs #709
Comments
I ran 3000 sleeper jobs on Bulbasaur's n2 in 26 minutes, and couldn't see any change in the available memory. |
I ran the same |
I ran the same |
I ran just the |
This caused a problem again today on octomore's n1 node, so I rebooted the node. It was down to 1GB of free memory, which relaxed to 4.7GB of free memory after the node drained. |
I ran The options I used for this test were:
|
I ran a steady stream of Kive jobs in docker containers on bulbasaur overnight, and saw a steady drop in available memory. At 19:00 there was 6.8GB available, and by 11:00 there was 5.95GB available. That's a drop of 53MB per hour. That compares to the 160MB per hour of |
I ran |
I ran
|
I ran
|
I ran
Interestingly, when nothing particular was running over the weekend, I saw available memory steadily drop by about 2MB per hour. |
I ran |
When I stopped using gzip, and just read the same file over and over again launching subprocesses from the main process, it consumed 1.7GB per hour. That makes me think it may just be launching Python in a subprocess that's leaking the memory. |
OK, I've found a simple script to reproduce the leak:
That leaks memory at 6.7GB per hour, and I see the same leak when I run it under Python 3. |
I reproduced the same leak on n2, as well as octomore n0. I tried it on a virtual machine on my workstation and saw very little memory leaked (KB instead of MB). I opened case 00094775 with Penguin Computing to ask for help tracking this down. |
Penguin Computing thinks it's a known memory leak, and suggests updating to Scyld ClusterWare 7.3.6. |
Once we finally got the update from Penguin working, the memory leak was fixed. Just need to migrate Octomore to the new version. |
After upgrading Octomore, the leak is gone. |
Thanks for catching this @donkirkby - I'm going to update our cluster as well! |
Be careful about upgrading to centos 7.4. We had compatibility issues with
old compute nodes
…On Mar 21, 2018 6:07 PM, "Art Poon" ***@***.***> wrote:
Thanks for catching this @donkirkby <https://github.com/donkirkby> - I'm
going to update our cluster as well!
👍
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#709 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABkC7GGOF-lwdIomI52ZXcaeVFpCgfDpks5tgvlagaJpZM4PjFLO>
.
|
Now that the fleet has stabilised enough to successfully process runs for several days, we're seeing a steady drop in the available memory. This plot shows the available memory over several days on the Bulbasaur head node, a large compute node, and a small compute node. Disk cache doesn't reduce available memory. Before starting the record, I rebooted all the compute nodes, but not the head node.
You can see that they all steadily decline at a similar rate, but the small nodes run into trouble because they have less memory to start with. The flat sections are where the fleet was down for several hours.
If so, how long until the smallest Octomore node has problems? It starts with 62GB.See if different types of jobs change the rate of memory consumption: sleeper, aln2counts, prelim_map, mixed-hcv.See if the same thing happens when you run the jobs without Slurm. Just submit jobs to a node usingbpsh
.See if it's just memory fragmentation.Any effect when jobs are run under Docker?The text was updated successfully, but these errors were encountered: