Memory leak from fleet jobs #709

donkirkby · 2017-09-25T17:33:14Z

Now that the fleet has stabilised enough to successfully process runs for several days, we're seeing a steady drop in the available memory. This plot shows the available memory over several days on the Bulbasaur head node, a large compute node, and a small compute node. Disk cache doesn't reduce available memory. Before starting the record, I rebooted all the compute nodes, but not the head node.

You can see that they all steadily decline at a similar rate, but the small nodes run into trouble because they have less memory to start with. The flat sections are where the fleet was down for several hours.

See if Octomore has the same problem.
~~If so, how long until the smallest Octomore node has problems? It starts with 62GB.~~
~~See if different types of jobs change the rate of memory consumption: sleeper, aln2counts, prelim_map, mixed-hcv.~~
~~See if the same thing happens when you run the jobs without Slurm. Just submit jobs to a node using bpsh.~~
~~See if it's just memory fragmentation.~~
~~Any effect when jobs are run under Docker?~~
Migrate Octomore to the new release from Penguin.

The text was updated successfully, but these errors were encountered:

donkirkby · 2017-11-22T20:30:49Z

I ran 3000 sleeper jobs on Bulbasaur's n2 in 26 minutes, and couldn't see any change in the available memory.

donkirkby · 2017-11-23T17:56:11Z

I ran the same prelim_map task 227 times, 10 in slurm queue, 8 active, from 13:28 to 15:14, and the available memory dropped from 7.3 to 7.1GB.

donkirkby · 2017-11-23T21:27:52Z

I ran the same prelim_map task 227 times without slurm, sending it to n2 with bpsh, 8 parallel processes, from 12:18 to 13:16, and the available memory dropped from 7.0 to 6.8 GB. Interesting that slurm was 50% slower.

donkirkby · 2017-11-23T23:21:56Z

I ran just the bowtie2-align-s command 227 times without slurm, sending it to n2 with bpsh, 8 parallel processes from 13:47 to 14:37, and the available memory dropped from 6.8 to 6.7GB.

donkirkby · 2018-02-06T22:26:32Z

This caused a problem again today on octomore's n1 node, so I rebooted the node. It was down to 1GB of free memory, which relaxed to 4.7GB of free memory after the node drained.
Most of the reported failures were just out of memory messages, but I did see a few NODE_FAIL messages.

donkirkby · 2018-02-13T18:24:33Z

I ran copy_slurm_test.py on bulbasaur for 2.5 hours, and it consumed 400MB of memory, even after the jobs had finished. The script just scans through all the .gz files in Kive's datasets, copies them to a working folder, and then submits a Slurm job that launches a Docker process that unzips the file and counts lines.

The options I used for this test were:

python copy_slurm_test.py --min_size 50 --max_size 200 -n20000 -p70 "/data/kive/Datasets/*/*.gz" copy_test

donkirkby · 2018-02-15T19:16:09Z

I ran a steady stream of Kive jobs in docker containers on bulbasaur overnight, and saw a steady drop in available memory. At 19:00 there was 6.8GB available, and by 11:00 there was 5.95GB available. That's a drop of 53MB per hour. That compares to the 160MB per hour of copy_slurm_test.py.

donkirkby · 2018-02-16T20:54:10Z

I ran copy_test.py on bulbasaur for 78 minutes, and it consumed 1.5 GB of memory. That's 1.15 GB per hour when I just run docker, copy files, and unzip them.

donkirkby · 2018-02-16T23:18:32Z

I ran copy_test.py on bulbasaur n3 without using docker. It ran for 52 minutes and consumed 210MB of memory. That's 242MB per hour to copy files, run python3, and use the gzip module. Here's the command line I used with the current version of the script:

bpsh 3 python copy_test.py --min_size 50 --max_size 200 -n20000 -p8 "/data/kive/Datasets/*/*.gz" copy_test --test python

donkirkby · 2018-02-17T00:14:59Z

I ran copy_test.py on bulbasaur n3 without doing the copy. It just unzipped the same file over and over again in 8 subprocesses. It ran for 18 minutes and consumed 110MB of memory. That's 489MB per hour to run python3 and use the gzip module. The latest version and command line:

bpsh 3 python copy_test.py --min_size 50 --max_size 200 -n20000 -p8 copy_test/0044-AIV3-Unknown_S29_L001_R2_001.fastq.gz copy_test --test python --skip_copy

…ver.

donkirkby · 2018-02-19T18:41:13Z

I ran memory_consumer2.py that reads a file into memory over and over for three hours, and the available memory dropped by 7MB per hour, which is much slower than the other tests.
The leading suspects for the leak are:

the subprocess pool
gzip module
just launching a Python process

Interestingly, when nothing particular was running over the weekend, I saw available memory steadily drop by about 2MB per hour.

donkirkby · 2018-02-19T21:41:39Z

I ran copy_test.py on bulbasaur n3 using a single process instead of the process pool - still without doing the copy or docker or slurm. It ran for 110 minutes and consumed 90MB. That's 49MB per hour, compared to 489/8 = 61MB per hour if the consumption rate had just dropped consistently with the number of processors being used.

donkirkby · 2018-02-19T22:35:08Z

When I stopped using gzip, and just read the same file over and over again launching subprocesses from the main process, it consumed 1.7GB per hour. That makes me think it may just be launching Python in a subprocess that's leaking the memory.

donkirkby · 2018-02-19T23:05:16Z

OK, I've found a simple script to reproduce the leak:

from __future__ import print_function
from subprocess import check_output, STDOUT

command_args = ["python3", "-c", "print('Hello, World!')"]

while True:
    report = check_output(command_args, stderr=STDOUT)
    assert report.decode('utf8') == 'Hello, World!\n', report
    print('.', end='')

That leaks memory at 6.7GB per hour, and I see the same leak when I run it under Python 3.

donkirkby · 2018-02-20T00:33:02Z

I reproduced the same leak on n2, as well as octomore n0. I tried it on a virtual machine on my workstation and saw very little memory leaked (KB instead of MB).

I opened case 00094775 with Penguin Computing to ask for help tracking this down.

donkirkby · 2018-02-20T22:48:22Z

Penguin Computing thinks it's a known memory leak, and suggests updating to Scyld ClusterWare 7.3.6.

donkirkby · 2018-03-14T22:11:38Z

Once we finally got the update from Penguin working, the memory leak was fixed. Just need to migrate Octomore to the new version.

donkirkby · 2018-03-21T23:13:32Z

After upgrading Octomore, the leak is gone.
Closing the issue.

ArtPoon · 2018-03-22T01:07:37Z

Thanks for catching this @donkirkby - I'm going to update our cluster as well!
👍

donkirkby · 2018-03-22T06:17:17Z

Be careful about upgrading to centos 7.4. We had compatibility issues with old compute nodes

…

On Mar 21, 2018 6:07 PM, "Art Poon" ***@***.***> wrote: Thanks for catching this @donkirkby <https://github.com/donkirkby> - I'm going to update our cluster as well! 👍 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#709 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABkC7GGOF-lwdIomI52ZXcaeVFpCgfDpks5tgvlagaJpZM4PjFLO> .

donkirkby added the bug label Sep 25, 2017

donkirkby added this to the Near future milestone Sep 25, 2017

donkirkby added a commit that referenced this issue Feb 19, 2018

New test script for #709 consumes memory by reading a file over and o…

458c8b8

…ver.

donkirkby added a commit that referenced this issue Feb 19, 2018

Add option to skip process pool in copy test for #709.

f427ae8

donkirkby added a commit that referenced this issue Feb 19, 2018

Add option to skip zip in copy test for #709.

e69bfc4

donkirkby modified the milestones: Near future, 0.11 - Docker support Mar 14, 2018

donkirkby closed this as completed Mar 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak from fleet jobs #709

Memory leak from fleet jobs #709

donkirkby commented Sep 25, 2017 •

edited

Loading

donkirkby commented Nov 22, 2017 •

edited

Loading

donkirkby commented Nov 23, 2017

donkirkby commented Nov 23, 2017

donkirkby commented Nov 23, 2017

donkirkby commented Feb 6, 2018 •

edited

Loading

donkirkby commented Feb 13, 2018

donkirkby commented Feb 15, 2018

donkirkby commented Feb 16, 2018

donkirkby commented Feb 16, 2018

donkirkby commented Feb 17, 2018

donkirkby commented Feb 19, 2018 •

edited

Loading

donkirkby commented Feb 19, 2018

donkirkby commented Feb 19, 2018

donkirkby commented Feb 19, 2018

donkirkby commented Feb 20, 2018

donkirkby commented Feb 20, 2018

donkirkby commented Mar 14, 2018

donkirkby commented Mar 21, 2018

ArtPoon commented Mar 22, 2018

donkirkby commented Mar 22, 2018 via email

Memory leak from fleet jobs #709

Memory leak from fleet jobs #709

Comments

donkirkby commented Sep 25, 2017 • edited Loading

donkirkby commented Nov 22, 2017 • edited Loading

donkirkby commented Nov 23, 2017

donkirkby commented Nov 23, 2017

donkirkby commented Nov 23, 2017

donkirkby commented Feb 6, 2018 • edited Loading

donkirkby commented Feb 13, 2018

donkirkby commented Feb 15, 2018

donkirkby commented Feb 16, 2018

donkirkby commented Feb 16, 2018

donkirkby commented Feb 17, 2018

donkirkby commented Feb 19, 2018 • edited Loading

donkirkby commented Feb 19, 2018

donkirkby commented Feb 19, 2018

donkirkby commented Feb 19, 2018

donkirkby commented Feb 20, 2018

donkirkby commented Feb 20, 2018

donkirkby commented Mar 14, 2018

donkirkby commented Mar 21, 2018

ArtPoon commented Mar 22, 2018

donkirkby commented Mar 22, 2018 via email

donkirkby commented Sep 25, 2017 •

edited

Loading

donkirkby commented Nov 22, 2017 •

edited

Loading

donkirkby commented Feb 6, 2018 •

edited

Loading

donkirkby commented Feb 19, 2018 •

edited

Loading