Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak from fleet jobs #709

Closed
7 tasks done
donkirkby opened this issue Sep 25, 2017 · 20 comments
Closed
7 tasks done

Memory leak from fleet jobs #709

donkirkby opened this issue Sep 25, 2017 · 20 comments

Comments

@donkirkby
Copy link
Member

donkirkby commented Sep 25, 2017

Now that the fleet has stabilised enough to successfully process runs for several days, we're seeing a steady drop in the available memory. This plot shows the available memory over several days on the Bulbasaur head node, a large compute node, and a small compute node. Disk cache doesn't reduce available memory. Before starting the record, I rebooted all the compute nodes, but not the head node.

memory plot

You can see that they all steadily decline at a similar rate, but the small nodes run into trouble because they have less memory to start with. The flat sections are where the fleet was down for several hours.

  • See if Octomore has the same problem.
  • If so, how long until the smallest Octomore node has problems? It starts with 62GB.
  • See if different types of jobs change the rate of memory consumption: sleeper, aln2counts, prelim_map, mixed-hcv.
  • See if the same thing happens when you run the jobs without Slurm. Just submit jobs to a node using bpsh.
  • See if it's just memory fragmentation.
  • Any effect when jobs are run under Docker?
  • Migrate Octomore to the new release from Penguin.
@donkirkby donkirkby added the bug label Sep 25, 2017
@donkirkby donkirkby added this to the Near future milestone Sep 25, 2017
@donkirkby
Copy link
Member Author

donkirkby commented Nov 22, 2017

I ran 3000 sleeper jobs on Bulbasaur's n2 in 26 minutes, and couldn't see any change in the available memory.

@donkirkby
Copy link
Member Author

I ran the same prelim_map task 227 times, 10 in slurm queue, 8 active, from 13:28 to 15:14, and the available memory dropped from 7.3 to 7.1GB.

@donkirkby
Copy link
Member Author

I ran the same prelim_map task 227 times without slurm, sending it to n2 with bpsh, 8 parallel processes, from 12:18 to 13:16, and the available memory dropped from 7.0 to 6.8 GB. Interesting that slurm was 50% slower.

@donkirkby
Copy link
Member Author

I ran just the bowtie2-align-s command 227 times without slurm, sending it to n2 with bpsh, 8 parallel processes from 13:47 to 14:37, and the available memory dropped from 6.8 to 6.7GB.

@donkirkby
Copy link
Member Author

donkirkby commented Feb 6, 2018

This caused a problem again today on octomore's n1 node, so I rebooted the node. It was down to 1GB of free memory, which relaxed to 4.7GB of free memory after the node drained.
Most of the reported failures were just out of memory messages, but I did see a few NODE_FAIL messages.

@donkirkby
Copy link
Member Author

I ran copy_slurm_test.py on bulbasaur for 2.5 hours, and it consumed 400MB of memory, even after the jobs had finished. The script just scans through all the .gz files in Kive's datasets, copies them to a working folder, and then submits a Slurm job that launches a Docker process that unzips the file and counts lines.

The options I used for this test were:

python copy_slurm_test.py --min_size 50 --max_size 200 -n20000 -p70 "/data/kive/Datasets/*/*.gz" copy_test

@donkirkby
Copy link
Member Author

I ran a steady stream of Kive jobs in docker containers on bulbasaur overnight, and saw a steady drop in available memory. At 19:00 there was 6.8GB available, and by 11:00 there was 5.95GB available. That's a drop of 53MB per hour. That compares to the 160MB per hour of copy_slurm_test.py.

@donkirkby
Copy link
Member Author

I ran copy_test.py on bulbasaur for 78 minutes, and it consumed 1.5 GB of memory. That's 1.15 GB per hour when I just run docker, copy files, and unzip them.

@donkirkby
Copy link
Member Author

I ran copy_test.py on bulbasaur n3 without using docker. It ran for 52 minutes and consumed 210MB of memory. That's 242MB per hour to copy files, run python3, and use the gzip module. Here's the command line I used with the current version of the script:

bpsh 3 python copy_test.py --min_size 50 --max_size 200 -n20000 -p8 "/data/kive/Datasets/*/*.gz" copy_test --test python

@donkirkby
Copy link
Member Author

I ran copy_test.py on bulbasaur n3 without doing the copy. It just unzipped the same file over and over again in 8 subprocesses. It ran for 18 minutes and consumed 110MB of memory. That's 489MB per hour to run python3 and use the gzip module. The latest version and command line:

bpsh 3 python copy_test.py --min_size 50 --max_size 200 -n20000 -p8 copy_test/0044-AIV3-Unknown_S29_L001_R2_001.fastq.gz copy_test --test python --skip_copy

@donkirkby
Copy link
Member Author

donkirkby commented Feb 19, 2018

I ran memory_consumer2.py that reads a file into memory over and over for three hours, and the available memory dropped by 7MB per hour, which is much slower than the other tests.
The leading suspects for the leak are:

  • the subprocess pool
  • gzip module
  • just launching a Python process

Interestingly, when nothing particular was running over the weekend, I saw available memory steadily drop by about 2MB per hour.

@donkirkby
Copy link
Member Author

I ran copy_test.py on bulbasaur n3 using a single process instead of the process pool - still without doing the copy or docker or slurm. It ran for 110 minutes and consumed 90MB. That's 49MB per hour, compared to 489/8 = 61MB per hour if the consumption rate had just dropped consistently with the number of processors being used.

@donkirkby
Copy link
Member Author

When I stopped using gzip, and just read the same file over and over again launching subprocesses from the main process, it consumed 1.7GB per hour. That makes me think it may just be launching Python in a subprocess that's leaking the memory.

@donkirkby
Copy link
Member Author

OK, I've found a simple script to reproduce the leak:

from __future__ import print_function
from subprocess import check_output, STDOUT

command_args = ["python3", "-c", "print('Hello, World!')"]

while True:
    report = check_output(command_args, stderr=STDOUT)
    assert report.decode('utf8') == 'Hello, World!\n', report
    print('.', end='')

That leaks memory at 6.7GB per hour, and I see the same leak when I run it under Python 3.

@donkirkby
Copy link
Member Author

I reproduced the same leak on n2, as well as octomore n0. I tried it on a virtual machine on my workstation and saw very little memory leaked (KB instead of MB).

I opened case 00094775 with Penguin Computing to ask for help tracking this down.

@donkirkby
Copy link
Member Author

Penguin Computing thinks it's a known memory leak, and suggests updating to Scyld ClusterWare 7.3.6.

@donkirkby
Copy link
Member Author

Once we finally got the update from Penguin working, the memory leak was fixed. Just need to migrate Octomore to the new version.

@donkirkby
Copy link
Member Author

After upgrading Octomore, the leak is gone.
Closing the issue.

@ArtPoon
Copy link
Contributor

ArtPoon commented Mar 22, 2018

Thanks for catching this @donkirkby - I'm going to update our cluster as well!
👍

@donkirkby
Copy link
Member Author

donkirkby commented Mar 22, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants