Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On some inputs RepeatMasker consumes all RAM and crashes (4.1.6 & 4.1.7-p1) #304

Open
KirillKryukov opened this issue Jan 15, 2025 · 1 comment
Labels

Comments

@KirillKryukov
Copy link

For some inputs, RepeatMasker fails to complete and takes very long time, while consuming ever-increasing amount of RAM. Eventially it crashes (or gets killed). During the last freezing phase, nhmmscan is already done, and "top" shows that ProcessRepeats process is running and taking up memory.

This is reproducible 100% of times on our various machines. E.g., Ubuntu 24.04 (real or in WSL2), with RepeatMasker 4.1.6, TRF 4.09.1, HMMER 3.4, Dfam 3.8.

Initially we noticed the bug when analyzing ~20 MB inputs, but managed to reduced the input size to 10kB. As for the taxon used with the "-species" option, we could only reduce it to Boreoeutheria. When using a smaller taxon, RepeatMasker works fine.

Here is the complete repro script (tested on a freshly installed Ubuntu 24.04). The script installs all tools and database, downloads the reduced sequence query and runs RepeatMasker.

sudo apt upgrade
sudo apt install build-essential python3-pytest python3-h5py
mkdir ~/bin

# Setting the entire PATH, to remove junk automatically pulled on WSL2.
# Normally (not on WSL2), "$HOME/bin:$PATH" should be enough.
export PATH="$HOME/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

wget -O ~/bin/trf https://github.com/Benson-Genomics-Lab/TRF/releases/download/v4.09.1/trf409.linux64
chmod a+x ~/bin/trf

mkdir -p ~/hmmer/3.4
cd ~/hmmer/3.4
wget http://eddylab.org/software/hmmer/hmmer-3.4.tar.gz
tar -xvf hmmer-3.4.tar.gz
cd hmmer-3.4
./configure --prefix=$HOME
make
make install

mkdir -p ~/RepeatMasker/4.1.6
cd ~/RepeatMasker/4.1.6
wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.6.tar.gz
tar -xvf RepeatMasker-4.1.6.tar.gz
cd RepeatMasker/Libraries/famdb
wget https://dfam.org/releases/Dfam_3.8/families/FamDB/dfam38-1_full.0.h5.gz
gunzip -k dfam38-1_full.0.h5.gz
cd ../..
./configure --trf_prgm $HOME/bin/trf --hmmer_dir $HOME/bin
ln -s ~/RepeatMasker/4.1.6/RepeatMasker/RepeatMasker ~/bin/RepeatMasker

mkdir ~/test
cd ~/test

# Downloading 10kB sequence reduced from Acomys russatus genome (accession GCF_903995435.1).
wget https://biokirr.com/Supporting-Data/RepeatMasker-bug-report/a.fna

# Running RepeatMasker
# When using smaller taxa “Euarchontoglires” or “Laurasiatheria”, RepeatMasker works without problems.
# However, when using the Boreoeutheria taxon, RepeatMasker fails to complete the analysis.
# It takes a long time in the “ProcessRepeats” step, consuming increasingly large amount of RAM.
# Eventually it crashes, possibly due to running out of memory, while consuming hundreds of gigabytes of RAM.
RepeatMasker -engine hmmer -parallel 1 -species Boreoeutheria -dir . a.fna >a.log 2>a.err

The content of the log file "a.log":

Search Engine: HMMER [ 3.4 (Aug 2023) ]

Using Master RepeatMasker Database: /home/kirr/RepeatMasker/4.1.6/RepeatMasker/Libraries/famdb
  Title    : Dfam
  Version  : 3.8
  Date     : 2023-11-14
  Families : 295,590

Species/Taxa Search:
  Boreoeutheria [NCBI Taxonomy ID: 1437010]
  Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa;
           Eumetazoa;Bilateria;Deuterostomia;Chordata;
           Craniata <chordates>;Vertebrata <vertebrates>;
           Gnathostomata <vertebrates>;Teleostomi;Euteleostomi;
           Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota
Including only curated families:
  716 families in ancestor taxa; 9317 lineage-specific families


analyzing file a.fna
identifying Simple Repeats in batch 1 of 1
identifying young abundant SINEs in batch 1 of 1
identifying full-length interspersed repeats in batch 1 of 1
identifying most interspersed repeats in batch 1 of 1
identifying Simple Repeats in batch 1 of 1

(Note, this log-file is from re-running on same machine, so Library formatting is not mentioned in the output above).

"a.err":

Killed

(This is on WSL2, on other machines the error output can be different, depending on system and on how it was killed).

On a WSL2 VM with 48 GB or RAM, it takes 23 minutes to consume all RAM and get killed. On an actual Linux machine with 1 TB, it takes much longer.

When ran through GNU Time (/usr/bin/time -v test.sh), it gives the following report:

        Command being timed: "./test.sh"
        User time (seconds): 1775.03
        System time (seconds): 62.27
        Percent of CPU this job got: 132%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 23:04.82
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 48116696
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 9175
        Minor (reclaiming a frame) page faults: 13947808
        Voluntary context switches: 93548
        Involuntary context switches: 5140
        Swaps: 0
        File system inputs: 8827088
        File system outputs: 17688
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Other than the failure itself, another potantially significant impact of this bug is that by consuming all RAM, it may disrupt other processes running on the same machine.

Please let me know if I need to provide any other information.

Thanks!

@KirillKryukov
Copy link
Author

KirillKryukov commented Jan 29, 2025

Now tested and confirmed that the bug also reproduces with RepeatMasker 4.1.7-p1. Essentially the same repro script as above, adapted to 4.1.7-p1, and with an extra step of "rm min_init.0.h5". Let me know if you need the entire separate script for this.

@KirillKryukov KirillKryukov changed the title On some inputs RepeatMasker consumes all RAM and crashes On some inputs RepeatMasker consumes all RAM and crashes (4.1.6 & 4.1.7-p1) Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant