Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mosdepth fails when run with less than 8GB RAM on ~50MB CRAM #245

Open
dennishendriksen opened this issue Dec 9, 2024 · 11 comments
Open

Comments

@dennishendriksen
Copy link

Hi @brentp,

Mosdepth is using more memory than expected when ran with a .bed file on a ~50MB .cram file:

    local args=()
    args+=("--threads"  "1")
# enabling/disabling this line doesn't make a difference
#    args+=("--no-per-base")
    args+=("--by" "nanopore.bed")
    args+=("--thresholds" "1,5,10,15,20,30,50,100")
    args+=("--fasta" "<cut>/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz")
    args+=("mosdepth")
    args+=("vip_fam0_HG002.cram")
    ${CMD_MOSDEPTH} "${args[@]}"

We are running mosdepth 0.3.8 using a singularity apptainer image build from docker://quay.io/biocontainers/mosdepth:0.3.8--hd299d5a_0

I was expecting mosdepth to be able to run with much less memory based on the numbers reported in README and your comment in #161 (comment). Running with 16GB resolves the issue.

Data to reproduce the issue:
https://download.molgeniscloud.org/downloads/vip/_dev/github/mosdepth/
https://download.molgeniscloud.org/downloads/vip/resources/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
https://download.molgeniscloud.org/downloads/vip/images/mosdepth-0.3.8.sif

Is this the behavior that you would expect?

Best regards,
@dennishendriksen

@brentp
Copy link
Owner

brentp commented Dec 13, 2024

It should require only ~ 1GB of memory, perhaps 2. Do you see the same memory use if you remove --by and --thresholds ?

@dennishendriksen
Copy link
Author

dennishendriksen commented Dec 16, 2024

Hi @brentp,

I did some tests and this is the median mosdepth physical memory usage:

8.3GB   # with    --by and --thresholds
6.8GB   # without --by and --thresholds

So there is a difference, but both values are way outside the 1-2GB range. Any thoughts? I'm not familiar with nim, but was wondering if it could be related to a memory management setting?

Greetings,
@dennishendriksen

@LudvigOlsen
Copy link
Contributor

When running mosdepth on a ~280mb BAM file (hg38) using a ~2gb .bed.gz file (~8gb when uncompressed) in --by (and setting --mapq 20 --min-frag-len 100 --max-frag-len 220 -no-per-base) my system uses >40gb of memory.

When running mosdepth, the "Max mem used" slowly builds up to that and then stops increasing (perhaps when being done with chromosome 1).

Considering whether I should chunk the bed file to reduce the RAM usage? Might increase running time to have multiple calls though.

Note that I am using a fork of mosdepth, but I don't think I've changed anything relevant to RAM usage in the standard usecase.

@brentp
Copy link
Owner

brentp commented Jan 9, 2025

that sounds excessive! can you verify that mosdepth from the releases here has the same memory usage?

@LudvigOlsen
Copy link
Contributor

@brentp I just installed mosdepth via bioconda into a new environment. It uses the same amount of RAM, unfortunately.

The files are available online as part of a preprint, so it should be easy for you to reproduce:

The BED file is available here: https://zenodo.org/records/14215762
It is called "whole_genome.mappable.binned_10bp.bed.gz"

The test BAM file is available here: https://zenodo.org/records/13909979

Something like the following:

$ mosdepth --by whole_genome.mappable.binned_10bp.bed.gz --threads 1 --mapq 20 --min-frag-len 100 --max-frag-len 220 --no-per-base <OUTPUT_PATH> IC38.hg38.downsampled.aligned.sorted.markdup.bam

@brentp
Copy link
Owner

brentp commented Jan 9, 2025

ok. just a quick point--does whole_genome.mappable.binned_10bp.bed.gz have, for example 24.9 million 10bp intervals for chromosome 1? That would be the reason. I do not optimize for bed files of this size and so that is likely the issue.
You can also use --by 10 and then filter to mappable regions after.

@LudvigOlsen
Copy link
Contributor

Something on that scale, yes! Would --by 10 have a much lower memory requirement?

I need to be sure I always have the same 10bp intervals, as I use a lot of other reference information for the same bins. But if that behavior is stable, I could perhaps recompute everything to use the mosdepth bins. Would be a lot of work but if it heavily reduces memory usage, it could be the best approach.

@brentp
Copy link
Owner

brentp commented Jan 9, 2025

yes, --by 10 should have lower memory requirement and it is consistent, it will always create 0..10, 10..20, etc.
I am downloading your test-data on a slow connection so it will take a while, but you could try --by 10

@LudvigOlsen
Copy link
Contributor

Perfect, thanks! It spent 5min and < 4gb, compared to 15min and 42gb. That would probably be worth rerunning everything for!

@brentp
Copy link
Owner

brentp commented Jan 9, 2025

phew! glad to hear it! I am surprised that the bed file is so bad. I am looking into it.

@LudvigOlsen
Copy link
Contributor

Great! I'm lucky all intervals are 10bp, but of course it would be great to have similar performance for bed files, if possible :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants