bamCoverage slow on tiny BAM file #912

marcelm · 2020-01-29T13:27:24Z

A similar issue #662 has been reported previously, but since that has been closed and the issue appears to persist, I’m reporting this again.

Python 3.7.6
deepTools/bamCoverage 3.3.2

The following command takes 13 minutes although the in.bam file only contains ~7200 alignments:

bamCoverage -p 1 --normalizeUsing RPGC --effectiveGenomeSize 2150570000 -b in.bam -o out.bw

With -p 4, the wall-clock time is reduced to 8 minutes, but CPU time is 24 minutes (on a four-core machine).

When I remove --normalizeUsing RPGC, the time goes down to less than 2 minutes, which is better, but still suprisingly slow IMO considering the number of reads.

The reference is Mus musculus (mm9). The BAM has 35 @SQ records, so nothing weird.

My use case is running a pipeline on a test dataset during CI. All the other steps taken together finish in less than one minute.

The text was updated successfully, but these errors were encountered:

marcelm · 2020-01-29T13:40:06Z

Here’s the profiling output of running the above command with python3 -m cProfile:

         197613142 function calls (197607172 primitive calls) in 801.479 seconds

   Ordered by: internal time
   List reduced from 2007 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   880221  530.504    0.001  572.900    0.001 pysam/libcalignmentfile.pyx:777(_open)
      569   69.502    0.122  122.382    0.215 .../lib/python3.7/site-packages/deeptools/writeBedGraph.py:173(writeBedGraph_worker)
   880221   34.586    0.000   34.586    0.000 pysam/libchtslib.pyx:523(_open_htsfile)
 54515326   25.133    0.000   25.133    0.000 .../lib/python3.7/site-packages/deeptools/writeBedGraph.py:343(scaleCoverage)
   880221   21.221    0.000   22.647    0.000 pysam/libcalignmentfile.pyx:1680(__dealloc__)
 55397460   11.769    0.000   11.769    0.000 {built-in method builtins.min}
   879647    9.937    0.000  651.022    0.001 .../lib/python3.7/site-packages/deeptools/getFragmentAndReadSize.py:14(getFragmentLength_worker)
   880219    7.953    0.000    7.953    0.000 pysam/libcalignmentfile.pyx:362(__get__)
 1390/928    7.657    0.006    7.667    0.008 {built-in method numpy.core._multiarray_umath.implement_array_function}
  1759797    6.548    0.000   21.628    0.000 pysam/libcalignmentfile.pyx:1017(fetch)
   880221    5.652    0.000  595.025    0.001 .../lib/python3.7/site-packages/deeptools/bamHandler.py:47(openBam)
   880221    5.427    0.000    6.591    0.000 pysam/libcalignmentfile.pyx:1759(__get__)
  2627701    5.285    0.000    5.285    0.000 {built-in method numpy.array}
      569    5.077    0.009   15.908    0.028 .../lib/python3.7/site-packages/deeptools/countReadsPerBin.py:400(count_reads_in_region)
  1759797    5.074    0.000    5.074    0.000 pysam/libcalignmentfile.pyx:516(get_tid)
   880221    4.928    0.000    6.121    0.000 pysam/libcalignmentfile.pyx:1775(__get__)
  1759797    4.557    0.000    6.672    0.000 pysam/libcalignmentfile.pyx:2052(__init__)
  1803274    3.443    0.000    3.443    0.000 pysam/libcalignmentfile.pyx:2074(cnext)
  1761286    3.124    0.000    3.124    0.000 {method 'extend' of 'list' objects}

dpryan79 · 2020-01-29T14:51:36Z

As you can see from the profiling there's considerable overhead on the pysam side when it comes to opening and jumping into a given spot in a BAM file and that ends up sucking up most of the time in sparse files. For CI testing we've found it useful to just put the reads on a single smaller contig, since then we only pay this overhead price once. When files have lots of reads then processing them is more costly than the overhead so this multiprocessing strategy tends to prove quicker, but we've also found it's not the case with small files. In theory one could implement a heuristic to decide upon different strategies based on the likely number of alignments, but I don't think anyone on our side has felt particularly compelled to write that given that we can just construct or CI test files a bit differently.

marcelm · 2020-01-29T15:07:41Z

But why does the file need to be opened so many times? Can’t you re-use an already-opened AlignmentFile and seek appropriately? I’m sure the overhead comes only from opening the file, not from seeking to the correct position.

I don’t know anything about the architecture of the tool, but naively I would expect that it’s sufficient to have as many open AlignmentFiles as there are threads/processes.

marcelm · 2020-01-29T15:55:25Z

So to answer my own question: It appears the reference is split up into chunks and then a worker is spawned for each chunk. Each worker re-opens the BAM file. For the genome that I’m using, it appears there are 880221 chunks, most of which will not contain any reads.

When I change the getFragmentLength_worker() function so that it does not get the BAM file name, but the opened AlignmentFile, the runtime goes down to 2 minutes. This only works with -p 1 because there can be only one iterator over an AlignmentFile at the same time, so I’m not suggesting this as a fix (there is an allow_multiple_iterators argument to fetch(), but it will also effectively re-open the file, so the overhead would still be there).

To solve this even in the multi-threaded case, one would need a mechanism to share state (i.e, open AlignmentFiles) between workers.

Alternatively, it may help to use pool.map_async with a chunksize greater than 1 (this would require changing all workers so that they accept lists of work units).

I’m not too happy that I have to adjust my test data just because of this since I want to mainly test our pipeline and I don’t know, yet, whether I can do that in a realistic fashion with few, small contigs, but I accept that fixing this is not priority at the moment. I also cannot offer any help myself at this time, so I cannot complain ...

Closing this, and don’t worry, I’m thankful the tool exists!

dpryan79 · 2020-01-29T19:24:37Z

Yeah, I really wish python had more C-like multithreading. In MethylDackel I use the same sort of create-genomic-chunks-and-process-each strategy, but since that's in C and uses pthreads the number of file openings is equal to the number of threads and so the overhead is miniscule.

If I ever get time in the future I'd like to refactor the heavier lifting in deepTools in nim or go or something like that that would seriously speed stuff like this up...but only time will tell if I ever have enough free time to do that.

marcelm closed this as completed Jan 29, 2020

marcelm mentioned this issue Dec 8, 2020

BigWig computation rules are slow elsasserlab/minute#89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bamCoverage slow on tiny BAM file #912

bamCoverage slow on tiny BAM file #912

marcelm commented Jan 29, 2020

marcelm commented Jan 29, 2020

dpryan79 commented Jan 29, 2020

marcelm commented Jan 29, 2020

marcelm commented Jan 29, 2020

dpryan79 commented Jan 29, 2020

bamCoverage slow on tiny BAM file #912

bamCoverage slow on tiny BAM file #912

Comments

marcelm commented Jan 29, 2020

marcelm commented Jan 29, 2020

dpryan79 commented Jan 29, 2020

marcelm commented Jan 29, 2020

marcelm commented Jan 29, 2020

dpryan79 commented Jan 29, 2020