-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bamCoverage slow on tiny BAM file #912
Comments
Here’s the profiling output of running the above command with
|
As you can see from the profiling there's considerable overhead on the pysam side when it comes to opening and jumping into a given spot in a BAM file and that ends up sucking up most of the time in sparse files. For CI testing we've found it useful to just put the reads on a single smaller contig, since then we only pay this overhead price once. When files have lots of reads then processing them is more costly than the overhead so this multiprocessing strategy tends to prove quicker, but we've also found it's not the case with small files. In theory one could implement a heuristic to decide upon different strategies based on the likely number of alignments, but I don't think anyone on our side has felt particularly compelled to write that given that we can just construct or CI test files a bit differently. |
But why does the file need to be opened so many times? Can’t you re-use an already-opened AlignmentFile and seek appropriately? I’m sure the overhead comes only from opening the file, not from seeking to the correct position. I don’t know anything about the architecture of the tool, but naively I would expect that it’s sufficient to have as many open AlignmentFiles as there are threads/processes. |
So to answer my own question: It appears the reference is split up into chunks and then a worker is spawned for each chunk. Each worker re-opens the BAM file. For the genome that I’m using, it appears there are 880221 chunks, most of which will not contain any reads. When I change the To solve this even in the multi-threaded case, one would need a mechanism to share state (i.e, open AlignmentFiles) between workers. Alternatively, it may help to use I’m not too happy that I have to adjust my test data just because of this since I want to mainly test our pipeline and I don’t know, yet, whether I can do that in a realistic fashion with few, small contigs, but I accept that fixing this is not priority at the moment. I also cannot offer any help myself at this time, so I cannot complain ... Closing this, and don’t worry, I’m thankful the tool exists! |
Yeah, I really wish python had more C-like multithreading. In MethylDackel I use the same sort of create-genomic-chunks-and-process-each strategy, but since that's in C and uses pthreads the number of file openings is equal to the number of threads and so the overhead is miniscule. If I ever get time in the future I'd like to refactor the heavier lifting in deepTools in nim or go or something like that that would seriously speed stuff like this up...but only time will tell if I ever have enough free time to do that. |
A similar issue #662 has been reported previously, but since that has been closed and the issue appears to persist, I’m reporting this again.
The following command takes 13 minutes although the
in.bam
file only contains ~7200 alignments:With
-p 4
, the wall-clock time is reduced to 8 minutes, but CPU time is 24 minutes (on a four-core machine).When I remove
--normalizeUsing RPGC
, the time goes down to less than 2 minutes, which is better, but still suprisingly slow IMO considering the number of reads.The reference is Mus musculus (mm9). The BAM has 35
@SQ
records, so nothing weird.My use case is running a pipeline on a test dataset during CI. All the other steps taken together finish in less than one minute.
The text was updated successfully, but these errors were encountered: