-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bamCoverage very slow, regardless of BAM size, --regions or other shortcuts #662
Comments
For us it does not take more than 10 minutes to process a millions of reads
aligned to a human reference genome. Are you using a small computer? maybe
you are running out of memory?
Also, using binsize = 1 is more time consuming than larger bin sizes. Since
you have few number of reads maybe decreasing the number of cores is
helpful.
…-fidel
On Wed, Feb 7, 2018 at 8:04 PM Alejandro Barrera ***@***.***> wrote:
Hi,
I have noticed that for ChIP-seq BAM files, bamCoverage is extremely slow.
The following bamCoverage call over a BAM of ~2k reads (830Kb) takes more
than 15min. I'm using the latest deeptools available in conda (bamCoverage
2.5.7), but the same is true for my previous deeptools version (2.2.4).
$ bamCoverage \
--bam GR.dedup_filtered.sorted.bam \
--binSize 1 \
--verbose \
--extendReads 200 \
--normalizeUsingRPKM \
--outFileFormat bigwig \
--outFileName GR.dedup_filtered.sorted.rpkm.bw \
--numberOfProcessors 1 \
--skipNAs \
--region chr21
The log:
genome partition size for multiprocessing: 1000000
genome partition size for multiprocessing: 500000
genome partition size for multiprocessing: 250000
genome partition size for multiprocessing: 125000
genome partition size for multiprocessing: 62500
genome partition size for multiprocessing: 50000
normalization: RPKM
Final scaling factor: 51041.241323
genome partition size for multiprocessing: 1050000
minFragmentLength: 0
verbose: True
out_file_for_raw_data: None
numberOfSamples: None
bedFile: None
bamFilesList: ['ctrl_001.dedup_filtered.sorted.bam']
ignoreDuplicates: False
numberOfProcessors: 1
samFlag_exclude: None
save_data: False
stepSize: 1
smoothLength: None
center_read: False
defaultFragmentLength: 200
chrsToSkip: []
region: chr21:1
maxPairedFragmentLength: 800
samFlag_include: None
binLength: 1
blackListFileName: None
maxFragmentLength: 0
minMappingQuality: None
zerosToNans: True
MainProcess, processing 0 (0.0 per sec) reads @ chr21:1-1000001
MainProcess countReadsInRegions_worker: processing 1000000 (7019720.4 per sec) @ chr21:1-1000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:1000001-2000001
MainProcess countReadsInRegions_worker: processing 1000000 (6849643.7 per sec) @ chr21:1000001-2000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:2000001-3000001
MainProcess countReadsInRegions_worker: processing 1000000 (6978364.3 per sec) @ chr21:2000001-3000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:3000001-4000001
MainProcess countReadsInRegions_worker: processing 1000000 (6872517.0 per sec) @ chr21:3000001-4000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:4000001-5000001
MainProcess countReadsInRegions_worker: processing 1000000 (7018087.8 per sec) @ chr21:4000001-5000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:5000001-6000001
MainProcess countReadsInRegions_worker: processing 1000000 (6885299.3 per sec) @ chr21:5000001-6000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:6000001-7000001
MainProcess countReadsInRegions_worker: processing 1000000 (7029426.3 per sec) @ chr21:6000001-7000001
MainProcess, processing 1 (1209.4 per sec) reads @ chr21:7000001-8000001
MainProcess countReadsInRegions_worker: processing 1000000 (6840316.1 per sec) @ chr21:7000001-8000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:8000001-9000001
MainProcess countReadsInRegions_worker: processing 1000000 (7004621.0 per sec) @ chr21:8000001-9000001
MainProcess, processing 1 (2336.7 per sec) reads @ chr21:9000001-10000001
MainProcess countReadsInRegions_worker: processing 1000000 (6999360.9 per sec) @ chr21:9000001-10000001
MainProcess, processing 3 (2292.0 per sec) reads @ chr21:10000001-11000001
MainProcess countReadsInRegions_worker: processing 1000000 (6997504.2 per sec) @ chr21:10000001-11000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:11000001-12000001
MainProcess countReadsInRegions_worker: processing 1000000 (7017395.0 per sec) @ chr21:11000001-12000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:12000001-13000001
MainProcess countReadsInRegions_worker: processing 1000000 (7016808.0 per sec) @ chr21:12000001-13000001
MainProcess, processing 0 (0.0 per sec) reads @ chr21:13000001-14000001
MainProcess countReadsInRegions_worker: processing 1000000 (7030710.6 per sec) @ chr21:13000001-14000001
MainProcess, processing 3 (2434.8 per sec) reads @ chr21:14000001-15000001
MainProcess countReadsInRegions_worker: processing 1000000 (7010849.8 per sec) @ chr21:14000001-15000001
MainProcess, processing 5 (2761.2 per sec) reads @ chr21:15000001-16000001
MainProcess countReadsInRegions_worker: processing 1000000 (6965384.8 per sec) @ chr21:15000001-16000001
MainProcess, processing 7 (3824.9 per sec) reads @ chr21:16000001-17000001
MainProcess countReadsInRegions_worker: processing 1000000 (7007265.7 per sec) @ chr21:16000001-17000001
MainProcess, processing 5 (3787.5 per sec) reads @ chr21:17000001-18000001
MainProcess countReadsInRegions_worker: processing 1000000 (6982895.3 per sec) @ chr21:17000001-18000001
MainProcess, processing 3 (2396.3 per sec) reads @ chr21:18000001-19000001
MainProcess countReadsInRegions_worker: processing 1000000 (7026023.2 per sec) @ chr21:18000001-19000001
MainProcess, processing 6 (4662.1 per sec) reads @ chr21:19000001-20000001
MainProcess countReadsInRegions_worker: processing 1000000 (7025929.1 per sec) @ chr21:19000001-20000001
MainProcess, processing 6 (3401.7 per sec) reads @ chr21:20000001-21000001
MainProcess countReadsInRegions_worker: processing 1000000 (6993105.8 per sec) @ chr21:20000001-21000001
MainProcess, processing 2 (2522.1 per sec) reads @ chr21:21000001-22000001
MainProcess countReadsInRegions_worker: processing 1000000 (6982070.0 per sec) @ chr21:21000001-22000001
MainProcess, processing 2 (1620.7 per sec) reads @ chr21:22000001-23000001
MainProcess countReadsInRegions_worker: processing 1000000 (6992371.3 per sec) @ chr21:22000001-23000001
MainProcess, processing 3 (2286.6 per sec) reads @ chr21:23000001-24000001
MainProcess countReadsInRegions_worker: processing 1000000 (7033151.0 per sec) @ chr21:23000001-24000001
MainProcess, processing 5 (2839.4 per sec) reads @ chr21:24000001-25000001
MainProcess countReadsInRegions_worker: processing 1000000 (7011740.6 per sec) @ chr21:24000001-25000001
MainProcess, processing 5 (3955.4 per sec) reads @ chr21:25000001-26000001
MainProcess countReadsInRegions_worker: processing 1000000 (7021577.1 per sec) @ chr21:25000001-26000001
MainProcess, processing 8 (4444.3 per sec) reads @ chr21:26000001-27000001
MainProcess countReadsInRegions_worker: processing 1000000 (7018980.3 per sec) @ chr21:26000001-27000001
MainProcess, processing 6 (3359.5 per sec) reads @ chr21:27000001-28000001
MainProcess countReadsInRegions_worker: processing 1000000 (7026176.2 per sec) @ chr21:27000001-28000001
MainProcess, processing 4 (3005.1 per sec) reads @ chr21:28000001-29000001
MainProcess countReadsInRegions_worker: processing 1000000 (6985581.8 per sec) @ chr21:28000001-29000001
MainProcess, processing 9 (4912.6 per sec) reads @ chr21:29000001-30000001
MainProcess countReadsInRegions_worker: processing 1000000 (6991287.4 per sec) @ chr21:29000001-30000001
MainProcess, processing 2 (1587.2 per sec) reads @ chr21:30000001-31000001
MainProcess countReadsInRegions_worker: processing 1000000 (7014109.1 per sec) @ chr21:30000001-31000001
MainProcess, processing 3 (2375.5 per sec) reads @ chr21:31000001-32000001
MainProcess countReadsInRegions_worker: processing 1000000 (6998134.6 per sec) @ chr21:31000001-32000001
MainProcess, processing 1 (1270.6 per sec) reads @ chr21:32000001-33000001
MainProcess countReadsInRegions_worker: processing 1000000 (7043497.2 per sec) @ chr21:32000001-33000001
MainProcess, processing 3 (3671.7 per sec) reads @ chr21:33000001-34000001
MainProcess countReadsInRegions_worker: processing 1000000 (7014730.9 per sec) @ chr21:33000001-34000001
MainProcess, processing 6 (4531.9 per sec) reads @ chr21:34000001-35000001
MainProcess countReadsInRegions_worker: processing 1000000 (7009771.9 per sec) @ chr21:34000001-35000001
MainProcess, processing 4 (3002.9 per sec) reads @ chr21:35000001-36000001
MainProcess countReadsInRegions_worker: processing 1000000 (7022070.9 per sec) @ chr21:35000001-36000001
MainProcess, processing 3 (1674.2 per sec) reads @ chr21:36000001-37000001
MainProcess countReadsInRegions_worker: processing 1000000 (6964228.3 per sec) @ chr21:36000001-37000001
MainProcess, processing 4 (3112.7 per sec) reads @ chr21:37000001-38000001
MainProcess countReadsInRegions_worker: processing 1000000 (7022023.9 per sec) @ chr21:37000001-38000001
MainProcess, processing 5 (3834.6 per sec) reads @ chr21:38000001-39000001
MainProcess countReadsInRegions_worker: processing 1000000 (7012866.0 per sec) @ chr21:38000001-39000001
MainProcess, processing 4 (4634.6 per sec) reads @ chr21:39000001-40000001
MainProcess countReadsInRegions_worker: processing 1000000 (7003872.4 per sec) @ chr21:39000001-40000001
MainProcess, processing 3 (3640.9 per sec) reads @ chr21:40000001-41000001
MainProcess countReadsInRegions_worker: processing 1000000 (7010802.9 per sec) @ chr21:40000001-41000001
MainProcess, processing 3 (3501.1 per sec) reads @ chr21:41000001-42000001
MainProcess countReadsInRegions_worker: processing 1000000 (6950091.9 per sec) @ chr21:41000001-42000001
MainProcess, processing 3 (2399.9 per sec) reads @ chr21:42000001-43000001
MainProcess countReadsInRegions_worker: processing 1000000 (7005159.1 per sec) @ chr21:42000001-43000001
MainProcess, processing 4 (2304.2 per sec) reads @ chr21:43000001-44000001
MainProcess countReadsInRegions_worker: processing 1000000 (7009080.7 per sec) @ chr21:43000001-44000001
MainProcess, processing 3 (1727.0 per sec) reads @ chr21:44000001-45000001
MainProcess countReadsInRegions_worker: processing 1000000 (7004913.4 per sec) @ chr21:44000001-45000001
MainProcess, processing 3 (2429.1 per sec) reads @ chr21:45000001-46000001
MainProcess countReadsInRegions_worker: processing 1000000 (7009713.3 per sec) @ chr21:45000001-46000001
MainProcess, processing 1 (2391.3 per sec) reads @ chr21:46000001-46709983
MainProcess countReadsInRegions_worker: processing 709982 (7033488.0 per sec) @ chr21:46000001-46709983
output file: ctrl_001.dedup_filtered.sorted.rpkm.bw
The above code seems to be particularly slow in going through the first
phase (in each iteration of the mapReduce function, printing out each genome
partition size for multiprocessing: XXXXX line).
I have uploaded the BAM and index files to Google Drive
<https://drive.google.com/open?id=1_81HLp11UXhmpC3sfcPIXf06WQGlIl9r>.
Many thanks in advance!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#662>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEu_1THNKuEQ6XBQgqnndV_xQxF5VOCOks5tSfOmgaJpZM4R9OhG>
.
|
@fidelram thanks for your reply. The observation is independent of the architecture where this is run. I have access to a powerful HPC, where memory is not an issue, plus there is no out of memory error. The As originally noted, I specify This does not seem to affect to all files, since I have computed the coverage of many RNA-seq BAM files without issues. It must be something in the BAMs, however, I'm not sure what could cause a 2000 reads BAM to take such a long time. Did you get a chance to run the command above without issues? |
The performance is so poor because of the small number of reads in the file. deepTools is tuned toward large files and will often perform poorly on very very small files because of this. This particular behavior is due to a specific tuning behavior in the normalization step that tries to determine the number of reads that are going to be filtered out. This normally works by sampling ~1% of the reads in a file, but for such small files it ends up needing to select every read. There's no short-circuit for "just use every read", though, which is why you see multiple "genome partition size" lines in the output, one for each iteration through the file while it attempts to get enough reads to meet its minimum sampling rate. |
I'm working on a fix to by-pass this, since (A) it's easy enough to know to sample all of the reads and (B) nothing will be filtered out anyway, so we can just bypass all of that. That'll appear in deepTools 3, which should be out soon (the code is finished, I just need to wait for pysam to be updated for it to all work). |
This is now changed in |
Hello,
Thanks ;) ! |
@Pacomito Sometimes a thread dies and the program never finishes. Try killing the job and rerunning things. |
@dpryan79 Thanks for the answer, I tried re-running it both on the server and my local but it is taking more than 3 hours already using 8 processors, so I think the issue is not about dying thread this time. |
That seems odd and the addition of an aux tag shouldn't make a difference (deepTools ignores those tags). Can you make the file available to me? |
Thank you, Here is the link to a drive : |
Did you mean to have 800,000+ 56 base long contigs in there? That's why it's so slow. |
Oh it is my bad, thank you, I didn't know the |
@Pacomito Hi! I am also having this problem with a .bam file I obtained from mergeing other .bam files. However I did not understand how to correct the problem. I don't know how to correct the '@'SQ headers from my file or what my headers should be like. You said "Re-running it with the correct @'SQ' headers". How do you know what are the correct headers? Could you tell me where can I learn to fix this issue? |
@ascarafia Do you also have hundreds of thousands of small contigs in your headers? If not you have a different problem. |
Hi. Thank you for your answer. I'm not sure how to check the number of contigs. I did chr1 248956422 3785338 0 What puzzles me the most is that I created two .bam files with the same commands and run bamCoverage on one of them without any problems. It took more than an hour but got the job done. It's the second file that run over 12 hours but never got processed. The headers of both look the same. |
Something crashed on your system. Kill the job and rerun it. |
@dpryan79 |
@ascarafia Sounds like a good plan, please open a new issue if you continue running into this problem with deepTools 3. |
Hi,
I have noticed that for ChIP-seq BAM files, bamCoverage is extremely slow. The following
bamCoverage
call over a BAM of ~2k reads (830Kb) takes more than 15min. I'm using the latest deeptools available in conda (bamCoverage 2.5.7), but the same is true for my previous deeptools version (2.2.4).The log:
The above code seems to be particularly slow in going through the first phase (in each iteration of the mapReduce function, printing out each
genome partition size for multiprocessing: XXXXX
line).I have uploaded the BAM and index files to Google Drive.
Many thanks in advance!
The text was updated successfully, but these errors were encountered: