sampling from bam (plotEnrichment) #530

thomasmanke · 2017-05-16T09:05:21Z

use case: need to estimate of fraction of reads overlapping with many regions (restriction sites)
Current solution
plotEnrichment --BED RS.bed -b Input.bam --Offset 1 --outRawCounts RS.freq -p 10

is very slow for BED-file with 43M entries. This might be improved by sampling from Input.bam.

Currently --region could be used, but I would prefer to sample independently of chromosomes, i.e a certain fraction or a given number of reads. Such a sampling parameter could become an extra filter for all tools.

dpryan79 · 2017-05-16T09:54:40Z

Before going down this route, let's do some performance profiling first.

dpryan79 · 2017-05-16T12:54:03Z

The slowness in this case comes from needing a few minutes to read in the BED file in each chunk sent for processing. One option would be to make that variable and to set it to something like 20 megabases in the case you're experiencing. Note that that won't work if --region is specified, since then whatever is passed in will get ignored and overwritten.

I'll profile the C code to see if there's anything that can be improved there, but I'm not holding my breath, since it makes no assumption of sort order (and that's not something I'll change).

fidelram · 2017-05-17T21:18:48Z

Since the bed file is not modified, it can be shared among process. Maybe there is a chance here for some optimization.

…

On Tue, May 16, 2017 at 2:54 PM, Devon Ryan ***@***.***> wrote: The slowness in this case comes from needing a few minutes to read in the BED file in each chunk sent for processing. One option would be to make that variable and to set it to something like 20 megabases in the case you're experiencing. Note that that won't work if --region is specified, since then whatever is passed in will get ignored and overwritten. I'll profile the C code to see if there's anything that can be improved there, but I'm not holding my breath, since it makes no assumption of sort order (and that's not something I'll change). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEu_1ZQt40mYIgssbvqYyRdQaZrBEvWmks5r6ZxsgaJpZM4NcMmJ> .

-- Fidel Ramirez

dpryan79 · 2017-05-17T21:25:59Z

Is there a way to share things that can't be pickled (I look forward to @thomasmanke asking what the heck this means :) )?

dpryan79 · 2017-07-13T12:31:11Z

Relatedly and courtesy of Fidel, see here: https://github.com/maxplanck-ie/HiCExplorer/blob/develop/hicexplorer/hicBuildMatrix.py#L13-L15

dpryan79 · 2017-07-25T11:58:50Z

Ugh, thank goodness this is only relevant for plotEnrichment, since the method used in HiC explorer has to explicitly copy everything into shared memory, which is only easy for very very simple data structures. I have a feeling that the Manager interface in multiprocessing, which basically starts a thread that acts as a data server, will end up being the simplest, if annoying, solution.

I really hate python's global interpreter lock, it just creates headaches.

dpryan79 · 2017-07-25T14:25:39Z

Hmm, the path of least resistance for this is to just read in the files before forking and make the Enrichment object global, since then the memory is just copied. This is just as memory inefficient as before but is still probably a LOT faster for such large files.

dpryan79 · 2017-07-25T21:52:57Z

That turns out to work reasonably well as a solution. I still hate how much memory python is wasting, but that's at least largely unchanged. This is now implemented in the develop branch.

thomasmanke added the enhancement label May 16, 2017

dpryan79 closed this as completed Jul 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampling from bam (plotEnrichment) #530

sampling from bam (plotEnrichment) #530

thomasmanke commented May 16, 2017

dpryan79 commented May 16, 2017

dpryan79 commented May 16, 2017

fidelram commented May 17, 2017 via email

dpryan79 commented May 17, 2017 •

edited

Loading

dpryan79 commented Jul 13, 2017

dpryan79 commented Jul 25, 2017 •

edited

Loading

dpryan79 commented Jul 25, 2017

dpryan79 commented Jul 25, 2017

sampling from bam (plotEnrichment) #530

sampling from bam (plotEnrichment) #530

Comments

thomasmanke commented May 16, 2017

dpryan79 commented May 16, 2017

dpryan79 commented May 16, 2017

fidelram commented May 17, 2017 via email

dpryan79 commented May 17, 2017 • edited Loading

dpryan79 commented Jul 13, 2017

dpryan79 commented Jul 25, 2017 • edited Loading

dpryan79 commented Jul 25, 2017

dpryan79 commented Jul 25, 2017

dpryan79 commented May 17, 2017 •

edited

Loading

dpryan79 commented Jul 25, 2017 •

edited

Loading