Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling from bam (plotEnrichment) #530

Closed
thomasmanke opened this issue May 16, 2017 · 8 comments
Closed

sampling from bam (plotEnrichment) #530

thomasmanke opened this issue May 16, 2017 · 8 comments

Comments

@thomasmanke
Copy link
Collaborator

use case: need to estimate of fraction of reads overlapping with many regions (restriction sites)
Current solution
plotEnrichment --BED RS.bed -b Input.bam --Offset 1 --outRawCounts RS.freq -p 10

is very slow for BED-file with 43M entries. This might be improved by sampling from Input.bam.

Currently --region could be used, but I would prefer to sample independently of chromosomes, i.e a certain fraction or a given number of reads. Such a sampling parameter could become an extra filter for all tools.

@dpryan79
Copy link
Collaborator

Before going down this route, let's do some performance profiling first.

@dpryan79
Copy link
Collaborator

The slowness in this case comes from needing a few minutes to read in the BED file in each chunk sent for processing. One option would be to make that variable and to set it to something like 20 megabases in the case you're experiencing. Note that that won't work if --region is specified, since then whatever is passed in will get ignored and overwritten.

I'll profile the C code to see if there's anything that can be improved there, but I'm not holding my breath, since it makes no assumption of sort order (and that's not something I'll change).

@fidelram
Copy link
Collaborator

fidelram commented May 17, 2017 via email

@dpryan79
Copy link
Collaborator

dpryan79 commented May 17, 2017

Is there a way to share things that can't be pickled (I look forward to @thomasmanke asking what the heck this means :) )?

@dpryan79
Copy link
Collaborator

@dpryan79
Copy link
Collaborator

dpryan79 commented Jul 25, 2017

Ugh, thank goodness this is only relevant for plotEnrichment, since the method used in HiC explorer has to explicitly copy everything into shared memory, which is only easy for very very simple data structures. I have a feeling that the Manager interface in multiprocessing, which basically starts a thread that acts as a data server, will end up being the simplest, if annoying, solution.

I really hate python's global interpreter lock, it just creates headaches.

@dpryan79
Copy link
Collaborator

Hmm, the path of least resistance for this is to just read in the files before forking and make the Enrichment object global, since then the memory is just copied. This is just as memory inefficient as before but is still probably a LOT faster for such large files.

@dpryan79
Copy link
Collaborator

That turns out to work reasonably well as a solution. I still hate how much memory python is wasting, but that's at least largely unchanged. This is now implemented in the develop branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants