Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step 4 taking too long #23

Open
re2srm opened this issue Feb 20, 2024 · 7 comments
Open

Step 4 taking too long #23

re2srm opened this issue Feb 20, 2024 · 7 comments

Comments

@re2srm
Copy link

re2srm commented Feb 20, 2024

Hello,

I am running some human hic data through the pipeline and step 4 is taking extremely long. At the end of step3 I have around 150M uniquely mapping pairs and 35M multimapping pairs. Up to now my step4 has been running for around 24 hours and the only outputs are validPairs.MULTI and validPairs.UNI files for chromosome 1 and 2. At this rate this step would take >10 days to run. Is this normal and are there any parameters I can tune to make this faster? I am currently using all default parameters from the demo file (binning resolution of 10k with KR normalization).

I can use up to 24 cpus and 100G of memory but this step doesn't seem to have any options to use multiple cpus. Would appreciate any feedback.

Thanks

@yezhengSTAT
Copy link
Owner

Hello,
It seems that you got stuck with the duplicates removal step which consists of a lot of sorting and merging. You may consider separating the input file by the chromosomes into individual files and then running step 4 on each chromosome file in parallel. Otherwise, running at 10kb can be quite heavy for the human genome depending on the sequencing depth. You may consider starting with the demo data or running at a much lower resolution (such as 1Mb or even 10Mb) just to make sure all the steps run successfully.

Hope it helps.

Thanks,
Ye

@re2srm
Copy link
Author

re2srm commented Feb 20, 2024

Thanks, I already ran the demo data and everything worked fine. For splitting by chromosome, can I just specify a single chromosome in the chromosome list parameter and then run the step separately for each chromosome (instead of splitting the input file)?

I ran the previous steps at a resolution of 10k. Would it be ok if I now run step 4 at a higher resolution or do I have to start from the beginning? Also are you recommending 1mb or 10mb just for testing or for actual analysis?

Thanks

Thanks

@yezhengSTAT
Copy link
Owner

Thanks, I already ran the demo data and everything worked fine. For splitting by chromosome, can I just specify a single chromosome in the chromosome list parameter and then run the step separately for each chromosome (instead of splitting the input file)?

Yes, that should also work!

I ran the previous steps at a resolution of 10k. Would it be ok if I now run step 4 at a higher resolution or do I have to start from the beginning? Also are you recommending 1mb or 10mb just for testing or for actual analysis?

Yeah.....you have to go back to step3 as the bin index is given in that step. Actually, most duplicate removal procedures have nothing to do with binning. Basically, it sorts the alignment position (chrom-start-end) and removes those reads with the same genomic coordinate. In other words, you probably won't observe too much acceleration until the normalization step. You may monitor your memory to see if that is the bottleneck limiting the sorting procedure. Or you may further consider splitting one chromosome into small chunk of files that are ordered by starting position or similar strategies......

Choosing a resolution depends on the nature of your study. If you want to focus on a more general structure like compartment or TAD, 1Mb or 100kb may be sufficient. If your study requires a much-refined structure or your target is at a small range of genome, then a much higher resolution is needed.

Best,
Ye

@re2srm
Copy link
Author

re2srm commented Feb 21, 2024

Thanks, I am now doing it in parallel after splitting by chromosome. It seems the bottleneck is in the duplicate removal step as increasing the resolution doesn't speed up the process. Just wanted to ask how I should merge the output files from the chromosomes for the next steps. Should I just use the 'cat' command to concatenate them like this:

cat *.MULTI.binPair.multi > allchroms.MULTI.binPair.multi
cat *.MULTI.binPair.multiKeys > allchroms.MULTI.binPair.multiKeys
cat *.binPairCount.uni > allchroms.binPairCount.uni
cat *.binPair.multi > allchroms.binPair.multi

@yezhengSTAT
Copy link
Owner

Yes, you are right!

@re2srm
Copy link
Author

re2srm commented Feb 29, 2024

Hi,

I am getting this error when running step6 on some of my chromosomes:

Traceback (most recent call last):
File "/scratch/mHiC/s6_em.py", line 88, in
main(priorPath, multiFilePath, multiKeysPath, uniFilePath, outFilePath, threshold)
File "s6_em_cython.pyx", line 306, in s6_em_cython.main
(spline, interProb) = read_spline_prior(priorPath)
File "s6_em_cython.pyx", line 79, in s6_em_cython.read_spline_prior
interProb = min(spline.values())/2
ValueError: min() arg is an empty sequence

This is only happening on a few chromosomes and most run fine. I ran step5 with normalization set to 'None' (because I was having some other errors using KR normalization). Any idea why this might happen?

Thanks

@yezhengSTAT
Copy link
Owner

It seems that the spline is not fitted correctly. Maybe......it is because the data is too sparse for some chromosomes. Maybe try with larger bins?

Thanks,
Ye

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants