Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HiC-Pro and FitHiChIP compatibility #22

Open
pna059 opened this issue Jan 9, 2023 · 10 comments
Open

HiC-Pro and FitHiChIP compatibility #22

pna059 opened this issue Jan 9, 2023 · 10 comments

Comments

@pna059
Copy link

pna059 commented Jan 9, 2023

Hi,
I would like to test including multimappers in my HiChIP datasets which have been analyzed by HiC-Pro and FitHiChIP pipelines before. Can I somehow use .ValidPairs or .allValidPairs from the HiC-Pro to avoid need for re-mapping the data?
Which mHiC output can I use as an input for the FitHiChIP pipeline?
When I ran entire mHiC pipeline including post-mHiC processing with filterT=0.5, the line numbers looked like this:

wc -l s6/*
67617525 s6/HiChIP_merged.validPairs.binPairCount.uniMulti
3654838480 s6/HiChIP_merged.validPairs.binPair.multi.mHiC
24566027 s6/HiChIP_merged.validPairs.binPair.multi.mHiC.binPairCount.multi

in comparison to the line number of the HiC-Pro .allValidPairs file from the same data:

201896336 HiChIP_HiCPro.allValidPairs

I am confused how comprable are these two results. Initially I thought I should use the HiChIP_merged.validPairs.binPair.multi.mHiC.binPairCount.multi, but the line numbers are too different.

Thank you!
Pavla

@yezhengSTAT
Copy link
Owner

Can I somehow use .ValidPairs or .allValidPairs from the HiC-Pro to avoid need for re-mapping the data?
Yes!

Which mHiC output can I use as an input for the FitHiChIP pipeline?
You will need to do some of the filtering based on how stringent you want the multi-mapping results to be and then merge with the uniquely mapping reads. Please see section: https://github.com/yezhengSTAT/mHiC#64-post-mhic-processing.

I am confused how comprable are these two results. Initially I thought I should use the HiChIP_merged.validPairs.binPair.multi.mHiC.binPairCount.multi, but the line numbers are too different.
You should use HiChIP_merged.validPairs.binPairCount.uniMulti file which should contains both the uniquely mapping and multi-mapping read count. To be safe, better to check what is in this file and ensure everything looks reasonable.
HiChIP_HiCPro.allValidPairs if I understand correctly contains the read pairs however, HiChIP_merged.validPairs.binPairCount.uniMulti is the bin pair results. In other words, the genome is binned by a fixed window size, say 1Mb or 100kb, and count how many valid reads fall within each bin-pair. You should expect the line number to be much smaller for the bin pair file.

Thanks,
Ye

@pna059
Copy link
Author

pna059 commented Jan 10, 2023

Dear Ye,
thank you for your reply. Which of the two: .ValidPairs or .allValidPairs shoudl I use and from which step exactly should I kick off the mHiC after HiC-Pro - s4 or s5? Can I just place the .validParis (or allValidPairs) to the respective directory and resume from that step?

@yezhengSTAT
Copy link
Owner

Hello,
I am not sure what does .allValidPairs look like but based on the name, it seems that .allValidPairs should contain all the read pairs. I think you should go with that one. Also, you probably also need to modify the data format (maybe excluding a few unnecessary columns? or adding the missing ones) before putting in step 3 of mHi-C. You can quickly run the demo data in mHi-C to see what the read pair file looks like after step 2 and is used as input for step 3.

Thanks,
Ye

@pna059
Copy link
Author

pna059 commented Apr 5, 2023

Hi Ye,
to proceed with mHiC result to FitHiChIP, I think the .uni.multi file would have to contain different columns to substitute the usual (HiC-Pro) resulting .allValidPair file.

head *.allValidPairs
A00703:228:HNF2FDRXY:2:1222:14642:20024 chr1H   46      +       chr1H   8584    -       237     HIC_chr1H_2     HIC_chr1H_55    18      2
A00703:228:HNF2FDRXY:1:1121:1099:21339  chr1H   46      +       chr1H   8585    -       238     HIC_chr1H_2     HIC_chr1H_55    32      6
A00703:50:HVY73DRXX:2:2125:7211:17738   chr1H   74      +       chr1H   67964526        -       235     HIC_chr1H_3     HIC_chr1H_507523        32      31
A00703:71:H2K2VDRXY:2:2210:28619:29919  chr1H   143     +       chr7H   165762903       -       367     HIC_chr1H_3     HIC_chr7H_1276961       31      42
A00703:50:HVY73DRXX:2:1223:9055:6621    chr1H   205     -       chr7H   372380305       -       182     HIC_chr1H_3     HIC_chr7H_2845290       32      42

while the s6/*.uniMulti file that I should use is just a count file:

head s6/*_merged.validPairs.binPairCount.uniMulti
chr1H 10000 chr1H 50000 4
chr1H 10000 chr1H 70000 5
chr1H 10000 chr1H 90000 2
chr1H 10000 chr1H 110000        1
chr1H 10000 chr1H 150000        1
chr1H 10000 chr1H 170000        2

....and this file is not accepted by FitHiChIP. So I reason, that I need to do the UNI+MULTI merging differently to achieve the right format for my purpose. Could you please advise me on how can I perform the merge step to get the file format analogous to one of .allValidPairs? Or how can I add the missing columns to the file?

Thank you!

@yezhengSTAT
Copy link
Owner

Hello,
It seems to me what you want are read pairs instead of bin pairs (you may notice that there is a "binPairCount" in the file name, such as s6/*_merged.validPairs.binPairCount.uniMulti or the"s4/${name}.validPairs.binPairCount.uni". In other words, the mHi-C results are at the bin-pair level, especially for the multi-mapping reads (binning the genome already). I guess within FitHiChIP, there is also some binning procedure, so I think you may see if they offer any bin-pair option as input or you can just treat each bin-pair as a read-pair.

Best,
Ye

@pna059
Copy link
Author

pna059 commented Apr 6, 2023

Hi Ye,
yes, I found it, finally. This type of format is accepted by FitHiChIP, indeed. I just missed the information that it needs to be placed in a different place in their config file.

"** Option 2: Tab delimited contact matrix format **
Using the Bed= option, user can put a tab delimited text file containing the binned intervals and their contact counts.
Of the format: chr1 start1 end1 chr2 start2 end2 contact"

Thank you so much for navigating me!
best,
Pavla

@pna059 pna059 closed this as completed Apr 6, 2023
@pna059 pna059 reopened this Apr 7, 2023
@pna059
Copy link
Author

pna059 commented Apr 7, 2023

Hi again,
sorry for just a (hopefully last) question:
The *.binPairCount.uniMulti is in the format: chr1 pos1 chr2 pos2 count
I need to convert this into the bin BEDPE format: chr1 start1 end1 chr2 start2 end2 contact
How can I do this correctly? Are the "pos" bin midpoints? My resolution setting was 20 000.
Thanks!

Pavla

@yezhengSTAT
Copy link
Owner

Not sure how did you bin the genome but at 20k if you see the first bin for chr1 is 10k, and followed by 30k, yes, pos refers to the midpoints of the bin.

Thanks,
Ye

@pna059
Copy link
Author

pna059 commented Apr 11, 2023

Great, the analysis of mHiC result using FitHiChIP works nice after converting the *.binPairCount.uniMulti by:

awk -v OFS="\t" '{print $1, $2-10000, $2+10000, $3, $4-10000, $4+10000, $5}' *.binPairCount.uniMulti > *.binPairCount.uniMulti.bedpe

I am surprised to see that the number of significant interactions (leaving the FitHiChIP parameters identical) went down (from 47k to 27k at 20kb resolution). Is it possible that when considering the multimappers, the final number fo interactions decreases? Should I re-analyze the ChIP-seq and re-do peak calling using Permseq [Zeng et al., 2015] or SmartMap (https://github.com/shah-rohan/SmartMap) and re-do peak calling in order to call HiChIP loops correctly for the repetitive regions? My current ChIP-seq peaks for submitting the interactions come from standard pipeline avoiding multimappers.

Best,
Pavla

@yezhengSTAT
Copy link
Owner

Hello Pavla,
For Hi-C, usually, I observe more significant interactions comparing Multi&Uni with the Uni-mapping settings. Maybe you can play around with the mHi-C filtering threshold and start with probability filtering at 0.9 to keep those highly confident multi-mapping reads to see how will the significant interaction number changes. Regardless, theoretically, there is no determinant rule that adding multi-mapping reads will always increase the significant interaction called by FitHiC or FitHiChIP. You can consider the two settings as two independent tests. The background signal and the distribution are completely different between those two tests. Therefore, the p values coming out of the test are not comparable. Therefore, I usually emphasize more on the order of significant interactions instead of p values themselves of those interactions. Namely, you may check among the top 100, 1k, 10k, 20k or 50k significant interactions and see if there are novel interactions called under Multi&Uni settings and if there are interactions deemed non-significant. Then go further and check if the novel ones are more reasonable (HiC map visualization, epigenomic tracks, literature etc).

Using matching ChIP-seq peaks from the Multi&Uni-setting definitely makes more sense, which is also what I did for the paper. However, you can start with the standard pipeline and see if they can already address your scientific target.

Best,
Ye

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants