Why do strands of pairs generated by pairtools parse differ from those in .sam alignments #168

jiangshan529 · 2023-01-20T21:24:21Z

jiangshan529
Jan 20, 2023

Hello, I am trying to understand the details about how pairtools works. In my pair.gz file, I choose a read pair to do the test, the read pair looks like this:

SRR710074.10273713      chr1    1054204 chr1    1054648 +       -       UU

Then I searched the line number in .bam file and found:

SRR710074.10273713      81      chr1    1054613 60      36M     =       1054204 -445    ACTCCACCCCCCAGCGCCCACCCTTGAGTCAGGGTG
SRR710074.10273713      161     chr1    1054204 60      36M     =       1054613 445     GTCGCTCCAGTCTGAGCCTGGCCGTCGCCTCCAGCA

I use the blat tool in IGV genome browser and found actually the two reads are both on + strand, not as shown in the pair.gz file that one is on + strand and another on - strand. So how should I understand this?

I also searched 'UU' in the pair.gz file and do blat for some 'UU' pairs, and some of them can be blasted to several sites on the hg38 genome. And it is not what defined by 'UU'. So how should I interpret this? Thanks for your help!

agalitsyna · 2023-01-20T21:53:08Z

agalitsyna
Jan 20, 2023
Maintainer

hi, @jiangshan529

Firstly, Hi-C is a paired-end sequencing method; alignments in a pair can originate from the opposite sides of your DNA molecule. See explanations in our docs on bam parsing.

Next, the orientation of the alignments is recoded as SAM tags in the second field of your sam/bam file, and you can decode them, e.g., here: https://broadinstitute.github.io/picard/explain-flags.html

In your case:

"SRR710074.10273713 81 chr1 1054613 60 36M" is the first alignment in the pair originating from R1 (forward) side of the read, and it's mapped to the - chain. Note that the coordinate is the minimum coordinate (thus, the endpoint of your alignment, the start is 1054613-1+36=1054648; 1 is subtracted because bam is 1-based, and pairtools reports 0-based coordinates).
"SRR710074.10273713 161 chr1 1054204 60 36M" is the second alignment in the pair originating from R2 (reverse) side of the read, and it's mapped to the + chain.

So two alignments you report are already mapped to the opposite strands of DNA. IGV reports something else, but this might be some IGV convention of reporting for paired-end reads, which is unrelated to conventions of Hi-C data interpretation.

Finally, let's dig into what standard pairtools parse is doing (and I assume you did not change reporting orientation settings, right?)

It reports two alignments as a pair, and it will be something like:
SRR710074.10273713 chr1 1054648 chr1 1054204 - + UU
However, the left alignment has a larger coordinate than the right one, and pairtools by default performs the flipping procedure so that the alignment with smaller coordinate is always first. So the pair will look like this:
SRR710074.10273713 chr1 1054204 chr1 1054648 + - UU

Let me know if you have further questions. This is also a discussion rather than an issue. I will transfer it there for you, but I appreciate it if this kind of help request would go into discussion directly next time. Thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do strands of pairs generated by pairtools parse differ from those in .sam alignments #168

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why do strands of pairs generated by pairtools parse differ from those in .sam alignments #168

jiangshan529 Jan 20, 2023

Replies: 1 comment

agalitsyna Jan 20, 2023 Maintainer

jiangshan529
Jan 20, 2023

agalitsyna
Jan 20, 2023
Maintainer