different number of reads reported and in bam file #50

jdmontenegro · 2020-09-07T00:06:12Z

Hello,

I recently aligned reads from a heterozygous individual to the specie's reference genome. After identification of alignment breakpoints, I ran longshot on the conserved regions. Out of 1900 targets, 1780 were succesfully split into haplotypes. However, I hav noticed a few things:

there is a different number between the number of reported reads and the number of reads in the bam produced:

$ tail longshot.log
...
Separate fragments
2020-09-07 01:41:45     11668 reads (23.47%) assigned to haplotype 1
2020-09-07 01:41:45     11576 reads (23.28%) assigned to haplotype 2
2020-09-07 01:41:45     26476 reads (53.25%) unassigned.
2020-09-07 01:41:45 Writing haplotype-assigned reads to bam files...
2020-09-07 01:43:11 Printing VCF file...

that is 49720 reads phased and unphased, but

$ samtools view 10_1-18299080.bam | cut -f 1 | sort | uniq | wc -l
109628

the bam produced by longshot contains 109628 unique reads. That does not add up. Do you know what is going on here? Or am I reading it wrong?

BTW, the number of phased reads (HP:i:1 and HP:i:2) is correct, so the problem is from the unphased reads.

Cheers,

Juan D. Montenegro

The text was updated successfully, but these errors were encountered:

jdmontenegro · 2020-09-07T01:09:58Z

Hello,

Looking more closely to this issue, it appears that the bam file produced by longshot also contains supplementary and secondary alignments, even though these are filtered out before SNV discovery. Filtering these out, I get 29240 unphased reads, but I should be getting 26476, so there are ~ 3 thousand additional reads. Wat else could I be missing?

Cheers,

Juan D. Montenegro

vibansal · 2020-09-09T23:33:57Z

Longshot does output all alignments therefore the total number of reads in the output should be exactly the same as in the input bam. The statistics are only for the filtered reads. Can you confirm if the extra reads are duplicates?

jdmontenegro · 2020-09-13T00:05:40Z

Hello,
Sorry for the late reply. The input and output do have the same number of records. So I am probably missing some filter to make the numbers match. Besides A30 and secondary/supplementary alignments, what else is being ignored by Longshot? I amasking this, beacuse I am usually, reassemblying the phased and unphased reads. Usually what I see is that one end of the original contig cannot be phased, while the other end is split into two homologous haplotypes, but in some cases I do get a very complex partition after assemblying, so I guess some of the unphased reads actually help complete gaps between haplotypes but were too short and did not contain enough polymorphisms to be phased.

Any help selecting the appropriate set of unphased reads would be very helpful.

Cheers,

Juan D.

vibansal · 2020-09-17T17:14:51Z

Longshot filters out reads with low mapping quality in addition to secondary/supp. alignments. I have copied the list of filters from the code below:

record.is_quality_check_failed()
|| record.is_duplicate()
|| record.is_secondary()
|| record.is_unmapped()
|| record.mapq() < min_mapq
|| record.is_supplementary()

All these reads will be output as 'unphased'.

jdmontenegro · 2020-09-17T18:31:52Z

Thank you for your reply. So once I filter all these, those that do not have the "HP" tag were not phased because they either: 1) did not have enough variants to be phased, 2) the variants assigned were in conflict with other more abundant reads, or 3) because the locus is actually (mostly) homozygous. Would that be correct? Cheers, Juan D. Montenegro El jue., 17 sept. 2020 a las 12:15, Bansal Lab (<[email protected]>) escribió:

…

Longshot filters out reads with low mapping quality in addition to secondary/supp. alignments. I have copied the list of filters from the code below: record.is_quality_check_failed() || record.is_duplicate() || record.is_secondary() || record.is_unmapped() || record.mapq() < min_mapq || record.is_supplementary() All these reads will be output as 'unphased'. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACHSLORKYEUFL5XWQQHXX4LSGI7ZXANCNFSM4Q5DFJHQ> .

bluenote-1577 mentioned this issue Jul 21, 2021

Calling variants from supplementary alignments #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different number of reads reported and in bam file #50

different number of reads reported and in bam file #50

jdmontenegro commented Sep 7, 2020 •

edited

Loading

jdmontenegro commented Sep 7, 2020

vibansal commented Sep 9, 2020

jdmontenegro commented Sep 13, 2020

vibansal commented Sep 17, 2020

jdmontenegro commented Sep 17, 2020 via email

different number of reads reported and in bam file #50

different number of reads reported and in bam file #50

Comments

jdmontenegro commented Sep 7, 2020 • edited Loading

jdmontenegro commented Sep 7, 2020

vibansal commented Sep 9, 2020

jdmontenegro commented Sep 13, 2020

vibansal commented Sep 17, 2020

jdmontenegro commented Sep 17, 2020 via email

jdmontenegro commented Sep 7, 2020 •

edited

Loading