Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different number of reads reported and in bam file #50

Open
jdmontenegro opened this issue Sep 7, 2020 · 5 comments
Open

different number of reads reported and in bam file #50

jdmontenegro opened this issue Sep 7, 2020 · 5 comments

Comments

@jdmontenegro
Copy link

jdmontenegro commented Sep 7, 2020

Hello,

I recently aligned reads from a heterozygous individual to the specie's reference genome. After identification of alignment breakpoints, I ran longshot on the conserved regions. Out of 1900 targets, 1780 were succesfully split into haplotypes. However, I hav noticed a few things:

  1. there is a different number between the number of reported reads and the number of reads in the bam produced:
$ tail longshot.log
...
Separate fragments
2020-09-07 01:41:45     11668 reads (23.47%) assigned to haplotype 1
2020-09-07 01:41:45     11576 reads (23.28%) assigned to haplotype 2
2020-09-07 01:41:45     26476 reads (53.25%) unassigned.
2020-09-07 01:41:45 Writing haplotype-assigned reads to bam files...
2020-09-07 01:43:11 Printing VCF file...

that is 49720 reads phased and unphased, but

$ samtools view 10_1-18299080.bam | cut -f 1 | sort | uniq | wc -l
109628

the bam produced by longshot contains 109628 unique reads. That does not add up. Do you know what is going on here? Or am I reading it wrong?

BTW, the number of phased reads (HP:i:1 and HP:i:2) is correct, so the problem is from the unphased reads.

Cheers,

Juan D. Montenegro

@jdmontenegro
Copy link
Author

Hello,

Looking more closely to this issue, it appears that the bam file produced by longshot also contains supplementary and secondary alignments, even though these are filtered out before SNV discovery. Filtering these out, I get 29240 unphased reads, but I should be getting 26476, so there are ~ 3 thousand additional reads. Wat else could I be missing?

Cheers,

Juan D. Montenegro

@vibansal
Copy link
Collaborator

vibansal commented Sep 9, 2020

Longshot does output all alignments therefore the total number of reads in the output should be exactly the same as in the input bam. The statistics are only for the filtered reads. Can you confirm if the extra reads are duplicates?

@jdmontenegro
Copy link
Author

Hello,
Sorry for the late reply. The input and output do have the same number of records. So I am probably missing some filter to make the numbers match. Besides A30 and secondary/supplementary alignments, what else is being ignored by Longshot? I amasking this, beacuse I am usually, reassemblying the phased and unphased reads. Usually what I see is that one end of the original contig cannot be phased, while the other end is split into two homologous haplotypes, but in some cases I do get a very complex partition after assemblying, so I guess some of the unphased reads actually help complete gaps between haplotypes but were too short and did not contain enough polymorphisms to be phased.

Any help selecting the appropriate set of unphased reads would be very helpful.

Cheers,

Juan D.

@vibansal
Copy link
Collaborator

Longshot filters out reads with low mapping quality in addition to secondary/supp. alignments. I have copied the list of filters from the code below:

record.is_quality_check_failed()
|| record.is_duplicate()
|| record.is_secondary()
|| record.is_unmapped()
|| record.mapq() < min_mapq
|| record.is_supplementary()

All these reads will be output as 'unphased'.

@jdmontenegro
Copy link
Author

jdmontenegro commented Sep 17, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants