Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bismark_methylation_extractor difference between pair end or single end #579

Closed
yxwucq opened this issue May 11, 2023 · 7 comments
Closed

Comments

@yxwucq
Copy link

yxwucq commented May 11, 2023

Hi Felix!

I'm using bismark for wgbs data. I'm confused about the difference between pair end or single end data in bismark_methylation_extractor function. I can't find out the difference since they both just extract methylation data from bam file.

In this issue #360 you said that Read1 and Read2 follow each other on consecutive lines, so this command will fail

bismark -p 4 --parallel 2 --fastq --non_directional ref/genome -1 test_1_clean.fq.gz -2 test_2_clean.fq.gz
deduplicate_bismark --bam test_1_clean_bismark_bt2_pe.bam
samtools sort test_1_clean_bismark_bt2_pe.deduplicated.bam > sorted.bam # using samtools sort
bismark_methylation_extractor --genome_folder ref/genome sorted.bam # failed

However, when I added --se to bismark_methylation_extractor command, it worked. Then I compared the output generated from the RIGHT pipeline (without samtools sort). It seemed that the RIGHT file kept a few more records.

wc -l CHG_CTOB_sorted.txt 
> 23068
wc -l CHG_CTOB_test_1_clean_bismark_bt2_pe.deduplicated.txt
> 24769

What's the difference between two modes and the output file?

A lot of thanks!

@FelixKrueger
Copy link
Owner

If you use the deduplciation command:

deduplicate_bismark --bam test_1_clean_bismark_bt2_pe.bam

and then use that file as the input for the methylation extractor it should all work fine (in paired-end mode):

bismark_methylation_extractor --genome_folder ref/genome test_1_clean_bismark_bt2_pe.bam

If you extract the methylation information from this file as single-end file, it will just extract all calls from each file, and not take into consideration the paired-end nature of your data - which means that it will not do an overlap detection between R1 and R2, and therefore you would see parts of the read that are present in both R1 and R2 extracted twice (which leads to an undue coverage bias).

I hope this is a good enough explanation, and you won't use samtools sort going forward?

@yxwucq
Copy link
Author

yxwucq commented May 11, 2023

Yeah, that pretty much makes sense! But "parts of the read that are present in both R1 and R2_extracted twice" means extracted sites in --se mode will over-count cytosine site? In that case, line count of pipeline1(after sorting with --se) should be greater than pipeline2(without sorting), which contradicts the wc -l result.

wc -l CHG_CTOB_sorted.txt 
> 23068
wc -l CHG_CTOB_test_1_clean_bismark_bt2_pe.deduplicated.txt
> 24769

@FelixKrueger
Copy link
Owner

My gut feeling is that your last statement is correct, but since this the wrong thing to do on at least two levels I don't really want to invest more energy into this to be perfectly honest. Maybe you want to follow this up some more?

@yxwucq
Copy link
Author

yxwucq commented May 12, 2023

After I checked the bam file, I found that bismark_methylation_extractor in --se mode discarded the second read of a pair, thus resulting the truncated output file. You are right that in any case people shouldn't use --se mode for their PE data. Here's the detail:

In sorted file:
image

In unsorted & pe mode file:
image

the raw bam (two reads):
image

@FelixKrueger
Copy link
Owner

Thanks for persisting. If I am not mistaken that single-end mode won't actually discard these calls, but they will get attributed to a different strand (in your case probably the OB strand), so the calls should be in a different file altogether. Can you just grep for that read ID in the other file?

But yea, don't do it :)

@yxwucq
Copy link
Author

yxwucq commented May 12, 2023

Exactly! Sorry that it didn't come up to me. The information of the second read was stored in OB file instead. In total the site chrM-143 was counted twice. Thanks again for your patience!

image

@FelixKrueger
Copy link
Owner

glad we found an explanation in the end! All the best going forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants