bismark_methylation_extractor difference between pair end or single end #579

yxwucq · 2023-05-11T11:54:02Z

Hi Felix!

I'm using bismark for wgbs data. I'm confused about the difference between pair end or single end data in bismark_methylation_extractor function. I can't find out the difference since they both just extract methylation data from bam file.

In this issue #360 you said that Read1 and Read2 follow each other on consecutive lines, so this command will fail

bismark -p 4 --parallel 2 --fastq --non_directional ref/genome -1 test_1_clean.fq.gz -2 test_2_clean.fq.gz
deduplicate_bismark --bam test_1_clean_bismark_bt2_pe.bam
samtools sort test_1_clean_bismark_bt2_pe.deduplicated.bam > sorted.bam # using samtools sort
bismark_methylation_extractor --genome_folder ref/genome sorted.bam # failed

However, when I added --se to bismark_methylation_extractor command, it worked. Then I compared the output generated from the RIGHT pipeline (without samtools sort). It seemed that the RIGHT file kept a few more records.

wc -l CHG_CTOB_sorted.txt 
> 23068
wc -l CHG_CTOB_test_1_clean_bismark_bt2_pe.deduplicated.txt
> 24769

What's the difference between two modes and the output file?

A lot of thanks!

The text was updated successfully, but these errors were encountered:

FelixKrueger · 2023-05-11T13:10:42Z

If you use the deduplciation command:

deduplicate_bismark --bam test_1_clean_bismark_bt2_pe.bam

and then use that file as the input for the methylation extractor it should all work fine (in paired-end mode):

bismark_methylation_extractor --genome_folder ref/genome test_1_clean_bismark_bt2_pe.bam

If you extract the methylation information from this file as single-end file, it will just extract all calls from each file, and not take into consideration the paired-end nature of your data - which means that it will not do an overlap detection between R1 and R2, and therefore you would see parts of the read that are present in both R1 and R2 extracted twice (which leads to an undue coverage bias).

I hope this is a good enough explanation, and you won't use samtools sort going forward?

yxwucq · 2023-05-11T16:24:48Z

Yeah, that pretty much makes sense! But "parts of the read that are present in both R1 and R2_extracted twice" means extracted sites in --se mode will over-count cytosine site? In that case, line count of pipeline1(after sorting with --se) should be greater than pipeline2(without sorting), which contradicts the wc -l result.

wc -l CHG_CTOB_sorted.txt 
> 23068
wc -l CHG_CTOB_test_1_clean_bismark_bt2_pe.deduplicated.txt
> 24769

FelixKrueger · 2023-05-12T06:48:43Z

My gut feeling is that your last statement is correct, but since this the wrong thing to do on at least two levels I don't really want to invest more energy into this to be perfectly honest. Maybe you want to follow this up some more?

yxwucq · 2023-05-12T12:14:54Z

After I checked the bam file, I found that bismark_methylation_extractor in --se mode discarded the second read of a pair, thus resulting the truncated output file. You are right that in any case people shouldn't use --se mode for their PE data. Here's the detail:

In sorted file:

In unsorted & pe mode file:

the raw bam (two reads):

FelixKrueger · 2023-05-12T13:07:57Z

Thanks for persisting. If I am not mistaken that single-end mode won't actually discard these calls, but they will get attributed to a different strand (in your case probably the OB strand), so the calls should be in a different file altogether. Can you just grep for that read ID in the other file?

But yea, don't do it :)

yxwucq · 2023-05-12T13:43:06Z

Exactly! Sorry that it didn't come up to me. The information of the second read was stored in OB file instead. In total the site chrM-143 was counted twice. Thanks again for your patience!

FelixKrueger · 2023-05-12T23:13:03Z

glad we found an explanation in the end! All the best going forward!

FelixKrueger closed this as completed May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bismark_methylation_extractor difference between pair end or single end #579

bismark_methylation_extractor difference between pair end or single end #579

yxwucq commented May 11, 2023

FelixKrueger commented May 11, 2023

yxwucq commented May 11, 2023

FelixKrueger commented May 12, 2023

yxwucq commented May 12, 2023

FelixKrueger commented May 12, 2023

yxwucq commented May 12, 2023

FelixKrueger commented May 12, 2023

bismark_methylation_extractor difference between pair end or single end #579

bismark_methylation_extractor difference between pair end or single end #579

Comments

yxwucq commented May 11, 2023

FelixKrueger commented May 11, 2023

yxwucq commented May 11, 2023

FelixKrueger commented May 12, 2023

yxwucq commented May 12, 2023

FelixKrueger commented May 12, 2023

yxwucq commented May 12, 2023

FelixKrueger commented May 12, 2023