Skip to content

7. Compare alignments with bamDiff

Frederick Tan edited this page Oct 9, 2015 · 11 revisions

#Introduction to bamDiff

bamDiff will allow you to take the .bam files output by the aligners, along with the .csv obtained from runCsaw.R, and compare their performances. Depending on the options you pass to bamDiff you can see summary statistics for each bam file, see the regions where most reads are mapping, and how well the alignments from each aligner line up.

bamDiff compares the alignments in one mapping of RNAseq data to the alignments output by another, in a pairwise fashion

As shown in the image above, bamDiff will look at each read in the output from Aligner 1, and first checks to see if that read is even being mapped by the other aligner, or if it is unmapped. If it is being mapped, then bamDiff checks for where it is being mapped to. If a lot of reads from the first aligner are being mapped to a consistent region by the second aligner, that region will be reported.

#Using bamDiff

Make sure you are in your home directory: $ cd

cd into the bamdiff directory: $ cd bamdiff

bamDiff takes a number of inputs:

  • Options:
    • -s, -S, -v: These options control the output. We will talk more about them in a minute.
    • -n: This allows you to specify how many regions we want bamDiff to analyze
  • .csv file:
    • This will be your output from runCsaw.R. It needs to come before the .bam files
  • 2+ .bam files:
    • These will be the .bam files produced by the aligners.
  • For more information, check out bamDiff's documentation.

Let us start with a simple case, comparing only two aligners. By default, bamDiff will only analyze the first region in the .csv file:

Commands to type: (This should take approximately 5 minutes to run)

$ cd ~/bamdiff
$ ./bamDiff.py ../csaw/csaw.bam.results.csv ../hisat/hisat.sorted.bam ../star/star.sorted.bam

Screen Output:

File key:
#	NAME
1	../hisat/cached/hisat.sorted.bam
2	../star/cached/star.sorted.bam

Region 20:7889001-7890000
================================
                |       vs. file
File    # Reads |1      2
----------------+---------------
Proportion of unmapped reads
----------------+---------------
1       4       | -         0.00	
2       1036    | 0.99      -	
----------------+---------------
Median # times each read mapped
----------------+---------------
1       4       | -         7.0	
2       1036    | 1.0       -	
----------------+---------------
Median proportion times each read mapped inside region
----------------+---------------
1       4       | -         0.0	
2       1036    | 0.0       -	
================================

Regions where most identified reads are mapping in second bam file
Reads in region from file hisat.sorted.bam mapped to file star.sorted.bam at:
20	34563297	34563299	4
20	34635605	34635607	4
20	34639131	34639133	4
20	43554810	43554812	4
20	34472105	34472106	3
20	34620541	34620542	3
20	43356459	43356460	3

Reads in region from file star.sorted.bam mapped to file hisat.sorted.bam at:
20	7987382	7987383	4
20	14808875	14808876	1
20	7956794	7956795	1
20	17413469	17413470	1
20	8108928	8108929	1
20	8595745	8595746	1

Notice the File Key at the beginning of the output. This tells you which file 1 or 2 represents in the table.

The table itself reports the region being analyzed. The far left column tells you which file was being analyzed. The reads column indicates for that file, for the region being analyzed, how many reads fall in that region. The rest of the columns show the proportion of those reads which were left unmapped in the other files.

Finally, there is a list of regions in the format: chromosome, start position, end position, the number of counts in this region. For the reads in one file, if they mapped to the second, their locations are shown. If you inspect these regions using a service like the UCSC Genome Browser, you can look to see whether these regions contain repetitive regions, pseudo genes, or other elements.

If you pass the --verbose option, you can get additional information, including summary statistics about each of the bam files, the table identifying the proportion of reads in the first file which are unmapped in the second, a table reporting the median value of the number of times each read in the identified region in the first file are mapped in the second file, and the median value of the proportion of those mappings (for a given read) which are actually inside the same region being looked at. Finally is the same list of regions where there is an accumulation of reads from the region in file 1 being mapped to in file 2.

You can try this yourself by running the command

$ ./bamDiff.py --verbose ../csaw/csaw.bam.results.csv ../hisat/hisat.sorted.bam ../star/star.sorted.bam

You can also take a look at the files in the ~/bamdiff/cached/ directory. There are two files (4by4_n1.summary, 4by4_n1.results) which show the results from comparing all 4 aligners in verbose mode. As you can see, the output is broken up into two files (using the -o *file*/--output *file* option). The .summary file contains the simple summary statistics for each file, while the .results file contains the rest of the output. It took 96 minutes to run on all four.

Take a look around using less or cat. E.g.

$ cat ~/bamdiff/cached/4by4_n1.summary
$ cat ~/bamdiff/cached/4by4_n1.results

*Note: If following this tutorial locally (i.e. on your own machine) use these links to see the summary and results.