-
Notifications
You must be signed in to change notification settings - Fork 12
7. Compare alignments with bamDiff
#Introduction to bamDiff
bamDiff
will allow you to take the .bam files output by the aligners, along with the .csv obtained from runCsaw.R, and compare their performances. Depending on the options you pass to bamDiff
you can see summary statistics for each bam file, see the regions where most reads are mapping, and how well the alignments from each aligner line up.
As shown in the image above, bamDiff
will look at each read in the output from Aligner 1, and first checks to see if that read is even being mapped by the other aligner, or if it is unmapped. If it is being mapped, then bamDiff
checks for where it is being mapped to. If a lot of reads from the first aligner are being mapped to a consistent region by the second aligner, that region will be reported.
#Using bamDiff
Make sure you are in your home directory: $ cd
cd
into the bamdiff
directory:
$ cd bamdiff
bamDiff takes a number of inputs:
- Options:
-
-s, -S, -v
: These options control the output. We will talk more about them in a minute. -
-n
: This allows you to specify how many regions we want bamDiff to analyze
-
-
.csv file:
- This will be your output from runCsaw.R. It needs to come before the .bam files
- 2+ .bam files:
- These will be the .bam files produced by the aligners.
- For more information, check out bamDiff's documentation.
Let us start with a simple case, comparing only two aligners. By default, bamDiff will only analyze the first region in the .csv file:
Commands to type: (This should take approximately 5 minutes to run)
$ cd ~/bamdiff
$ ./bamDiff.py ../csaw/csaw.bam.results.csv ../hisat/hisat.sorted.bam ../star/star.sorted.bam
Screen Output:
File key:
# NAME
1 ../hisat/cached/hisat.sorted.bam
2 ../star/cached/star.sorted.bam
Region 20:7889001-7890000
================================
| vs. file
File # Reads |1 2
----------------+---------------
Proportion of unmapped reads
----------------+---------------
1 4 | - 0.00
2 1036 | 0.99 -
----------------+---------------
Median # times each read mapped
----------------+---------------
1 4 | - 7.0
2 1036 | 1.0 -
----------------+---------------
Median proportion times each read mapped inside region
----------------+---------------
1 4 | - 0.0
2 1036 | 0.0 -
================================
Regions where most identified reads are mapping in second bam file
Reads in region from file hisat.sorted.bam mapped to file star.sorted.bam at:
20 34563297 34563299 4
20 34635605 34635607 4
20 34639131 34639133 4
20 43554810 43554812 4
20 34472105 34472106 3
20 34620541 34620542 3
20 43356459 43356460 3
Reads in region from file star.sorted.bam mapped to file hisat.sorted.bam at:
20 7987382 7987383 4
20 14808875 14808876 1
20 7956794 7956795 1
20 17413469 17413470 1
20 8108928 8108929 1
20 8595745 8595746 1
Notice the File Key at the beginning of the output. This tells you which file 1 or 2 represents in the table.
The table itself reports the region being analyzed. The far left column tells you which file was being analyzed. The reads column indicates for that file, for the region being analyzed, how many reads fall in that region. The rest of the columns show the proportion of those reads which were left unmapped in the other files.
Finally, there is a list of regions in the format: chromosome, start position, end position, the number of counts in this region. For the reads in one file, if they mapped to the second, their locations are shown. If you inspect these regions using a service like the UCSC Genome Browser, you can look to see whether these regions contain repetitive regions, pseudo genes, or other elements.
If you pass the --verbose
option, you can get additional information, including summary statistics about each of the bam files, the table identifying the proportion of reads in the first file which are unmapped in the second, a table reporting the median value of the number of times each read in the identified region in the first file are mapped in the second file, and the median value of the proportion of those mappings (for a given read) which are actually inside the same region being looked at. Finally is the same list of regions where there is an accumulation of reads from the region in file 1 being mapped to in file 2.
You can try this yourself by running the command
$ ./bamDiff.py --verbose ../csaw/csaw.bam.results.csv ../hisat/hisat.sorted.bam ../star/star.sorted.bam
You can also take a look at the files in the ~/bamdiff/cached/
directory. There are two files (4by4_n1.summary
, 4by4_n1.results
) which show the results from comparing all 4 aligners in verbose mode. As you can see, the output is broken up into two files (using the -o *file*
/--output *file*
option). The .summary file contains the simple summary statistics for each file, while the .results file contains the rest of the output. It took 96 minutes to run on all four.
Take a look around using less
or cat
. E.g.
$ cat ~/bamdiff/cached/4by4_n1.summary
$ cat ~/bamdiff/cached/4by4_n1.results
*Note: If following this tutorial locally (i.e. on your own machine) use these links to see the summary and results.