################################################################################################
CRISP: Comprehensive Read Analysis for Identification of SNVs (and short indels) from Pooled sequencing data
Contact: Vikas Bansal, [email protected]
Citation:
A statistical method for the detection of variants from next-generation resequencing of DNA pools, V. Bansal. Bioinformatics 26(12), 2010
Efficient and Cost Effective Population Resequencing by Pooling and In-Solution Hybridization. V. Bansal et al. PloS ONE 2011.
#################################################################################################
CRISP is a software program to detect SNPs and short indels from pooled sequencing data. CRISP is designed to detect both rare and common variants by utilizing sequence reads from next-generation sequencing of multiple DNA pools. CRISP uses a cross-pool comparison approach to distinguish sequencing errors from rare variants. Note that the method is not designed for variant detection from a single pool. CRISP has been evaluated on several pooled sequencing datasets (human and bacterial) generated using the Illumina sequencing platform. In principle, it should work for sequence data from other sequencing platforms. The method requires each pool to be sequenced using the same sequencing platform.
CRISP is implemented in C and uses the SAMtools API for reading BAM files. The source code for samtools is included in the source code (subdirectory samtools). Samtools requires certain libraries (libncurses-dev) to compile. Typing "make all" should compile both samtools and the CRISP source code. The compiled executable 'CRISP.binary' should be in the 'bin' subfolder. CRISP has been tested on Debian & Ubuntu Linux.
The first step is to align the reads from the sequencing experiment to the reference genome using BWA (or your favorite aligner) and generate sorted BAM file for each pool. To improve detection of indels, it is recommended to run an indel realigner program (such as GATK) to realign reads and generate a realigned BAM file. Once a single bam file for each pool is generated, CRISP can be run on the BAM files to detect variants by comparing against a reference sequence.
./CRISP [options] --bams file_bam_paths --ref reference.fasta --VCF variantcalls.VCF --poolsize poolsize --bed targets.bed > variantcalls.log
Important Notes:
-
CRISP requires at least four arguments: poolsize, reference fasta file, bamfiles and the output VCF file
-
BAM files can be specified in two ways: (a) --bams: list of all bam files in a single text file (b) --bam: specify individual bam files on the command line
-
The reference sequence file should be indexed using 'samtools faidx' and the index file reference.fasta.fai placed in the same directory as reference.fasta
-
For targeted sequencing studies, it is recommended to use a bed file for variant calling. If a bedfile is not specified, the program will evaluate each base in the genome for variant calling.
-
CRISP requires at least two pools to make variant calls, but at least 5 pools are recommended. CRISP should be run separately on pools sequenced on different sequencing instruments.
-
The number of haplotypes in each pool (--poolsize) is assumed to be the same. For variable poolsizes, the poolsize should be specified inn the input file with the list of bam files (see FAQ for details on the format).
-
CRISP can be parallelized by calling variants on specific chromosomes (or regions) using the --regions option. This requires the bam files to be indexed.
--bams textfile with list of bam file paths (one for each pool)
--bam bam file for one pool, specify filename for each pool using --bam pool1.bam --bam pool2.bam .... --bam pooln.bam
--ref Indexed Reference Sequence file (fasta)
--bed bedfile with intervals in which variants will be called
--poolsize poolsize (number of haploid genomes in each pool), for diploid genomes: 2 x # individuals
--VCF VCF file to which the variant calls will be output
--qvoffset quality value offset, 33 for Sanger format, 64 for Illumina 1.3+ format
--mbq minimum base quality to consider a base for variant calling, default 10
--mmq minimum read mapping quality to consider a read for variant calling, default 20
--regions region(s) in which variants will be called, e.g chr1:654432-763332. BAM files should be indexed for this option with the index file pooln.bam.bai
--minc minimum number of reads with alternate allele required for calling a variant, default 4
--ctpval threshold on the contingency table p-value for calling position as variant (specified as log10), default is -3.5, increase this threshold as number of pools increases
--qvpval threshold on the quality values based p-value for calling position as variant (specified as log10), default is -5
--perms maximum number of permutations for calculating contingency table p-value, default 20,000
--filterreads filter reads with excessive number of mismatches (and gaps) compared to the reference sequence, Default is 1. Set to 0 to disable filtering
CRISP outputs the variants identified to a VCF file. For each variant, CRISP outputs the allele depth (read counts) for the reference and variant alleles at each poisition in the VCF file. This can be used for calculating allele frequencies or doing case-control association analysis. Multi-allelic variants and indels are also reported.
Update: the latest version of CRISP outputs genotypes for each pool (--EM 1, default) as allele counts. The CRISP VCF can be converted to a standard pooled genotype VCF using a script provided in the scripts sub-directory.
NP: Number of Pools With Data VP: Number of Pools with variant allele(s) DP: Total number of reads (+strand,-strand) across all pools (filtered reads only) CT: contingency table p-value for each variant allele in same order as listed in column 5 QVpf (QVpr): quality values based p-value for each variant allele using forward(reverse) strand reads MQ: number of reads with mapping qualities <10 | 10-19 | 20-39 | >=40 VT: variant type, SNV | DELETION | INSERTION HP: length in the ambiguity of positioning of indels (homopolymer length or microsatellite length) FLANKSEQ: This represents the reference haplotype sequence (length spans the homopolymer or microsatellite tract) with 10 bases either side of the variant position (10bases_upstream:reference_haplotype_sequence:10bases_downstream). This is useful to eyeball indels that occur in long homopolymer or microsatellite tracts.
SB: variant demonstrates strand bias PASS: variant passes all filters
AF: frequency of variant alleles(s) in the pool in order listed in column 5 ADf: Number of reads aligned to the forward strand of the genome supporting reference allele and the alternate alleles in the order listed ADr: Number of reads aligned to the reverse strand of the genome supporting reference allele and the alternate alleles in the order listed
Please note that some of these fields may be removed/updated in future releases.