Simplistic pipeline to call PacBio Data
This pipeline uses PacBio's smrtlink pbmm2 tool to align reads to a reference genome and then uses Google's deepvariant tool to call and the PacBio pbsv
tools to call.
You need to have a FASTA file that to represent the reference genome. This is used by both of the above tools. This needs to be indexed -- and the two tools use different indexes so you have to index twice
-
Index this first using pbmm2 e.g.
pbmm2 index ref38.fasta
This produces a file with an mmi suffix. -
Index using samtools e.g.
samtools faidx ref38.fasta
. This produces a file with anfai
suffix
The index files should be in the same directory as the fasta file. If necessary using symbolic links.
To be specified on the command line or in the config file
--input
: A comma separated list of input directories. Each directory should contain the FASTQ files used as input. No spaces in list--ref_seq
: The full path to the reference genome. This should be a FASTA file--ref_mmi
: The full path to an index file (mmi) of the reference genome. This is produced bypbmm2
--ref_fai
: The full path to an index file (fai) of the reference genome. This is produced bysamtools faidx
--bam
: Where the BAM files should be placed. Usually an output directory but seehas_bam
--has_bam
(defaultfalse
), set to totrue
if thebam
directory above should be treated as an input--tandem_example
. A BED file with tandem repeat annotation to help discovery. Download from here https://github.com/PacificBiosciences/pbsv/blob/master/annotations/human_GRCh38_no_alt_analysis_set.trf.bed (the build 37 annotation also available)--bamify_cpus
: how many cores does creating the BAM file use (default is 16)--bamify_mem
: how much memory BAM creation needs (default 32GB)--call_cpus
: how many cores calling requires (default is 16)--call_mem
: how much memory calling requires (default 48GB)--output
: The name of the jointly called VCF file output bypbsv
--chrom_prefix
. The default ischr
. BAM/VCF files can refer to a chromosome as chr7 or just as 7. The various tools in the pipeline need to know which. For build 38,chr7
is more common and this is the default but if your data is different you need to set this.
Please put module load smrtlink
into your .bashrc
file.
Google's DeepVariant pipeline relies on TensorFlow which in turn relies on computers with AVX instruction support. A few of the older nodes on the cluster do not support AVX instructions so you need to make sure that SLURM gives you the nodes you need. Add the following options. (If you see jobs failing with a 252 error you may have overlooked this)
--constraint=avx2