This pipeline emulates a Primer-BLAST to find primer matching sequences in a nucleotide database and recover the sequences flanked by these matching sites.
With this pipeline you can:
- Filter the query database using an ancestor taxid node
- Set the minimal coverage and identity of the primer sequences for the BLAST step
- Scan the database with an arbitrary number of primers
- Recover a fasta file with flanked sequences
- Produce a BLAST-formated database of flanked sequences
RRW-PrimerBLAST runs in a UNIX environment with BASH (tested on Debian GNU/Linux 10 (buster)) and requires conda and an internet connection (at least for the first run).
Start by getting a copy of this repository on your system, either by downloading and unpacking the archive, or using 'git clone':
cd path/to/repo/
git clone --recurse-submodules https://github.com/CVUA-RRW/RRW-PrimerBLAST.git
Set up a conda environment containing snakemake, python and the pandas library and activate it:
conda create --name snakemake -c bioconda -c anaconda snakemake pandas biopython
conda activate snakemake
RRW-PrimerBLAST requires several databases to run, all are available from the NCBI ftp servers:
- taxdump
- taxdb
- Any nucleotide collection you want to use, this needs to be a searchable BLAST database with taxonomy information. For this you can build a local database from a subset of sequences, for exemple from the BOLD database. Check the BLAST documentation to know how to do this.
RRW-PrimerBLAST should be run using the snakemake command-line application. For this you will need to manually fill the config.yaml file with the paths to the required files. You can also modify the parameters already present in the file.
Then run the pipeline with:
snakemake -s /path/to/FooDMe/Snakefile --configfile path/to/config.yaml --use-conda --conda-prefix path/to/your/conda/envs
Consult snakemake's documentation for more details.
The configuration file contains the following parameters:
# Fill in the path belows with your own specifications:
workdir: # Path to output directory
blast_db: # Path to BLAST-formated database
taxdb: # Path to the folder containing the taxdb files
rankedlineage_dmp: # Path to rankedlineage.dmp
nodes_dmp: # Path to nodes.dmp
primers: # Path to the primer sequence files (in FASTA), any number of primers is acceptable
# Modify the parameters below:
parent_node: 32524 # Ancestor node to pre-filter BLAST database, 1 to ignore
primerBlast_coverage: 100 # Minimal query coverage for primer BLAST (0-100)
primerBlast_identity: 100 # Minimal sequence identity value for primer BLAST (0-100)
Flanked sequences can be seen as in silico PCR amplicons. However note that only the first amplicon for each database sequence will be returned! A flanked sequence is simply defined as the first sequences between a primer match on the plus strand followed by a match on the minus strand. See examples below:
-> indicates a match on the '+' strand
<- indicates a match on the '-' strand
primer matches: <- -> -> <- ->
database sequence: =========================
flanked sequence: ========
primer matches: -> <- -> <- ->
database sequence: =========================
flanked sequence: =====
primer matches: <- <- -> -> ->
database sequence: =========================
flanked sequence:
RRW-PrimerBLAST is built with Snakemake and uses the BLAST+
For new features or to report bugs please submit issues directly on the online repository.
This project is licensed under a BSD 3-Clauses License, see the LICENSE file for details.
For questions about the pipeline, problems, suggestions or requests, feel free to contact:
Grégoire Denay, Chemisches- und Veterinär-Untersuchungsamt Rhein-Ruhr-Wupper