-
Notifications
You must be signed in to change notification settings - Fork 4
04. Generating Required Inputs for lsaBGC
lsaBGC-Ready.py
simplifies the usage of lsaBGC for downstream analysis by taking in precomputed antiSMASH results and optionally GCF specifications from a prior BiG-SCAPE analysis.
lsaBGC-Ready.py
will first perform genome-wide gene-calling (if genomes are provided as FASTAs) and attempt to match gene-calls to those in antiSMASH BGC Genbanks, renaming locus_tags for predicted CDS features to be sample specific. Afterwards, it will extract proteins from antiSMASH BGC Genbanks and run OrthoFinder2 to determine homologs. To satisfy some assumptions in lsaBGC's backend, it will finally attempt to determine paralogs of BGC associated homolog groups across genomes (mainly to be able to confidently identify whether homolog groups are specific to BGCs or whether they can be found in background genomic contexts).
The four major outputs of lsaBGC-Ready.py
are:
- Homolog group vs. Sample presence/absence matrix
- antiSMASH BGC Listings file
- Full Genome Predicted Proteome & Genbank Listings file
- GCF listings file (optional if BiG-SCAPE results are provided)
usage: lsaBGC-Ready.py [-h] -i GENOME_LISTING -l ANTISMASH_LISTING -o OUTPUT_DIRECTORY [-b BIGSCAPE_RESULTS] [-d ADDITIONAL_GENOME_LISTING] [-a] [-g] [-lc] [-le] [-c CORES] [-k] [-spe]
Program: lsaBGC-Ready.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
Program to convert existing antiSMASH (and optionally BiG-SCAPE) results and convert to input used by the lsaBGC
suite (make it "ready" for lsaBGC analysis). Will run OrthoFinder2 on just proteins from antiSMASH BGCs. If
BiG-SCAPE results are not provided, users have the option to run lsaBGC-Cluster instead which implements algorithms
designed for clustering complete instances of BGCs from completed/finished genomic assemblies.
Note, to avoid issues with BiG-SCAPE clustering (if used instead lsaBGC-Cluster.py), please use distinct output
prefices for each sample so that BGC names do not overlap across samples (can happen if sample genomes were
assembled by users and do not have unique identifiers).
Hopefully, in the near future users will be also able to draw from ready made GCF predictions made by BiG-SLICE
as provided in the BiG-FAM database.
optional arguments:
-h, --help show this help message and exit
-i GENOME_LISTING, --genome_listing GENOME_LISTING
Tab-delimited, two column file for primary samples (ideally with high-quality or complete genomes) where the first column is the sample/isolate/genome name and the second is the full path to the genome file (Genbank or FASTA)
-l ANTISMASH_LISTING, --antismash_listing ANTISMASH_LISTING
Tab-delimited, two column file listing antiSMASH results for primary samples (those from the "--genome_listing" argument), where the first column is the sample/isolate/genome name the second is the full path to an antiSMASH BGC prediction in Genbank format.
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Parent output/workspace directory.
-b BIGSCAPE_RESULTS, --bigscape_results BIGSCAPE_RESULTS
Path to BiG-SCAPE results directory of antiSMASH predicted in complete genomes. Please make sure the sample names match what is provided for "--genome_listings".
-d ADDITIONAL_GENOME_LISTING, --additional_genome_listing ADDITIONAL_GENOME_LISTING
Tab-delimited, two column file for samples with additional/draft genomes (same format as for the "--genome_listing" argument). The genomes/BGCs of these samples won't be used in ortholog-grouping of proteins and clustering of BGCs, but will simply have gene calling run for them. This will enable more sensitive/expanded detection of GCF instances later using lsaBGC-Expansion/AutoExpansion.
-a, --annotate Perform annotation of BGC proteins using KOfam HMM profiles.
-g, --genomes_as_genbanks
Genomes used for initial antiSMASH analysis were in Genbank format with CDS features which have protein translations included.
-lc, --lsabgc_cluster
Run lsaBGC-Cluster with default parameters. Note, we recommend running lsaBGC-Cluster manually and exploring parameters through its ability to generate a user-report for setting clustering parameters.
-le, --lsabgc_expansion
Run lsaBGC-AutoExpansion with default parameters. Assumes either "--bigscape_results" or "--lsabgc_cluster" is specified.
-c CORES, --cores CORES
Total number of cores/threads to use for running OrthoFinder2/prodigal.
-k, --keep_intermediates
Keep intermediate directories / files which are likely not useful for downstream analyses.
-spe, --skip_primary_expansion
Skip expansion on primary genomes as well.
lsaBGC-AutoProcess is the first program to run in the lsaBGC suite and simply creates the required inputs for the rest of suite. It's implementation is also different in that it requires users to specify paths to separate conda environments for the three programs which generate these required inputs: (i) Prokka (2) antiSMASH and (3) OrthoFinderV2. It is actually a workflow, similar to lsaBGC-Automate.py, and both programs can be found in the workflows/
subdirectory of the suite.
All three programs take a while to run, and it is therefore recommended that users only process completed / high-quality genomic assemblies through lsaBGC-AutoProcess to lay out and identify the major BGCs found in two or more members of lineages. Additional instances of BGCs belonging to a GCF of interest can later be identified in high-throughput using lsaBGC-Expansion.py
across a multitude of draft genomes, if desired. To run lsaBGC-Expansion.py however you will need to run the additional (low/medium quality) draft genomes through lsaBGC-Process.py in a special mode [ specified by setting the flags -p
(run only Prokka) and -q
(avoid deep annotation with Prokka) ] which avoids running AntiSMASH and OrthoFinder for each genomic assembly.
A hopefully convenient option for certain users with access to high-performance computing resources is the dry-run option. Which simply creates task files with commands for each of the three major programs and leaves it to the user to parallelize or initiate these on the server.
usage: lsaBGC-AutoProcess.py [-h] -a ASSEMBLY_LISTING -o OUTPUT_DIRECTORY -cp CONDA_PATH -pe PROKKA_ENV_PATH [-oe ORTHOFINDER_ENV_PATH]
[-ae ANTISMASH_ENV_PATH] [-g GENUS] [-c CORES] [-d] [-s] [-q] [-p] [-f]
Program: lsaBGC-Process.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Microbiology and Immunology
This program will automatically run or create task files for running Prokka (gene calling and annotation),
antiSMASH (biosynthetic gene cluster annotation), and OrthoFinder (de novo ortholog group construction).
optional arguments:
-h, --help show this help message and exit
-a ASSEMBLY_LISTING, --assembly_listing ASSEMBLY_LISTING
Tab delimited text file. First column is the sample name and the second is the path to its assembly in FASTA format. Please remove troublesome characters in the sample name.
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Prefix for output files.
-cp CONDA_PATH, --conda_path CONDA_PATH
Path to anaconda/miniconda installation directory itself.
-pe PROKKA_ENV_PATH, --prokka_env_path PROKKA_ENV_PATH
Path to conda environment for Prokka.
-oe ORTHOFINDER_ENV_PATH, --orthofinder_env_path ORTHOFINDER_ENV_PATH
Path to conda environment for OrthoFinder. Optional, if not used, locus tags will be 3 characters insteado just 2.
-ae ANTISMASH_ENV_PATH, --antiSMASH_env_path ANTISMASH_ENV_PATH
Path to conda environment for antiSMASH. Database should automatically configured for antiSMASH loaded by the environment.
-g GENUS, --genus GENUS
The genus under investigation. The lineage of interest could be species, but for this, just use the genus.
-c CORES, --cores CORES
The number of cores to use.
-d, --dry_run Just create task files with commands for running prodigal, antiSMASH, and OrthoFinder. Useful for parallelizing across an HPC.
-s, --append_singleton_hgs
Append homolog groups with only one protein representative to the Orthogroups.csv homolog group matrix. This enables more reliable detection of homologous rare/singleton BGCs downstream in the pipeline.
-q, --fast_annotation
Skip basic/standard annotation in Prokka.
-p, --only_run_prokka
Only run Prokka for gene annotation and Genbank creation. Skip the rest.
-f, --refined_orthofinder
Only run OrthoFinder on proteins from antiSMASH proteomes only. This has implications downstream on being able to identify multi-copy genes across the genome.
lsaBGC-AutoProcess.py a flag called --refined_orthofinder
which allows users to request that OrthoFinder be run only on proteins identified as belonging to potential biosynthetic gene clusters instead of the full predicted-proteome of samples. OrthoFinder is a major bottleneck when using whole predicted-proteomes and so using this option will allow lsaBGC-AutoProcess.py
to be run in full (Prokka, AntiSMASH, and OrthoFinder) on a significantly larger sample size. A consequence however is that certain downstream features in lsaBGC, where copy-number of homolog groups is assessed to identify gene-cluster family specific markers will become less reliable (and options for lsaBGC-Expansion.py
should be adjusted for this purpose).