-
Notifications
You must be signed in to change notification settings - Fork 4
04. Generating Required Inputs for lsaBGC
lsaBGC-Ready.py
simplifies the usage of lsaBGC for downstream analysis by taking in precomputed antiSMASH, GECCO, or DeepBGC results (and optionally GCF specifications from a prior BiG-SCAPE analysis) and creating inputs needed for lsaBGC-AutoAnalyze.py
and other programs.
lsaBGC-Ready.py
will first perform genome-wide gene-calling (if genomes are provided as FASTAs) and attempt to match gene-calls to those in antiSMASH BGC Genbanks, renaming locus_tags for predicted CDS features to be sample specific. Afterwards, it will extract proteins from antiSMASH BGC Genbanks and run OrthoFinder2 to determine homologs. To satisfy some assumptions in lsaBGC's backend, it will finally attempt to determine paralogs of BGC associated homolog groups across genomes (mainly to be able to confidently identify whether homolog groups are specific to BGCs or whether they can be found in background genomic contexts).
Final results, which can alter based on options specified to lsaBGC-Ready.py
can be find in the subdirectory Final_Results/
usage: lsaBGC-Ready.py [-h] -i GENOME_LISTING [-d ADDITIONAL_GENOME_LISTING] -l BGC_GENBANK_LISTING
[-p BGC_PREDICTION_SOFTWARE] [-b BIGSCAPE_RESULTS] -o OUTPUT_DIRECTORY [-m ORTHOFINDER_MODE] [-mc]
[-a] [-t] [-gtm GTOTREE_MODEL] [-lc] [-le] [-c CPUS] [-k] [-spe] [-py]
Program: lsaBGC-Ready.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
Program to convert existing BGC predictions, e.g. from antiSMASH, DeepBGC, and GECCO, (and optionally BiG-SCAPE)
results and convert to input used by the lsaBGC suite (make it "ready" for lsaBGC analysis). Will run OrthoFinder2
on just proteins from antiSMASH BGCs. If BiG-SCAPE results are not provided, users have the option to run
lsaBGC-Cluster instead which implements algorithms designed for clustering complete instances of BGCs from
completed/finished genomic assemblies.
There are scripts from creating input listing files from AntiSMASH result directories and directories with
genomes (listAllBGCGenbanksInDirectory.py & listAllGenomesInDirectory.py) if you don't want
to write them yourself. The listing files as inputs instead of directories help ensure that sample mapping
between genomes (FASTA or Genbank) and BGC Genbanks and should be manually investigated to ensure proper
linking.
ALGORITHMIC OVERVIEWS/CONSIDERATIONS:
*****************************************************************************************************************
-*- OrthoFinder2 modes:
* Genome_Wide: Run OrthoFinder2 as intended with all primary sample full genome-wide proteomes.
[DEFAULT; LOW-THROUGHPUT (<200 Genomes)].
* BGC_Only: OrthoFinder2 is run across samples/genomes accounting for only BGC embedded proteins.
Genome-wide paralogs for orthogroups are subsequently identified by using orthogroup specific cutoffs
based on the percent identity and coverage thresholds determined for each orthogroup (the minimum
perc. id and coverage observed within BGC proteins belonging to the same orthgroup).
[MEDIUM-THROUGHPUT (>200 but <500 genomes)]. Note, this can result in the same protein
being assigned to multiple ortholog groups currently because of the parology search (will aim to fix
this soon, but should have minimal effects I believe).
* COMING SOON: palo - scalable genome-wide orthology determination.
-*- To avoid issues with processing BiG-SCAPE results (if used instead lsaBGC-Cluster.py), please use distinct
output prefices for each sample when running antiSMASH so that BGC names do not overlap across samples
(can happen if sample genomes were assembled by users and do not have unique identifiers). If issues persist
please consider using lsaBGC-Cluster.py, we use similar methods to BiG-SCAPE, though the algorithms are
mostly designed for complete BGCs in mind for lsaBGC-Cluster.py, while BiG-SCAPE has some nice settings
to handle fragmented BGCs. lsaBGC-Expansion/AutoExpansion are specifically designed for detecting fragmented
GCF instances in draft assemblies and can be run on the initial "primary" genome set as well.
*****************************************************************************************************************
optional arguments:
-h, --help show this help message and exit
-i GENOME_LISTING, --genome_listing GENOME_LISTING
Tab-delimited, two column file for primary samples (ideally with high-quality or complete genomes)
where the first column is the sample/isolate/genome name and the second is the
full path to the genome file (Genbank or FASTA).
Check note above about available scripts to automatically create this.
-d ADDITIONAL_GENOME_LISTING, --additional_genome_listing ADDITIONAL_GENOME_LISTING
Tab-delimited, two column file for samples with additional/draft
genomes (same format as for the "--genome_listing" argument). The genomes/BGCs of these
samples won't be used in ortholog-grouping of proteins and clustering of BGCs, but will simply have gene
calling run for them. This will enable more sensitive/expanded detection of GCF instances later
using lsaBGC-Expansion/AutoExpansion.
Check note above about available scripts to automatically create this.
-l BGC_GENBANK_LISTING, --bgc_genbank_listing BGC_GENBANK_LISTING
Tab-delimited, two column file listing BGC predictions results for primary samples
(those from the "--genome_listing" argument), where the first column is the sample name and the second
is the full path to BGC prediction in Genbank format.
-p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
Default is antiSMASH.
-b BIGSCAPE_RESULTS, --bigscape_results BIGSCAPE_RESULTS
Path to BiG-SCAPE results directory of antiSMASH/DeepBGC/GECCO results predicted in primary
genomes.Please make sure the sample names match what is provided for "--genome_listings".
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Parent output/workspace directory.
-m ORTHOFINDER_MODE, --orthofinder_mode ORTHOFINDER_MODE
Method for running OrthoFinder2. (Options: Genome_Wide, BGC_Only). Default is Genome_Wide.
-mc, --run_coarse_orthofinder
Use coarse clustering of homolog groups in OrthoFinder instead of more resolute hierarchical determined homolog groups.
-a, --annotate Perform annotation of BGC proteins using KOfam and PGAP (including TIGR) HMM profiles.
-t, --run_gtotree Whether to create phylogeny and expected sample-vs-sample
divergence for downstream analyses using GToTree.
-gtm GTOTREE_MODEL, --gtotree_model GTOTREE_MODEL
Set of core genes to use for phylogeny construction in GToTree. Default is Universal_Hug_et_al
-lc, --lsabgc_cluster
Run lsaBGC-Cluster with default parameters. Note, we recommend running lsaBGC-Cluster manually
and exploring parameters through its ability to generate a user-report for setting clustering parameters.
-le, --lsabgc_expansion
Run lsaBGC-AutoExpansion with default parameters. Assumes either "--bigscape_results" or
"--lsabgc_cluster" is specified.
-c CPUS, --cpus CPUS Total number of cpus/threads to use for running OrthoFinder2/prodigal.
-k, --keep_intermediates
Keep intermediate directories / files which are likely not useful for downstream analyses.
-spe, --skip_primary_expansion
Skip expansion on primary genomes as well.
-py, --use_pyrodigal Use pyrodigal instead of prodigal.
lsaBGC-AutoProcess is the first program to run in the lsaBGC suite and simply creates the required inputs for the rest of suite. It's implementation is also different in that it requires users to specify paths to separate conda environments for the three programs which generate these required inputs: (i) Prokka (2) antiSMASH and (3) OrthoFinderV2. It is actually a workflow, similar to lsaBGC-Automate.py, and both programs can be found in the workflows/
subdirectory of the suite.
All three programs take a while to run, and it is therefore recommended that users only process completed / high-quality genomic assemblies through lsaBGC-AutoProcess to lay out and identify the major BGCs found in two or more members of lineages. Additional instances of BGCs belonging to a GCF of interest can later be identified in high-throughput using lsaBGC-Expansion.py
across a multitude of draft genomes, if desired. To run lsaBGC-Expansion.py however you will need to run the additional (low/medium quality) draft genomes through lsaBGC-Process.py in a special mode [ specified by setting the flags -p
(run only Prokka) and -q
(avoid deep annotation with Prokka) ] which avoids running AntiSMASH and OrthoFinder for each genomic assembly.
A hopefully convenient option for certain users with access to high-performance computing resources is the dry-run option. Which simply creates task files with commands for each of the three major programs and leaves it to the user to parallelize or initiate these on the server.
usage: lsaBGC-AutoProcess.py [-h] -a ASSEMBLY_LISTING -o OUTPUT_DIRECTORY -cp CONDA_PATH -pe PROKKA_ENV_PATH [-oe ORTHOFINDER_ENV_PATH]
[-ae ANTISMASH_ENV_PATH] [-g GENUS] [-c CORES] [-d] [-s] [-q] [-p] [-f]
Program: lsaBGC-Process.py
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Microbiology and Immunology
This program will automatically run or create task files for running Prokka (gene calling and annotation),
antiSMASH (biosynthetic gene cluster annotation), and OrthoFinder (de novo ortholog group construction).
optional arguments:
-h, --help show this help message and exit
-a ASSEMBLY_LISTING, --assembly_listing ASSEMBLY_LISTING
Tab delimited text file. First column is the sample name and the second is the path to its assembly in FASTA format. Please remove troublesome characters in the sample name.
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
Prefix for output files.
-cp CONDA_PATH, --conda_path CONDA_PATH
Path to anaconda/miniconda installation directory itself.
-pe PROKKA_ENV_PATH, --prokka_env_path PROKKA_ENV_PATH
Path to conda environment for Prokka.
-oe ORTHOFINDER_ENV_PATH, --orthofinder_env_path ORTHOFINDER_ENV_PATH
Path to conda environment for OrthoFinder. Optional, if not used, locus tags will be 3 characters insteado just 2.
-ae ANTISMASH_ENV_PATH, --antiSMASH_env_path ANTISMASH_ENV_PATH
Path to conda environment for antiSMASH. Database should automatically configured for antiSMASH loaded by the environment.
-g GENUS, --genus GENUS
The genus under investigation. The lineage of interest could be species, but for this, just use the genus.
-c CORES, --cores CORES
The number of cores to use.
-d, --dry_run Just create task files with commands for running prodigal, antiSMASH, and OrthoFinder. Useful for parallelizing across an HPC.
-s, --append_singleton_hgs
Append homolog groups with only one protein representative to the Orthogroups.csv homolog group matrix. This enables more reliable detection of homologous rare/singleton BGCs downstream in the pipeline.
-q, --fast_annotation
Skip basic/standard annotation in Prokka.
-p, --only_run_prokka
Only run Prokka for gene annotation and Genbank creation. Skip the rest.
-f, --refined_orthofinder
Only run OrthoFinder on proteins from antiSMASH proteomes only. This has implications downstream on being able to identify multi-copy genes across the genome.
lsaBGC-AutoProcess.py a flag called --refined_orthofinder
which allows users to request that OrthoFinder be run only on proteins identified as belonging to potential biosynthetic gene clusters instead of the full predicted-proteome of samples. OrthoFinder is a major bottleneck when using whole predicted-proteomes and so using this option will allow lsaBGC-AutoProcess.py
to be run in full (Prokka, AntiSMASH, and OrthoFinder) on a significantly larger sample size. A consequence however is that certain downstream features in lsaBGC, where copy-number of homolog groups is assessed to identify gene-cluster family specific markers will become less reliable (and options for lsaBGC-Expansion.py
should be adjusted for this purpose).