Skip to content

04. Generating Required Inputs for lsaBGC

Rauf Salamzade edited this page Apr 14, 2023 · 18 revisions

Starting from Precomputed antiSMASH - and optionally BiG SCAPE - Results using lsaBGC-Ready.py

lsaBGC-Ready.py simplifies the usage of lsaBGC for downstream analysis by taking in precomputed antiSMASH, GECCO, or DeepBGC results (and optionally GCF specifications from a prior BiG-SCAPE analysis) and creating inputs needed for lsaBGC-AutoAnalyze.py and other programs.

lsaBGC-Ready.py will first perform genome-wide gene-calling (if genomes are provided as FASTAs) and attempt to match gene-calls to those in antiSMASH BGC Genbanks, renaming locus_tags for predicted CDS features to be sample specific. Afterwards, it will extract proteins from antiSMASH BGC Genbanks and run OrthoFinder2 to determine homologs. To satisfy some assumptions in lsaBGC's backend, it will finally attempt to determine paralogs of BGC associated homolog groups across genomes (mainly to be able to confidently identify whether homolog groups are specific to BGCs or whether they can be found in background genomic contexts).

Final results, which can alter based on options specified to lsaBGC-Ready.py can be find in the subdirectory Final_Results/

Usage

usage: lsaBGC-Ready.py [-h] -i GENOME_LISTING [-d ADDITIONAL_GENOME_LISTING] -l BGC_GENBANK_LISTING
                       [-p BGC_PREDICTION_SOFTWARE] [-b BIGSCAPE_RESULTS] -o OUTPUT_DIRECTORY [-m ORTHOFINDER_MODE] [-mc]
                       [-a] [-t] [-gtm GTOTREE_MODEL] [-lc] [-le] [-c CPUS] [-k] [-spe] [-py]

        Program: lsaBGC-Ready.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        Program to convert existing BGC predictions, e.g. from antiSMASH, DeepBGC, and GECCO, (and optionally BiG-SCAPE)
        results and convert to input used by the lsaBGC suite (make it "ready" for lsaBGC analysis). Will run OrthoFinder2
        on just proteins from antiSMASH BGCs. If BiG-SCAPE results are not provided, users have the option to run
        lsaBGC-Cluster instead which implements algorithms designed for clustering complete instances of BGCs from
        completed/finished genomic assemblies.

    There are scripts from creating input listing files from AntiSMASH result directories and directories with
    genomes (listAllBGCGenbanksInDirectory.py & listAllGenomesInDirectory.py) if you don't want
    to write them yourself. The listing files as inputs instead of directories help ensure that sample mapping
    between genomes (FASTA or Genbank) and BGC Genbanks and should be manually investigated to ensure proper
    linking.

        ALGORITHMIC OVERVIEWS/CONSIDERATIONS:
        *****************************************************************************************************************
    -*-  OrthoFinder2 modes:
            * Genome_Wide: Run OrthoFinder2 as intended with all primary sample full genome-wide proteomes.
              [DEFAULT; LOW-THROUGHPUT (<200 Genomes)].
            * BGC_Only: OrthoFinder2 is run across samples/genomes accounting for only BGC embedded proteins.
              Genome-wide paralogs for orthogroups are subsequently identified by using orthogroup specific cutoffs
              based on the percent identity and coverage thresholds determined for each orthogroup (the minimum
              perc. id and coverage observed within BGC proteins belonging to the same orthgroup).
              [MEDIUM-THROUGHPUT (>200 but <500 genomes)]. Note, this can result in the same protein
              being assigned to multiple ortholog groups currently because of the parology search (will aim to fix
              this soon, but should have minimal effects I believe).
            * COMING SOON: palo - scalable genome-wide orthology determination.

    -*-  To avoid issues with processing BiG-SCAPE results (if used instead lsaBGC-Cluster.py), please use distinct
         output prefices for each sample when running antiSMASH so that BGC names do not overlap across samples
         (can happen if sample genomes were assembled by users and do not have unique identifiers). If issues persist
         please consider using lsaBGC-Cluster.py, we use similar methods to BiG-SCAPE, though the algorithms are
         mostly designed for complete BGCs in mind for lsaBGC-Cluster.py, while BiG-SCAPE has some nice settings
         to handle fragmented BGCs. lsaBGC-Expansion/AutoExpansion are specifically designed for detecting fragmented
         GCF instances in draft assemblies and can be run on the initial "primary" genome set as well.
        *****************************************************************************************************************


optional arguments:
  -h, --help            show this help message and exit
  -i GENOME_LISTING, --genome_listing GENOME_LISTING
                        Tab-delimited, two column file for primary samples (ideally with high-quality or complete genomes)
                        where the first column is the sample/isolate/genome name and the second is the
                        full path to the genome file (Genbank or FASTA).
                        Check note above about available scripts to automatically create this.
  -d ADDITIONAL_GENOME_LISTING, --additional_genome_listing ADDITIONAL_GENOME_LISTING
                        Tab-delimited, two column file for samples with additional/draft
                        genomes (same format as for the "--genome_listing" argument). The genomes/BGCs of these
                        samples won't be used in ortholog-grouping of proteins and clustering of BGCs, but will simply have gene
                        calling run for them. This will enable more sensitive/expanded detection of GCF instances later
                        using lsaBGC-Expansion/AutoExpansion.
                        Check note above about available scripts to automatically create this.
  -l BGC_GENBANK_LISTING, --bgc_genbank_listing BGC_GENBANK_LISTING
                        Tab-delimited, two column file listing BGC predictions results for primary samples
                        (those from the "--genome_listing" argument), where the first column is the sample name and the second
                        is the full path to BGC prediction in Genbank format.
  -p BGC_PREDICTION_SOFTWARE, --bgc_prediction_software BGC_PREDICTION_SOFTWARE
                        Software used to predict BGCs (Options: antiSMASH, DeepBGC, GECCO).
                        Default is antiSMASH.
  -b BIGSCAPE_RESULTS, --bigscape_results BIGSCAPE_RESULTS
                        Path to BiG-SCAPE results directory of antiSMASH/DeepBGC/GECCO results predicted in primary
                        genomes.Please make sure the sample names match what is provided for "--genome_listings".
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Parent output/workspace directory.
  -m ORTHOFINDER_MODE, --orthofinder_mode ORTHOFINDER_MODE
                        Method for running OrthoFinder2. (Options: Genome_Wide, BGC_Only). Default is Genome_Wide.
  -mc, --run_coarse_orthofinder
                        Use coarse clustering of homolog groups in OrthoFinder instead of more resolute hierarchical determined homolog groups.
  -a, --annotate        Perform annotation of BGC proteins using KOfam and PGAP (including TIGR) HMM profiles.
  -t, --run_gtotree     Whether to create phylogeny and expected sample-vs-sample
                        divergence for downstream analyses using GToTree.
  -gtm GTOTREE_MODEL, --gtotree_model GTOTREE_MODEL
                        Set of core genes to use for phylogeny construction in GToTree. Default is Universal_Hug_et_al
  -lc, --lsabgc_cluster
                        Run lsaBGC-Cluster with default parameters. Note, we recommend running lsaBGC-Cluster manually
                        and exploring parameters through its ability to generate a user-report for setting clustering parameters.
  -le, --lsabgc_expansion
                        Run lsaBGC-AutoExpansion with default parameters. Assumes either "--bigscape_results" or
                        "--lsabgc_cluster" is specified.
  -c CPUS, --cpus CPUS  Total number of cpus/threads to use for running OrthoFinder2/prodigal.
  -k, --keep_intermediates
                        Keep intermediate directories / files which are likely not useful for downstream analyses.
  -spe, --skip_primary_expansion
                        Skip expansion on primary genomes as well.
  -py, --use_pyrodigal  Use pyrodigal instead of prodigal.

lsaBGC-AutoProcess.py (Now obsolete)

Overview

lsaBGC-AutoProcess is the first program to run in the lsaBGC suite and simply creates the required inputs for the rest of suite. It's implementation is also different in that it requires users to specify paths to separate conda environments for the three programs which generate these required inputs: (i) Prokka (2) antiSMASH and (3) OrthoFinderV2. It is actually a workflow, similar to lsaBGC-Automate.py, and both programs can be found in the workflows/ subdirectory of the suite.

All three programs take a while to run, and it is therefore recommended that users only process completed / high-quality genomic assemblies through lsaBGC-AutoProcess to lay out and identify the major BGCs found in two or more members of lineages. Additional instances of BGCs belonging to a GCF of interest can later be identified in high-throughput using lsaBGC-Expansion.py across a multitude of draft genomes, if desired. To run lsaBGC-Expansion.py however you will need to run the additional (low/medium quality) draft genomes through lsaBGC-Process.py in a special mode [ specified by setting the flags -p (run only Prokka) and -q (avoid deep annotation with Prokka) ] which avoids running AntiSMASH and OrthoFinder for each genomic assembly.

A hopefully convenient option for certain users with access to high-performance computing resources is the dry-run option. Which simply creates task files with commands for each of the three major programs and leaves it to the user to parallelize or initiate these on the server.

Usage

usage: lsaBGC-AutoProcess.py [-h] -a ASSEMBLY_LISTING -o OUTPUT_DIRECTORY -cp CONDA_PATH -pe PROKKA_ENV_PATH [-oe ORTHOFINDER_ENV_PATH]
                             [-ae ANTISMASH_ENV_PATH] [-g GENUS] [-c CORES] [-d] [-s] [-q] [-p] [-f]

        Program: lsaBGC-Process.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Microbiology and Immunology

        This program will automatically run or create task files for running Prokka (gene calling and annotation),
        antiSMASH (biosynthetic gene cluster annotation), and OrthoFinder (de novo ortholog group construction).


optional arguments:
  -h, --help            show this help message and exit
  -a ASSEMBLY_LISTING, --assembly_listing ASSEMBLY_LISTING
                        Tab delimited text file. First column is the sample name and the second is the path to its assembly in FASTA format. Please remove troublesome characters in the sample name.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Prefix for output files.
  -cp CONDA_PATH, --conda_path CONDA_PATH
                        Path to anaconda/miniconda installation directory itself.
  -pe PROKKA_ENV_PATH, --prokka_env_path PROKKA_ENV_PATH
                        Path to conda environment for Prokka.
  -oe ORTHOFINDER_ENV_PATH, --orthofinder_env_path ORTHOFINDER_ENV_PATH
                        Path to conda environment for OrthoFinder. Optional, if not used, locus tags will be 3 characters insteado just 2.
  -ae ANTISMASH_ENV_PATH, --antiSMASH_env_path ANTISMASH_ENV_PATH
                        Path to conda environment for antiSMASH. Database should automatically configured for antiSMASH loaded by the environment.
  -g GENUS, --genus GENUS
                        The genus under investigation. The lineage of interest could be species, but for this, just use the genus.
  -c CORES, --cores CORES
                        The number of cores to use.
  -d, --dry_run         Just create task files with commands for running prodigal, antiSMASH, and OrthoFinder. Useful for parallelizing across an HPC.
  -s, --append_singleton_hgs
                        Append homolog groups with only one protein representative to the Orthogroups.csv homolog group matrix. This enables more reliable detection of homologous rare/singleton BGCs downstream in the pipeline.
  -q, --fast_annotation
                        Skip basic/standard annotation in Prokka.
  -p, --only_run_prokka
                        Only run Prokka for gene annotation and Genbank creation. Skip the rest.
  -f, --refined_orthofinder
                        Only run OrthoFinder on proteins from antiSMASH proteomes only. This has implications downstream on being able to identify multi-copy genes across the genome.

Run OrthoFinder in Narrow-Scope

lsaBGC-AutoProcess.py a flag called --refined_orthofinder which allows users to request that OrthoFinder be run only on proteins identified as belonging to potential biosynthetic gene clusters instead of the full predicted-proteome of samples. OrthoFinder is a major bottleneck when using whole predicted-proteomes and so using this option will allow lsaBGC-AutoProcess.py to be run in full (Prokka, AntiSMASH, and OrthoFinder) on a significantly larger sample size. A consequence however is that certain downstream features in lsaBGC, where copy-number of homolog groups is assessed to identify gene-cluster family specific markers will become less reliable (and options for lsaBGC-Expansion.py should be adjusted for this purpose).

Clone this wiki locally