Skip to content

04. Generating Required Inputs for lsaBGC

Rauf Salamzade edited this page Jun 10, 2022 · 18 revisions

Starting from Precomputed antiSMASH - with or without BiG SCAPE - Results using lsaBGC-Ready.py

lsaBGC-Ready.py simplifies the usage of lsaBGC for downstream analysis by taking in precomputed antiSMASH results and optionally GCF specifications from a prior BiG-SCAPE analysis.

lsaBGC-Ready.py will first perform genome-wide gene-calling (if genomes are provided as FASTAs) and attempt to match gene-calls to those in antiSMASH BGC Genbanks, renaming locus_tags for predicted CDS features to be sample specific. Afterwards, it will extract proteins from antiSMASH BGC Genbanks and run OrthoFinder2 to determine homologs. To satisfy some assumptions in lsaBGC's backend, it will finally attempt to determine paralogs of BGC associated homolog groups across genomes (mainly to be able to confidently identify whether homolog groups are specific to BGCs or whether they can be found in background genomic contexts).

The four major outputs of lsaBGC-Ready.py are:

  • Homolog group vs. Sample presence/absence matrix
  • antiSMASH BGC Listings file
  • Full Genome Predicted Proteome & Genbank Listings file
  • GCF listings file (optional if BiG-SCAPE results are provided)

Usage

usage: lsaBGC-Ready.py [-h] -i GENOME_LISTING -l ANTISMASH_LISTING -o OUTPUT_DIRECTORY [-b BIGSCAPE_RESULTS] [-d ADDITIONAL_GENOME_LISTING] [-a] [-g] [-lc] [-le] [-c CORES] [-k] [-spe]

        Program: lsaBGC-Ready.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        Program to convert existing antiSMASH (and optionally BiG-SCAPE) results and convert to input used by the lsaBGC
        suite (make it "ready" for lsaBGC analysis). Will run OrthoFinder2 on just proteins from antiSMASH BGCs. If
        BiG-SCAPE results are not provided, users have the option to run lsaBGC-Cluster instead which implements algorithms
        designed for clustering complete instances of BGCs from completed/finished genomic assemblies.

        Note, to avoid issues with BiG-SCAPE clustering (if used instead lsaBGC-Cluster.py), please use distinct output
        prefices for each sample so that BGC names do not overlap across samples (can happen if sample genomes were
        assembled by users and do not have unique identifiers).

        Hopefully, in the near future users will be also able to draw from ready made GCF predictions made by BiG-SLICE
        as provided in the BiG-FAM database.


optional arguments:
  -h, --help            show this help message and exit
  -i GENOME_LISTING, --genome_listing GENOME_LISTING
                        Tab-delimited, two column file for primary samples (ideally with high-quality or complete genomes) where the first column is the sample/isolate/genome name and the second is the full path to the genome file (Genbank or FASTA)
  -l ANTISMASH_LISTING, --antismash_listing ANTISMASH_LISTING
                        Tab-delimited, two column file listing antiSMASH results for primary samples (those from the "--genome_listing" argument), where the first column is the sample/isolate/genome name the second is the full path to an antiSMASH BGC prediction in Genbank format.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Parent output/workspace directory.
  -b BIGSCAPE_RESULTS, --bigscape_results BIGSCAPE_RESULTS
                        Path to BiG-SCAPE results directory of antiSMASH predicted in complete genomes. Please make sure the sample names match what is provided for "--genome_listings".
  -d ADDITIONAL_GENOME_LISTING, --additional_genome_listing ADDITIONAL_GENOME_LISTING
                        Tab-delimited, two column file for samples with additional/draft genomes (same format as for the "--genome_listing" argument). The genomes/BGCs of these samples won't be used in ortholog-grouping of proteins and clustering of BGCs, but will simply have gene calling run for them. This will enable more sensitive/expanded detection of GCF instances later using lsaBGC-Expansion/AutoExpansion.
  -a, --annotate        Perform annotation of BGC proteins using KOfam HMM profiles.
  -g, --genomes_as_genbanks
                        Genomes used for initial antiSMASH analysis were in Genbank format with CDS features which have protein translations included.
  -lc, --lsabgc_cluster
                        Run lsaBGC-Cluster with default parameters. Note, we recommend running lsaBGC-Cluster manually and exploring parameters through its ability to generate a user-report for setting clustering parameters.
  -le, --lsabgc_expansion
                        Run lsaBGC-AutoExpansion with default parameters. Assumes either "--bigscape_results" or "--lsabgc_cluster" is specified.
  -c CORES, --cores CORES
                        Total number of cores/threads to use for running OrthoFinder2/prodigal.
  -k, --keep_intermediates
                        Keep intermediate directories / files which are likely not useful for downstream analyses.
  -spe, --skip_primary_expansion
                        Skip expansion on primary genomes as well.

lsaBGC-AutoProcess.py

Overview

lsaBGC-AutoProcess is the first program to run in the lsaBGC suite and simply creates the required inputs for the rest of suite. It's implementation is also different in that it requires users to specify paths to separate conda environments for the three programs which generate these required inputs: (i) Prokka (2) antiSMASH and (3) OrthoFinderV2. It is actually a workflow, similar to lsaBGC-Automate.py, and both programs can be found in the workflows/ subdirectory of the suite.

All three programs take a while to run, and it is therefore recommended that users only process completed / high-quality genomic assemblies through lsaBGC-AutoProcess to lay out and identify the major BGCs found in two or more members of lineages. Additional instances of BGCs belonging to a GCF of interest can later be identified in high-throughput using lsaBGC-Expansion.py across a multitude of draft genomes, if desired. To run lsaBGC-Expansion.py however you will need to run the additional (low/medium quality) draft genomes through lsaBGC-Process.py in a special mode [ specified by setting the flags -p (run only Prokka) and -q (avoid deep annotation with Prokka) ] which avoids running AntiSMASH and OrthoFinder for each genomic assembly.

A hopefully convenient option for certain users with access to high-performance computing resources is the dry-run option. Which simply creates task files with commands for each of the three major programs and leaves it to the user to parallelize or initiate these on the server.

Usage

usage: lsaBGC-AutoProcess.py [-h] -a ASSEMBLY_LISTING -o OUTPUT_DIRECTORY -cp CONDA_PATH -pe PROKKA_ENV_PATH [-oe ORTHOFINDER_ENV_PATH]
                             [-ae ANTISMASH_ENV_PATH] [-g GENUS] [-c CORES] [-d] [-s] [-q] [-p] [-f]

        Program: lsaBGC-Process.py
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Microbiology and Immunology

        This program will automatically run or create task files for running Prokka (gene calling and annotation),
        antiSMASH (biosynthetic gene cluster annotation), and OrthoFinder (de novo ortholog group construction).


optional arguments:
  -h, --help            show this help message and exit
  -a ASSEMBLY_LISTING, --assembly_listing ASSEMBLY_LISTING
                        Tab delimited text file. First column is the sample name and the second is the path to its assembly in FASTA format. Please remove troublesome characters in the sample name.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Prefix for output files.
  -cp CONDA_PATH, --conda_path CONDA_PATH
                        Path to anaconda/miniconda installation directory itself.
  -pe PROKKA_ENV_PATH, --prokka_env_path PROKKA_ENV_PATH
                        Path to conda environment for Prokka.
  -oe ORTHOFINDER_ENV_PATH, --orthofinder_env_path ORTHOFINDER_ENV_PATH
                        Path to conda environment for OrthoFinder. Optional, if not used, locus tags will be 3 characters insteado just 2.
  -ae ANTISMASH_ENV_PATH, --antiSMASH_env_path ANTISMASH_ENV_PATH
                        Path to conda environment for antiSMASH. Database should automatically configured for antiSMASH loaded by the environment.
  -g GENUS, --genus GENUS
                        The genus under investigation. The lineage of interest could be species, but for this, just use the genus.
  -c CORES, --cores CORES
                        The number of cores to use.
  -d, --dry_run         Just create task files with commands for running prodigal, antiSMASH, and OrthoFinder. Useful for parallelizing across an HPC.
  -s, --append_singleton_hgs
                        Append homolog groups with only one protein representative to the Orthogroups.csv homolog group matrix. This enables more reliable detection of homologous rare/singleton BGCs downstream in the pipeline.
  -q, --fast_annotation
                        Skip basic/standard annotation in Prokka.
  -p, --only_run_prokka
                        Only run Prokka for gene annotation and Genbank creation. Skip the rest.
  -f, --refined_orthofinder
                        Only run OrthoFinder on proteins from antiSMASH proteomes only. This has implications downstream on being able to identify multi-copy genes across the genome.

Run OrthoFinder in Narrow-Scope

lsaBGC-AutoProcess.py a flag called --refined_orthofinder which allows users to request that OrthoFinder be run only on proteins identified as belonging to potential biosynthetic gene clusters instead of the full predicted-proteome of samples. OrthoFinder is a major bottleneck when using whole predicted-proteomes and so using this option will allow lsaBGC-AutoProcess.py to be run in full (Prokka, AntiSMASH, and OrthoFinder) on a significantly larger sample size. A consequence however is that certain downstream features in lsaBGC, where copy-number of homolog groups is assessed to identify gene-cluster family specific markers will become less reliable (and options for lsaBGC-Expansion.py should be adjusted for this purpose).